A reinforcement learning agent that evolves from random movement to a precision shooter using only raw visual input.
| Member | Role & Contributions |
|---|---|
| Gamze Çetin (2102967) | Algorithm & Architecture: Implemented the PPO algorithm and the CNN Feature Extractor backbone. |
| Fazıl Eren Çiftdemir (2103573) | Environment Dynamics: Designed the custom Reward Function and configured VizDoom scenario parameters. |
| Melis Bahar Kurşun (2101834) | Training & Analysis: Managed the training pipeline, hyperparameter optimization, and Tensorboard visualization. |
evolution_side_by_side.gif: The agent's journey from chaos to mastery.
To courage survival and effective combat, we implemented a custom Shaped Reward Function.
| Component | Weight | Purpose |
|---|---|---|
| Living Bonus | +0.05 |
Incentivizes maximizing episode duration. |
|
Health Delta ( |
+0.10 |
Penalizes damage heavily; encourages dodging. |
|
Ammo Penalty ( |
-0.03 |
Discourages "spray and pray"; forces precision. |
We utilized Proximal Policy Optimization (PPO) due to its proven stability in continuous and discrete control tasks from visual inputs.
- Input:
(100, 160, 1)Grayscale Tensor (Raw Pixels) - Backbone: CNN Feature Extractor (Conv2d Layers + ReLU)
- Action Space:
Discrete(3)(Turn Left, Turn Right, Attack) - Hyperparameters:
- Learning Rate:
1e-4(tuned for stability) - Batch Size:
256 - Gamma:
0.99 - Steps:
2M
- Learning Rate:
The following graphs demonstrate the agent's learning progress over 2 Million Timesteps.
Analysis: This graph illustrates the agent's overall performance. The blue line shows the raw reward per episode, which is highly volatile due to the random spawning of enemies. The orange line (Moving Average) reveals the true trend:
- 0 - 1M Steps: The agent is in the "Exploration" phase, struggling to find a winning strategy. Rewards are low.
- 1M - 2M Steps: A sharp increase indicates the "Exploitation" phase. The agent has learned that Aligning + Shooting yields positive reinforcement.
- Significance: The steady climb proves the PPO algorithm successfully optimized the policy against the custom reward function.
Analysis: This metric tracks how many frames the agent survived before dying or winning.
- Correlation: Notice how this graph mirrors Figure 1. As the agent gets better at killing enemies (higher reward), it also lives longer.
- Validation: This confirms the agent isn't "gaming" the system by finding a quick-suicide loop to avoid penalties. It is genuinely surviving the onslaught.
Analysis:
This is the raw internal metric (rollout/ep_rew_mean) logged directly by Stable-Baselines3 during training.
- Purpose: It serves as a verification of the custom plots.
- Insight: The curve is smoother here because SB3 applies internal smoothing. It clearly documents the final convergence at a reward of approx ~30, matching our custom analysis.
The Problem: Initially, the agent would spin continuously to locate enemies but refused to fire. It settled on a strategy of optimizing the "Living Bonus" by passively navigating, rather than risking engagement.
The Solution: We reshaped the reward structure to make passive survival impossible:
- Increased Health Weight (
0.10): Taking damage became too expensive to ignore. - Result: The agent realized that the only way to preserve health was to eliminate the threat (the enemies) before they could fire, forcing it to transition from passive spinning to aggressive shooting.
The Problem: The agent struggled to detect and hit enemies at a distance. Due to the low resolution (100x160), distant monsters appeared as barely distinguishable clusters of pixels against the background, causing the agent to miss frequently.
The Solution:
- Frame Stacking (Motion Perception): We implemented
VecFrameStack(stacking 4 sequential frames). - Impact: Instead of relying on a single static blurry image, the agent perceives motion. This allows it to track the trajectory of distant, moving enemies effectively, even when they are just a few pixels large.



