Day 02 The Logic of Survival
Module 2 • Reward Engineering

The Brain in the Labyrinth

Training your drone for terrestrial search & rescue (SAR) missions using PPO (Proximal Policy Optimization).

Protocol 1: Defining Motivation

An AI doesn't "want" to explore. It only wants to maximize its Reward. We need to create a mathematical definition of "Success."

Reward Checklist

  • Positive (+): Moving forward along the tube's axis.
  • Negative (-): Collision with walls (instant fail).
  • Negative (-): Existential penalty ( encourages speed).

Step 1.1: Update the Helper Script

Modify your CaveExplorerAgent.cs file. Update the OnActionReceived method with this logic:

public override void OnActionReceived(ActionBuffers actions)
{
    // ... (Keep movement code from Day 1) ...
    float moveForward = actions.ContinuousActions[0];

    // REWARD 1: Velocity matching
    // We want the agent to move Forward (Z-axis) relative to itself
    // But we only reward it if it's actually moving in the right direction
    float speed = Vector3.Dot(rBody.velocity, transform.forward);
    
    // Small reward every step for moving forward
    if (speed > 0.1f) {
        AddReward(0.01f * speed);
    }

    // Time Penalty (forces efficiency)
    AddReward(-0.0005f);
}

Protocol 2: The Training Configuration

We need to tell ML-Agents how to learn (Hyperparameters).

Step 2.1: Create YAML Config

  1. In your Project Root (outside Assets folder), create a file named trainer_config.yaml.
  2. Paste the following PPO configuration:
behaviors:
  CaveExplorer:
    trainer_type: ppo
    hyperparameters:
      batch_size: 1024
      buffer_size: 10240
      learning_rate: 3.0e-4
      beta: 5.0e-4
      epsilon: 0.2
      lambd: 0.99
      num_epoch: 3
    network_settings:
      normalize: false
      hidden_units: 128
      num_layers: 2
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
    max_steps: 500000
    time_horizon: 64
    summary_freq: 10000

Protocol 3: Parallel Training

One drone takes hours to learn. 50 drones learn in minutes.

Step 3.1: Create the Training Set

  1. Create an Empty GameObject named Environment.
  2. Drag your Cave_Segment_01 and DroneAgent inside this object.
  3. Make Environment a Prefab (drag it into Project window).
  4. Delete the one in the scene.
  5. Drag 10 to 20 instances of the Environment prefab into the scene.
  6. Space them out so they don't overlap (e.g., (0,0,0), (0,0,50), (0,0,100)...).

Protocol 4: Execute Training

This is the moment of truth.

Step 4.1: Start the Python Trainer

  1. Open your terminal / Anaconda Prompt.
  2. Activate your environment: conda activate agents (or equivalent).
  3. Navigate to your project root folder.
  4. Run:
mlagents-learn trainer_config.yaml --run-id=CaveRun1

You should see the Unity Logo and "Listening on port 5004".

Step 4.2: Start Unity

Press Play in the Unity Editor.

Observation: Initially, the drones will spin and crash randomly. This is exploration. After ~5-10 minutes, you should see them starting to hover and move forward steadily.


Protocol 5: Monitoring (TensorBoard)

While it trains, we watch the metrics.

  1. Open a second terminal window.
  2. Activate your environment.
  3. Run:
tensorboard --logdir results

Open your browser to http://localhost:6006.

What to look for:

  • Cumulative Reward: Should go UP.
  • Episode Length: Should go UP (drones survive longer).
  • Policy Loss: Should go DOWN (agent is less surprised by outcomes).

Geographic Inquiry: Optimal Paths

In real world SAR, "optimal" isn't just speed—it's safety. Does your reward function encourage the drone to fly recklessly fast? Try increasing the wall-crash penalty to make it more cautious.

Proceed to Day 3: Deployment →