Training your drone for terrestrial search & rescue (SAR) missions using PPO (Proximal Policy Optimization).
An AI doesn't "want" to explore. It only wants to maximize its Reward. We need to create a mathematical definition of "Success."
Modify your CaveExplorerAgent.cs file. Update the OnActionReceived method with this logic:
public override void OnActionReceived(ActionBuffers actions)
{
// ... (Keep movement code from Day 1) ...
float moveForward = actions.ContinuousActions[0];
// REWARD 1: Velocity matching
// We want the agent to move Forward (Z-axis) relative to itself
// But we only reward it if it's actually moving in the right direction
float speed = Vector3.Dot(rBody.velocity, transform.forward);
// Small reward every step for moving forward
if (speed > 0.1f) {
AddReward(0.01f * speed);
}
// Time Penalty (forces efficiency)
AddReward(-0.0005f);
}
We need to tell ML-Agents how to learn (Hyperparameters).
trainer_config.yaml.behaviors:
CaveExplorer:
trainer_type: ppo
hyperparameters:
batch_size: 1024
buffer_size: 10240
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 3
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
max_steps: 500000
time_horizon: 64
summary_freq: 10000
One drone takes hours to learn. 50 drones learn in minutes.
Environment.Cave_Segment_01 and DroneAgent inside this object.Environment a Prefab (drag it into Project window).Environment prefab into the scene.This is the moment of truth.
conda activate agents (or equivalent).mlagents-learn trainer_config.yaml --run-id=CaveRun1
You should see the Unity Logo and "Listening on port 5004".
Press Play in the Unity Editor.
Observation: Initially, the drones will spin and crash randomly. This is exploration. After ~5-10 minutes, you should see them starting to hover and move forward steadily.
While it trains, we watch the metrics.
tensorboard --logdir results
Open your browser to http://localhost:6006.
In real world SAR, "optimal" isn't just speed—it's safety. Does your reward function encourage the drone to fly recklessly fast? Try increasing the wall-crash penalty to make it more cautious.