Skip to main content

RL Algorithms

πŸ₯· Agents​

All agents implement IAgent and operate on VectorN state vectors. Tabular agents work with discrete state indices; deep agents use neural networks built on the shared NeuralNetwork class.


Tabular Agents

Constructor: (int numStates, int numActions, Func<VectorN, int> stateMapper)

Q-Learning

Class: QLearning

Off-policy TD(0) with max-Q target. The classic model-free control algorithm.

Hyperparameters:

  • LearningRate (0.1)
  • Gamma (0.99)

SARSA

Class: SARSA

On-policy TD(0) β€” updates Q using the action actually taken, not the greedy action.

Hyperparameters:

  • LearningRate (0.1)
  • Gamma (0.99)

Monte Carlo Control

Class: MonteCarloControl

First-visit Monte Carlo with full episode returns. Updates only at end of episode.

Hyperparameters:

  • LearningRate (0.1)
  • Gamma (0.99)

Exposes: GetQTable() β†’ Matrix, GetQValues(state) β†’ VectorN


Value-Based (Deep) Agents

Require Initialize(observationSize, actionSize, seed) and a replay buffer.

DQN

Class: DQN

Deep Q-Network with target network and experience replay.

Hyperparameters:

  • HiddenLayers ([64, 64])
  • Activation (ReLU)
  • LearningRate (0.001)
  • Gamma (0.99)
  • TargetUpdateFrequency (100)
  • BatchSize (32)
  • MinBufferSize (64)

Exposes: GetQValues(state) β†’ VectorN

Double DQN

Class: DoubleDQN

Extends DQN β€” selects actions with the online network, evaluates with the target network. Reduces overestimation bias.

Hyperparameters: Same as DQN

Dueling DQN

Class: DuelingDQN

Three-network architecture: shared β†’ value stream + advantage stream. Q(s,a)=V(s)+A(s,a)βˆ’Aβ€Ύ(s)Q(s,a) = V(s) + A(s,a) - \overline{A}(s)

Hyperparameters:

  • SharedLayers ([64])
  • ValueLayers ([32])
  • AdvantageLayers ([32])
  • Activation (ReLU)
  • LearningRate (0.001)
  • Gamma (0.99)
  • TargetUpdateFrequency (100)
  • BatchSize (32)
  • MinBufferSize (64)

Policy Gradient Agents

Learn a policy directly (no Q-table). Support entropy bonuses and baselines.

REINFORCE

Class: REINFORCE

Monte Carlo policy gradient with optional baseline. Updates at end of each episode.

Hyperparameters:

  • HiddenLayers ([32])
  • Activation (ReLU)
  • LearningRate (0.01)
  • Gamma (0.99)
  • UseBaseline (true)

Exposes: GetActionProbabilities(state) β†’ VectorN

Actor-Critic (A2C)

Class: ActorCritic

Per-step TD actor-critic with entropy bonus for exploration.

Hyperparameters:

  • ActorHiddenLayers ([32])
  • CriticHiddenLayers ([32])
  • Activation (ReLU)
  • ActorLearningRate (0.001)
  • CriticLearningRate (0.002)
  • Gamma (0.99)
  • EntropyCoefficient (0.01)

Exposes: GetActionProbabilities(state) β†’ VectorN, GetValue(state) β†’ double

PPO (Proximal Policy Optimization)

Class: PPO

Clipped surrogate objective with GAE (Generalized Advantage Estimation) and mini-batch updates.

Hyperparameters:

  • ActorHiddenLayers ([64, 64])
  • CriticHiddenLayers ([64, 64])
  • Activation (ReLU)
  • ActorLearningRate (0.0003)
  • CriticLearningRate (0.001)
  • Gamma (0.99)
  • Lambda (0.95) β€” GAE Ξ»
  • ClipEpsilon (0.2)
  • UpdateEpochs (4)
  • MiniBatchSize (64)
  • EntropyCoefficient (0.01)

Exposes: GetActionProbabilities(state) β†’ VectorN, GetValue(state) β†’ double


Continuous Control Agents

DDPG (Deep Deterministic Policy Gradient)

Class: DDPG

Actor-critic for continuous action spaces. Deterministic policy with Polyak-averaged target networks.

Hyperparameters:

  • ActorHiddenLayers ([64, 64])
  • CriticHiddenLayers ([64, 64])
  • Activation (ReLU)
  • ActorLearningRate (1e-4)
  • CriticLearningRate (1e-3)
  • Gamma (0.99)
  • Tau (0.005) β€” soft update rate
  • BatchSize (64)
  • MinBufferSize (128)
  • ActionScale (1.0)

🎁 Policies (Exploration Strategies)​

All policies implement IPolicy with SelectAction(VectorN qValues) for discrete and SelectAction(VectorN mean, VectorN std) for continuous. Each supports Decay() per episode and Clone().

Epsilon-Greedy

Class: EpsilonGreedy

Random action with probability Ξ΅, greedy otherwise. Standard discrete exploration.

Properties:

  • Epsilon (1.0) β€” current exploration rate
  • EpsilonMin (0.01) β€” minimum Ξ΅
  • EpsilonDecay (0.995) β€” multiplicative decay per episode

Softmax Policy

Class: SoftmaxPolicy

Boltzmann exploration β€” action probabilities proportional to eQ(s,a)/Ο„e^{Q(s,a) / \tau}.

Properties:

  • Temperature (1.0) β€” current temperature
  • TemperatureMin (0.01)
  • TemperatureDecay (0.995)

Gaussian Noise

Class: GaussianNoise

Additive i.i.d. Gaussian noise for continuous action exploration. a=ΞΌ+N(0,Οƒ2)a = \mu + \mathcal{N}(0, \sigma^2)

Properties:

  • Sigma (0.1) β€” noise standard deviation
  • SigmaMin (0.01)
  • SigmaDecay (0.999)

Ornstein-Uhlenbeck Process

Class: OrnsteinUhlenbeck

Temporally correlated noise for smooth continuous exploration. Mean-reverting process.

Properties:

  • Theta (0.15) β€” mean reversion rate
  • Mu (0.0) β€” long-run mean
  • Sigma (0.2) β€” volatility
  • SigmaMin (0.01)
  • SigmaDecay (1.0)
  • Dt (1.0) β€” time step

🏠 Environments​

All environments implement IEnvironment with Reset(seed?), Step(int) (discrete), and Step(VectorN) (continuous). State is always VectorN.

GridWorld

Class: GridWorld

2D grid navigation. Start at (0,0), goal at (rows-1, cols-1). Reward: +1 at goal, -0.01 per step.

PropertyValue
Constructor(rows, cols, walls?, goal?)
ObservationVectorN([row, col]), size 2
Actions4 (Up, Right, Down, Left)
Discreteβœ…

Exposes: StateToIndex(state) β†’ flat index, StateCount β†’ total states

CartPole

Class: CartPole

Classic control: balance a pole on a cart. Episode ends if pole angle > 12Β° or cart leaves bounds.

PropertyValue
Constructor(none)
ObservationVectorN([x, αΊ‹, ΞΈ, ΞΈΜ‡]), size 4
Actions2 (Left, Right)
Discreteβœ…

MountainCar

Class: MountainCar

Drive an underpowered car up a hill. Requires momentum from both sides.

PropertyValue
Constructor(none)
ObservationVectorN([position, velocity]), size 2
Actions3 (Left, Neutral, Right)
Discreteβœ…
SettableMaxSteps (200)

Pendulum

Class: Pendulum

Swing up and balance an inverted pendulum with continuous torque.

PropertyValue
Constructor(none)
ObservationVectorN([cos ΞΈ, sin ΞΈ, ΞΈΜ‡]), size 3
Actions1 (continuous torque)
Discrete❌
SettableMaxTorque (2.0), MaxSpeed (8.0), MaxSteps (200)

Plume (GIS)

Class: PlumeEnvironment

RL environment wrapping the GIS Gaussian plume simulator. The agent takes mitigation actions (deploy barriers, activate filters) to minimise population exposure over a transient plume scenario.

PropertyValue
Constructor(emissionRate, windSpeed, windDirection, stackHeight, sourcePosition, grid, timeFrame, stability?)
ObservationVectorN([maxConc, meanConc, exposedFrac, windSpeed, windDirX, windDirY, emissionRate, normTime]), size 8
Actions6 (None, BarrierN, BarrierE, BarrierS, BarrierW, ActivateFilter)
Discreteβœ…
SettableThreshold (1e-6), ActionCost (0.05), BarrierEfficiency (0.4), FilterEfficiency (0.5)

Exposes: MaxSteps β†’ number of time steps per episode

See the [GIS-RL Integration](../../Simulation Engines/Geo Engine.md#-gis-rl-integration) section for full usage.

Flight (Game Engine)

Class: FlightEnv

RL environment wrapping the 6DOF flight dynamics engine. The agent flies an aircraft through waypoints while maintaining stable flight.

PropertyValue
Constructor(config?, waypoints?, dt?, maxSteps?, waypointRadius?, initAltitude?, initAirspeed?)
ObservationVectorN([altitude, airspeed, Ξ±, Ξ², roll, pitch, yaw, p, q, r, dist, headingErr]), size 12
Actions4 (continuous: throttle, pitch, roll, yaw)
Discrete❌
SettableAll constructor params

Reward: positive for approaching waypoints, bonus on arrival, penalty for stall/crash/excessive bank. Small fuel bonus for lower throttle.

using CSharpNumerics.ML.ReinforcementLearning.Environments;

var env = new FlightEnv(initAltitude: 1000, initAirspeed: 60);
var (state, info) = env.Reset(seed: 42);

// Continuous action: [throttle, pitch, roll, yaw]
var action = new VectorN(new[] { 0.7, 0.1, 0.0, 0.0 });
var (nextState, reward, done, _, stepInfo) = env.Step(action);

Dogfight (Game Engine)

Class: DogfightEnv

Multi-agent pursuit-evasion with two aircraft. Formulated from the pursuer's perspective; the evader follows a scripted evasion policy.

PropertyValue
Constructor(config?, dt?, maxSteps?, killRadius?, killAspect?, initSeparation?)
ObservationVectorN([pursuer(6), relEvader(6), rates(3), engagement(3)]), size 18
Actions4 (continuous: throttle, pitch, roll, yaw)
Discrete❌

Reward: positive for closing range and maintaining firing solution; large bonus for intercept (range < killRadius and aspect < killAspect).

var env = new DogfightEnv(killRadius: 300, killAspect: 0.52);
var (state, _) = env.Reset();
// Train with PPO or DDPG for pursuit-evasion behaviour

Fluid Navigation (Game Engine)

Class: FluidNavigationEnv

Agent navigates a point-mass through a 2D wind/current field to reach a target while minimizing energy. The wind field is a live GameFluidSolver2D simulation.

PropertyValue
Constructor(gridSize?, dt?, maxSteps?, targetRadius?, agentMass?, maxThrust?)
ObservationVectorN([agentX, agentY, agentVx, agentVy, windU, windV, targetDx, targetDy]), size 8
Actions2 (continuous: thrustX, thrustY)
Discrete❌

Reward: positive for approaching target, bonus on arrival, penalty per step and proportional to thrust magnitude (fuel cost).

var env = new FluidNavigationEnv(gridSize: 32, maxThrust: 2.0);
var (state, _) = env.Reset(seed: 7);
var action = new VectorN(new[] { 0.5, 0.3 });
var (next, reward, done, _, _) = env.Step(action);

πŸ” Replay Buffers​

ReplayBuffer

Class: ReplayBuffer

Uniform random sampling from a circular buffer of transitions.

Constructor: (int capacity, int? seed = null)

PrioritizedReplayBuffer

Class: PrioritizedReplayBuffer

Prioritized experience replay β€” transitions with higher TD-error are sampled more frequently.

Constructor: (int capacity, double alpha = 0.6, double beta = 0.4, int? seed = null)


πŸ“Š Diagnostics & Visualisation​

All diagnostic tools return List<Serie> or Matrix β€” ready for the existing export/charting pipeline.

Training Curves

Built into TrainingResult (returned by every experiment):

var result = RLExperiment.For(env).WithAgent(agent).WithPolicy(policy).WithEpisodes(500).Run();

List<Serie> returns = result.ReturnCurve; // (episode, return)
List<Serie> losses = result.LossCurve; // (step, loss)
List<Serie> exploration = result.ExplorationCurve; // (episode, Ξ΅)

Q-Value Heatmap

Visualise Q-values for tabular agents on GridWorld:

// Max Q per state β€” one value per cell
List<Serie> heatmap = QValueHeatmap.GetMaxQValues(agent, env);

// Q-values for a specific action
List<Serie> actionQ = QValueHeatmap.GetQValuesForAction(agent, env, action: 1);

// Full Q-table as Matrix (states Γ— actions)
Matrix qTable = QValueHeatmap.GetQTableMatrix(agent);

// Greedy policy β€” best action per state
List<Serie> policy = QValueHeatmap.GetGreedyPolicy(agent, env);

Policy Visualisation

Visualise action probabilities for any agent:

// Action probabilities per state (one List<Serie> per action)
var probs = PolicyVisualizer.GetActionProbabilities(env,
state => agent.GetActionProbabilities(state));

// Softmax probabilities from Q-values (tabular agents)
var softmax = PolicyVisualizer.GetSoftmaxProbabilities(agent, env, temperature: 0.5);

// Policy entropy per state (high = uncertain, low = deterministic)
var entropy = PolicyVisualizer.GetPolicyEntropy(env,
state => agent.GetActionProbabilities(state));

// Dominant action per state
var dominant = PolicyVisualizer.GetDominantAction(env,
state => agent.GetActionProbabilities(state));

Value Function Surface

Sample V(s) or max-Q(s) across continuous state spaces:

// 1D slice (e.g. cart position, other dims fixed)
var vFn = ValueFunctionSurface.ValueFunction(actorCriticAgent);
List<Serie> curve = ValueFunctionSurface.Sample1D(
s => vFn(new VectorN(new[] { s[0], 0, 0, 0 })),
min: -2.4, max: 2.4, numPoints: 100);

// 2D surface (e.g. position Γ— velocity)
var maxQ = ValueFunctionSurface.MaxQFunction(dqnAgent);
var surface = ValueFunctionSurface.Sample2D(maxQ,
minX: -1.2, maxX: 0.6, numX: 50,
minY: -0.07, maxY: 0.07, numY: 50);

// Convert to Matrix for heatmap rendering
Matrix heatmap = surface.ToMatrix();

Available extractors:

MethodAgent typeReturns
ValueFunctionSurface.MaxQFunction(DQN)DQN, DoubleDQNmax⁑aQ(s,a)\max_a Q(s,a)
ValueFunctionSurface.MaxQFunction(DuelingDQN)DuelingDQNmax⁑aQ(s,a)\max_a Q(s,a)
ValueFunctionSurface.ValueFunction(ActorCritic)A2CV(s)V(s)
ValueFunctionSurface.ValueFunction(PPO)PPOV(s)V(s)

πŸ“ Interfaces Summary​

InterfacePurposeKey methods
IAgentRL agent contractSelectAction, SelectContinuousAction, Train, TrainBatch, EndEpisode, Clone, Get/SetHyperParameters
IEnvironmentEnvironment contractReset, Step(int), Step(VectorN), ObservationSize, ActionSize, IsDiscrete
IPolicyExploration policySelectAction(VectorN qValues), SelectAction(VectorN mean, VectorN std), Decay, Clone
IReplayBufferExperience storageAdd, Sample, Count, Capacity