Skip to main content

RL Algorithms

πŸ₯· Agents​

All agents implement IAgent and operate on VectorN state vectors. Tabular agents work with discrete state indices; deep agents use neural networks built on the shared NeuralNetwork class.


Tabular Agents

Constructor: (int numStates, int numActions, Func<VectorN, int> stateMapper)

Q-Learning

Class: QLearning

Off-policy TD(0) with max-Q target. The classic model-free control algorithm.

Hyperparameters:

  • LearningRate (0.1)
  • Gamma (0.99)

SARSA

Class: SARSA

On-policy TD(0) β€” updates Q using the action actually taken, not the greedy action.

Hyperparameters:

  • LearningRate (0.1)
  • Gamma (0.99)

Monte Carlo Control

Class: MonteCarloControl

First-visit Monte Carlo with full episode returns. Updates only at end of episode.

Hyperparameters:

  • LearningRate (0.1)
  • Gamma (0.99)

Exposes: GetQTable() β†’ Matrix, GetQValues(state) β†’ VectorN


Value-Based (Deep) Agents

Require Initialize(observationSize, actionSize, seed) and a replay buffer.

DQN

Class: DQN

Deep Q-Network with target network and experience replay.

Hyperparameters:

  • HiddenLayers ([64, 64])
  • Activation (ReLU)
  • LearningRate (0.001)
  • Gamma (0.99)
  • TargetUpdateFrequency (100)
  • BatchSize (32)
  • MinBufferSize (64)

Exposes: GetQValues(state) β†’ VectorN

Double DQN

Class: DoubleDQN

Extends DQN β€” selects actions with the online network, evaluates with the target network. Reduces overestimation bias.

Hyperparameters: Same as DQN

Dueling DQN

Class: DuelingDQN

Three-network architecture: shared β†’ value stream + advantage stream. Q(s,a)=V(s)+A(s,a)βˆ’Aβ€Ύ(s)Q(s,a) = V(s) + A(s,a) - \overline{A}(s)

Hyperparameters:

  • SharedLayers ([64])
  • ValueLayers ([32])
  • AdvantageLayers ([32])
  • Activation (ReLU)
  • LearningRate (0.001)
  • Gamma (0.99)
  • TargetUpdateFrequency (100)
  • BatchSize (32)
  • MinBufferSize (64)

Policy Gradient Agents

Learn a policy directly (no Q-table). Support entropy bonuses and baselines.

REINFORCE

Class: REINFORCE

Monte Carlo policy gradient with optional baseline. Updates at end of each episode.

Hyperparameters:

  • HiddenLayers ([32])
  • Activation (ReLU)
  • LearningRate (0.01)
  • Gamma (0.99)
  • UseBaseline (true)

Exposes: GetActionProbabilities(state) β†’ VectorN

Actor-Critic (A2C)

Class: ActorCritic

Per-step TD actor-critic with entropy bonus for exploration.

Hyperparameters:

  • ActorHiddenLayers ([32])
  • CriticHiddenLayers ([32])
  • Activation (ReLU)
  • ActorLearningRate (0.001)
  • CriticLearningRate (0.002)
  • Gamma (0.99)
  • EntropyCoefficient (0.01)

Exposes: GetActionProbabilities(state) β†’ VectorN, GetValue(state) β†’ double

PPO (Proximal Policy Optimization)

Class: PPO

Clipped surrogate objective with GAE (Generalized Advantage Estimation) and mini-batch updates.

Hyperparameters:

  • ActorHiddenLayers ([64, 64])
  • CriticHiddenLayers ([64, 64])
  • Activation (ReLU)
  • ActorLearningRate (0.0003)
  • CriticLearningRate (0.001)
  • Gamma (0.99)
  • Lambda (0.95) β€” GAE Ξ»
  • ClipEpsilon (0.2)
  • UpdateEpochs (4)
  • MiniBatchSize (64)
  • EntropyCoefficient (0.01)

Exposes: GetActionProbabilities(state) β†’ VectorN, GetValue(state) β†’ double


Continuous Control Agents

DDPG (Deep Deterministic Policy Gradient)

Class: DDPG

Actor-critic for continuous action spaces. Deterministic policy with Polyak-averaged target networks.

Hyperparameters:

  • ActorHiddenLayers ([64, 64])
  • CriticHiddenLayers ([64, 64])
  • Activation (ReLU)
  • ActorLearningRate (1e-4)
  • CriticLearningRate (1e-3)
  • Gamma (0.99)
  • Tau (0.005) β€” soft update rate
  • BatchSize (64)
  • MinBufferSize (128)
  • ActionScale (1.0)

🎁 Policies (Exploration Strategies)​

All policies implement IPolicy with SelectAction(VectorN qValues) for discrete and SelectAction(VectorN mean, VectorN std) for continuous. Each supports Decay() per episode and Clone().

Epsilon-Greedy

Class: EpsilonGreedy

Random action with probability Ξ΅, greedy otherwise. Standard discrete exploration.

Properties:

  • Epsilon (1.0) β€” current exploration rate
  • EpsilonMin (0.01) β€” minimum Ξ΅
  • EpsilonDecay (0.995) β€” multiplicative decay per episode

Softmax Policy

Class: SoftmaxPolicy

Boltzmann exploration β€” action probabilities proportional to eQ(s,a)/Ο„e^{Q(s,a) / \tau}.

Properties:

  • Temperature (1.0) β€” current temperature
  • TemperatureMin (0.01)
  • TemperatureDecay (0.995)

Gaussian Noise

Class: GaussianNoise

Additive i.i.d. Gaussian noise for continuous action exploration. a=ΞΌ+N(0,Οƒ2)a = \mu + \mathcal{N}(0, \sigma^2)

Properties:

  • Sigma (0.1) β€” noise standard deviation
  • SigmaMin (0.01)
  • SigmaDecay (0.999)

Ornstein-Uhlenbeck Process

Class: OrnsteinUhlenbeck

Temporally correlated noise for smooth continuous exploration. Mean-reverting process.

Properties:

  • Theta (0.15) β€” mean reversion rate
  • Mu (0.0) β€” long-run mean
  • Sigma (0.2) β€” volatility
  • SigmaMin (0.01)
  • SigmaDecay (1.0)
  • Dt (1.0) β€” time step

🏠 Environments​

All environments implement IEnvironment with Reset(seed?), Step(int) (discrete), and Step(VectorN) (continuous). State is always VectorN.

GridWorld

Class: GridWorld

2D grid navigation. Start at (0,0), goal at (rows-1, cols-1). Reward: +1 at goal, -0.01 per step.

PropertyValue
Constructor(rows, cols, walls?, goal?)
ObservationVectorN([row, col]), size 2
Actions4 (Up, Right, Down, Left)
Discreteβœ…

Exposes: StateToIndex(state) β†’ flat index, StateCount β†’ total states

CartPole

Class: CartPole

Classic control: balance a pole on a cart. Episode ends if pole angle > 12Β° or cart leaves bounds.

PropertyValue
Constructor(none)
ObservationVectorN([x, αΊ‹, ΞΈ, ΞΈΜ‡]), size 4
Actions2 (Left, Right)
Discreteβœ…

MountainCar

Class: MountainCar

Drive an underpowered car up a hill. Requires momentum from both sides.

PropertyValue
Constructor(none)
ObservationVectorN([position, velocity]), size 2
Actions3 (Left, Neutral, Right)
Discreteβœ…
SettableMaxSteps (200)

Pendulum

Class: Pendulum

Swing up and balance an inverted pendulum with continuous torque.

PropertyValue
Constructor(none)
ObservationVectorN([cos ΞΈ, sin ΞΈ, ΞΈΜ‡]), size 3
Actions1 (continuous torque)
Discrete❌
SettableMaxTorque (2.0), MaxSpeed (8.0), MaxSteps (200)

Plume (GIS)

Class: PlumeEnvironment

RL environment wrapping the GIS Gaussian plume simulator. The agent takes mitigation actions (deploy barriers, activate filters) to minimise population exposure over a transient plume scenario.

PropertyValue
Constructor(emissionRate, windSpeed, windDirection, stackHeight, sourcePosition, grid, timeFrame, stability?)
ObservationVectorN([maxConc, meanConc, exposedFrac, windSpeed, windDirX, windDirY, emissionRate, normTime]), size 8
Actions6 (None, BarrierN, BarrierE, BarrierS, BarrierW, ActivateFilter)
Discreteβœ…
SettableThreshold (1e-6), ActionCost (0.05), BarrierEfficiency (0.4), FilterEfficiency (0.5)

Exposes: MaxSteps β†’ number of time steps per episode

See the GIS-RL Integration section below for full usage.


πŸ” Replay Buffers​

ReplayBuffer

Class: ReplayBuffer

Uniform random sampling from a circular buffer of transitions.

Constructor: (int capacity, int? seed = null)

PrioritizedReplayBuffer

Class: PrioritizedReplayBuffer

Prioritized experience replay β€” transitions with higher TD-error are sampled more frequently.

Constructor: (int capacity, double alpha = 0.6, double beta = 0.4, int? seed = null)


πŸ“Š Diagnostics & Visualisation​

All diagnostic tools return List<Serie> or Matrix β€” ready for the existing export/charting pipeline.

Training Curves

Built into TrainingResult (returned by every experiment):

var result = RLExperiment.For(env).WithAgent(agent).WithPolicy(policy).WithEpisodes(500).Run();

List<Serie> returns = result.ReturnCurve; // (episode, return)
List<Serie> losses = result.LossCurve; // (step, loss)
List<Serie> exploration = result.ExplorationCurve; // (episode, Ξ΅)

Q-Value Heatmap

Visualise Q-values for tabular agents on GridWorld:

// Max Q per state β€” one value per cell
List<Serie> heatmap = QValueHeatmap.GetMaxQValues(agent, env);

// Q-values for a specific action
List<Serie> actionQ = QValueHeatmap.GetQValuesForAction(agent, env, action: 1);

// Full Q-table as Matrix (states Γ— actions)
Matrix qTable = QValueHeatmap.GetQTableMatrix(agent);

// Greedy policy β€” best action per state
List<Serie> policy = QValueHeatmap.GetGreedyPolicy(agent, env);

Policy Visualisation

Visualise action probabilities for any agent:

// Action probabilities per state (one List<Serie> per action)
var probs = PolicyVisualizer.GetActionProbabilities(env,
state => agent.GetActionProbabilities(state));

// Softmax probabilities from Q-values (tabular agents)
var softmax = PolicyVisualizer.GetSoftmaxProbabilities(agent, env, temperature: 0.5);

// Policy entropy per state (high = uncertain, low = deterministic)
var entropy = PolicyVisualizer.GetPolicyEntropy(env,
state => agent.GetActionProbabilities(state));

// Dominant action per state
var dominant = PolicyVisualizer.GetDominantAction(env,
state => agent.GetActionProbabilities(state));

Value Function Surface

Sample V(s) or max-Q(s) across continuous state spaces:

// 1D slice (e.g. cart position, other dims fixed)
var vFn = ValueFunctionSurface.ValueFunction(actorCriticAgent);
List<Serie> curve = ValueFunctionSurface.Sample1D(
s => vFn(new VectorN(new[] { s[0], 0, 0, 0 })),
min: -2.4, max: 2.4, numPoints: 100);

// 2D surface (e.g. position Γ— velocity)
var maxQ = ValueFunctionSurface.MaxQFunction(dqnAgent);
var surface = ValueFunctionSurface.Sample2D(maxQ,
minX: -1.2, maxX: 0.6, numX: 50,
minY: -0.07, maxY: 0.07, numY: 50);

// Convert to Matrix for heatmap rendering
Matrix heatmap = surface.ToMatrix();

Available extractors:

MethodAgent typeReturns
ValueFunctionSurface.MaxQFunction(DQN)DQN, DoubleDQNmax⁑aQ(s,a)\max_a Q(s,a)
ValueFunctionSurface.MaxQFunction(DuelingDQN)DuelingDQNmax⁑aQ(s,a)\max_a Q(s,a)
ValueFunctionSurface.ValueFunction(ActorCritic)A2CV(s)V(s)
ValueFunctionSurface.ValueFunction(PPO)PPOV(s)V(s)

πŸ“ Interfaces Summary​

InterfacePurposeKey methods
IAgentRL agent contractSelectAction, SelectContinuousAction, Train, TrainBatch, EndEpisode, Clone, Get/SetHyperParameters
IEnvironmentEnvironment contractReset, Step(int), Step(VectorN), ObservationSize, ActionSize, IsDiscrete
IPolicyExploration policySelectAction(VectorN qValues), SelectAction(VectorN mean, VectorN std), Decay, Clone
IReplayBufferExperience storageAdd, Sample, Count, Capacity