n this example we take the previous example of Bagging Duelling Double Q Learning and add ‘noisy layers’ to the estimatin of advantage. Adding noise within the network is an alterantive to epsilon-greedy exploration. As the performance of the system improves the amount of noise will reduce in the network, as noise is a parameter in the network which is optimised through the normal network optimization procedure.

## A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day

• Weekend arrival numbers are 50% average arrival numbers

• Weekday arrival numbers are 120% average arrival numbers

• Distribution of inter-arrival time is inverse exponential

• Average length of stay is 7 days (default)

• Distribution of length of stay is inverse exponential

• The RL agent may request a change in bed numbers once a day (default)

• The allowed bed change requests are -20, -10, 0, 10, 20

• Bed changes take 2 days to occur (default)

• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed

• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

## Reinforcement learning introduction

### RL involves:

- Trial and error search
- Receiving and maximising reward (often delayed)
- Linking state -> action -> reward
- Must be able to sense something of their environment
- Involves uncertainty in sensing and linking action to reward
- Learning -> improved choice of actions over time
- All models find a way to balance best predicted action vs. exploration

### Elements of RL

*Environment*: all observable and unobservable information relevant to us*Observation*: sensing the environment*State*: the perceived (or perceivable) environment*Agent*: senses environment, decides on action, receives and monitors rewards*Action*: may be discrete (e.g. turn left) or continuous (accelerator pedal)*Policy*(how to link state to action; often based on probabilities)*Reward signal*: aim is to accumulate maximum reward over time*Value function*of a state: prediction of likely/possible long-term reward*Q*: prediction of likely/possible long-term reward of an*action**Advantage*: The difference in Q between actions in a given state (sums to zero for all actions)*Model*(optional): a simulation of the environment

### Types of model

*Model-based*: have model of environment (e.g. a board game)*Model-free*: used when environment not fully known*Policy-based*: identify best policy directly*Value-based*: estimate value of a decision*Off-policy*: can learn from historic data from other agent*On-policy*: requires active learning from current decisions

## Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state *value*. The other component models the *advantage*, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

For a presenation on Q, see https://youtu.be/o22P5rCAAEQ

## Key DQN components

## General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

## Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The *policy network* is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best *action* (best predicted Q) is taken from the *policy network*, but the *policy network* is updated using the predicted Q value of the next state from the *target network* (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the *policy network*, but the the *policy network* is updated to better predict the Q value of that action from the *target network*. The *policy network* is copied across to the *target network* every *n* steps (e.g. 1000).

## Bagging (Bootstrap Aggregation)

Each network is trained from the same memory, but have different starting weights and are trained on different bootstrap samples from that memory. In this example actions are chosen randomly from each of the networks (an alternative could be to take the most common action recommended by the networks, or an average output). This bagging method may also be used to have some measure of uncertainty of action by looking at the distribution of actions recommended from the different nets. Bagging may also be used to aid exploration during stages where networks are providing different suggested action.

## Noisy layers

Noisy layers are an alternative to epsilon-greedy exploration (here, we leave the epsilon-greedy code in the model, but set it to reduce to zero immediately after the period of fully random action choice).

For every weight in the layer we have a random value that we draw from the normal distribution. This random value is used to add noise to the output. The parameters for the extent of noise for each weight, sigma, are stored within the layer and get trained as part of the standard back-propogation.

A modification to normal nosiy layers is to use layers with ‘factorized gaussian noise’. This reduces the number of random numbers to be sampled (so is less computationally expensive). There are two random vectors, one with the size of the input, and the other with the size of the output. A random matrix is created by calculating the outer product of the two vectors.

## References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Bagging: Osband I, Blundell C, Pritzel A, et al. (2016) Deep Exploration via Bootstrapped DQN. arXiv:160204621 http://arxiv.org/abs/1602.04621

Noisy networks: Fortunato M, Azar MG, Piot B, et al. (2019) Noisy Networks for Exploration. arXiv:170610295 http://arxiv.org/abs/1706.10295

Code for the nosiy layers comes from:

Lapan, M. (2020). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition. Packt Publishing.

## Code structure

## Code

## Results

Adding noisy layers maintains or improves the performance of bagging DDQN, but without the need for epsilon-greedy exploration.