A classic Health Operations problem is where to locate ambulances between calls in order to minimise time from emergency call to ambulance on-scene. This is known as the Ambulance Location problem.

We have developed a simple simulation of this problem and have tested a range of deep reinforcement learning agents.

The paper describing the results is:

Allen M, Pearn K, Monks T. (2021) Developing an OpenAI Gym-compatible framework and simulation environment for testing Deep Reinforcement Learning agents solving the Ambulance Location Problem. arXiv:210104434 http://arxiv.org/abs/2101.04434

The Health Service Modelling Associates (HSMA) Programme is an initiative in which staff working in health, social care and policing organisations can access free online training in modelling, simulation and AI techniques targeted at their applications areas. Staff are released for a day a week for 1 year. In the first 3 months, they receive 140 hours of live online training. They then pitch project ideas that use their new skills to tackle an issue of importance for their organisation. Selected projects receive 9 months of mentoring support from expert modelling and data science academics and practitioners. Previous HSMA projects have led to £multi-million investments in new services, and significant career progression for those taking part.

The programme teaches skills in :

– Python and R programming (no prior knowledge of programming required)

– Discrete Event Simulation using SimPy for modelling pathway and queuing problems

– Network Analysis and System Dynamics for modelling whole systems

– Geographic modelling and visualisation techniques for dealing with location issues

– Cellular Automata and Agent Based Simulation using MESA for modelling behaviour

– Artificial Intelligence approaches, such as Machine Learning to teach machines to make decisions, and Natural Language Processing to extract information from free text data

– Forecasting methods to make predictions about the future

All the approaches taught are completely Free and Open Source.

HSMA 4 launches in October 2021 and, for the first time ever, is open to anyone working in health, social care or policing organisations anywhere in England. If you’re interested in the programme, you can find out more about it here : https://arc-swp.nihr.ac.uk/training-type/hsma/

In addition, if you are an existing Operational Research or Data Science academic or practitioner, and would be interested in accessing the training and then providing mentoring support in the programme, please also register your interest at the link above.

“Pay more attention to the time you got things wrong!”

In this example we take the previous example of Noisy Duelling Double Q Learning and add a Prioritised Replay memory. Using this type of memory, state/actions that have the highest loss (error) when training our policy network are sampled more frequently than state/actions that have low loss.

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day • Weekend arrival numbers are 50% average arrival numbers • Weekday arrival numbers are 120% average arrival numbers • Distribution of inter-arrival time is inverse exponential • Average length of stay is 7 days (default) • Distribution of length of stay is inverse exponential • The RL agent may request a change in bed numbers once a day (default) • The allowed bed change requests are -20, -10, 0, 10, 20 • Bed changes take 2 days to occur (default) • The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed • The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

Trial and error search

Receiving and maximising reward (often delayed)

Linking state -> action -> reward

Must be able to sense something of their environment

Involves uncertainty in sensing and linking action to reward

Learning -> improved choice of actions over time

All models find a way to balance best predicted action vs. exploration

Elements of RL

Environment: all observable and unobservable information relevant to us

Observation: sensing the environment

State: the perceived (or perceivable) environment

Agent: senses environment, decides on action, receives and monitors rewards

Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)

Policy (how to link state to action; often based on probabilities)

Reward signal: aim is to accumulate maximum reward over time

Value function of a state: prediction of likely/possible long-term reward

Q: prediction of likely/possible long-term reward of an action

Advantage: The difference in Q between actions in a given state (sums to zero for all actions)

Model (optional): a simulation of the environment

Types of model

Model-based: have model of environment (e.g. a board game)

Model-free: used when environment not fully known

Policy-based: identify best policy directly

Value-based: estimate value of a decision

Off-policy: can learn from historic data from other agent

On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Noisy layers

Noisy layers are an alternative to epsilon-greedy exploration (here, we leave the epsilon-greedy code in the model, but set it to reduce to zero immediately after the period of fully random action choice).

For every weight in the layer we have a random value that we draw from the normal distribution. This random value is used to add noise to the output. The parameters for the extent of noise for each weight, sigma, are stored within the layer and get trained as part of the standard back-propogation.

A modification to normal nosiy layers is to use layers with ‘factorized gaussian noise’. This reduces the number of random numbers to be sampled (so is less computationally expensive). There are two random vectors, one with the size of the input, and the other with the size of the output. A random matrix is created by calculating the outer product of the two vectors.

Prioritised replay

In standard DQN samples are taken randomly from the memory (replay buffer). In prioritised replay samples are taken in proportion to their loss when training the network; where the network has the greatest error in predicting the target valur of a state/action, then those samples will be sampled more frequently (which will reduce the error in the network until the sample is not prioritised). In other words, the training focuses more heavenly on samples it gets most wrong, and spends less time training on samples that it can acurately predict already.

This priority may also be used as a weight for training the network, but this i snot implemented here; we use loss just for sampling.

When we use the loss for priority we add a small value (1e-5) t the loss. This avoids any sample having zero priority (and never having a chance of being sampled). For frequency of sampling we also raise the loss to the power of ‘alpha’ (default value of 0.6). Smaller values of alpha will compress the differences between samples, making the priority weighting less significant in the frequency of sampling.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Duelling DDQN: Wang Z, Schaul T, Hessel M, et al. (2016) Dueling Network Architectures for Deep Reinforcement Learning. arXiv:151106581 http://arxiv.org/abs/1511.06581

Noisy networks: Fortunato M, Azar MG, Piot B, et al. (2019) Noisy Networks for Exploration. arXiv:170610295 http://arxiv.org/abs/1706.10295

Prioritised replay: Schaul T, Quan J, Antonoglou I, et al (2016). Prioritized Experience Replay. arXiv:151105952 http://arxiv.org/abs/1511.05952

Code for the nosiy layers comes from:

Lapan, M. (2020). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition. Packt Publishing.

n this example we take the previous example of Bagging Duelling Double Q Learning and add ‘noisy layers’ to the estimatin of advantage. Adding noise within the network is an alterantive to epsilon-greedy exploration. As the performance of the system improves the amount of noise will reduce in the network, as noise is a parameter in the network which is optimised through the normal network optimization procedure.

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day • Weekend arrival numbers are 50% average arrival numbers • Weekday arrival numbers are 120% average arrival numbers • Distribution of inter-arrival time is inverse exponential • Average length of stay is 7 days (default) • Distribution of length of stay is inverse exponential • The RL agent may request a change in bed numbers once a day (default) • The allowed bed change requests are -20, -10, 0, 10, 20 • Bed changes take 2 days to occur (default) • The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed • The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

Trial and error search

Receiving and maximising reward (often delayed)

Linking state -> action -> reward

Must be able to sense something of their environment

Involves uncertainty in sensing and linking action to reward

Learning -> improved choice of actions over time

All models find a way to balance best predicted action vs. exploration

Elements of RL

Environment: all observable and unobservable information relevant to us

Observation: sensing the environment

State: the perceived (or perceivable) environment

Agent: senses environment, decides on action, receives and monitors rewards

Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)

Policy (how to link state to action; often based on probabilities)

Reward signal: aim is to accumulate maximum reward over time

Value function of a state: prediction of likely/possible long-term reward

Q: prediction of likely/possible long-term reward of an action

Advantage: The difference in Q between actions in a given state (sums to zero for all actions)

Model (optional): a simulation of the environment

Types of model

Model-based: have model of environment (e.g. a board game)

Model-free: used when environment not fully known

Policy-based: identify best policy directly

Value-based: estimate value of a decision

Off-policy: can learn from historic data from other agent

On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Bagging (Bootstrap Aggregation)

Each network is trained from the same memory, but have different starting weights and are trained on different bootstrap samples from that memory. In this example actions are chosen randomly from each of the networks (an alternative could be to take the most common action recommended by the networks, or an average output). This bagging method may also be used to have some measure of uncertainty of action by looking at the distribution of actions recommended from the different nets. Bagging may also be used to aid exploration during stages where networks are providing different suggested action.

Noisy layers

Noisy layers are an alternative to epsilon-greedy exploration (here, we leave the epsilon-greedy code in the model, but set it to reduce to zero immediately after the period of fully random action choice).

For every weight in the layer we have a random value that we draw from the normal distribution. This random value is used to add noise to the output. The parameters for the extent of noise for each weight, sigma, are stored within the layer and get trained as part of the standard back-propogation.

A modification to normal nosiy layers is to use layers with ‘factorized gaussian noise’. This reduces the number of random numbers to be sampled (so is less computationally expensive). There are two random vectors, one with the size of the input, and the other with the size of the output. A random matrix is created by calculating the outer product of the two vectors.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Bagging: Osband I, Blundell C, Pritzel A, et al. (2016) Deep Exploration via Bootstrapped DQN. arXiv:160204621 http://arxiv.org/abs/1602.04621

Noisy networks: Fortunato M, Azar MG, Piot B, et al. (2019) Noisy Networks for Exploration. arXiv:170610295 http://arxiv.org/abs/1706.10295

Code for the nosiy layers comes from:

Lapan, M. (2020). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition. Packt Publishing.

In this example we take the previous example of Duelling Double Q Learning and add ‘noisy layers’ to the estimatin of advantage. Adding noise within the network is an alterantive to epsilon-greedy exploration. As the performance of the system improves the amount of noise will reduce in the network, as noise is a parameter in the network which is optimised through the normal network optimization procedure.

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day • Weekend arrival numbers are 50% average arrival numbers • Weekday arrival numbers are 120% average arrival numbers • Distribution of inter-arrival time is inverse exponential • Average length of stay is 7 days (default) • Distribution of length of stay is inverse exponential • The RL agent may request a change in bed numbers once a day (default) • The allowed bed change requests are -20, -10, 0, 10, 20 • Bed changes take 2 days to occur (default) • The RL agent receives a reward at each action based on the number of free beds • The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

Trial and error search

Receiving and maximising reward (often delayed)

Linking state -> action -> reward

Must be able to sense something of their environment

Involves uncertainty in sensing and linking action to reward

Learning -> improved choice of actions over time

All models find a way to balance best predicted action vs. exploration

Elements of RL

Environment: all observable and unobservable information relevant to us

Observation: sensing the environment

State: the perceived (or perceivable) environment

Agent: senses environment, decides on action, receives and monitors rewards

Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)

Policy (how to link state to action; often based on probabilities)

Reward signal: aim is to accumulate maximum reward over time

Value function of a state: prediction of likely/possible long-term reward

Q: prediction of likely/possible long-term reward of an action

Advantage: The difference in Q between actions in a given state (sums to zero for all actions)

Model (optional): a simulation of the environment

Types of model

Model-based: have model of environment (e.g. a board game)

Model-free: used when environment not fully known

Policy-based: identify best policy directly

Value-based: estimate value of a decision

Off-policy: can learn from historic data from other agent

On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Noisy layers

Noisy layers are an alternative to epsilon-greedy exploration (here, we leave the epsilon-greedy code in the model, but set it to reduce to zero immediately after the period of fully random action choice).

For every weight in the layer we have a random value that we draw from the normal distribution. This random value is used to add noise to the output. The parameters for the extent of noise for each weight, sigma, are stored within the layer and get trained as part of the standard back-propogation.

A modification to normal nosiy layers is to use layers with ‘factorized gaussian noise’. This reduces the number of random numbers to be sampled (so is less computationally expensive). There are two random vectors, one with the size of the input, and the other with the size of the output. A random matrix is created by calculating the outer product of the two vectors.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Duelling DDQN: Wang Z, Schaul T, Hessel M, et al. (2016) Dueling Network Architectures for Deep Reinforcement Learning. arXiv:151106581 http://arxiv.org/abs/1511.06581

Noisy networks: Fortunato M, Azar MG, Piot B, et al. (2019) Noisy Networks for Exploration. arXiv:170610295 http://arxiv.org/abs/1706.10295

Code for the nosiy layers comes from:

Lapan, M. (2020). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition. Packt Publishing.

In this example we take the previous example of Duelling Double Q Learning and add ‘bagging’ where we have multiple Duelling Double Q Learning Networks (each with their own target network).

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day • Weekend arrival numbers are 50% average arrival numbers • Weekday arrival numbers are 120% average arrival numbers • Distribution of inter-arrival time is inverse exponential • Average length of stay is 7 days (default) • Distribution of length of stay is inverse exponential • The RL agent may request a change in bed numbers once a day (default) • The allowed bed change requests are -20, -10, 0, 10, 20 • Bed changes take 2 days to occur (default) • The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed • The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

Trial and error search

Receiving and maximising reward (often delayed)

Linking state -> action -> reward

Must be able to sense something of their environment

Involves uncertainty in sensing and linking action to reward

Learning -> improved choice of actions over time

All models find a way to balance best predicted action vs. exploration

Elements of RL

Environment: all observable and unobservable information relevant to us

Observation: sensing the environment

State: the perceived (or perceivable) environment

Agent: senses environment, decides on action, receives and monitors rewards

Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)

Policy (how to link state to action; often based on probabilities)

Reward signal: aim is to accumulate maximum reward over time

Value function of a state: prediction of likely/possible long-term reward

Q: prediction of likely/possible long-term reward of an action

Advantage: The difference in Q between actions in a given state (sums to zero for all actions)

Model (optional): a simulation of the environment

Types of model

Model-based: have model of environment (e.g. a board game)

Model-free: used when environment not fully known

Policy-based: identify best policy directly

Value-based: estimate value of a decision

Off-policy: can learn from historic data from other agent

On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Bagging (Bootstrap Aggregation)

Each network is trained from the same memory, but have different starting weights and are trained on different bootstrap samples from that memory. In this example actions are chosen randomly from each of the networks (an alternative could be to take the most common action recommended by the networks, or an average output). This bagging method may also be used to have some measure of uncertainty of action by looking at the distribution of actions recommended from the different nets. Bagging may also be used to aid exploration during stages where networks are providing different suggested action.

days under capacity 54.0
days over capacity 306.0
average patients 497.0
average beds 540.0
% occupancy 91.9
dtype: float64

Bagging D3QN gives us the best performance and the most stable and consistent results we have seen so far. The downside is that more networks need to be trained, so this method is slower.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Bagging: Osband I, Blundell C, Pritzel A, et al. (2016) Deep Exploration via Bootstrapped DQN. arXiv:160204621 http://arxiv.org/abs/1602.04621

Previously, we looked at using a Double Deep Q Network to manage beds in a simple hospital simulation.

Here we look at a refinement, a Duelling Double Deep Q Network, that can sometimes improve performance (see Wang et al, 2016, https://arxiv.org/abs/1511.06581)

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advantage always sums to zero). These are aggregated to produce Q for each action.

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day • Weekend arrival numbers are 50% average arrival numbers • Weekday arrival numbers are 120% average arrival numbers • Distribution of inter-arrival time is inverse exponential • Average length of stay is 7 days (default) • Distribution of length of stay is inverse exponential • The RL agent may request a change in bed numbers once a day (default) • The allowed bed change requests are -20, -10, 0, 10, 20 • Bed changes take 2 days to occur (default) • The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed • The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

Trial and error search

Receiving and maximising reward (often delayed)

Linking state -> action -> reward

Must be able to sense something of their environment

Involves uncertainty in sensing and linking action to reward

Learning -> improved choice of actions over time

All models find a way to balance best predicted action vs. exploration

Elements of RL

Environment: all observable and unobservable information relevant to us

Observation: sensing the environment

State: the perceived (or perceivable) environment

Agent: senses environment, decides on action, receives and monitors rewards

Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)

Policy (how to link state to action; often based on probabilities)

Reward signal: aim is to accumulate maximum reward over time

Value function of a state: prediction of likely/possible long-term reward

Q: prediction of likely/possible long-term reward of an action

Advantage: The difference in Q between actions in a given state (sums to zero for all actions)

Model (optional): a simulation of the environment

Types of model

Model-based: have model of environment (e.g. a board game)

Model-free: used when environment not fully known

Policy-based: identify best policy directly

Value-based: estimate value of a decision

Off-policy: can learn from historic data from other agent

On-policy: requires active learning from current decisions

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

Key DQN components (common to both Double Deep Q Networks and Duelling Deep Q Networks)

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Though performance is broadly similar to Double Deep Q Networks, I have found that the Duelling version a little more consistent in performance from one training run to another..

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Duelling DDQN: Wang Z, Schaul T, Hessel M, et al. (2016) Dueling Network Architectures for Deep Reinforcement Learning. arXiv:151106581 http://arxiv.org/abs/1511.06581

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock.

Default arrivals = 50/day

Weekend arrival numbers are 50% average arrival numbers

Weekday arrival numbers are 120% average arrival numbers

Distribution of inter-arrival time is inverse exponential

Average length of stay is 7 days (default)

Distribution of length of stay is inverse exponential

The RL agent may request a change in bed numbers once a day (default)

The allowed bed change requests are -20, -10, 0, 10, 20

Bed changes take 2 days to occur (default)

The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

Trial and error search

Receiving and maximising reward (often delayed)

Linking state -> action -> reward

Must be able to sense something of their environment

Involves uncertainty in sensing and linking action to reward

Learning -> improved choice of actions over time

All models find a way to balance best predicted action vs. exploration

Elements of RL

Environment: all observable and unobservable information relevant to us

Observation: sensing the environment

State: the perceived (or perceivable) environment

Agent: senses environment, decides on action, receives and monitors rewards

Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)

Policy (how to link state to action; often based on probabilities)

Reward signal: aim is to accumulate maximum reward over time

Value function of a state: prediction of likely/possible long-term reward

Q: prediction of likely/possible long-term reward of an action

Advantage: The difference in Q between actions in a given state (sums to zero for all actions)

Model (optional): a simulation of the environment

Types of model

Model-based: have model of environment (e.g. a board game)

Model-free: used when environment not fully known

Policy-based: identify best policy directly

Value-based: estimate value of a decision

Off-policy: can learn from historic data from other agent

On-policy: requires active learning from current decisions

Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

Key DQN components

General method for Q learning

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Sometimes we have functions , or complete models, that may be run in parallel across CPU cores. This may save significant time when we have access to computers to multiple cores. Here we use the joblib library to show how we can run functions in parallel (pip install joblib if not already installed).

Import libraries

import numpy as np
import time
from joblib import Parallel, delayed

Mimic something to be run in parallel

Here we will create a function that takes 2 seconds to run, to mimic a model or a complex function.

def my_slow_function(x):
"""A very slow function, which takes 1 second to double a number"""
# A 1 second delay
time.sleep (1)
return x * 2

Running our function sequentially in a ‘for’ loop

# Get start time
start = time.time()
# Run functions 8 times with different input (using a list comprehension)
trial_output = [my_slow_function(i) for i in range(8)]
print(trial_output)
# Get time taken
time_taken = time.time() - start
# Print time taken
print (f'Time taken: {time_taken:.1f} seconds')
OUT:
[0, 2, 4, 6, 8, 10, 12, 14]
Time taken: 8.0 seconds

Running our function in parallel using joblib

n_jobs is the maximum number of CPU cores to use. If set to -1, all available cores will be used.

# Get start time
start = time.time()
# Run functions 8 times with different input using joblib
trial_output = \
Parallel(n_jobs=-1)(delayed(my_slow_function)(i) for i in range(8))
print(trial_output)
# Get time taken
time_taken = time.time() - start
# Print time taken
print (f'Time taken: {time_taken:.1f} seconds')
[0, 2, 4, 6, 8, 10, 12, 14]
Time taken: 1.3 seconds

That’s a good improvement in speed!

Checking pseudo-random number generation

Pseudo-random number generators, if not provided with a seed, use the computer clock to generate the seed. This means some methods of parallel processing will generate sets of random numbers that may be the same. By default joblib uses the loki backend which prevents this occurring, but let’s check.

def numpy_random():
"""Generate a random number using NumPy"""
return np.random.random()
Parallel(n_jobs=-1)(delayed(numpy_random)() for i in range(5))
Out:
[0.5268839074941227,
0.12883536669358964,
0.14084785209998263,
0.4166795926896423,
0.19189235808368665]

This is a very simple of example of producing an interactive visualisation using Holoviews (which calls on Bokeh). These visualisations can be viewed in Jupyter notebooks, or may be saved as a single html page which needs only a web browser to see. Here we show room temperature and humidity, with the plots allowing the choice of which room to show.

To create these interactive plots you will need to install pyviz, holoviews and bokeh as described below.

Install libraries if needed

From terminal run:

conda install -c pyviz holoviews bokeh

holoviews --install-examples

Import libraries

import numpy as np
import pandas as pd
import holoviews as hv
import panel as pn
from holoviews import opts
hv.extension('bokeh')

Create some dummy data

# Set length of data set to create
length = 25
# Build strings for location data
location1 = ['Window'] * length
location2 = ['Porch'] * length
location3 = ['Fridge'] * length
# Set temperature to normal distribution (mu, sigma, length)
temperature1 = np.random.normal(25 ,5, length)
temperature2 = np.random.normal(15 ,3, length)
temperature3 = np.random.normal(4, 0.5,length)
# Set temperature to uniform distribution (min, max, length)
humidity1 = np.random.uniform(30, 60, length)
humidity2 = np.random.uniform(60, 80, length)
humidity3 = np.random.uniform(80, 99, length)
# Record mean temperature/humidity (use np.repeat to repeata single value)
mean_temp1 = np.repeat(np.mean(temperature1), length)
mean_temp2 = np.repeat(np.mean(temperature2), length)
mean_temp3 = np.repeat(np.mean(temperature3), length)
mean_humidity1 = np.repeat(np.mean(humidity1), length)
mean_humidity2 = np.repeat(np.mean(humidity2), length)
mean_humidity3 = np.repeat(np.mean(humidity3), length)
# Concatenate three sets of data into single list/arrays
location = location1 + location2 + location3
temperature = np.concatenate((temperature1, temperature2, temperature3))
mean_temperature = np.concatenate((mean_temp1, mean_temp2, mean_temp3))
humidity = np.concatenate((humidity1, humidity2, humidity3))
mean_humidity = np.concatenate((mean_humidity1, mean_humidity2, mean_humidity3))
# Create list of days
days = list(range(1,length + 1))
day = days * 3 # times 3 as there are three locations
# Transfer data to pandas DataFrame
data = pd.DataFrame()
data['day'] = day
data['location'] = location
data['temperature'] = temperature
data['humidity'] = humidity
data['mean_temperature'] = mean_temperature
data['mean_humidity'] = mean_humidity
data.head()
Out:
day location temperature humidity mean_temperature mean_humidity
0 1 Window 26.081745 49.611333 25.222169 45.43133
1 2 Window 31.452276 39.027559 25.222169 45.43133
2 3 Window 19.031828 58.825912 25.222169 45.43133
3 4 Window 21.309825 52.741160 25.222169 45.43133
4 5 Window 13.529042 39.977335 25.222169 45.43133

Build bar chart

# Make holoviews data table
key_dimensions = ['location']
value_dimensions = ['day', 'temperature', 'humidity', 'mean_temperature', 'mean_humidity']
hv_data = hv.Table(data, key_dimensions, value_dimensions)
# Build bar charts
bars1 = hv_data.to.bars(['day'], ['temperature'])
bars2 = hv_data.to.bars(['day'], ['humidity']).opts(color='Red')
# Compose plot
bar_plot = bars1 + bars2
# Show plot (only work in Jupyter notebook)
bar_plot

Here we combine the line and scatter charts. We viewed them individually before, though this is not actually necessary.

# Compose plot (* creates overlays of two or more plots)
combined_plot = line1 * scatter1 + line2 * scatter2
# Show plot
combined_plot

Save to html

The interactive plot may be saved as html which may be shared with, and viewed by, anyone (there is no need for anything other than a standard web browser to view the interactive plot).