Prioritised Replay Noisy Duelling Double Deep Q Learning – controlling a simple hospital bed system

“Pay more attention to the time you got things wrong!”

In this example we take the previous example of Noisy Duelling Double Q Learning and add a Prioritised Replay memory. Using this type of memory, state/actions that have the highest loss (error) when training our policy network are sampled more frequently than state/actions that have low loss.

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Noisy layers

Noisy layers are an alternative to epsilon-greedy exploration (here, we leave the epsilon-greedy code in the model, but set it to reduce to zero immediately after the period of fully random action choice).

For every weight in the layer we have a random value that we draw from the normal distribution. This random value is used to add noise to the output. The parameters for the extent of noise for each weight, sigma, are stored within the layer and get trained as part of the standard back-propogation.

A modification to normal nosiy layers is to use layers with ‘factorized gaussian noise’. This reduces the number of random numbers to be sampled (so is less computationally expensive). There are two random vectors, one with the size of the input, and the other with the size of the output. A random matrix is created by calculating the outer product of the two vectors.

Prioritised replay

In standard DQN samples are taken randomly from the memory (replay buffer). In prioritised replay samples are taken in proportion to their loss when training the network; where the network has the greatest error in predicting the target valur of a state/action, then those samples will be sampled more frequently (which will reduce the error in the network until the sample is not prioritised). In other words, the training focuses more heavenly on samples it gets most wrong, and spends less time training on samples that it can acurately predict already.

This priority may also be used as a weight for training the network, but this i snot implemented here; we use loss just for sampling.

When we use the loss for priority we add a small value (1e-5) t the loss. This avoids any sample having zero priority (and never having a chance of being sampled). For frequency of sampling we also raise the loss to the power of ‘alpha’ (default value of 0.6). Smaller values of alpha will compress the differences between samples, making the priority weighting less significant in the frequency of sampling.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Duelling DDQN: Wang Z, Schaul T, Hessel M, et al. (2016) Dueling Network Architectures for Deep Reinforcement Learning. arXiv:151106581 http://arxiv.org/abs/1511.06581

Noisy networks: Fortunato M, Azar MG, Piot B, et al. (2019) Noisy Networks for Exploration. arXiv:170610295 http://arxiv.org/abs/1706.10295

Prioritised replay: Schaul T, Quan J, Antonoglou I, et al (2016). Prioritized Experience Replay. arXiv:151105952 http://arxiv.org/abs/1511.05952

Code for the nosiy layers comes from:

Lapan, M. (2020). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition. Packt Publishing.

Code structure

Code

Code is available at:

https://github.com/MichaelAllen1966/learninghospital/blob/master/05_prioritised_replay_noisy_duelling_ddqn_simple_hospital_sim_1.ipynb

Results

Add in prioritised replay leads to more rapid learning.

Noisy Bootstrapped Aggregation (‘Bagging’) Duelling Double Deep Q Learning – controlling a simple hospital bed system

n this example we take the previous example of Bagging Duelling Double Q Learning and add ‘noisy layers’ to the estimatin of advantage. Adding noise within the network is an alterantive to epsilon-greedy exploration. As the performance of the system improves the amount of noise will reduce in the network, as noise is a parameter in the network which is optimised through the normal network optimization procedure.

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.$$Q(s,a)=r + \gamma.maxQ(s',a')$$

For a presenation on Q, see https://youtu.be/o22P5rCAAEQ

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Bagging (Bootstrap Aggregation)

Each network is trained from the same memory, but have different starting weights and are trained on different bootstrap samples from that memory. In this example actions are chosen randomly from each of the networks (an alternative could be to take the most common action recommended by the networks, or an average output). This bagging method may also be used to have some measure of uncertainty of action by looking at the distribution of actions recommended from the different nets. Bagging may also be used to aid exploration during stages where networks are providing different suggested action.

Noisy layers

Noisy layers are an alternative to epsilon-greedy exploration (here, we leave the epsilon-greedy code in the model, but set it to reduce to zero immediately after the period of fully random action choice).

For every weight in the layer we have a random value that we draw from the normal distribution. This random value is used to add noise to the output. The parameters for the extent of noise for each weight, sigma, are stored within the layer and get trained as part of the standard back-propogation.

A modification to normal nosiy layers is to use layers with ‘factorized gaussian noise’. This reduces the number of random numbers to be sampled (so is less computationally expensive). There are two random vectors, one with the size of the input, and the other with the size of the output. A random matrix is created by calculating the outer product of the two vectors.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Bagging: Osband I, Blundell C, Pritzel A, et al. (2016) Deep Exploration via Bootstrapped DQN. arXiv:160204621 http://arxiv.org/abs/1602.04621

Noisy networks: Fortunato M, Azar MG, Piot B, et al. (2019) Noisy Networks for Exploration. arXiv:170610295 http://arxiv.org/abs/1706.10295

Code for the nosiy layers comes from:

Lapan, M. (2020). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition. Packt Publishing.

Code structure

Code

https://github.com/MichaelAllen1966/learninghospital/blob/master/07_noisy_bagging_duelling_ddqn_simple_hospital_sim_1.ipynb

Results

Adding noisy layers maintains or improves the performance of bagging DDQN, but without the need for epsilon-greedy exploration.

Noisy Duelling Double Deep Q Learning – controlling a simple hospital bed system

In this example we take the previous example of Duelling Double Q Learning and add ‘noisy layers’ to the estimatin of advantage. Adding noise within the network is an alterantive to epsilon-greedy exploration. As the performance of the system improves the amount of noise will reduce in the network, as noise is a parameter in the network which is optimised through the normal network optimization procedure.

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Noisy layers

Noisy layers are an alternative to epsilon-greedy exploration (here, we leave the epsilon-greedy code in the model, but set it to reduce to zero immediately after the period of fully random action choice).

For every weight in the layer we have a random value that we draw from the normal distribution. This random value is used to add noise to the output. The parameters for the extent of noise for each weight, sigma, are stored within the layer and get trained as part of the standard back-propogation.

A modification to normal nosiy layers is to use layers with ‘factorized gaussian noise’. This reduces the number of random numbers to be sampled (so is less computationally expensive). There are two random vectors, one with the size of the input, and the other with the size of the output. A random matrix is created by calculating the outer product of the two vectors.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Duelling DDQN: Wang Z, Schaul T, Hessel M, et al. (2016) Dueling Network Architectures for Deep Reinforcement Learning. arXiv:151106581 http://arxiv.org/abs/1511.06581

Noisy networks: Fortunato M, Azar MG, Piot B, et al. (2019) Noisy Networks for Exploration. arXiv:170610295 http://arxiv.org/abs/1706.10295

Code for the nosiy layers comes from:

Lapan, M. (2020). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition. Packt Publishing.

Code structure

Code

https://github.com/MichaelAllen1966/learninghospital/blob/master/03_noisy_duelling_ddqn_simple_hospital_sim_1.ipynb

Results

Adding noisy layers maintains or improves the performance of duelling DQN, but without the need for epsilon-greedy exploration.

Bootstrapped Aggregation (‘Bagging’) Duelling Double Deep Q Learning – controlling a simple hospital bed system

We again use a variation on Deep Q Learning, to control staffed bed numbers in a simple hospital simulation. This follows on from Double Deep Q Learning and Duelling Double Deep Q Learning.

In this example we take the previous example of Duelling Double Q Learning and add ‘bagging’ where we have multiple Duelling Double Q Learning Networks (each with their own target network).

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Bagging (Bootstrap Aggregation)

Each network is trained from the same memory, but have different starting weights and are trained on different bootstrap samples from that memory. In this example actions are chosen randomly from each of the networks (an alternative could be to take the most common action recommended by the networks, or an average output). This bagging method may also be used to have some measure of uncertainty of action by looking at the distribution of actions recommended from the different nets. Bagging may also be used to aid exploration during stages where networks are providing different suggested action.

Code structure

Code

The code for this expierment:

https://github.com/MichaelAllen1966/learninghospital/blob/master/06_bagging_duelling_ddqn_simple_hospital_sim_1%20.ipynb

The simple hospital bed simulation that the Deep Q Learning experiment uses:

https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/learning_hospital/simpy_envs/env_simple_hospital_bed_1.py

Results

days under capacity     54.0
days over capacity     306.0
average patients       497.0
average beds           540.0
% occupancy             91.9
dtype: float64

Bagging D3QN gives us the best performance and the most stable and consistent results we have seen so far. The downside is that more networks need to be trained, so this method is slower.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Bagging: Osband I, Blundell C, Pritzel A, et al. (2016) Deep Exploration via Bootstrapped DQN. arXiv:160204621 http://arxiv.org/abs/1602.04621

Duelling Double Deep Q Network (D3QN) controlling a simple hospital bed system

Previously, we looked at using a Double Deep Q Network to manage beds in a simple hospital simulation.

Here we look at a refinement, a Duelling Double Deep Q Network, that can sometimes improve performance (see Wang et al, 2016, https://arxiv.org/abs/1511.06581)

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advantage always sums to zero). These are aggregated to produce Q for each action.

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components (common to both Double Deep Q Networks and Duelling Deep Q Networks)

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Code structure

Code

The code for this this experiment may be found here:
https://github.com/MichaelAllen1966/learninghospital/blob/master/02_duelling_ddqn_simple_hospital_sim_1.ipynb

The simple hospital bed simulation that the Deep Q Learning experiment uses:

https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/learning_hospital/simpy_envs/env_simple_hospital_bed_1.py

Results

Though performance is broadly similar to Double Deep Q Networks, I have found that the Duelling version a little more consistent in performance from one training run to another..

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Duelling DDQN: Wang Z, Schaul T, Hessel M, et al. (2016) Dueling Network Architectures for Deep Reinforcement Learning. arXiv:151106581 http://arxiv.org/abs/1511.06581

Double Deep Q Network (DDQN) – controlling a simple hospital bed system.

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock.

  • Default arrivals = 50/day
  • Weekend arrival numbers are 50% average arrival numbers
  • Weekday arrival numbers are 120% average arrival numbers
  • Distribution of inter-arrival time is inverse exponential
  • Average length of stay is 7 days (default)
  • Distribution of length of stay is inverse exponential
  • The RL agent may request a change in bed numbers once a day (default)
  • The allowed bed change requests are -20, -10, 0, 10, 20
  • Bed changes take 2 days to occur (default)
  • The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components

General method for Q learning

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Code structure

Code

Deep Q Learning Experiment:

https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/learning_hospital/01_ddqn_simple_hospital_sim_1.ipynb

The simple hospital bed simulation that the Deep Q Learning experiment uses:

https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/learning_hospital/simpy_envs/env_simple_hospital_bed_1.py

Results

References

van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

128. Parallel processing in Python

Sometimes we have functions
, or complete models, that may be run in parallel across CPU cores. This may save significant time when we have access to computers to multiple cores. Here we use the joblib library to show how we can run functions in parallel (pip install joblib if not already installed).

Import libraries

import numpy as np
import time
from joblib import Parallel, delayed

Mimic something to be run in parallel

Here we will create a function that takes 2 seconds to run, to mimic a model or a complex function.

def my_slow_function(x):
    """A very slow function, which takes 1 second to double a number"""
    
    # A 1 second delay
    time.sleep (1)
    
    return x * 2

Running our function sequentially in a ‘for’ loop

# Get start time
start = time.time()
# Run functions 8 times with different input (using a list comprehension)
trial_output = [my_slow_function(i) for i in range(8)]
print(trial_output)
# Get time taken
time_taken = time.time() - start
# Print time taken
print (f'Time taken: {time_taken:.1f} seconds')

OUT:

[0, 2, 4, 6, 8, 10, 12, 14]
Time taken: 8.0 seconds

Running our function in parallel using joblib

n_jobs is the maximum number of CPU cores to use. If set to -1, all available cores will be used.

# Get start time
start = time.time()
# Run functions 8 times with different input using joblib
trial_output = \
    Parallel(n_jobs=-1)(delayed(my_slow_function)(i) for i in range(8))
print(trial_output)
# Get time taken
time_taken = time.time() - start
# Print time taken
print (f'Time taken: {time_taken:.1f} seconds')


[0, 2, 4, 6, 8, 10, 12, 14]
Time taken: 1.3 seconds

That’s a good improvement in speed!

Checking pseudo-random number generation

Pseudo-random number generators, if not provided with a seed, use the computer clock to generate the seed. This means some methods of parallel processing will generate sets of random numbers that may be the same. By default joblib uses the loki backend which prevents this occurring, but let’s check.

def numpy_random():
    """Generate a random number using NumPy"""
    return np.random.random()

Parallel(n_jobs=-1)(delayed(numpy_random)() for i in range(5))

Out:
[0.5268839074941227,
 0.12883536669358964,
 0.14084785209998263,
 0.4166795926896423,
 0.19189235808368665]


123: A basic example of creating an interactive plot with HoloViews and Bokeh


This is a very simple of example of producing an interactive visualisation using Holoviews (which calls on Bokeh). These visualisations can be viewed in Jupyter notebooks, or may be saved as a single html page which needs only a web browser to see. Here we show room temperature and humidity, with the plots allowing the choice of which room to show.

To create these interactive plots you will need to install pyviz, holoviews and bokeh as described below.

Install libraries if needed

From terminal run:

conda install -c pyviz holoviews bokeh

holoviews --install-examples

Import libraries

import numpy as np
import pandas as pd
import holoviews as hv
import panel as pn
from holoviews import opts
hv.extension('bokeh')

Create some dummy data

# Set length of data set to create
length = 25

# Build strings for location data
location1 = ['Window'] * length
location2 = ['Porch'] * length
location3 = ['Fridge'] * length

# Set temperature to normal distribution (mu, sigma, length)
temperature1 = np.random.normal(25 ,5, length)
temperature2 = np.random.normal(15 ,3, length)
temperature3 = np.random.normal(4, 0.5,length)

# Set temperature to uniform distribution (min, max, length)
humidity1 = np.random.uniform(30, 60, length)
humidity2 = np.random.uniform(60, 80, length)
humidity3 = np.random.uniform(80, 99, length)

# Record mean temperature/humidity (use np.repeat to repeata single value)
mean_temp1 = np.repeat(np.mean(temperature1), length)
mean_temp2 = np.repeat(np.mean(temperature2), length)
mean_temp3 = np.repeat(np.mean(temperature3), length)

mean_humidity1 = np.repeat(np.mean(humidity1), length)
mean_humidity2 = np.repeat(np.mean(humidity2), length)
mean_humidity3 = np.repeat(np.mean(humidity3), length)

# Concatenate three sets of data into single list/arrays
location = location1 + location2 + location3
temperature = np.concatenate((temperature1, temperature2, temperature3))
mean_temperature = np.concatenate((mean_temp1, mean_temp2, mean_temp3))
humidity = np.concatenate((humidity1, humidity2, humidity3))
mean_humidity = np.concatenate((mean_humidity1, mean_humidity2, mean_humidity3))

# Create list of days
days = list(range(1,length + 1))
day = days * 3 # times 3 as there are three locations

# Transfer data to pandas DataFrame
data = pd.DataFrame()
data['day'] = day
data['location'] = location
data['temperature'] = temperature
data['humidity'] = humidity
data['mean_temperature'] = mean_temperature
data['mean_humidity'] = mean_humidity

data.head()

Out:

day	location	temperature	humidity	mean_temperature	mean_humidity
0	1	Window	26.081745	49.611333	25.222169	45.43133
1	2	Window	31.452276	39.027559	25.222169	45.43133
2	3	Window	19.031828	58.825912	25.222169	45.43133
3	4	Window	21.309825	52.741160	25.222169	45.43133
4	5	Window	13.529042	39.977335	25.222169	45.43133

Build bar chart

# Make holoviews data table
key_dimensions   = ['location']
value_dimensions = ['day', 'temperature', 'humidity', 'mean_temperature', 'mean_humidity']
hv_data = hv.Table(data, key_dimensions, value_dimensions)

# Build bar charts
bars1 = hv_data.to.bars(['day'], ['temperature'])
bars2 = hv_data.to.bars(['day'], ['humidity']).opts(color='Red')

# Compose plot
bar_plot = bars1 + bars2

# Show plot (only work in Jupyter notebook)
bar_plot

Build scatter chart

# Build scatter charts
scatter1 = hv_data.to.scatter(['day'], ['temperature'])
scatter2 = hv_data.to.scatter(['day'], ['humidity']).opts(color='Red')

# Compose plot
scatter_plot = scatter1 + scatter2

# Show plot
scatter_plot

Build line chart for mean temperature and humidity

# Build line charts
line1 = hv_data.to.curve(['day'], ['mean_temperature'])
line2 = hv_data.to.curve(['day'], ['mean_humidity']).opts(color='r')

# Compose plot
line_chart = line1 + line2

# Show plot
line_chart

Combine line and scatter charts

Here we combine the line and scatter charts. We viewed them individually before, though this is not actually necessary.

# Compose plot (* creates overlays of two or more plots)
combined_plot = line1 * scatter1 + line2 * scatter2

# Show plot
combined_plot

Save to html

The interactive plot may be saved as html which may be shared with, and viewed by, anyone (there is no need for anything other than a standard web browser to view the interactive plot).

hv.save(combined_plot, 'holoviews_example.html')

117 (code only): Genetic Algorithms 2 – a multiple objective genetic algorithm (NSGA-II)

See here for description of this code.

import random as rn
import numpy as np
import matplotlib.pyplot as plt
# For use in Jupyter notebooks only:

# Create reference solutions
# --------------------------

def create_reference_solutions(chromosome_length, solutions):
    """
    Function to create reference chromosomes that will mimic an ideal solution 
    """
    references = np.zeros((solutions, chromosome_length))
    number_of_ones = int(chromosome_length / 2)
    
    for solution in range(solutions):
        # Build an array with an equal mix of zero and ones
        reference = np.zeros(chromosome_length)
        reference[0: number_of_ones] = 1
    
        # Shuffle the array to mix the zeros and ones
        np.random.shuffle(reference)
        references[solution,:] = reference
    
    return references


# Evaluate solutions
# ------------------

def calculate_fitness(reference, population):
    """
    Calculate how many binary digits in each solution are the same as our 
    reference solution.
    """
    # Create an array of True/False compared to reference
    identical_to_reference = population == reference
    # Sum number of genes that are identical to the reference
    fitness_scores = identical_to_reference.sum(axis=1)
    
    return fitness_scores 


def score_population(population, references):
    """
    Loop through all reference solutoins and request score/fitness of
    populaiton against that reference solution.
    """
    scores = np.zeros((population.shape[0], references.shape[0]))
    for i, reference in enumerate(references):
        scores[:,i] = calculate_fitness(reference, population)
        
    return scores 

# Calculate crowding and select a population based on crowding scores
# -------------------------------------------------------------------
    
def calculate_crowding(scores):
    """
    Crowding is based on a vector for each individual
    All scores are normalised between low and high. For any one score, all
    solutions are sorted in order low to high. Crowding for chromsome x
    for that score is the difference between the next highest and next
    lowest score. Total crowding value sums all crowding for all scores
    """
    
    population_size = len(scores[:, 0])
    number_of_scores = len(scores[0, :])

    # create crowding matrix of population (row) and score (column)
    crowding_matrix = np.zeros((population_size, number_of_scores))

    # normalise scores (ptp is max-min)
    normed_scores = (scores - scores.min(0)) / scores.ptp(0)

    # calculate crowding distance for each score in turn
    for col in range(number_of_scores):
        crowding = np.zeros(population_size)

        # end points have maximum crowding
        crowding[0] = 1
        crowding[population_size - 1] = 1

        # Sort each score (to calculate crowding between adjacent scores)
        sorted_scores = np.sort(normed_scores[:, col])

        sorted_scores_index = np.argsort(
            normed_scores[:, col])

        # Calculate crowding distance for each individual
        crowding[1:population_size - 1] = \
            (sorted_scores[2:population_size] -
             sorted_scores[0:population_size - 2])

        # resort to orginal order (two steps)
        re_sort_order = np.argsort(sorted_scores_index)
        sorted_crowding = crowding[re_sort_order]

        # Record crowding distances
        crowding_matrix[:, col] = sorted_crowding

    # Sum croding distances of each score
    crowding_distances = np.sum(crowding_matrix, axis=1)

    return crowding_distances


def reduce_by_crowding(scores, number_to_select):
    """
    This function selects a number of solutions based on tournament of
    crowding distances. Two members of the population are picked at
    random. The one with the higher croding dostance is always picked
    """
    population_ids = np.arange(scores.shape[0])

    crowding_distances = calculate_crowding(scores)

    picked_population_ids = np.zeros((number_to_select))

    picked_scores = np.zeros((number_to_select, len(scores[0, :])))

    for i in range(number_to_select):

        population_size = population_ids.shape[0]

        fighter1ID = rn.randint(0, population_size - 1)

        fighter2ID = rn.randint(0, population_size - 1)

        # If fighter # 1 is better
        if crowding_distances[fighter1ID] >= crowding_distances[
            fighter2ID]:

            # add solution to picked solutions array
            picked_population_ids[i] = population_ids[
                fighter1ID]

            # Add score to picked scores array
            picked_scores[i, :] = scores[fighter1ID, :]

            # remove selected solution from available solutions
            population_ids = np.delete(population_ids, (fighter1ID),
                                       axis=0)

            scores = np.delete(scores, (fighter1ID), axis=0)

            crowding_distances = np.delete(crowding_distances, (fighter1ID),
                                           axis=0)
        else:
            picked_population_ids[i] = population_ids[fighter2ID]

            picked_scores[i, :] = scores[fighter2ID, :]

            population_ids = np.delete(population_ids, (fighter2ID), axis=0)

            scores = np.delete(scores, (fighter2ID), axis=0)

            crowding_distances = np.delete(
                crowding_distances, (fighter2ID), axis=0)

    # Convert to integer 
    picked_population_ids = np.asarray(picked_population_ids, dtype=int)
    
    return (picked_population_ids)

# Pareto selecion
# ---------------
    
def identify_pareto(scores, population_ids):
    """
    Identifies a single Pareto front, and returns the population IDs of
    the selected solutions.
    """
    
    population_size = scores.shape[0]
    # Create a starting list of items on the Pareto front
    # All items start off as being labelled as on the Parteo front
    pareto_front = np.ones(population_size, dtype=bool)
    # Loop through each item. This will then be compared with all other items
    for i in range(population_size):
        # Loop through all other items
        for j in range(population_size):
            # Check if our 'i' pint is dominated by out 'j' point
            if all(scores[j] >= scores[i]) and any(scores[j] > scores[i]):
                # j dominates i. Label 'i' point as not on Pareto front
                pareto_front[i] = 0
                # Stop further comparisons with 'i' (no more comparisons needed)
                break
    # Return ids of scenarios on pareto front
    return population_ids[pareto_front]


def build_pareto_population(
        population, scores, minimum_population_size, maximum_population_size):
    """
    As necessary repeats Pareto front selection to build a population within
    defined size limits. Will reduce a Pareto front by applying crowding 
    selection as necessary.    
    """
    unselected_population_ids = np.arange(population.shape[0])
    all_population_ids = np.arange(population.shape[0])
    pareto_front = []
    while len(pareto_front) < minimum_population_size:
        temp_pareto_front = identify_pareto(
                scores[unselected_population_ids, :], unselected_population_ids)
        
        # Check size of total parteo front. 
        # If larger than maximum size reduce new pareto front by crowding
        combined_pareto_size = len(pareto_front) + len(temp_pareto_front)
        if combined_pareto_size > maximum_population_size:
            number_to_select = combined_pareto_size - maximum_population_size
            selected_individuals = (reduce_by_crowding(
                    scores[temp_pareto_front], number_to_select))
            temp_pareto_front = temp_pareto_front[selected_individuals]
        
        # Add latest pareto front to full Pareto front
        pareto_front = np.hstack((pareto_front, temp_pareto_front))
        
    
        # Update unslected population ID by using sets to find IDs in all
        # ids that are not in the selected front
        unselected_set = set(all_population_ids) - set(pareto_front)
        unselected_population_ids = np.array(list(unselected_set))

    population = population[pareto_front.astype(int)]
    return population

# Population functions
# --------------------
    
def create_population(individuals, chromosome_length):
    """
    Create random population with given number of individuals and chroosome
    length.
    """
    
    # Set up an initial array of all zeros
    population = np.zeros((individuals, chromosome_length))
    # Loop through each row (individual)
    for i in range(individuals):
        # Choose a random number of ones to create
        ones = rn.randint(0, chromosome_length)
        # Change the required number of zeros to ones
        population[i, 0:ones] = 1
        # Sfuffle row
        np.random.shuffle(population[i])
    
    return population

def breed_by_crossover(parent_1, parent_2):
    """
    Combine two parent chromsomes by crossover to produce two children.
    """
    # Get length of chromosome
    chromosome_length = len(parent_1)
    
    # Pick crossover point, avoding ends of chromsome
    crossover_point = rn.randint(1,chromosome_length-1)
    
    # Create children. np.hstack joins two arrays
    child_1 = np.hstack((parent_1[0:crossover_point],
                        parent_2[crossover_point:]))
    
    child_2 = np.hstack((parent_2[0:crossover_point],
                        parent_1[crossover_point:]))
    
    # Return children
    return child_1, child_2


def randomly_mutate_population(population, mutation_probability):
    """
    Randomly mutate population with a given individual gene mutation
    probability. Individual gene may switch between 0/1.
    """
    
    # Apply random mutation
    random_mutation_array = np.random.random(size=(population.shape))
    
    random_mutation_boolean = \
        random_mutation_array <= mutation_probability

    population[random_mutation_boolean] = \
    np.logical_not(population[random_mutation_boolean])
    
    # Return mutation population
    return population


def breed_population(population):
    """
    Create child population by repetedly calling breeding function (two parents
    producing two children), applying genetic mutation to the child population,
    combining parent and child population, and removing duplice chromosomes.
    """
    # Create an empty list for new population
    new_population = []
    population_size = population.shape[0]
    # Create new popualtion generating two children at a time
    for i in range(int(population_size/2)):
        parent_1 = population[rn.randint(0, population_size-1)]
        parent_2 = population[rn.randint(0, population_size-1)]
        child_1, child_2 = breed_by_crossover(parent_1, parent_2)
        new_population.append(child_1)
        new_population.append(child_2)
    
    # Add the child population to the parent population
    # In this method we allow parents and children to compete to be kept
    population = np.vstack((population, np.array(new_population)))
    population = np.unique(population, axis=0)
    
    return population

# *************************************
# ******** MAIN ALGORITHM CODE ********
# *************************************

# Set general parameters
chromosome_length = 50
starting_population_size = 5000
maximum_generation = 250
minimum_population_size = 500
maximum_population_size = 1000

# Create two reference solutions 
# (this is used just to illustrate GAs)
references = create_reference_solutions(chromosome_length, 2)

# Create starting population
population = create_population(
        starting_population_size, chromosome_length)
#population = np.unique(population, axis=0)

# Now we'll go through the generations of genetic algorithm

for generation in range(maximum_generation):
    if generation %10 ==0:
        print ('Generation (out of %i): %i '%(maximum_generation, generation))
    
    # Brred
    population = breed_population(population)
    
    # Score population
    scores = score_population(population, references)
    
    # Build pareto front
    population = build_pareto_population(
            population, scores, minimum_population_size, maximum_population_size)


# Get final pareto front
scores = score_population(population, references)
population_ids = np.arange(population.shape[0]).astype(int)
pareto_front = identify_pareto(scores, population_ids)
population = population[pareto_front, :]
scores = scores[pareto_front]

# Plot Pareto front (for two scores only)  
x = scores[:, 0]/chromosome_length*100
y = scores[:, 1]/chromosome_length*100
plt.xlabel('Objective A - % maximum obtainable')
plt.ylabel('Objective B - % maximum obtainable')

plt.scatter(x,y)
plt.savefig('pareto.png')
plt.show()