Prioritised Replay Noisy Duelling Double Deep Q Learning – controlling a simple hospital bed system

“Pay more attention to the time you got things wrong!”

In this example we take the previous example of Noisy Duelling Double Q Learning and add a Prioritised Replay memory. Using this type of memory, state/actions that have the highest loss (error) when training our policy network are sampled more frequently than state/actions that have low loss.

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Noisy layers

Noisy layers are an alternative to epsilon-greedy exploration (here, we leave the epsilon-greedy code in the model, but set it to reduce to zero immediately after the period of fully random action choice).

For every weight in the layer we have a random value that we draw from the normal distribution. This random value is used to add noise to the output. The parameters for the extent of noise for each weight, sigma, are stored within the layer and get trained as part of the standard back-propogation.

A modification to normal nosiy layers is to use layers with ‘factorized gaussian noise’. This reduces the number of random numbers to be sampled (so is less computationally expensive). There are two random vectors, one with the size of the input, and the other with the size of the output. A random matrix is created by calculating the outer product of the two vectors.

Prioritised replay

In standard DQN samples are taken randomly from the memory (replay buffer). In prioritised replay samples are taken in proportion to their loss when training the network; where the network has the greatest error in predicting the target valur of a state/action, then those samples will be sampled more frequently (which will reduce the error in the network until the sample is not prioritised). In other words, the training focuses more heavenly on samples it gets most wrong, and spends less time training on samples that it can acurately predict already.

This priority may also be used as a weight for training the network, but this i snot implemented here; we use loss just for sampling.

When we use the loss for priority we add a small value (1e-5) t the loss. This avoids any sample having zero priority (and never having a chance of being sampled). For frequency of sampling we also raise the loss to the power of ‘alpha’ (default value of 0.6). Smaller values of alpha will compress the differences between samples, making the priority weighting less significant in the frequency of sampling.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Duelling DDQN: Wang Z, Schaul T, Hessel M, et al. (2016) Dueling Network Architectures for Deep Reinforcement Learning. arXiv:151106581 http://arxiv.org/abs/1511.06581

Noisy networks: Fortunato M, Azar MG, Piot B, et al. (2019) Noisy Networks for Exploration. arXiv:170610295 http://arxiv.org/abs/1706.10295

Prioritised replay: Schaul T, Quan J, Antonoglou I, et al (2016). Prioritized Experience Replay. arXiv:151105952 http://arxiv.org/abs/1511.05952

Code for the nosiy layers comes from:

Lapan, M. (2020). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition. Packt Publishing.

Code structure

Code

Code is available at:

https://github.com/MichaelAllen1966/learninghospital/blob/master/05_prioritised_replay_noisy_duelling_ddqn_simple_hospital_sim_1.ipynb

Results

Add in prioritised replay leads to more rapid learning.

Noisy Bootstrapped Aggregation (‘Bagging’) Duelling Double Deep Q Learning – controlling a simple hospital bed system

n this example we take the previous example of Bagging Duelling Double Q Learning and add ‘noisy layers’ to the estimatin of advantage. Adding noise within the network is an alterantive to epsilon-greedy exploration. As the performance of the system improves the amount of noise will reduce in the network, as noise is a parameter in the network which is optimised through the normal network optimization procedure.

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.$$Q(s,a)=r + \gamma.maxQ(s',a')$$

For a presenation on Q, see https://youtu.be/o22P5rCAAEQ

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Bagging (Bootstrap Aggregation)

Each network is trained from the same memory, but have different starting weights and are trained on different bootstrap samples from that memory. In this example actions are chosen randomly from each of the networks (an alternative could be to take the most common action recommended by the networks, or an average output). This bagging method may also be used to have some measure of uncertainty of action by looking at the distribution of actions recommended from the different nets. Bagging may also be used to aid exploration during stages where networks are providing different suggested action.

Noisy layers

Noisy layers are an alternative to epsilon-greedy exploration (here, we leave the epsilon-greedy code in the model, but set it to reduce to zero immediately after the period of fully random action choice).

For every weight in the layer we have a random value that we draw from the normal distribution. This random value is used to add noise to the output. The parameters for the extent of noise for each weight, sigma, are stored within the layer and get trained as part of the standard back-propogation.

A modification to normal nosiy layers is to use layers with ‘factorized gaussian noise’. This reduces the number of random numbers to be sampled (so is less computationally expensive). There are two random vectors, one with the size of the input, and the other with the size of the output. A random matrix is created by calculating the outer product of the two vectors.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Bagging: Osband I, Blundell C, Pritzel A, et al. (2016) Deep Exploration via Bootstrapped DQN. arXiv:160204621 http://arxiv.org/abs/1602.04621

Noisy networks: Fortunato M, Azar MG, Piot B, et al. (2019) Noisy Networks for Exploration. arXiv:170610295 http://arxiv.org/abs/1706.10295

Code for the nosiy layers comes from:

Lapan, M. (2020). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition. Packt Publishing.

Code structure

Code

https://github.com/MichaelAllen1966/learninghospital/blob/master/07_noisy_bagging_duelling_ddqn_simple_hospital_sim_1.ipynb

Results

Adding noisy layers maintains or improves the performance of bagging DDQN, but without the need for epsilon-greedy exploration.

Noisy Duelling Double Deep Q Learning – controlling a simple hospital bed system

In this example we take the previous example of Duelling Double Q Learning and add ‘noisy layers’ to the estimatin of advantage. Adding noise within the network is an alterantive to epsilon-greedy exploration. As the performance of the system improves the amount of noise will reduce in the network, as noise is a parameter in the network which is optimised through the normal network optimization procedure.

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Noisy layers

Noisy layers are an alternative to epsilon-greedy exploration (here, we leave the epsilon-greedy code in the model, but set it to reduce to zero immediately after the period of fully random action choice).

For every weight in the layer we have a random value that we draw from the normal distribution. This random value is used to add noise to the output. The parameters for the extent of noise for each weight, sigma, are stored within the layer and get trained as part of the standard back-propogation.

A modification to normal nosiy layers is to use layers with ‘factorized gaussian noise’. This reduces the number of random numbers to be sampled (so is less computationally expensive). There are two random vectors, one with the size of the input, and the other with the size of the output. A random matrix is created by calculating the outer product of the two vectors.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Duelling DDQN: Wang Z, Schaul T, Hessel M, et al. (2016) Dueling Network Architectures for Deep Reinforcement Learning. arXiv:151106581 http://arxiv.org/abs/1511.06581

Noisy networks: Fortunato M, Azar MG, Piot B, et al. (2019) Noisy Networks for Exploration. arXiv:170610295 http://arxiv.org/abs/1706.10295

Code for the nosiy layers comes from:

Lapan, M. (2020). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition. Packt Publishing.

Code structure

Code

https://github.com/MichaelAllen1966/learninghospital/blob/master/03_noisy_duelling_ddqn_simple_hospital_sim_1.ipynb

Results

Adding noisy layers maintains or improves the performance of duelling DQN, but without the need for epsilon-greedy exploration.

Bootstrapped Aggregation (‘Bagging’) Duelling Double Deep Q Learning – controlling a simple hospital bed system

We again use a variation on Deep Q Learning, to control staffed bed numbers in a simple hospital simulation. This follows on from Double Deep Q Learning and Duelling Double Deep Q Learning.

In this example we take the previous example of Duelling Double Q Learning and add ‘bagging’ where we have multiple Duelling Double Q Learning Networks (each with their own target network).

A simple hospital simulation model

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Duelling Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Bagging (Bootstrap Aggregation)

Each network is trained from the same memory, but have different starting weights and are trained on different bootstrap samples from that memory. In this example actions are chosen randomly from each of the networks (an alternative could be to take the most common action recommended by the networks, or an average output). This bagging method may also be used to have some measure of uncertainty of action by looking at the distribution of actions recommended from the different nets. Bagging may also be used to aid exploration during stages where networks are providing different suggested action.

Code structure

Code

The code for this expierment:

https://github.com/MichaelAllen1966/learninghospital/blob/master/06_bagging_duelling_ddqn_simple_hospital_sim_1%20.ipynb

The simple hospital bed simulation that the Deep Q Learning experiment uses:

https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/learning_hospital/simpy_envs/env_simple_hospital_bed_1.py

Results

days under capacity     54.0
days over capacity     306.0
average patients       497.0
average beds           540.0
% occupancy             91.9
dtype: float64

Bagging D3QN gives us the best performance and the most stable and consistent results we have seen so far. The downside is that more networks need to be trained, so this method is slower.

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Bagging: Osband I, Blundell C, Pritzel A, et al. (2016) Deep Exploration via Bootstrapped DQN. arXiv:160204621 http://arxiv.org/abs/1602.04621

Duelling Double Deep Q Network (D3QN) controlling a simple hospital bed system

Previously, we looked at using a Double Deep Q Network to manage beds in a simple hospital simulation.

Here we look at a refinement, a Duelling Double Deep Q Network, that can sometimes improve performance (see Wang et al, 2016, https://arxiv.org/abs/1511.06581)

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advantage always sums to zero). These are aggregated to produce Q for each action.

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components (common to both Double Deep Q Networks and Duelling Deep Q Networks)

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Code structure

Code

The code for this this experiment may be found here:
https://github.com/MichaelAllen1966/learninghospital/blob/master/02_duelling_ddqn_simple_hospital_sim_1.ipynb

The simple hospital bed simulation that the Deep Q Learning experiment uses:

https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/learning_hospital/simpy_envs/env_simple_hospital_bed_1.py

Results

Though performance is broadly similar to Double Deep Q Networks, I have found that the Duelling version a little more consistent in performance from one training run to another..

References

Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Duelling DDQN: Wang Z, Schaul T, Hessel M, et al. (2016) Dueling Network Architectures for Deep Reinforcement Learning. arXiv:151106581 http://arxiv.org/abs/1511.06581

Double Deep Q Network (DDQN) – controlling a simple hospital bed system.

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock.

  • Default arrivals = 50/day
  • Weekend arrival numbers are 50% average arrival numbers
  • Weekday arrival numbers are 120% average arrival numbers
  • Distribution of inter-arrival time is inverse exponential
  • Average length of stay is 7 days (default)
  • Distribution of length of stay is inverse exponential
  • The RL agent may request a change in bed numbers once a day (default)
  • The allowed bed change requests are -20, -10, 0, 10, 20
  • Bed changes take 2 days to occur (default)
  • The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Deep Q Networks for Reinforcement Learning

Q = The expected future rewards discounted over time. This is what we are trying to maximise.

The aim is to teach a network to take the current state observations and recommend the action with greatest Q.

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components

General method for Q learning

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Code structure

Code

Deep Q Learning Experiment:

https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/learning_hospital/01_ddqn_simple_hospital_sim_1.ipynb

The simple hospital bed simulation that the Deep Q Learning experiment uses:

https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/learning_hospital/simpy_envs/env_simple_hospital_bed_1.py

Results

References

van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

128. Parallel processing in Python

Sometimes we have functions
, or complete models, that may be run in parallel across CPU cores. This may save significant time when we have access to computers to multiple cores. Here we use the joblib library to show how we can run functions in parallel (pip install joblib if not already installed).

Import libraries

import numpy as np
import time
from joblib import Parallel, delayed

Mimic something to be run in parallel

Here we will create a function that takes 2 seconds to run, to mimic a model or a complex function.

def my_slow_function(x):
    """A very slow function, which takes 1 second to double a number"""
    
    # A 1 second delay
    time.sleep (1)
    
    return x * 2

Running our function sequentially in a ‘for’ loop

# Get start time
start = time.time()
# Run functions 8 times with different input (using a list comprehension)
trial_output = [my_slow_function(i) for i in range(8)]
print(trial_output)
# Get time taken
time_taken = time.time() - start
# Print time taken
print (f'Time taken: {time_taken:.1f} seconds')

OUT:

[0, 2, 4, 6, 8, 10, 12, 14]
Time taken: 8.0 seconds

Running our function in parallel using joblib

n_jobs is the maximum number of CPU cores to use. If set to -1, all available cores will be used.

# Get start time
start = time.time()
# Run functions 8 times with different input using joblib
trial_output = \
    Parallel(n_jobs=-1)(delayed(my_slow_function)(i) for i in range(8))
print(trial_output)
# Get time taken
time_taken = time.time() - start
# Print time taken
print (f'Time taken: {time_taken:.1f} seconds')


[0, 2, 4, 6, 8, 10, 12, 14]
Time taken: 1.3 seconds

That’s a good improvement in speed!

Checking pseudo-random number generation

Pseudo-random number generators, if not provided with a seed, use the computer clock to generate the seed. This means some methods of parallel processing will generate sets of random numbers that may be the same. By default joblib uses the loki backend which prevents this occurring, but let’s check.

def numpy_random():
    """Generate a random number using NumPy"""
    return np.random.random()

Parallel(n_jobs=-1)(delayed(numpy_random)() for i in range(5))

Out:
[0.5268839074941227,
 0.12883536669358964,
 0.14084785209998263,
 0.4166795926896423,
 0.19189235808368665]


127. Feature expansion

Here we use survival on the Titanic to demonstrate how features may be expanded by looking for interaction between features. This frequently needs to be followed by feature selection to reduce the number of features (while keeping the important interactions).

Simple models such as logistic regression do not incorporate complex inetractions between features. If two features produce more than an additive effect, this will not be fiited in logistic regression. In order to allow for feature interaction we need to add terms that create new features by producing the product of each product pair.

When we use polynomial expansion of features, we create new features that are the product of two features. For example if we had two features, A, B and C, a full polynomial expansion would produce the following extra features:

  • A.A, A.B, A.C
  • B.A, B.B, B.C
  • C.A, C.B, C.C

But we will reduce this in two ways:

  • Remove duplicate terms (e.g. A.B and B.A are the same, so we only need A.B)
  • Use the interaction_only argument to remove powers of single features (e.g. A.A, B.B)

A danger of polynomial expansion is that th emodel may start to over-fit to the training data. This may be dealy with in one (or both of two ways):

  • Increase the regularisation strength in the model (reduce the value of C in the logistic regression model)
  • Use feature selection to pick only the most important features (which now may include polynomial features)

The methods described here build on previous notebooks, notably on logistic regression, k-fold, sampling, regularisation, and model-based forward feature selection.

Load modules

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold

Download data

Run the following code if data for Titanic survival has not been previously downloaded.

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data

data = pd.read_csv('data/processed_data.csv')
# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
data.drop('PassengerId', inplace=True, axis=1)

Divide into X (features) and y (labels)

# Split data into two DataFrames
X_df = data.drop('Survived',axis=1)
y_df = data['Survived']

# Convert to NumPy arrays
X = X_df.values
y = y_df.values
# Add polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X_df)

Let’s look at the shape of our data sets (the first value is the number of samples, and the second value is the number of features):

print ('Shape of X:', X.shape)
print ('Shape of X_poly:', X_poly.shape)
Shape of X: (891, 24)
Shape of X_poly: (891, 300)

Woah – we’ve gone from 24 features to 301! But are they any use?

Training and testing normal and expanded models with varying regularisation

The following code:

  • Defines a list of regularisation (lower values lead to greater regularisation)
  • Sets up lists to hold results for each k-fold split
  • Starts a loop for each regularisation value, and loops through:
    • Print regularisation level (to show progress)
    • Sets up lists to record replicates from k-fold stratification
    • Sets up the k-fold splits using sklearns StratifiedKFold method
    • Trains two logistic regression models (regular and polynomial), and test its it, for eack k-fold split
    • Adds each k-fold training/test accuracy to the lists
  • Record average accuracy from k-fold stratification (so each regularisation level has one accuracy result recorded for training and test sets)

We pass the regularisation to the model during fitting, it has the argument name C.

# Define function to standardise data

def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc = StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std
# Training and testing normal and polynomial models

reg_values = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10]

# Set up lists to hold results
training_acc_results = []
test_acc_results = []
training_acc_results_poly = []
test_acc_results_poly = []

# Set up splits
skf = StratifiedKFold(n_splits = 5)
skf.get_n_splits(X, y)
skf.get_n_splits(X_poly, y)

# Set up model type

for reg in reg_values:
    # Show progress
    print(reg, end=' ')
    
    # Set up lists for results for each of k splits
    training_k_results = []
    test_k_results = []
    training_k_results_poly = []
    test_k_results_poly = []
    # Loop through the k-fold splits
    for train_index, test_index in skf.split(X, y):
        
        # Normal (non-polynomial model)
        
        # Get X and Y train/test
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        # Standardise X data
        X_train_std, X_test_std = standardise_data(X_train, X_test)
        # Fit model with regularisation (C)
        model = LogisticRegression(C=reg, solver='lbfgs', max_iter=1000)
        model.fit(X_train_std,y_train)
        # Predict training and test set labels
        y_pred_train = model.predict(X_train_std)
        y_pred_test = model.predict(X_test_std)
        # Calculate accuracy of training and test sets
        accuracy_train = np.mean(y_pred_train == y_train)
        accuracy_test = np.mean(y_pred_test == y_test)
        # Record accuracy for each k-fold split
        training_k_results.append(accuracy_train)
        test_k_results.append(accuracy_test)
        
        # Polynomial model (same as above except use X with polynomial features)
        
        # Get X and Y train/test
        X_train, X_test = X_poly[train_index], X_poly[test_index]
        y_train, y_test = y[train_index], y[test_index]
        # Standardise X data
        X_train_std, X_test_std = standardise_data(X_train, X_test)
        # Fit model with regularisation (C)
        model = LogisticRegression(C=reg, solver='lbfgs', max_iter=1000)
        model.fit(X_train_std,y_train)
        # Predict training and test set labels
        y_pred_train = model.predict(X_train_std)
        y_pred_test = model.predict(X_test_std)
        # Calculate accuracy of training and test sets
        accuracy_train = np.mean(y_pred_train == y_train)
        accuracy_test = np.mean(y_pred_test == y_test)
        # Record accuracy for each k-fold split
        training_k_results_poly.append(accuracy_train)
        test_k_results_poly.append(accuracy_test)
        
    # Record average accuracy for each k-fold split
    training_acc_results.append(np.mean(training_k_results))
    test_acc_results.append(np.mean(test_k_results))
    training_acc_results_poly.append(np.mean(training_k_results_poly))
    test_acc_results_poly.append(np.mean(test_k_results_poly))

Plot results

import matplotlib.pyplot as plt
%matplotlib inline

# Define data for chart
x = reg_values
y1 = training_acc_results
y2 = test_acc_results
y3 = training_acc_results_poly
y4 = test_acc_results_poly

# Set up figure
fig = plt.figure(figsize=(5,5))
ax1 = fig.add_subplot(111)

# Plot training set accuracy
ax1.plot(x, y1,
        color = 'k',
        linestyle = '-',
        markersize = 8,
        marker = 'o',
        markerfacecolor='k',
        markeredgecolor='k',
        label  = 'Training set accuracy')

# Plot test set accuracy
ax1.plot(x, y2,
        color = 'r',
        linestyle = '-',
        markersize = 8,
        marker = 'o',
        markerfacecolor='r',
        markeredgecolor='r',
        label  = 'Test set accuracy')

# Plot training set accuracy (poly model)
ax1.plot(x, y3,
        color = 'g',
        linestyle = '-',
        markersize = 8,
        marker = 'o',
        markerfacecolor='g',
        markeredgecolor='g',
        label  = 'Training set accuracy (poly)')

# Plot test set accuracy (poly model)
ax1.plot(x, y4,
        color = 'b',
        linestyle = '-',
        markersize = 8,
        marker = 'o',
        markerfacecolor='b',
        markeredgecolor='b',
        label  = 'Test set accuracy (poly)')

# Customise axes
ax1.grid(True, which='both')
ax1.set_xlabel('Regularisation\n(lower value = greater regularisation)')
ax1.set_ylabel('Accuracy')
ax1.set_xscale('log')

# Add legend
ax1.legend()

# Show plot
plt.show()

Show best test set accuracy measured.In [11]:

best_test_non_poly = np.max(test_acc_results)
best_test_poly = np.max(test_acc_results_poly)
print ('Best accuracy for non-poly and poly were {0:0.3f} and {1:0.3f}'.format(
    best_test_non_poly, best_test_poly))
Best accuracy for non-poly and poly were 0.789 and 0.816

Note in the above figure that:

  • Polynomial expansion has increased the accuracy of both training and test sets. Test set accuracy was increased over 2%
  • We do not know which polynomial terms are most useful (below we will use feature reduction to identiofy those)
  • The polynomial X data suffers from more over-fitting than the non-polynomial set (there is a larger difference between training and test set accuracies)

Feature reduction after feature expansion

We will revisit the code we have used previously to pick the best features (here).

In the previous method we ranked all features in their ability to improve model performance. Here, because there are many more features we will look at the influence of the top 20 (and if we see model performance is still increasing with additional features we could come back and change that limit).

We will amend the previous code as well to use simple accuracy (rather than the ROC Area Under Curve in our previous example).

# Transfer polynomial X into a pandas DataFrame (as method use Pandas)
X_poly_df = pd.DataFrame(X_poly, columns=poly.get_feature_names())

# Create list to store accuracies and chosen features
accuracy_by_feature_number = []
chosen_features = []

# Initialise chosen features list and run tracker
available_features = list(poly.get_feature_names())
run = 0
number_of_features = len(list(X))

# Loop through feature list to select next feature
maximum_features_to_choose = 20

for i in range(maximum_features_to_choose):

    # Track and pront progress
    run += 1
    print ('Feature run {} of {}'.format(run, maximum_features_to_choose))
    
    # Reset best feature and accuracy
    best_result = 0
    best_feature = ''

    # Loop through available features
    for feature in available_features:

        # Create copy of already chosen features to avoid orginal being changed
        features_to_use = chosen_features.copy()
        # Create a list of features from features already chosen + 1 new feature
        features_to_use.append(feature)
        # Get data for features, and convert to NumPy array
        X_np = X_poly_df[features_to_use].values
        
        # Set up lists to hold results for each selected features
        test_accuracy_results = []
    
        # Set up k-fold training/test splits
        number_of_splits = 5
        skf = StratifiedKFold(n_splits = number_of_splits)
        skf.get_n_splits(X_np, y)
    
        # Loop through the k-fold splits
        for train_index, test_index in skf.split(X_np, y):
            
            # Get X and Y train/test
            X_train, X_test = X_np[train_index], X_np[test_index]
            y_train, y_test = y[train_index], y[test_index]
    
            # Get X and Y train/test
            X_train_std, X_test_std = standardise_data(X_train, X_test)
    
            # Set up and fit model
            model = LogisticRegression(solver='lbfgs')
            model.fit(X_train_std,y_train)
    
            # Predict test set labels
            y_pred_test = model.predict(X_test_std)
                        
            # Calculate accuracy of test sets
            accuracy_test = np.mean(y_pred_test == y_test)
            test_accuracy_results.append(accuracy_test)
          
        # Get average result from all k-fold splits
        feature_accuracy = np.mean(test_accuracy_results)
    
        # Update chosen feature and result if this feature is a new best
        if feature_accuracy > best_result:
            best_result = feature_accuracy
            best_feature = feature
    
    # k-fold splits are complete    
    # Add mean accuracy and AUC to record of accuracy by feature number
    accuracy_by_feature_number.append(best_result)
    chosen_features.append(best_feature)
    available_features.remove(best_feature)

# Put results in DataFrame
results = pd.DataFrame()
results['feature to add'] = chosen_features
results['accuracy'] = accuracy_by_feature_number
results
feature to addaccuracy
0x1 x100.792352
1x0 x20.818170
2x13 x190.824924
3x3 x120.828320
4x4 x160.830568
5x0 x30.831716
6x5 x70.835081
7x5 x180.836211
8x190.837335
9x2 x160.838458
10x8 x190.838464
11x2 x190.840712
12x11 x180.840718
13x3 x160.841835
14x60.841835
15x140.841835
16x220.841835
17x0 x60.841835
18x0 x140.841835
19x0 x220.841835

Plot results:

import matplotlib.pyplot as plt
%matplotlib inline

chart_x = list(range(1, maximum_features_to_choose+1))

plt.plot(chart_x, accuracy_by_feature_number,
        label = 'Accuracy')

plt.xlabel('Number of features')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

plt.show()

Neat! By selecting our best features from our expanded model we now have a little over 84% accuracy! It looks like we need about 15 features for our optimum model, nearly all of which are polynomial terms. Note that sklearn’s polynomial method outputs features names in relation to the original X index. Our ‘best’ feature is a product of X1 and X10. Let’s see what those are:In [16]:

X_index_names = list(X_df)
print ('X1:',X_index_names[1])
print ('X10:',X_index_names[10])
X1: Age
X10: male

So looking at just the ages of male passengers is the best single predictor of survival.

126. Feature selection 3 (model backward elimination)

Here we use survival on the Titanic to demonstrate a model-based method to select the most important features.

Reducing the number of features we use can have three benefits:

  • Simplifies model explanation
  • Model fit may be improved by the removal of features that add no value
  • Model will be faster to fit

In this notebook we will use a model-based approach whereby we incrementally remove features that least reduce model performance.

Two key advantages of this method are:

  • It is relatively simple.
  • It is tailored to the model in question.

Some key disadvantage of this method are:

  • It may be slow if there are many parameters (though the loop to select features could be limited in the number of features to select).
  • The selection of features may be dependent on model meta-parameters (such as level of regularisation).
  • The selection of features may not transfer between models (e.g. a model that does not allow for feature interactions may not detect features which do not add much value independently).

We will assess performance using k-fold stratification for better measurement of performance. If you are not familiar with k-fold stratification, have a look here.

And we will assess performance with a Receiver Operator Characteristic (ROC) curve. See here.

Load data

The section below downloads pre-processed data, and saves it to a subfolder (from where this code is run). If data has already been downloaded that cell may be skipped.

Code that was used to pre-process the data ready for machine learning may be found at: https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/titanic/01_preprocessing.ipynb

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data:

data = pd.read_csv('data/processed_data.csv')

The first column is a passenger index number. We will remove this, as this is not part of the original Titanic passenger data.

# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data

data.drop('PassengerId', inplace=True, axis=1)

Divide into X (features) and y (lables)

We will separate out our features (the data we use to make a prediction) from our label (what we are truing to predict). By convention our features are called X (usually upper case to denote multiple features), and the label (survive or not) y.

X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'

Forward feature selection

Define data standardisation function.

def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc = StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std

The forward selection method:

  • Keeps a list of selected features
  • Keeps a list of features still available for selection
  • Loops through available features:
    • Calculates added value for each feature (using stratified k-fold validation)
    • Selects feature that adds most value
    • Adds selected feature to selected features list and removes it from available features list

This method uses a while lop to keep exploring features until no more are available. An alternative would be to use a for loop with a maximum number of features to select.In [8]:

# Create list to store accuracies and chosen features
roc_auc_by_feature_number = []
chosen_features = []

# Initialise chosen features list and run tracker
available_features = list(X)
run = 0
number_of_features = len(list(X))

# Creat einitial reference performance
reference_auc = 1.0 # used to compare reduction in AUC

# Loop through feature list to select next feature
while len(available_features)> 1:

    # Track and pront progress
    run += 1
    print ('Feature run {} of {}'.format(run, number_of_features-1))
    
    # Convert DataFrames to NumPy arrays
    y_np = y.values
    
    # Reset best feature and accuracy
    best_result = 1.0
    best_feature = ''

    # Loop through available features
    for feature in available_features:

        # Create copy of already chosen features to avoid orginal being changed
        features_to_use = available_features.copy()
        # Create a list of features to use by removing 1 feature
        features_to_use.remove(feature)
        # Get data for features, and convert to NumPy array
        X_np = X[features_to_use].values
        
        # Set up lists to hold results for each selected features
        test_auc_results = []
    
        # Set up k-fold training/test splits
        number_of_splits = 5
        skf = StratifiedKFold(n_splits = number_of_splits)
        skf.get_n_splits(X_np, y)
    
        # Loop through the k-fold splits
        for train_index, test_index in skf.split(X_np, y_np):
            
            # Get X and Y train/test
            X_train, X_test = X_np[train_index], X_np[test_index]
            y_train, y_test = y[train_index], y[test_index]
    
            # Get X and Y train/test
            X_train_std, X_test_std = standardise_data(X_train, X_test)
    
            # Set up and fit model
            model = LogisticRegression(solver='lbfgs')
            model.fit(X_train_std,y_train)
    
            # Predict test set labels
            y_pred_test = model.predict(X_test_std)
            
            # Calculate accuracy of test sets
            accuracy_test = np.mean(y_pred_test == y_test)
          
            # Get ROC AUC
            probabilities = model.predict_proba(X_test_std)
            probabilities = probabilities[:, 1] # Probability of 'survived'
            fpr, tpr, thresholds = roc_curve(y_test, probabilities)
            roc_auc = auc(fpr, tpr)
            test_auc_results.append(roc_auc)
        
        # Get average result from all k-fold splits
        feature_auc = np.mean(test_auc_results)
    
        # Update chosen feature and result if this feature is a new best
        # We are looking for the smallest drop in performance
        drop_in_performance = reference_auc - feature_auc
        if drop_in_performance < best_result:
            best_result = drop_in_performance
            best_feature = feature
            best_auc = feature_auc
                
    # k-fold splits are complete    
    # Add mean accuracy and AUC to record of accuracy by feature number
    roc_auc_by_feature_number.append(best_auc)
    chosen_features.append(best_feature)    
    available_features.remove(best_feature)
    reference_auc = best_auc

# Add last remaining feature
chosen_features += available_features
roc_auc_by_feature_number.append(0)
    
# Put results in DataFrame
# Reverse order of lists with [::-1] so best features first
results = pd.DataFrame()
results['feature removed'] = chosen_features[::-1]
results['ROC AUC'] = roc_auc_by_feature_number[::-1]

Show results

The table is now in the order of preferred features, though our code worked in the reverse direction, incrementally removing the feature that made least difference to the model.In [9]:

results
feature removedROC AUC
0male0.000000
1Pclass0.766733
2Age0.833036
3SibSp0.843055
4Embarked_S0.848686
5CabinNumberImputed0.853125
6CabinLetter_C0.855046
7CabinLetter_missing0.855504
8CabinLetterImputed0.855506
9Embarked_missing0.855533
10EmbarkedImputed0.855479
11CabinLetter_T0.855479
12CabinLetter_F0.855159
13CabinLetter_B0.855124
14Embarked_Q0.854781
15Embarked_C0.853826
16Fare0.853760
17CabinNumber0.853030
18CabinLetter_E0.852737
19CabinLetter_A0.852083
20AgeImputed0.851654
21CabinLetter_D0.850486
22CabinLetter_G0.848673
23Parch0.847432

Plot results

import matplotlib.pyplot as plt
%matplotlib inline

chart_x = list(range(1, number_of_features+1))

plt.plot(chart_x, roc_auc_by_feature_number,
        label = 'ROC AUC')

plt.xlabel('Number of features removed')
plt.ylabel('Accuracy (ROC AUC)')
plt.legend()
plt.grid(True)

plt.show()

From the above results it looks like we could eliminate all but 5-6 features in this model. It may also be worth examining the same method using other performance scores (such as simple accuracy, or f1) in place of ROC AUC.

125. Feature selection 2 (model forward selection)

Here we use survival on the Titanic to demonstrate a model-based method to select the most important features.

Reducing the number of features we use can have three benefits:

  • Simplifies model explanation
  • Model fit may be improved by the removal of features that add no value
  • Model will be faster to fit

In this notebook we will use a model-based approach whereby we incrementally add features that most increase model performance.

Two key advantages of this method are:

  • It is relatively simple.
  • It is tailored to the model in question.

Some key disadvantage of this method are:

  • It may be slow if there are many parameters (though the loop to select features could be limited in the number of features to select).
  • The selection of features may be dependent on model meta-parameters (such as level of regularisation).
  • The selection of features may not transfer between models (e.g. a model that does not allow for feature interactions may not detect features which do not add much value independently).

We will go through the following steps:

  • Download and save pre-processed data
  • Split data into features (X) and label (y)
  • Loop through features to select the feature that most increase ROC AUC
  • Plot results

We will assess performance using k-fold stratification for better measurement of performance. If you are not familiar with k-fold stratification, have a look here.

And we will assess performance with a Receiver Operator Characteristic (ROC) curve. See here.

Load modules

A standard Anaconda install of Python (https://www.anaconda.com/distribution/) contains all the necessary modules.

import numpy as np
import pandas as pd

# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler

Load data

The section below downloads pre-processed data, and saves it to a subfolder (from where this code is run). If data has already been downloaded that cell may be skipped.

Code that was used to pre-process the data ready for machine learning may be found at: https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/titanic/01_preprocessing.ipynb:

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data:

data = pd.read_csv('data/processed_data.csv')

The first column is a passenger index number. We will remove this, as this is not part of the original Titanic passenger data.

# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data

data.drop('PassengerId', inplace=True, axis=1)

Divide into X (features) and y (labels)

We will separate out our features (the data we use to make a prediction) from our label (what we are trying to predict). By convention our features are called X (usually upper case to denote multiple features), and the label (survive or not) y.

X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'

Forward feature selection

Define data standardisation function.

def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc = StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std

The forward selection method:

  • Keeps a list of selected features
  • Keeps a list of features still available for selection
  • Loops through available features:
    • Calculates added value for each feature (using stratified k-fold validation)
    • Selects feature that adds most value
    • Adds selected feature to selected features list and removes it from available features list

This method uses a while lop to keep exploring features until no more are available. An alternative would be to use a for loop with a maximum number of features to select.

# Create list to store accuracies and chosen features
roc_auc_by_feature_number = []
chosen_features = []

# Initialise chosen features list and run tracker
available_features = list(X)
run = 0
number_of_features = len(list(X))

# Loop through feature list to select next feature
while len(available_features)> 0:

    # Track and pront progress
    run += 1
    print ('Feature run {} of {}'.format(run, number_of_features))
    
    # Convert DataFrames to NumPy arrays
    y_np = y.values
    
    # Reset best feature and accuracy
    best_result = 0
    best_feature = ''

    # Loop through available features
    for feature in available_features:

        # Create copy of already chosen features to avoid orginal being changed
        features_to_use = chosen_features.copy()
        # Create a list of features from features already chosen + 1 new feature
        features_to_use.append(feature)
        # Get data for features, and convert to NumPy array
        X_np = X[features_to_use].values
        
        # Set up lists to hold results for each selected features
        test_auc_results = []
    
        # Set up k-fold training/test splits
        number_of_splits = 5
        skf = StratifiedKFold(n_splits = number_of_splits)
        skf.get_n_splits(X_np, y)
    
        # Loop through the k-fold splits
        for train_index, test_index in skf.split(X_np, y_np):
            
            # Get X and Y train/test
            X_train, X_test = X_np[train_index], X_np[test_index]
            y_train, y_test = y[train_index], y[test_index]
    
            # Get X and Y train/test
            X_train_std, X_test_std = standardise_data(X_train, X_test)
    
            # Set up and fit model
            model = LogisticRegression(solver='lbfgs')
            model.fit(X_train_std,y_train)
    
            # Predict test set labels
            y_pred_test = model.predict(X_test_std)
            
            # Calculate accuracy of test sets
            accuracy_test = np.mean(y_pred_test == y_test)
          
            # Get ROC AUC
            probabilities = model.predict_proba(X_test_std)
            probabilities = probabilities[:, 1] # Probability of 'survived'
            fpr, tpr, thresholds = roc_curve(y_test, probabilities)
            roc_auc = auc(fpr, tpr)
            test_auc_results.append(roc_auc)
        
        # Get average result from all k-fold splits
        feature_auc = np.mean(test_auc_results)
    
        # Update chosen feature and result if this feature is a new best
        if feature_auc > best_result:
            best_result = feature_auc
            best_feature = feature
    
    # k-fold splits are complete    
    # Add mean accuracy and AUC to record of accuracy by feature number
    roc_auc_by_feature_number.append(best_result)
    chosen_features.append(best_feature)
    available_features.remove(best_feature)

# Put results in DataFrame
results = pd.DataFrame()
results['feature to add'] = chosen_features
results['ROC AUC'] = roc_auc_by_feature_number
results
feature to addROC AUC
0male0.766733
1Pclass0.833036
2Age0.843055
3SibSp0.848686
4Embarked_S0.853125
5CabinLetter_E0.855740
6CabinNumberImputed0.856158
7CabinLetter_T0.856160
8CabinLetter_D0.856167
9CabinLetter_F0.856307
10EmbarkedImputed0.856121
11Embarked_missing0.856094
12CabinLetterImputed0.855637
13CabinLetter_missing0.855744
14Fare0.855302
15CabinLetter_A0.854307
16CabinLetter_B0.853215
17CabinNumber0.852896
18Embarked_C0.851377
19Embarked_Q0.851324
20CabinLetter_C0.849972
21AgeImputed0.848673
22CabinLetter_G0.847432
23Parch0.845842

Plot results

import matplotlib.pyplot as plt
%matplotlib inline

chart_x = list(range(1, number_of_features+1))

plt.plot(chart_x, roc_auc_by_feature_number,
        label = 'ROC AUC')

plt.xlabel('Number of features')
plt.ylabel('Accuracy (ROC AUC)')
plt.legend()
plt.grid(True)

plt.show()

From the above results it looks like we could use just 5-7 features in this model. It may also be worth examining the same method using other performance scores (such as simple accuracy, or f1) in place of ROC AUC.

Note that accuracy of the model appears to decline with a greater number of features.