In this example we take the previous example of Duelling Double Q Learning and add ‘bagging’ where we have multiple Duelling Double Q Learning Networks (each with their own target network).
A simple hospital simulation model
This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:
• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed
• The simulation is loaded with the average number of patients present
The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.
Reinforcement learning introduction
- Trial and error search
- Receiving and maximising reward (often delayed)
- Linking state -> action -> reward
- Must be able to sense something of their environment
- Involves uncertainty in sensing and linking action to reward
- Learning -> improved choice of actions over time
- All models find a way to balance best predicted action vs. exploration
Elements of RL
- Environment: all observable and unobservable information relevant to us
- Observation: sensing the environment
- State: the perceived (or perceivable) environment
- Agent: senses environment, decides on action, receives and monitors rewards
- Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
- Policy (how to link state to action; often based on probabilities)
- Reward signal: aim is to accumulate maximum reward over time
- Value function of a state: prediction of likely/possible long-term reward
- Q: prediction of likely/possible long-term reward of an action
- Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
- Model (optional): a simulation of the environment
Types of model
- Model-based: have model of environment (e.g. a board game)
- Model-free: used when environment not fully known
- Policy-based: identify best policy directly
- Value-based: estimate value of a decision
- Off-policy: can learn from historic data from other agent
- On-policy: requires active learning from current decisions
Duelling Deep Q Networks for Reinforcement Learning
Q = The expected future rewards discounted over time. This is what we are trying to maximise.
The aim is to teach a network to take the current state observations and recommend the action with greatest Q.
Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advtantage always sums to zero). These are aggregated to produce Q for each action.
Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.
Key DQN components
General method for Q learning:
Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).
Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.
Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.
The policy network is used to select action (action with best predicted Q) when playing the game.
When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).
Bagging (Bootstrap Aggregation)
Each network is trained from the same memory, but have different starting weights and are trained on different bootstrap samples from that memory. In this example actions are chosen randomly from each of the networks (an alternative could be to take the most common action recommended by the networks, or an average output). This bagging method may also be used to have some measure of uncertainty of action by looking at the distribution of actions recommended from the different nets. Bagging may also be used to aid exploration during stages where networks are providing different suggested action.
The code for this expierment:
The simple hospital bed simulation that the Deep Q Learning experiment uses:
days under capacity 54.0 days over capacity 306.0 average patients 497.0 average beds 540.0 % occupancy 91.9 dtype: float64
Bagging D3QN gives us the best performance and the most stable and consistent results we have seen so far. The downside is that more networks need to be trained, so this method is slower.
Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461
Bagging: Osband I, Blundell C, Pritzel A, et al. (2016) Deep Exploration via Bootstrapped DQN. arXiv:160204621 http://arxiv.org/abs/1602.04621