This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock.
- Default arrivals = 50/day
- Weekend arrival numbers are 50% average arrival numbers
- Weekday arrival numbers are 120% average arrival numbers
- Distribution of inter-arrival time is inverse exponential
- Average length of stay is 7 days (default)
- Distribution of length of stay is inverse exponential
- The RL agent may request a change in bed numbers once a day (default)
- The allowed bed change requests are -20, -10, 0, 10, 20
- Bed changes take 2 days to occur (default)
- The simulation is loaded with the average number of patients present
The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.
Reinforcement learning introduction
- Trial and error search
- Receiving and maximising reward (often delayed)
- Linking state -> action -> reward
- Must be able to sense something of their environment
- Involves uncertainty in sensing and linking action to reward
- Learning -> improved choice of actions over time
- All models find a way to balance best predicted action vs. exploration
Elements of RL
- Environment: all observable and unobservable information relevant to us
- Observation: sensing the environment
- State: the perceived (or perceivable) environment
- Agent: senses environment, decides on action, receives and monitors rewards
- Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
- Policy (how to link state to action; often based on probabilities)
- Reward signal: aim is to accumulate maximum reward over time
- Value function of a state: prediction of likely/possible long-term reward
- Q: prediction of likely/possible long-term reward of an action
- Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
- Model (optional): a simulation of the environment
Types of model
- Model-based: have model of environment (e.g. a board game)
- Model-free: used when environment not fully known
- Policy-based: identify best policy directly
- Value-based: estimate value of a decision
- Off-policy: can learn from historic data from other agent
- On-policy: requires active learning from current decisions
Deep Q Networks for Reinforcement Learning
Q = The expected future rewards discounted over time. This is what we are trying to maximise.
The aim is to teach a network to take the current state observations and recommend the action with greatest Q.
Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.
Key DQN components
General method for Q learning
Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).
Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.
Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.
The policy network is used to select action (action with best predicted Q) when playing the game.
When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).
Deep Q Learning Experiment:
The simple hospital bed simulation that the Deep Q Learning experiment uses:
van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461