Duelling Double Deep Q Network (D3QN) controlling a simple hospital bed system

Previously, we looked at using a Double Deep Q Network to manage beds in a simple hospital simulation.

Here we look at a refinement, a Duelling Double Deep Q Network, that can sometimes improve performance (see Wang et al, 2016, https://arxiv.org/abs/1511.06581)

Duelling is very similar to Double DQN, except that the policy net splits into two. One component reduces to a single value, which will model the state value. The other component models the advantage, the difference in Q between different actions (the mean value is subtracted from all values, so that the advantage always sums to zero). These are aggregated to produce Q for each action.

This is an example of a simple hospital bed model where a Reinforcement learning (RL) agent has to learn how to manage the bed stock:

• Default arrivals = 50/day
• Weekend arrival numbers are 50% average arrival numbers
• Weekday arrival numbers are 120% average arrival numbers
• Distribution of inter-arrival time is inverse exponential
• Average length of stay is 7 days (default)
• Distribution of length of stay is inverse exponential
• The RL agent may request a change in bed numbers once a day (default)
• The allowed bed change requests are -20, -10, 0, 10, 20
• Bed changes take 2 days to occur (default)
• The RL agent receives a reward at each action based on the number of free beds or number of patients without a bed
• The simulation is loaded with the average number of patients present

The RL agent must learn to maximise the long term reward (return). The maximum reward = 0, so the agent is learning to minimise the loss for each unoccupied bed or patient without bed.

Reinforcement learning introduction

RL involves:

  • Trial and error search
  • Receiving and maximising reward (often delayed)
  • Linking state -> action -> reward
  • Must be able to sense something of their environment
  • Involves uncertainty in sensing and linking action to reward
  • Learning -> improved choice of actions over time
  • All models find a way to balance best predicted action vs. exploration

Elements of RL

  • Environment: all observable and unobservable information relevant to us
  • Observation: sensing the environment
  • State: the perceived (or perceivable) environment
  • Agent: senses environment, decides on action, receives and monitors rewards
  • Action: may be discrete (e.g. turn left) or continuous (accelerator pedal)
  • Policy (how to link state to action; often based on probabilities)
  • Reward signal: aim is to accumulate maximum reward over time
  • Value function of a state: prediction of likely/possible long-term reward
  • Q: prediction of likely/possible long-term reward of an action
  • Advantage: The difference in Q between actions in a given state (sums to zero for all actions)
  • Model (optional): a simulation of the environment

Types of model

  • Model-based: have model of environment (e.g. a board game)
  • Model-free: used when environment not fully known
  • Policy-based: identify best policy directly
  • Value-based: estimate value of a decision
  • Off-policy: can learn from historic data from other agent
  • On-policy: requires active learning from current decisions

Q is learned through the Bellman equation, where the Q of any state and action is the immediate reward achieved + the discounted maximum Q value (the best action taken) of next best action, where gamma is the discount rate.

$$Q(s,a)=r + \gamma.maxQ(s',a')$$

Key DQN components (common to both Double Deep Q Networks and Duelling Deep Q Networks)

General method for Q learning:

Overall aim is to create a neural network that predicts Q. Improvement comes from improved accuracy in predicting ‘current’ understood Q, and in revealing more about Q as knowledge is gained (some rewards only discovered after time).

Target networks are used to stabilise models, and are only updated at intervals. Changes to Q values may lead to changes in closely related states (i.e. states close to the one we are in at the time) and as the network tries to correct for errors it can become unstable and suddenly lose signficiant performance. Target networks (e.g. to assess Q) are updated only infrequently (or gradually), so do not have this instability problem.

Training networks

Double DQN contains two networks. This ammendment, from simple DQN, is to decouple training of Q for current state and target Q derived from next state which are closely correlated when comparing input features.

The policy network is used to select action (action with best predicted Q) when playing the game.

When training, the predicted best action (best predicted Q) is taken from the policy network, but the policy network is updated using the predicted Q value of the next state from the target network (which is updated from the policy network less frequently). So, when training, the action is selected using Q values from the policy network, but the the policy network is updated to better predict the Q value of that action from the target network. The policy network is copied across to the target network every n steps (e.g. 1000).

Code structure


The code for this this experiment may be found here:

The simple hospital bed simulation that the Deep Q Learning experiment uses:



Though performance is broadly similar to Double Deep Q Networks, I have found that the Duelling version a little more consistent in performance from one training run to another..


Double DQN: van Hasselt H, Guez A, Silver D. (2015) Deep Reinforcement Learning with Double Q-learning. arXiv:150906461 http://arxiv.org/abs/1509.06461

Duelling DDQN: Wang Z, Schaul T, Hessel M, et al. (2016) Dueling Network Architectures for Deep Reinforcement Learning. arXiv:151106581 http://arxiv.org/abs/1511.06581

One thought on “Duelling Double Deep Q Network (D3QN) controlling a simple hospital bed system

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s