In England emergency departments have a target that 95% of patients should be admitted or discharged from ED within four hours. Patients waiting more than four hours are known as ‘breaches’
This notebook explores predicting emergency department (ED) breaches (patients taking more than 4 hours to be discharged or admitted). The data is from a real mid-sized acute hospital in England.
The model receives data every 2 hours and predicts whether there will be a breach in the next 2 hours.
It uses some basic ED data alongside whole-hospital data (number of occupied beds and total beds) to try to predict whether there are likely to be breaches in the next two hours. It uses a simple logistic regression model to achieve 80% accuracy in predicting breaches. Sensitivity may be adjusted to balance accuracy in predicting beach and non-breaching episodes (80% accuracy may be be simultaneousness achieved in both).
Load and process data
Download the data file and save it in the same directory as the python code file you will build up.
This code can also be downloaded:
import pandas as pd data = pd.read_csv('ed_1.csv')
Show data columns:
['snapshot_id', 'snapshot_date', 'snapshot_time', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Number of Patients In department >= 4 Hours', 'Total Number of Patients in the Department', 'Number of Patients in Resus', 'Number of Patients Registered in Last 60 Minutes', 'Number of Patients Waiting Triage', 'Number of Patients Waiting to be Seen (ED)', 'Number of Patients Waiting to be Seen (Medical)', 'Number of Patients Waiting to be Seen (Surgery)', 'Number of Patients > 3 Hours', 'Number of Patients Waiting a Bed', 'Number of Patients Left Department in Last 60 Minutes', 'Free_beds', 'Breach_in_next_timeslot']
Separate data into features (X) and label (Y) to predict. Y is whether there are breaches in the following 2 hours.
X = data.loc[:,"Monday":"Free_beds"] y = data['Breach_in_next_timeslot']
Let’s see what proportion of 2 hour epochs have a breach:
Split data in training and test sets
from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)
Normalise data with standard scaling
from sklearn.preprocessing import StandardScaler # Initialise a new scaling object for normalising input data sc=StandardScaler() # Set up the scaler just on the training set sc.fit(X_train) # Apply the scaler to the training and test sets X_train_std=sc.transform(X_train) X_test_std=sc.transform(X_test)
Build a logistic regression model
C=1000 sets low regularisation. If accuracy of training data is significantly higher than accuracy of test data this should be reduced in 10-fold or 3-fold steps to maximise accuracy of test data.
(Note: the ‘;’ at the end of the last line suppresses model description output in the Jupyter Notebook)
from sklearn.linear_model import LogisticRegression ml = LogisticRegression(C=1000) ml.fit(X_train_std,y_train);
Predict training and test set labels
Our model is now built. We can now predict breaches for training and test sets. The results for the test set gives the better description of accuracy, but it is useful to calculate both to look for ‘over-fitting’. If the training data has significantly better accuracy than the test data then it is likely the model is ‘over-fitted’ to the training data, and the regularisation term (C) in the model fit above should be reduced step-wise – this will reduce accuracy of predicting the training data, but will increase the accuracy of the test data, though too high regularisation (low C) will reduce the accuracy of both predicting training and test data.
# Predict training and test set labels y_pred_train = ml.predict(X_train_std) y_pred_test = ml.predict(X_test_std)
import numpy as np accuracy_train = np.mean(y_pred_train == y_train) accuracy_test = np.mean(y_pred_test == y_test) print ('Accuracy of predicting training data =', accuracy_train) print ('Accuracy of predicting test data =', accuracy_test)
Accuracy of predicting training data = 0.8111326090191993 Accuracy of prediciing test data = 0.8151785714285714
Display weights (coefficients) of model.
# Create table of weights weights_table = pd.DataFrame() weights_table['feature'] = list(X) weights_table['weight'] = ml.coef_ print(weights_table)
feature weight 0 Monday 0.038918 1 Tuesday -0.026935 2 Wednesday 0.001615 3 Thursday 0.001543 4 Friday -0.014975 5 Saturday 0.011287 6 Sunday -0.011401 7 Number of Patients In department >= 4 Hours 1.515722 8 Total Number of Patients in the Department 0.544407 9 Number of Patients in Resus 0.307983 10 Number of Patients Registered in Last 60 Minutes -0.444304 11 Number of Patients Waiting Triage 0.028371 12 Number of Patients Waiting to be Seen (ED) 0.138082 13 Number of Patients Waiting to be Seen (Medical) -0.036093 14 Number of Patients Waiting to be Seen (Surgery) 0.022757 15 Number of Patients > 3 Hours 1.265580 16 Number of Patients Waiting a Bed 0.013085 17 Number of Patients Left Department in Last 60 ... -0.001884 18 Free_beds -0.369558
Define a function for sensitivity and specificity
Sensitivity = proportion of breaching periods correctly identified
Specificity = proportion of breaching periods correctly identified
def calculate_sensitivity_specificity(y_test, y_pred_test): # Note: More parameters are defined than necessary. # This would allow return of other measures other than sensitivity and specificity # Get true/false for whether a breach actually occurred actual_pos = y_test == 1 actual_neg = y_test == 0 # Get true and false test (true test match actual, false tests differ from actual) true_pos = (y_pred_test == 1) & (actual_pos) false_pos = (y_pred_test == 1) & (actual_neg) true_neg = (y_pred_test == 0) & (actual_neg) false_neg = (y_pred_test == 0) & (actual_pos) # Calculate accuracy accuracy = np.mean(y_pred_test == y_test) # Calculate sensitivity and specificity sensitivity = np.sum(true_pos) / np.sum(actual_pos) specificity = np.sum(true_neg) / np.sum(actual_neg) return sensitivity, specificity, accuracy
Show sensitivity and specificity:
sensitivity, specificity, accuracy = calculate_sensitivity_specificity(y_test, y_pred_test) print ('Sensitivity:', sensitivity) print ('Specificity:', specificity) print ('Accuracy:', accuracy)
Sensitivity: 0.8488529014844804 Specificity: 0.7493403693931399 Accuracy: 0.8151785714285714
So we are better at detecting breaches than non-breaches. This is likely because breaching sessions occur more often. Let’s adjust our model cut-off to balance the accuracy out. We’ll vary the cut-off we use and construct a sensitivity/specificity plot (very similar to a ‘Receiver-Operator Curve’ or ‘ROC’).
Balancing sensitivity and specificity
cuttoff = np.arange (0.01,1.01,0.01) sensitivity_results =  specificity_results =  for threshold in cuttoff: # linear regression model has .predict+proba method to return # probability of outcomes. Some methods such as svc use # .decision_function to return probability # Get test results y_pred_probability = ml.predict_proba(X_test_std) # Check probability of positive classification is >trhreshold y_pred_test = (y_pred_probability[:,1] >= threshold) # Convert boolean to 0/1 (could also simply multiple by 1) y_pred_test = y_pred_test.astype(int) # Get sensitivity and specificity sensitivity, specificity, accuracy = \ calculate_sensitivity_specificity(y_test, y_pred_test) # Add results to list of results sensitivity_results.append(sensitivity) specificity_results.append(specificity)
Plotting specificity against sensitivity:
import matplotlib.pyplot as plt %matplotlib inline fig = plt.figure(figsize=(5,5)) ax1 = fig.add_subplot(111) x = sensitivity_results y = specificity_results ax1.grid(True, which='both') ax1.set_xlabel('Sensitivity (proportion of breaching\nperiods predicted correctly)') ax1.set_ylabel('Specificity (proportion of non-breaching\nperiods predicted correctly)') plt.plot(x,y) plt.show()
Plotting specificity against sensitivity shows we can adjust our machine learning cut-off to simultaneously achieve 80% accuracy in predicting likelihood of breaches in the next 2 hours.