97: Simple machine learning model to predict emergency department (ED) breaches of the four-hour target

In England emergency departments have a target that 95% of patients should be admitted or discharged from ED within four hours. Patients waiting more than four hours are known as ‘breaches’

This notebook explores predicting emergency department (ED) breaches (patients taking more than 4 hours to be discharged or admitted). The data is from a real mid-sized acute hospital in England.

The model receives data every 2 hours and predicts whether there will be a breach in the next 2 hours.

It uses some basic ED data alongside whole-hospital data (number of occupied beds and total beds) to try to predict whether there are likely to be breaches in the next two hours. It uses a simple logistic regression model to achieve 80% accuracy in predicting breaches. Sensitivity may be adjusted to balance accuracy in predicting beach and non-breaching episodes (80% accuracy may be be simultaneousness achieved in both).

import pandas as pd
data = pd.read_csv('ed_1.csv')

Show data columns:

print (list(data))
['snapshot_id', 'snapshot_date', 'snapshot_time', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Number of Patients In department >= 4 Hours', 'Total Number of Patients in the Department', 'Number of Patients in Resus', 'Number of Patients Registered in Last 60 Minutes', 'Number of Patients Waiting Triage', 'Number of Patients Waiting to be Seen (ED)', 'Number of Patients Waiting to be Seen (Medical)', 'Number of Patients Waiting to be Seen (Surgery)', 'Number of Patients > 3 Hours', 'Number of Patients Waiting a Bed', 'Number of Patients Left Department in Last 60 Minutes', 'Free_beds', 'Breach_in_next_timeslot']

Separate data into features (X) and label (Y) to predict. Y is whether there are breaches in the following 2 hours.

X = data.loc[:,"Monday":"Free_beds"]
y = data['Breach_in_next_timeslot']

Let’s see what proportion of 2 hour epochs have a breach:

print (data['Breach_in_next_timeslot'].mean())
0.6575510659671838

Split data in training and test sets

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)

Normalise data with standard scaling

from sklearn.preprocessing import StandardScaler

# Initialise a new scaling object for normalising input data
sc=StandardScaler()

# Set up the scaler just on the training set
sc.fit(X_train)

# Apply the scaler to the training and test sets
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)

Build a logistic regression model

C=1000 sets low regularisation. If accuracy of training data is significantly higher than accuracy of test data this should be reduced in 10-fold or 3-fold steps to maximise accuracy of test data.

(Note: the ‘;’ at the end of the last line suppresses model description output in the Jupyter Notebook)

from sklearn.linear_model import LogisticRegression

ml = LogisticRegression(C=1000)
ml.fit(X_train_std,y_train);

Predict training and test set labels

Our model is now built. We can now predict breaches for training and test sets. The results for the test set gives the better description of accuracy, but it is useful to calculate both to look for ‘over-fitting’. If the training data has significantly better accuracy than the test data then it is likely the model is ‘over-fitted’ to the training data, and the regularisation term (C) in the model fit above should be reduced step-wise – this will reduce accuracy of predicting the training data, but will increase the accuracy of the test data, though too high regularisation (low C) will reduce the accuracy of both predicting training and test data.

In [8]:
# Predict training and test set labels
y_pred_train = ml.predict(X_train_std)
y_pred_test = ml.predict(X_test_std)

Test accuracy

import numpy as np
accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)
print ('Accuracy of predicting training data =', accuracy_train)
print ('Accuracy of predicting test data =', accuracy_test)
Accuracy of predicting training data = 0.8111326090191993
Accuracy of prediciing test data = 0.8151785714285714

Display weights (coefficients) of model.

# Create table of weights
weights_table = pd.DataFrame()
weights_table['feature'] = list(X)
weights_table['weight'] = ml.coef_[0]
print(weights_table)
                                              feature    weight
0                                              Monday  0.038918
1                                             Tuesday -0.026935
2                                           Wednesday  0.001615
3                                            Thursday  0.001543
4                                              Friday -0.014975
5                                            Saturday  0.011287
6                                              Sunday -0.011401
7         Number of Patients In department >= 4 Hours  1.515722
8          Total Number of Patients in the Department  0.544407
9                         Number of Patients in Resus  0.307983
10   Number of Patients Registered in Last 60 Minutes -0.444304
11                  Number of Patients Waiting Triage  0.028371
12         Number of Patients Waiting to be Seen (ED)  0.138082
13    Number of Patients Waiting to be Seen (Medical) -0.036093
14    Number of Patients Waiting to be Seen (Surgery)  0.022757
15                       Number of Patients > 3 Hours  1.265580
16                   Number of Patients Waiting a Bed  0.013085
17  Number of Patients Left Department in Last 60 ... -0.001884
18                                          Free_beds -0.369558

 

Define a function for sensitivity and specificity

Sensitivity = proportion of breaching periods correctly identified
Specificity = proportion of breaching periods correctly identified

def calculate_sensitivity_specificity(y_test, y_pred_test):
    # Note: More parameters are defined than necessary. 
    # This would allow return of other measures other than sensitivity and specificity
    
    # Get true/false for whether a breach actually occurred
    actual_pos = y_test == 1
    actual_neg = y_test == 0
    
    # Get true and false test (true test match actual, false tests differ from actual)
    true_pos = (y_pred_test == 1) & (actual_pos)
    false_pos = (y_pred_test == 1) & (actual_neg)
    true_neg = (y_pred_test == 0) & (actual_neg)
    false_neg = (y_pred_test == 0) & (actual_pos)
    
    # Calculate accuracy
    accuracy = np.mean(y_pred_test == y_test)
    
    # Calculate sensitivity and specificity
    sensitivity = np.sum(true_pos) / np.sum(actual_pos)
    specificity = np.sum(true_neg) / np.sum(actual_neg)
    
    return sensitivity, specificity, accuracy

Show sensitivity and specificity:

sensitivity, specificity, accuracy = calculate_sensitivity_specificity(y_test, y_pred_test)
print ('Sensitivity:', sensitivity)
print ('Specificity:', specificity)
print ('Accuracy:', accuracy)
Sensitivity: 0.8488529014844804
Specificity: 0.7493403693931399
Accuracy: 0.8151785714285714

So we are better at detecting breaches than non-breaches. This is likely because breaching sessions occur more often. Let’s adjust our model cut-off to balance the accuracy out. We’ll vary the cut-off we use and construct a sensitivity/specificity plot (very similar to a ‘Receiver-Operator Curve’ or ‘ROC’).

Balancing sensitivity and specificity

cuttoff = np.arange (0.01,1.01,0.01)
sensitivity_results = []
specificity_results = []


for threshold in cuttoff:
    # linear regression model has .predict+proba  method to return 
    # probability of outcomes. Some methods such as svc use 
    # .decision_function to return probability
        
    # Get test results 
    y_pred_probability = ml.predict_proba(X_test_std)
    
    # Check probability of positive classification is >trhreshold
    y_pred_test = (y_pred_probability[:,1] >= threshold)
    
    # Convert boolean to 0/1 (could also simply multiple by 1)
    y_pred_test = y_pred_test.astype(int)
    
    # Get sensitivity and specificity
    sensitivity, specificity, accuracy = \
        calculate_sensitivity_specificity(y_test, y_pred_test)
    
    # Add results to list of results
    sensitivity_results.append(sensitivity)
    specificity_results.append(specificity)  
    

Plotting specificity against sensitivity:

import matplotlib.pyplot as plt

%matplotlib inline

fig = plt.figure(figsize=(5,5))
ax1 = fig.add_subplot(111)

x = sensitivity_results
y = specificity_results

ax1.grid(True, which='both')
ax1.set_xlabel('Sensitivity (proportion of breaching\nperiods predicted correctly)')
ax1.set_ylabel('Specificity (proportion of non-breaching\nperiods predicted correctly)')


plt.plot(x,y)
plt.show()
ed_roc

Plotting specificity against sensitivity shows we can adjust our machine learning cut-off to simultaneously achieve 80% accuracy in predicting likelihood of breaches in the next 2 hours.

90. Worked machine learning example (for HSMA course)

Logistic regression

Load data

This example uses a data set built into sklearn. It classifies biopsy samples from breast cancer patients as ether malignant (cancer) or benign (no cancer).

from sklearn import datasets

data_set = datasets.load_breast_cancer()

# Set up features (X), labels (y) and feature names
X = data_set.data
y = data_set.target
feature_names = data_set.feature_names
label_names = data_set.target_names

Show feature names.

print (feature_names)
Out:
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

Show first record feature data.

print (X[1])
Out:
[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
 7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
 5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
 2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
 2.750e-01 8.902e-02]

Show label names.

print (label_names)
Out:
['malignant' 'benign']

Print first 25 labels.

print (y[:25])
Out:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0]

We are dealing with a binary outcome. There are just two possibilities: benign or malignant. The methods described below will also work with problems with more than two possible classifications, but we’ll keep things relatively simple here.

Split the data into training and test sets

Data will be split randomly with 75% of the data used to train the machine learning model, and 25% used to test the model.

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(
        X,y,test_size=0.25)

Scale the data using standardisation

We scale data so that all features share similar scales.

The X data will be transformed by standardisation. To standardise we subtract the mean and divide by the standard deviation. All data (training + test) will be standardised using the mean and standard deviation of the training set.

We will use a scaler from sklearn (but we could do this manually).

from sklearn.preprocessing import StandardScaler

# Initialise a new scaling object for normalising input data
sc=StandardScaler() 

# Set up the scaler just on the training set
sc.fit(X_train)

# Apply the scaler to the training and test sets
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)

Look at the first row of the raw and scaled data.

print ('Raw data:')
print (X_train[0])
print ()
print ('Scaled data')
print (X_train_std[0])
Out:
Raw data:
[1.288e+01 1.822e+01 8.445e+01 4.931e+02 1.218e-01 1.661e-01 4.825e-02
 5.303e-02 1.709e-01 7.253e-02 4.426e-01 1.169e+00 3.176e+00 3.437e+01
 5.273e-03 2.329e-02 1.405e-02 1.244e-02 1.816e-02 3.299e-03 1.505e+01
 2.437e+01 9.931e+01 6.747e+02 1.456e-01 2.961e-01 1.246e-01 1.096e-01
 2.582e-01 8.893e-02]

Scaled data
[-0.36737431 -0.24219862 -0.32226491 -0.4700164   1.83425665  1.21709729
 -0.52651885  0.09871505 -0.35972527  1.39174201  0.15619514 -0.05861438
  0.17858324 -0.11883777 -0.57112622 -0.1052128  -0.61143715  0.10804629
 -0.24792232 -0.17943819 -0.26773858 -0.22328682 -0.25111534 -0.37512237
  0.55131326  0.25668614 -0.74440738 -0.10266883 -0.50914117  0.25431606]

Fit logistic regression model

Our first machine learning model is a logistic regression model.

https://en.wikipedia.org/wiki/Logistic_regression

from sklearn.linear_model import LogisticRegression

ml = LogisticRegression(C=1000)
ml.fit(X_train_std,y_train)
LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

That’s it! We can now use the model to predict malignant vs. benign classification of patients.

# Predict training and test set labels
y_pred_train = ml.predict(X_train_std)
y_pred_test = ml.predict(X_test_std)

Check accuracy of model

import numpy as np

accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)

print ('Accuracy of predicting training data =', accuracy_train)
print ('Accuracy of predicting test data =', accuracy_test)
Accuracy of predicting training data = 0.9929577464788732
Accuracy of predicting test data = 0.958041958041958

Notice that the accuracy of fitting training data is significantly higher than the test data. This is known as over-fitting. There are two potential problems with over-fitting:

1) If we test accuracy on the same data used to train the model we report a spuriously high accuracy. Our model is not actually as good as we report.
2) The model is too tightly built around our training data.

The solution to the first problem is to always report accuracy based on predicting the values of a test data set that was not used to train the model.

The solution to the second problem is to use ‘regularisation’.

Regularisation

Regularisation ‘loosens’ the fit to the training data. It effectively moves all predictions a little closer to the average for all values.

In our logistic regression model, the regularisation parameter is C. A C value of 1,000 means there is very little regularisation. Try changing the values down by factors of 10, re-run code blocks 9, 10 and 11 and see what happens to the accuracy of the model. What value of C do you think is best?

Examinining the probability of outcome, and changing the sensitivity of the model to predicting a positive

There may be cases where either:

1) We want to see the probability of a given classification, or

2) We want to adjust the sensitivity of predicting a certain outcome (e.g. for health screening we may choose to accept more false positives in order to minimise the number fo false negatives).

For linear regression we use the output ‘predict_proba’. This may also be used in other machine learning models such as random forests and neural networks (but for support vector machines the output ‘decision_function’ is used in place of predict_proba.

Let’s look at it in action.

For this section we’ll refit the model with regularisation of C=0.1.

# We'll start by retraining the model with C=0.1

ml = LogisticRegression(C=0.1)
ml.fit(X_train_std,y_train)
y_pred_test = ml.predict(X_test_std)

# Calculate the predicted probability of outcome 1:

y_pred_probability = ml.predict_proba(X_test_std)

# Print first 5 values and the predicted label:

print ('Predicted label probabilities:')
print (y_pred_probability[0:5])
print ()
print ('Predicted labels:')
print (y_pred_test[0:5])
print ()
print ('Actual labels:')
print (y_test[0:5])
Predicted label probabilities:
[[9.99999977e-01 2.27009500e-08]
 [2.31989267e-01 7.68010733e-01]
 [5.99718849e-02 9.40028115e-01]
 [1.58608410e-02 9.84139159e-01]
 [1.87433182e-03 9.98125668e-01]]

Predicted labels:
[0 1 1 1 1]

Actual labels:
[0 1 1 1 1]

Let’s calculate false positive and false negatives . In this data set being clear of chance has a label ‘1’, and having cancer has a label ‘0’. A false positive, that is a sample is classed as cancer when is not actually cancer has a predicted test label of 0 and an actual label of 0. A false negative (predicted no cancer when cancer is actually present) has a predicted label of 1 and and actual label of zero.

fp = np.sum((y_pred_test == 1) & (y_test == 0))
fn = np.sum((y_pred_test == 0) & (y_test == 1))

print ('False positives:', fp)
print ('False negatives:', fn)
False positives: 2
False negatives: 0

Maybe we are more concerned about false negatives. Let’s adjust the probability cut-off to change the threshold for classification as having no cancer (predicted label 1).

cutoff = 0.75

# Now let's make a prediction based on that new cutoff.
# Column 1 contains the probability of no cancer

new_prediction = y_pred_probability[:,1] >= cutoff

And let’s calculate our false positives and negatives:

fp = np.sum((new_prediction == 0) & (y_test == 1))
fn = np.sum((new_prediction == 1) & (y_test == 0))

print ('False positives:', fp)
print ('False negatives:', fn)
False positives: 8
False negatives: 1

We have eliminated false negatives, but at the cost of more false postives. Try adjusting the cuttoff value. What value do you think is best?

Model weights (coefficients)

We can obtain the model weights (coefficients) for each of the features. Values that are more strongly positive or negative are most important. A positive number means that this feature is linked to a classification label of 1. A negative number means that this feature is linked to a classification label of 0.
We can obtain the weights by using the method .coef_ (be careful to add the trailing underscore).

print (ml.coef_)
[[-0.34349044 -0.33178015 -0.33844463 -0.36708645 -0.12187052  0.00765482
  -0.42255581 -0.40341256 -0.07540728  0.20606778 -0.44130835  0.01567654
  -0.33062184 -0.3657388  -0.0561784   0.19323076 -0.01636882 -0.01956193
   0.06890254  0.24517315 -0.53861132 -0.52077952 -0.50924417 -0.51384138
  -0.38831286 -0.11229711 -0.41778047 -0.43760388 -0.3899574  -0.09911312]]

Random Forests

A second type of categorisation model is a Random Forest.

https://en.wikipedia.org/wiki/Random_forests

from sklearn.ensemble import RandomForestClassifier

ml = RandomForestClassifier(n_estimators = 10000,
                            n_jobs = -1)

# For random forests we don't need to use scaled data
ml.fit (X_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10000, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
# Predict test set labels

y_pred_test = ml.predict(X_test)
accuracy_test = np.mean(y_pred_test == y_test)
print ('Accuracy of predicting test data =', accuracy_test)
Accuracy of predicting test data = 0.9440559440559441

Feature importance

Feature importances give us the relative importance of each feature – the higher the number the greater the influence on the decision (they add up to 1.0). Feature importances do not tell use which decision is more likely.

(Careful to add the trailing underscore in ml.featureimportances)

import pandas as pd

df = pd.DataFrame()
df['feature'] = feature_names
df['importance'] = ml.feature_importances_
df = df.sort_values('importance', ascending = False)
print (df)
                    feature  importance
22          worst perimeter    0.145793
20             worst radius    0.126224
23               worst area    0.120934
27     worst concave points    0.107566
7       mean concave points    0.086282
6            mean concavity    0.050089
2            mean perimeter    0.049308
3                 mean area    0.046792
0               mean radius    0.041615
26          worst concavity    0.036971
13               area error    0.035772
25        worst compactness    0.016429
21            worst texture    0.015007
10             radius error    0.014587
12          perimeter error    0.012707
1              mean texture    0.012507
28           worst symmetry    0.011385
24         worst smoothness    0.010911
5          mean compactness    0.010349
29  worst fractal dimension    0.006442
4           mean smoothness    0.005170
11            texture error    0.005052
16          concavity error    0.004991
17     concave points error    0.004303
15        compactness error    0.004159
19  fractal dimension error    0.004147
18           symmetry error    0.003951
14         smoothness error    0.003708
8             mean symmetry    0.003454
9    mean fractal dimension    0.003395

ADDITIONAL MATERIAL

Support Vector Machines

https://en.wikipedia.org/wiki/Support_vector_machine

Support Vector Machines are another common classification algorithm.
Regularisation (C) may be adjusted, and different ‘kernels’ may also be applied. The two most common are ‘linear’ and ‘rbf’).

# Import data

from sklearn import datasets

data_set = datasets.load_breast_cancer()

# Set up features (X), labels (y) and feature names

X = data_set.data
y = data_set.target
feature_names = data_set.feature_names
label_names = data_set.target_names

# Split data into training and test sets

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(
        X,y,test_size=0.25)

# Scale data

from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
sc.fit(X_train)
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)

# Fit model
# Note: a common test is to see whether linear or rbf kernel is best
# Try changing regularisation (C)

from sklearn.svm import SVC
ml = SVC(kernel='linear',C=100)
ml.fit(X_train_std,y_train)

# Predict training and test set labels

y_pred_train = ml.predict(X_train_std)
y_pred_test = ml.predict(X_test_std)

# Check accuracy of model

import numpy as np
accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)
print ('Accuracy of prediciting training data =', accuracy_train)
print ('Accuracy of prediciting test data =', accuracy_test)

# Show classification probabilities for first five samples
# Note that for SVMs we use decision_function, in place of predict_proba 

y_pred_probability = ml.decision_function(X_test_std)
print ()
print ('Predicted label probabilities:')
print (y_pred_probability[0:5])
Accuracy of prediciting training data = 0.9953051643192489
Accuracy of prediciting test data = 0.972027972027972

Predicted label probabilities:
[-0.7905539   6.09369495 16.11186543  6.55771422 21.89224697]

Neural Networks

Neural networks may be better for very large or complex data sets. A challenge is the number of parameters that need to be optimised.

After importing the MLPClassifier (another name for a Neural Network is a Mulit-Level Perceptron Classifier) type help (MLPClassifier) for more information.

In [22]:
# Import data

from sklearn import datasets

data_set = datasets.load_breast_cancer()

# Set up features (X), labels (y) and feature names

X = data_set.data
y = data_set.target
feature_names = data_set.feature_names
label_names = data_set.target_names

# Split data into training and test sets

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)

# Scale data

from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
sc.fit(X_train)
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)

# Fit model
# Note: a common test is to see whether linear or rbf kernel is best
# Try changing regularisation (C)

from sklearn.neural_network import MLPClassifier

ml = MLPClassifier(solver='lbfgs',
                   alpha=1e-8, 
                   hidden_layer_sizes=(50, 10),
                   max_iter=100000, 
                   shuffle=True, 
                   learning_rate_init=0.001,
                   activation='relu', 
                   learning_rate='constant', 
                   tol=1e-7,
                   random_state=0)

ml.fit(X_train_std, y_train) 

# Predict training and test set labels

y_pred_train = ml.predict(X_train_std)
y_pred_test = ml.predict(X_test_std)

# Check accuracy of model

import numpy as np
accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)
print ('Accuracy of prediciting training data =', accuracy_train)
print ('Accuracy of prediciting test data =', accuracy_test)

# Show classification probabilities for first five samples
# Neural networks may often produce spuriously high proabilities!

y_pred_probability = ml.predict_proba(X_test_std)
print ()
print ('Predicted label probabilities:')
print (y_pred_probability[0:5])
Accuracy of prediciting training data = 1.0
Accuracy of prediciting test data = 0.972027972027972

Predicted label probabilities:
[[0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [1.00000000e+00 2.25849723e-60]
 [1.00000000e+00 1.40199390e-70]
 [3.77475828e-15 1.00000000e+00]]

A Random Forest example with multiple categories

We will use another classic ‘toy’ data set to look at multiple label classification. This is the categorisation of iris plants. We only have four features but have three different classification possibilities.
We will use logistic regression, but all the methods described above work on multiple label classification.
Note: for completeness of code we’ll import the required modules again, but this is not actually necessary if they have been imported previously.

Load data

from sklearn import datasets

data_set = datasets.load_iris()

X = data_set.data
y = data_set.target
feature_names = data_set.feature_names
label_names = data_set.target_names

print ('Label names:')
print (label_names)
print ()
print ('Feature names:')
print (feature_names)
print ()
print ('First sample feature values:')
print (X[0])
print ()
print ('Labels:')
print (y)
Label names:
['setosa' 'versicolor' 'virginica']

Feature names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

First sample feature values:
[5.1 3.5 1.4 0.2]

Labels:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

Split data into training and test data sets

In [24]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)

Scale data

Here we’ll use a random forest model which does not need scaling. But for all other model types a scaling step would be added here.

Fit random forest model and show accuracy

ml = RandomForestClassifier(n_estimators = 10000,
                            random_state = 0,
                            n_jobs = -1)

# For random forests we don't need to use scaled data
ml.fit (X_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10000, n_jobs=-1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

Predict training and test set labels

y_pred_train = ml.predict(X_train)
y_pred_test = ml.predict(X_test)

Check accuracy of model

import numpy as np

accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)

print ('Accuracy of prediciting training data =', accuracy_train)
print ('Accuracy of prediciting test data =', accuracy_test)
Accuracy of prediciting training data = 1.0
Accuracy of prediciting test data = 0.9736842105263158

Show classification of first 10 samples

print ('Actual label:')
print (y_test[0:10])
print ()
print ('Predicted label:')
print (y_pred_test[0:10])
Actual label:
[2 0 2 1 2 2 0 0 1 1]

Predicted label:
[1 0 2 1 2 2 0 0 1 1]

Showing prediction probabilities and changing sensitivity to classification

As with a binary classification we may be interested in obtaining the probability of label classification, either to get an indicator of the certainty of classification, or to bias classification towards or against particular classes.

Changing sensitivity towards particular class labels is more complicated with multi-class problems, but the principle is the same as with a binary classification. We can access the calculated probabilities of classification for each label. Usually the one with the highest probability is taken, but rules could be defined to bias decisions more towards one class if that is beneficial.

Here we will just look at the probability outcomes for each class. The usual rule for prediction is to simply take the one that is highest.

y_pred_probability = ml.predict_proba(X_test)

print (y_pred_probability[0:5])
[[0.000e+00 8.058e-01 1.942e-01]
 [9.997e-01 1.000e-04 2.000e-04]
 [0.000e+00 1.100e-03 9.989e-01]
 [1.000e-04 9.988e-01 1.100e-03]
 [0.000e+00 2.200e-03 9.978e-01]]

 

86. Linear regression and multiple linear regression

In linear regression we seek to predict the value of a continuous variable based on either a single variable, or a set of variables.

The example we will look at below seeks to predict life span based on weight, height, physical activity, BMI, gender, and whether the person has a history of smoking.

With linear regression we assume that the output variable (lifespan in this example) is linearly related to the features we have (we will look at non-linear models in the next module).

This example uses a synthetic data set.

Load data

We’ll load the data from the web…

import pandas as pd

filename = 'https://gitlab.com/michaelallen1966/1804_python_healthcare_wordpress/raw/master/jupyter_notebooks/life_expectancy.csv'
df = pd.read_csv(filename)
df.head()
weight smoker physical_activity_scale BMI height male life_expectancy
0 51 1 6 22 152 1 57
1 83 1 5 34 156 1 36
2 78 1 10 18 208 0 78
3 106 1 3 28 194 0 49
4 92 1 7 23 200 0 67

Exploratory data analysis

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set(style = 'whitegrid', context = 'notebook')
sns.pairplot(df, size = 2.5)
plt.show()
lr1

We can show the correlation matrix with np.corrcoef (np.cov would show the non-standardised covariance matrix; a correlation matrix has the same values as a covariance matrix on standardised data). Note that we need to transpose our data so that each feature is in a row rather than a column.

When building a linear regression model we are most interested in those features which have the strongest correlation with our outcome. If there are high degrees of covariance between features we may wish to consider using principal component analysis to reduce the data set.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

np.set_printoptions(precision=3)
corr_mat = np.corrcoef(df.values.T)
print ('Correlation matrix:\n')
print (corr_mat)
print ()

# Plot correlation matrix
plt.imshow(corr_mat, interpolation='nearest')
plt.colorbar()
plt.xlabel('Feature')
plt.ylabel('Feature')
plt.title('Correlation between life expectancy features')
plt.show()
Correlation matrix:

[[ 1.    -0.012 -0.03  -0.011  0.879 -0.004 -0.009]
 [-0.012  1.     0.034 -0.027  0.006  0.018 -0.518]
 [-0.03   0.034  1.    -0.028 -0.009 -0.007  0.366]
 [-0.011 -0.027 -0.028  1.    -0.477 -0.019 -0.619]
 [ 0.879  0.006 -0.009 -0.477  1.     0.006  0.278]
 [-0.004  0.018 -0.007 -0.019  0.006  1.    -0.299]
 [-0.009 -0.518  0.366 -0.619  0.278 -0.299  1.   ]]

lr2

Fitting a linear regression model using a single feature

To illustrate linear regression, we’ll start with a single feature. We’ll pick BMI.

X = df['BMI'].values.reshape(-1, 1) 
X = X.astype('float')
y = df['life_expectancy'].values.reshape(-1, 1)
y = y.astype('float')

# Standardise X and y
# Though this may often not be necessary it may help when features are on
# very different scales. We won't use the standardised data here,
# but here is how it would be done
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X_std = sc_X.fit_transform(X)
Y_std = sc_y.fit_transform(X)

# Create linear regression model
from sklearn.linear_model import LinearRegression
slr = LinearRegression()
slr.fit(X, y)

# Print model coefficients
print ('Slope =', slr.coef_[0])
print ('Intercept =', slr.intercept_)
Slope = [-1.739]
Intercept = [ 107.874]

Predicting values

We can simply use the predict method of the linear regression model to predict values for any given X. (We use ‘flatten’ below to change from a column array toa row array).

y_pred = slr.predict(X)

print ('Actual = ', y[0:5].flatten())
print ('Predicted = ', y_pred[0:5].flatten())
Actual =  [ 57.  36.  78.  49.  67.]
Predicted =  [ 69.616  48.748  76.572  59.182  67.877]

Obtaining metrics of observed vs predicted

The metrics module from sklearn contains simple methods of reporting metrics given observed and predicted values.

from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, y_pred)))
print('R-square:',metrics.r2_score(y, y_pred))
Mean Absolute Error: 5.5394378529
Mean Squared Error: 45.9654230242
Root Mean Squared Error: 6.77978045545
R-square: 0.382648884752

Plotting observed values and fitted line

plt.scatter (X, y, c = 'blue')
plt.plot (X, slr.predict(X), color = 'red')
plt.xlabel('X')
plt.ylabel('y')
plt.show()
lr3

Plotting observed vs. predicted values

Plotting observed vs. predicted can give a good sense of the accuracy of the model, and is also suitable when there are multiple X features.

plt.scatter (y, slr.predict(X), c = 'blue')
plt.xlabel('Observed')
plt.ylabel('Predicted')
plt.show()
lr4

Fitting a model to multiple X features

The method described above works with any number of X features. Generally we may wish to pick those features with the highest correlation to the outcome value, but here we will use them all.

X = df.values[:, :-1]
y = df.values[:, -1]
# Create linear regression model
from sklearn.linear_model import LinearRegression
slr = LinearRegression()
slr.fit(X, y)

# Print model coefficients
print ('Slope =', slr.coef_)
print ('Intercept =', slr.intercept_)
Slope = [  0.145 -10.12    1.104  -2.225  -0.135  -5.148]
Intercept = 132.191010159

Show metrics (notice the improvement)

y_pred = slr.predict(X)
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, y_pred)))
print('R-square:',metrics.r2_score(y, y_pred))
Mean Absolute Error: 2.4055613151
Mean Squared Error: 7.96456424623
Root Mean Squared Error: 2.82215595711
R-square: 0.89302975375

Plot observed vs. predicted:

plt.scatter (y, slr.predict(X), c = 'blue')
plt.xlabel('Observed')
plt.ylabel('Predicted')
plt.show()
lr5

Plotting residuals

Residuals are simply the difference between an observed value and its predicted value. We can plot the relationship between observed values and residuals. Ideally we like to see that there is no clear relationship between predicted value and residual – residuals should be randomly distributed. Residual plotting may also be used to look to see if there are any outliers which might be having an effect on our model (in which case we may decide that it better to remove the outliers and re-fit).

residuals = slr.predict(X) - y # predicted - observed
plt.scatter (y, residuals, c = 'blue')
plt.xlabel('Observed')
plt.ylabel('Residual')
plt.show()
lr6

 

 

85. Using free text for classification – ‘Bag of Words’

There may be times in healthcare where we would like to classify patients based on free text data we have for them. Maybe, for example, we would like to predict likely outcome based on free text clinical notes.

Using free text requires methods known as ‘Natural Language Processing’.

Here we start with one of the simplest techniques – ‘bag of words’.

In a ‘bag of words’ free text is reduced to a vector (a series of numbers) that represent the number of times a word is used in the text we are given. It is also posible to look at series of two, three or more words in case use of two or more words together helps to classify a patient.

A classic ‘toy problem’ used to help teach or develop methos is to try to judge whether people rates a film as ‘like’ or ‘did not like’ based on the free text they entered into a widely used internet film review database (www.imdb.com).

Here will will use 50,000 records from IMDb to convert each review into a ‘bag of words’, which we will then use in a simple logistic regression machine learning model.

We can use raw word counts, but in this case we’ll add an extra transformation called tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely, so tf-idf reduces the value of those words used frequently across reviews.

This code will take us through the following steps:

1) Load data from internet, and split into training and test sets.

2) Clean data – remove non-text, convert to lower case, reduce words to their ‘stems’ (see below for details), and reduce common ‘stop-words’ (such as ‘as’, ‘the’, ‘of’).

3) Convert cleaned reviews in word vectors (‘bag of words’), and apply the tf-idf transform.

4) Train a logistic regression model on the tr-idf transformed word vectors.

5) Apply the logistic regression model to our previously unseen test cases, and calculate accuracy of our model.

Load data

This will load the IMDb data from the web. It is loaded into a Pandas DataFrame.

import pandas as pd
import numpy as np

file_location = 'https://raw.githubusercontent.com/MichaelAllen1966/1805_nlp_basics/master/data/IMDb.csv'
data = pd.read_csv(file_location)

Show the size of the data set (rows, columns).

data.shape
Out:
(50000, 2)

Show the data fields.

list(data)
Out:
['review', 'sentiment']

Show the first record review and recorded sentiment (which will be 0 for not liked, or 1 for liked)

In [11
print ('Review:')
print (data['review'].iloc[0])
print ('\nSentiment (label):')
print (data['sentiment'].iloc[0])
  Out:
Review:
I have no read the novel on which "The Kite Runner" is based. My wife and daughter, who did, thought the movie fell a long way short of the book, and I'm prepared to take their word for it. But, on its own, the movie is good -- not great but good. How accurately does it portray the havoc created by the Soviet invasion of Afghanistan? How convincingly does it show the intolerant Taliban regime that followed? I'd rate it C+ on the first and B+ on the second. The human story, the Afghan-American who returned to the country to rescue the son of his childhood playmate, is well done but it is on this count particularly that I'm told the book was far more convincing than the movie. The most exciting part of the film, however -- the kite contests in Kabul and, later, a mini-contest in California -- cannot have been equaled by the book. I'd wager money on that.

Sentiment (label):
1

Splitting the data into training and test sets

Split the data into 70% training data and 30% test data. The model will be trained using the training data, and accuracy will be tested using the independent test data.

from sklearn.model_selection import train_test_split
X = list(data['review'])
y = list(data['sentiment'])
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size = 0.3, random_state = 0)

Defining a function to clean the text

This function will:

1) Remove ant HTML commands in the text

2) Remove non-letters (e.g. punctuation)

3) Convert all words to lower case

4) Remove stop words (stop words are commonly used works like ‘and’ and ‘the’ which have little value in nag of words. If stop words are not already installed then open a python terminal and type the two following lines of code (these instructions will also be given when running this code if the stopwords have not already been downloaded onto the computer running this code).

import nltk

nltk.download(“stopwords”)

5) Reduce words to stem of words (e.g. ‘runner’, ‘running’, and ‘runs’ will all be converted to ‘run’)

6) Join words back up into a single string

def clean_text(raw_review):
    # Function to convert a raw review to a string of words
    
    # Import modules
    from bs4 import BeautifulSoup
    import re

    # Remove HTML
    review_text = BeautifulSoup(raw_review, 'lxml').get_text() 

    # Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    
    # Convert to lower case, split into individual words
    words = letters_only.lower().split()   

    # Remove stop words (use of sets makes this faster)
    from nltk.corpus import stopwords
    stops = set(stopwords.words("english"))                  
    meaningful_words = [w for w in words if not w in stops]                             

    # Reduce word to stem of word
    from nltk.stem.porter import PorterStemmer
    porter = PorterStemmer()
    stemmed_words = [porter.stem(w) for w in meaningful_words]

    # Join the words back into one string separated by space
    joined_words = ( " ".join( stemmed_words ))
    return joined_words 

Now will will define a function that will apply the cleaning function to a series of records (the clean text function works on one string of text at a time).

def apply_cleaning_function_to_series(X):
    print('Cleaning data')
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    print ('Finished')
    return cleaned_X

We will call the cleaning functions to clean the text of both the training and the test data. This may take a little time.

X_train_clean = apply_cleaning_function_to_series(X_train)
X_test_clean = apply_cleaning_function_to_series(X_test)
  Out:
Cleaning data
Finished
Cleaning data
Finished

Create ‘bag of words’

The ‘bag of words’ is the word vector for each review. This may be a simple word count for each review where each position of the vector represnts a word (returned in the ‘vocab’ list) and the value of that position represents the number fo times that word is used in the review.

The function below also returns a tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely. The tf-idf transorm reduces the value of a given word in proportion to the number of documents that it appears in.

The function returns the following:

1) vectorizer – this may be applied to any new reviews to convert the revier into the same word vector as the training set.

2) vocab – the list of words that the word vectors refer to.

3) train_data_features – raw word count vectors for each review

4) tfidf_features – tf-idf transformed word vectors

5) tfidf – the tf-idf transformation that may be applied to new reviews to convert the raw word counts into the transformed word counts in the same way as the training data.

Our vectorizer has an argument called ‘ngram_range’. A simple bag of words divides reviews into single words. If we have an ngram_range of (1,2) it means that the review is divided into single words and also pairs of consecutiev words. This may be useful if pairs of words are useful, such as ‘very good’. The max_features argument limits the size of the word vector, in this case to a maximum of 10,000 words (or 10,000 ngrams of words if an ngram may be mor than one word).

def create_bag_of_words(X):
    from sklearn.feature_extraction.text import CountVectorizer
    
    print ('Creating bag of words...')
    # Initialize the "CountVectorizer" object, which is scikit-learn's
    # bag of words tool.  
    
    # In this example features may be single words or two consecutive words
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 ngram_range = (1,2), \
                                 max_features = 10000) 

    # fit_transform() does two functions: First, it fits the model
    # and learns the vocabulary; second, it transforms our training data
    # into feature vectors. The input to fit_transform should be a list of 
    # strings. The output is a sparse array
    train_data_features = vectorizer.fit_transform(X)
    
    # Convert to a NumPy array for easy of handling
    train_data_features = train_data_features.toarray()
    
    # tfidf transform
    from sklearn.feature_extraction.text import TfidfTransformer
    tfidf = TfidfTransformer()
    tfidf_features = tfidf.fit_transform(train_data_features).toarray()

    # Take a look at the words in the vocabulary
    vocab = vectorizer.get_feature_names()
   
    return vectorizer, vocab, train_data_features, tfidf_features, tfidf

We will apply our bag_of_words function to our training set. Again this might take a little time.

vectorizer, vocab, train_data_features, tfidf_features, tfidf  = (
        create_bag_of_words(X_train_clean))
  Out:
Creating bag of words...

Let’s look at the some items from the vocab list (positions 40-44). Some of the words may seem odd. That is because of the stemming.

vocab[40:45]
Out:
['accomplish', 'accord', 'account', 'accur', 'accuraci']

And we can see the raw word count represented in train_data_features.

train_data_features[0][40:45]
Out:
array([0, 0, 1, 0, 0], dtype=int64)

If we look at the tf-idf transform we can see the value reduced (words occuring in many documents will have their value reduced the most)

tfidf_features[0][40:45]
Out:
array([0.        , 0.        , 0.06988648, 0.        , 0.        ])

Training a machine learning model on the bag of words

Now we have transformed our free text reviews in vectors of numebrs (representing words) we can apply many different machine learning techniques. Here will will use a relatively simple one, logistic regression.

We’ll set up a function to train a logistic regression model.

def train_logistic_regression(features, label):
    print ("Training the logistic regression model...")
    from sklearn.linear_model import LogisticRegression
    ml_model = LogisticRegression(C = 100,random_state = 0)
    ml_model.fit(features, label)
    print ('Finished')
    return ml_model

Now we will use the tf-idf tranformed word vectors to train the model (we could use the plain word counts contained in ‘train_data_features’ (rather than using’tfidf_features’). We pass both the features and the known label corresponding to the review (the sentiment, either 0 or 1 depending on whether a person likes the film or not.

ml_model = train_logistic_regression(tfidf_features, y_train)
  Out:
Training the logistic regression model...
Finished

Applying the bag of words model to test reviews

We will now apply the bag of words model to test reviews, and assess the accuracy.

We’ll first apply our vectorizer to create a word vector for review in the test data set.

test_data_features = vectorizer.transform(X_test_clean)
# Convert to numpy array
test_data_features = test_data_features.toarray()

As we are using the tf-idf transform, we’ll apply the tfid transformer so that word vectors are transformed in the same way as the training data set.

test_data_tfidf_features = tfidf.fit_transform(test_data_features)
# Convert to numpy array
test_data_tfidf_features = test_data_tfidf_features.toarray()

Now the bit that we really want to do – we’ll predict the sentiment of the all test reviews (and it’s just a single line of code!). Did they like the film or not?

predicted_y = ml_model.predict(test_data_tfidf_features)

Now we’ll compare the predicted sentiment to the actual sentiment, and show the overall accuracy of this model.

correctly_identified_y = predicted_y == y_test
accuracy = np.mean(correctly_identified_y) * 100
print ('Accuracy = %.0f%%' %accuracy)
 Out:
Accuracy = 87%

87% accuracy. That’s not bad for a simple Natural Language Processing model, using logistic regression.

80. Grouping unlabelled data with k-means clustering

k-means clustering is a form of ‘unsupervised learning’. This is useful for grouping unlabelled data. For example, in the Wisconsin breast cancer data set, what if we did did not know whether the patients had cancer or not at the time the data was collected? Could we find a way of finding groups of similar patients? These groups may then form the basis of some further investigation.

k-means clustering groups data in such a way that the average distance of all points to their closest cluster centre is minimised.

Let’s start by importing the required modules, and loading our data. In order better visualise what k-means is doing we will limit our data to two features (though k-means clustering works with any number of features).

In [11]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
  
# Load data
from sklearn import datasets
data_set = datasets.load_breast_cancer()
X=data_set.data[:,0:2] #  restrict patient data to two features.

We will normalise our data:

sc=StandardScaler() 
sc.fit(X)
X_std=sc.transform(X)

And now we will group patients into three clusters (k=3), and plot the centre of each cluster (each cluster centre has two co-ordinates, representing the normalised features we are using). If we used data with more than two features we would have a co-ordinate point for each feature.

In [13]:
kmeans = KMeans(n_clusters=3)  
kmeans.fit(X_std)  
print('\nCluster centres:')
print(kmeans.cluster_centers_)
Cluster centres:
[[-0.30445203  0.92354217]
 [ 1.52122912  0.58146379]
 [-0.50592385 -0.73299233]]

We can identify which cluster a sample has been put into with ‘kmeans.labels_’. Using that we may plot our clustering:

In [14]:
labels = kmeans.labels_

plt.scatter(X_std[:,0],X_std[:,1],
            c=labels, cmap=plt.cm.rainbow)
plt.xlabel('Normalised feature 1')
plt.ylabel('Normalised feature 2')
plt.show() 
kmeans1

How many clusters should be used?

Sometimes we may have prior knowledge that we want to group the data into a given number of clusters. Other times we may wish to investigate what may be a good number of clusters.

In the example below we look at changing the number of clusters between 1 and 100 and measure the average distance points are from their closest cluster centre (kmeans.transform gives us the distance for each sample to each cluster centre). Looking at the results we may decide that up to about 10 clusters may be useful, but after that there are diminishing returns of adding further clusters.

In [16]:
distance_to_closter_cluster_centre = []
for k in range(1,100):
    kmeans = KMeans(n_clusters=k)  
    kmeans.fit(X_std)
    distance = np.min(kmeans.transform(X_std),axis=1)
    average_distance = np.mean(distance)
    distance_to_closter_cluster_centre.append(average_distance)

clusters = np.arange(len(distance_to_closter_cluster_centre))+1
plt.plot(clusters, distance_to_closter_cluster_centre)
plt.xlabel('Number of clusters (k)')
plt.ylabel('Average distance to closest cluster centroid')
plt.ylim(0,1.5)
plt.show()
kmeans2

Continue reading “80. Grouping unlabelled data with k-means clustering”