127. Feature expansion

Here we use survival on the Titanic to demonstrate how features may be expanded by looking for interaction between features. This frequently needs to be followed by feature selection to reduce the number of features (while keeping the important interactions).

Simple models such as logistic regression do not incorporate complex inetractions between features. If two features produce more than an additive effect, this will not be fiited in logistic regression. In order to allow for feature interaction we need to add terms that create new features by producing the product of each product pair.

When we use polynomial expansion of features, we create new features that are the product of two features. For example if we had two features, A, B and C, a full polynomial expansion would produce the following extra features:

  • A.A, A.B, A.C
  • B.A, B.B, B.C
  • C.A, C.B, C.C

But we will reduce this in two ways:

  • Remove duplicate terms (e.g. A.B and B.A are the same, so we only need A.B)
  • Use the interaction_only argument to remove powers of single features (e.g. A.A, B.B)

A danger of polynomial expansion is that th emodel may start to over-fit to the training data. This may be dealy with in one (or both of two ways):

  • Increase the regularisation strength in the model (reduce the value of C in the logistic regression model)
  • Use feature selection to pick only the most important features (which now may include polynomial features)

The methods described here build on previous notebooks, notably on logistic regression, k-fold, sampling, regularisation, and model-based forward feature selection.

Load modules

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold

Download data

Run the following code if data for Titanic survival has not been previously downloaded.

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data

data = pd.read_csv('data/processed_data.csv')
# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
data.drop('PassengerId', inplace=True, axis=1)

Divide into X (features) and y (labels)

# Split data into two DataFrames
X_df = data.drop('Survived',axis=1)
y_df = data['Survived']

# Convert to NumPy arrays
X = X_df.values
y = y_df.values
# Add polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X_df)

Let’s look at the shape of our data sets (the first value is the number of samples, and the second value is the number of features):

print ('Shape of X:', X.shape)
print ('Shape of X_poly:', X_poly.shape)
Shape of X: (891, 24)
Shape of X_poly: (891, 300)

Woah – we’ve gone from 24 features to 301! But are they any use?

Training and testing normal and expanded models with varying regularisation

The following code:

  • Defines a list of regularisation (lower values lead to greater regularisation)
  • Sets up lists to hold results for each k-fold split
  • Starts a loop for each regularisation value, and loops through:
    • Print regularisation level (to show progress)
    • Sets up lists to record replicates from k-fold stratification
    • Sets up the k-fold splits using sklearns StratifiedKFold method
    • Trains two logistic regression models (regular and polynomial), and test its it, for eack k-fold split
    • Adds each k-fold training/test accuracy to the lists
  • Record average accuracy from k-fold stratification (so each regularisation level has one accuracy result recorded for training and test sets)

We pass the regularisation to the model during fitting, it has the argument name C.

# Define function to standardise data

def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc = StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std
# Training and testing normal and polynomial models

reg_values = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10]

# Set up lists to hold results
training_acc_results = []
test_acc_results = []
training_acc_results_poly = []
test_acc_results_poly = []

# Set up splits
skf = StratifiedKFold(n_splits = 5)
skf.get_n_splits(X, y)
skf.get_n_splits(X_poly, y)

# Set up model type

for reg in reg_values:
    # Show progress
    print(reg, end=' ')
    
    # Set up lists for results for each of k splits
    training_k_results = []
    test_k_results = []
    training_k_results_poly = []
    test_k_results_poly = []
    # Loop through the k-fold splits
    for train_index, test_index in skf.split(X, y):
        
        # Normal (non-polynomial model)
        
        # Get X and Y train/test
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        # Standardise X data
        X_train_std, X_test_std = standardise_data(X_train, X_test)
        # Fit model with regularisation (C)
        model = LogisticRegression(C=reg, solver='lbfgs', max_iter=1000)
        model.fit(X_train_std,y_train)
        # Predict training and test set labels
        y_pred_train = model.predict(X_train_std)
        y_pred_test = model.predict(X_test_std)
        # Calculate accuracy of training and test sets
        accuracy_train = np.mean(y_pred_train == y_train)
        accuracy_test = np.mean(y_pred_test == y_test)
        # Record accuracy for each k-fold split
        training_k_results.append(accuracy_train)
        test_k_results.append(accuracy_test)
        
        # Polynomial model (same as above except use X with polynomial features)
        
        # Get X and Y train/test
        X_train, X_test = X_poly[train_index], X_poly[test_index]
        y_train, y_test = y[train_index], y[test_index]
        # Standardise X data
        X_train_std, X_test_std = standardise_data(X_train, X_test)
        # Fit model with regularisation (C)
        model = LogisticRegression(C=reg, solver='lbfgs', max_iter=1000)
        model.fit(X_train_std,y_train)
        # Predict training and test set labels
        y_pred_train = model.predict(X_train_std)
        y_pred_test = model.predict(X_test_std)
        # Calculate accuracy of training and test sets
        accuracy_train = np.mean(y_pred_train == y_train)
        accuracy_test = np.mean(y_pred_test == y_test)
        # Record accuracy for each k-fold split
        training_k_results_poly.append(accuracy_train)
        test_k_results_poly.append(accuracy_test)
        
    # Record average accuracy for each k-fold split
    training_acc_results.append(np.mean(training_k_results))
    test_acc_results.append(np.mean(test_k_results))
    training_acc_results_poly.append(np.mean(training_k_results_poly))
    test_acc_results_poly.append(np.mean(test_k_results_poly))

Plot results

import matplotlib.pyplot as plt
%matplotlib inline

# Define data for chart
x = reg_values
y1 = training_acc_results
y2 = test_acc_results
y3 = training_acc_results_poly
y4 = test_acc_results_poly

# Set up figure
fig = plt.figure(figsize=(5,5))
ax1 = fig.add_subplot(111)

# Plot training set accuracy
ax1.plot(x, y1,
        color = 'k',
        linestyle = '-',
        markersize = 8,
        marker = 'o',
        markerfacecolor='k',
        markeredgecolor='k',
        label  = 'Training set accuracy')

# Plot test set accuracy
ax1.plot(x, y2,
        color = 'r',
        linestyle = '-',
        markersize = 8,
        marker = 'o',
        markerfacecolor='r',
        markeredgecolor='r',
        label  = 'Test set accuracy')

# Plot training set accuracy (poly model)
ax1.plot(x, y3,
        color = 'g',
        linestyle = '-',
        markersize = 8,
        marker = 'o',
        markerfacecolor='g',
        markeredgecolor='g',
        label  = 'Training set accuracy (poly)')

# Plot test set accuracy (poly model)
ax1.plot(x, y4,
        color = 'b',
        linestyle = '-',
        markersize = 8,
        marker = 'o',
        markerfacecolor='b',
        markeredgecolor='b',
        label  = 'Test set accuracy (poly)')

# Customise axes
ax1.grid(True, which='both')
ax1.set_xlabel('Regularisation\n(lower value = greater regularisation)')
ax1.set_ylabel('Accuracy')
ax1.set_xscale('log')

# Add legend
ax1.legend()

# Show plot
plt.show()

Show best test set accuracy measured.In [11]:

best_test_non_poly = np.max(test_acc_results)
best_test_poly = np.max(test_acc_results_poly)
print ('Best accuracy for non-poly and poly were {0:0.3f} and {1:0.3f}'.format(
    best_test_non_poly, best_test_poly))
Best accuracy for non-poly and poly were 0.789 and 0.816

Note in the above figure that:

  • Polynomial expansion has increased the accuracy of both training and test sets. Test set accuracy was increased over 2%
  • We do not know which polynomial terms are most useful (below we will use feature reduction to identiofy those)
  • The polynomial X data suffers from more over-fitting than the non-polynomial set (there is a larger difference between training and test set accuracies)

Feature reduction after feature expansion

We will revisit the code we have used previously to pick the best features (here).

In the previous method we ranked all features in their ability to improve model performance. Here, because there are many more features we will look at the influence of the top 20 (and if we see model performance is still increasing with additional features we could come back and change that limit).

We will amend the previous code as well to use simple accuracy (rather than the ROC Area Under Curve in our previous example).

# Transfer polynomial X into a pandas DataFrame (as method use Pandas)
X_poly_df = pd.DataFrame(X_poly, columns=poly.get_feature_names())

# Create list to store accuracies and chosen features
accuracy_by_feature_number = []
chosen_features = []

# Initialise chosen features list and run tracker
available_features = list(poly.get_feature_names())
run = 0
number_of_features = len(list(X))

# Loop through feature list to select next feature
maximum_features_to_choose = 20

for i in range(maximum_features_to_choose):

    # Track and pront progress
    run += 1
    print ('Feature run {} of {}'.format(run, maximum_features_to_choose))
    
    # Reset best feature and accuracy
    best_result = 0
    best_feature = ''

    # Loop through available features
    for feature in available_features:

        # Create copy of already chosen features to avoid orginal being changed
        features_to_use = chosen_features.copy()
        # Create a list of features from features already chosen + 1 new feature
        features_to_use.append(feature)
        # Get data for features, and convert to NumPy array
        X_np = X_poly_df[features_to_use].values
        
        # Set up lists to hold results for each selected features
        test_accuracy_results = []
    
        # Set up k-fold training/test splits
        number_of_splits = 5
        skf = StratifiedKFold(n_splits = number_of_splits)
        skf.get_n_splits(X_np, y)
    
        # Loop through the k-fold splits
        for train_index, test_index in skf.split(X_np, y):
            
            # Get X and Y train/test
            X_train, X_test = X_np[train_index], X_np[test_index]
            y_train, y_test = y[train_index], y[test_index]
    
            # Get X and Y train/test
            X_train_std, X_test_std = standardise_data(X_train, X_test)
    
            # Set up and fit model
            model = LogisticRegression(solver='lbfgs')
            model.fit(X_train_std,y_train)
    
            # Predict test set labels
            y_pred_test = model.predict(X_test_std)
                        
            # Calculate accuracy of test sets
            accuracy_test = np.mean(y_pred_test == y_test)
            test_accuracy_results.append(accuracy_test)
          
        # Get average result from all k-fold splits
        feature_accuracy = np.mean(test_accuracy_results)
    
        # Update chosen feature and result if this feature is a new best
        if feature_accuracy > best_result:
            best_result = feature_accuracy
            best_feature = feature
    
    # k-fold splits are complete    
    # Add mean accuracy and AUC to record of accuracy by feature number
    accuracy_by_feature_number.append(best_result)
    chosen_features.append(best_feature)
    available_features.remove(best_feature)

# Put results in DataFrame
results = pd.DataFrame()
results['feature to add'] = chosen_features
results['accuracy'] = accuracy_by_feature_number
results
feature to addaccuracy
0x1 x100.792352
1x0 x20.818170
2x13 x190.824924
3x3 x120.828320
4x4 x160.830568
5x0 x30.831716
6x5 x70.835081
7x5 x180.836211
8x190.837335
9x2 x160.838458
10x8 x190.838464
11x2 x190.840712
12x11 x180.840718
13x3 x160.841835
14x60.841835
15x140.841835
16x220.841835
17x0 x60.841835
18x0 x140.841835
19x0 x220.841835

Plot results:

import matplotlib.pyplot as plt
%matplotlib inline

chart_x = list(range(1, maximum_features_to_choose+1))

plt.plot(chart_x, accuracy_by_feature_number,
        label = 'Accuracy')

plt.xlabel('Number of features')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

plt.show()

Neat! By selecting our best features from our expanded model we now have a little over 84% accuracy! It looks like we need about 15 features for our optimum model, nearly all of which are polynomial terms. Note that sklearn’s polynomial method outputs features names in relation to the original X index. Our ‘best’ feature is a product of X1 and X10. Let’s see what those are:In [16]:

X_index_names = list(X_df)
print ('X1:',X_index_names[1])
print ('X10:',X_index_names[10])
X1: Age
X10: male

So looking at just the ages of male passengers is the best single predictor of survival.

126. Feature selection 3 (model backward elimination)

Here we use survival on the Titanic to demonstrate a model-based method to select the most important features.

Reducing the number of features we use can have three benefits:

  • Simplifies model explanation
  • Model fit may be improved by the removal of features that add no value
  • Model will be faster to fit

In this notebook we will use a model-based approach whereby we incrementally remove features that least reduce model performance.

Two key advantages of this method are:

  • It is relatively simple.
  • It is tailored to the model in question.

Some key disadvantage of this method are:

  • It may be slow if there are many parameters (though the loop to select features could be limited in the number of features to select).
  • The selection of features may be dependent on model meta-parameters (such as level of regularisation).
  • The selection of features may not transfer between models (e.g. a model that does not allow for feature interactions may not detect features which do not add much value independently).

We will assess performance using k-fold stratification for better measurement of performance. If you are not familiar with k-fold stratification, have a look here.

And we will assess performance with a Receiver Operator Characteristic (ROC) curve. See here.

Load data

The section below downloads pre-processed data, and saves it to a subfolder (from where this code is run). If data has already been downloaded that cell may be skipped.

Code that was used to pre-process the data ready for machine learning may be found at: https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/titanic/01_preprocessing.ipynb

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data:

data = pd.read_csv('data/processed_data.csv')

The first column is a passenger index number. We will remove this, as this is not part of the original Titanic passenger data.

# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data

data.drop('PassengerId', inplace=True, axis=1)

Divide into X (features) and y (lables)

We will separate out our features (the data we use to make a prediction) from our label (what we are truing to predict). By convention our features are called X (usually upper case to denote multiple features), and the label (survive or not) y.

X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'

Forward feature selection

Define data standardisation function.

def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc = StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std

The forward selection method:

  • Keeps a list of selected features
  • Keeps a list of features still available for selection
  • Loops through available features:
    • Calculates added value for each feature (using stratified k-fold validation)
    • Selects feature that adds most value
    • Adds selected feature to selected features list and removes it from available features list

This method uses a while lop to keep exploring features until no more are available. An alternative would be to use a for loop with a maximum number of features to select.In [8]:

# Create list to store accuracies and chosen features
roc_auc_by_feature_number = []
chosen_features = []

# Initialise chosen features list and run tracker
available_features = list(X)
run = 0
number_of_features = len(list(X))

# Creat einitial reference performance
reference_auc = 1.0 # used to compare reduction in AUC

# Loop through feature list to select next feature
while len(available_features)> 1:

    # Track and pront progress
    run += 1
    print ('Feature run {} of {}'.format(run, number_of_features-1))
    
    # Convert DataFrames to NumPy arrays
    y_np = y.values
    
    # Reset best feature and accuracy
    best_result = 1.0
    best_feature = ''

    # Loop through available features
    for feature in available_features:

        # Create copy of already chosen features to avoid orginal being changed
        features_to_use = available_features.copy()
        # Create a list of features to use by removing 1 feature
        features_to_use.remove(feature)
        # Get data for features, and convert to NumPy array
        X_np = X[features_to_use].values
        
        # Set up lists to hold results for each selected features
        test_auc_results = []
    
        # Set up k-fold training/test splits
        number_of_splits = 5
        skf = StratifiedKFold(n_splits = number_of_splits)
        skf.get_n_splits(X_np, y)
    
        # Loop through the k-fold splits
        for train_index, test_index in skf.split(X_np, y_np):
            
            # Get X and Y train/test
            X_train, X_test = X_np[train_index], X_np[test_index]
            y_train, y_test = y[train_index], y[test_index]
    
            # Get X and Y train/test
            X_train_std, X_test_std = standardise_data(X_train, X_test)
    
            # Set up and fit model
            model = LogisticRegression(solver='lbfgs')
            model.fit(X_train_std,y_train)
    
            # Predict test set labels
            y_pred_test = model.predict(X_test_std)
            
            # Calculate accuracy of test sets
            accuracy_test = np.mean(y_pred_test == y_test)
          
            # Get ROC AUC
            probabilities = model.predict_proba(X_test_std)
            probabilities = probabilities[:, 1] # Probability of 'survived'
            fpr, tpr, thresholds = roc_curve(y_test, probabilities)
            roc_auc = auc(fpr, tpr)
            test_auc_results.append(roc_auc)
        
        # Get average result from all k-fold splits
        feature_auc = np.mean(test_auc_results)
    
        # Update chosen feature and result if this feature is a new best
        # We are looking for the smallest drop in performance
        drop_in_performance = reference_auc - feature_auc
        if drop_in_performance < best_result:
            best_result = drop_in_performance
            best_feature = feature
            best_auc = feature_auc
                
    # k-fold splits are complete    
    # Add mean accuracy and AUC to record of accuracy by feature number
    roc_auc_by_feature_number.append(best_auc)
    chosen_features.append(best_feature)    
    available_features.remove(best_feature)
    reference_auc = best_auc

# Add last remaining feature
chosen_features += available_features
roc_auc_by_feature_number.append(0)
    
# Put results in DataFrame
# Reverse order of lists with [::-1] so best features first
results = pd.DataFrame()
results['feature removed'] = chosen_features[::-1]
results['ROC AUC'] = roc_auc_by_feature_number[::-1]

Show results

The table is now in the order of preferred features, though our code worked in the reverse direction, incrementally removing the feature that made least difference to the model.In [9]:

results
feature removedROC AUC
0male0.000000
1Pclass0.766733
2Age0.833036
3SibSp0.843055
4Embarked_S0.848686
5CabinNumberImputed0.853125
6CabinLetter_C0.855046
7CabinLetter_missing0.855504
8CabinLetterImputed0.855506
9Embarked_missing0.855533
10EmbarkedImputed0.855479
11CabinLetter_T0.855479
12CabinLetter_F0.855159
13CabinLetter_B0.855124
14Embarked_Q0.854781
15Embarked_C0.853826
16Fare0.853760
17CabinNumber0.853030
18CabinLetter_E0.852737
19CabinLetter_A0.852083
20AgeImputed0.851654
21CabinLetter_D0.850486
22CabinLetter_G0.848673
23Parch0.847432

Plot results

import matplotlib.pyplot as plt
%matplotlib inline

chart_x = list(range(1, number_of_features+1))

plt.plot(chart_x, roc_auc_by_feature_number,
        label = 'ROC AUC')

plt.xlabel('Number of features removed')
plt.ylabel('Accuracy (ROC AUC)')
plt.legend()
plt.grid(True)

plt.show()

From the above results it looks like we could eliminate all but 5-6 features in this model. It may also be worth examining the same method using other performance scores (such as simple accuracy, or f1) in place of ROC AUC.

125. Feature selection 2 (model forward selection)

Here we use survival on the Titanic to demonstrate a model-based method to select the most important features.

Reducing the number of features we use can have three benefits:

  • Simplifies model explanation
  • Model fit may be improved by the removal of features that add no value
  • Model will be faster to fit

In this notebook we will use a model-based approach whereby we incrementally add features that most increase model performance.

Two key advantages of this method are:

  • It is relatively simple.
  • It is tailored to the model in question.

Some key disadvantage of this method are:

  • It may be slow if there are many parameters (though the loop to select features could be limited in the number of features to select).
  • The selection of features may be dependent on model meta-parameters (such as level of regularisation).
  • The selection of features may not transfer between models (e.g. a model that does not allow for feature interactions may not detect features which do not add much value independently).

We will go through the following steps:

  • Download and save pre-processed data
  • Split data into features (X) and label (y)
  • Loop through features to select the feature that most increase ROC AUC
  • Plot results

We will assess performance using k-fold stratification for better measurement of performance. If you are not familiar with k-fold stratification, have a look here.

And we will assess performance with a Receiver Operator Characteristic (ROC) curve. See here.

Load modules

A standard Anaconda install of Python (https://www.anaconda.com/distribution/) contains all the necessary modules.

import numpy as np
import pandas as pd

# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler

Load data

The section below downloads pre-processed data, and saves it to a subfolder (from where this code is run). If data has already been downloaded that cell may be skipped.

Code that was used to pre-process the data ready for machine learning may be found at: https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/titanic/01_preprocessing.ipynb:

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data:

data = pd.read_csv('data/processed_data.csv')

The first column is a passenger index number. We will remove this, as this is not part of the original Titanic passenger data.

# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data

data.drop('PassengerId', inplace=True, axis=1)

Divide into X (features) and y (labels)

We will separate out our features (the data we use to make a prediction) from our label (what we are trying to predict). By convention our features are called X (usually upper case to denote multiple features), and the label (survive or not) y.

X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'

Forward feature selection

Define data standardisation function.

def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc = StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std

The forward selection method:

  • Keeps a list of selected features
  • Keeps a list of features still available for selection
  • Loops through available features:
    • Calculates added value for each feature (using stratified k-fold validation)
    • Selects feature that adds most value
    • Adds selected feature to selected features list and removes it from available features list

This method uses a while lop to keep exploring features until no more are available. An alternative would be to use a for loop with a maximum number of features to select.

# Create list to store accuracies and chosen features
roc_auc_by_feature_number = []
chosen_features = []

# Initialise chosen features list and run tracker
available_features = list(X)
run = 0
number_of_features = len(list(X))

# Loop through feature list to select next feature
while len(available_features)> 0:

    # Track and pront progress
    run += 1
    print ('Feature run {} of {}'.format(run, number_of_features))
    
    # Convert DataFrames to NumPy arrays
    y_np = y.values
    
    # Reset best feature and accuracy
    best_result = 0
    best_feature = ''

    # Loop through available features
    for feature in available_features:

        # Create copy of already chosen features to avoid orginal being changed
        features_to_use = chosen_features.copy()
        # Create a list of features from features already chosen + 1 new feature
        features_to_use.append(feature)
        # Get data for features, and convert to NumPy array
        X_np = X[features_to_use].values
        
        # Set up lists to hold results for each selected features
        test_auc_results = []
    
        # Set up k-fold training/test splits
        number_of_splits = 5
        skf = StratifiedKFold(n_splits = number_of_splits)
        skf.get_n_splits(X_np, y)
    
        # Loop through the k-fold splits
        for train_index, test_index in skf.split(X_np, y_np):
            
            # Get X and Y train/test
            X_train, X_test = X_np[train_index], X_np[test_index]
            y_train, y_test = y[train_index], y[test_index]
    
            # Get X and Y train/test
            X_train_std, X_test_std = standardise_data(X_train, X_test)
    
            # Set up and fit model
            model = LogisticRegression(solver='lbfgs')
            model.fit(X_train_std,y_train)
    
            # Predict test set labels
            y_pred_test = model.predict(X_test_std)
            
            # Calculate accuracy of test sets
            accuracy_test = np.mean(y_pred_test == y_test)
          
            # Get ROC AUC
            probabilities = model.predict_proba(X_test_std)
            probabilities = probabilities[:, 1] # Probability of 'survived'
            fpr, tpr, thresholds = roc_curve(y_test, probabilities)
            roc_auc = auc(fpr, tpr)
            test_auc_results.append(roc_auc)
        
        # Get average result from all k-fold splits
        feature_auc = np.mean(test_auc_results)
    
        # Update chosen feature and result if this feature is a new best
        if feature_auc > best_result:
            best_result = feature_auc
            best_feature = feature
    
    # k-fold splits are complete    
    # Add mean accuracy and AUC to record of accuracy by feature number
    roc_auc_by_feature_number.append(best_result)
    chosen_features.append(best_feature)
    available_features.remove(best_feature)

# Put results in DataFrame
results = pd.DataFrame()
results['feature to add'] = chosen_features
results['ROC AUC'] = roc_auc_by_feature_number
results
feature to addROC AUC
0male0.766733
1Pclass0.833036
2Age0.843055
3SibSp0.848686
4Embarked_S0.853125
5CabinLetter_E0.855740
6CabinNumberImputed0.856158
7CabinLetter_T0.856160
8CabinLetter_D0.856167
9CabinLetter_F0.856307
10EmbarkedImputed0.856121
11Embarked_missing0.856094
12CabinLetterImputed0.855637
13CabinLetter_missing0.855744
14Fare0.855302
15CabinLetter_A0.854307
16CabinLetter_B0.853215
17CabinNumber0.852896
18Embarked_C0.851377
19Embarked_Q0.851324
20CabinLetter_C0.849972
21AgeImputed0.848673
22CabinLetter_G0.847432
23Parch0.845842

Plot results

import matplotlib.pyplot as plt
%matplotlib inline

chart_x = list(range(1, number_of_features+1))

plt.plot(chart_x, roc_auc_by_feature_number,
        label = 'ROC AUC')

plt.xlabel('Number of features')
plt.ylabel('Accuracy (ROC AUC)')
plt.legend()
plt.grid(True)

plt.show()

From the above results it looks like we could use just 5-7 features in this model. It may also be worth examining the same method using other performance scores (such as simple accuracy, or f1) in place of ROC AUC.

Note that accuracy of the model appears to decline with a greater number of features.

124. Feature selection 1 (univariate statistical selection)

Here we use survival on the Titanic to demonstrate a simple statistical method to select the most important features.

Reducing the number of features we use can have three benefits:

  • Simplifies model explanation
  • Model fit may be improved by the removal of features that add no value
  • Model will be faster to fit

In this notebook we will use a simple statistical method for selecting features called univariate feature selection. We will examine the correlation between each feature and the target label value. This is called univariate statistics because we examine each feature independently.

Two key advantages of this method are:

  • It is simple
  • It is fast

Two key disadvantage of this method are:

  • It may miss features which have little effect alone, but which are influential when combined
  • It may include features which are highly correlated which could be reduced to choosing just one of the highly correlated features.

The machine learning model we will use to test the feature selection will be a simple logistic regression model.

We will go through the following steps:

  • Download and save pre-processed data
  • Split data into features (X) and label (y)
  • Calculate the correlation of each feature with the target label value
  • Sort by correlation (ignoring the +ve/-ve sign)
  • Test the features in our logistic regression model

We will assess performance using k-fold stratification for better measurement of performance. If you are not familiar with k-fold stratification, have a look here.

https://pythonhealthcare.org/2018/04/20/75-machine-learning-choosing-between-models-with-stratified-k-fold-validation/

And we will assess performance with a Receiver Operator Characteristic (ROC) curve. See here.

Load modules

A standard Anaconda install of Python (https://www.anaconda.com/distribution/) contains all the necessary modules.

import numpy as np
import pandas as pd

# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler

Load data

The section below downloads pre-processed data, and saves it to a subfolder (from where this code is run). If data has already been downloaded that cell may be skipped.

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data once downloaded:

data = pd.read_csv('data/processed_data.csv')

The first column is a passenger index number. We will remove this, as this is not part of the original Titanic passenger data.

# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data

data.drop('PassengerId', inplace=True, axis=1)

Divide into X (features) and y (labels)

We will separate out our features (the data we use to make a prediction) from our label (what we are truing to predict). By convention our features are called X (usually upper case to denote multiple features), and the label (survive or not) y.

X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'

Calculate correlation coefficients

Calculate the correlation between each feature and survival.

from scipy.stats.stats import pearsonr
features = list(X)
correlation = []
significance = []
for feature in features:
    correl = pearsonr(X[feature].values, y.values)
    correlation.append(correl[0])
    significance.append(correl[1])
df = pd.DataFrame()
df['feature'] = features
df['correlation'] = correlation
df['abs_correlation'] = np.abs(correlation)
df['significance'] = significance
df['significant'] = df['significance'] < 0.05 # Label those P<0.01
df.sort_values(by='abs_correlation', ascending=False, inplace=True)

Show features in order of correlation with survival. Note: significance = probability that correlation is due to chance. Lower values show higher significance.

df
featurecorrelationabs_correlationsignificancesignificant
10male-0.5433510.5433511.406066e-69True
0Pclass-0.3384810.3384812.537047e-25True
9CabinNumberImputed-0.3218420.3218426.404266e-23True
7CabinLetterImputed-0.3169120.3169123.090891e-22True
23CabinLetter_missing-0.3169120.3169123.090891e-22True
4Fare0.2573070.2573076.120189e-15True
8CabinNumber0.2354090.2354091.100977e-12True
16CabinLetter_B0.1750950.1750951.441584e-07True
11Embarked_C0.1682400.1682404.397151e-07True
13Embarked_S-0.1556600.1556603.036111e-06True
18CabinLetter_D0.1507160.1507166.233140e-06True
19CabinLetter_E0.1453210.1453211.331654e-05True
17CabinLetter_C0.1146520.1146526.061874e-04True
5AgeImputed-0.0921970.0921975.886535e-03True
3Parch0.0816290.0816291.479925e-02True
1Age-0.0649100.0649105.276069e-02False
14Embarked_missing0.0600950.0600957.298717e-02False
6EmbarkedImputed0.0600950.0600957.298717e-02False
20CabinLetter_F0.0579350.0579358.392361e-02False
2SibSp-0.0353220.0353222.922439e-01False
22CabinLetter_T-0.0264560.0264564.302610e-01False
15CabinLetter_A0.0222870.0222875.064306e-01False
21CabinLetter_G0.0160400.0160406.325420e-01False
12Embarked_Q0.0036500.0036509.133532e-01False

Get ordered feature list.



ordered_features = list(df['feature'])

ordered_features
['male',
 'Pclass',
 'CabinNumberImputed',
 'CabinLetterImputed',
 'CabinLetter_missing',
 'Fare',
 'CabinNumber',
 'CabinLetter_B',
 'Embarked_C',
 'Embarked_S',
 'CabinLetter_D',
 'CabinLetter_E',
 'CabinLetter_C',
 'AgeImputed',
 'Parch',
 'Age',
 'Embarked_missing',
 'EmbarkedImputed',
 'CabinLetter_F',
 'SibSp',
 'CabinLetter_T',
 'CabinLetter_A',
 'CabinLetter_G',
 'Embarked_Q']

Testing our selected features

After statistical selection we may simply choose the top k features, or we may choose those labelled as significant (P<0.05).

Here we will incrementally add features to the list of features to use (chosen in order of their correlation coefficients), and see the effect on model accuracy and Receiver Operator Characteristic (ROC) Area Under Curve (AUC)* as measured by k-fold stratification. Here we will use sklearn’s implementation of calculating the ROC AUC.

*A reminder that ROC AUC is a measure of the balance between true positive and false positives as the threshold to classify a case as a positive is changed.

def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc = StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std
# Create list to store accuracies
accuracy_by_feature_number = []
roc_auc_by_feature_number = []

# Loop through feature list
number_of_features = len(ordered_features)
for i in range(number_of_features):
    # print ("{0} features of {1}".format(i, number_of_features))
    features_to_use = ordered_features[0:i+1]
    X_selected = X[features_to_use]
    
    # Convert to NumPy (needed for k-fold method)
    # Convert DataFrames to NumPy arrays
    X_np = X_selected.values
    y_np = y.values
    
    #%% Run k fold model

    # Set up lists to hold results for each k-fold run
    test_acc_results = []
    test_auc_results = []

    # Set up splits
    number_of_splits = 10
    skf = StratifiedKFold(n_splits = number_of_splits)
    skf.get_n_splits(X_np, y)

    # Loop through the k-fold splits
    for train_index, test_index in skf.split(X_np, y_np):
        # Get X and Y train/test
        X_train, X_test = X_np[train_index], X_np[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Get X and Y train/test
        X_train_std, X_test_std = standardise_data(X_train, X_test)

        # Set up and fit model
        model = LogisticRegression(solver='lbfgs')
        model.fit(X_train_std,y_train)

        # Predict test set labels
        y_pred_test = model.predict(X_test_std)
        
        # Calculate accuracy of test sets
        accuracy_test = np.mean(y_pred_test == y_test)
        test_acc_results.append(accuracy_test)
        
        # Get ROC AUC
        probabilities = model.predict_proba(X_test_std)
        probabilities = probabilities[:, 1] # Probability of 'survived' class
        fpr, tpr, thresholds = roc_curve(y_test, probabilities)
        roc_auc = auc(fpr, tpr)
        test_auc_results.append(roc_auc)      
        
    # Add mean accuracy and AUC to record of accuracy by feature number
    accuracy_by_feature_number.append(np.mean(test_acc_results))
    roc_auc_by_feature_number.append(np.mean(test_auc_results))

Plot results:

import matplotlib.pyplot as plt
%matplotlib inline

chart_x = list(range(1, number_of_features + 1))

plt.plot(chart_x, accuracy_by_feature_number,
        label = 'Accuracy')

plt.plot(chart_x, roc_auc_by_feature_number,
        label = 'ROC AUC')

plt.xlabel('Number of features')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

plt.show()

The results show that our ROC AUC* increases significantly between 1 and 2 features, and then climbs slowly up to about 20 features. Accuracy, interestingly, first declines with more features, and then climbs, to a plateau between 16 and 22 features.

Taking the top 20 features is likely to give us our best model (though we could reduce features more if computational time was critical).

123: A basic example of creating an interactive plot with HoloViews and Bokeh


This is a very simple of example of producing an interactive visualisation using Holoviews (which calls on Bokeh). These visualisations can be viewed in Jupyter notebooks, or may be saved as a single html page which needs only a web browser to see. Here we show room temperature and humidity, with the plots allowing the choice of which room to show.

To create these interactive plots you will need to install pyviz, holoviews and bokeh as described below.

Install libraries if needed

From terminal run:

conda install -c pyviz holoviews bokeh

holoviews --install-examples

Import libraries

import numpy as np
import pandas as pd
import holoviews as hv
import panel as pn
from holoviews import opts
hv.extension('bokeh')

Create some dummy data

# Set length of data set to create
length = 25

# Build strings for location data
location1 = ['Window'] * length
location2 = ['Porch'] * length
location3 = ['Fridge'] * length

# Set temperature to normal distribution (mu, sigma, length)
temperature1 = np.random.normal(25 ,5, length)
temperature2 = np.random.normal(15 ,3, length)
temperature3 = np.random.normal(4, 0.5,length)

# Set temperature to uniform distribution (min, max, length)
humidity1 = np.random.uniform(30, 60, length)
humidity2 = np.random.uniform(60, 80, length)
humidity3 = np.random.uniform(80, 99, length)

# Record mean temperature/humidity (use np.repeat to repeata single value)
mean_temp1 = np.repeat(np.mean(temperature1), length)
mean_temp2 = np.repeat(np.mean(temperature2), length)
mean_temp3 = np.repeat(np.mean(temperature3), length)

mean_humidity1 = np.repeat(np.mean(humidity1), length)
mean_humidity2 = np.repeat(np.mean(humidity2), length)
mean_humidity3 = np.repeat(np.mean(humidity3), length)

# Concatenate three sets of data into single list/arrays
location = location1 + location2 + location3
temperature = np.concatenate((temperature1, temperature2, temperature3))
mean_temperature = np.concatenate((mean_temp1, mean_temp2, mean_temp3))
humidity = np.concatenate((humidity1, humidity2, humidity3))
mean_humidity = np.concatenate((mean_humidity1, mean_humidity2, mean_humidity3))

# Create list of days
days = list(range(1,length + 1))
day = days * 3 # times 3 as there are three locations

# Transfer data to pandas DataFrame
data = pd.DataFrame()
data['day'] = day
data['location'] = location
data['temperature'] = temperature
data['humidity'] = humidity
data['mean_temperature'] = mean_temperature
data['mean_humidity'] = mean_humidity

data.head()

Out:

day	location	temperature	humidity	mean_temperature	mean_humidity
0	1	Window	26.081745	49.611333	25.222169	45.43133
1	2	Window	31.452276	39.027559	25.222169	45.43133
2	3	Window	19.031828	58.825912	25.222169	45.43133
3	4	Window	21.309825	52.741160	25.222169	45.43133
4	5	Window	13.529042	39.977335	25.222169	45.43133

Build bar chart

# Make holoviews data table
key_dimensions   = ['location']
value_dimensions = ['day', 'temperature', 'humidity', 'mean_temperature', 'mean_humidity']
hv_data = hv.Table(data, key_dimensions, value_dimensions)

# Build bar charts
bars1 = hv_data.to.bars(['day'], ['temperature'])
bars2 = hv_data.to.bars(['day'], ['humidity']).opts(color='Red')

# Compose plot
bar_plot = bars1 + bars2

# Show plot (only work in Jupyter notebook)
bar_plot

Build scatter chart

# Build scatter charts
scatter1 = hv_data.to.scatter(['day'], ['temperature'])
scatter2 = hv_data.to.scatter(['day'], ['humidity']).opts(color='Red')

# Compose plot
scatter_plot = scatter1 + scatter2

# Show plot
scatter_plot

Build line chart for mean temperature and humidity

# Build line charts
line1 = hv_data.to.curve(['day'], ['mean_temperature'])
line2 = hv_data.to.curve(['day'], ['mean_humidity']).opts(color='r')

# Compose plot
line_chart = line1 + line2

# Show plot
line_chart

Combine line and scatter charts

Here we combine the line and scatter charts. We viewed them individually before, though this is not actually necessary.

# Compose plot (* creates overlays of two or more plots)
combined_plot = line1 * scatter1 + line2 * scatter2

# Show plot
combined_plot

Save to html

The interactive plot may be saved as html which may be shared with, and viewed by, anyone (there is no need for anything other than a standard web browser to view the interactive plot).

hv.save(combined_plot, 'holoviews_example.html')

122: Oversampling to correct for imbalanced data using naive sampling or SMOTE

Machine learning can have poor performance for minority classes (where one or more classes represent only a small proportion of the overall data set compared with a dominant class). One method of improving performance is to balance out the number of examples between different classes. Here two methods are described:

  1. Resampling from the minority classes to give the same number of examples as the majority class.
  2. SMOTE (Synthetic Minority Over-sampling Technique): creating synthetic data based on creating new data points that are mid-way between two near neighbours in any particular class.

SMOTE uses imblearn See: https://imbalanced-learn.org

Install with: pip install -U imbalanced-learn, or conda install -c conda-forge imbalanced-learn

Reference

N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 16, 321-357, 2002

Create dummy data

First we will create some unbalanced dummy data: Classes 0, 1 and 2 will represent 1%, 5% and 94% of the data respectively.

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=3,                            n_clusters_per_class=1, weights=[0.01, 0.05, 0.94],                            class_sep=0.8, random_state=0)

Count instances of each class.

from collections import Counter
print(sorted(Counter(y).items()))

OUT:
[(0, 64), (1, 262), (2, 4674)]

Define function to plot data

import matplotlib.pyplot as plt

def plot_classes(X,y):
    colours = ['k','b','g']
    point_colours = [colours[val] for val in y]
    X1 = X[:,0]
    X2 = X[:,1]
    plt.scatter(X1, X2, facecolor = point_colours, edgecolor = 'k')
    plt.show()

Plots data using function

plot_classes(X,y)

Oversample with naive sampling to match numbers in each class

With naive resampling we repeatedly randomly sample from the minority classes and add that the new sample to the existing data set, leading to multiple instances of the minority classes.This builds up the number of minority class samples.

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)

Count instances of each class in the augmented data.

from collections import Counter
print(sorted(Counter(y_resampled).items()))

OUT:
[(0, 4674), (1, 4674), (2, 4674)]

Plot augmented data (it looks the same as the original as points are overlaid).

plot_classes(X_resampled,y_resampled)

SMOTE with continuous variables

SMOTE (synthetic minority oversampling technique) works by finding two near neighbours in a minority class, producing a new point midway between the two existing points and adding that new point in to the sample. The example shown is in two dimensions, but SMOTE will work across multiple dimensions (features). SMOTE therefore helps to ‘fill in’ the feature space occupied by minority classes.

from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X, y)

# Count instances of each class
from collections import Counter
print(sorted(Counter(y_resampled).items()))

OUT:
[(0, 4674), (1, 4674), (2, 4674)]

Plot augmented data (note minority class data points now exist in new spaces).

SMOTE with mixed continuous and binary/categorical values

It is not possible to calculate a ‘mid point’ between two points of binary or categorical data. An extension to the SMOTE method allows for use of binary or categorical data by taking the most common occurring category of nearest neighbours to a minority class point.

# create a synthetic data set with continuous and categorical features
import numpy as np
rng = np.random.RandomState(42)
n_samples = 50
X = np.empty((n_samples, 3), dtype=object)
X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)
X[:, 1] = rng.randn(n_samples)
X[:, 2] = rng.randint(3, size=n_samples)
y = np.array([0] * 20 + [1] * 30)

Count instances of each class

print(sorted(Counter(y).items()))

OUT:
[(0, 20), (1, 30)]

Show last 10 original data points

print (X[-10:])

OUT:
[['A' 1.4689412854323924 2]
 ['C' -1.1238983345400366 0]
 ['C' 0.9500053955071801 2]
 ['A' 1.7265164685753638 1]
 ['A' 0.4578850770000152 0]
 ['C' -1.6842873783658814 0]
 ['B' 0.32684522397001387 0]
 ['A' -0.0811189541586873 2]
 ['B' 0.46779475326315173 1]
 ['B' 0.7361223506692577 0]]

Use SMOTENC to create new data points.

from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X, y)

Count instances of each class

print(sorted(Counter(y_resampled).items()))

OUT:
[(0, 30), (1, 30)]

Show last 10 values of X (SMOTE data points are added to the end of the original data set)

print (X_resampled[-10:])

[['C' -1.0600505672469849 1]
 ['C' -0.36965644259183145 1]
 ['A' 0.1453826708354494 2]
 ['C' -1.7442827953859052 2]
 ['C' -1.6278053447258838 2]
 ['A' 0.5246469549655818 2]
 ['B' -0.3657680728116921 2]
 ['A' 0.9344237230779993 2]
 ['B' 0.3710891618824609 2]
 ['B' 0.3327240726719727 2]]

121. More list comprehension examples

Example 1 – double the numbers

Standard loop approach:

foo = [1, 2, 3, 4]
bar = []

for x in foo:
    bar.append(x * 2)
    
print(bar)

Out:

[2, 4, 6, 8]

Using list comprehension:

foo = [1, 2, 3, 4]
bar = [x * 2 for x in foo]
print(bar)

Out:

[2, 4, 6, 8]

Example 2 – convert Celsius to Fahrenheit

This example calls a function from within the list comprehension.

Define the function:

def convert_celsius_to_fahrenheit(deg_celsius):
    """
    Convert degress celsius to fahrenheit
    Returns float value - temp in fahrenheit
    Keyword arguments:
        def_celcius -- temp in degrees celsius
    """
    return (9/5) * deg_celsius + 32

Standard loop approach:

#list of temps in degree celsius to convert to fahrenheit
celsius = [39.2, 36.5, 37.3, 41.0]

#standard for loop approach
fahrenheit = []
for x in celsius:
    fahrenheit.append(convert_celsius_to_fahrenheit(x))

print('Using standard for loop: {}'.format(fahrenheit))

Out:

Using standard for loop: [102.56, 97.7, 99.14, 105.8]

Using list comprehension

fahrenheit = [convert_celsius_to_fahrenheit(x) for x in celsius]
print('Using list comprehension: {}'.format(fahrenheit))

Out:

Using list comprehension: [102.56, 97.7, 99.14, 105.8]

Example 3 – convert the strings to different data types

This example also make use of the zip function. Zip allows you to iterate through two lists at the same time.

inputs = ["1", "3.142", "True", "spam"]
converters = [int, float, bool, str]

values_with_correct_data_types = [t(s) for (s, t) in zip(inputs, converters)]
print(values_with_correct_data_types)

Out:

[1, 3.142, True, 'spam']

Example 4 – Using if statements within a list comprehension

The example filters a list of file names to the python files only

unfiltered_files = ['test.py', 'names.csv', 'fun_module.py', 'prog.config']

# Standard loop form
python_files = []
# filter the files using a standard for loop 
for file in unfiltered_files:
    if file[-2:] == 'py':
        python_files.append(file)
        
print('using standard for loop: {}'.format(python_files))

#list comprehension
python_files = [file for file in unfiltered_files if file[-2:] == 'py']

print('using list comprehension {}'.format(python_files))

Out:

using standard for loop: ['test.py', 'fun_module.py']
using list comprehension ['test.py', 'fun_module.py']

Example 5 – List comprehension to create a list of lists

list_of_lists = []

# Standard loop form
for i in range(5):
    sub_list = []
    for j in range(3):
        sub_list.append(i * j)
    list_of_lists.append(sub_list)

print(list_of_lists)


# List comprehension
list_of_lists = [[i * j for j in range(3)] for i in range(5)]

print(list_of_lists)

Out:

[[0, 0, 0], [0, 1, 2], [0, 2, 4], [0, 3, 6], [0, 4, 8]]
[[0, 0, 0], [0, 1, 2], [0, 2, 4], [0, 3, 6], [0, 4, 8]]

Example 6: Iterate over all items in a list of lists

The code converts a list of lists to a list of items
We call this flattening the list.

list_of_lists = [[8, 2, 1], [9, 1, 2], [4, 5, 100]]

# Standard loop form
flat_list = []
for row in list_of_lists:
    for col in row:
        flat_list.append(col)

print(flat_list)

# List comprehension:
flat_list = [item for sublist in list_of_lists for item in sublist]
print(flat_list)

Out:

[8, 2, 1, 9, 1, 2, 4, 5, 100]
[8, 2, 1, 9, 1, 2, 4, 5, 100]

120. Generating log normal samples from provided arithmetic mean and standard deviation of original population

The log normal distribution is frequently a useful distribution for mimicking process times in healthcare pathways (or many other non-automated processes). The distribution has a right skew which may frequently occur when some clinical process step has some additional complexity to it compared to the ‘usual’ case.

To sample from a log normal distribution we need to convert the mean and standard deviation that was calculated from the original non-logged population into the mu and sigma of the underlying log normal population.

(For maximum computation effuiciency, when calling the function repeatedly using the same mean and standard deviation, you may wish to split this into two functions – one to calculate mu and sigma which needs only calling once, and the other to sample from the log normal distribution given mu and sigma).

For more on the maths see:

https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-with-specified-mean-and-variance.html

import numpy as np

def generate_lognormal_samples(mean, stdev, n=1):
    """
    Returns n samples taken from a lognormal distribution, based on mean and
    standard deviation calaculated from the original non-logged population.
    
    Converts mean and standard deviation to underlying lognormal distribution
    mu and sigma based on calculations desribed at:
        https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-
        with-specified-mean-and-variance.html
        
    Returns a numpy array of floats if n > 1, otherwise return a float
    """
    
    # Calculate mu and sigma of underlying lognormal distribution
    phi = (stdev ** 2 + mean ** 2) ** 0.5
    mu = np.log(mean ** 2 / phi)
    sigma = (np.log(phi ** 2 / mean ** 2)) ** 0.5
    
    # Generate lognormal population
    generated_pop = np.random.lognormal(mu, sigma , n)
    
    # Convert single sample (if n=1) to a float, otherwise leave as array
    generated_pop = \
        generated_pop[0] if len(generated_pop) == 1 else generated_pop
        
    return generated_pop

Test the function

We will generate a population of 100,000 samples with a given mean and standard deviation (these would be calculated on the non-logged population), and test the resulting generated population has the same mean and standard deviation.

mean = 10
stdev = 10
generated_pop = generate_lognormal_samples(mean, stdev, 100000)
print ('Mean:', generated_pop.mean())
print ('Standard deviation:', generated_pop.std())

Out:

Mean: 10.043105926813356
Standard deviation: 9.99527575740651

Plot a histogram of the generated population:

import matplotlib.pyplot as plt
%matplotlib inline
bins = np.arange(0,51,1)
plt.hist(generated_pop, bins=bins)
plt.show()

Generating a single sample

The function will return a single number if no n is given in the function call:

print (generate_lognormal_samples(mean, stdev))

Out: 6.999376449335125

119: Optimising scikit-learn machine learning models with grid search or randomized search

Machine learning models have many hyper-parameters (parameters set before a model is fitted, and which remain constant throughout model fitting). Optimising model hyper-parameters may involve many model runs with alternative hyper-parameters. In SciKit-Learn, this may be performed in an automated fashion using Grid Search (which explores all combinations of provided hyper-parameters) or Randomized Search (which randomly selects combinations to test).

Grid search and randomized search will perform this optimisation using k-fold validation which avoids potential bias in training/test splits.

Here we will revisit a previous example of machine learning, using Random Forests to predict whether a person has breast cancer. We will then use Grid Search to optimise performance, using the ‘f1’ performance score (https://en.wikipedia.org/wiki/F1_score) as an accuracy score that balances the importance of false negatives and false positives.

First we will look at how we previously built the Random Forests model.

(See https://pythonhealthcare.org/2018/04/17/72-machine-learning-random-forests/ for previous post on Random Forest method)

# import required modules

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

def calculate_diagnostic_performance (actual_predicted):
    """ Calculate diagnostic performance.
    
    Takes a Numpy array of 1 and zero, two columns: actual and predicted
    
    Note that some statistics are repeats with different names
    (precision = positive_predictive_value and recall = sensitivity).
    Both names are returned
    
    Returns a dictionary of results:
        
    1) accuracy: proportion of test results that are correct    
    2) sensitivity: proportion of true +ve identified
    3) specificity: proportion of true -ve identified
    4) positive likelihood: increased probability of true +ve if test +ve
    5) negative likelihood: reduced probability of true +ve if test -ve
    6) false positive rate: proportion of false +ves in true -ve patients
    7) false negative rate:  proportion of false -ves in true +ve patients
    8) positive predictive value: chance of true +ve if test +ve
    9) negative predictive value: chance of true -ve if test -ve
    10) precision = positive predictive value 
    11) recall = sensitivity
    12) f1 = (2 * precision * recall) / (precision + recall)
    13) positive rate = rate of true +ve (not strictly a performance measure)
    """
    # Calculate results
    actual_positives = actual_predicted[:, 0] == 1
    actual_negatives = actual_predicted[:, 0] == 0
    test_positives = actual_predicted[:, 1] == 1
    test_negatives = actual_predicted[:, 1] == 0
    test_correct = actual_predicted[:, 0] == actual_predicted[:, 1]
    accuracy = np.average(test_correct)
    true_positives = actual_positives & test_positives
    true_negatives = actual_negatives & test_negatives
    sensitivity = np.sum(true_positives) / np.sum(actual_positives)
    specificity = np.sum(true_negatives) / np.sum(actual_negatives)
    positive_likelihood = sensitivity / (1 - specificity)
    negative_likelihood = (1 - sensitivity) / specificity
    false_positive_rate = 1 - specificity
    false_negative_rate = 1 - sensitivity
    positive_predictive_value = np.sum(true_positives) / np.sum(test_positives)
    negative_predictive_value = np.sum(true_negatives) / np.sum(test_negatives)
    precision = positive_predictive_value
    recall = sensitivity
    f1 = (2 * precision * recall) / (precision + recall)
    positive_rate = np.mean(actual_predicted[:,1])
    
    # Add results to dictionary
    performance = {}
    performance['accuracy'] = accuracy
    performance['sensitivity'] = sensitivity
    performance['specificity'] = specificity
    performance['positive_likelihood'] = positive_likelihood
    performance['negative_likelihood'] = negative_likelihood
    performance['false_positive_rate'] = false_positive_rate
    performance['false_negative_rate'] = false_negative_rate
    performance['positive_predictive_value'] = positive_predictive_value
    performance['negative_predictive_value'] = negative_predictive_value
    performance['precision'] = precision
    performance['recall'] = recall
    performance['f1'] = f1
    performance['positive_rate'] = positive_rate

    return performance

def load_data ():
    """Load the data set. Here we load the Breast Cancer Wisconsin (Diagnostic)
    Data Set. Data could be loaded from other sources though the structure
    should be compatible with this data set, that is an object with the 
    following attributes:
        .data (holds feature data)
        .feature_names (holds feature titles)
        .target_names (holds outcome classification names)
        .target (holds classification as zero-based number)
        .DESCR (holds text-based description of data set)"""
    
    data_set = datasets.load_breast_cancer()
    return data_set

def normalise (X_train,X_test):
    """Normalise X data, so that training set has mean of zero and standard
    deviation of one"""
    
    # Initialise a new scaling object for normalising input data
    sc=StandardScaler() 
    # Set up the scaler just on the training set
    sc.fit(X_train)
    # Apply the scaler to the training and test sets
    X_train_std=sc.transform(X_train)
    X_test_std=sc.transform(X_test)
    return X_train_std, X_test_std


def print_diagnostic_results (performance):
    """Iterate through, and print, the performance metrics dictionary"""
    
    print('\nMachine learning diagnostic performance measures:')
    print('-------------------------------------------------')
    for key, value in performance.items():
        print (key,'= %0.3f' %value) # print 3 decimal places
    return

def print_feaure_importances (model, features):
    print ()
    print ('Feature importances:')
    print ('--------------------')
    df = pd.DataFrame()
    df['feature'] = features
    df['importance'] = model.feature_importances_
    df = df.sort_values('importance', ascending = False)
    print (df)
    return

def split_data (data_set, split=0.25):
    """Extract X and y data from data_set object, and split into training and
    test data. Split defaults to 75% training, 25% test if not other value 
    passed to function"""
    
    X=data_set.data
    y=data_set.target
    X_train,X_test,y_train,y_test=train_test_split(
        X,y,test_size=split, random_state=0)
    return X_train,X_test,y_train,y_test

def test_model(model, X, y):
    """Return predicted y given X (attributes)"""
    
    y_pred = model.predict(X)
    test_results = np.vstack((y, y_pred)).T
    return test_results

def train_model (X, y):
    """Train the model. Note n_jobs=-1 uses all cores on a computer"""
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(n_jobs=-1)
    model.fit (X,y)
    return model

###### Main code #######

# Load data
data_set = load_data()

# Split data into training and test sets
X_train,X_test,y_train,y_test = split_data(data_set, 0.25)

# Normalise data (not needed for Random Forests)
# X_train_std, X_test_std = normalise(X_train,X_test)

# Train model
model = train_model(X_train, y_train)

# Produce results for test set
test_results = test_model(model, X_test, y_test)

# Measure performance of test set predictions
performance = calculate_diagnostic_performance(test_results)

# Print performance metrics
print_diagnostic_results(performance)
Out: Machine learning diagnostic performance measures:

accuracy = 0.951
sensitivity = 0.944
specificity = 0.962
positive_likelihood = 25.028
negative_likelihood = 0.058
false_positive_rate = 0.038
false_negative_rate = 0.056
positive_predictive_value = 0.977
negative_predictive_value = 0.911
precision = 0.977
recall = 0.944
f1 = 0.960
positive_rate = 0.608

Optimise with grid search

NOTE: Grid search may take considerable time to run!

Grid search enables us to perform an exhaustive search of hyper-parameters (those model parameters that are constant in any one model). We define which hyper-parameters we wish to change, and what values we wish to try. All combinations are tested. Test are performed using k-fold validation which re-runs the model with different train/test splits (this avoids bias in our train/test split, but does increase the time required). You may wish to time small grid search first, so you have a better idea of how many parameter combinations you can realistically look at.

We pass four arguments to the grid search method:

1) The range of values for the hyper-parameters, defined in a dictionary 2) The machine learning model to use 3) The number of k-fold splits to use (cv); a value of 5 will give five 80:20 training/test splits with each sample being present in the test set once 4) The accuracy score to use. In a classification model ‘accuracy’ is common. For a regression model using scoring='neg_mean_squared_error' is common (for grid search an accuracy score must be a ‘utility function’ rather than a ‘cost function’, that is, higher values are better).

If the best model uses a value at one extreme of the provided hyper-paramter ranges then it is best to expand the range of that hyper-paraemter to be sure an optimum has been found.

More info on grid search: https://scikit-learn.org/stable/modules/grid_search.html

An alternative approach is randomised hyper-parameter searching. See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

# Use Grid search to optimise
# n_jobs is set to -1 to use all cores on the CPU

from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [10, 30, 100, 300, 1000, 3000],
              'bootstrap': [True, False],
              'min_samples_split': [2, 4, 6, 8, 10],
              'n_jobs': [-1]}

# Grid search will use k-fold cross-validation (CV is number of splits)
# Grid search also needs a ultility function (higher is better) rather than
# a cost function (lower is better) so use neg square mean error

from sklearn.ensemble import RandomForestClassifier
forest_grid = RandomForestClassifier()
grid_search = GridSearchCV(forest_grid, param_grid, cv=10,
                           scoring='accuracy')

grid_search.fit(X_train, y_train); #';' suppresses printed output

Show optimised model hyper-parameters:

# show best parameters
# If best parameters are at the extremes of the searches then extend the range

grid_search.best_params_

Out:

{'bootstrap': True, 'min_samples_split': 6, 'n_estimators': 30, 'n_jobs': -1}
# Or, show full description
grid_search.best_estimator_

Out:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=6,
            min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Now we will use the optimised model. We could use the text above (from the output of grid_search.best_estimator_, or we can use grid_search.best_estimator_ directly.

# Use optimised model
model = grid_search.best_estimator_
model.fit (X_train, y_train);

Test optimised model:

test_results = test_model(model, X_test, y_test)

# Measure performance of test set predictions
performance = calculate_diagnostic_performance(test_results)

# Print performance metrics
print_diagnostic_results(performance)

Out:

Machine learning diagnostic performance measures:

accuracy = 0.972
sensitivity = 0.967
specificity = 0.981
positive_likelihood = 51.233
negative_likelihood = 0.034
false_positive_rate = 0.019
false_negative_rate = 0.033
positive_predictive_value = 0.989
negative_predictive_value = 0.945
precision = 0.989
recall = 0.967
f1 = 0.978
positive_rate = 0.615

Our accuracy has now increased from 95.1% to 97.2%.

When the number of parameter combinations because unreasonable large for grid search, and alternative is to use random search, which will select parameters randomly from the ranges given. The number of combinations tried is given by the argument n_iter.

Below is an example where we expand the number of arguments varied (becoming too large for grid search) and use random search to test 50 different samples.

For more details see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

## Use Grid search to optimise
# n_jobs is set to -1 to use all cores on the CPU

from sklearn.model_selection import RandomizedSearchCV

param_grid = {'n_estimators': [10, 30, 100, 300, 1000, 3000],
              'bootstrap': [True, False],
              'min_samples_split': range(2,11),
              'max_depth': range(1,30),
              'min_samples_split': [2, 4, 6, 8, 10],
              'n_jobs': [-1]}

n_iter_search = 50

from sklearn.ensemble import RandomForestClassifier
forest_grid = RandomForestClassifier()
random_search = RandomizedSearchCV(forest_grid, param_grid, cv=10,
                           n_iter=n_iter_search, scoring='accuracy')

random_search.fit(X_train, y_train); #';' suppresses printed output
# show best parameters
# If best parameters are at the extremes of the searches then extend the range

random_search.best_params_

Out:

{'n_jobs': -1,
 'n_estimators': 100,
 'min_samples_split': 2,
 'max_depth': 29,
 'bootstrap': False}
# Or, show full description
random_search.best_estimator_

Out:

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=29, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Now we will train a model with the optimized model hyper-parameters, and test against the test set.

# Use optimised model
model = random_search.best_estimator_
model.fit (X_train, y_train);

test_results = test_model(model, X_test, y_test)

# Measure performance of test set predictions
performance = calculate_diagnostic_performance(test_results)

# Print performance metrics
print_diagnostic_results(performance)

Out:

Machine learning diagnostic performance measures:

accuracy = 0.986
sensitivity = 0.989
specificity = 0.981
positive_likelihood = 52.411
negative_likelihood = 0.011
false_positive_rate = 0.019
false_negative_rate = 0.011
positive_predictive_value = 0.989
negative_predictive_value = 0.981
precision = 0.989
recall = 0.989
f1 = 0.989
positive_rate = 0.629

So though random search does not explore all combinations, because we can increase the number of parameters to explore, comapred with grid search, we have increased our accuracy to 98.6%

118: Python basics – saving python objects to disk with pickle

Sometimes we may wish to save Python objects to disc (for example if we have performed a lot of processing to get to a certain point). We can use Python’s pickle method to save and reload any Python object. Here we will save and reload a NumPy array, and then save and reload a collection of different objects.

Saving a single python object

Here we will use pickle to save a single object, a NumPy array.

import pickle 
import numpy as np

# Create array of random numbers:
my_array= np.random.rand(2,4)
print (my_array)

Out:

[[0.6383297  0.45250192 0.09882854 0.84896196]
 [0.97006917 0.29206495 0.92500062 0.52965801]]
# Save using pickle
filename = 'pickled_array.p'
with open(filename, 'wb') as filehandler:
    pickle.dump(my_array, filehandler)

Reload and print pickled array:

filename = 'pickled_array.p'
with open(filename, 'rb') as filehandler: 
    reloaded_array = pickle.load(filehandler)

print ('Reloaded array:')
print (reloaded_array)

Out:

Reloaded array:
[[0.6383297  0.45250192 0.09882854 0.84896196]
 [0.97006917 0.29206495 0.92500062 0.52965801]]

Using a tuple to save multiple objects

We can use pickle to save a collection of objects grouped together as a list, a dictionary, or a tuple. Here we will save a collection of objects as a tuple.

# Create an array, a list, and a dictionary
my_array = np.random.rand(2,4)
my_list =['A', 'B', 'C']
my_dictionary = {'name': 'Bob', 'Age': 42}
# Save all items in a tuple
items_to_save = (my_array, my_list, my_dictionary)
filename = 'pickled_tuple_of_objects.p'
with open(filename, 'wb') as filehandler:
    pickle.dump(items_to_save, filehandler)

Reload pickled tuple, unpack the objects, and print them.

filename = 'pickled_tuple_of_objects.p'
with open(filename, 'rb') as filehandler:
    reloaded_tuple = pickle.load(filehandler)


reloaded_array = reloaded_tuple[0]
reloaded_list = reloaded_tuple[1]
reloaded_dict = reloaded_tuple[2]

print ('Reloaded array:')
print (reloaded_array)
print ('\nReloaded list:')
print (reloaded_list)
print ('\n Reloaded dictionary')
print (reloaded_dict)

Out:

Reloaded array:
[[0.40193978 0.55173167 0.89411291 0.84625061]
 [0.86540981 0.27835353 0.43359222 0.31579122]]

Reloaded list:
['A', 'B', 'C']

 Reloaded dictionary
{'name': 'Bob', 'Age': 42}