126. Feature selection 3 (model backward elimination)

Here we use survival on the Titanic to demonstrate a model-based method to select the most important features.

Reducing the number of features we use can have three benefits:

  • Simplifies model explanation
  • Model fit may be improved by the removal of features that add no value
  • Model will be faster to fit

In this notebook we will use a model-based approach whereby we incrementally remove features that least reduce model performance.

Two key advantages of this method are:

  • It is relatively simple.
  • It is tailored to the model in question.

Some key disadvantage of this method are:

  • It may be slow if there are many parameters (though the loop to select features could be limited in the number of features to select).
  • The selection of features may be dependent on model meta-parameters (such as level of regularisation).
  • The selection of features may not transfer between models (e.g. a model that does not allow for feature interactions may not detect features which do not add much value independently).

We will assess performance using k-fold stratification for better measurement of performance. If you are not familiar with k-fold stratification, have a look here.

And we will assess performance with a Receiver Operator Characteristic (ROC) curve. See here.

Load data

The section below downloads pre-processed data, and saves it to a subfolder (from where this code is run). If data has already been downloaded that cell may be skipped.

Code that was used to pre-process the data ready for machine learning may be found at: https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/titanic/01_preprocessing.ipynb

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data:

data = pd.read_csv('data/processed_data.csv')

The first column is a passenger index number. We will remove this, as this is not part of the original Titanic passenger data.

# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data

data.drop('PassengerId', inplace=True, axis=1)

Divide into X (features) and y (lables)

We will separate out our features (the data we use to make a prediction) from our label (what we are truing to predict). By convention our features are called X (usually upper case to denote multiple features), and the label (survive or not) y.

X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'

Forward feature selection

Define data standardisation function.

def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc = StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std

The forward selection method:

  • Keeps a list of selected features
  • Keeps a list of features still available for selection
  • Loops through available features:
    • Calculates added value for each feature (using stratified k-fold validation)
    • Selects feature that adds most value
    • Adds selected feature to selected features list and removes it from available features list

This method uses a while lop to keep exploring features until no more are available. An alternative would be to use a for loop with a maximum number of features to select.In [8]:

# Create list to store accuracies and chosen features
roc_auc_by_feature_number = []
chosen_features = []

# Initialise chosen features list and run tracker
available_features = list(X)
run = 0
number_of_features = len(list(X))

# Creat einitial reference performance
reference_auc = 1.0 # used to compare reduction in AUC

# Loop through feature list to select next feature
while len(available_features)> 1:

    # Track and pront progress
    run += 1
    print ('Feature run {} of {}'.format(run, number_of_features-1))
    
    # Convert DataFrames to NumPy arrays
    y_np = y.values
    
    # Reset best feature and accuracy
    best_result = 1.0
    best_feature = ''

    # Loop through available features
    for feature in available_features:

        # Create copy of already chosen features to avoid orginal being changed
        features_to_use = available_features.copy()
        # Create a list of features to use by removing 1 feature
        features_to_use.remove(feature)
        # Get data for features, and convert to NumPy array
        X_np = X[features_to_use].values
        
        # Set up lists to hold results for each selected features
        test_auc_results = []
    
        # Set up k-fold training/test splits
        number_of_splits = 5
        skf = StratifiedKFold(n_splits = number_of_splits)
        skf.get_n_splits(X_np, y)
    
        # Loop through the k-fold splits
        for train_index, test_index in skf.split(X_np, y_np):
            
            # Get X and Y train/test
            X_train, X_test = X_np[train_index], X_np[test_index]
            y_train, y_test = y[train_index], y[test_index]
    
            # Get X and Y train/test
            X_train_std, X_test_std = standardise_data(X_train, X_test)
    
            # Set up and fit model
            model = LogisticRegression(solver='lbfgs')
            model.fit(X_train_std,y_train)
    
            # Predict test set labels
            y_pred_test = model.predict(X_test_std)
            
            # Calculate accuracy of test sets
            accuracy_test = np.mean(y_pred_test == y_test)
          
            # Get ROC AUC
            probabilities = model.predict_proba(X_test_std)
            probabilities = probabilities[:, 1] # Probability of 'survived'
            fpr, tpr, thresholds = roc_curve(y_test, probabilities)
            roc_auc = auc(fpr, tpr)
            test_auc_results.append(roc_auc)
        
        # Get average result from all k-fold splits
        feature_auc = np.mean(test_auc_results)
    
        # Update chosen feature and result if this feature is a new best
        # We are looking for the smallest drop in performance
        drop_in_performance = reference_auc - feature_auc
        if drop_in_performance < best_result:
            best_result = drop_in_performance
            best_feature = feature
            best_auc = feature_auc
                
    # k-fold splits are complete    
    # Add mean accuracy and AUC to record of accuracy by feature number
    roc_auc_by_feature_number.append(best_auc)
    chosen_features.append(best_feature)    
    available_features.remove(best_feature)
    reference_auc = best_auc

# Add last remaining feature
chosen_features += available_features
roc_auc_by_feature_number.append(0)
    
# Put results in DataFrame
# Reverse order of lists with [::-1] so best features first
results = pd.DataFrame()
results['feature removed'] = chosen_features[::-1]
results['ROC AUC'] = roc_auc_by_feature_number[::-1]

Show results

The table is now in the order of preferred features, though our code worked in the reverse direction, incrementally removing the feature that made least difference to the model.In [9]:

results
feature removedROC AUC
0male0.000000
1Pclass0.766733
2Age0.833036
3SibSp0.843055
4Embarked_S0.848686
5CabinNumberImputed0.853125
6CabinLetter_C0.855046
7CabinLetter_missing0.855504
8CabinLetterImputed0.855506
9Embarked_missing0.855533
10EmbarkedImputed0.855479
11CabinLetter_T0.855479
12CabinLetter_F0.855159
13CabinLetter_B0.855124
14Embarked_Q0.854781
15Embarked_C0.853826
16Fare0.853760
17CabinNumber0.853030
18CabinLetter_E0.852737
19CabinLetter_A0.852083
20AgeImputed0.851654
21CabinLetter_D0.850486
22CabinLetter_G0.848673
23Parch0.847432

Plot results

import matplotlib.pyplot as plt
%matplotlib inline

chart_x = list(range(1, number_of_features+1))

plt.plot(chart_x, roc_auc_by_feature_number,
        label = 'ROC AUC')

plt.xlabel('Number of features removed')
plt.ylabel('Accuracy (ROC AUC)')
plt.legend()
plt.grid(True)

plt.show()

From the above results it looks like we could eliminate all but 5-6 features in this model. It may also be worth examining the same method using other performance scores (such as simple accuracy, or f1) in place of ROC AUC.

125. Feature selection 2 (model forward selection)

Here we use survival on the Titanic to demonstrate a model-based method to select the most important features.

Reducing the number of features we use can have three benefits:

  • Simplifies model explanation
  • Model fit may be improved by the removal of features that add no value
  • Model will be faster to fit

In this notebook we will use a model-based approach whereby we incrementally add features that most increase model performance.

Two key advantages of this method are:

  • It is relatively simple.
  • It is tailored to the model in question.

Some key disadvantage of this method are:

  • It may be slow if there are many parameters (though the loop to select features could be limited in the number of features to select).
  • The selection of features may be dependent on model meta-parameters (such as level of regularisation).
  • The selection of features may not transfer between models (e.g. a model that does not allow for feature interactions may not detect features which do not add much value independently).

We will go through the following steps:

  • Download and save pre-processed data
  • Split data into features (X) and label (y)
  • Loop through features to select the feature that most increase ROC AUC
  • Plot results

We will assess performance using k-fold stratification for better measurement of performance. If you are not familiar with k-fold stratification, have a look here.

And we will assess performance with a Receiver Operator Characteristic (ROC) curve. See here.

Load modules

A standard Anaconda install of Python (https://www.anaconda.com/distribution/) contains all the necessary modules.

import numpy as np
import pandas as pd

# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler

Load data

The section below downloads pre-processed data, and saves it to a subfolder (from where this code is run). If data has already been downloaded that cell may be skipped.

Code that was used to pre-process the data ready for machine learning may be found at: https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/titanic/01_preprocessing.ipynb:

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data:

data = pd.read_csv('data/processed_data.csv')

The first column is a passenger index number. We will remove this, as this is not part of the original Titanic passenger data.

# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data

data.drop('PassengerId', inplace=True, axis=1)

Divide into X (features) and y (labels)

We will separate out our features (the data we use to make a prediction) from our label (what we are trying to predict). By convention our features are called X (usually upper case to denote multiple features), and the label (survive or not) y.

X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'

Forward feature selection

Define data standardisation function.

def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc = StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std

The forward selection method:

  • Keeps a list of selected features
  • Keeps a list of features still available for selection
  • Loops through available features:
    • Calculates added value for each feature (using stratified k-fold validation)
    • Selects feature that adds most value
    • Adds selected feature to selected features list and removes it from available features list

This method uses a while lop to keep exploring features until no more are available. An alternative would be to use a for loop with a maximum number of features to select.

# Create list to store accuracies and chosen features
roc_auc_by_feature_number = []
chosen_features = []

# Initialise chosen features list and run tracker
available_features = list(X)
run = 0
number_of_features = len(list(X))

# Loop through feature list to select next feature
while len(available_features)> 0:

    # Track and pront progress
    run += 1
    print ('Feature run {} of {}'.format(run, number_of_features))
    
    # Convert DataFrames to NumPy arrays
    y_np = y.values
    
    # Reset best feature and accuracy
    best_result = 0
    best_feature = ''

    # Loop through available features
    for feature in available_features:

        # Create copy of already chosen features to avoid orginal being changed
        features_to_use = chosen_features.copy()
        # Create a list of features from features already chosen + 1 new feature
        features_to_use.append(feature)
        # Get data for features, and convert to NumPy array
        X_np = X[features_to_use].values
        
        # Set up lists to hold results for each selected features
        test_auc_results = []
    
        # Set up k-fold training/test splits
        number_of_splits = 5
        skf = StratifiedKFold(n_splits = number_of_splits)
        skf.get_n_splits(X_np, y)
    
        # Loop through the k-fold splits
        for train_index, test_index in skf.split(X_np, y_np):
            
            # Get X and Y train/test
            X_train, X_test = X_np[train_index], X_np[test_index]
            y_train, y_test = y[train_index], y[test_index]
    
            # Get X and Y train/test
            X_train_std, X_test_std = standardise_data(X_train, X_test)
    
            # Set up and fit model
            model = LogisticRegression(solver='lbfgs')
            model.fit(X_train_std,y_train)
    
            # Predict test set labels
            y_pred_test = model.predict(X_test_std)
            
            # Calculate accuracy of test sets
            accuracy_test = np.mean(y_pred_test == y_test)
          
            # Get ROC AUC
            probabilities = model.predict_proba(X_test_std)
            probabilities = probabilities[:, 1] # Probability of 'survived'
            fpr, tpr, thresholds = roc_curve(y_test, probabilities)
            roc_auc = auc(fpr, tpr)
            test_auc_results.append(roc_auc)
        
        # Get average result from all k-fold splits
        feature_auc = np.mean(test_auc_results)
    
        # Update chosen feature and result if this feature is a new best
        if feature_auc > best_result:
            best_result = feature_auc
            best_feature = feature
    
    # k-fold splits are complete    
    # Add mean accuracy and AUC to record of accuracy by feature number
    roc_auc_by_feature_number.append(best_result)
    chosen_features.append(best_feature)
    available_features.remove(best_feature)

# Put results in DataFrame
results = pd.DataFrame()
results['feature to add'] = chosen_features
results['ROC AUC'] = roc_auc_by_feature_number
results
feature to addROC AUC
0male0.766733
1Pclass0.833036
2Age0.843055
3SibSp0.848686
4Embarked_S0.853125
5CabinLetter_E0.855740
6CabinNumberImputed0.856158
7CabinLetter_T0.856160
8CabinLetter_D0.856167
9CabinLetter_F0.856307
10EmbarkedImputed0.856121
11Embarked_missing0.856094
12CabinLetterImputed0.855637
13CabinLetter_missing0.855744
14Fare0.855302
15CabinLetter_A0.854307
16CabinLetter_B0.853215
17CabinNumber0.852896
18Embarked_C0.851377
19Embarked_Q0.851324
20CabinLetter_C0.849972
21AgeImputed0.848673
22CabinLetter_G0.847432
23Parch0.845842

Plot results

import matplotlib.pyplot as plt
%matplotlib inline

chart_x = list(range(1, number_of_features+1))

plt.plot(chart_x, roc_auc_by_feature_number,
        label = 'ROC AUC')

plt.xlabel('Number of features')
plt.ylabel('Accuracy (ROC AUC)')
plt.legend()
plt.grid(True)

plt.show()

From the above results it looks like we could use just 5-7 features in this model. It may also be worth examining the same method using other performance scores (such as simple accuracy, or f1) in place of ROC AUC.

Note that accuracy of the model appears to decline with a greater number of features.

124. Feature selection 1 (univariate statistical selection)

Here we use survival on the Titanic to demonstrate a simple statistical method to select the most important features.

Reducing the number of features we use can have three benefits:

  • Simplifies model explanation
  • Model fit may be improved by the removal of features that add no value
  • Model will be faster to fit

In this notebook we will use a simple statistical method for selecting features called univariate feature selection. We will examine the correlation between each feature and the target label value. This is called univariate statistics because we examine each feature independently.

Two key advantages of this method are:

  • It is simple
  • It is fast

Two key disadvantage of this method are:

  • It may miss features which have little effect alone, but which are influential when combined
  • It may include features which are highly correlated which could be reduced to choosing just one of the highly correlated features.

The machine learning model we will use to test the feature selection will be a simple logistic regression model.

We will go through the following steps:

  • Download and save pre-processed data
  • Split data into features (X) and label (y)
  • Calculate the correlation of each feature with the target label value
  • Sort by correlation (ignoring the +ve/-ve sign)
  • Test the features in our logistic regression model

We will assess performance using k-fold stratification for better measurement of performance. If you are not familiar with k-fold stratification, have a look here.

https://pythonhealthcare.org/2018/04/20/75-machine-learning-choosing-between-models-with-stratified-k-fold-validation/

And we will assess performance with a Receiver Operator Characteristic (ROC) curve. See here.

Load modules

A standard Anaconda install of Python (https://www.anaconda.com/distribution/) contains all the necessary modules.

import numpy as np
import pandas as pd

# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler

Load data

The section below downloads pre-processed data, and saves it to a subfolder (from where this code is run). If data has already been downloaded that cell may be skipped.

download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

Load data once downloaded:

data = pd.read_csv('data/processed_data.csv')

The first column is a passenger index number. We will remove this, as this is not part of the original Titanic passenger data.

# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data

data.drop('PassengerId', inplace=True, axis=1)

Divide into X (features) and y (labels)

We will separate out our features (the data we use to make a prediction) from our label (what we are truing to predict). By convention our features are called X (usually upper case to denote multiple features), and the label (survive or not) y.

X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'

Calculate correlation coefficients

Calculate the correlation between each feature and survival.

from scipy.stats.stats import pearsonr
features = list(X)
correlation = []
significance = []
for feature in features:
    correl = pearsonr(X[feature].values, y.values)
    correlation.append(correl[0])
    significance.append(correl[1])
df = pd.DataFrame()
df['feature'] = features
df['correlation'] = correlation
df['abs_correlation'] = np.abs(correlation)
df['significance'] = significance
df['significant'] = df['significance'] < 0.05 # Label those P<0.01
df.sort_values(by='abs_correlation', ascending=False, inplace=True)

Show features in order of correlation with survival. Note: significance = probability that correlation is due to chance. Lower values show higher significance.

df
featurecorrelationabs_correlationsignificancesignificant
10male-0.5433510.5433511.406066e-69True
0Pclass-0.3384810.3384812.537047e-25True
9CabinNumberImputed-0.3218420.3218426.404266e-23True
7CabinLetterImputed-0.3169120.3169123.090891e-22True
23CabinLetter_missing-0.3169120.3169123.090891e-22True
4Fare0.2573070.2573076.120189e-15True
8CabinNumber0.2354090.2354091.100977e-12True
16CabinLetter_B0.1750950.1750951.441584e-07True
11Embarked_C0.1682400.1682404.397151e-07True
13Embarked_S-0.1556600.1556603.036111e-06True
18CabinLetter_D0.1507160.1507166.233140e-06True
19CabinLetter_E0.1453210.1453211.331654e-05True
17CabinLetter_C0.1146520.1146526.061874e-04True
5AgeImputed-0.0921970.0921975.886535e-03True
3Parch0.0816290.0816291.479925e-02True
1Age-0.0649100.0649105.276069e-02False
14Embarked_missing0.0600950.0600957.298717e-02False
6EmbarkedImputed0.0600950.0600957.298717e-02False
20CabinLetter_F0.0579350.0579358.392361e-02False
2SibSp-0.0353220.0353222.922439e-01False
22CabinLetter_T-0.0264560.0264564.302610e-01False
15CabinLetter_A0.0222870.0222875.064306e-01False
21CabinLetter_G0.0160400.0160406.325420e-01False
12Embarked_Q0.0036500.0036509.133532e-01False

Get ordered feature list.



ordered_features = list(df['feature'])

ordered_features
['male',
 'Pclass',
 'CabinNumberImputed',
 'CabinLetterImputed',
 'CabinLetter_missing',
 'Fare',
 'CabinNumber',
 'CabinLetter_B',
 'Embarked_C',
 'Embarked_S',
 'CabinLetter_D',
 'CabinLetter_E',
 'CabinLetter_C',
 'AgeImputed',
 'Parch',
 'Age',
 'Embarked_missing',
 'EmbarkedImputed',
 'CabinLetter_F',
 'SibSp',
 'CabinLetter_T',
 'CabinLetter_A',
 'CabinLetter_G',
 'Embarked_Q']

Testing our selected features

After statistical selection we may simply choose the top k features, or we may choose those labelled as significant (P<0.05).

Here we will incrementally add features to the list of features to use (chosen in order of their correlation coefficients), and see the effect on model accuracy and Receiver Operator Characteristic (ROC) Area Under Curve (AUC)* as measured by k-fold stratification. Here we will use sklearn’s implementation of calculating the ROC AUC.

*A reminder that ROC AUC is a measure of the balance between true positive and false positives as the threshold to classify a case as a positive is changed.

def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc = StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std
# Create list to store accuracies
accuracy_by_feature_number = []
roc_auc_by_feature_number = []

# Loop through feature list
number_of_features = len(ordered_features)
for i in range(number_of_features):
    # print ("{0} features of {1}".format(i, number_of_features))
    features_to_use = ordered_features[0:i+1]
    X_selected = X[features_to_use]
    
    # Convert to NumPy (needed for k-fold method)
    # Convert DataFrames to NumPy arrays
    X_np = X_selected.values
    y_np = y.values
    
    #%% Run k fold model

    # Set up lists to hold results for each k-fold run
    test_acc_results = []
    test_auc_results = []

    # Set up splits
    number_of_splits = 10
    skf = StratifiedKFold(n_splits = number_of_splits)
    skf.get_n_splits(X_np, y)

    # Loop through the k-fold splits
    for train_index, test_index in skf.split(X_np, y_np):
        # Get X and Y train/test
        X_train, X_test = X_np[train_index], X_np[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Get X and Y train/test
        X_train_std, X_test_std = standardise_data(X_train, X_test)

        # Set up and fit model
        model = LogisticRegression(solver='lbfgs')
        model.fit(X_train_std,y_train)

        # Predict test set labels
        y_pred_test = model.predict(X_test_std)
        
        # Calculate accuracy of test sets
        accuracy_test = np.mean(y_pred_test == y_test)
        test_acc_results.append(accuracy_test)
        
        # Get ROC AUC
        probabilities = model.predict_proba(X_test_std)
        probabilities = probabilities[:, 1] # Probability of 'survived' class
        fpr, tpr, thresholds = roc_curve(y_test, probabilities)
        roc_auc = auc(fpr, tpr)
        test_auc_results.append(roc_auc)      
        
    # Add mean accuracy and AUC to record of accuracy by feature number
    accuracy_by_feature_number.append(np.mean(test_acc_results))
    roc_auc_by_feature_number.append(np.mean(test_auc_results))

Plot results:

import matplotlib.pyplot as plt
%matplotlib inline

chart_x = list(range(1, number_of_features + 1))

plt.plot(chart_x, accuracy_by_feature_number,
        label = 'Accuracy')

plt.plot(chart_x, roc_auc_by_feature_number,
        label = 'ROC AUC')

plt.xlabel('Number of features')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

plt.show()

The results show that our ROC AUC* increases significantly between 1 and 2 features, and then climbs slowly up to about 20 features. Accuracy, interestingly, first declines with more features, and then climbs, to a plateau between 16 and 22 features.

Taking the top 20 features is likely to give us our best model (though we could reduce features more if computational time was critical).