# 119: Optimising scikit-learn machine learning models with grid search or randomized search

Machine learning models have many hyper-parameters (parameters set before a model is fitted, and which remain constant throughout model fitting). Optimising model hyper-parameters may involve many model runs with alternative hyper-parameters. In SciKit-Learn, this may be performed in an automated fashion using Grid Search (which explores all combinations of provided hyper-parameters) or Randomized Search (which randomly selects combinations to test).

Grid search and randomized search will perform this optimisation using k-fold validation which avoids potential bias in training/test splits.

Here we will revisit a previous example of machine learning, using Random Forests to predict whether a person has breast cancer. We will then use Grid Search to optimise performance, using the ‘f1’ performance score (https://en.wikipedia.org/wiki/F1_score) as an accuracy score that balances the importance of false negatives and false positives.

First we will look at how we previously built the Random Forests model.

(See https://pythonhealthcare.org/2018/04/17/72-machine-learning-random-forests/ for previous post on Random Forest method)

``````# import required modules

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

def calculate_diagnostic_performance (actual_predicted):
""" Calculate diagnostic performance.

Takes a Numpy array of 1 and zero, two columns: actual and predicted

Note that some statistics are repeats with different names
(precision = positive_predictive_value and recall = sensitivity).
Both names are returned

Returns a dictionary of results:

1) accuracy: proportion of test results that are correct
2) sensitivity: proportion of true +ve identified
3) specificity: proportion of true -ve identified
4) positive likelihood: increased probability of true +ve if test +ve
5) negative likelihood: reduced probability of true +ve if test -ve
6) false positive rate: proportion of false +ves in true -ve patients
7) false negative rate:  proportion of false -ves in true +ve patients
8) positive predictive value: chance of true +ve if test +ve
9) negative predictive value: chance of true -ve if test -ve
10) precision = positive predictive value
11) recall = sensitivity
12) f1 = (2 * precision * recall) / (precision + recall)
13) positive rate = rate of true +ve (not strictly a performance measure)
"""
# Calculate results
actual_positives = actual_predicted[:, 0] == 1
actual_negatives = actual_predicted[:, 0] == 0
test_positives = actual_predicted[:, 1] == 1
test_negatives = actual_predicted[:, 1] == 0
test_correct = actual_predicted[:, 0] == actual_predicted[:, 1]
accuracy = np.average(test_correct)
true_positives = actual_positives & test_positives
true_negatives = actual_negatives & test_negatives
sensitivity = np.sum(true_positives) / np.sum(actual_positives)
specificity = np.sum(true_negatives) / np.sum(actual_negatives)
positive_likelihood = sensitivity / (1 - specificity)
negative_likelihood = (1 - sensitivity) / specificity
false_positive_rate = 1 - specificity
false_negative_rate = 1 - sensitivity
positive_predictive_value = np.sum(true_positives) / np.sum(test_positives)
negative_predictive_value = np.sum(true_negatives) / np.sum(test_negatives)
precision = positive_predictive_value
recall = sensitivity
f1 = (2 * precision * recall) / (precision + recall)
positive_rate = np.mean(actual_predicted[:,1])

performance = {}
performance['accuracy'] = accuracy
performance['sensitivity'] = sensitivity
performance['specificity'] = specificity
performance['positive_likelihood'] = positive_likelihood
performance['negative_likelihood'] = negative_likelihood
performance['false_positive_rate'] = false_positive_rate
performance['false_negative_rate'] = false_negative_rate
performance['positive_predictive_value'] = positive_predictive_value
performance['negative_predictive_value'] = negative_predictive_value
performance['precision'] = precision
performance['recall'] = recall
performance['f1'] = f1
performance['positive_rate'] = positive_rate

return performance

"""Load the data set. Here we load the Breast Cancer Wisconsin (Diagnostic)
Data Set. Data could be loaded from other sources though the structure
should be compatible with this data set, that is an object with the
following attributes:
.data (holds feature data)
.feature_names (holds feature titles)
.target_names (holds outcome classification names)
.target (holds classification as zero-based number)
.DESCR (holds text-based description of data set)"""

return data_set

def normalise (X_train,X_test):
"""Normalise X data, so that training set has mean of zero and standard
deviation of one"""

# Initialise a new scaling object for normalising input data
sc=StandardScaler()
# Set up the scaler just on the training set
sc.fit(X_train)
# Apply the scaler to the training and test sets
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)
return X_train_std, X_test_std

def print_diagnostic_results (performance):
"""Iterate through, and print, the performance metrics dictionary"""

print('\nMachine learning diagnostic performance measures:')
print('-------------------------------------------------')
for key, value in performance.items():
print (key,'= %0.3f' %value) # print 3 decimal places
return

def print_feaure_importances (model, features):
print ()
print ('Feature importances:')
print ('--------------------')
df = pd.DataFrame()
df['feature'] = features
df['importance'] = model.feature_importances_
df = df.sort_values('importance', ascending = False)
print (df)
return

def split_data (data_set, split=0.25):
"""Extract X and y data from data_set object, and split into training and
test data. Split defaults to 75% training, 25% test if not other value
passed to function"""

X=data_set.data
y=data_set.target
X_train,X_test,y_train,y_test=train_test_split(
X,y,test_size=split, random_state=0)
return X_train,X_test,y_train,y_test

def test_model(model, X, y):
"""Return predicted y given X (attributes)"""

y_pred = model.predict(X)
test_results = np.vstack((y, y_pred)).T
return test_results

def train_model (X, y):
"""Train the model. Note n_jobs=-1 uses all cores on a computer"""
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_jobs=-1)
model.fit (X,y)
return model

###### Main code #######

# Split data into training and test sets
X_train,X_test,y_train,y_test = split_data(data_set, 0.25)

# Normalise data (not needed for Random Forests)
# X_train_std, X_test_std = normalise(X_train,X_test)

# Train model
model = train_model(X_train, y_train)

# Produce results for test set
test_results = test_model(model, X_test, y_test)

# Measure performance of test set predictions
performance = calculate_diagnostic_performance(test_results)

# Print performance metrics
print_diagnostic_results(performance)``````
``````Out: Machine learning diagnostic performance measures:

accuracy = 0.951
sensitivity = 0.944
specificity = 0.962
positive_likelihood = 25.028
negative_likelihood = 0.058
false_positive_rate = 0.038
false_negative_rate = 0.056
positive_predictive_value = 0.977
negative_predictive_value = 0.911
precision = 0.977
recall = 0.944
f1 = 0.960
positive_rate = 0.608
``````

# Optimise with grid search

NOTE: Grid search may take considerable time to run!

Grid search enables us to perform an exhaustive search of hyper-parameters (those model parameters that are constant in any one model). We define which hyper-parameters we wish to change, and what values we wish to try. All combinations are tested. Test are performed using k-fold validation which re-runs the model with different train/test splits (this avoids bias in our train/test split, but does increase the time required). You may wish to time small grid search first, so you have a better idea of how many parameter combinations you can realistically look at.

We pass four arguments to the grid search method:

1) The range of values for the hyper-parameters, defined in a dictionary 2) The machine learning model to use 3) The number of k-fold splits to use (`cv`); a value of 5 will give five 80:20 training/test splits with each sample being present in the test set once 4) The accuracy score to use. In a classification model `‘accuracy’` is common. For a regression model using `scoring='neg_mean_squared_error'` is common (for grid search an accuracy score must be a ‘utility function’ rather than a ‘cost function’, that is, higher values are better).

If the best model uses a value at one extreme of the provided hyper-paramter ranges then it is best to expand the range of that hyper-paraemter to be sure an optimum has been found.

An alternative approach is randomised hyper-parameter searching. See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

``````# Use Grid search to optimise
# n_jobs is set to -1 to use all cores on the CPU

from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [10, 30, 100, 300, 1000, 3000],
'bootstrap': [True, False],
'min_samples_split': [2, 4, 6, 8, 10],
'n_jobs': [-1]}

# Grid search will use k-fold cross-validation (CV is number of splits)
# Grid search also needs a ultility function (higher is better) rather than
# a cost function (lower is better) so use neg square mean error

from sklearn.ensemble import RandomForestClassifier
forest_grid = RandomForestClassifier()
grid_search = GridSearchCV(forest_grid, param_grid, cv=10,
scoring='accuracy')

grid_search.fit(X_train, y_train); #';' suppresses printed output``````

Show optimised model hyper-parameters:

``````# show best parameters
# If best parameters are at the extremes of the searches then extend the range

grid_search.best_params_

Out:

{'bootstrap': True, 'min_samples_split': 6, 'n_estimators': 30, 'n_jobs': -1}``````
``````# Or, show full description
grid_search.best_estimator_

Out:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=6,
min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=-1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)``````

Now we will use the optimised model. We could use the text above (from the output of `grid_search.best_estimator_`, or we can use `grid_search.best_estimator_` directly.

``````# Use optimised model
model = grid_search.best_estimator_
model.fit (X_train, y_train);``````

Test optimised model:

``````test_results = test_model(model, X_test, y_test)

# Measure performance of test set predictions
performance = calculate_diagnostic_performance(test_results)

# Print performance metrics
print_diagnostic_results(performance)

Out:

Machine learning diagnostic performance measures:

accuracy = 0.972
sensitivity = 0.967
specificity = 0.981
positive_likelihood = 51.233
negative_likelihood = 0.034
false_positive_rate = 0.019
false_negative_rate = 0.033
positive_predictive_value = 0.989
negative_predictive_value = 0.945
precision = 0.989
recall = 0.967
f1 = 0.978
positive_rate = 0.615``````

Our accuracy has now increased from 95.1% to 97.2%.

When the number of parameter combinations because unreasonable large for grid search, and alternative is to use random search, which will select parameters randomly from the ranges given. The number of combinations tried is given by the argument `n_iter`.

Below is an example where we expand the number of arguments varied (becoming too large for grid search) and use random search to test 50 different samples.

``````## Use Grid search to optimise
# n_jobs is set to -1 to use all cores on the CPU

from sklearn.model_selection import RandomizedSearchCV

param_grid = {'n_estimators': [10, 30, 100, 300, 1000, 3000],
'bootstrap': [True, False],
'min_samples_split': range(2,11),
'max_depth': range(1,30),
'min_samples_split': [2, 4, 6, 8, 10],
'n_jobs': [-1]}

n_iter_search = 50

from sklearn.ensemble import RandomForestClassifier
forest_grid = RandomForestClassifier()
random_search = RandomizedSearchCV(forest_grid, param_grid, cv=10,
n_iter=n_iter_search, scoring='accuracy')

random_search.fit(X_train, y_train); #';' suppresses printed output``````
``````# show best parameters
# If best parameters are at the extremes of the searches then extend the range

random_search.best_params_

Out:

{'n_jobs': -1,
'n_estimators': 100,
'min_samples_split': 2,
'max_depth': 29,
'bootstrap': False}``````
``````# Or, show full description
random_search.best_estimator_

Out:

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=29, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)``````

Now we will train a model with the optimized model hyper-parameters, and test against the test set.

``````# Use optimised model
model = random_search.best_estimator_
model.fit (X_train, y_train);

test_results = test_model(model, X_test, y_test)

# Measure performance of test set predictions
performance = calculate_diagnostic_performance(test_results)

# Print performance metrics
print_diagnostic_results(performance)

Out:

Machine learning diagnostic performance measures:

accuracy = 0.986
sensitivity = 0.989
specificity = 0.981
positive_likelihood = 52.411
negative_likelihood = 0.011
false_positive_rate = 0.019
false_negative_rate = 0.011
positive_predictive_value = 0.989
negative_predictive_value = 0.981
precision = 0.989
recall = 0.989
f1 = 0.989
positive_rate = 0.629``````

So though random search does not explore all combinations, because we can increase the number of parameters to explore, comapred with grid search, we have increased our accuracy to 98.6%

# 80. Grouping unlabelled data with k-means clustering

k-means clustering is a form of ‘unsupervised learning’. This is useful for grouping unlabelled data. For example, in the Wisconsin breast cancer data set, what if we did did not know whether the patients had cancer or not at the time the data was collected? Could we find a way of finding groups of similar patients? These groups may then form the basis of some further investigation.

k-means clustering groups data in such a way that the average distance of all points to their closest cluster centre is minimised.

Let’s start by importing the required modules, and loading our data. In order better visualise what k-means is doing we will limit our data to two features (though k-means clustering works with any number of features).

In [11]:
```import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

from sklearn import datasets
X=data_set.data[:,0:2] #  restrict patient data to two features.
```

We will normalise our data:

```sc=StandardScaler()
sc.fit(X)
X_std=sc.transform(X)
```

And now we will group patients into three clusters (k=3), and plot the centre of each cluster (each cluster centre has two co-ordinates, representing the normalised features we are using). If we used data with more than two features we would have a co-ordinate point for each feature.

In [13]:
```kmeans = KMeans(n_clusters=3)
kmeans.fit(X_std)
print('\nCluster centres:')
print(kmeans.cluster_centers_)
```
```Cluster centres:
[[-0.30445203  0.92354217]
[ 1.52122912  0.58146379]
[-0.50592385 -0.73299233]]
```

We can identify which cluster a sample has been put into with ‘kmeans.labels_’. Using that we may plot our clustering:

In [14]:
```labels = kmeans.labels_

plt.scatter(X_std[:,0],X_std[:,1],
c=labels, cmap=plt.cm.rainbow)
plt.xlabel('Normalised feature 1')
plt.ylabel('Normalised feature 2')
plt.show()
```

## How many clusters should be used?¶

Sometimes we may have prior knowledge that we want to group the data into a given number of clusters. Other times we may wish to investigate what may be a good number of clusters.

In the example below we look at changing the number of clusters between 1 and 100 and measure the average distance points are from their closest cluster centre (kmeans.transform gives us the distance for each sample to each cluster centre). Looking at the results we may decide that up to about 10 clusters may be useful, but after that there are diminishing returns of adding further clusters.

In [16]:
```distance_to_closter_cluster_centre = []
for k in range(1,100):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X_std)
distance = np.min(kmeans.transform(X_std),axis=1)
average_distance = np.mean(distance)
distance_to_closter_cluster_centre.append(average_distance)

clusters = np.arange(len(distance_to_closter_cluster_centre))+1
plt.plot(clusters, distance_to_closter_cluster_centre)
plt.xlabel('Number of clusters (k)')
plt.ylabel('Average distance to closest cluster centroid')
plt.ylim(0,1.5)
plt.show()
```

# 73. Machine learning: neural networks

The last of our machine learning methods that we will look at in this introduction is neural networks.

Neural networks power much of modern image and voice recongition. They can cope with highly complex data, but often take large amounts of data to train well. There are many parameters that can be changes, so fine-tuning a neural net can require extensive work. We will not go into all the ways they may be fine-tuned here, but just look at a simple example. Continue reading “73. Machine learning: neural networks”

# 72. Machine Learning: Random Forests

Random forest is a versatile machine learning method based on decision trees. One useful feature of random forests is that it is easy to obtain the relative importance of features. This may be used to help better understand what drives classification, and may also be used to reduce the feature set used with minimal reduction in accuracy.

Once again we will re-use our logistic regression model, and replace the model function wit the following three lines:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)

model.fit (X,y) Continue reading “72. Machine Learning: Random Forests”

# 71. Machine Learning: Support Vector Machines

So far we have used logistic regression to create our machine learning model. We will now cover some alternative methods, starting with Support Vector Machines. We can keep nearly of our code. In fact we only have to change the three lines in the ’train mode’ function.

We’ll re-use the logistic regression code for looking for appropriate regularisation of the model (to avoid over-fitting of the model to the training data). Our three lines of code which define the model are now: Continue reading “71. Machine Learning: Support Vector Machines”

# 70. Machine Learning: Working with ordinal and categorical data

Some data sets may have ordinal data, which are descriptions with a natural order, such as small, medium large. There may also be categorical data which has no obvious order like green, blue, red. We’ll usually want to convert both of these into numbers for use by machine learning models.

Let’s look at an example: Continue reading “70. Machine Learning: Working with ordinal and categorical data”

# 69. Machine Learning: How do you know if you have gathered enough data? By using learning rates.

Do you have enough data samples to build a good machine learning model? Examining the learning rate of your model will help you decide whether it could be improved by having more data.

Here we repeat the logistic regression model on the Wisconsin Breast Cancer Diagnostic data set. We set out regularisation parameter (c) to 1, from our previous experiment, and then look at the effect of restricting our training set to varying sizes (the test set remains the same size, at 25% of our data).

We can see that as we increase our training set size the accuracy of fitting the training set reduces (it is easier to over-fit smaller data sets), and increase the accuracy of the test set. When we reach the most data we have there has not yet been a plateau in the accuracy of our test set, and the test set accuracy is significantly poorer than the training set accuracy (at out optimum regularisation for this amount of data). These two observations suggest that we would benefit from having more data for the model. If we did have more data then we should the experiment to find the optimum regularisation: generally as data set size increases, the need for regularisation reduces.

Different types of machine learning model may have different learning rates. This may influence your choice of model.

``````# import required modules

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def calculate_diagnostic_performance (actual_predicted):
"""Here we truncate calulcation of results just to accuracy measurement"""

# Calculate results

test_correct = actual_predicted[:, 0] == actual_predicted[:, 1]
accuracy = np.average(test_correct)

performance = {}
performance['accuracy'] = accuracy
return performance

def chart_results(results):
x = results['n']
y1 = results['training_accuracy']
y2 = results['test_accuracy']

# Create figure
fig = plt.figure(figsize=(5,5))
ax.plot(x,y1, color='k',linestyle='solid', label = 'Training set')
ax.plot(x,y2, color='b',linestyle='dashed', label = 'Test set')
ax.set_xlabel('training set size (cases)')
ax.set_ylabel('Accuracy')
plt.title('Effect of training set size on model accuracy')
plt.legend()
plt.show()

"""Load the data set. Here we load the Breast Cancer Wisconsin (Diagnostic)
Data Set. Data could be loaded from other sources though the structure
should be compatible with thi sdata set, that is an object with the
following attribtes:
.data (holds feature data)
.feature_names (holds feature titles)
.target_names (holds outcome classification names)
.target (holds classification as zero-based number)
.DESCR (holds text-based description of data set)"""

return data_set

def normalise (X_train,X_test):
"""Normalise X data, so that training set has mean of zero and standard
deviation of one"""

# Initialise a new scaling object for normalising input data
sc=StandardScaler()
# Set up the scaler just on the training set
sc.fit(X_train)
# Apply the scaler to the training and test sets
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)
return X_train_std, X_test_std

def print_diagnostic_results (performance):
"""Iterate through, and print, the performance metrics dictionary"""

print('\nMachine learning diagnostic performance measures:')
print('-------------------------------------------------')
for key, value in performance.items():
print (key,'= %0.3f' %value) # print 3 decimal places
return

def split_data (data_set, split, n):
"""Extract X and y data from data_set object, and split into tarining and
test data. Split defaults to 75% training, 25% test if not other value
passed to function"""

X=data_set.data
y=data_set.target
X_train,X_test,y_train,y_test=train_test_split(
X,y,test_size=split)
X_train = X_train[0:n]
y_train = y_train[0:n]
return X_train,X_test,y_train,y_test

def test_model(model, X, y):
"""Return predicted y given X (attributes)"""

y_pred = model.predict(X)
test_results = np.vstack((y, y_pred)).T
return test_results

def train_model (X, y):
"""Train the model """

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1000)
model.fit(X, y)
return model

###### Main code #######

# List of regularisation values
number_of_training_points = range(25, 450, 25)

# Set up empty lists to record results
training_accuracy = []
test_accuracy = []
n_results = []

for n in number_of_training_points:
# Repeat ml model/prediction 1000 times for each different number of runs
for i in range(1000):

# Split data into training and test sets
X_train,X_test,y_train,y_test = split_data(data_set, 0.25, n)

# Normalise data
X_train_std, X_test_std = normalise(X_train,X_test)
# Repeat test 1000x per level of c
n_results.append(n)

# Train model
model = train_model(X_train_std,y_train)

# Produce results for training set
test_results = test_model(model, X_train_std, y_train)
performance = calculate_diagnostic_performance(test_results)
training_accuracy.append(performance['accuracy'])

# Produce results for test set
test_results = test_model(model, X_test_std, y_test)
performance = calculate_diagnostic_performance(test_results)
test_accuracy.append(performance['accuracy'])

results = pd.DataFrame()
results['n'] = n_results
results['training_accuracy'] = training_accuracy
results['test_accuracy'] = test_accuracy
summary = results.groupby('n').median()
summary['n'] = list(summary.index)

print (summary)
chart_results (summary)

OUT:

training_accuracy  test_accuracy    n
n
25                 1.0       0.930070   25
50                 1.0       0.944056   50
75                 1.0       0.951049   75
100                1.0       0.951049  100
125                1.0       0.958042  125
150                1.0       0.958042  150
175                1.0       0.958042  175
200                1.0       0.958042  200
225                1.0       0.958042  225
250                1.0       0.958042  250
275                1.0       0.958042  275
300                1.0       0.958042  300
325                1.0       0.958042  325
350                1.0       0.958042  350
375                1.0       0.958042  375
400                1.0       0.951049  400
425                1.0       0.958042  425``````

# 68. Machine learning: Using regularisation to improve accuracy

Many machine learning techniques include an option to fine-tune regularisation. Regularisation helps to avoid over-fitting of the model to the training set at the cost of accuracy of predication for previously unseen samples in the test set. In the logistic regression method that we have been looking at the regularisation term in the model fit is ’c’. The lower the c value the greater the regularisation. The previous code has been amended below to loop through a series of c values. For each value of c the model fit is run 100 times with different random train/test splits, and the average results are presented. Continue reading “68. Machine learning: Using regularisation to improve accuracy”

# 67. Machine learning: Adding standard diagnostic performance metrics to a ml diagnosis model

Machine learning diagnostic performance measures:
accuracy = 0.937
sensitivity = 0.933
specificity = 0.943
positive_likelihood = 16.489
negative_likelihood = 0.071
false_positive_rate = 0.057
false_negative_rate = 0.067
positive_predictive_value = 0.966
negative_predictive_value = 0.893
precision = 0.966
recall = 0.933
f1 = 0.949