# 120. Generating log normal samples from provided arithmetic mean and standard deviation of original population

The log normal distribution is frequently a useful distribution for mimicking process times in healthcare pathways (or many other non-automated processes). The distribution has a right skew which may frequently occur when some clinical process step has some additional complexity to it compared to the ‘usual’ case.

To sample from a log normal distribution we need to convert the mean and standard deviation that was calculated from the original non-logged population into the mu and sigma of the underlying log normal population.

(For maximum computation effuiciency, when calling the function repeatedly using the same mean and standard deviation, you may wish to split this into two functions – one to calculate mu and sigma which needs only calling once, and the other to sample from the log normal distribution given mu and sigma).

For more on the maths see:

https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-with-specified-mean-and-variance.html

``````import numpy as np

def generate_lognormal_samples(mean, stdev, n=1):
"""
Returns n samples taken from a lognormal distribution, based on mean and
standard deviation calaculated from the original non-logged population.

Converts mean and standard deviation to underlying lognormal distribution
mu and sigma based on calculations desribed at:
https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-
with-specified-mean-and-variance.html

Returns a numpy array of floats if n > 1, otherwise return a float
"""

# Calculate mu and sigma of underlying lognormal distribution
phi = (stdev ** 2 + mean ** 2) ** 0.5
mu = np.log(mean ** 2 / phi)
sigma = (np.log(phi ** 2 / mean ** 2)) ** 0.5

# Generate lognormal population
generated_pop = np.random.lognormal(mu, sigma , n)

# Convert single sample (if n=1) to a float, otherwise leave as array
generated_pop = \
generated_pop if len(generated_pop) == 1 else generated_pop

return generated_pop``````

## Test the function

We will generate a population of 100,000 samples with a given mean and standard deviation (these would be calculated on the non-logged population), and test the resulting generated population has the same mean and standard deviation.

``````mean = 10
stdev = 10
generated_pop = generate_lognormal_samples(mean, stdev, 100000)
print ('Mean:', generated_pop.mean())
print ('Standard deviation:', generated_pop.std())

Out:

Mean: 10.043105926813356
Standard deviation: 9.99527575740651``````

Plot a histogram of the generated population:

``````import matplotlib.pyplot as plt
%matplotlib inline
bins = np.arange(0,51,1)
plt.hist(generated_pop, bins=bins)
plt.show()``````

## Generating a single sample

The function will return a single number if no `n` is given in the function call:

``````print (generate_lognormal_samples(mean, stdev))

Out: 6.999376449335125``````

# 119: Optimising scikit-learn machine learning models with grid search or randomized search

Machine learning models have many hyper-parameters (parameters set before a model is fitted, and which remain constant throughout model fitting). Optimising model hyper-parameters may involve many model runs with alternative hyper-parameters. In SciKit-Learn, this may be performed in an automated fashion using Grid Search (which explores all combinations of provided hyper-parameters) or Randomized Search (which randomly selects combinations to test).

Grid search and randomized search will perform this optimisation using k-fold validation which avoids potential bias in training/test splits.

Here we will revisit a previous example of machine learning, using Random Forests to predict whether a person has breast cancer. We will then use Grid Search to optimise performance, using the ‘f1’ performance score (https://en.wikipedia.org/wiki/F1_score) as an accuracy score that balances the importance of false negatives and false positives.

First we will look at how we previously built the Random Forests model.

(See https://pythonhealthcare.org/2018/04/17/72-machine-learning-random-forests/ for previous post on Random Forest method)

``````# import required modules

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

def calculate_diagnostic_performance (actual_predicted):
""" Calculate diagnostic performance.

Takes a Numpy array of 1 and zero, two columns: actual and predicted

Note that some statistics are repeats with different names
(precision = positive_predictive_value and recall = sensitivity).
Both names are returned

Returns a dictionary of results:

1) accuracy: proportion of test results that are correct
2) sensitivity: proportion of true +ve identified
3) specificity: proportion of true -ve identified
4) positive likelihood: increased probability of true +ve if test +ve
5) negative likelihood: reduced probability of true +ve if test -ve
6) false positive rate: proportion of false +ves in true -ve patients
7) false negative rate:  proportion of false -ves in true +ve patients
8) positive predictive value: chance of true +ve if test +ve
9) negative predictive value: chance of true -ve if test -ve
10) precision = positive predictive value
11) recall = sensitivity
12) f1 = (2 * precision * recall) / (precision + recall)
13) positive rate = rate of true +ve (not strictly a performance measure)
"""
# Calculate results
actual_positives = actual_predicted[:, 0] == 1
actual_negatives = actual_predicted[:, 0] == 0
test_positives = actual_predicted[:, 1] == 1
test_negatives = actual_predicted[:, 1] == 0
test_correct = actual_predicted[:, 0] == actual_predicted[:, 1]
accuracy = np.average(test_correct)
true_positives = actual_positives & test_positives
true_negatives = actual_negatives & test_negatives
sensitivity = np.sum(true_positives) / np.sum(actual_positives)
specificity = np.sum(true_negatives) / np.sum(actual_negatives)
positive_likelihood = sensitivity / (1 - specificity)
negative_likelihood = (1 - sensitivity) / specificity
false_positive_rate = 1 - specificity
false_negative_rate = 1 - sensitivity
positive_predictive_value = np.sum(true_positives) / np.sum(test_positives)
negative_predictive_value = np.sum(true_negatives) / np.sum(test_negatives)
precision = positive_predictive_value
recall = sensitivity
f1 = (2 * precision * recall) / (precision + recall)
positive_rate = np.mean(actual_predicted[:,1])

performance = {}
performance['accuracy'] = accuracy
performance['sensitivity'] = sensitivity
performance['specificity'] = specificity
performance['positive_likelihood'] = positive_likelihood
performance['negative_likelihood'] = negative_likelihood
performance['false_positive_rate'] = false_positive_rate
performance['false_negative_rate'] = false_negative_rate
performance['positive_predictive_value'] = positive_predictive_value
performance['negative_predictive_value'] = negative_predictive_value
performance['precision'] = precision
performance['recall'] = recall
performance['f1'] = f1
performance['positive_rate'] = positive_rate

return performance

"""Load the data set. Here we load the Breast Cancer Wisconsin (Diagnostic)
Data Set. Data could be loaded from other sources though the structure
should be compatible with this data set, that is an object with the
following attributes:
.data (holds feature data)
.feature_names (holds feature titles)
.target_names (holds outcome classification names)
.target (holds classification as zero-based number)
.DESCR (holds text-based description of data set)"""

return data_set

def normalise (X_train,X_test):
"""Normalise X data, so that training set has mean of zero and standard
deviation of one"""

# Initialise a new scaling object for normalising input data
sc=StandardScaler()
# Set up the scaler just on the training set
sc.fit(X_train)
# Apply the scaler to the training and test sets
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)
return X_train_std, X_test_std

def print_diagnostic_results (performance):
"""Iterate through, and print, the performance metrics dictionary"""

print('\nMachine learning diagnostic performance measures:')
print('-------------------------------------------------')
for key, value in performance.items():
print (key,'= %0.3f' %value) # print 3 decimal places
return

def print_feaure_importances (model, features):
print ()
print ('Feature importances:')
print ('--------------------')
df = pd.DataFrame()
df['feature'] = features
df['importance'] = model.feature_importances_
df = df.sort_values('importance', ascending = False)
print (df)
return

def split_data (data_set, split=0.25):
"""Extract X and y data from data_set object, and split into training and
test data. Split defaults to 75% training, 25% test if not other value
passed to function"""

X=data_set.data
y=data_set.target
X_train,X_test,y_train,y_test=train_test_split(
X,y,test_size=split, random_state=0)
return X_train,X_test,y_train,y_test

def test_model(model, X, y):
"""Return predicted y given X (attributes)"""

y_pred = model.predict(X)
test_results = np.vstack((y, y_pred)).T
return test_results

def train_model (X, y):
"""Train the model. Note n_jobs=-1 uses all cores on a computer"""
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_jobs=-1)
model.fit (X,y)
return model

###### Main code #######

# Split data into training and test sets
X_train,X_test,y_train,y_test = split_data(data_set, 0.25)

# Normalise data (not needed for Random Forests)
# X_train_std, X_test_std = normalise(X_train,X_test)

# Train model
model = train_model(X_train, y_train)

# Produce results for test set
test_results = test_model(model, X_test, y_test)

# Measure performance of test set predictions
performance = calculate_diagnostic_performance(test_results)

# Print performance metrics
print_diagnostic_results(performance)``````
``````Out: Machine learning diagnostic performance measures:

accuracy = 0.951
sensitivity = 0.944
specificity = 0.962
positive_likelihood = 25.028
negative_likelihood = 0.058
false_positive_rate = 0.038
false_negative_rate = 0.056
positive_predictive_value = 0.977
negative_predictive_value = 0.911
precision = 0.977
recall = 0.944
f1 = 0.960
positive_rate = 0.608
``````

# Optimise with grid search

NOTE: Grid search may take considerable time to run!

Grid search enables us to perform an exhaustive search of hyper-parameters (those model parameters that are constant in any one model). We define which hyper-parameters we wish to change, and what values we wish to try. All combinations are tested. Test are performed using k-fold validation which re-runs the model with different train/test splits (this avoids bias in our train/test split, but does increase the time required). You may wish to time small grid search first, so you have a better idea of how many parameter combinations you can realistically look at.

We pass four arguments to the grid search method:

1) The range of values for the hyper-parameters, defined in a dictionary 2) The machine learning model to use 3) The number of k-fold splits to use (`cv`); a value of 5 will give five 80:20 training/test splits with each sample being present in the test set once 4) The accuracy score to use. In a classification model `‘accuracy’` is common. For a regression model using `scoring='neg_mean_squared_error'` is common (for grid search an accuracy score must be a ‘utility function’ rather than a ‘cost function’, that is, higher values are better).

If the best model uses a value at one extreme of the provided hyper-paramter ranges then it is best to expand the range of that hyper-paraemter to be sure an optimum has been found.

An alternative approach is randomised hyper-parameter searching. See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

``````# Use Grid search to optimise
# n_jobs is set to -1 to use all cores on the CPU

from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [10, 30, 100, 300, 1000, 3000],
'bootstrap': [True, False],
'min_samples_split': [2, 4, 6, 8, 10],
'n_jobs': [-1]}

# Grid search will use k-fold cross-validation (CV is number of splits)
# Grid search also needs a ultility function (higher is better) rather than
# a cost function (lower is better) so use neg square mean error

from sklearn.ensemble import RandomForestClassifier
forest_grid = RandomForestClassifier()
grid_search = GridSearchCV(forest_grid, param_grid, cv=10,
scoring='accuracy')

grid_search.fit(X_train, y_train); #';' suppresses printed output``````

Show optimised model hyper-parameters:

``````# show best parameters
# If best parameters are at the extremes of the searches then extend the range

grid_search.best_params_

Out:

{'bootstrap': True, 'min_samples_split': 6, 'n_estimators': 30, 'n_jobs': -1}``````
``````# Or, show full description
grid_search.best_estimator_

Out:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=6,
min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=-1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)``````

Now we will use the optimised model. We could use the text above (from the output of `grid_search.best_estimator_`, or we can use `grid_search.best_estimator_` directly.

``````# Use optimised model
model = grid_search.best_estimator_
model.fit (X_train, y_train);``````

Test optimised model:

``````test_results = test_model(model, X_test, y_test)

# Measure performance of test set predictions
performance = calculate_diagnostic_performance(test_results)

# Print performance metrics
print_diagnostic_results(performance)

Out:

Machine learning diagnostic performance measures:

accuracy = 0.972
sensitivity = 0.967
specificity = 0.981
positive_likelihood = 51.233
negative_likelihood = 0.034
false_positive_rate = 0.019
false_negative_rate = 0.033
positive_predictive_value = 0.989
negative_predictive_value = 0.945
precision = 0.989
recall = 0.967
f1 = 0.978
positive_rate = 0.615``````

Our accuracy has now increased from 95.1% to 97.2%.

When the number of parameter combinations because unreasonable large for grid search, and alternative is to use random search, which will select parameters randomly from the ranges given. The number of combinations tried is given by the argument `n_iter`.

Below is an example where we expand the number of arguments varied (becoming too large for grid search) and use random search to test 50 different samples.

``````## Use Grid search to optimise
# n_jobs is set to -1 to use all cores on the CPU

from sklearn.model_selection import RandomizedSearchCV

param_grid = {'n_estimators': [10, 30, 100, 300, 1000, 3000],
'bootstrap': [True, False],
'min_samples_split': range(2,11),
'max_depth': range(1,30),
'min_samples_split': [2, 4, 6, 8, 10],
'n_jobs': [-1]}

n_iter_search = 50

from sklearn.ensemble import RandomForestClassifier
forest_grid = RandomForestClassifier()
random_search = RandomizedSearchCV(forest_grid, param_grid, cv=10,
n_iter=n_iter_search, scoring='accuracy')

random_search.fit(X_train, y_train); #';' suppresses printed output``````
``````# show best parameters
# If best parameters are at the extremes of the searches then extend the range

random_search.best_params_

Out:

{'n_jobs': -1,
'n_estimators': 100,
'min_samples_split': 2,
'max_depth': 29,
'bootstrap': False}``````
``````# Or, show full description
random_search.best_estimator_

Out:

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=29, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)``````

Now we will train a model with the optimized model hyper-parameters, and test against the test set.

``````# Use optimised model
model = random_search.best_estimator_
model.fit (X_train, y_train);

test_results = test_model(model, X_test, y_test)

# Measure performance of test set predictions
performance = calculate_diagnostic_performance(test_results)

# Print performance metrics
print_diagnostic_results(performance)

Out:

Machine learning diagnostic performance measures:

accuracy = 0.986
sensitivity = 0.989
specificity = 0.981
positive_likelihood = 52.411
negative_likelihood = 0.011
false_positive_rate = 0.019
false_negative_rate = 0.011
positive_predictive_value = 0.989
negative_predictive_value = 0.981
precision = 0.989
recall = 0.989
f1 = 0.989
positive_rate = 0.629``````

So though random search does not explore all combinations, because we can increase the number of parameters to explore, comapred with grid search, we have increased our accuracy to 98.6%

# 117. Genetic Algorithms 2 – a multiple objective genetic algorithm (NSGA-II) Note: As there is quite a substantial amount of code in this post, you may also copy the code as a single block from here.

If you have not looked at our description of a more simple genetic algorithm, with a single objective, then we advise you to look at that first (here). The example here will assume some basic understand of genetic algorithms.

In our previous example of a genetic algorithm, we looked at a genetic algorithm that optimised a single parameter. But what if we have two or more objectives to optimise?

An example in healthcare modelling is in deciding which hospitals should provide a specialist service in order to 1) minimise travel time for patients who need an emergency hospital admission, while 2) ensuring the hospital has sufficient admissions to maintain expertise and a 24/7 service, and 3) and ensuring no hospital is over-loaded with too many admissions.

One option when considering multiple objectives is to combine different objectives into a single number. This requires standardising the values in some way and giving them weights for their importance. This can simplify the optimisation problem but may be sensitive to how individual objectives are weighted (if following this approach it will probably be sensible to re-run the algorithm using alternative weighting schemes).

## Pareto fronts

The approach we will describe here takes an alternative approach and seeks to explicitly investigate the trade-off between different objectives. When one objective cannot be improved without the worsening of another objective we are on what is known as the ‘Pareto front’. That is easy to visualise with two objectives, but Pareto fronts exist across any number of objectives – we are on the Pareto front when we cannot improve one objective without necessarily worsening at least one other objective (as shown by the red line in the figure below).

## NSGA-II

The algorithm implemented here is based on an algorithm called ‘NSGA-II’. This was published by Deb et al.

Deb et al. (2002) A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation. 6, 182-197.

With NSGA-II we maintain a population of a given size (in the original paper that is fixed; in our implementation we define a range – the population must be larger than a minimum size, but not exceed a given maximum size.

The key process steps of NSGA-II are:

1) Start with a random population of solutions (P), encoded in binary form (‘chromosomes’) 2) Create a child population (Q) through crossover and mutation 3) Combine P & Q and score for all objectives 4) Identify the first Pareto front (F1); that is all solutions where there are no other solutions that are at least equally good in all objectives and better in at least one objective 5) If F1 is larger than the maximum permitted solution then reduce the size of F1 by ‘crowding selection’ (see below). 6) If F1 is smaller than the required population size then repeat Pareto selection (after removal of selected already selected). This new set of solutions is F2. If the total number selected solutions is greater than the permitted maximum population size then reduce the size of just the latest selection (in this case F2) so that the number of all selected solutions is equal to the maximum permitted population size. 7) Repeat Pareto selection until the required population size is reached (and then reduce the last selected Pareto front by ‘crowding selection’ as required to avoid exceeding the maximum permitted population size). 8) The selected Pareto fronts for the new population, P. 9) Repeat from (2) for the required number of generations or until some other ‘stop’ criterion is reached. 10) Perform a final Pareto selection so that the final reported population are just those on the first Pareto front.

A minimum population size is required to maintain diversity of solutions. If we only selected the first Pareto front we may have a very small population which would lack the diversity required to reach good final solutions.

This process is shown diagrammatically below (from Deb et al., 2002):

## Crowding distances/selection

When we are using a Pareto front to select solutions (as described here), all solutions are on the optimal front, that is in each solution there is no other solution that is at least as good in all scores, and better in at least one score. We therefore cannot rank solutions by performance. In order to select solutions, if we need to control the number of solutions we are generating we can use ‘crowding distances’. Crowding distances give a measure of closeness in performance to other solutions. The crowding distance is the average distance to its two neighbouring solutions. we will bias the selection of solutions to those with greater distances to neighbouring solutions.

We will use a ‘tournament’ method to select between two individual solutions. The crowding distances are calculated for all solutions. Two solutions are picked at random and the one with the greater crowding distance will be selected (with the unselected solution being returned back). This process is repeated until the require number of solutions have been selected.

An example is shown below – by using crowding distances we have reduced the number of points in densely packed regions by more than in the sparsely populated areas.

# The Code

We will break the code down into sections and functions.

To enale easier visualisation we will look to identify a Pareto population based on two competing objectives. But all the code will work on as many objectives as you wish (though it is not recommended to try to optimise more than five objectives).

## Import required libraries

``````import random as rn
import numpy as np
import matplotlib.pyplot as plt
# For use in Jupyter notebooks only:
% matplotlib inline``````

## Create reference solutions

In this example we will pick the easiest example possible. We will create two reference chromosomes as ‘ideal solutions’. This function may be used to create more references to mimic optimising against more than two objectives.

We will pick a chromosome length of 25 (this mimics solutions that would be coded as 25 binary ‘genes’).

In real life this binary representation would lead to better or worse solutions. We might, for example, use the binary notation to denote open/closed hospitals in a location problem. Or any number of binary digits might be used to encode the value of a variable.

In real life this binary representation would lead to better or worse solutions. We might, for example, use the binary notation to denote open/closed hospitals in a location problem. Or any number of binary digits might be used to encode the value of a variable.

``````def create_reference_solutions(chromosome_length, solutions):
"""
Function to create reference chromosomes that will mimic an ideal solution
"""
references = np.zeros((solutions, chromosome_length))
number_of_ones = int(chromosome_length / 2)

for solution in range(solutions):
# Build an array with an equal mix of zero and ones
reference = np.zeros(chromosome_length)
reference[0: number_of_ones] = 1

# Shuffle the array to mix the zeros and ones
np.random.shuffle(reference)
references[solution,:] = reference

return references``````

Show an example set of reference solutions:

``````print (create_reference_solutions(25, 2))

Out:
print (create_reference_solutions(25, 2))

[[0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1.
0.]
[0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0. 1. 1.
0.]]``````

## Evaluating solutions

Normally this section of the code would be more complex – it decodes the chromosome and applies it to a problem, and assesses the performance. For example we might use the chromosome to store whether hospitals are open/closed and then a piece of code will test travel distances for all patients to the open hospitals, and will also calculate the number of admissions to each hospital.

In our ‘toy’ example here we will simply calculate how many binary digits in each solution are the same as our two reference solutions (so we will have two scores for each solution).

We have two functions. The first will calculate the score (‘fitness’) against a single reference. The second will loop through all references (equivalent to all objectives) and call for the score/fitness to be calculated.

``````def calculate_fitness(reference, population):
"""
Calculate how many binary digits in each solution are the same as our
reference solution.
"""
# Create an array of True/False compared to reference
identical_to_reference = population == reference
# Sum number of genes that are identical to the reference
fitness_scores = identical_to_reference.sum(axis=1)

return fitness_scores``````
``````def score_population(population, references):
"""
Loop through all reference solutions and request score/fitness of
population against that reference solution.
"""
scores = np.zeros((population.shape, references.shape))
for i, reference in enumerate(references):
scores[:,i] = calculate_fitness(reference, population)

return scores ``````

## Calculate crowding and select a population based on crowding scores

We have two functions here. The first to calculate the crowding of a population, based on similarity of scores. The second is a Tournament selection method that uses those crowding scores to pick a given number of solutions from a population.

``````def calculate_crowding(scores):
"""
Crowding is based on a vector for each individual
All scores are normalised between low and high. For any one score, all
solutions are sorted in order low to high. Crowding for chromsome x
for that score is the difference between the next highest and next
lowest score. Total crowding value sums all crowding for all scores
"""

population_size = len(scores[:, 0])
number_of_scores = len(scores[0, :])

# create crowding matrix of population (row) and score (column)
crowding_matrix = np.zeros((population_size, number_of_scores))

# normalise scores (ptp is max-min)
normed_scores = (scores - scores.min(0)) / scores.ptp(0)

# calculate crowding distance for each score in turn
for col in range(number_of_scores):
crowding = np.zeros(population_size)

# end points have maximum crowding
crowding = 1
crowding[population_size - 1] = 1

# Sort each score (to calculate crowding between adjacent scores)
sorted_scores = np.sort(normed_scores[:, col])

sorted_scores_index = np.argsort(
normed_scores[:, col])

# Calculate crowding distance for each individual
crowding[1:population_size - 1] = \
(sorted_scores[2:population_size] -
sorted_scores[0:population_size - 2])

# resort to orginal order (two steps)
re_sort_order = np.argsort(sorted_scores_index)
sorted_crowding = crowding[re_sort_order]

# Record crowding distances
crowding_matrix[:, col] = sorted_crowding

# Sum crowding distances of each score
crowding_distances = np.sum(crowding_matrix, axis=1)

return crowding_distances``````
``````def reduce_by_crowding(scores, number_to_select):
"""
This function selects a number of solutions based on tournament of
crowding distances. Two members of the population are picked at
random. The one with the higher croding dostance is always picked
"""
population_ids = np.arange(scores.shape)

crowding_distances = calculate_crowding(scores)

picked_population_ids = np.zeros((number_to_select))

picked_scores = np.zeros((number_to_select, len(scores[0, :])))

for i in range(number_to_select):

population_size = population_ids.shape

fighter1ID = rn.randint(0, population_size - 1)

fighter2ID = rn.randint(0, population_size - 1)

# If fighter # 1 is better
if crowding_distances[fighter1ID] >= crowding_distances[
fighter2ID]:

# add solution to picked solutions array
picked_population_ids[i] = population_ids[
fighter1ID]

# Add score to picked scores array
picked_scores[i, :] = scores[fighter1ID, :]

# remove selected solution from available solutions
population_ids = np.delete(population_ids, (fighter1ID),
axis=0)

scores = np.delete(scores, (fighter1ID), axis=0)

crowding_distances = np.delete(crowding_distances, (fighter1ID),
axis=0)
else:
picked_population_ids[i] = population_ids[fighter2ID]

picked_scores[i, :] = scores[fighter2ID, :]

population_ids = np.delete(population_ids, (fighter2ID), axis=0)

scores = np.delete(scores, (fighter2ID), axis=0)

crowding_distances = np.delete(
crowding_distances, (fighter2ID), axis=0)

# Convert to integer
picked_population_ids = np.asarray(picked_population_ids, dtype=int)

return (picked_population_ids)``````

## Pareto selection

Pareto selection involves two functions. The first will select a single Pareto front. The second will, as necessary, repeat Pareto front selection to build a population within defined size limits, and, as necessary, will reduce a Pareto front by applying crowding selection.

``````def identify_pareto(scores, population_ids):
"""
Identifies a single Pareto front, and returns the population IDs of
the selected solutions.
"""
population_size = scores.shape
# Create a starting list of items on the Pareto front
# All items start off as being labelled as on the Parteo front
pareto_front = np.ones(population_size, dtype=bool)
# Loop through each item. This will then be compared with all other items
for i in range(population_size):
# Loop through all other items
for j in range(population_size):
# Check if our 'i' pint is dominated by out 'j' point
if all(scores[j] >= scores[i]) and any(scores[j] > scores[i]):
# j dominates i. Label 'i' point as not on Pareto front
pareto_front[i] = 0
# Stop further comparisons with 'i' (no more comparisons needed)
break
# Return ids of scenarios on pareto front
return population_ids[pareto_front]``````
``````def build_pareto_population(
population, scores, minimum_population_size, maximum_population_size):
"""
As necessary repeats Pareto front selection to build a population within
defined size limits. Will reduce a Pareto front by applying crowding
selection as necessary.
"""
unselected_population_ids = np.arange(population.shape)
all_population_ids = np.arange(population.shape)
pareto_front = []
while len(pareto_front) < minimum_population_size:
temp_pareto_front = identify_pareto(
scores[unselected_population_ids, :], unselected_population_ids)

# Check size of total parteo front.
# If larger than maximum size reduce new pareto front by crowding
combined_pareto_size = len(pareto_front) + len(temp_pareto_front)
if combined_pareto_size > maximum_population_size:
number_to_select = combined_pareto_size - maximum_population_size
selected_individuals = (reduce_by_crowding(
scores[temp_pareto_front], number_to_select))
temp_pareto_front = temp_pareto_front[selected_individuals]

# Add latest pareto front to full Pareto front
pareto_front = np.hstack((pareto_front, temp_pareto_front))

# Update unselected population ID by using sets to find IDs in all
# ids that are not in the selected front
unselected_set = set(all_population_ids) - set(pareto_front)
unselected_population_ids = np.array(list(unselected_set))

population = population[pareto_front.astype(int)]
return population``````

## Population functions

There are four functions concerning the population

1) Create a random population

2) Breed by crossover – two children produced from two parents

3) Randomly mutate population (small probability that any given gene in population will switch between 1/0 – applied to the child population)

4) Breed population -Create child population by repeatedly calling breeding function (two parents producing two children), applying genetic mutation to the child population, combining parent and child population, and removing duplicate chromosomes.

``````def create_population(individuals, chromosome_length):
"""
Create random population with given number of individuals and chroosome
length.
"""
# Set up an initial array of all zeros
population = np.zeros((individuals, chromosome_length))
# Loop through each row (individual)
for i in range(individuals):
# Choose a random number of ones to create
ones = rn.randint(0, chromosome_length)
# Change the required number of zeros to ones
population[i, 0:ones] = 1
# Sfuffle row
np.random.shuffle(population[i])

return population``````
``````def breed_by_crossover(parent_1, parent_2):
"""
Combine two parent chromsomes by crossover to produce two children.
"""
# Get length of chromosome
chromosome_length = len(parent_1)

# Pick crossover point, avoding ends of chromsome
crossover_point = rn.randint(1,chromosome_length-1)

# Create children. np.hstack joins two arrays
child_1 = np.hstack((parent_1[0:crossover_point],
parent_2[crossover_point:]))

child_2 = np.hstack((parent_2[0:crossover_point],
parent_1[crossover_point:]))

# Return children
return child_1, child_2``````
``````def randomly_mutate_population(population, mutation_probability):
"""
Randomly mutate population with a given individual gene mutation
probability. Individual gene may switch between 0/1.
"""
# Apply random mutation
random_mutation_array = np.random.random(size=(population.shape))

random_mutation_boolean = \
random_mutation_array <= mutation_probability

population[random_mutation_boolean] = \
np.logical_not(population[random_mutation_boolean])

# Return mutation population
return population``````
``````def breed_population(population):
"""
Create child population by repetedly calling breeding function (two parents
producing two children), applying genetic mutation to the child population,
combining parent and child population, and removing duplicatee chromosomes.
"""
# Create an empty list for new population
new_population = []
population_size = population.shape
# Create new popualtion generating two children at a time
for i in range(int(population_size/2)):
parent_1 = population[rn.randint(0, population_size-1)]
parent_2 = population[rn.randint(0, population_size-1)]
child_1, child_2 = breed_by_crossover(parent_1, parent_2)
new_population.append(child_1)
new_population.append(child_2)

# Add the child population to the parent population
# In this method we allow parents and children to compete to be kept
population = np.vstack((population, np.array(new_population)))
population = np.unique(population, axis=0)

return population``````

## Main algorithm code

Now let’s put it all together!

``````# Set general parameters
chromosome_length = 50
starting_population_size = 5000
maximum_generation = 250
minimum_population_size = 500
maximum_population_size = 1000

# Create two reference solutions
# (this is used just to illustrate GAs)
references = create_reference_solutions(chromosome_length, 2)

# Create starting population
population = create_population(
starting_population_size, chromosome_length)

# Loop through the generations of genetic algorithm

for generation in range(maximum_generation):
if generation %10 ==0:
print ('Generation (out of %i): %i '%(maximum_generation, generation))

# Breed
population = breed_population(population)

# Score population
scores = score_population(population, references)

# Build pareto front
population = build_pareto_population(
population, scores, minimum_population_size, maximum_population_size)

# Get final pareto front
scores = score_population(population, references)
population_ids = np.arange(population.shape).astype(int)
pareto_front = identify_pareto(scores, population_ids)
population = population[pareto_front, :]
scores = scores[pareto_front]``````

# Plot final Pareto front

This is the only part of the code that works only for two objectives.

You will see in this example we have a well-described trade-off between achieving the two objectives. With the population sizes used and the number of generations we achieved 96% maximum score on each objective (a larger population and/or more generations would get us to 100% on this quite simple example, though Pareto algorithms should not be expected to find the perfect solution for all objectives, but rather they should identify good solutions).

``````# Plot Pareto front (for two scores only)
x = scores[:, 0]/chromosome_length*100
y = scores[:, 1]/chromosome_length*100
plt.xlabel('Objective A - % maximum obtainable')
plt.ylabel('Objective B - % maximum obtainable')

plt.scatter(x,y)
plt.savefig('pareto.png')
plt.show()``````

# 116. Random Forests regression This tutorial provides an alternative regression method to a linear/multiple regression previously described at:

https://pythonhealthcare.org/2018/06/14/86-linear-regression-and-multiple-linear-regression/

Random Forests regression may provide a better predictor than multiple linear regression when the relationship between features (X) and dependent variable (y) is complex.

In regression we seek to predict the value of a continuous variable based on either a single variable, or a set of variables.

The example we will look at below seeks to predict life span based on weight, height, physical activity, BMI, gender, and whether the person has a history of smoking.

## Load common libraries and data

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

filename = 'https://gitlab.com/michaelallen1966/1804_python_healthcare_wordpress/raw/master/jupyter_notebooks/life_expectancy.csv'

Out:

weight 	smoker 	physical_activity_scale 	BMI 	height 	male 	life_expectancy
0 	51 	1 	6 	22 	152 	1 	57
1 	83 	1 	5 	34 	156 	1 	36
2 	78 	1 	10 	18 	208 	0 	78
3 	106 	1 	3 	28 	194 	0 	49
4 	92 	1 	7 	23 	200 	0 	67```

## Fit model

```# Extract features (X) and taregt life expectancy (y)

X = df.values[:, :-1]
y = df.values[:, -1]

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X, y)

Out:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0, warm_start=False)```

## Predict values, calculate error, and show predicted vs. actual

```# Predict values

predicted = model.predict(X)

# Show mean squared error

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, predicted)
rmse = np.sqrt(mse)
print (rmse)

Out:
1.4628948576409964

# Plot actual vs predicted

plt.scatter(y,predicted, alpha = 0.5)
plt.xlabel('Actutal')
plt.ylabel('Predicted')
plt.show()``` # 113: Regression analysis with TensorFlow

This code comes from the TensorFlow tutorial here, with minor modifications (such as the additional of regularization to avoid over-fitting).

In a regression problem, we aim to predict the output of a continuous value, like a price or a probability. Contrast this with a classification problem, where we aim to predict a discrete label (for example, where a picture contains an apple or an orange).

This notebook uses the classic Auto MPG Dataset and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, we’ll provide the model with a description of many models from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

``````
# If needed install seaborn (conda install seaborn or pip install seaborn)

import pathlib
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers

###############################################################################
###############################################################################

# Load data from web and save locally
dataset_path = keras.utils.get_file("auto-mpg.data",
"https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
'Acceleration', 'Model Year', 'Origin']
na_values = "?", comment='\t',
sep=" ", skipinitialspace=True)

raw_dataset.to_csv('mpg.csv', index=False)

###############################################################################
############################## CLEAN DATA #####################################
###############################################################################

# Dataset contains some missing data (see by using print(data.isna().sum()))
# Drop rows with missing data
data = data.dropna()

# The "Origin" column is really categorical, not numeric.
# So convert that to a one-hot:

origin = data.pop('Origin')
data['USA'] = (origin == 1)*1.0
data['Europe'] = (origin == 2)*1.0
data['Japan'] = (origin == 3)*1.0

###############################################################################
############################## CLEAN DATA #####################################
###############################################################################

train_dataset = data.sample(frac=0.8,random_state=0)
test_dataset = data.drop(train_dataset.index)

###############################################################################
############################# EXAMINE DATA ####################################
###############################################################################

# Have a quick look at the joint distribution of a few pairs of columns from
# the training set.

g = sns.pairplot(
train_dataset[["MPG", "Cylinders", "Displacement", "Weight"]],
diag_kind="kde")

fig = g.fig # convert to matplotlib plot. Other seaborn use fig = g.getfig()
fig.show()

# Look at overall stats
train_stats = train_dataset.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()
print (train_stats)

###############################################################################
###################### SPLIT FEATURES FROM LABELS #############################
###############################################################################

train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

###############################################################################
########################### NORMALISE THE DATA ################################
###############################################################################

# Normalise using the mean and standard deviation from the training set

def norm(x):
return (x - train_stats['mean']) / train_stats['std']

normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

###############################################################################
############################### BUILD MODEL ###################################
###############################################################################

# Here, we'll use a Sequential model with two densely connected hidden layers,
# and an output layer that returns a single, continuous value. Regularisation
# helps prevent over-fitting (try adjusting the values; higher numbers = more
# regularisation. Regularisation may be type l1 or l2.)

def build_model():
model = keras.Sequential([
layers.Dense(64, kernel_regularizer=keras.regularizers.l1(0.01),
activation=tf.nn.relu,
input_shape=[len(train_dataset.keys())]),

keras.layers.Dense(64, kernel_regularizer=keras.regularizers.l1(0.01),
activation=tf.nn.relu),

keras.layers.Dense(1)])

optimizer = tf.train.RMSPropOptimizer(0.001)

model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae', 'mse'])
return model

model = build_model()

# Print a summary of the model

print (model.summary())

###############################################################################
############################### TRAIN MODEL ###################################
###############################################################################

# Display training progress by printing a single dot for each completed epoch
class PrintDot(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs):
if epoch % 100 == 0: print('')
print('.', end='')

EPOCHS = 1000

history = model.fit(
normed_train_data, train_labels,
epochs=EPOCHS, validation_split = 0.2, verbose=0,
callbacks=[PrintDot()])

# Show last few epochs in history
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
print(hist.tail())

###############################################################################
############################### PLOT TRAINING #################################
###############################################################################

def plot_history(history):
plt.figure()
plt.xlabel('Epoch')
plt.ylabel('Mean Abs Error [MPG]')
plt.plot(hist['epoch'], hist['mean_absolute_error'],
label='Train Error')
plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
label = 'Val Error')
plt.legend()
plt.ylim([0,5])

plt.figure()
plt.xlabel('Epoch')
plt.ylabel('Mean Square Error [\$MPG^2\$]')
plt.plot(hist['epoch'], hist['mean_squared_error'],
label='Train Error')
plt.plot(hist['epoch'], hist['val_mean_squared_error'],
label = 'Val Error')
plt.legend()
plt.ylim([0,20])
plt.show()

plot_history(history)

###############################################################################
############################# MAKE PREDICTIONS ################################
###############################################################################

# Make predictions from test-set

test_predictions = model.predict(normed_test_data).flatten()

# Scatter plot plot
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()])
plt.ylim([0,plt.ylim()])
_ = plt.plot([-100, 100], [-100, 100])
plt.show()

# Error plot
error = test_predictions - test_labels
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [MPG]")
_ = plt.ylabel("Count")
plt.show

# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
``````

# 111. Using ‘pop’ to remove a Pandas DataFrame column and transfer to a new variable

Sometimes we may want to remove a column from a DataFrame, but at the same time transfer that column to a new variable to perform some work on it. An example is re-coding a column as shown below where we will convert a text male/female column into a number 0/1 male column.

## Create a DataFrame

``````import pandas as pd

name = ['Sam', 'Bill', 'Bob', 'Ian', 'Jo', 'Anne', 'Carl', 'Toni']
age = [22, 34, 18, 34, 76, 54, 21, 8]
gender = ['f', 'm', 'm', 'm', 'f', 'f', 'm', 'f']
height = [1.64, 1.85, 1.70, 1.75, 1.63, 1.79, 1.70, 1.68]

people = pd.DataFrame()
people['name'] = name
people['age'] = age
people['gender'] = gender
people['height'] = height

print(people)

Out:

name  age gender  height
0   Sam   22      f    1.64
1  Bill   34      m    1.85
2   Bob   18      m    1.70
3   Ian   34      m    1.75
4    Jo   76      f    1.63
5  Anne   54      f    1.79
6  Carl   21      m    1.70
7  Toni    8      f    1.68``````

## Pop a column (to code differently)

``````# Pop column
people_gender = people.pop('gender') # extracts and removes gender

# Recode (using == gves True/False, but in Python that also has numerical values of 1/0)

male = (people_gender == 'm') * 1 # 'm' is true is converted to number

# Put new column into DataFrame and print
people['male'] = male
print (people)

Out:

name  age  height  male
0   Sam   22    1.64     0
1  Bill   34    1.85     1
2   Bob   18    1.70     1
3   Ian   34    1.75     1
4    Jo   76    1.63     0
5  Anne   54    1.79     0
6  Carl   21    1.70     1
7  Toni    8    1.68     0``````

# 107. Image recognition with TensorFlow This code is based on TensorFlow’s own introductory example here. but with the addition of a ‘Confusion Matrix’ to better understand where mis-classification occurs.

This example will download the ‘minst-fashion’ data set of images which is a collection of 60,000 images of 10 different types of fashion items

``````"""
Code, apart from the confusion matrix,taken from:
https://www.tensorflow.org/tutorials/keras/basic_classification#
"""

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import itertools``````

## Load minst fashion data set

If this is the first time you have used the data set it will automatically be downloaded from the internet. The data set loads as trainign and test images and labels.

``````# Load minst fashion data set
"""Minst fashion data set  is a collection of 70k images of 10 different
fashion items. It is loaded as training and test images and labels (60K training
images and 10K test images).

0 	T-shirt/top
1 	Trouser
2 	Pullover
3 	Dress
4 	Coat
5 	Sandal
6 	Shirt
7 	Sneaker
8 	Bag
9 	Ankle boot
"""
(train_images, train_labels), (test_images, test_labels) = \

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']``````

Show an example image.

``````# Show an example image (an ankle boot)
plt.figure()
plt.imshow(train_images)
plt.colorbar()
plt.grid(False)
plt.show()``````

The image pixels currently range from 0 to 255. We will normalise to 0-1

``````# Scale images to values 0-1 (currently 0-255)
train_images = train_images / 255.0
test_images = test_images / 255.0``````

Plot the first 25 images with labels.

``````# Plot first 25 images with labels
plt.figure(figsize=(10,10))
for i in range(25):
plt.subplot(5,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(train_images[i], cmap=plt.cm.binary)
plt.xlabel(class_names[train_labels[i]])
plt.show()``````

## Build the model

``````# Set up neural network layers
"""The first layer flattess the 28x28 image to a 1D array (784 pixels).
The second layer is a fully connected (dense) layer of 128 nodes/neurones.
The last layer is a 10 node softmax layer, giving probability of each class.
Softmax adjusts probabilities so that they total 1."""

model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)])

# Compile the model
"""Optimizer: how model corrects itself and learns.
Loss function: How accurate the model is.
Metrics: How to monitor performance of model"""

loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
``````

## Train the model

Change the number of passes to balance accuracy vs. speed.

``````# Train the model (epochs is the number of times the training data is applied)
model.fit(train_images, train_labels, epochs=5)``````

## Evaluate accuracy

``````# Evaluate accuracy
test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Test accuracy:', test_acc)

Out: Test accuracy: 0.8732

``````

## Make predictions and show examples

``````# Make predictions
predictions = model.predict(test_images)
print ('\nClass propbabilities for test image 0')
print (predictions)
print ('\nPrdicted class for test image 0:', np.argmax(predictions))
print ('Actual classification for test image 0:', test_labels)

# Plot image and predictions
def plot_image(i, predictions_array, true_label, img):
predictions_array, true_label, img = predictions_array[i], true_label[i], img[i]
plt.grid(False)
plt.xticks([])
plt.yticks([])

plt.imshow(img, cmap=plt.cm.binary)

predicted_label = np.argmax(predictions_array)
if predicted_label == true_label:
color = 'blue'
else:
color = 'red'

plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
100*np.max(predictions_array),
class_names[true_label]),
color=color)

def plot_value_array(i, predictions_array, true_label):
predictions_array, true_label = predictions_array[i], true_label[i]
plt.grid(False)
plt.xticks([])
plt.yticks([])
thisplot = plt.bar(range(10), predictions_array, color="#777777")
plt.ylim([0, 1])
predicted_label = np.argmax(predictions_array)

thisplot[predicted_label].set_color('red')
thisplot[true_label].set_color('blue')

# Plot images and graph for  selected images
# Blue bars shows actual classification
# Red bar shows an incorrect classificiation
num_rows = 6
num_cols = 3
num_images = num_rows*num_cols
plt.figure(figsize=(2*2*num_cols, 2*num_rows))
for i in range(num_images):
plt.subplot(num_rows, 2*num_cols, 2*i+1)
plot_image(i, predictions, test_labels, test_images)
plt.subplot(num_rows, 2*num_cols, 2*i+2)
plot_value_array(i, predictions, test_labels)
plt.show()``````

## Calculate and show confusion matrix

You can see that misclassification is usually between similar types of objects, such as t-shirts and shirts

``````# SHOW CONFUSION MATRIX

def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm = cm * 100
print("\nNormalized confusion matrix")
else:
print('\nConfusion matrix, without normalization')
print(cm)
print ()

plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = '.0f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape), range(cm.shape)):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

# Compute confusion matrix
y_pred = np.argmax(predictions, axis=1)
cnf_matrix = confusion_matrix(test_labels, y_pred)
np.set_printoptions(precision=2) # set NumPy to 2 decimal places

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
title='Normalized confusion matrix')

Out:

Confusion matrix, without normalization
[[810   2  17  20   4   2 135   0  10   0]
[  1 975   1  17   3   0   2   0   1   0]
[ 10   0 855  12  67   0  56   0   0   0]
[ 16  19  16 858  52   1  34   0   4   0]
[  0   1 172  21 770   0  36   0   0   0]
[  0   0   0   0   0 951   0  33   0  16]
[112   3 142  27  83   0 624   0   9   0]
[  0   0   0   0   0  12   0 966   0  22]
[  2   0   7   3   6   4   7   3 968   0]
[  0   0   0   0   0   5   1  39   0 955]]

Normalized confusion matrix
[[81.   0.2  1.7  2.   0.4  0.2 13.5  0.   1.   0. ]
[ 0.1 97.5  0.1  1.7  0.3  0.   0.2  0.   0.1  0. ]
[ 1.   0.  85.5  1.2  6.7  0.   5.6  0.   0.   0. ]
[ 1.6  1.9  1.6 85.8  5.2  0.1  3.4  0.   0.4  0. ]
[ 0.   0.1 17.2  2.1 77.   0.   3.6  0.   0.   0. ]
[ 0.   0.   0.   0.   0.  95.1  0.   3.3  0.   1.6]
[11.2  0.3 14.2  2.7  8.3  0.  62.4  0.   0.9  0. ]
[ 0.   0.   0.   0.   0.   1.2  0.  96.6  0.   2.2]
[ 0.2  0.   0.7  0.3  0.6  0.4  0.7  0.3 96.8  0. ]
[ 0.   0.   0.   0.   0.   0.5  0.1  3.9  0.  95.5]]
``````

## Making a prediction from a single image

``````# Making a prediction of a single image
"""tf.keras models are optimized to make predictions on a batch, or collection,
of examples at once. So even though we're using a single image, we need to add
it to a list:"""

# Grab an example image
img = test_images
# Add the image to a batch where it's the only member.
img = (np.expand_dims(img,0))
# Make prediction
predictions_single = model.predict(img)
# Plot results
plot_value_array(0, predictions_single, test_labels)
_ = plt.xticks(range(10), class_names, rotation=45)
plt.show()
``````

## MIT license for TensorFlow code

``````# MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.``````

# 105: Topic modelling (dividing documents into topic groups) with Gensim

Gensim is a library that can sort documents into groups. It is an ‘unsupervised’ method, meaning that documents do not need to be pre-labelled.

Here we will use gensim to group titles or keywords from PubMed scientific paper references.

Gensim is not part of the standard Anaconda Python installation, but it may be installed from the command line with:

conda install gensim

If you are not using an Anaconda installation of Python then you can install with pip:

pip install gensim

## Import libraries

``````import pandas as pd
import gensim
import nltk
from nltk.corpus import stopwords``````

Now we will load our data (the script below loads from a local copy of the imdb movie review database, but instructions are also given for downloading from the internet).

In this example we will use a portion of a large dataset of pubmed medical paper titles and keywords. The full data set may be downloaded from the link below (1.2GB download).

https://www.kaggle.com/hsrobo/titlebased-semantic-subject-indexing#pubmed.csv

``````## LOAD 50k DATA SET FROM INTERNET

file_location = 'https://gitlab.com/michaelallen1966/1804_python_healthcare_wordpress' + \
'/raw/master/jupyter_notebooks/pubmed_50k.csv'
# save to current directory
data.to_csv('pubmed_50k.csv', index=False)

# If you already have the data locally, you may load with:

## Clean data

We will clean data by applying the following steps.

• convert all text to lower case
• divide strings/sentences into individual words (‘tokenize’)
• remove non-text words
• remove ‘stop words’ (commonly occurring words that have little value in model)

In the example here we will take the keywords (called #labels’ for each paper).

``````stops = set(stopwords.words("english"))

# Define function to clean text
def pre_process_text(X):
cleaned_X = []
for raw_text in X:
# Convert to lower case
text = raw_text.lower()

# Tokenize
tokens = nltk.word_tokenize(text)

# Keep only words (removes punctuation + numbers)
token_words = [w for w in tokens if w.isalpha()]

# Remove stop words
meaningful_words = [w for w in token_words if not w in stops]

cleaned_X.append(meaningful_words)
return cleaned_X

# Clean text
raw_text = list(data['labels'])
processed_text = pre_process_text(raw_text)``````

## Create topic model¶

The following will create our topic model. We will divide the references into 50 different topic areas. This may take a few minutes.

``````dictionary = gensim.corpora.Dictionary(processed_text)
corpus = [dictionary.doc2bow(text) for text in processed_text]
model = gensim.models.LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=50,
passes=10)``````

## Show topics

``top_topics = model.top_topics(corpus)``

When we look at the first topic, we see that keywords are largely related to molecular biology.

``````# Print the keywords for the first topic
from pprint import pprint # makes the output easier to read
pprint(top_topics)

Out:

([(0.14238602, 'sequence'),
(0.07095096, 'molecular'),
(0.068985716, 'acid'),
(0.06632707, 'dna'),
(0.052034967, 'amino'),
(0.045958135, 'data'),
(0.03165856, 'proteins'),
(0.03090581, 'base'),
(0.02128076, 'rna'),
(0.018465547, 'genetic'),
(0.01786064, 'bacterial'),
(0.017836036, 'viral'),
(0.014751778, 'animals'),
(0.013531377, 'nucleic'),
(0.013407393, 'genes'),
(0.013192138, 'analysis'),
(0.012144415, 'humans'),
(0.011643986, 'cloning'),
(0.011388958, 'phylogeny'),
(0.011024448, 'protein')],
-1.8900328727623419)``````

If we look at another topic (topic 10) we see keywords that are associated with cardiac surgical procedures.

``````pprint(top_topics)

Out:

([(0.056376483, 'outcome'),
(0.0558772, 'treatment'),
(0.05120605, 'humans'),
(0.03075589, 'surgical'),
(0.030704997, 'complications'),
(0.028222634, 'postoperative'),
(0.026755776, 'coronary'),
(0.02651835, 'tomography'),
(0.026010627, 'heart'),
(0.023422625, 'male'),
(0.022875749, 'computed'),
(0.021913974, 'studies'),
(0.02180667, 'procedures'),
(0.019811377, 'myocardial'),
(0.01757028, 'cardiac'),
(0.015987962, 'artery'),
(0.015796969, 'female'),
(0.013579166, 'prosthesis'),
(0.012545248, 'valve'),
(0.01180034, 'history')],
-3.1752669709698886)``````

## Show topics present in each document

Each document may contain one or more topics. The first paper is highlighted as containing topics 4, 8, 19 and 40.

``````model[corpus]

Out:

[(4, 0.0967338), (8, 0.27275458), (19, 0.4578644), (40, 0.09598056)]``````

# 100: Sorting and sub-grouping dictionary items with itemgetter and groupby

Often in data science we might use Pandas to store mixed text and numbers (Pandas then allows easy sorting and grouping), but sometimes you may want to stick to pure Python and use lists of dictionary items. Sorting and sub-grouping of lists of dictionary items may be performed in Python using itemgetter (to help sort) and groupby to sub-group.

## Sorting lists of dictionary items with itemgetter

```# Set up a list of dictionary items and add content

people = []
people.append({'born':1966, 'gender':'male', 'name':'Bob'})
people.append({'born':1966, 'gender':'female', 'name':'Anne'})
people.append({'born':1970, 'gender':'male', 'name':'John'})
people.append({'born':1970, 'gender':'female', 'name':'Daisy'})
people.append({'born': 1968, 'gender':'male', 'name':'Steve'})

# import methods

from operator import itemgetter

# Sort by 'born' and 'name'
people.sort(key=itemgetter('born','name'))

# Print out sorted list
for item in people:
print(item)```

Output:

```{'born': 1966, 'gender': 'male', 'name': 'Adam'}
{'born': 1966, 'gender': 'female', 'name': 'Anne'}
{'born': 1966, 'gender': 'male', 'name': 'Bob'}
{'born': 1968, 'gender': 'male', 'name': 'Steve'}
{'born': 1970, 'gender': 'female', 'name': 'Daisy'}
{'born': 1970, 'gender': 'male', 'name': 'John'}```

## Sub-grouping the sorted list of dictionary items with groupby

Note: We must always sort our list of dictionary items in the required order before sub-grouping them.

```# Set up a list of dictionary items and add content
people = []
people.append({'born':1966, 'gender':'male', 'name':'Bob'})
people.append({'born':1966, 'gender':'female', 'name':'Anne'})
people.append({'born':1970, 'gender':'male', 'name':'John'})
people.append({'born':1970, 'gender':'female', 'name':'Daisy'})
people.append({'born': 1968, 'gender':'male', 'name':'Steve'})

# import methods

from operator import itemgetter
from itertools import groupby

# First sort by required field
# Groupby only finds groups that are collected consecutively
people.sort(key=itemgetter('born'))

# Now iterate through groups (here we will group by the year born)
for born, items in groupby(people, key=itemgetter('born')):
print (born)
for i in items:
print(' ', i)```

Output:

```1966
{'born': 1966, 'gender': 'male', 'name': 'Bob'}
{'born': 1966, 'gender': 'female', 'name': 'Anne'}
{'born': 1966, 'gener': 'male', 'name': 'Adam'}
1968
{'born': 1968, 'gender': 'male', 'name': 'Steve'}
1970
{'born': 1970, 'gender': 'male', 'name': 'John'}
{'born': 1970, 'gender': 'female', 'name': 'Daisy'}```

# 95: Crowding distances: selecting solutions when too many multi-objective solutions exist

Sometimes in multi-objective algorithms we need to thin out the number of solutions we have.

When we are using a Pareto front to select solutions (as described here), all solutions are on the optimal front, that is in each solution there is no other solution that is at least as good in all scores, and better in at least one score. We therefore cannot rank solutions by performance. In order to select solutions, if we need to control the number of solutions we are generating we can use ‘crowding distances’. Crowding distances give a measure of closeness in performance to other solutions. The crowding distance is the average distance to its two neighbouring solutions. Continue reading “95: Crowding distances: selecting solutions when too many multi-objective solutions exist”