98: Brief examples of applying lambda functions to lists, and filtering lists with list comprehensions, map and filter

These examples are intended as a reminder of how to use list comprehensions, map and filter. They are not intended to be an exhaustive tutorial.

Let’s start with a list of numbers, and we wish to:

1) Square all numbers
2) Find even numbers
3) Square all even numbers

That is where list comprehensions, map and filter come in. There is significant overlap between these methodologies, but here they are.

Define my list of numbers

my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Square numbers with list comprehension

answer = [x**2 for x in my_list]
print (answer)
Out:
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Map a named lambda function to square the numbers

We use a one-line lambda function here, but a full function with define and return could also be used.

sq = lambda x: x**2
answer = list(map(sq, my_list))
print (answer)
Out:
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Map a lambda function directly to square numbers

answer = list(map(lambda x: x**2, my_list))
print (answer)
Out:
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Use a lambda function in a list comprehension

sq = lambda x: x**2
answer = [sq(x) for x in my_list]
print (answer)
Out:
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Filter a list with list comprehension to find even numbers

For even numbers x%2 will equal zero:

answer = [x for x in my_list if x%2 == 0]
print (answer)
Out:
[2, 4, 6, 8, 10]

Filter a list with a named lambda function

Full functions call also be used. The function (full or lambda must return True or False)

is_even = lambda x: x%2 == 0
answer = list(filter(is_even, my_list))
print (answer)
Out:
[2, 4, 6, 8, 10]

Filter a list with a lambda function applied directly

answer = list(filter(lambda x: x%2 ==0, my_list))
print (answer)
[2, 4, 6, 8, 10]

Combine squaring and filtering with a list comprehension

List comprehensions may filter and apply a function in one line. To get the square of all even numbers in our list:

answer = [x**2 for x in my_list if x%2 == 0]
print (answer)
Out:
[4, 16, 36, 64, 100]

Or we may apply lambda (or full) functions in both the operation and the filter ina list comprehension.

sq = lambda x: x**2
is_even = lambda x: x%2 == 0
answer = [sq(x) for x in my_list if is_even(x)]
print (answer)
Out:
[4, 16, 36, 64, 100]

Combine squaring and filtering with map and filter

filter and map would need to be used in two statements to achieve the same end. Here we will use named lambda functions, but they could be applied directly as above.

sq = lambda x: x**2
is_even = lambda x: x%2 == 0

filtered = list(filter(is_even, my_list))
answer = list(map(sq, filtered))
print (answer)
Out:
[4, 16, 36, 64, 100]

 

 

Open data travel distances and times for all England lower super output areas (LSOA) to all acute hospitals

Data here: https://gitlab.com/michaelallen1966/1811_lsoa_to_acute_hospital_travel

This data set includes travel times and distances for all 32,844 Lower Super Output Areas in England (excluding Isles of Scilly) to 185 acute hospitals in England. Tables for inter-hospital travel distances and times are also given.

Data comes from Open Street Map (https://www.openstreetmap.org), and fastest routes were identified using Routino (www.routino.org).

Data files represent raw output from OSM and also and adjusted travel time based on calibration calculated by comparing 200 OSM/Routino route times with Google maps (2am travel times). OSM/Routino tended to under-estimate travel shorter travel times compared with Google.

FILES
=====

correction.txt
————–
Description of the calibration equation for travel times, obtained by comparison of 200 randomly selected routes between OSM/Routino data and Google maps (travel times at 2am Wednesday morning; clear road conditions).
hospitals.csv
————-
A list of 185 English acute hospitals with region
inter_hospital_results.csv
————————–
Raw Routino output for inter-hospital travel times and distances.
lsoa2011_postcodes.csv
———————-
Information on lower super output areas and the postcode (closest to population-weighted centroid) used for travel distance/time estimation.
pivoted_distance_km.csv
———————–
A pivoted table of LSOA to hospital travel distances (km, fastest road route)
pivoted_inter_hospital_distance_km.csv
—————————————

A pivoted table of inter-hospital travel distances (km, fastest road route)
pivoted_inter_hospital_time_adjusted.csv
—————————————-
A pivoted table of inter-hospital travel times (adjusted by calibration against Google maps)
pivoted_inter_hospital_time.csv
——————————-
A pivoted table of inter-hospital travel times (not adjusted)
pivoted_time_adjusted.csv
————————-
A pivoted table on travel times (fastest road travel time, minutes) from all LSOA to all hospitals (adjusted by calibration against Google maps)
pivoted_time.csv
—————-
A pivoted table on travel times (fastest road travel time, minutes) from all LSOA to all hospitals (not adjusted)
results.csv
———–
Raw Routino output for travel times and distances from all LSOA to all hospitals.
Note: 2,555 out of 6,075,955 (0.04%) routes did not have a travel time/distances reported by Routino. In that case travel distances were by adjusting straight line distances by the average road distance/straight line distances for all other routes. Travel times were then imputed using an average travel speed for all other routes.

 

 

 

 

97: Simple machine learning model to predict emergency department (ED) breaches of the four-hour target

In England emergency departments have a target that 95% of patients should be admitted or discharged from ED within four hours. Patients waiting more than four hours are known as ‘breaches’

This notebook explores predicting emergency department (ED) breaches (patients taking more than 4 hours to be discharged or admitted). The data is from a real mid-sized acute hospital in England.

The model receives data every 2 hours and predicts whether there will be a breach in the next 2 hours.

It uses some basic ED data alongside whole-hospital data (number of occupied beds and total beds) to try to predict whether there are likely to be breaches in the next two hours. It uses a simple logistic regression model to achieve 80% accuracy in predicting breaches. Sensitivity may be adjusted to balance accuracy in predicting beach and non-breaching episodes (80% accuracy may be be simultaneousness achieved in both).

import pandas as pd
data = pd.read_csv('ed_1.csv')

Show data columns:

print (list(data))
['snapshot_id', 'snapshot_date', 'snapshot_time', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Number of Patients In department >= 4 Hours', 'Total Number of Patients in the Department', 'Number of Patients in Resus', 'Number of Patients Registered in Last 60 Minutes', 'Number of Patients Waiting Triage', 'Number of Patients Waiting to be Seen (ED)', 'Number of Patients Waiting to be Seen (Medical)', 'Number of Patients Waiting to be Seen (Surgery)', 'Number of Patients > 3 Hours', 'Number of Patients Waiting a Bed', 'Number of Patients Left Department in Last 60 Minutes', 'Free_beds', 'Breach_in_next_timeslot']

Separate data into features (X) and label (Y) to predict. Y is whether there are breaches in the following 2 hours.

X = data.loc[:,"Monday":"Free_beds"]
y = data['Breach_in_next_timeslot']

Let’s see what proportion of 2 hour epochs have a breach:

print (data['Breach_in_next_timeslot'].mean())
0.6575510659671838

Split data in training and test sets

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)

Normalise data with standard scaling

from sklearn.preprocessing import StandardScaler

# Initialise a new scaling object for normalising input data
sc=StandardScaler()

# Set up the scaler just on the training set
sc.fit(X_train)

# Apply the scaler to the training and test sets
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)

Build a logistic regression model

C=1000 sets low regularisation. If accuracy of training data is significantly higher than accuracy of test data this should be reduced in 10-fold or 3-fold steps to maximise accuracy of test data.

(Note: the ‘;’ at the end of the last line suppresses model description output in the Jupyter Notebook)

from sklearn.linear_model import LogisticRegression

ml = LogisticRegression(C=1000)
ml.fit(X_train_std,y_train);

Predict training and test set labels

Our model is now built. We can now predict breaches for training and test sets. The results for the test set gives the better description of accuracy, but it is useful to calculate both to look for ‘over-fitting’. If the training data has significantly better accuracy than the test data then it is likely the model is ‘over-fitted’ to the training data, and the regularisation term (C) in the model fit above should be reduced step-wise – this will reduce accuracy of predicting the training data, but will increase the accuracy of the test data, though too high regularisation (low C) will reduce the accuracy of both predicting training and test data.

In [8]:
# Predict training and test set labels
y_pred_train = ml.predict(X_train_std)
y_pred_test = ml.predict(X_test_std)

Test accuracy

import numpy as np
accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)
print ('Accuracy of predicting training data =', accuracy_train)
print ('Accuracy of predicting test data =', accuracy_test)
Accuracy of predicting training data = 0.8111326090191993
Accuracy of prediciing test data = 0.8151785714285714

Display weights (coefficients) of model.

# Create table of weights
weights_table = pd.DataFrame()
weights_table['feature'] = list(X)
weights_table['weight'] = ml.coef_[0]
print(weights_table)
                                              feature    weight
0                                              Monday  0.038918
1                                             Tuesday -0.026935
2                                           Wednesday  0.001615
3                                            Thursday  0.001543
4                                              Friday -0.014975
5                                            Saturday  0.011287
6                                              Sunday -0.011401
7         Number of Patients In department >= 4 Hours  1.515722
8          Total Number of Patients in the Department  0.544407
9                         Number of Patients in Resus  0.307983
10   Number of Patients Registered in Last 60 Minutes -0.444304
11                  Number of Patients Waiting Triage  0.028371
12         Number of Patients Waiting to be Seen (ED)  0.138082
13    Number of Patients Waiting to be Seen (Medical) -0.036093
14    Number of Patients Waiting to be Seen (Surgery)  0.022757
15                       Number of Patients > 3 Hours  1.265580
16                   Number of Patients Waiting a Bed  0.013085
17  Number of Patients Left Department in Last 60 ... -0.001884
18                                          Free_beds -0.369558

 

Define a function for sensitivity and specificity

Sensitivity = proportion of breaching periods correctly identified
Specificity = proportion of breaching periods correctly identified

def calculate_sensitivity_specificity(y_test, y_pred_test):
    # Note: More parameters are defined than necessary. 
    # This would allow return of other measures other than sensitivity and specificity
    
    # Get true/false for whether a breach actually occurred
    actual_pos = y_test == 1
    actual_neg = y_test == 0
    
    # Get true and false test (true test match actual, false tests differ from actual)
    true_pos = (y_pred_test == 1) & (actual_pos)
    false_pos = (y_pred_test == 1) & (actual_neg)
    true_neg = (y_pred_test == 0) & (actual_neg)
    false_neg = (y_pred_test == 0) & (actual_pos)
    
    # Calculate accuracy
    accuracy = np.mean(y_pred_test == y_test)
    
    # Calculate sensitivity and specificity
    sensitivity = np.sum(true_pos) / np.sum(actual_pos)
    specificity = np.sum(true_neg) / np.sum(actual_neg)
    
    return sensitivity, specificity, accuracy

Show sensitivity and specificity:

sensitivity, specificity, accuracy = calculate_sensitivity_specificity(y_test, y_pred_test)
print ('Sensitivity:', sensitivity)
print ('Specificity:', specificity)
print ('Accuracy:', accuracy)
Sensitivity: 0.8488529014844804
Specificity: 0.7493403693931399
Accuracy: 0.8151785714285714

So we are better at detecting breaches than non-breaches. This is likely because breaching sessions occur more often. Let’s adjust our model cut-off to balance the accuracy out. We’ll vary the cut-off we use and construct a sensitivity/specificity plot (very similar to a ‘Receiver-Operator Curve’ or ‘ROC’).

Balancing sensitivity and specificity

cuttoff = np.arange (0.01,1.01,0.01)
sensitivity_results = []
specificity_results = []


for threshold in cuttoff:
    # linear regression model has .predict+proba  method to return 
    # probability of outcomes. Some methods such as svc use 
    # .decision_function to return probability
        
    # Get test results 
    y_pred_probability = ml.predict_proba(X_test_std)
    
    # Check probability of positive classification is >trhreshold
    y_pred_test = (y_pred_probability[:,1] >= threshold)
    
    # Convert boolean to 0/1 (could also simply multiple by 1)
    y_pred_test = y_pred_test.astype(int)
    
    # Get sensitivity and specificity
    sensitivity, specificity, accuracy = \
        calculate_sensitivity_specificity(y_test, y_pred_test)
    
    # Add results to list of results
    sensitivity_results.append(sensitivity)
    specificity_results.append(specificity)  
    

Plotting specificity against sensitivity:

import matplotlib.pyplot as plt

%matplotlib inline

fig = plt.figure(figsize=(5,5))
ax1 = fig.add_subplot(111)

x = sensitivity_results
y = specificity_results

ax1.grid(True, which='both')
ax1.set_xlabel('Sensitivity (proportion of breaching\nperiods predicted correctly)')
ax1.set_ylabel('Specificity (proportion of non-breaching\nperiods predicted correctly)')


plt.plot(x,y)
plt.show()
ed_roc

Plotting specificity against sensitivity shows we can adjust our machine learning cut-off to simultaneously achieve 80% accuracy in predicting likelihood of breaches in the next 2 hours.

96. Passing arguments to Python from the command line (or other programs)

Many Python programs have all the variables defined within the Python code (or the Python code reads input files). It my be useful at times, however, to be able to pass one or more arguments to Python when calling the program from the command line or another programme. This is simple in Python. Below is an example (saved as ‘mycode.py’) that takes two arguments and multiplies them together:

import sys

x,y = int(sys.argv[1]), int(sys.argv[2])

print (x * y)

Now we can call that from the command line, passing our variables:

python mycode.py 3 4

>12

This type of code may allow us to write generic Python code, and call it as part of an automated sequence from another piece of code which passes the required variables to Python.

95: Crowding distances: selecting solutions when too many multi-objective solutions exist

Sometimes in multi-objective algorithms we need to thin out the number of solutions we have.

When we are using a Pareto front to select solutions (as described here), all solutions are on the optimal front, that is in each solution there is no other solution that is at least as good in all scores, and better in at least one score. We therefore cannot rank solutions by performance. In order to select solutions, if we need to control the number of solutions we are generating we can use ‘crowding distances’. Crowding distances give a measure of closeness in performance to other solutions. The crowding distance is the average distance to its two neighbouring solutions. Continue reading “95: Crowding distances: selecting solutions when too many multi-objective solutions exist”

94: Genetic algorithms 1. A simple genetic algorithm

Note: For core code only, without explanation or test code sections see this link: code_only

 

progress

For more discussion on the general concepts of genetic algorithms, which are only presented briefly here (as we will focus on how to code a simple example in Python), see Wikipedia article.

In this example we will look at a basic genetic algorithm (GA). We will set up the GA to try to match a pre-defined ‘optimal. solution. Often with GAs we are using them to find solutions to problems which 1) cannot be solved with ‘exact’ methods (methods are are guaranteed to find the best solution), and 2) where we cannot recognise when we have found the optimal solution. GAs therefore fall into a collection of algorithms called heuristic (from Greek for ‘search’) algorithms that hunt down good solutions, without us knowing how far off the theoretical optimal solution they are.

In this case however we will test the algorithms against a solution we know we are looking for.

Principles of genetic algorithms (GAs)

GAs are iterating algorithms, that is they repeatedly loop through a progress until a target is reached or a maximum number of iterations (called ‘generations’ in GAs) is reached.

The basic steps of a genetic algorithm are:

1) Create a population of randomly generated solutions, coded as binary arrays, and score population for performance (or ‘fitness’) of each individual.

2) Loop (until target performance is reached or a maximum number of generations is reached):

  • Select two parents to ‘breed’. Selection is based on performance (or ‘fitness’) – better performing parents are more likely to be selected.
  • Generate two child solutions from two parents by mixing the genes from each parent and by applying a chance of random mutation.
  • Repeat child generation until a required new population size is reached.
  • Score new population

The solution to ‘hunt down’

We code most GAs to work with binary strings (or a binary NumPy array, as we will use here), so that the solution may be represented as a series of 0 and 1.

A real-life example in healthcare is that the array of 0s and 1s may represents the choices of closed or open hospital units providing a given service. We then evaluate each solution against predetermined criteria.

Here we will define a known solution based on a string of 70 0s or 1s. The number of possible combinations for this is 2^18, or 1.2 x 10^18 – that is 1 followed by eighteen zeros. Or, to put it another way (as these large numbers are difficult to imagine) …… the universe is about 15 billion (15 x 10^9) years old, or 5 x 10^17 seconds old. If we could evaluate 1,000 solutions per second, then a computer would need to run for twice the current age of the universe in order to evaluate all possible combinations. Let’s see how close to the perfect solution we can get in reasonable time!

In GA we will call each potential solution to be evaluated a ‘chromsome’. Each element (a 0 or 1) in that chromsome is a ‘gene’.

import random
import numpy as np

def create_reference_solution(chromosome_length):

    number_of_ones = int(chromosome_length / 2)

    # Build an array with an equal mix of zero and ones
    reference = np.zeros(chromosome_length)
    reference[0: number_of_ones] = 1

    # Shuffle the array to mix the zeros and ones
    np.random.shuffle(reference)
    
    return reference

Let’s test the function and show an example 70-gene reference chromosome.

# Print an example target array
print (create_reference_solution(70))
OUT:

[1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 0.
 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 1. 1.
 0. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1.]

Create a starting random population

We will use NumPy to store our population. Each row will represent one solution, which will contain a random sequence of zeros and ones. The number of rows will represent to number of ‘individuals, in our population.

def create_starting_population(individuals, chromosome_length):
    # Set up an initial array of all zeros
    population = np.zeros((individuals, chromosome_length))
    # Loop through each row (individual)
    for i in range(individuals):
        # Choose a random number of ones to create
        ones = random.randint(0, chromosome_length)
        # Change the required number of zeros to ones
        population[i, 0:ones] = 1
        # Sfuffle row
        np.random.shuffle(population[i])
    
    return population
    

Let’s test by showing a random population of 4 individuals with a gene length of 10.

print (create_starting_population(4, 10))
OUT:

[[0. 1. 1. 1. 1. 0. 0. 1. 1. 0.]
 [1. 1. 1. 1. 0. 1. 1. 0. 1. 1.]
 [1. 0. 0. 0. 0. 0. 0. 1. 1. 0.]
 [1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]

Calculate fitness of population

In GAs we refer to how good each individual in the population is, as ‘fitness’. The calculate_fitness function will be the evaluation procedure you wish to apply in your algorithm. In this example we are going to return the number of genes (elements) in a potential solution (chromosome) that match our f=reference standard.

def calculate_fitness(reference, population):
    # Create an array of True/False compared to reference
    identical_to_reference = population == reference
    # Sum number of genes that are identical to the reference
    fitness_scores = identical_to_reference.sum(axis=1)
    
    return fitness_scores

Let’s test what we have so far:

reference = create_reference_solution(10)
print ('Reference solution: \n', reference)
population = create_starting_population(6, 10)
print ('\nStarting population: \n', population)
scores = calculate_fitness(reference, population)
print('\nScores: \n', scores)
OUT: 

Reference solution: 
 [1. 1. 0. 0. 0. 0. 1. 0. 1. 1.]

Starting population: 
 [[0. 0. 1. 1. 1. 0. 0. 0. 0. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 1. 1. 1. 0. 0. 1. 1. 0. 1.]
 [1. 1. 0. 0. 0. 0. 0. 1. 1. 1.]]

Scores: 
 [3 5 6 6 6 8]

Choosing individuals to breed with tournament selection

Genetic algorithms mimic biology in that the individuals with the best fitness cores are most likely to breed and pass on their genes. But we do not simply take all the best individuals from our population to breed, as this might risk ‘in-breeding’. Rather, we use a method that means better individuals are moire likely to breed, but low fitness individuals at times may be chosen to breed.

In tournament selection we first choose two individuals at random from our population (it is possible that two low fitness individuals may be chosen). We then pass those individuals to a ‘tournament’ where the individual with the highest fitness will be chosen.

It is possible to further modify this so that the highest fitness individual will win with a given probability, but we will keep it simple here and have the highest fitness individual always winning.
It is also possible to have more than two individuals in a tournament. The more individuals in a tournament the more the picked population will be biased towards the highest fitness individuals.

def select_individual_by_tournament(population, scores):
    # Get population size
    population_size = len(scores)
    
    # Pick individuals for tournament
    fighter_1 = random.randint(0, population_size-1)
    fighter_2 = random.randint(0, population_size-1)
    
    # Get fitness score for each
    fighter_1_fitness = scores[fighter_1]
    fighter_2_fitness = scores[fighter_2]
    
    # Identify undividual with highest fitness
    # Fighter 1 will win if score are equal
    if fighter_1_fitness >= fighter_2_fitness:
        winner = fighter_1
    else:
        winner = fighter_2
    
    # Return the chromsome of the winner
    return population[winner, :]

Let’s test selection of parents:

# Set up and score population
reference = create_reference_solution(10)
population = create_starting_population(6, 10)
scores = calculate_fitness(reference, population)

# Pick two parents and dispplay
parent_1 = select_individual_by_tournament(population, scores)
parent_2 = select_individual_by_tournament(population, scores)
print (parent_1)
print (parent_2)
OUT:

[1. 0. 1. 1. 0. 1. 0. 1. 1. 0.]
[1. 0. 1. 1. 0. 1. 0. 1. 1. 0.]

Producing children from parents – crossover

When two individuals are chosen, the next step is to produce ‘children’ from them. We produce these children by ‘crossover’ mix of their two chromosomes. We choose a random point within the chromosome, and then one ‘child’ will take the left portion (up to, but not including, the crossover point) from parent 1 and the corresponding right portion from parent 2. The result is a mix of genes from each parent. The second ‘child’ will be the opposite of this – portion (up to, but not including) the crossover point) from parent 2 and the corresponding right portion from parent 1.

It is possible to have more than one crossover point, but we will keep it simple and have a single crossover point.

In [9]:
def breed_by_crossover(parent_1, parent_2):
    # Get length of chromosome
    chromosome_length = len(parent_1)
    
    # Pick crossover point, avoding ends of chromsome
    crossover_point = random.randint(1,chromosome_length-1)
    
    # Create children. np.hstack joins two arrays
    child_1 = np.hstack((parent_1[0:crossover_point],
                        parent_2[crossover_point:]))
    
    child_2 = np.hstack((parent_2[0:crossover_point],
                        parent_1[crossover_point:]))
    
    # Return children
    return child_1, child_2

And let’s test it so far, creating a population, scoring it, picking two ‘parents’ and producing ‘two children’.

# Set up and score population
reference = create_reference_solution(15)
population = create_starting_population(100, 15)
scores = calculate_fitness(reference, population)

# Pick two parents and dispplay
parent_1 = select_individual_by_tournament(population, scores)
parent_2 = select_individual_by_tournament(population, scores)

# Get children
child_1, child_2 = breed_by_crossover(parent_1, parent_2)

# Show output
print ('Parents')
print (parent_1)
print (parent_2)
print ('Children')
print (child_1)
print (child_2)
OUT:

Parents
[1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0.]
Children
[1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 1. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]

Random mutation of genes

In evolution sometimes genes are copied incorrectly. This change may be harmful or beneficial. We mimic this by having a certain probability of that a gene (which is either a 0 or a 1) becomes switched.

Typically this probability is low (e.g. 0.005), though it can be made to be flexible (e.g. increase mutation rate if progress has stalled)

def randomly_mutate_population(population, mutation_probability):
    
    # Apply random mutation
        random_mutation_array = np.random.random(
            size=(population.shape))
        
        random_mutation_boolean = \
            random_mutation_array <= mutation_probability

        population[random_mutation_boolean] = \
        np.logical_not(population[random_mutation_boolean])
        
        # Return mutation population
        return population

Let’s test our function with a high mutation rate (0.25) to see the effects. You can change the mutation rate and see what happens (a mutation rate of 1.0 will invert all genes).

# Set up and score population
reference = create_reference_solution(15)
population = create_starting_population(100, 15)
scores = calculate_fitness(reference, population)

# Pick two parents and display
parent_1 = select_individual_by_tournament(population, scores)
parent_2 = select_individual_by_tournament(population, scores)

# Get children and make new population 
child_1, child_2 = breed_by_crossover(parent_1, parent_2)
population = np.stack((child_1, child_2))

# Mutate population
mutation_probability = 0.25
print ("Population before mutation")
print (population)
population = randomly_mutate_population(population, mutation_probability)
print ("Population after mutation")
print (population)
OUT:

Population before mutation
[[1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1.]]

Population after mutation
[[1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1.]
 [0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 1.]]

Putting it all together

We’ve defined all the functions we need. Now let’s put it all together.

# Set general parameters
chromosome_length = 75
population_size = 500
maximum_generation = 200
best_score_progress = [] # Tracks progress

# Create reference solution 
# (this is used just to illustrate GAs)
reference = create_reference_solution(chromosome_length)

# Create starting population
population = create_starting_population(population_size, chromosome_length)

# Display best score in starting population
scores = calculate_fitness(reference, population)
best_score = np.max(scores)/chromosome_length * 100
print ('Starting best score, percent target: %.1f' %best_score)

# Add starting best score to progress tracker
best_score_progress.append(best_score)

# Now we'll go through the generations of genetic algorithm
for generation in range(maximum_generation):
    # Create an empty list for new population
    new_population = []
    
    # Create new popualtion generating two children at a time
    for i in range(int(population_size/2)):
        parent_1 = select_individual_by_tournament(population, scores)
        parent_2 = select_individual_by_tournament(population, scores)
        child_1, child_2 = breed_by_crossover(parent_1, parent_2)
        new_population.append(child_1)
        new_population.append(child_2)
    
    # Replace the old population with the new one
    population = np.array(new_population)
    
    # Score best solution, and add to tracker
    scores = calculate_fitness(reference, population)
    best_score = np.max(scores)/chromosome_length * 100
    best_score_progress.append(best_score)

# GA has completed required generation
print ('End best score, percent target: %.1f' %best_score)

# Plot progress
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(best_score_progress)
plt.xlabel('Generation')
plt.ylabel('Best score (% target)')
plt.show()
OUT:

Starting best score, percent target: 64.0
End best score, percent target: 100.0
progress

 

 

94. Genetic Algorithms 1. A simple genetic algorithm (code only)

For explanation see: 94: Genetic algorithms 1. A simple genetic algorithm

import random
import numpy as np

def create_reference_solution(chromosome_length):

    number_of_ones = int(chromosome_length / 2)

    # Build an array with an equal mix of zero and ones
    reference = np.zeros(chromosome_length)
    reference[0: number_of_ones] = 1

    # Shuffle the array to mix the zeros and ones
    np.random.shuffle(reference)
    
    return reference


def create_starting_population(individuals, chromosome_length):
    # Set up an initial array of all zeros
    population = np.zeros((individuals, chromosome_length))
    # Loop through each row (individual)
    for i in range(individuals):
        # Choose a random number of ones to create
        ones = random.randint(0, chromosome_length)
        # Change the required number of zeros to ones
        population[i, 0:ones] = 1
        # Sfuffle row
        np.random.shuffle(population[i])
    
    return population


def calculate_fitness(reference, population):
    # Create an array of True/False compared to reference
    identical_to_reference = population == reference
    # Sum number of genes that are identical to the reference
    fitness_scores = identical_to_reference.sum(axis=1)
    
    return fitness_scores


def select_individual_by_tournament(population, scores):
    # Get population size
    population_size = len(scores)
    
    # Pick individuals for tournament
    fighter_1 = random.randint(0, population_size-1)
    fighter_2 = random.randint(0, population_size-1)
    
    # Get fitness score for each
    fighter_1_fitness = scores[fighter_1]
    fighter_2_fitness = scores[fighter_2]
    
    # Identify undividual with highest fitness
    # Fighter 1 will win if score are equal
    if fighter_1_fitness >= fighter_2_fitness:
        winner = fighter_1
    else:
        winner = fighter_2
    
    # Return the chromsome of the winner
    return population[winner, :]


def breed_by_crossover(parent_1, parent_2):
    # Get length of chromosome
    chromosome_length = len(parent_1)
    
    # Pick crossover point, avoding ends of chromsome
    crossover_point = random.randint(1,chromosome_length-1)
    
    # Create children. np.hstack joins two arrays
    child_1 = np.hstack((parent_1[0:crossover_point],
                        parent_2[crossover_point:]))
    
    child_2 = np.hstack((parent_2[0:crossover_point],
                        parent_1[crossover_point:]))
    
    # Return children
    return child_1, child_2
    

def randomly_mutate_population(population, mutation_probability):
    
    # Apply random mutation
        random_mutation_array = np.random.random(
            size=(population.shape))
        
        random_mutation_boolean = \
            random_mutation_array <= mutation_probability

        population[random_mutation_boolean] = \
        np.logical_not(population[random_mutation_boolean])
        
        # Return mutation population
        return population
    

# *************************************
# ******** MAIN ALGORITHM CODE ********
# *************************************

# Set general parameters
chromosome_length = 75
population_size = 500
maximum_generation = 200
best_score_progress = [] # Tracks progress

# Create reference solution 
# (this is used just to illustrate GAs)
reference = create_reference_solution(chromosome_length)

# Create starting population
population = create_starting_population(population_size, chromosome_length)

# Display best score in starting population
scores = calculate_fitness(reference, population)
best_score = np.max(scores)/chromosome_length * 100
print ('Starting best score, % target: ',best_score)

# Add starting best score to progress tracker
best_score_progress.append(best_score)

# Now we'll go through the generations of genetic algorithm
for generation in range(maximum_generation):
    # Create an empty list for new population
    new_population = []
    
    # Create new popualtion generating two children at a time
    for i in range(int(population_size/2)):
        parent_1 = select_individual_by_tournament(population, scores)
        parent_2 = select_individual_by_tournament(population, scores)
        child_1, child_2 = breed_by_crossover(parent_1, parent_2)
        new_population.append(child_1)
        new_population.append(child_2)
    
    # Replace the old population with the new one
    population = np.array(new_population)
    
    # Score best solution, and add to tracker
    scores = calculate_fitness(reference, population)
    best_score = np.max(scores)/chromosome_length * 100
    best_score_progress.append(best_score)

# GA has completed required generation
print ('End best score, % target: ', best_score)

# Plot progress
import matplotlib.pyplot as plt
plt.plot(best_score_progress)
plt.xlabel('Generation')
plt.ylabel('Best score (% target)')
plt.show()