# 120. Generating log normal samples from provided arithmetic mean and standard deviation of original population

The log normal distribution is frequently a useful distribution for mimicking process times in healthcare pathways (or many other non-automated processes). The distribution has a right skew which may frequently occur when some clinical process step has some additional complexity to it compared to the ‘usual’ case.

To sample from a log normal distribution we need to convert the mean and standard deviation that was calculated from the original non-logged population into the mu and sigma of the underlying log normal population.

(For maximum computation effuiciency, when calling the function repeatedly using the same mean and standard deviation, you may wish to split this into two functions – one to calculate mu and sigma which needs only calling once, and the other to sample from the log normal distribution given mu and sigma).

For more on the maths see:

https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-with-specified-mean-and-variance.html

``````import numpy as np

def generate_lognormal_samples(mean, stdev, n=1):
"""
Returns n samples taken from a lognormal distribution, based on mean and
standard deviation calaculated from the original non-logged population.

Converts mean and standard deviation to underlying lognormal distribution
mu and sigma based on calculations desribed at:
https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-
with-specified-mean-and-variance.html

Returns a numpy array of floats if n > 1, otherwise return a float
"""

# Calculate mu and sigma of underlying lognormal distribution
phi = (stdev ** 2 + mean ** 2) ** 0.5
mu = np.log(mean ** 2 / phi)
sigma = (np.log(phi ** 2 / mean ** 2)) ** 0.5

# Generate lognormal population
generated_pop = np.random.lognormal(mu, sigma , n)

# Convert single sample (if n=1) to a float, otherwise leave as array
generated_pop = \
generated_pop if len(generated_pop) == 1 else generated_pop

return generated_pop``````

## Test the function

We will generate a population of 100,000 samples with a given mean and standard deviation (these would be calculated on the non-logged population), and test the resulting generated population has the same mean and standard deviation.

``````mean = 10
stdev = 10
generated_pop = generate_lognormal_samples(mean, stdev, 100000)
print ('Mean:', generated_pop.mean())
print ('Standard deviation:', generated_pop.std())

Out:

Mean: 10.043105926813356
Standard deviation: 9.99527575740651``````

Plot a histogram of the generated population:

``````import matplotlib.pyplot as plt
%matplotlib inline
bins = np.arange(0,51,1)
plt.hist(generated_pop, bins=bins)
plt.show()``````

## Generating a single sample

The function will return a single number if no `n` is given in the function call:

``````print (generate_lognormal_samples(mean, stdev))

Out: 6.999376449335125``````

# 89. An emergency department model in SimPy, with patient prioritisation and capacity limited by doctor availability (object based)  This model mimics the random arrival of patients at an emergency department (ED). The patients are assessed as being low, medium or high priority. Higher priority patients are always selected next (but do not displace lower priority patients already being seen by a doctor). Continue reading “89. An emergency department model in SimPy, with patient prioritisation and capacity limited by doctor availability (object based)”

# 88. A hospital bed occupancy model with queuing for a limited number of beds (object based) Before looking at this model it is best to look at the simplest bed model (here), and the object-based version of that simpler model (here).

In this model we extend those previous models to include beds as a limited resource (please see Simpy tutorials, here, for detailed guidance on use of capacity-limed resources in SimPy).

The key extension to the models without constrained beds are:

1.  A new class has been added, resources.
2. Within that new class the attribute beds is stored which is set with the capacity of the bed/hospital/ward.
3. In the spell process/function, the stay in hospital is preceded by the command with self.resources.beds.request(). This is the simplest way in SimPy of waiting for resources to be available to perform a task (more flexible, but more verbose, methods are given in the SimPy documentation).
4. The global variables now track the number of people waiting for a bed, and the number of beds occupied. These are updated each audit (set at once daily).
```"""
This model requires simpy to run (pip install simpy). All other modules
should are present in an Anaconda installation of Python 3. The model was
written in Python 3.6.

This model simulates the random arrival of patients at a hospital. Patients
stay for a random time (given an average length of stay). Both inter-arrival
time and length of stay are sampled from an inverse exponential distribution.
Average inter-arrival time and length of stay may be changed by changing their
values in class g (global variables). At intervals the number of beds occupied
are counted in an audit, and a chart of bed occupancy is plotted at the end of
the run.

There are restricted beds in the hospital. As part of the audit, the number of
people waiting for a bed are counted.

There are five classes:

g (global variables)
--------------------

This class stores global variables. No individual object instance is used;
global variables are stored as class variables.

Hospital
--------
Hospital class (one instance created):
1) Dictionary of patients present
2) List of audit times
3) List of beds occupied at each audit time
4) Current total beds occupied

The Hospital class contains methods for audit of beds occupied, summarising
audit (at end of run), and plotting bed occupancy over time (at end of run).

Model
-----
The model class contains the model environment. The modelling environment is
set up, and patient arrival and audit processes initiated. Patient arrival
triggers a spell for that patient in hospital. Arrivals and audit continue
fort he duration of the model run. The audit is then summarised and bed
occupancy, and number of people waiting for beds (with 5th, 50th and 95th
percentiles) plotted.

Patient
-------
The patient class is the template for all patients generated (each new patient
arrival creates a new patient object). The patient object contains patient id
and length of stay.

Resoucres
---------
This class holds the beds resource (it could also hold other resources, such as
doctors)

Main
----
Code entry point after: if __name__ == '__main__'
Creates model object, and runs model

"""

# Import required modules

import simpy
import random
import pandas as pd
import matplotlib.pyplot as plt

class g:
"""g holds Global variables. No individual instance is required"""

inter_arrival_time = 1  # Average time (days) between arrivals
los = 10  # Average length of stay in hospital (days)
sim_duration = 500  # Duration of simulation (days)
audit_interval = 1  # Interval between audits (days)
beds = 15  # bed capacity of hospital

class Hospital:
"""
Hospital class holds:
1) Dictionary of patients present
2) List of audit times
3) List of beds occupied at each audit time
4) Current total beds occupied

Methods:

__init__: Set up hospital instance

audit: records number of beds occupied

build_audit_report: builds audit report at end of run (calculate 5th, 50th
and 95th percentile bed occupancy.

chart: plot beds occupied over time (at end of run)
"""

def __init__(self):
"""
Constructor method for hospital class"
Initialise object with attributes.
"""

self.patients = {}  # Dictionary of patients present
self.patients_in_queue = {}
self.patients_in_beds = {}
self.audit_time = []  # List of audit times
self.audit_beds = []  # List of beds occupied at each audit time
self.audit_queue = []
self.bed_count = 0  # Current total beds occupied
self.queue_count = 0
return

def audit(self, time):
"""
Audit method. When called appends current simulation time to audit_time
list, and appends current bed count to audit_beds.
"""

self.audit_time.append(time)
self.audit_beds.append(self.bed_count)
self.audit_queue.append(self.queue_count)
return

def build_audit_report(self):
"""
This method is called at end of run. It creates a pandas DataFrame,
transfers audit times and bed counts to the DataFrame, and
calculates/stores 5th, 50th and 95th percentiles.
"""
self.audit_report = pd.DataFrame()

self.audit_report['Time'] = self.audit_time

self.audit_report['Occupied_beds'] = self.audit_beds

self.audit_report['Median_beds'] = \
self.audit_report['Occupied_beds'].quantile(0.5)

self.audit_report['Beds_5_percent'] = \
self.audit_report['Occupied_beds'].quantile(0.05)

self.audit_report['Beds_95_percent'] = \
self.audit_report['Occupied_beds'].quantile(0.95)

self.audit_report['Queue'] = self.audit_queue

self.audit_report['Median_queue'] = \
self.audit_report['Queue'].quantile(0.5)

self.audit_report['Median_queue'] = \
self.audit_report['Queue'].quantile(0.5)

self.audit_report['Queue_5_percent'] = \
self.audit_report['Queue'].quantile(0.05)

self.audit_report['Queue_95_percent'] = \
self.audit_report['Queue'].quantile(0.95)

return

def chart(self):
"""
This method is called at end of run. It plots beds occupancy over the
model run, with 5%, 50% and 95% percentiles.
"""

# Plot occupied beds

plt.plot(self.audit_report['Time'],
self.audit_report['Occupied_beds'],
color='k',
marker='o',
linestyle='solid',
markevery=1,
label='Occupied beds')

plt.plot(self.audit_report['Time'],
self.audit_report['Beds_5_percent'],
color='0.5',
linestyle='dashdot',
markevery=1,
label='5th percentile')

plt.plot(self.audit_report['Time'],
self.audit_report['Median_beds'],
color='0.5',
linestyle='dashed',
label='Median')

plt.plot(self.audit_report['Time'],
self.audit_report['Beds_95_percent'],
color='0.5',
linestyle='dashdot',
label='95th percentile')

plt.xlabel('Day')
plt.ylabel('Occupied beds')
plt.title(
'Occupied beds (individual days with 5th, 50th and 95th ' +
'percentiles)')
plt.legend()
plt.show()

# Plot queue for beds

plt.plot(self.audit_report['Time'],
self.audit_report['Queue'],
color='k',
marker='o',
linestyle='solid',
markevery=1, label='Occupied beds')

plt.plot(self.audit_report['Time'],
self.audit_report['Queue_5_percent'],
color='0.5',
linestyle='dashdot',
markevery=1,
label='5th percentile')

plt.plot(self.audit_report['Time'],
self.audit_report['Median_queue'],
color='0.5',
linestyle='dashed',
label='Median')

plt.plot(self.audit_report['Time'],
self.audit_report['Queue_95_percent'],
color='0.5',
linestyle='dashdot',
label='95th percentile')

plt.xlabel('Day')
plt.ylabel('Queue for beds')
plt.title('Queue for beds (individual days with 5th, 50th and 95th' +
' percentiles)')
plt.legend()
plt.show()

return

class Model:
"""
The main model class.

The model class contains the model environment. The modelling environment
is set up, and patient arrival and audit processes initiated. Patient
arrival triggers a spell for that patient in hospital. Arrivals and audit
continue for the duration of the model run. The audit is then
summarised and bed occupancy (with 5th, 50th and 95th percentiles) plotted.

Methods are:

__init__: Set up model instance

audit_beds: call for bed audit at regular intervals (after initial delay
for model warm-up)

Call for patient generation with patient id and length of stay, then call
for patient spell in hospital.

run: Controls the main model run. Initialises model and patient arrival and
audit processes. Instigates the run. At end of run calls for an audit
summary and bed occupancy plot.

spell_gen: stores patient in hospital patient list and bed queue
dictionaries, waits for bed resource to become available, then removes
patient from bed queue dictionary and adds patient to hospital bed
dictionary and increments beds occupied. Waits for the patient length of
stay in the hospital and then decrements beds occupied and removes patient
from hospital patient dictionary and beds occupied dictionary.
"""

def __init__(self):
"""
Constructor class for new model.
"""
self.env = simpy.Environment()

return

def audit_beds(self, delay):
"""
Bed audit process. Begins by applying delay, then calls for audit at
intervals set in g.audit_interval

:param delay: delay (days) at start of model run for model warm-up.
"""

# Delay first audit
yield self.env.timeout(delay)

# Continually generate audit requests until end of model run
while True:
# Call audit (pass simulation time to hospital.audit)
self.hospital.audit(self.env.now)
# Delay until next call
yield self.env.timeout(g.audit_interval)

return

"""

:param interarrival_time: average time (days) between arrivals
:param los: average length of stay (days)
"""
while True:

# Generate new patient object (from Patient class). Give patient id
# and set length of stay from inverse exponential distribution).
los=random.expovariate(1 / los))

# Add patient to hospital patient dictionary
self.hospital.patients[p.id] = p

# Generate a patient spell in hospital (by calling spell method).
# This triggers a patient admission and allows the next arrival to
# be set before the paitent spell is finished
self.spell = self.spell_gen(p)
self.env.process(self.spell)

# Set and call delay before looping back to new patient admission

return

def run(self):
"""
Controls the main model run. Initialises model and patient arrival and
audit processes. Instigates the run. At end of run calls for an audit
summary and bed occupancy plot
"""

# Set up hospital (calling Hospital class)
self.hospital = Hospital()

# Set up resources (beds)
self.resources = Resources(self.env, g.beds)

# Set up starting processes: new admissions and bed  audit (with delay)
self.env.process(self.audit_beds(delay=20))

# Start model run
self.env.run(until=g.sim_duration)

# At end of run call for bed audit summary and bed occupancy plot
self.hospital.build_audit_report()
self.hospital.chart()

return

def spell_gen(self, p):
"""
Patient hospital stay generator. Increment bed count, wait for patient
length of stay to complete, then decrement bed count and remove patient
from hospital patient dictionary

:param p: patient object (contains length of stay for patient)
"""
# The following 'with' defines the required resources and automatically
# releases resources when no longer required

with self.resources.beds.request() as req:
# Increment queue count
self.hospital.queue_count += 1

# Add patient to dictionary of queuing patients. This is not used
# further in this model.
self.hospital.patients_in_queue[p.id] = p

# Yield resource request. Sim continues after yield when resources
# are vailable (so there is no delay if resources are immediately
# available)
yield req

# Resource now available. Remove from queue count and dictionary of
# queued objects
self.hospital.queue_count -= 1
del self.hospital.patients_in_queue[p.id]

# Add to count of patients in beds and to dictionary of patients in
# beds
self.hospital.patients_in_beds[p.id] = p
self.hospital.bed_count += 1

# Trigger length of stay delay
yield self.env.timeout(p.los)

# Length of stay complete. Remove patient from counts and
# dictionaries
self.hospital.bed_count -= 1
del self.hospital.patients_in_beds[p.id]
del self.hospital.patients[p.id]

return

class Patient:
"""
Patient class. Contains patient id and length of stay (it could contain
other info about patient, such as priority or clinical group.

The only method is __init__ for creating a patient (with assignment of
patient id and length of stay).
"""

def __init__(self, patient_id, los):
"""
Contructor for new patient.

:param patient_id: id of patient  (set in self.new_admission)
:param los: length of stay (days, set in self.new_admission)
"""
self.id = patient_id
self.los = los

return

class Resources:
"""
Holds beds resources
"""

def __init__(self, env, number_of_beds):
"""        Constructor method to initialise beds resource)"""
self.beds = simpy.Resource(env, capacity=number_of_beds)

return

def main():
"""
Code entry point after: if __name__ == '__main__'
Creates model object, and runs model
"""

model = Model()
model.run()

return

# Code entry point. Calls main method.
if __name__ == '__main__':
main()
```

# 87. SimPy: A simple hospital bed occupancy model (object-based) Previously, I have posted code for a simple hospital bed occupancy model using SimPy. Here, the same model is presented by written using an object-based method. This approach would allow for a more sophisticated model to be produced – each patient exists as an object in the model. Here we only have the patient object have id and length-of-stay attributes, but these could be expanded to include, for example, clinical group. ```"""
This model requires simpy to run (pip install simpy). All other modules
should are present in an Anaconda installation of Python 3. The model was
written in Python 3.6.

This model simulates the random arrival of patients at a hospital. Patients
stay for a random time (given an average length of stay). Both inter-arrival
time and length of stay are sampled from an inverse exponential distribution.
Average inter-arrival time and length of stay may be changed by changing their
values in class g (global variables). At intervals the number of beds occupied
are counted in an audit, and a chart of bed occupancy is plotted at the end of
the run.

There are four classes:

g (global variables)
--------------------

This class stores global variables. No individual object instance is used;
global variables are stored as class variables.

Hospital
--------
Hospital class (one instance created):
1) Dictionary of patients present
2) List of audit times
3) List of beds occupied at each audit time
4) Current total beds occupied

The Hospital class contains methods for audit of beds occupied, summarising
audit (at end of run), and plotting bed occupancy over time (at end of run).

Model
-----
The model class contains the model environment. The modelling environment is
set up, and patient arrival and audit processes initiated. Patient arrival
triggers a spell for that patient in hospital. Arrivals and audit continue
fort he duration of the model run. The audit is then summarised and bed
occupancy (with 5th, 50th and 95th percentiles) plotted.

Patient
-------
The patient class is the template for all patients generated (each new patient
arrival creates a new patietn object). The patient object contains patient id
and length of stay.

Main
----
Code entry point after: if __name__ == '__main__'
Creates model object, and runs model

"""

# Import required modules

import simpy
import random
import pandas as pd
import matplotlib.pyplot as plt

class g:
"""g holds Global variables. No individual instance is required"""

inter_arrival_time = 1  # Average time (days) between arrivals
los = 10  # Average length of stay in hospital (days)
sim_duration = 500  # Duration of simulation (days)
audit_interval = 1  # Interval between audits (days)

class Hospital:
"""
Hospital class holds:
1) Dictionary of patients present
2) List of audit times
3) List of beds occupied at each audit time
4) Current total beds occupied

Methods:

__init__: Set up hospital instance

audit: records number of beds occupied

build_audit_report: builds audit report at end of run (calculate 5th, 50th
and 95th percentile bed occupancy.

chart: plot beds occupied over time (at end of run)
"""

def __init__(self):
"""
Constructor method for hospital class"
Initialise object with attributes.
"""

self.patients = {}  # Dictionary of patients present
self.audit_time = []  # List of audit times
self.audit_beds = []  # List of beds occupied at each audit time
self.bed_count = 0  # Current total beds occupied
return

def audit(self, time):
"""
Audit method. When called appends current simulation time to audit_time
list, and appends current bed count to audit_beds.
"""

self.audit_time.append(time)
self.audit_beds.append(self.bed_count)
return

def build_audit_report(self):
"""
This method is called at end of run. It creates a pandas DataFrame,
transfers audit times and bed counts to the DataFrame, and
calculates/stores 5th, 50th and 95th percentiles.
"""
self.audit_report = pd.DataFrame()
self.audit_report['Time'] = self.audit_time
self.audit_report['Occupied_beds'] = self.audit_beds
self.audit_report['Median_beds'] = \
self.audit_report['Occupied_beds'].quantile(0.5)
self.audit_report['Beds_5_percent'] = \
self.audit_report['Occupied_beds'].quantile(0.05)
self.audit_report['Beds_95_percent'] = \
self.audit_report['Occupied_beds'].quantile(0.95)
return

def chart(self):
"""
This method is called at end of run. It plots beds occupancy over the
model run, with 5%, 50% and 95% percentiles.
"""

plt.plot(self.audit_report['Time'],
self.audit_report['Occupied_beds'],
color='k',
marker='o',
linestyle='solid',
markevery=1,
label='Occupied beds')
plt.plot(self.audit_report['Time'],
self.audit_report['Beds_5_percent'],
color='0.5',
linestyle='dashdot',
markevery=1,
label='5th percentile')
plt.plot(self.audit_report['Time'],
self.audit_report['Median_beds'],
color='0.5',
linestyle='dashed',
label='Median')
plt.plot(self.audit_report['Time'],
self.audit_report['Beds_95_percent'],
color='0.5',
linestyle='dashdot',
label='95th percentile')
plt.xlabel('Day')
plt.ylabel('Occupied beds')
plt.title(
'Occupied beds (individual days with 5th, 50th and 95th ' + \
'percentiles)')
# plt.legend()
plt.show()
return

class Model:
"""
The main model class.

The model class contains the model environment. The modelling environment is
set up, and patient arrival and audit processes initiated. Patient arrival
triggers a spell for that patient in hospital. Arrivals and audit continue
for the duration of the model run. The audit is then summarised and bed
occupancy (with 5th, 50th and 95th percentiles) plotted.

Methods are:

__init__: Set up model instance

audit_beds: call for bed audit at regular intervals (after initial delay for
model warm-up)

Call for patient generation with patient id and length of stay, then call
for patient spell in hospital.

run: Controls the main model run. Initialises model and patient arrival and
audit processes. Instigates the run. At end of run calls for an audit
summary and bed occupancy plot.

spell: increments beds occupied, stores patient in hospital patient list
dictionary, waits for patient length of stay and then decrements beds
occupied and removes patient from hospital patient list.
"""

def __init__(self):

"""
Constructor class for new model.
"""
self.env = simpy.Environment()

return

def audit_beds(self, delay):
"""
Bed audit process. Begins by applying delay, then calls for audit at
intervals set in g.audit_interval

:param delay: delay (days) at start of model run for model warm-up.
"""

# Delay first audit
yield self.env.timeout(delay)

# Continually generate audit requests until end of model run
while True:
# Call audit (pass simulation time to hospital.audit)
self.hospital.audit(self.env.now)
# Delay until next call
yield self.env.timeout(g.audit_interval)

return

"""

:param interarrival_time: average time (days) between arrivals
:param los: average length of stay (days)
"""
while True:

# Generate new patient object (from Patient class). Give patient id
# and set length of stay from inverse exponential distribution).
los=random.expovariate(1 / los))

# Add patient to hospital patient dictionary
self.hospital.patients[p.id] = p

# Generate a patient spell in hospital (by calling spell method).
# This triggers a patient admission and allows the next arrival to
# be set before the paitent spell is finished
self.spell = self.spell_gen(p)
self.env.process(self.spell)

# Set and call delay before looping back to new patient admission

return

def run(self):
"""
Controls the main model run. Initialises model and patient arrival and
audit processes. Instigates the run. At end of run calls for an audit
summary and bed occupancy plot
"""

# Set up hospital (calling Hospital class)
self.hospital = Hospital()

# Set up starting processes: new admissions and bed  audit (with delay)
self.env.process(self.audit_beds(delay=20))

# Start model run
self.env.run(until=g.sim_duration)

# At end of run call for bed audit summary and bed occupancy plot
self.hospital.build_audit_report()
self.hospital.chart()

return

def spell_gen(self, p):
"""
Patient hospital stay generator. Increment bed count, wait for patient
length of stay to complete, then decrement bed count and remove patient
from hospital patient dictionary

:param p: patient object (contains length of stay for patient)
"""

# Increment bed count
self.hospital.bed_count += 1

# Wait for patient length of stay to complete
yield self.env.timeout(p.los)

# Decrement bed count and remove patient from hospital patient
# dictionary
self.hospital.bed_count -= 1
del self.hospital.patients[p.id]
return

class Patient:
"""
Patient class. Contains patient id and length of stay (it could contain
other info about patient, such as priority or clinical group.

The only method is __init__ for creating a patient (with assignment of
patient id and length of stay).
"""

def __init__(self, patient_id, los):
"""
Contructor for new patient.

:param patient_id: id of patient  (set in self.new_admission)
:param los: length of stay (days, set in self.new_admission)
"""
self.id = patient_id
self.los = los

return

def main():
"""
Code entry point after: if __name__ == '__main__'
Creates model object, and runs model
"""

model = Model()
model.run()

return

# Code entry point. Calls main method.
if __name__ == '__main__':
main()

```

# 81. Distribution fitting to data

SciPy has over 80 distributions that may be used to either generate data or test for fitting of existing data. In this example we will test for fit against ten distributions and plot the best three fits. For a full list of distributions see:

https://docs.scipy.org/doc/scipy/reference/stats.html

In this example we’ll take the first feature (column) from the Wisconsin Breast Cancer data set and identify a statistical distribution that can approximate the observed distribution.

As usual we will start by loading general modules used, and load our data (selecting the first column for our ‘y’, the data to be fitted).

```import pandas as pd
import numpy as np
import scipy
from sklearn.preprocessing import StandardScaler
import scipy.stats
import matplotlib.pyplot as plt
%matplotlib inline
# Load data and select first column

from sklearn import datasets
y=data_set.data[:,0]

# Create an index array (x) for data

x = np.arange(len(y))
size = len(y)
```

## Visualise the data, and show descriptive summary

Before we do any fitting of distributions, it’s always good to do a simple visualisation of the data, and show descriptive statistics. We may, for example, decide to perform some outlier removal if we think that necessary (if there appear to be data points that don’t belong to the rest of the population).

```plt.hist(y)
plt.show()
``` If we put the data in a Pandas Dataframe we can use the Pandas describe method to show a summary.

```y_df = pd.DataFrame(y, columns=['Data'])
y_df.describe()
```
Out:
Data
count 569.000000
mean 14.127292
std 3.524049
min 6.981000
25% 11.700000
50% 13.370000
75% 15.780000
max 28.110000

## Fitting a range of distribution and test for goodness of fit

This method will fit a number of distributions to our data, compare goodness of fit with a chi-squared value, and test for significant difference between observed and fitted distribution with a Kolmogorov-Smirnov test.

The chi-squared value bins data into 50 bins (this could be reduced for smaller data sets) based on percentiles so that each bin contains approximately an equal number of values. For each fitted distribution the expected count of values in each bin is predicted from the distribution. The chi-squared value is the the sum of the relative squared error for each bin, such that:

chi-squared = sum ((observed – predicted) ** 2) / predicted)

For the observed and predicted we will use the cumulative sum of observed and predicted frequency across the bin range used.

The lower the chi-squared value the better the fit.

The Kolmogorov-Smirnov test assumes that data has been standardised: that is the mean is subtracted from all data (so the data becomes centred around zero), and that the results values are divided by the standard deviation (so all data becomes expressed as the number of standard deviations above or below the mean). A value of greater than 0.05 means that the fitted distribution is not significantly different to the observed distribution of the data.

It is worth noting that statistical distributions are theoretical models of real-world data. Statistical distributions offer a good way of approximating data (and simplifying huge amounts of data into a few parameters). But when you have a large set of real-world data it is not surprising to find that no theoretical distribution fits the data perfectly. Having he Kolmogorov-Smirnov tests for all distributions produce results of P<0.05 (fitted distribution is statistically different to the observed data distribution) is not unusual for large data sets. In that case, in modelling we are generally happy to continue with a fit that looks ‘reasonable’, being aware this is one of the simplifications present in any model.

Let’s first standardise the data using sklearn’s StandardScaler:

```sc=StandardScaler()
yy = y.reshape (-1,1)
sc.fit(yy)
y_std =sc.transform(yy)
y_std = y_std.flatten()
y_std
del yy
```

Now we will fit 10 different distributions, rank them by the approximate chi-squared goodness of fit, and report the Kolmogorov-Smirnov (KS) P value results. Remember that we want chi-squared to be as low as possible, and ideally we want the KS P-value to be >0.05.

Python may report warnings while running the distributions.

```# Set list of distributions to test
# See https://docs.scipy.org/doc/scipy/reference/stats.html for more

# Turn off code warnings (this is not recommended for routine use)
import warnings
warnings.filterwarnings("ignore")

# Set up list of candidate distributions to use
# See https://docs.scipy.org/doc/scipy/reference/stats.html for more

dist_names = ['beta',
'expon',
'gamma',
'lognorm',
'norm',
'pearson3',
'triang',
'uniform',
'weibull_min',
'weibull_max']

# Set up empty lists to stroe results
chi_square = []
p_values = []

# Set up 50 bins for chi-square test
# Observed data will be approximately evenly distrubuted aross all bins
percentile_bins = np.linspace(0,100,51)
percentile_cutoffs = np.percentile(y_std, percentile_bins)
observed_frequency, bins = (np.histogram(y_std, bins=percentile_cutoffs))
cum_observed_frequency = np.cumsum(observed_frequency)

# Loop through candidate distributions

for distribution in dist_names:
# Set up distribution and get fitted distribution parameters
dist = getattr(scipy.stats, distribution)
param = dist.fit(y_std)

# Obtain the KS test P statistic, round it to 5 decimal places
p = scipy.stats.kstest(y_std, distribution, args=param)
p = np.around(p, 5)
p_values.append(p)

# Get expected counts in percentile bins
# This is based on a 'cumulative distrubution function' (cdf)
cdf_fitted = dist.cdf(percentile_cutoffs, *param[:-2], loc=param[-2],
scale=param[-1])
expected_frequency = []
for bin in range(len(percentile_bins)-1):
expected_cdf_area = cdf_fitted[bin+1] - cdf_fitted[bin]
expected_frequency.append(expected_cdf_area)

# calculate chi-squared
expected_frequency = np.array(expected_frequency) * size
cum_expected_frequency = np.cumsum(expected_frequency)
ss = sum (((cum_expected_frequency - cum_observed_frequency) ** 2) / cum_observed_frequency)
chi_square.append(ss)

# Collate results and sort by goodness of fit (best at top)

results = pd.DataFrame()
results['Distribution'] = dist_names
results['chi_square'] = chi_square
results['p_value'] = p_values
results.sort_values(['chi_square'], inplace=True)

# Report results

print ('\nDistributions sorted by goodness of fit:')
print ('----------------------------------------')
print (results)
```
```Distributions sorted by goodness of fit:
----------------------------------------
Distribution    chi_square  p_value
3      lognorm     30.426685  0.17957
2        gamma     44.960532  0.06151
5     pearson3     44.961716  0.06152
0         beta     48.102181  0.06558
4         norm    292.430764  0.00000
6       triang    532.742597  0.00000
7      uniform   2150.560693  0.00000
1        expon   5701.366858  0.00000
9  weibull_max  10452.188968  0.00000
8  weibull_min  12002.386769  0.00000
```

We will now take the top three fits, plot the fit and return the sklearn parameters. This time we will fit to the raw data rather than the standardised data.

```# Divide the observed data into 100 bins for plotting (this can be changed)
number_of_bins = 100
bin_cutoffs = np.linspace(np.percentile(y,0), np.percentile(y,99),number_of_bins)

# Create the plot
h = plt.hist(y, bins = bin_cutoffs, color='0.75')

# Get the top three distributions from the previous phase
number_distributions_to_plot = 3
dist_names = results['Distribution'].iloc[0:number_distributions_to_plot]

# Create an empty list to stroe fitted distribution parameters
parameters = []

# Loop through the distributions ot get line fit and paraemters

for dist_name in dist_names:
# Set up distribution and store distribution paraemters
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
parameters.append(param)

# Get line for each distribution (and scale to match observed data)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1])
scale_pdf = np.trapz (h, h[:-1]) / np.trapz (pdf_fitted, x)
pdf_fitted *= scale_pdf

# Add the line to the plot
plt.plot(pdf_fitted, label=dist_name)

# Set the plot x axis to contain 99% of the data
# This can be removed, but sometimes outlier data makes the plot less clear
plt.xlim(0,np.percentile(y,99))

# Add legend and display plot

plt.legend()
plt.show()

# Store distribution paraemters in a dataframe (this could also be saved)
dist_parameters = pd.DataFrame()
dist_parameters['Distribution'] = (
results['Distribution'].iloc[0:number_distributions_to_plot])
dist_parameters['Distribution parameters'] = parameters

# Print parameter results
print ('\nDistribution parameters:')
print ('------------------------')

for index, row in dist_parameters.iterrows():
print ('\nDistribution:', row)
print ('Parameters:', row )
``` ```Distribution parameters:
------------------------

Distribution: lognorm
Parameters: (0.3411670333611477, 4.067737189292493, 9.490709944326486)

Distribution: gamma
Parameters: (5.252232022713325, 6.175162625863668, 1.5140473798580563)

Distribution: pearson3
Parameters: (0.8726704680754525, 14.127306804909308, 3.4698385545042782)
```

## qq and pp plots

qq and pp plots are two ways of showing how well a distribution fits data, other than plotting the distribution on top of a histogram of values (as used above).

Our intention here is not to describe the basis of the plots, but to show how to plot them in Python. In very basic terms they both describe how observed data is distributed compared with a theoretical distribution. If the fit is perfect then the data will appear as a perfect diagonal line where the x and y values are the same (for standardised data, as we use here).

Here is the code for the plots.

```## qq and pp plots

data = y_std.copy()
data.sort()

# Loop through selected distributions (as previously selected)

for distribution in dist_names:
# Set up distribution
dist = getattr(scipy.stats, distribution)
param = dist.fit(y_std)

# Get random numbers from distribution
norm = dist.rvs(*param[0:-2],loc=param[-2], scale=param[-1],size = size)
norm.sort()

# Create figure
fig = plt.figure(figsize=(8,5))

# qq plot
ax1 = fig.add_subplot(121) # Grid of 2x2, this is suplot 1
ax1.plot(norm,data,"o")
min_value = np.floor(min(min(norm),min(data)))
max_value = np.ceil(max(max(norm),max(data)))
ax1.plot([min_value,max_value],[min_value,max_value],'r--')
ax1.set_xlim(min_value,max_value)
ax1.set_xlabel('Theoretical quantiles')
ax1.set_ylabel('Observed quantiles')
title = 'qq plot for ' + distribution +' distribution'
ax1.set_title(title)

# pp plot

# Calculate cumulative distributions
bins = np.percentile(norm,range(0,101))
data_counts, bins = np.histogram(data,bins)
norm_counts, bins = np.histogram(norm,bins)
cum_data = np.cumsum(data_counts)
cum_norm = np.cumsum(norm_counts)
cum_data = cum_data / max(cum_data)
cum_norm = cum_norm / max(cum_norm)

# plot
ax2.plot(cum_norm,cum_data,"o")
min_value = np.floor(min(min(cum_norm),min(cum_data)))
max_value = np.ceil(max(max(cum_norm),max(cum_data)))
ax2.plot([min_value,max_value],[min_value,max_value],'r--')
ax2.set_xlim(min_value,max_value)
ax2.set_xlabel('Theoretical cumulative distribution')
ax2.set_ylabel('Observed cumulative distribution')
title = 'pp plot for ' + distribution +' distribution'
ax2.set_title(title)

# Display plot    