This post is also available as a pdf, a Jupyter Notebook, and a py file. Continue reading “82. The travelling community nurse problem (aka the Travelling Salesman Problem)”

# 81. Distribution fitting to data

SciPy has over 80 distributions that may be used to either generate data or test for fitting of existing data. In this example we will test for fit against ten distributions and plot the best three fits. For a full list of distributions see:

https://docs.scipy.org/doc/scipy/reference/stats.html

In this example we’ll take the first feature (column) from the Wisconsin Breast Cancer data set and identify a statistical distribution that can approximate the observed distribution.

As usual we will start by loading general modules used, and load our data (selecting the first column for our ‘y’, the data to be fitted).

```
import pandas as pd
import numpy as np
import scipy
from sklearn.preprocessing import StandardScaler
import scipy.stats
import matplotlib.pyplot as plt
%matplotlib inline
# Load data and select first column
from sklearn import datasets
data_set = datasets.load_breast_cancer()
y=data_set.data[:,0]
# Create an index array (x) for data
x = np.arange(len(y))
size = len(y)
```

## Visualise the data, and show descriptive summary

Before we do any fitting of distributions, it’s always good to do a simple visualisation of the data, and show descriptive statistics. We may, for example, decide to perform some outlier removal if we think that necessary (if there appear to be data points that don’t belong to the rest of the population).

```
plt.hist(y)
plt.show()
```

If we put the data in a Pandas Dataframe we can use the Pandas describe method to show a summary.

```
y_df = pd.DataFrame(y, columns=['Data'])
y_df.describe()
```

## Fitting a range of distribution and test for goodness of fit

This method will fit a number of distributions to our data, compare goodness of fit with a chi-squared value, and test for significant difference between observed and fitted distribution with a Kolmogorov-Smirnov test.

The chi-squared value bins data into 50 bins (this could be reduced for smaller data sets) based on percentiles so that each bin contains approximately an equal number of values. For each fitted distribution the expected count of values in each bin is predicted from the distribution. The chi-squared value is the the sum of the relative squared error for each bin, such that:

chi-squared = sum ((observed – predicted) ** 2) / predicted)

For the observed and predicted we will use the cumulative sum of observed and predicted frequency across the bin range used.

The lower the chi-squared value the better the fit.

The Kolmogorov-Smirnov test assumes that data has been standardised: that is the mean is subtracted from all data (so the data becomes centred around zero), and that the results values are divided by the standard deviation (so all data becomes expressed as the number of standard deviations above or below the mean). A value of greater than 0.05 means that the fitted distribution is not significantly different to the observed distribution of the data.

It is worth noting that statistical distributions are theoretical models of real-world data. Statistical distributions offer a good way of approximating data (and simplifying huge amounts of data into a few parameters). But when you have a large set of real-world data it is not surprising to find that no theoretical distribution fits the data perfectly. Having he Kolmogorov-Smirnov tests for all distributions produce results of P<0.05 (fitted distribution is statistically different to the observed data distribution) is not unusual for large data sets. In that case, in modelling we are generally happy to continue with a fit that looks ‘reasonable’, being aware this is one of the simplifications present in any model.

Let’s first standardise the data using sklearn’s StandardScaler:

```
sc=StandardScaler()
yy = y.reshape (-1,1)
sc.fit(yy)
y_std =sc.transform(yy)
y_std = y_std.flatten()
y_std
del yy
```

Now we will fit 10 different distributions, rank them by the approximate chi-squared goodness of fit, and report the Kolmogorov-Smirnov (KS) P value results. Remember that we want chi-squared to be as low as possible, and ideally we want the KS P-value to be >0.05.

Python may report warnings while running the distributions.

```
# Set list of distributions to test
# See https://docs.scipy.org/doc/scipy/reference/stats.html for more
# Turn off code warnings (this is not recommended for routine use)
import warnings
warnings.filterwarnings("ignore")
# Set up list of candidate distributions to use
# See https://docs.scipy.org/doc/scipy/reference/stats.html for more
dist_names = ['beta',
'expon',
'gamma',
'lognorm',
'norm',
'pearson3',
'triang',
'uniform',
'weibull_min',
'weibull_max']
# Set up empty lists to stroe results
chi_square = []
p_values = []
# Set up 50 bins for chi-square test
# Observed data will be approximately evenly distrubuted aross all bins
percentile_bins = np.linspace(0,100,51)
percentile_cutoffs = np.percentile(y_std, percentile_bins)
observed_frequency, bins = (np.histogram(y_std, bins=percentile_cutoffs))
cum_observed_frequency = np.cumsum(observed_frequency)
# Loop through candidate distributions
for distribution in dist_names:
# Set up distribution and get fitted distribution parameters
dist = getattr(scipy.stats, distribution)
param = dist.fit(y_std)
# Obtain the KS test P statistic, round it to 5 decimal places
p = scipy.stats.kstest(y_std, distribution, args=param)[1]
p = np.around(p, 5)
p_values.append(p)
# Get expected counts in percentile bins
# This is based on a 'cumulative distrubution function' (cdf)
cdf_fitted = dist.cdf(percentile_cutoffs, *param[:-2], loc=param[-2],
scale=param[-1])
expected_frequency = []
for bin in range(len(percentile_bins)-1):
expected_cdf_area = cdf_fitted[bin+1] - cdf_fitted[bin]
expected_frequency.append(expected_cdf_area)
# calculate chi-squared
expected_frequency = np.array(expected_frequency) * size
cum_expected_frequency = np.cumsum(expected_frequency)
ss = sum (((cum_expected_frequency - cum_observed_frequency) ** 2) / cum_observed_frequency)
chi_square.append(ss)
# Collate results and sort by goodness of fit (best at top)
results = pd.DataFrame()
results['Distribution'] = dist_names
results['chi_square'] = chi_square
results['p_value'] = p_values
results.sort_values(['chi_square'], inplace=True)
# Report results
print ('\nDistributions sorted by goodness of fit:')
print ('----------------------------------------')
print (results)
```

We will now take the top three fits, plot the fit and return the sklearn parameters. This time we will fit to the raw data rather than the standardised data.

```
# Divide the observed data into 100 bins for plotting (this can be changed)
number_of_bins = 100
bin_cutoffs = np.linspace(np.percentile(y,0), np.percentile(y,99),number_of_bins)
# Create the plot
h = plt.hist(y, bins = bin_cutoffs, color='0.75')
# Get the top three distributions from the previous phase
number_distributions_to_plot = 3
dist_names = results['Distribution'].iloc[0:number_distributions_to_plot]
# Create an empty list to stroe fitted distribution parameters
parameters = []
# Loop through the distributions ot get line fit and paraemters
for dist_name in dist_names:
# Set up distribution and store distribution paraemters
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
parameters.append(param)
# Get line for each distribution (and scale to match observed data)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1])
scale_pdf = np.trapz (h[0], h[1][:-1]) / np.trapz (pdf_fitted, x)
pdf_fitted *= scale_pdf
# Add the line to the plot
plt.plot(pdf_fitted, label=dist_name)
# Set the plot x axis to contain 99% of the data
# This can be removed, but sometimes outlier data makes the plot less clear
plt.xlim(0,np.percentile(y,99))
# Add legend and display plot
plt.legend()
plt.show()
# Store distribution paraemters in a dataframe (this could also be saved)
dist_parameters = pd.DataFrame()
dist_parameters['Distribution'] = (
results['Distribution'].iloc[0:number_distributions_to_plot])
dist_parameters['Distribution parameters'] = parameters
# Print parameter results
print ('\nDistribution parameters:')
print ('------------------------')
for index, row in dist_parameters.iterrows():
print ('\nDistribution:', row[0])
print ('Parameters:', row[1] )
```

## qq and pp plots

qq and pp plots are two ways of showing how well a distribution fits data, other than plotting the distribution on top of a histogram of values (as used above).

Our intention here is not to describe the basis of the plots, but to show how to plot them in Python. In very basic terms they both describe how observed data is distributed compared with a theoretical distribution. If the fit is perfect then the data will appear as a perfect diagonal line where the x and y values are the same (for standardised data, as we use here).

For more info on the plots see:

https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot

https://en.wikipedia.org/wiki/P%E2%80%93P_plot

Here is the code for the plots.

```
## qq and pp plots
data = y_std.copy()
data.sort()
# Loop through selected distributions (as previously selected)
for distribution in dist_names:
# Set up distribution
dist = getattr(scipy.stats, distribution)
param = dist.fit(y_std)
# Get random numbers from distribution
norm = dist.rvs(*param[0:-2],loc=param[-2], scale=param[-1],size = size)
norm.sort()
# Create figure
fig = plt.figure(figsize=(8,5))
# qq plot
ax1 = fig.add_subplot(121) # Grid of 2x2, this is suplot 1
ax1.plot(norm,data,"o")
min_value = np.floor(min(min(norm),min(data)))
max_value = np.ceil(max(max(norm),max(data)))
ax1.plot([min_value,max_value],[min_value,max_value],'r--')
ax1.set_xlim(min_value,max_value)
ax1.set_xlabel('Theoretical quantiles')
ax1.set_ylabel('Observed quantiles')
title = 'qq plot for ' + distribution +' distribution'
ax1.set_title(title)
# pp plot
ax2 = fig.add_subplot(122)
# Calculate cumulative distributions
bins = np.percentile(norm,range(0,101))
data_counts, bins = np.histogram(data,bins)
norm_counts, bins = np.histogram(norm,bins)
cum_data = np.cumsum(data_counts)
cum_norm = np.cumsum(norm_counts)
cum_data = cum_data / max(cum_data)
cum_norm = cum_norm / max(cum_norm)
# plot
ax2.plot(cum_norm,cum_data,"o")
min_value = np.floor(min(min(cum_norm),min(cum_data)))
max_value = np.ceil(max(max(cum_norm),max(cum_data)))
ax2.plot([min_value,max_value],[min_value,max_value],'r--')
ax2.set_xlim(min_value,max_value)
ax2.set_xlabel('Theoretical cumulative distribution')
ax2.set_ylabel('Observed cumulative distribution')
title = 'pp plot for ' + distribution +' distribution'
ax2.set_title(title)
# Display plot
plt.tight_layout(pad=4)
plt.show()
```

# 80. Grouping unlabelled data with k-means clustering

k-means clustering is a form of ‘unsupervised learning’. This is useful for grouping unlabelled data. For example, in the Wisconsin breast cancer data set, what if we did did not know whether the patients had cancer or not at the time the data was collected? Could we find a way of finding groups of similar patients? These groups may then form the basis of some further investigation.

k-means clustering groups data in such a way that the average distance of all points to their closest cluster centre is minimised.

Let’s start by importing the required modules, and loading our data. In order better visualise what k-means is doing we will limit our data to two features (though k-means clustering works with any number of features).

```
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Load data
from sklearn import datasets
data_set = datasets.load_breast_cancer()
X=data_set.data[:,0:2] # restrict patient data to two features.
```

We will normalise our data:

```
sc=StandardScaler()
sc.fit(X)
X_std=sc.transform(X)
```

And now we will group patients into three clusters (k=3), and plot the centre of each cluster (each cluster centre has two co-ordinates, representing the normalised features we are using). If we used data with more than two features we would have a co-ordinate point for each feature.

```
kmeans = KMeans(n_clusters=3)
kmeans.fit(X_std)
print('\nCluster centres:')
print(kmeans.cluster_centers_)
```

We can identify which cluster a sample has been put into with ‘kmeans.labels_’. Using that we may plot our clustering:

```
labels = kmeans.labels_
plt.scatter(X_std[:,0],X_std[:,1],
c=labels, cmap=plt.cm.rainbow)
plt.xlabel('Normalised feature 1')
plt.ylabel('Normalised feature 2')
plt.show()
```

## How many clusters should be used?¶

Sometimes we may have prior knowledge that we want to group the data into a given number of clusters. Other times we may wish to investigate what may be a good number of clusters.

In the example below we look at changing the number of clusters between 1 and 100 and measure the average distance points are from their closest cluster centre (kmeans.transform gives us the distance for each sample to each cluster centre). Looking at the results we may decide that up to about 10 clusters may be useful, but after that there are diminishing returns of adding further clusters.

```
distance_to_closter_cluster_centre = []
for k in range(1,100):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X_std)
distance = np.min(kmeans.transform(X_std),axis=1)
average_distance = np.mean(distance)
distance_to_closter_cluster_centre.append(average_distance)
clusters = np.arange(len(distance_to_closter_cluster_centre))+1
plt.plot(clusters, distance_to_closter_cluster_centre)
plt.xlabel('Number of clusters (k)')
plt.ylabel('Average distance to closest cluster centroid')
plt.ylim(0,1.5)
plt.show()
```

Continue reading “80. Grouping unlabelled data with k-means clustering”

# 79. Reducing data complexity, and eliminating covariance, with principal component analysis

Principal component analysis (PCA) may be used for two purposes: Continue reading “79. Reducing data complexity, and eliminating covariance, with principal component analysis”

# 78. Changing sensitivity of machine learning algorithms and performing a receiver-operator characteristic curve

# 77. Machine learning: Visualising accuracy and error in a classification model with a confusion matrix

# 76. Matplotlib: Adding shaded areas to charts

*plt.fill_between* may be used to add shaded areas to charts. By using the alpha (transparency) argument shaded areas may overlap.

```
import matplotlib.pyplot as plt
y1 = [60, 65, 65, 70, 75]
y1_max = [70, 73, 70, 78, 82]
y1_min = [55, 58, 61, 62, 68]
y2 = [45, 50, 55, 55, 55]
y2_max = [53, 55, 63, 60, 62]
y2_min = [35, 45, 52, 52, 50]
plt.plot(x, y1)
plt.plot(x, y2, linestyle ='dashed')
# alpha adjusts transparency, higher alpha --> darker grey
# Or color could be set to, for example '0.2', but using transparency allows
# overlapping shaded areas
plt.fill_between(x, y1_min, y1_max, color = 'k', alpha = 0.1)
plt.fill_between(x, y2_min, y2_max, color = 'k', alpha = 0.1)
plt.show()
```