120. Generating log normal samples from provided arithmetic mean and standard deviation of original population

The log normal distribution is frequently a useful distribution for mimicking process times in healthcare pathways (or many other non-automated processes). The distribution has a right skew which may frequently occur when some clinical process step has some additional complexity to it compared to the ‘usual’ case.

To sample from a log normal distribution we need to convert the mean and standard deviation that was calculated from the original non-logged population into the mu and sigma of the underlying log normal population.

(For maximum computation effuiciency, when calling the function repeatedly using the same mean and standard deviation, you may wish to split this into two functions – one to calculate mu and sigma which needs only calling once, and the other to sample from the log normal distribution given mu and sigma).

For more on the maths see:

https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-with-specified-mean-and-variance.html

import numpy as np

def generate_lognormal_samples(mean, stdev, n=1):
    """
    Returns n samples taken from a lognormal distribution, based on mean and
    standard deviation calaculated from the original non-logged population.
    
    Converts mean and standard deviation to underlying lognormal distribution
    mu and sigma based on calculations desribed at:
        https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-
        with-specified-mean-and-variance.html
        
    Returns a numpy array of floats if n > 1, otherwise return a float
    """
    
    # Calculate mu and sigma of underlying lognormal distribution
    phi = (stdev ** 2 + mean ** 2) ** 0.5
    mu = np.log(mean ** 2 / phi)
    sigma = (np.log(phi ** 2 / mean ** 2)) ** 0.5
    
    # Generate lognormal population
    generated_pop = np.random.lognormal(mu, sigma , n)
    
    # Convert single sample (if n=1) to a float, otherwise leave as array
    generated_pop = \
        generated_pop[0] if len(generated_pop) == 1 else generated_pop
        
    return generated_pop

Test the function

We will generate a population of 100,000 samples with a given mean and standard deviation (these would be calculated on the non-logged population), and test the resulting generated population has the same mean and standard deviation.

mean = 10
stdev = 10
generated_pop = generate_lognormal_samples(mean, stdev, 100000)
print ('Mean:', generated_pop.mean())
print ('Standard deviation:', generated_pop.std())

Out:

Mean: 10.043105926813356
Standard deviation: 9.99527575740651

Plot a histogram of the generated population:

import matplotlib.pyplot as plt
%matplotlib inline
bins = np.arange(0,51,1)
plt.hist(generated_pop, bins=bins)
plt.show()

Generating a single sample

The function will return a single number if no n is given in the function call:

print (generate_lognormal_samples(mean, stdev))

Out: 6.999376449335125

115. A short function to replace (impute) missing numerical data in Pandas DataFrames with median of column values

When we import data into NumPy or Pandas, any empty cells of numerical data will be labelled np.NaN on import. In techniques such as machine learning we may wish to either 1) remove rows with any missing data, or 2) fill in the missing data with a set value, often the median of all other values in that data column. The latter has an advantage that the technique can be used both in training the machine learning model, and in predicting output when we are given examples with some missing data.

Here we define a function that goes through data columns in a Pandas DataFrame, looks to see if there is any missing data and, of there is, replaces np.NaN with the median of all other values in that data column.

import pandas as pd
import numpy as np

def impute_with_median (df):
    """Iterate through columns of Pandas DataFrame.
    Where NaNs exist replace with median"""
    
    # Get list of DataFrame column names
    cols = list(df)
    # Loop through columns
    for column in cols:
        # Transfer column to independent series
        col_data = df[column]
        # Look to see if there is any missing numerical data
        missing_data = sum(col_data.isna())
        if missing_data > 0:
            # Get median and replace missing numerical data with median
            col_median = col_data.median()
            col_data.fillna(col_median, inplace=True)
            df[column] = col_data
    return df   

We will mimic importing data with missing numerical data.

name = ['Bob', 'Jim', 'Anne', 'Rosie', 'Ben', 'Tom']
colour = ['red', 'red', 'red', 'blue', 'red', 'blue']
age = [23, 45, np.NaN, 21, 18, 20]
height = [1.80, np.NaN, 1.65, 1.71, 1.61, 1.76] 

data =pd.DataFrame()
data['name'] = name
data['colour'] = colour
data['age'] = age
data['height'] = height

View the data with missing values.

print (data)

Out:

 	name 	colour 	age 	height
0 	Bob 	red 	23.0 	1.80
1 	Jim 	red 	45.0 	NaN
2 	Anne 	red 	NaN 	1.65
3 	Rosie 	blue 	21.0 	1.71
4 	Ben 	red 	18.0 	1.61
5 	Tom 	blue 	20.0 	1.76

Call the function to replace missing data with the median, and re-examine data.

data = impute_with_median(data)
print (data)

Out:

 	name 	colour 	age 	height
0 	Bob 	red 	23.0 	1.80
1 	Jim 	red 	45.0 	1.71
2 	Anne 	red 	21.0 	1.65
3 	Rosie 	blue 	21.0 	1.71
4 	Ben 	red 	18.0 	1.61
5 	Tom 	blue 	20.0 	1.76

112. Splitting data set into training and test sets using Pandas DataFrames methods

Note: this may also be performed using SciKit-Learn train_test_split method, but here we will use native Pandas methods.

Create a DataFrame

# Create pandas data frame

import pandas as pd

name = ['Sam', 'Bill', 'Bob', 'Ian', 'Jo', 'Anne', 'Carl', 'Toni']
age = [22, 34, 18, 34, 76, 54, 21, 8]
gender = ['f', 'm', 'm', 'm', 'f', 'f', 'm', 'f']
height = [1.64, 1.85, 1.70, 1.75, 1.63, 1.79, 1.70, 1.68]
passed_physical = [0, 1, 1, 1, 0, 1, 1, 0]

people = pd.DataFrame()
people['name'] = name
people['age'] = age
people['gender'] = gender
people['height'] = height
people['passed'] = passed_physical

print(people)

Out:

   name  age gender  height  passed
0   Sam   22      f    1.64       0
1  Bill   34      m    1.85       1
2   Bob   18      m    1.70       1
3   Ian   34      m    1.75       1
4    Jo   76      f    1.63       0
5  Anne   54      f    1.79       1
6  Carl   21      m    1.70       1
7  Toni    8      f    1.68       0

Split training and test sets

Here we take a random sample (25%) of rows and remove them from the original data by dropping index values.

# Create a copy of the DataFrame to work from
# Omit random state to have different random split each run

people_copy = people.copy()
train_set = people_copy.sample(frac=0.75, random_state=0)
test_set = people_copy.drop(train_set.index)

print ('Training set')
print (train_set)
print ('\nTest set')
print (test_set)
print ('\nOriginal DataFrame')
print (people)

Out:

Training set
   name  age gender  height  passed
6  Carl   21      m    1.70       1
2   Bob   18      m    1.70       1
1  Bill   34      m    1.85       1
7  Toni    8      f    1.68       0
3   Ian   34      m    1.75       1
0   Sam   22      f    1.64       0

Test set
   name  age gender  height  passed
4    Jo   76      f    1.63       0
5  Anne   54      f    1.79       1

Original DataFrame
   name  age gender  height  passed
0   Sam   22      f    1.64       0
1  Bill   34      m    1.85       1
2   Bob   18      m    1.70       1
3   Ian   34      m    1.75       1
4    Jo   76      f    1.63       0
5  Anne   54      f    1.79       1
6  Carl   21      m    1.70       1
7  Toni    8      f    1.68       0

Use ‘pop’ to extract the labels

‘Pop’ will remove a column from the DataFrame, and transfer it to a new variable.

train_set_labels = train_set.pop('passed')
test_set_labels = test_set.pop('passed')

Training set
   name  age gender  height
6  Carl   21      m    1.70
2   Bob   18      m    1.70
1  Bill   34      m    1.85
7  Toni    8      f    1.68
3   Ian   34      m    1.75
0   Sam   22      f    1.64

Out:

Training set
   name  age gender  height
6  Carl   21      m    1.70
2   Bob   18      m    1.70
1  Bill   34      m    1.85
7  Toni    8      f    1.68
3   Ian   34      m    1.75
0   Sam   22      f    1.64

Training set label (y)
6    1
2    1
1    1
7    0
3    1
0    0
Name: passed, dtype: int64

111. Using ‘pop’ to remove a Pandas DataFrame column and transfer to a new variable

Sometimes we may want to remove a column from a DataFrame, but at the same time transfer that column to a new variable to perform some work on it. An example is re-coding a column as shown below where we will convert a text male/female column into a number 0/1 male column.

Create a DataFrame

import pandas as pd

name = ['Sam', 'Bill', 'Bob', 'Ian', 'Jo', 'Anne', 'Carl', 'Toni']
age = [22, 34, 18, 34, 76, 54, 21, 8]
gender = ['f', 'm', 'm', 'm', 'f', 'f', 'm', 'f']
height = [1.64, 1.85, 1.70, 1.75, 1.63, 1.79, 1.70, 1.68]

people = pd.DataFrame()
people['name'] = name
people['age'] = age
people['gender'] = gender
people['height'] = height

print(people)

Out:

   name  age gender  height
0   Sam   22      f    1.64
1  Bill   34      m    1.85
2   Bob   18      m    1.70
3   Ian   34      m    1.75
4    Jo   76      f    1.63
5  Anne   54      f    1.79
6  Carl   21      m    1.70
7  Toni    8      f    1.68

Pop a column (to code differently)

# Pop column
people_gender = people.pop('gender') # extracts and removes gender

# Recode (using == gves True/False, but in Python that also has numerical values of 1/0)

male = (people_gender == 'm') * 1 # 'm' is true is converted to number

# Put new column into DataFrame and print
people['male'] = male
print (people)

Out:

   name  age  height  male
0   Sam   22    1.64     0
1  Bill   34    1.85     1
2   Bob   18    1.70     1
3   Ian   34    1.75     1
4    Jo   76    1.63     0
5  Anne   54    1.79     0
6  Carl   21    1.70     1
7  Toni    8    1.68     0

109. Saving intact Pandas DataFrames using ‘pickle’

Sometimes a DataFrame may have content in it that will not save well in text (e.g. csv) format. For example a DataFrame may contain lists, and these will be saved as a text string in a text format.

Python has a library (pickle) for saving Python objects intact so that they may saved and loaded without having to generate them again.

Pandas has built in ‘pckling’ capability which makes it very easy to save and load intact dataframes.

Let’s first generate a dataframe that contains lists.

import pandas as pd

my_df = pd.DataFrame()

names = ['Bob', 'Sam', 'Jo', 'Bill']

favourite_sports = [['Tennis', 'Motorsports'],
                   ['Football', 'Rugby', 'Hockey'],
                   ['Table tennis', 'Swimming', 'Athletics'],
                   ['Eating cheese']]

my_df['name'] = names
my_df['favourite_sport'] = favourite_sports

print(my_df)

Out:

   name                      favourite_sport
0   Bob                [Tennis, Motorsports]
1   Sam            [Football, Rugby, Hockey]
2    Jo  [Table tennis, Swimming, Athletics]
3  Bill                      [Eating cheese]

Save and load DataFrame using Pandas built in pickle methods (recommended)

# Save DataFrame to pickle object
my_df.to_pickle('test_df.p')

# Load DataFrame with pickle object
test_df_load_1 = pd.read_pickle('test_df.p')

print (test_df_load_1)

Out:

   name                      favourite_sport
0   Bob                [Tennis, Motorsports]
1   Sam            [Football, Rugby, Hockey]
2    Jo  [Table tennis, Swimming, Athletics]
3  Bill                      [Eating cheese]

Save and load DataFrame using standard Python pickle library

With DataFrames you will probably always want to use the df.ti_pickle and pd.read_pickle methods for ease. But below is an example of using the Python pickle library – this method can be used with other types of complex Python objects (such as trained machine learning models) as well.

import pickle

# Save using pickle 
# (the b in rb denotes binary mode which is required for more complex obecjtes)
filename = 'pickled_df.p'
with open(filename, 'wb') as filehandler:
    pickle.dump(my_df, filehandler)

# Load using pickle
filename = 'pickled_df.p'
with open(filename, 'rb') as filehandler: 
    reloaded_df = pickle.load(filehandler)

print (reloaded_df)

Out:

   name                      favourite_sport
0   Bob                [Tennis, Motorsports]
1   Sam            [Football, Rugby, Hockey]
2    Jo  [Table tennis, Swimming, Athletics]
3  Bill                      [Eating cheese]