115. A short function to replace (impute) missing numerical data in Pandas DataFrames with median of column values

When we import data into NumPy or Pandas, any empty cells of numerical data will be labelled np.NaN on import. In techniques such as machine learning we may wish to either 1) remove rows with any missing data, or 2) fill in the missing data with a set value, often the median of all other values in that data column. The latter has an advantage that the technique can be used both in training the machine learning model, and in predicting output when we are given examples with some missing data.

Here we define a function that goes through data columns in a Pandas DataFrame, looks to see if there is any missing data and, of there is, replaces np.NaN with the median of all other values in that data column.

import pandas as pd
import numpy as np

def impute_with_median (df):
    """Iterate through columns of Pandas DataFrame.
    Where NaNs exist replace with median"""
    
    # Get list of DataFrame column names
    cols = list(df)
    # Loop through columns
    for column in cols:
        # Transfer column to independent series
        col_data = df[column]
        # Look to see if there is any missing numerical data
        missing_data = sum(col_data.isna())
        if missing_data > 0:
            # Get median and replace missing numerical data with median
            col_median = col_data.median()
            col_data.fillna(col_median, inplace=True)
            df[column] = col_data
    return df   

We will mimic importing data with missing numerical data.

name = ['Bob', 'Jim', 'Anne', 'Rosie', 'Ben', 'Tom']
colour = ['red', 'red', 'red', 'blue', 'red', 'blue']
age = [23, 45, np.NaN, 21, 18, 20]
height = [1.80, np.NaN, 1.65, 1.71, 1.61, 1.76] 

data =pd.DataFrame()
data['name'] = name
data['colour'] = colour
data['age'] = age
data['height'] = height

View the data with missing values.

print (data)

Out:

 	name 	colour 	age 	height
0 	Bob 	red 	23.0 	1.80
1 	Jim 	red 	45.0 	NaN
2 	Anne 	red 	NaN 	1.65
3 	Rosie 	blue 	21.0 	1.71
4 	Ben 	red 	18.0 	1.61
5 	Tom 	blue 	20.0 	1.76

Call the function to replace missing data with the median, and re-examine data.

data = impute_with_median(data)
print (data)

Out:

 	name 	colour 	age 	height
0 	Bob 	red 	23.0 	1.80
1 	Jim 	red 	45.0 	1.71
2 	Anne 	red 	21.0 	1.65
3 	Rosie 	blue 	21.0 	1.71
4 	Ben 	red 	18.0 	1.61
5 	Tom 	blue 	20.0 	1.76

112. Splitting data set into training and test sets using Pandas DataFrames methods

Note: this may also be performed using SciKit-Learn train_test_split method, but here we will use native Pandas methods.

Create a DataFrame

# Create pandas data frame

import pandas as pd

name = ['Sam', 'Bill', 'Bob', 'Ian', 'Jo', 'Anne', 'Carl', 'Toni']
age = [22, 34, 18, 34, 76, 54, 21, 8]
gender = ['f', 'm', 'm', 'm', 'f', 'f', 'm', 'f']
height = [1.64, 1.85, 1.70, 1.75, 1.63, 1.79, 1.70, 1.68]
passed_physical = [0, 1, 1, 1, 0, 1, 1, 0]

people = pd.DataFrame()
people['name'] = name
people['age'] = age
people['gender'] = gender
people['height'] = height
people['passed'] = passed_physical

print(people)

Out:

   name  age gender  height  passed
0   Sam   22      f    1.64       0
1  Bill   34      m    1.85       1
2   Bob   18      m    1.70       1
3   Ian   34      m    1.75       1
4    Jo   76      f    1.63       0
5  Anne   54      f    1.79       1
6  Carl   21      m    1.70       1
7  Toni    8      f    1.68       0

Split training and test sets

Here we take a random sample (25%) of rows and remove them from the original data by dropping index values.

# Create a copy of the DataFrame to work from
# Omit random state to have different random split each run

people_copy = people.copy()
train_set = people_copy.sample(frac=0.75, random_state=0)
test_set = people_copy.drop(train_set.index)

print ('Training set')
print (train_set)
print ('\nTest set')
print (test_set)
print ('\nOriginal DataFrame')
print (people)

Out:

Training set
   name  age gender  height  passed
6  Carl   21      m    1.70       1
2   Bob   18      m    1.70       1
1  Bill   34      m    1.85       1
7  Toni    8      f    1.68       0
3   Ian   34      m    1.75       1
0   Sam   22      f    1.64       0

Test set
   name  age gender  height  passed
4    Jo   76      f    1.63       0
5  Anne   54      f    1.79       1

Original DataFrame
   name  age gender  height  passed
0   Sam   22      f    1.64       0
1  Bill   34      m    1.85       1
2   Bob   18      m    1.70       1
3   Ian   34      m    1.75       1
4    Jo   76      f    1.63       0
5  Anne   54      f    1.79       1
6  Carl   21      m    1.70       1
7  Toni    8      f    1.68       0

Use ‘pop’ to extract the labels

‘Pop’ will remove a column from the DataFrame, and transfer it to a new variable.

train_set_labels = train_set.pop('passed')
test_set_labels = test_set.pop('passed')

Training set
   name  age gender  height
6  Carl   21      m    1.70
2   Bob   18      m    1.70
1  Bill   34      m    1.85
7  Toni    8      f    1.68
3   Ian   34      m    1.75
0   Sam   22      f    1.64

Out:

Training set
   name  age gender  height
6  Carl   21      m    1.70
2   Bob   18      m    1.70
1  Bill   34      m    1.85
7  Toni    8      f    1.68
3   Ian   34      m    1.75
0   Sam   22      f    1.64

Training set label (y)
6    1
2    1
1    1
7    0
3    1
0    0
Name: passed, dtype: int64

111. Using ‘pop’ to remove a Pandas DataFrame column and transfer to a new variable

Sometimes we may want to remove a column from a DataFrame, but at the same time transfer that column to a new variable to perform some work on it. An example is re-coding a column as shown below where we will convert a text male/female column into a number 0/1 male column.

Create a DataFrame

import pandas as pd

name = ['Sam', 'Bill', 'Bob', 'Ian', 'Jo', 'Anne', 'Carl', 'Toni']
age = [22, 34, 18, 34, 76, 54, 21, 8]
gender = ['f', 'm', 'm', 'm', 'f', 'f', 'm', 'f']
height = [1.64, 1.85, 1.70, 1.75, 1.63, 1.79, 1.70, 1.68]

people = pd.DataFrame()
people['name'] = name
people['age'] = age
people['gender'] = gender
people['height'] = height

print(people)

Out:

   name  age gender  height
0   Sam   22      f    1.64
1  Bill   34      m    1.85
2   Bob   18      m    1.70
3   Ian   34      m    1.75
4    Jo   76      f    1.63
5  Anne   54      f    1.79
6  Carl   21      m    1.70
7  Toni    8      f    1.68

Pop a column (to code differently)

# Pop column
people_gender = people.pop('gender') # extracts and removes gender

# Recode (using == gves True/False, but in Python that also has numerical values of 1/0)

male = (people_gender == 'm') * 1 # 'm' is true is converted to number

# Put new column into DataFrame and print
people['male'] = male
print (people)

Out:

   name  age  height  male
0   Sam   22    1.64     0
1  Bill   34    1.85     1
2   Bob   18    1.70     1
3   Ian   34    1.75     1
4    Jo   76    1.63     0
5  Anne   54    1.79     0
6  Carl   21    1.70     1
7  Toni    8    1.68     0

109. Saving intact Pandas DataFrames using ‘pickle’

Sometimes a DataFrame may have content in it that will not save well in text (e.g. csv) format. For example a DataFrame may contain lists, and these will be saved as a text string in a text format.

Python has a library (pickle) for saving Python objects intact so that they may saved and loaded without having to generate them again.

Pandas has built in ‘pckling’ capability which makes it very easy to save and load intact dataframes.

Let’s first generate a dataframe that contains lists.

import pandas as pd

my_df = pd.DataFrame()

names = ['Bob', 'Sam', 'Jo', 'Bill']

favourite_sports = [['Tennis', 'Motorsports'],
                   ['Football', 'Rugby', 'Hockey'],
                   ['Table tennis', 'Swimming', 'Athletics'],
                   ['Eating cheese']]

my_df['name'] = names
my_df['favourite_sport'] = favourite_sports

print(my_df)

Out:

   name                      favourite_sport
0   Bob                [Tennis, Motorsports]
1   Sam            [Football, Rugby, Hockey]
2    Jo  [Table tennis, Swimming, Athletics]
3  Bill                      [Eating cheese]

Save and load DataFrame using Pandas built in pickle methods (recommended)

# Save DataFrame to pickle object
my_df.to_pickle('test_df.p')

# Load DataFrame with pickle object
test_df_load_1 = pd.read_pickle('test_df.p')

print (test_df_load_1)

Out:

   name                      favourite_sport
0   Bob                [Tennis, Motorsports]
1   Sam            [Football, Rugby, Hockey]
2    Jo  [Table tennis, Swimming, Athletics]
3  Bill                      [Eating cheese]

Save and load DataFrame using standard Python pickle library

With DataFrames you will probably always want to use the df.ti_pickle and pd.read_pickle methods for ease. But below is an example of using the Python pickle library – this method can be used with other types of complex Python objects (such as trained machine learning models) as well.

import pickle

# Save using pickle 
# (the b in rb denotes binary mode which is required for more complex obecjtes)
filehandler = open('test_df_2.p', 'wb')
pickle.dump(my_df, filehandler)

# Load using pickle
filehandler = open('test_df_2.p','rb')
test_df_load_2 = pickle.load(filehandler)

print (test_df_load_2)

Out:

   name                      favourite_sport
0   Bob                [Tennis, Motorsports]
1   Sam            [Football, Rugby, Hockey]
2    Jo  [Table tennis, Swimming, Athletics]
3  Bill                      [Eating cheese]