When we import data into NumPy or Pandas, any empty cells of numerical data will be labelled np.NaN on import. In techniques such as machine learning we may wish to either 1) remove rows with any missing data, or 2) fill in the missing data with a set value, often the median of all other values in that data column. The latter has an advantage that the technique can be used both in training the machine learning model, and in predicting output when we are given examples with some missing data.
Here we define a function that goes through data columns in a Pandas DataFrame, looks to see if there is any missing data and, of there is, replaces np.NaN with the median of all other values in that data column.
import pandas as pd
import numpy as np
def impute_with_median (df):
"""Iterate through columns of Pandas DataFrame.
Where NaNs exist replace with median"""
# Get list of DataFrame column names
cols = list(df)
# Loop through columns
for column in cols:
# Transfer column to independent series
col_data = df[column]
# Look to see if there is any missing numerical data
missing_data = sum(col_data.isna())
if missing_data > 0:
# Get median and replace missing numerical data with median
col_median = col_data.median()
col_data.fillna(col_median, inplace=True)
df[column] = col_data
return df
We will mimic importing data with missing numerical data.
name = ['Bob', 'Jim', 'Anne', 'Rosie', 'Ben', 'Tom']
colour = ['red', 'red', 'red', 'blue', 'red', 'blue']
age = [23, 45, np.NaN, 21, 18, 20]
height = [1.80, np.NaN, 1.65, 1.71, 1.61, 1.76]
data =pd.DataFrame()
data['name'] = name
data['colour'] = colour
data['age'] = age
data['height'] = height
View the data with missing values.
print (data)
Out:
name colour age height
0 Bob red 23.0 1.80
1 Jim red 45.0 NaN
2 Anne red NaN 1.65
3 Rosie blue 21.0 1.71
4 Ben red 18.0 1.61
5 Tom blue 20.0 1.76
Call the function to replace missing data with the median, and re-examine data.
data = impute_with_median(data)
print (data)
Out:
name colour age height
0 Bob red 23.0 1.80
1 Jim red 45.0 1.71
2 Anne red 21.0 1.65
3 Rosie blue 21.0 1.71
4 Ben red 18.0 1.61
5 Tom blue 20.0 1.76