When we import data into NumPy or Pandas, any empty cells of
numerical data will be labelled np.NaN on import. In techniques such as
machine learning we may wish to either 1) remove rows with any missing
data, or 2) fill in the missing data with a set value, often the median
of all other values in that data column. The latter has an advantage
that the technique can be used both in training the machine learning
model, and in predicting output when we are given examples with some
missing data.

Here we define a function that goes through data columns in a Pandas
DataFrame, looks to see if there is any missing data and, of there is,
replaces np.NaN with the median of all other values in that data column.

import pandas as pd
import numpy as np
def impute_with_median (df):
"""Iterate through columns of Pandas DataFrame.
Where NaNs exist replace with median"""
# Get list of DataFrame column names
cols = list(df)
# Loop through columns
for column in cols:
# Transfer column to independent series
col_data = df[column]
# Look to see if there is any missing numerical data
missing_data = sum(col_data.isna())
if missing_data > 0:
# Get median and replace missing numerical data with median
col_median = col_data.median()
col_data.fillna(col_median, inplace=True)
df[column] = col_data
return df

We will mimic importing data with missing numerical data.

name = ['Bob', 'Jim', 'Anne', 'Rosie', 'Ben', 'Tom']
colour = ['red', 'red', 'red', 'blue', 'red', 'blue']
age = [23, 45, np.NaN, 21, 18, 20]
height = [1.80, np.NaN, 1.65, 1.71, 1.61, 1.76]
data =pd.DataFrame()
data['name'] = name
data['colour'] = colour
data['age'] = age
data['height'] = height

View the data with missing values.

print (data)
Out:
name colour age height
0 Bob red 23.0 1.80
1 Jim red 45.0 NaN
2 Anne red NaN 1.65
3 Rosie blue 21.0 1.71
4 Ben red 18.0 1.61
5 Tom blue 20.0 1.76

Call the function to replace missing data with the median, and re-examine data.

data = impute_with_median(data)
print (data)
Out:
name colour age height
0 Bob red 23.0 1.80
1 Jim red 45.0 1.71
2 Anne red 21.0 1.65
3 Rosie blue 21.0 1.71
4 Ben red 18.0 1.61
5 Tom blue 20.0 1.76

Interests are use of simulation and machine learning in healthcare, currently working for the NHS and the University of Exeter. Committed to all work being performed in Free and Open Source Software (FOSS), and as much source data being made available as possible.
https://gitlab.com/michaelallen1966
View all posts by Michael Allen