Some data sets may have ordinal data, which are descriptions with a natural order, such as small, medium large. There may also be categorical data which has no obvious order like green, blue, red. We’ll usually want to convert both of these into numbers for use by machine learning models.
Let’s look at an example:
import pandas as pd
colour = ['green', 'green', 'red', 'blue', 'green', 'red','red']
size = ['small', 'small', 'large', 'medium', 'medium','x large', 'x small']
df = pd.DataFrame()
df['colour'] = colour
df['size'] = size
print (df)
OUT:
colour size
0 green small
1 green small
2 red large
3 blue medium
4 green medium
5 red x large
6 red x small
Working with ordinal data
One of our columns is obviously ordinal data: size has a natural order to it. We can convert this text to a number by mapping a dictionary to the column. We will create a new column (size_number) which replaces the text with a number.
# Define mapping dictionary:
size_classes = {'x small': 1,
'small': 2,
'medium': 3,
'large': 4,
'x large': 5}
# Map to dataframe and put results in a new column:
df['size_number'] = df['size'].map(size_classes)
# Display th new dataframe:
print (df)
OUT:
colour size size_number
0 green small 2
1 green small 2
2 red large 4
3 blue medium 3
4 green medium 3
5 red x large 5
6 red x small 1
Working with categorical data
There is no obvious sensible mapping of colour to a number. So in this case we create an extra column for each colour and put a one in the relevant column. For this we use pandas get_dummies method.
colours_df = pd.get_dummies(df['colour'])
print (colours_df)
OUT:
blue green red
0 0 1 0
1 0 1 0
2 0 0 1
3 1 0 0
4 0 1 0
5 0 0 1
6 0 0 1
We then combine the new dataframe with the original one, and we can delete the temporary one we made:
df = pd.concat([df, colours_df], axis=1, join='inner')
del colours_df
print (df)
OUT:
colour size size_number blue green red
0 green small 2 0 1 0
1 green small 2 0 1 0
2 red large 4 0 0 1
3 blue medium 3 1 0 0
4 green medium 3 0 1 0
5 red x large 5 0 0 1
6 red x small 1 0 0 1
Selecting just our new columns
At the moment we have both the original data and the transformed data. For use in the model we would just keep the new columns. Here we’ll use the pandas loc method to select column slices from size_number onwards:
df1 = (df.loc[:,'size_number':])
print (df1)
OUT:
size_number blue green red
0 2 0 1 0
1 2 0 1 0
2 4 0 0 1
3 3 1 0 0
4 3 0 1 0
5 5 0 0 1
6 1 0 0 1
One thought on “70. Machine Learning: Working with ordinal and categorical data”