122: Oversampling to correct for imbalanced data using naive sampling or SMOTE

Machine learning can have poor performance for minority classes (where one or more classes represent only a small proportion of the overall data set compared with a dominant class). One method of improving performance is to balance out the number of examples between different classes. Here two methods are described:

  1. Resampling from the minority classes to give the same number of examples as the majority class.
  2. SMOTE (Synthetic Minority Over-sampling Technique): creating synthetic data based on creating new data points that are mid-way between two near neighbours in any particular class.

SMOTE uses imblearn See: https://imbalanced-learn.org

Install with: pip install -U imbalanced-learn, or conda install -c conda-forge imbalanced-learn

Reference

N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 16, 321-357, 2002

Create dummy data

First we will create some unbalanced dummy data: Classes 0, 1 and 2 will represent 1%, 5% and 94% of the data respectively.

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=3,                            n_clusters_per_class=1, weights=[0.01, 0.05, 0.94],                            class_sep=0.8, random_state=0)

Count instances of each class.

from collections import Counter
print(sorted(Counter(y).items()))

OUT:
[(0, 64), (1, 262), (2, 4674)]

Define function to plot data

import matplotlib.pyplot as plt

def plot_classes(X,y):
    colours = ['k','b','g']
    point_colours = [colours[val] for val in y]
    X1 = X[:,0]
    X2 = X[:,1]
    plt.scatter(X1, X2, facecolor = point_colours, edgecolor = 'k')
    plt.show()

Plots data using function

plot_classes(X,y)

Oversample with naive sampling to match numbers in each class

With naive resampling we repeatedly randomly sample from the minority classes and add that the new sample to the existing data set, leading to multiple instances of the minority classes.This builds up the number of minority class samples.

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)

Count instances of each class in the augmented data.

from collections import Counter
print(sorted(Counter(y_resampled).items()))

OUT:
[(0, 4674), (1, 4674), (2, 4674)]

Plot augmented data (it looks the same as the original as points are overlaid).

plot_classes(X_resampled,y_resampled)

SMOTE with continuous variables

SMOTE (synthetic minority oversampling technique) works by finding two near neighbours in a minority class, producing a new point midway between the two existing points and adding that new point in to the sample. The example shown is in two dimensions, but SMOTE will work across multiple dimensions (features). SMOTE therefore helps to ‘fill in’ the feature space occupied by minority classes.

from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X, y)

# Count instances of each class
from collections import Counter
print(sorted(Counter(y_resampled).items()))

OUT:
[(0, 4674), (1, 4674), (2, 4674)]

Plot augmented data (note minority class data points now exist in new spaces).

SMOTE with mixed continuous and binary/categorical values

It is not possible to calculate a ‘mid point’ between two points of binary or categorical data. An extension to the SMOTE method allows for use of binary or categorical data by taking the most common occurring category of nearest neighbours to a minority class point.

# create a synthetic data set with continuous and categorical features
import numpy as np
rng = np.random.RandomState(42)
n_samples = 50
X = np.empty((n_samples, 3), dtype=object)
X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)
X[:, 1] = rng.randn(n_samples)
X[:, 2] = rng.randint(3, size=n_samples)
y = np.array([0] * 20 + [1] * 30)

Count instances of each class

print(sorted(Counter(y).items()))

OUT:
[(0, 20), (1, 30)]

Show last 10 original data points

print (X[-10:])

OUT:
[['A' 1.4689412854323924 2]
 ['C' -1.1238983345400366 0]
 ['C' 0.9500053955071801 2]
 ['A' 1.7265164685753638 1]
 ['A' 0.4578850770000152 0]
 ['C' -1.6842873783658814 0]
 ['B' 0.32684522397001387 0]
 ['A' -0.0811189541586873 2]
 ['B' 0.46779475326315173 1]
 ['B' 0.7361223506692577 0]]

Use SMOTENC to create new data points.

from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X, y)

Count instances of each class

print(sorted(Counter(y_resampled).items()))

OUT:
[(0, 30), (1, 30)]

Show last 10 values of X (SMOTE data points are added to the end of the original data set)

print (X_resampled[-10:])

[['C' -1.0600505672469849 1]
 ['C' -0.36965644259183145 1]
 ['A' 0.1453826708354494 2]
 ['C' -1.7442827953859052 2]
 ['C' -1.6278053447258838 2]
 ['A' 0.5246469549655818 2]
 ['B' -0.3657680728116921 2]
 ['A' 0.9344237230779993 2]
 ['B' 0.3710891618824609 2]
 ['B' 0.3327240726719727 2]]

120. Generating log normal samples from provided arithmetic mean and standard deviation of original population

The log normal distribution is frequently a useful distribution for mimicking process times in healthcare pathways (or many other non-automated processes). The distribution has a right skew which may frequently occur when some clinical process step has some additional complexity to it compared to the ‘usual’ case.

To sample from a log normal distribution we need to convert the mean and standard deviation that was calculated from the original non-logged population into the mu and sigma of the underlying log normal population.

(For maximum computation effuiciency, when calling the function repeatedly using the same mean and standard deviation, you may wish to split this into two functions – one to calculate mu and sigma which needs only calling once, and the other to sample from the log normal distribution given mu and sigma).

For more on the maths see:

https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-with-specified-mean-and-variance.html

import numpy as np

def generate_lognormal_samples(mean, stdev, n=1):
    """
    Returns n samples taken from a lognormal distribution, based on mean and
    standard deviation calaculated from the original non-logged population.
    
    Converts mean and standard deviation to underlying lognormal distribution
    mu and sigma based on calculations desribed at:
        https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-
        with-specified-mean-and-variance.html
        
    Returns a numpy array of floats if n > 1, otherwise return a float
    """
    
    # Calculate mu and sigma of underlying lognormal distribution
    phi = (stdev ** 2 + mean ** 2) ** 0.5
    mu = np.log(mean ** 2 / phi)
    sigma = (np.log(phi ** 2 / mean ** 2)) ** 0.5
    
    # Generate lognormal population
    generated_pop = np.random.lognormal(mu, sigma , n)
    
    # Convert single sample (if n=1) to a float, otherwise leave as array
    generated_pop = \
        generated_pop[0] if len(generated_pop) == 1 else generated_pop
        
    return generated_pop

Test the function

We will generate a population of 100,000 samples with a given mean and standard deviation (these would be calculated on the non-logged population), and test the resulting generated population has the same mean and standard deviation.

mean = 10
stdev = 10
generated_pop = generate_lognormal_samples(mean, stdev, 100000)
print ('Mean:', generated_pop.mean())
print ('Standard deviation:', generated_pop.std())

Out:

Mean: 10.043105926813356
Standard deviation: 9.99527575740651

Plot a histogram of the generated population:

import matplotlib.pyplot as plt
%matplotlib inline
bins = np.arange(0,51,1)
plt.hist(generated_pop, bins=bins)
plt.show()

Generating a single sample

The function will return a single number if no n is given in the function call:

print (generate_lognormal_samples(mean, stdev))

Out: 6.999376449335125

40. Removing duplicate data in NumPy and Pandas

Both NumPy and Pandas offer easy ways of removing duplicate rows. Pandas offers a more powerful approach if you wish to remove rows that are partly duplicated.

NumPy

With numpy we use np.unique() to remove duplicate rows or columns (use the argument axis=0 for unique rows or axis=1 for unique columns). Continue reading “40. Removing duplicate data in NumPy and Pandas”

34. Iterating through columns and rows in NumPy and Pandas

Using apply_along_axis (NumPy) or apply (Pandas) is a more Pythonic way of iterating through data in NumPy and Pandas (see related tutorial here). But there may be occasions you wish to simply work your way through rows or columns in NumPy and Pandas. Here is how it is done. Continue reading “34. Iterating through columns and rows in NumPy and Pandas”

30. Using masks to filter data, and perform search and replace, in NumPy and Pandas

In both NumPy and Pandas we can create masks to filter data. Masks are ’Boolean’ arrays – that is arrays of true and false values and provide a powerful and flexible method to selecting data.

NumPy

creating a mask

Let’s begin by creating an array of 4 rows of 10 columns of uniform random number between 0 and 100. Continue reading “30. Using masks to filter data, and perform search and replace, in NumPy and Pandas”

27. Adding more data to NumPy arrays and Pandas dataframes

Adding data to NumPy and Pandas

Numpy

Adding more rows

To add more rows to an existing numpy array use the vstack method which can add multiple or single rows. New data may be in the form of a numpy array or a list. All combined data must have the same number of columns.

In [9]:
import numpy as np

# Starting with a NumPy array
array1 = np.array([[1,2,3,4,5],
         [6,7,8,9,10],
         [11,12,13,14,15]])

# An additional 2d list
array2 = [[16,17,18,19,20],
         [21,22,23,24,25]]

# An additional single row Numpy array
array3 = np.array([26,27,28,29,30])

# We will combine all data into existing array, array1
# But a new name could be given
array1 = np.vstack([array1, array2, array3])

print (array1)
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]
 [21 22 23 24 25]
 [26 27 28 29 30]]

Adding more columns of data

To add more columns to an existing numpy array use the hstack method which can add multiple or single rows. All combined data must have the same number of rows.

In [10]:
import numpy as np

# Start with a numpy array
array1 = np.array([[1,2],
         [6,7],
         [11,12]])

# an additional multi-row numpy array
array2 = np.array([[3,4],
         [8,9],
         [13,14]])
# an additional single column list
# Note: the vertical appearance is for easy of reading only
# The square bracketed values within a wider set of square brackets will set this as a column
array3 = [[5],
         [10],
         [15]]

array1 = np.hstack([array1, array2, array3])

print (array1)
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]

Pandas

Adding more rows of data

Here we will use the concat method to add more rows. Note that we have to define column names for the rows we will be adding.

Notice what happens to the index column on the left, and the order of the columns

In [11]:
import pandas as pd

df1 =pd.DataFrame()

# Building an initial dataframe from individual lists:

names = ['Gandolf','Gimli']
types = ['Wizard','Dwarf']
magic = [10, 1]
aggression = [7, 10,]
stealth = [8, 2]

df1['names'] = names
df1['type'] = types
df1['magic_power'] = magic
df1['aggression'] = aggression
df1['stealth'] = stealth

# We can also define a dataframe with lists of all data for each row,
# but we need to remember to pass column names, as a list, to the dataframe

col_names = ['names','type','magic_power','aggression','stealth']

df2 = pd.DataFrame(
    [['Frodo','Hobbit',4,2,5],
     ['Legolas','Elf',6,5,10]],
        columns = col_names)

df1 = pd.concat([df1,df2])
print (df1)
     names    type  magic_power  aggression  stealth
0  Gandolf  Wizard           10           7        8
1    Gimli   Dwarf            1          10        2
0    Frodo  Hobbit            4           2        5
1  Legolas     Elf            6           5       10

Each dataframe had indexes starting with zero, and those numbers are kept when combining the dataframes. This may be approproate if the index column are unique identifiers, but with a numbered index we may prefer to let the index of the appended dataframe be ignored, and the index allowed to continue its original order. We do this by passing ignore_index = True to the concat method.

In [12]:
import pandas as pd

df1 =pd.DataFrame()

# Building an initial dataframe from individual lists:

names = ['Gandolf','Gimli']
types = ['Wizard','Dwarf']
magic = [10, 1]
aggression = [7, 10,]
stealth = [8, 2]

df1['names'] = names
df1['type'] = types
df1['magic_power'] = magic
df1['aggression'] = aggression
df1['stealth'] = stealth

# We can also define a dataframe with lists of all data for each row,
# but we need to remember to pass column names, as a list, to the dataframe

col_names = ['names','type','magic_power','aggression','stealth']

df2 = pd.DataFrame(
    [['Frodo','Hobbit',4,2,5],
     ['Legolas','Elf',6,5,10]],
        columns = col_names)

df1 = pd.concat([df1,df2],ignore_index = True)
print (df1)
     names    type  magic_power  aggression  stealth
0  Gandolf  Wizard           10           7        8
1    Gimli   Dwarf            1          10        2
2    Frodo  Hobbit            4           2        5
3  Legolas     Elf            6           5       10

In the above examples the concat method has reordered columns (there is another method, append, which does not reorder columns, but append is less efficient for combining larger dataframes). To re-order columns we can pass the column order to the new dataframe. Thois could be done by appending [col names] to the end of the concat statement, or mayy be performed as a separate step:

In [13]:
col_names = ['names','type','magic_power','aggression','stealth']
df1 = df1[col_names]
print(df1)
     names    type  magic_power  aggression  stealth
0  Gandolf  Wizard           10           7        8
1    Gimli   Dwarf            1          10        2
2    Frodo  Hobbit            4           2        5
3  Legolas     Elf            6           5       10

Adding more columns of data

Individual columns of data may be added to a dataframe simply by defining a new column and passing a list of values to it.

In [14]:
df1 = pd.DataFrame()
names = ['Gandolf','Gimli','Frodo','Legolas','Bilbo']
types = ['Wizard','Dwarf','Hobbit','Elf','Hobbit']

df1['names'] = names
df1['type'] = types

print (df1)

# Add another column
magic = [10, 1, 4, 6, 4]
df1['magic'] = magic

print ('\n Added column:\n',df1)
     names    type
0  Gandolf  Wizard
1    Gimli   Dwarf
2    Frodo  Hobbit
3  Legolas     Elf
4    Bilbo  Hobbit

 Added column:
      names    type  magic
0  Gandolf  Wizard     10
1    Gimli   Dwarf      1
2    Frodo  Hobbit      4
3  Legolas     Elf      6
4    Bilbo  Hobbit      4

We can use concat also to add multiple columns (in the form of another dataframe), in which case the data will be combined based on the index column. We pass the argument axis=1 to the concat statement to instruct the method to combine by column (it defaults to axis=0, or row concatenation).

In [15]:
df1 = pd.DataFrame()
names = ['Gandolf','Gimli','Frodo','Legolas','Bilbo']
types = ['Wizard','Dwarf','Hobbit','Elf','Hobbit']

df1['names'] = names
df1['type'] = types

print (df1)

df2 = pd.DataFrame()

magic = [10, 1, 4, 6, 4]
aggression = [7, 10, 2, 5, 1]
stealth = [8, 2, 5, 10, 5]

df2['magic_power'] = magic
df2['aggression'] = aggression
df2['stealth'] = stealth
     names    type
0  Gandolf  Wizard
1    Gimli   Dwarf
2    Frodo  Hobbit
3  Legolas     Elf
4    Bilbo  Hobbit
In [16]:
df1 = pd.concat([df1,df2], axis=1)
print(df1)
     names    type  magic_power  aggression  stealth
0  Gandolf  Wizard           10           7        8
1    Gimli   Dwarf            1          10        2
2    Frodo  Hobbit            4           2        5
3  Legolas     Elf            6           5       10
4    Bilbo  Hobbit            4           1        5