# 122: Oversampling to correct for imbalanced data using naive sampling or SMOTE

Machine learning can have poor performance for minority classes (where one or more classes represent only a small proportion of the overall data set compared with a dominant class). One method of improving performance is to balance out the number of examples between different classes. Here two methods are described:

1. Resampling from the minority classes to give the same number of examples as the majority class.
2. SMOTE (Synthetic Minority Over-sampling Technique): creating synthetic data based on creating new data points that are mid-way between two near neighbours in any particular class.

SMOTE uses imblearn See: https://imbalanced-learn.org

Install with: `pip install -U imbalanced-learn`, or `conda install -c conda-forge imbalanced-learn`

Reference

N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 16, 321-357, 2002

# Create dummy data

First we will create some unbalanced dummy data: Classes 0, 1 and 2 will represent 1%, 5% and 94% of the data respectively.

``````from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=3,                            n_clusters_per_class=1, weights=[0.01, 0.05, 0.94],                            class_sep=0.8, random_state=0)``````

Count instances of each class.

``````from collections import Counter
print(sorted(Counter(y).items()))

OUT:
[(0, 64), (1, 262), (2, 4674)]``````

### Define function to plot data

``````import matplotlib.pyplot as plt

def plot_classes(X,y):
colours = ['k','b','g']
point_colours = [colours[val] for val in y]
X1 = X[:,0]
X2 = X[:,1]
plt.scatter(X1, X2, facecolor = point_colours, edgecolor = 'k')
plt.show()``````

Plots data using function

``plot_classes(X,y)``

## Oversample with naive sampling to match numbers in each class

With naive resampling we repeatedly randomly sample from the minority classes and add that the new sample to the existing data set, leading to multiple instances of the minority classes.This builds up the number of minority class samples.

``````from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)``````

Count instances of each class in the augmented data.

``````from collections import Counter
print(sorted(Counter(y_resampled).items()))

OUT:
[(0, 4674), (1, 4674), (2, 4674)]``````

Plot augmented data (it looks the same as the original as points are overlaid).

``plot_classes(X_resampled,y_resampled)``

# SMOTE with continuous variables

SMOTE (synthetic minority oversampling technique) works by finding two near neighbours in a minority class, producing a new point midway between the two existing points and adding that new point in to the sample. The example shown is in two dimensions, but SMOTE will work across multiple dimensions (features). SMOTE therefore helps to ‘fill in’ the feature space occupied by minority classes.

``````from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X, y)

# Count instances of each class
from collections import Counter
print(sorted(Counter(y_resampled).items()))

OUT:
[(0, 4674), (1, 4674), (2, 4674)]``````

Plot augmented data (note minority class data points now exist in new spaces).

# SMOTE with mixed continuous and binary/categorical values

It is not possible to calculate a ‘mid point’ between two points of binary or categorical data. An extension to the SMOTE method allows for use of binary or categorical data by taking the most common occurring category of nearest neighbours to a minority class point.

``````# create a synthetic data set with continuous and categorical features
import numpy as np
rng = np.random.RandomState(42)
n_samples = 50
X = np.empty((n_samples, 3), dtype=object)
X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)
X[:, 1] = rng.randn(n_samples)
X[:, 2] = rng.randint(3, size=n_samples)
y = np.array([0] * 20 + [1] * 30)``````

Count instances of each class

``````print(sorted(Counter(y).items()))

OUT:
[(0, 20), (1, 30)]``````

Show last 10 original data points

``````print (X[-10:])

OUT:
[['A' 1.4689412854323924 2]
['C' -1.1238983345400366 0]
['C' 0.9500053955071801 2]
['A' 1.7265164685753638 1]
['A' 0.4578850770000152 0]
['C' -1.6842873783658814 0]
['B' 0.32684522397001387 0]
['A' -0.0811189541586873 2]
['B' 0.46779475326315173 1]
['B' 0.7361223506692577 0]]``````

Use SMOTENC to create new data points.

``````from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X, y)``````

Count instances of each class

``````print(sorted(Counter(y_resampled).items()))

OUT:
[(0, 30), (1, 30)]``````

Show last 10 values of X (SMOTE data points are added to the end of the original data set)

``````print (X_resampled[-10:])

[['C' -1.0600505672469849 1]
['C' -0.36965644259183145 1]
['A' 0.1453826708354494 2]
['C' -1.7442827953859052 2]
['C' -1.6278053447258838 2]
['A' 0.5246469549655818 2]
['B' -0.3657680728116921 2]
['A' 0.9344237230779993 2]
['B' 0.3710891618824609 2]
['B' 0.3327240726719727 2]]``````

# 120. Generating log normal samples from provided arithmetic mean and standard deviation of original population

The log normal distribution is frequently a useful distribution for mimicking process times in healthcare pathways (or many other non-automated processes). The distribution has a right skew which may frequently occur when some clinical process step has some additional complexity to it compared to the ‘usual’ case.

To sample from a log normal distribution we need to convert the mean and standard deviation that was calculated from the original non-logged population into the mu and sigma of the underlying log normal population.

(For maximum computation effuiciency, when calling the function repeatedly using the same mean and standard deviation, you may wish to split this into two functions – one to calculate mu and sigma which needs only calling once, and the other to sample from the log normal distribution given mu and sigma).

For more on the maths see:

https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-with-specified-mean-and-variance.html

``````import numpy as np

def generate_lognormal_samples(mean, stdev, n=1):
"""
Returns n samples taken from a lognormal distribution, based on mean and
standard deviation calaculated from the original non-logged population.

Converts mean and standard deviation to underlying lognormal distribution
mu and sigma based on calculations desribed at:
https://blogs.sas.com/content/iml/2014/06/04/simulate-lognormal-data-
with-specified-mean-and-variance.html

Returns a numpy array of floats if n > 1, otherwise return a float
"""

# Calculate mu and sigma of underlying lognormal distribution
phi = (stdev ** 2 + mean ** 2) ** 0.5
mu = np.log(mean ** 2 / phi)
sigma = (np.log(phi ** 2 / mean ** 2)) ** 0.5

# Generate lognormal population
generated_pop = np.random.lognormal(mu, sigma , n)

# Convert single sample (if n=1) to a float, otherwise leave as array
generated_pop = \
generated_pop[0] if len(generated_pop) == 1 else generated_pop

return generated_pop``````

## Test the function

We will generate a population of 100,000 samples with a given mean and standard deviation (these would be calculated on the non-logged population), and test the resulting generated population has the same mean and standard deviation.

``````mean = 10
stdev = 10
generated_pop = generate_lognormal_samples(mean, stdev, 100000)
print ('Mean:', generated_pop.mean())
print ('Standard deviation:', generated_pop.std())

Out:

Mean: 10.043105926813356
Standard deviation: 9.99527575740651``````

Plot a histogram of the generated population:

``````import matplotlib.pyplot as plt
%matplotlib inline
bins = np.arange(0,51,1)
plt.hist(generated_pop, bins=bins)
plt.show()``````

## Generating a single sample

The function will return a single number if no `n` is given in the function call:

``````print (generate_lognormal_samples(mean, stdev))

Out: 6.999376449335125``````

# 40. Removing duplicate data in NumPy and Pandas

Both NumPy and Pandas offer easy ways of removing duplicate rows. Pandas offers a more powerful approach if you wish to remove rows that are partly duplicated.

## NumPy

With numpy we use np.unique() to remove duplicate rows or columns (use the argument axis=0 for unique rows or axis=1 for unique columns). Continue reading “40. Removing duplicate data in NumPy and Pandas”

# 35. Array maths in NumPy

NumPy allows easy standard mathematics to be performed on arrays, a well as moire complex linear algebra such as array multiplication.

Lets begin by building a couple of arrays. We’ll use the np.arange method to create an array of numbers in range 1 to 12, and then reshape the array into a 3 x 4 array. Continue reading “35. Array maths in NumPy”

# 30. Using masks to filter data, and perform search and replace, in NumPy and Pandas

In both NumPy and Pandas we can create masks to filter data. Masks are ’Boolean’ arrays – that is arrays of true and false values and provide a powerful and flexible method to selecting data.

## NumPy

Let’s begin by creating an array of 4 rows of 10 columns of uniform random number between 0 and 100. Continue reading “30. Using masks to filter data, and perform search and replace, in NumPy and Pandas”

# Adding data to NumPy and Pandas

## Numpy

To add more rows to an existing numpy array use the vstack method which can add multiple or single rows. New data may be in the form of a numpy array or a list. All combined data must have the same number of columns.

In [9]:
```import numpy as np

# Starting with a NumPy array
array1 = np.array([[1,2,3,4,5],
[6,7,8,9,10],
[11,12,13,14,15]])

array2 = [[16,17,18,19,20],
[21,22,23,24,25]]

# An additional single row Numpy array
array3 = np.array([26,27,28,29,30])

# We will combine all data into existing array, array1
# But a new name could be given
array1 = np.vstack([array1, array2, array3])

print (array1)
```
```[[ 1  2  3  4  5]
[ 6  7  8  9 10]
[11 12 13 14 15]
[16 17 18 19 20]
[21 22 23 24 25]
[26 27 28 29 30]]
```

### Adding more columns of data

To add more columns to an existing numpy array use the hstack method which can add multiple or single rows. All combined data must have the same number of rows.

In [10]:
```import numpy as np

array1 = np.array([[1,2],
[6,7],
[11,12]])

# an additional multi-row numpy array
array2 = np.array([[3,4],
[8,9],
[13,14]])
# an additional single column list
# Note: the vertical appearance is for easy of reading only
# The square bracketed values within a wider set of square brackets will set this as a column
array3 = [[5],
[10],
[15]]

array1 = np.hstack([array1, array2, array3])

print (array1)
```
```[[ 1  2  3  4  5]
[ 6  7  8  9 10]
[11 12 13 14 15]]
```

## Pandas

### Adding more rows of data

Here we will use the concat method to add more rows. Note that we have to define column names for the rows we will be adding.

Notice what happens to the index column on the left, and the order of the columns

In [11]:
```import pandas as pd

df1 =pd.DataFrame()

# Building an initial dataframe from individual lists:

names = ['Gandolf','Gimli']
types = ['Wizard','Dwarf']
magic = [10, 1]
aggression = [7, 10,]
stealth = [8, 2]

df1['names'] = names
df1['type'] = types
df1['magic_power'] = magic
df1['aggression'] = aggression
df1['stealth'] = stealth

# We can also define a dataframe with lists of all data for each row,
# but we need to remember to pass column names, as a list, to the dataframe

col_names = ['names','type','magic_power','aggression','stealth']

df2 = pd.DataFrame(
[['Frodo','Hobbit',4,2,5],
['Legolas','Elf',6,5,10]],
columns = col_names)

df1 = pd.concat([df1,df2])
print (df1)
```
```     names    type  magic_power  aggression  stealth
0  Gandolf  Wizard           10           7        8
1    Gimli   Dwarf            1          10        2
0    Frodo  Hobbit            4           2        5
1  Legolas     Elf            6           5       10
```

Each dataframe had indexes starting with zero, and those numbers are kept when combining the dataframes. This may be approproate if the index column are unique identifiers, but with a numbered index we may prefer to let the index of the appended dataframe be ignored, and the index allowed to continue its original order. We do this by passing ignore_index = True to the concat method.

In [12]:
```import pandas as pd

df1 =pd.DataFrame()

# Building an initial dataframe from individual lists:

names = ['Gandolf','Gimli']
types = ['Wizard','Dwarf']
magic = [10, 1]
aggression = [7, 10,]
stealth = [8, 2]

df1['names'] = names
df1['type'] = types
df1['magic_power'] = magic
df1['aggression'] = aggression
df1['stealth'] = stealth

# We can also define a dataframe with lists of all data for each row,
# but we need to remember to pass column names, as a list, to the dataframe

col_names = ['names','type','magic_power','aggression','stealth']

df2 = pd.DataFrame(
[['Frodo','Hobbit',4,2,5],
['Legolas','Elf',6,5,10]],
columns = col_names)

df1 = pd.concat([df1,df2],ignore_index = True)
print (df1)
```
```     names    type  magic_power  aggression  stealth
0  Gandolf  Wizard           10           7        8
1    Gimli   Dwarf            1          10        2
2    Frodo  Hobbit            4           2        5
3  Legolas     Elf            6           5       10
```

In the above examples the concat method has reordered columns (there is another method, append, which does not reorder columns, but append is less efficient for combining larger dataframes). To re-order columns we can pass the column order to the new dataframe. Thois could be done by appending [col names] to the end of the concat statement, or mayy be performed as a separate step:

In [13]:
```col_names = ['names','type','magic_power','aggression','stealth']
df1 = df1[col_names]
print(df1)
```
```     names    type  magic_power  aggression  stealth
0  Gandolf  Wizard           10           7        8
1    Gimli   Dwarf            1          10        2
2    Frodo  Hobbit            4           2        5
3  Legolas     Elf            6           5       10
```

### Adding more columns of data

Individual columns of data may be added to a dataframe simply by defining a new column and passing a list of values to it.

In [14]:
```df1 = pd.DataFrame()
names = ['Gandolf','Gimli','Frodo','Legolas','Bilbo']
types = ['Wizard','Dwarf','Hobbit','Elf','Hobbit']

df1['names'] = names
df1['type'] = types

print (df1)

magic = [10, 1, 4, 6, 4]
df1['magic'] = magic

```
```     names    type
0  Gandolf  Wizard
1    Gimli   Dwarf
2    Frodo  Hobbit
3  Legolas     Elf
4    Bilbo  Hobbit

names    type  magic
0  Gandolf  Wizard     10
1    Gimli   Dwarf      1
2    Frodo  Hobbit      4
3  Legolas     Elf      6
4    Bilbo  Hobbit      4
```

We can use concat also to add multiple columns (in the form of another dataframe), in which case the data will be combined based on the index column. We pass the argument axis=1 to the concat statement to instruct the method to combine by column (it defaults to axis=0, or row concatenation).

In [15]:
```df1 = pd.DataFrame()
names = ['Gandolf','Gimli','Frodo','Legolas','Bilbo']
types = ['Wizard','Dwarf','Hobbit','Elf','Hobbit']

df1['names'] = names
df1['type'] = types

print (df1)

df2 = pd.DataFrame()

magic = [10, 1, 4, 6, 4]
aggression = [7, 10, 2, 5, 1]
stealth = [8, 2, 5, 10, 5]

df2['magic_power'] = magic
df2['aggression'] = aggression
df2['stealth'] = stealth
```
```     names    type
0  Gandolf  Wizard
1    Gimli   Dwarf
2    Frodo  Hobbit
3  Legolas     Elf
4    Bilbo  Hobbit
```
In [16]:
```df1 = pd.concat([df1,df2], axis=1)
print(df1)
```
```     names    type  magic_power  aggression  stealth
0  Gandolf  Wizard           10           7        8
1    Gimli   Dwarf            1          10        2
2    Frodo  Hobbit            4           2        5
3  Legolas     Elf            6           5       10
4    Bilbo  Hobbit            4           1        5
```