116. Random Forests regression

This tutorial provides an alternative regression method to a linear/multiple regression previously described at:

https://pythonhealthcare.org/2018/06/14/86-linear-regression-and-multiple-linear-regression/

Random Forests regression may provide a better predictor than multiple linear regression when the relationship between features (X) and dependent variable (y) is complex.

In regression we seek to predict the value of a continuous variable based on either a single variable, or a set of variables.

The example we will look at below seeks to predict life span based on weight, height, physical activity, BMI, gender, and whether the person has a history of smoking.

This example uses a synthetic data set, which will be downloaded.

Load common libraries and data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

filename = 'https://gitlab.com/michaelallen1966/1804_python_healthcare_wordpress/raw/master/jupyter_notebooks/life_expectancy.csv'
df = pd.read_csv(filename)
df.head()

Out:

weight 	smoker 	physical_activity_scale 	BMI 	height 	male 	life_expectancy
0 	51 	1 	6 	22 	152 	1 	57
1 	83 	1 	5 	34 	156 	1 	36
2 	78 	1 	10 	18 	208 	0 	78
3 	106 	1 	3 	28 	194 	0 	49
4 	92 	1 	7 	23 	200 	0 	67

Fit model

# Extract features (X) and taregt life expectancy (y)

X = df.values[:, :-1]
y = df.values[:, -1]

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X, y)

Out:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Predict values, calculate error, and show predicted vs. actual

# Predict values

predicted = model.predict(X)

# Show mean squared error

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, predicted)
rmse = np.sqrt(mse)
print (rmse)

Out:
1.4628948576409964

# Plot actual vs predicted

plt.scatter(y,predicted, alpha = 0.5)
plt.xlabel('Actutal')
plt.ylabel('Predicted')
plt.show()

forest_regression

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s