This tutorial provides an alternative regression method to a linear/multiple regression previously described at:
Random Forests regression may provide a better predictor than multiple linear regression when the relationship between features (X) and dependent variable (y) is complex.
In regression we seek to predict the value of a continuous variable based on either a single variable, or a set of variables.
The example we will look at below seeks to predict life span based on weight, height, physical activity, BMI, gender, and whether the person has a history of smoking.
This example uses a synthetic data set, which will be downloaded.
Load common libraries and data
import numpy as np import pandas as pd import matplotlib.pyplot as plt filename = 'https://gitlab.com/michaelallen1966/1804_python_healthcare_wordpress/raw/master/jupyter_notebooks/life_expectancy.csv' df = pd.read_csv(filename) df.head() Out: weight smoker physical_activity_scale BMI height male life_expectancy 0 51 1 6 22 152 1 57 1 83 1 5 34 156 1 36 2 78 1 10 18 208 0 78 3 106 1 3 28 194 0 49 4 92 1 7 23 200 0 67
# Extract features (X) and taregt life expectancy (y) X = df.values[:, :-1] y = df.values[:, -1] from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor() model.fit(X, y) Out: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
Predict values, calculate error, and show predicted vs. actual
# Predict values predicted = model.predict(X) # Show mean squared error from sklearn.metrics import mean_squared_error mse = mean_squared_error(y, predicted) rmse = np.sqrt(mse) print (rmse) Out: 1.4628948576409964 # Plot actual vs predicted plt.scatter(y,predicted, alpha = 0.5) plt.xlabel('Actutal') plt.ylabel('Predicted') plt.show()