63. Machine learning: Splitting data into training and test sets

To test the accuracy of a model we will test the model on data that it has not seen before. We will divide available data into two sets: a training set that the model will learn from, and a test set which will be used to test the accuracy of the model on new data. A convenient way to split the data is to use scikit-learn’s train_test_split method. This randomly divides the data between training and test sets. We may specify what proportion to keep for the test set (0.2 – 0.3 is common).

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

# Load the iris data


# Extra out the feature data (data), and the classification (target)



# Random_state is integer seed. 
# If this is omitted than a different seed will be used each time

Let’s look at the size of the data sets:

print ('Shape of X:', X.shape)
print ('Shape of y:', y.shape)
print ('Shape of X_train:', X_train.shape)
print ('Shape of y_train:', y_train.shape)
print ('Shape of X_test:', X_test.shape)
print ('Shape of y_test:', y_test.shape)


Shape of X: (150, 4)
Shape of y: (150,)
Shape of X_train: (105, 4)
Shape of y_train: (105,)
Shape of X_test: (45, 4)
Shape of y_test: (45,)

The data has been split randomly, 70% into the training set and 30% into the test set.


One thought on “63. Machine learning: Splitting data into training and test sets

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s