Creating training and Test sets in Python
Updated: Aug 25, 2022
While building our models, it is a convention to not use entire data for building the model. Rather it is considered as a good practice to use about 70 or 80% of our data for creating our model. But why is it so?
In this tutorial we will be covering about the concepts and logic of training and test sets.
Importance of training and test set - why are they needed?
Imagine a situation when you have built a model on 100% of your available data. Now you propose this model to the business, but how can you be sure that this model will perform well in the future data as well, which is currently unseen?
Now, imagine another situation where you have built 3 or 4 different models using 100% of your data, how can you be sure which is the best model among these? It maybe possible that the model which is currently performing best, might be the worst performing model for the future data.
A learner is good if it is able to accurately predict the unseen data. For this, we divide the data into training and test sets.
Training set comprises of features and the dependent variable (for supervised learning algorithms
where the values of dependent variable are already present). The models are trained on
training set and are used to predict the instances in test set . For test set, the actual labels are
present, using which we can measure the efficiency of our model.
There can be several metrics for measuring the model performance, namely, accuracy (% of
correct classifications), precision, recall, F-1 score, area under the curve (AUC) etc., for classification
problems. For numeric dependent variable, mean absolute percentage error (MAPE),
root mean square error (RMSE) and mean square error (MSE) are most common metrics.
It may be possible that the model is able to fit the training data well (having low error) but
performs poorly on new data. Such a situation is referred as overfitting or high variance in
literature. When the model is having high error even on the training data then it is referred to
as the situation of underfitting or high bias. A good model is a mixture of both, which has
reasonable or low error in both training and test sets.
For example, in the table below we are presenting the accuracies of 3 different models - for training and test sets. In model 1 both training and test set accuracy have high accuracy, thus it can be considered a good fit. However, model 2 is doing extremely well on training set, while performing poorly on test set - thus, we can say that model is overfitting the training data. For model 3, both training and test set accuracies are low. Since the model is unable to perform good even on the training set thus it is a good example of underfitting.
Thus, to gain confidence about performance of our models and to decide the best performing model we use about 70% or 80% of our data to build the model and use the remaining 30% or 20% to test the model.
This 70% or 80% of the data on which our model is trained is called as training set, while the remaining data is considered as test set. It is to be noted that both training and test sets are non-overlapping and observations are generally divided randomly in training and test set.
For time series data, we do not divide the observations randomly, but chronologically. i.e., first 70% or 80% goes into training set and remainder observations form the test set.
For building the model or scaling the data etc., we do not use any information from the test set. i.e., variables are standardised or model is built entirely on the basis of training set. If any such information by test set is used for building the model then it is known as data leakage.
Creating training and test sets in Python
In Python, using sklearn's model_selection module by using train_test_split function we can split our data into training and test sets.
Following is the description of train_test_split taken from sklearn's website:
sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
arrays, we are specifying the datasets which we want to split in training and test sets. Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : specifying the proportion of observations which will go in test set
train_size : specifying the proportion of observations which will go in training set.
Note: It is expected that train_size + test_size = 1 thus we need to specify either train_size or test_size in Python
random_state : It is any random number which you can specify for the reproducibility of the training and test sets i.e., if you again run the same code in future then you get same training and test sets and hence, your results do not alter.
shuffle : By default shuffle = True, which means that before allocating the observations to training and test set, observations are shuffled.
stratify: Observations are split in the stratified manner of stratified is not None.
Learning with an example
Let us consider Python's inbuilt iris dataset to create training and test set:
Here we will load iris dataset from sklearn library
import pandas as pd from sklearn.datasets import load_iris iris = load_iris()
Following are the variable names in iris dataset
Output:['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Now we are storing the independent variables from iris dataset in X and dependent variable in y
X = iris.data y = iris.target
We can see from the shape that X has 150 rows and 4 columns
Output: (150, 4)
Now we are splitting the data in training set and test set. Note that we will build our model using the training set and we will use test set to check our performance of the algorithm.
By specifying test_size = 0.2, we are splitting our data into 80% training set and 20% test set. We are specifying random_state = 42 for reproducibility, so that next time when we run our code then our observations in training and test set remain same, i.e., we get the same results.
Here we have specified X and y, thus our first 2 datasets would be for training set for X and test set for X : which we are saving as X_train and X_test respectively.
Then our next 2 datasets would be for training set for y and test set for y : which we are saving as y_train and y_test respectively.
We can see that training set has got 120 rows and test set has 30 rows which is in the ratio of 80:20
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42); print(X_train.shape); print(X_test.shape); print(y_train.shape); print(y_test.shape)