While building our models, it is a convention to not use entire data for building the model. Rather it is considered as a good practise to use about 70 or 80% of our data for creating our model. But why is it so?

In this tutorial we will be covering about the concepts and logic of training and test sets.

To learn how to create training and test sets in Python you can read our tutorial: __Creating Training and Test sets in Python__

**Importance of training and test set - why are they needed?**

Imagine a situation when you have built a model on 100% of your available data. Now you propose this model to the business, but how can you be sure that this model will perform well in the future data as well, which is currently unseen?

Now, imagine another situation where you have built 3 or 4 different models using 100% of your data, how can you be sure which is the best model among these? It maybe possible that the model which is currently performing best, might be the worst performing model for the future data.

A learner is good if it is able to accurately predict the unseen data. For this, we divide the data into training and test sets.

Training set comprises of features and the dependent variable (for supervised learning algorithms

where the values of dependent variable are already present). The models are trained on

training set and are used to predict the instances in test set . For test set, the actual labels are

present, using which we can measure the efficiency of our model.

There can be several metrics for measuring the model performance, namely, accuracy (% of

correct classifications), precision, recall, F-1 score, area under the curve (AUC) etc., for classification

problems. For numeric dependent variable, mean absolute percentage error (MAPE),

root mean square error (RMSE) and mean square error (MSE) are most common metrics.

It may be possible that the model is able to fit the training data well (having low error) but

performs poorly on new data. Such a situation is referred as ** overfitting** or

**in**

*high variance*literature. When the model is having high error even on the training data then it is referred to

as the situation of ** underfitting or high bias**. A good model is a mixture of both, which has

reasonable or low error in both training and test sets.

For example, in the table below we are presenting the accuracies of 3 different models - for training and test sets. In model 1 both training and test set accuracy have high accuracy, thus it can be considered a good fit. However, model 2 is doing extremely well on training set, while performing poorly on test set - thus, we can say that model is overfitting the training data. For model 3, both training and test set accuracies are low. Since the model is unable to perform good even on the training set thus it is a good example of underfitting.

Thus, to gain confidence about performance of our models and to decide the best performing model we use about 70% or 80% of our data to build the model and use the remaining 30% or 20% to test the model.

This 70% or 80% of the data on which our model is trained is called as **training set, **while the remaining data is considered as **test set.**
It is to be noted that both training and test sets are non-overlapping and observations are generally divided randomly in training and test set.

For time series data, we do not divide the observations randomly, but chronologically. i.e., first 70% or 80% goes into training set and remainder observations form the test set.

**Data Leakage**

For building the model or scaling the data etc., we do not use any information from the test set. i.e., variables are standardised or model is built entirely on the basis of training set. If any such information by test set is used for building the model then it is known as **data leakage.**

To learn how to create training and test sets in Python you can read our tutorial: __Creating Training and Test sets in Python__

## Comments