• Ekta Aggarwal

What are training and test sets?

To make data-driven predictions there can be many models which can be built on the data. A

learner is good if it is able to accurately predict the unseen data. For this, we divide the data into training and test sets.

Training set comprises of features and the dependent variable (for supervised learning algorithms

where the values of dependent variable are already present). The models are trained on

training set and are used to predict the instances in test set . For test set, the actual labels are

present, using which we can measure the efficiency of our model.

There can be several metrics for measuring the model performance, namely, accuracy (% of

correct classifications), precision, recall, F-1 score, area under the curve (AUC) etc., for classification

problems. For numeric dependent variable, mean absolute percentage error (MAPE),

root mean square error (RMSE) and mean square error (MSE) are most common metrics.

It may be possible that the model is able to fit the training data well (having low error) but

performs poorly on new data. Such a situation is referred as overfitting or high variance in

literature. When the model is having high error even on the training data then it is referred to

as the situation of underfitting or high bias. A good model is a mixture of both, which has

reasonable or low error in both training and test sets.

For example, in the table below we are presenting the accuracies of 3 different models - for training and test sets. In model 1 both training and test set accuracy have high accuracy, thus it can be considered a good fit. However, model 2 is doing extremely well on training set, while performing poorly on test set - thus, we can say that model is overfitting the training data. For model 3, both training and test set accuracies are low. Since the model is unable to perform good even on the training set thus it is a good example of underfitting.