What are training and test sets?
To make data-driven predictions there can be many models which can be built on the data. A
learner is good if it is able to accurately predict the unseen data. For this, we divide the data into training and test sets.
Training set comprises of features and the dependent variable (for supervised learning algorithms
where the values of dependent variable are already present). The models are trained on
training set and are used to predict the instances in test set . For test set, the actual labels are
present, using which we can measure the efficiency of our model.
There can be several metrics for measuring the model performance, namely, accuracy (% of
correct classifications), precision, recall, F-1 score, area under the curve (AUC) etc., for classification
problems. For numeric dependent variable, mean absolute percentage error (MAPE),
root mean square error (RMSE) and mean square error (MSE) are most common metrics.
It may be possible that the model is able to fit the training data well (having low error) but
performs poorly on new data. Such a situation is referred as overfitting or high variance in
literature. When the model is having high error even on the training data then it is referred to
as the situation of underfitting or high bias. A good model is a mixture of both, which has
reasonable or low error in both training and test sets.
For example, in the table below we are presenting the accuracies of 3 different models - for training and test sets. In model 1 both training and test set accuracy have high accuracy, thus it can be considered a good fit. However, model 2 is doing extremely well on training set, while performing poorly on test set - thus, we can say that model is overfitting the training data. For model 3, both training and test set accuracies are low. Since the model is unable to perform good even on the training set thus it is a good example of underfitting.