Performance Metrics for Classification Problems

Ekta Aggarwal
Aug 11, 2022
4 min read

In this tutorial we will understand in detail various performance metrics which can be used to evaluate the performance of a model.

The scope of this tutorial is limited to performance metrics for classification algorithms.

In this tutorial we will learn about the following concepts and performance metrics:

Confusion Matrix
Accuracy
Misclassification Rate
Precision
Recall or Sensitivity
Specificity
F1 - score
Area Under the Curve (AUC)

Confusion Matrix

Confusion matrix is a 2-D representation of our predictions and actuals. IT shows the number of instances where we have predicted a negative (0) and positive (1), corresponding to the number of actuals.

In this confusion matrix the rows represent the actual values, while column represent the predicted values.

True Positive (TP): It is the number of instances where we have predicted a 'positive' and our prediction was correct (i.e., true)

In the above confusion matrix TP = 48

True Negative (TN): It is the number of instances where we have predicted a 'negative' and our prediction was correct (i.e., true)

In the above confusion matrix TN = 49

False Positive (FP): It is the number of instances where we have predicted a 'positive' and our prediction was incorrect (i.e., false)

In the above confusion matrix FP = 2

False Negative (FN): It is the number of instances where we have predicted a 'negative' and our prediction was incorrect (i.e., false)

In the above confusion matrix FN = 1

Trick: How to remember these definitions of TP, FP, TN, FN?

When I was a student I was always overwhelmed by these definitions and got confused. The best way to remember the definitions of TP, FP, TN and FN is:

The second letter represents what are you predicting predicted.

eg. 'P' in 'TP' represents the positives prediction, while 'N' in 'TN' represents the negative prediction\

The first letter represents whether our prediction was correct or not.

eg. 'T' in 'TP' represents the correct (true) prediction, while 'F' in 'FP' represents the incorrect (false) prediction

Thus, TP means you are predicting a positive, and our prediction is true.

FP means you are predicting a positive, and our prediction is false.

TN means you are predicting a negative, and our prediction is true.

FN means you are predicting a negative, and our prediction is false.

Accuracy

Accuracy is defined as how many percentage of observations we have classified / predicted correctly

by our model. Thus for accuracy we refer to the True Positives (TP) and True Negatives (TN). Ranging between 0 and 1, we would like our model to have high accuracy as possible.

Accuracy = (TP + TN) / (TP+FP+TN+FN)

For the above confusion matrix accuracy = (48 + 49) / (48+ 2 + 49 + 1) = 0.97

Misclassification Rate

Misclassification Rate is defined as how many percentage of observations we have classified / predicted incorrectly by our model. Thus for accuracy we refer to the False Positives (FP) and False Negatives (FN). Ranging between 0 and 1, we would like our model to have low misclassification rate as possible.

Misclassification Rate = (FP + FN) / (TP+FP+TN+FN) = 1 - Accuracy

For the above confusion matrix accuracy = (2+ 1) / (48+ 2 + 49 + 1) = 0.03

Precision

Precision is defined as how many percentage of positives are predicted correctly out of the total number of positives predicted.

Thus for precision we refer to the True Positives (TP) and False Positives (FP).

Note that TP + FP represents the total number of positives predicted by our model.

Precision= (TP ) / (TP+FP)

For the above confusion matrix precision = (48) / (48+ 2) = 0.96

Recall (or Sensitivity)

Recall (or Sensitivity) is defined as how many percentage of positives are predicted correctly out of the actual number of positives in our data.

Thus for precision we refer to the True Positives (TP) and False Negatives (FN).

Note that TP + FN represents the actual number of positives in our data

Recall (or Sensitivity)= (TP ) / (TP+FN)

For the above confusion matrix recall or sensitivity = (48) / (48+ 1) = 0.98

Specificity

Specificity is defined as how many percentage of negatives are predicted correctly out of the actual number of negatives in our data.

Thus for precision we refer to the True Negatives(TN) and False Positives (FP).

Note that TN + FP represents the actual number of negatives in our data

Specificity = (TN ) / (TN + FP)

For the above confusion matrix specificity = (49) / (49 + 2) = 0.96

F1- Score

To find the best model sometimes accuracy is not enough.

Let us consider a model where we are trying to predict whether a person has cancer (Positive: 1) or does not have cancer (Negative: 0) and our confusion matrix looks as follows:

For the above confusion matrix accuracy = (85+1) / (85+14+ 0 +1) = 0.86

which seems to be a great accuracy.

specificity = 85/ 85 = 1. i.e., we are predicting 100% of non-cancerous people correctly.

But if we look closely, out of 15 people who have cancer we are only predicting one of them as cancer. Thus, our recall or sensitivity is only 1/15, which is very dangerous.

Similarly, a model lower precision or specificity can also be dangerous. Thus sometimes accuracy might not be enough to infer the model performance.

Suppose we have model 1 with precision and recall as 0.5 each.

model 2 as precision = 0.7 and recall = 0.1

model 3 as precision = 0.02 and recall = 1

Which of the model you will choose?

As a rule of thumb, we avoid the models where either precision or recall are close to 0. Thus, we will opt for model 1.

Mathematically speaking, we use F1 score to find an ideal model, which is calculated as

F1 score = 2*precision * recall (precision + recall)

Why cannot we take an average?

If we take an average, then in the below image we can see that model 3 would be considered as the best. However, it means precision is close to 0 i..e, we are correctly able to identify only 2% of patients with cancer.

Thus, we can see that F1 score is predicting algorithm 1 as the best while algorithm 3 is considered as the worst.

AUC curve

Another way to check the performance of an algorithm is by plotting the Receivers Operating Characteristic Curve (ROC curve).

In an ROC curve, x axis represents the False positive rate (FPR) = 1-specificity = FP / (TN+ FP)

ad y-axis represents the True positive rate (TPR) i.e., sensitivity. = TP / (TP+FN)

We try get the curve for different values of TPR. Thus the area under the curve (AUC) is calculated.

If AUC < 0.5 then the model is considered to be bad. However, if we have 2 models with AUC > 0.5 then the model with higher AUC is preferred.

The below image is due to AnalyticsVidhya.