How to Evaluate the Performance of ML Models

Machine learning basically have 2 tasks: regression and classification. This post will introduce how to evaluate the performance of ML models in some classes.

Task1: Regression

Preliminary concepts:

In regression, model predictions are continuous values. Usually, $y\in\mathbb{R}$.

Notation: $y_i$ is the actual value of the $i$th sample, $\hat{y_i}$ is the predicted value of the $i$th sample, $n$ is the number of samples.

1. Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)

MSE is the arithmetic mean of the squares of the errors. $$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y_i})^2 $$ RMSE is the square root version of MSE. It is the standard deviation of the residuals (prediction errors). $$ RMSE = \sqrt{MSE} $$

2. Mean Absolute Error (MAE)

MAE is the average of the absolute differences between predictions and actual values. $$ MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i-\hat{y_i}| $$

3. Mean Absolute Percentage Error (MAPE)

MAPE is the average of the absolute percentage difference between predictions and actual values. $$ MAPE = \frac{1}{n}\sum_{i=1}^{n}\frac{|y_i-\hat{y_i}|}{y_i} $$ It can be sensitive to outliers.

Task2: Classification

Preliminary concepts:

Test set are supposed to have limited and known labels, said $y_i\in\mathbb{C}$, where $\mathbb{C}$ is the set of all possible labels.

Notation: $y_i$ is the actual label of the $i$th sample, $\hat{y_i}$ is the predicted label of the $i$th sample, $n$ is the number of samples.

1. TP, TN, FP and FN

TP, TN, FP and FN classify the predictions into 4 categories for a binary classification problem.

TP: True Positive, the number of positive samples that are correctly predicted as positive. $$ TP = \frac{N(\text{Pred Positive}\cap\text{Actual Positive})}{N(\text{Actual Positive})} $$
TN: True Negative, the number of negative samples that are correctly predicted as negative. $$ TN = \frac{N(\text{Pred Negative}\cap\text{Actual Negative})}{N(\text{Actual Negative})} $$
FP: False Positive, the number of negative samples that are incorrectly predicted as positive. $$ FP = \frac{N(\text{Pred Positive}\cap\text{Actual Negative})}{N(\text{Actual Negative})} $$
FN: False Negative, the number of positive samples that are incorrectly predicted as negative. $$ FN = \frac{N(\text{Pred Negative}\cap\text{Actual Positive})}{N(\text{Actual Positive})} $$

The relationship between TP, TN, FP and FN is shown in the following table.

	Actual Positive	Actual Negative
Pred Positive	TP	FP
Pred Negative	FN	TN

2. Accuracy

Accuracy is the proportion of correct predictions among all predictions. $$ Accuracy = \begin{cases} \frac{TP+TN}{TP+TN+FP+FN} & \text{Binary Classification}\\\\ \frac{\sum_{i=1}^{n}I(y_i=\hat{y_i})}{n} & \text{Multi-class Classification} \end{cases} $$

Top-k Accuracy

In classification, the model is supposed to predict possibilities, or scores, for each class. Top-k accuracy means that if the true label is in the top-k predicted labels, the prediction is correct.

3. Precision

Precision is the proportion of correct positive predictions among all positive predictions. $$ Precision = \frac{TP}{TP+FP} $$ In multi-class classification, precision is calculated for each class. The average of all classes’ precision is the precision of the model.

4. Recall

Recall is the proportion of correct positive predictions among all actual positive samples. $$ Recall = \frac{TP}{TP+FN} $$ Multi-class classification takes the average.

5. F1 Score

F1 score is the harmonic mean of precision and recall. $$ F1 = \frac{2}{\frac{1}{Precision}+\frac{1}{Recall}} = \frac{2\times Precision\times Recall}{Precision+Recall} $$ F1 score considers both precision and recall. It is a better metric than accuracy when the dataset is imbalanced.

How to Evaluate the Performance of ML Models

How to Evaluate the Performance of ML Models Link to this heading

Task1: Regression Link to this heading

1. Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) Link to this heading

2. Mean Absolute Error (MAE) Link to this heading

3. Mean Absolute Percentage Error (MAPE) Link to this heading

Task2: Classification Link to this heading

1. TP, TN, FP and FN Link to this heading

2. Accuracy Link to this heading

Top-k Accuracy Link to this heading

3. Precision Link to this heading

4. Recall Link to this heading

5. F1 Score Link to this heading

Task3: To be continued… Link to this heading

How to Evaluate the Performance of ML Models

Task1: Regression

1. Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)

2. Mean Absolute Error (MAE)

3. Mean Absolute Percentage Error (MAPE)

Task2: Classification

1. TP, TN, FP and FN

2. Accuracy

Top-k Accuracy

3. Precision

4. Recall

5. F1 Score

Task3: To be continued…