Part of the Machine Learning Intro series:

  1. How to Evaluate the Performance of ML ModelsThis post!
  2. Entropy, KL Divergence and Cross Entropy

How to Evaluate the Performance of ML Models Link to this heading

Machine learning basically have 2 tasks: regression and classification. This post will introduce how to evaluate the performance of ML models in some classes.


Task1: Regression Link to this heading

Preliminary concepts:

In regression, model predictions are continuous values. Usually, $y\in\mathbb{R}$.

Notation: $y_i$ is the actual value of the $i$th sample, $\hat{y_i}$ is the predicted value of the $i$th sample, $n$ is the number of samples.

1. Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) Link to this heading

MSE is the arithmetic mean of the squares of the errors. $$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y_i})^2 $$ RMSE is the square root version of MSE. It is the standard deviation of the residuals (prediction errors). $$ RMSE = \sqrt{MSE} $$

2. Mean Absolute Error (MAE) Link to this heading

MAE is the average of the absolute differences between predictions and actual values. $$ MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i-\hat{y_i}| $$

3. Mean Absolute Percentage Error (MAPE) Link to this heading

MAPE is the average of the absolute percentage difference between predictions and actual values. $$ MAPE = \frac{1}{n}\sum_{i=1}^{n}\frac{|y_i-\hat{y_i}|}{y_i} $$ It can be sensitive to outliers.


Task2: Classification Link to this heading

Preliminary concepts:

Test set are supposed to have limited and known labels, said $y_i\in\mathbb{C}$, where $\mathbb{C}$ is the set of all possible labels.

Notation: $y_i$ is the actual label of the $i$th sample, $\hat{y_i}$ is the predicted label of the $i$th sample, $n$ is the number of samples.

1. TP, TN, FP and FN Link to this heading

TP, TN, FP and FN classify the predictions into 4 categories for a binary classification problem.

  • TP: True Positive, the number of positive samples that are correctly predicted as positive. $$ TP = \frac{N(\text{Pred Positive}\cap\text{Actual Positive})}{N(\text{Actual Positive})} $$
  • TN: True Negative, the number of negative samples that are correctly predicted as negative. $$ TN = \frac{N(\text{Pred Negative}\cap\text{Actual Negative})}{N(\text{Actual Negative})} $$
  • FP: False Positive, the number of negative samples that are incorrectly predicted as positive. $$ FP = \frac{N(\text{Pred Positive}\cap\text{Actual Negative})}{N(\text{Actual Negative})} $$
  • FN: False Negative, the number of positive samples that are incorrectly predicted as negative. $$ FN = \frac{N(\text{Pred Negative}\cap\text{Actual Positive})}{N(\text{Actual Positive})} $$

The relationship between TP, TN, FP and FN is shown in the following table.

Actual Positive Actual Negative
Pred Positive TP FP
Pred Negative FN TN

2. Accuracy Link to this heading

Accuracy is the proportion of correct predictions among all predictions. $$ Accuracy = \begin{cases} \frac{TP+TN}{TP+TN+FP+FN} & \text{Binary Classification}\\\\ \frac{\sum_{i=1}^{n}I(y_i=\hat{y_i})}{n} & \text{Multi-class Classification} \end{cases} $$

Top-k Accuracy Link to this heading

In classification, the model is supposed to predict possibilities, or scores, for each class. Top-k accuracy means that if the true label is in the top-k predicted labels, the prediction is correct.

3. Precision Link to this heading

Precision is the proportion of correct positive predictions among all positive predictions. $$ Precision = \frac{TP}{TP+FP} $$ In multi-class classification, precision is calculated for each class. The average of all classes’ precision is the precision of the model.

4. Recall Link to this heading

Recall is the proportion of correct positive predictions among all actual positive samples. $$ Recall = \frac{TP}{TP+FN} $$ Multi-class classification takes the average.

5. F1 Score Link to this heading

F1 score is the harmonic mean of precision and recall. $$ F1 = \frac{2}{\frac{1}{Precision}+\frac{1}{Recall}} = \frac{2\times Precision\times Recall}{Precision+Recall} $$ F1 score considers both precision and recall. It is a better metric than accuracy when the dataset is imbalanced.


Task3: To be continued… Link to this heading