Part of the Machine Learning Intro series:
- How to Evaluate the Performance of ML ModelsThis post!
- Entropy, KL Divergence and Cross Entropy
How to Evaluate the Performance of ML Models
Machine learning basically have 2 tasks: regression and classification. This post will introduce how to evaluate the performance of ML models in some classes.
Task1: Regression
Preliminary concepts:
In regression, model predictions are continuous values. Usually, $y\in\mathbb{R}$.
Notation: $y_i$ is the actual value of the $i$th sample, $\hat{y_i}$ is the predicted value of the $i$th sample, $n$ is the number of samples.
1. Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)
MSE is the arithmetic mean of the squares of the errors. $$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y_i})^2 $$ RMSE is the square root version of MSE. It is the standard deviation of the residuals (prediction errors). $$ RMSE = \sqrt{MSE} $$
2. Mean Absolute Error (MAE)
MAE is the average of the absolute differences between predictions and actual values. $$ MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i-\hat{y_i}| $$
3. Mean Absolute Percentage Error (MAPE)
MAPE is the average of the absolute percentage difference between predictions and actual values. $$ MAPE = \frac{1}{n}\sum_{i=1}^{n}\frac{|y_i-\hat{y_i}|}{y_i} $$ It can be sensitive to outliers.
Task2: Classification
Preliminary concepts:
Test set are supposed to have limited and known labels, said $y_i\in\mathbb{C}$, where $\mathbb{C}$ is the set of all possible labels.
Notation: $y_i$ is the actual label of the $i$th sample, $\hat{y_i}$ is the predicted label of the $i$th sample, $n$ is the number of samples.
1. TP, TN, FP and FN
TP, TN, FP and FN classify the predictions into 4 categories for a binary classification problem.
- TP: True Positive, the number of positive samples that are correctly predicted as positive. $$ TP = \frac{N(\text{Pred Positive}\cap\text{Actual Positive})}{N(\text{Actual Positive})} $$
- TN: True Negative, the number of negative samples that are correctly predicted as negative. $$ TN = \frac{N(\text{Pred Negative}\cap\text{Actual Negative})}{N(\text{Actual Negative})} $$
- FP: False Positive, the number of negative samples that are incorrectly predicted as positive. $$ FP = \frac{N(\text{Pred Positive}\cap\text{Actual Negative})}{N(\text{Actual Negative})} $$
- FN: False Negative, the number of positive samples that are incorrectly predicted as negative. $$ FN = \frac{N(\text{Pred Negative}\cap\text{Actual Positive})}{N(\text{Actual Positive})} $$
The relationship between TP, TN, FP and FN is shown in the following table.
Actual Positive | Actual Negative | |
---|---|---|
Pred Positive | TP | FP |
Pred Negative | FN | TN |
2. Accuracy
Accuracy is the proportion of correct predictions among all predictions. $$ Accuracy = \begin{cases} \frac{TP+TN}{TP+TN+FP+FN} & \text{Binary Classification}\\\\ \frac{\sum_{i=1}^{n}I(y_i=\hat{y_i})}{n} & \text{Multi-class Classification} \end{cases} $$
Top-k Accuracy
In classification, the model is supposed to predict possibilities, or scores, for each class. Top-k accuracy means that if the true label is in the top-k predicted labels, the prediction is correct.
3. Precision
Precision is the proportion of correct positive predictions among all positive predictions. $$ Precision = \frac{TP}{TP+FP} $$ In multi-class classification, precision is calculated for each class. The average of all classes’ precision is the precision of the model.
4. Recall
Recall is the proportion of correct positive predictions among all actual positive samples. $$ Recall = \frac{TP}{TP+FN} $$ Multi-class classification takes the average.
5. F1 Score
F1 score is the harmonic mean of precision and recall. $$ F1 = \frac{2}{\frac{1}{Precision}+\frac{1}{Recall}} = \frac{2\times Precision\times Recall}{Precision+Recall} $$ F1 score considers both precision and recall. It is a better metric than accuracy when the dataset is imbalanced.