How to evaluate the performance of a machine learning model? What are the common metrics?

Okay, no problem. Evaluating machine learning models can seem both complex and simple. Let's imagine it's like grading a student's exam: you can't just look at their total score; you also need to see if they're weak in certain subjects or if they can solve difficult problems, right?

Below, I'll explain in plain language what "scores" we look at when evaluating models.

Evaluation is like a physical exam; different tests check for different issues.

Imagine your model is a freshly graduated doctor taking their medical licensing exam. How do we determine if they're a good doctor?

It mainly falls into two broad categories: Classification Problems and Regression Problems.

I. Classification Problems

Classification problems are like a doctor diagnosing a patient as "sick" or "not sick," or identifying emails as "spam" or "legitimate."

1. Accuracy

What it is: This is the most intuitive metric. It's simply how many times the model predicted correctly out of a total of 100 predictions. If it guessed correctly 95 times, the accuracy is 95%.
Real-life example: You predict whether it will "rain" or "not rain" tomorrow. If you predict correctly 27 out of 30 days in a month, your accuracy is 27 / 30 = 90%.
Pros: Very easy to understand.
Cons: It can be misleading in certain situations. For example, imagine a disease that 99% of people don't get. A "lazy" model that diagnoses everyone as "not sick" will still achieve 99% accuracy! But it wouldn't have identified a single sick person, making it practically useless.

Therefore, we need more refined metrics. This leads us to two very important concepts: Precision and Recall.

To understand these two, let's set up a scenario: a model is tasked with picking out all the "apples" from a pile of fruit.

2. Precision - "Quality over Quantity"

What it is: Among all the fruits you picked out (which the model identified as "apples"), how many are actual apples?
Formula: True Positives / (True Positives + False Positives) (or Actual Apples Picked / All Fruits Picked)
Real-life example: Your model picked 10 fruits, confidently claiming they were all apples. Upon inspection, 8 of them were indeed apples, but 2 were red potatoes. So, the precision is 8 / 10 = 80%.
Significance: High precision means your model is "reliable" and doesn't often make false positive predictions. In scenarios like "recommendation systems," recommending the wrong items can annoy users, so precision is crucial.

3. Recall - "Don't Miss Anything"

What it is: Among all the actual apples, how many did your model successfully pick out?
Formula: True Positives / (True Positives + False Negatives) (or Actual Apples Picked / All Actual Apples)
Real-life example: There were a total of 15 actual apples in the pile. Your model only found 8 of them. So, the recall is 8 / 15 ≈ 53%.
Significance: High recall means your model "sees everything" and misses few instances. In scenarios like "cancer diagnosis," it's often better to over-diagnose (false positive) than to miss a case (false negative). The consequences of a missed diagnosis are very severe, so recall is critical.

The Trade-off Between Precision and Recall: Often, these two metrics are a "dilemma" (you can't have it all).

A very cautious model (that only picks what it's very confident about) will have high precision but might miss some ambiguous cases, leading to low recall.

A very aggressive model (that picks anything that even slightly resembles the target) will have high recall but will include many irrelevant items, leading to low precision.

4. F1-Score

What it is: Since precision and recall often conflict, we need a referee to balance them. The F1-Score is their "harmonic mean," a composite metric that considers both.
Significance: A higher F1-Score indicates that your model has achieved a better balance between precision and recall. When you're unsure which to prioritize, the F1-Score is a good choice.

5. ROC Curve and AUC Value

What it is: This is a more advanced evaluation method. You can imagine that when a model outputs "yes" or "no," it actually has a "confidence level" internally (e.g., 80% sure it's an apple). We can set a "threshold," and only if the confidence is above this threshold do we classify it as "yes."
ROC Curve: It's a plot showing the relationship between the "True Positive Rate" (which is recall) and the "False Positive Rate" (the proportion of non-targets incorrectly classified as targets) by continuously adjusting this "threshold."
AUC Value: It's the area under the ROC curve. The larger this area (closer to 1), the stronger the model's ability to distinguish between "yes" and "no," indicating better performance. A model that guesses randomly will have an AUC value of 0.5.
Significance: AUC is a more comprehensive measure of model performance that is not affected by the "threshold."

II. Regression Problems

Regression problems are not like multiple-choice questions; they're like fill-in-the-blank questions. For example, predicting house prices, stock prices, or tomorrow's temperature.

1. Mean Absolute Error (MAE)

What it is: It's the average of the "differences" between your predicted values and the actual values for each prediction. Regardless of whether you predicted too high or too low, we only take the absolute value of the difference.
Real-life example:
- House A's actual price is $100, you predicted $105, difference $5.
- House B's actual price is $80, you predicted $78, difference $2.
- MAE is ($5 + $2) / 2 = $3.5.
Significance: Very intuitive; it directly tells you how large your model's average prediction error is.

2. Mean Squared Error (MSE)

What it is: Similar to MAE, but when calculating the difference, we first "square" it, and then take the average.
Real-life example: Using the example above, MSE would be (5² + 2²) / 2 = (25 + 4) / 2 = 14.5.
Significance: MSE "penalizes" larger errors more heavily. For instance, an error of 10 units becomes 100 in MSE (10^2), while an error of 1 unit becomes only 1 (1^2). If large prediction errors are unacceptable in your business scenario, then MSE is a good metric.

3. R-squared

What it is: This metric can be a bit tricky but is very useful. It measures to what extent your model can "explain" the variance in the data.
Range of values: Typically between 0 and 1.
Real-life example: If the R-squared for house prices is 0.8, you can popularly understand it as: 80% of the variation in house prices (e.g., due to size, location, etc.) has been captured and successfully explained by your model. The remaining 20% is due to unknown factors not captured by the model.
Significance: A higher R-squared indicates a better fit of the model to the data. A model with an R-squared of 0 performs no better than simply predicting the average of all house prices.

In summary

There's no single best metric, only the most suitable one.
For classification tasks (like multiple-choice questions), first look at Accuracy, but always be wary of imbalanced data traps. Then, based on your business needs, weigh Precision (avoid false positives) against Recall (avoid false negatives), or directly look at their combined score, the F1-Score. For a comprehensive evaluation, use AUC.
For regression tasks (like fill-in-the-blank questions), if you want to know the average error, use MAE. If you want to penalize large errors, use MSE. If you want to know how much explanatory power your model has, use R-squared.

I hope this explanation gives you a clear understanding of how to "examine" a machine learning model!