What are Overfitting and Underfitting? How to avoid them?

秀梅 蒋
秀梅 蒋

Okay, no problem. Let's talk about these two very common concepts in machine learning in plain language.


Imagine you're teaching a robot student (our "model") how to identify cats. You show it a large collection of cat photos (this is the "training data"), hoping that after it learns, it can accurately determine if a new photo ("test data") is also a cat.

What is Underfitting?

Underfitting, in plain terms, means "learning too poorly, not even grasping the basics".

This robot student is too "dumb", or you taught it too superficially. It looked at photos for a long time but didn't even grasp the most basic patterns. For example, it might only learn "if it has fur, it's a cat", and as a result, it mistakes dogs, rabbits, or even fur coats for cats.

  • Performance: Not only does it make terrible mistakes when encountering new photos (test set), but it also fails to recognize all the old photos (training set) you showed it before. You could say it's a "double failure".
  • Reason: This usually happens because the model is too simple to capture the complex patterns within the data. It's like asking an elementary school student to solve calculus problems – the wrong tool for the job.

(In the image above, the model (blue line) is too simple and completely fails to capture the data's trend)

How to avoid underfitting?

Just like teaching a student, you need to "step up your efforts":

  1. Increase model complexity: Get a "smarter" student. For example, switch from a simple linear model to a complex neural network.
  2. Add more features: Tell the student more details. Don't just look at "whether it has fur", but also "eye shape", "ear style", "whiskers", etc.
  3. Reduce regularization: (This is a bit more technical) This can be understood as "loosening the reins". Sometimes, to prevent the student from "overthinking" (overfitting), we impose some restrictions (regularization). If you find it's too "dumb", you need to relax these restrictions a bit.
  4. Train longer: Give it more study time, let it review and learn repeatedly.

What is Overfitting?

Overfitting, on the contrary, means "learning too meticulously, becoming obsessed".

This robot student is an "ace student", but a bit bookish. Instead of learning the general characteristics of cats, it memorized every single detail of every photo you gave it.

For example, it memorized "the black cat with a pixel in the top-left corner in photo A is a cat", and "the white cat with a vase in the bottom-right corner in photo B is a cat".

  • Performance: On the old photos (training set) you gave it, it can achieve nearly 100% accuracy because it has memorized all the answers. But as soon as you give it a new photo, like a cat in the grass, it's stumped because it hasn't "memorized" the background or pixels of that specific photo. This is what's called "poor generalization ability".
  • Reason: This usually happens because the model is too complex, or the training data is too small. The model is powerful enough to memorize all the noise and accidental features, rather than learning the essential patterns.

(In the image above, the model (green line) is too complex, becoming unusually distorted to accommodate every data point)

How to avoid overfitting?

To cure a "bookworm", you need to broaden its horizons and prevent it from getting stuck in trivial details:

  1. Increase data volume: This is the most effective method! Show it countless, diverse cat photos. The more data it sees, the less likely it is to memorize every detail, forcing it to learn the common characteristics of cats.
  2. Data Augmentation: If you don't have many new photos, you can "create" some. Rotate, scale, crop, or change the colors of existing photos to generate new data that "looks different but is essentially the same".
  3. Reduce model complexity: Don't use such a "smart" student; switch to a slightly more "average" one, so it doesn't have the capacity to memorize so many details.
  4. Use Regularization: Add some "penalty" to the student's learning process. If it tries to engage in fancy, complex memorization, deduct points. This forces it to choose simpler, more general patterns to learn.
  5. Use Dropout: (Commonly used in neural networks) This can be understood as randomly letting some students "zone out" during class. During training, a random subset of neurons is temporarily deactivated, forcing the model not to over-rely on a few specific neurons, thus making the entire network learn more "robustly".
  6. Early Stopping: During training, we teach it while simultaneously testing it with new photos. As soon as we notice its score on new photos starting to drop (indicating it's beginning to memorize), we immediately stop training.

Summary

CharacteristicUnderfittingOverfittingIdeal Model (Good Fit)
AnalogyPoor learner, didn't grasp itBookworm, memorizes everythingAce student, applies knowledge flexibly
Training Set PerformancePoorExcellentGood
Test Set PerformancePoorPoorGood

Our ultimate goal is to train an ideal model with strong generalization ability – one that is neither a poor learner nor a bookworm, but a good student capable of applying what it learns to new situations and performing excellently on new problems. In practice, we are always looking for that perfect balance between underfitting and overfitting.