What are training data and test data, and what are their respective purposes?
Okay, no problem. When it comes to training data and test data, many people easily get confused. In fact, if we use our school days as an analogy, it becomes clear immediately.
Training Data vs. Test Data: Imagine It as 'Practice Problems' and 'The Big Test' from Your School Days
If you want a machine (which we call a "model") to learn a new skill, such as recognizing images of cats and dogs, you can't expect it to know it inherently. You have to "teach" it, and this process is called Training.
What is Training Data? — Textbooks and Exercise Books (Training Data
)
Training data is like the textbooks and vast amounts of exercise books used by this "student" (model).
What does it contain?
- Problems: Tens of thousands of images.
- Standard Answers: Each image is labeled, e.g., this one is "cat", that one is "dog".
What is its purpose? Its purpose is for the model to "learn" and "practice problems". The model will look at these images one by one, then try to guess: "Hmm... this one has whiskers and pointy ears, I guess it's a cat?" After guessing, it immediately checks against the standard answer.
- Guessed correctly? Great, it reinforces its impression of "cat" features.
- Guessed incorrectly? (e.g., mistook a Chihuahua for a cat), it receives a "penalty" and then adjusts its internal parameters (which can be understood as "correcting its problem-solving approach"), striving to get it right next time.
Through this repetitive, extensive process of "practice-check answer-correct", the model gradually summarizes a set of rules from this training data, such as what the respective features of "cats" and "dogs" are.
What is Test Data? — Mock Exams and Final Exams (Test Data
)
When you feel this student has learned enough, how do you know if it has truly understood, or if it's just "memorizing" the original problems from the exercise books?
At this point, a formal examination is needed, and test data is that brand-new exam paper it has never seen before.
What does it contain?
- Problems: A batch of brand-new cat and dog images. These images must not have appeared in previous exercise books (training data).
- Standard Answers: Of course, there are answers, but this time they cannot be shown to the model beforehand. It has to take a "closed-book exam" first.
What is its purpose? Its core purpose is to evaluate the model's true capability.
The model needs to classify the images on this new exam paper using the "knowledge" it learned from the training data. Once it has made all its guesses and submitted the "paper", we then take out the standard answers to "grade" it.
- "This exam paper has 100 images in total, and you guessed 95 correctly." — Then the model's accuracy is 95%.
Only this score can truly reflect whether the model has genuinely mastered the skill of recognizing cats and dogs, whether it can apply what it has learned to new situations (professionally known as generalization ability), rather than just being able to solve the original problems from the exercise books.
In Simple Summary
Type | Analogy | With Answers? | Purpose |
---|---|---|---|
Training Data | Textbooks, exercise books | Yes, with standard answers | Used to "teach" and "train" the model, allowing it to learn patterns |
Test Data | Mock exams, final exams | No, answers not provided beforehand | Used to "assess" the model, evaluating its true performance and generalization ability |
Why Must They Be Separated?
To put it plainly, if you use the exact problems from the exercise books for an exam, a student scoring 100% doesn't mean much, because they might have just memorized the answers. Only by using brand-new problems can you know if they truly understand. For machine learning, the principle is exactly the same. Separating training and testing is to prevent the model from "cheating" and to ensure we get a truly useful model.