What is data? Why does AI require a large amount of data?

Kelly Pollard
Kelly Pollard
Lead AI researcher with 15 years experience. 首席人工智能研究员,15年经验。主任AI研究員、15年の経験。Leitender KI-Forscher, 15 Jahre Erfahrung.

Okay, regarding the relationship between data and AI, I'll try to explain it to you in plain language.


What is Data? Why Does AI Need a Lot of Data?

First, what exactly is "data"?

Don't think about complex code and databases for now; let's consider something simpler.

Data, in essence, is "recorded information." It's like the various notes and materials we have in our lives.

You can think of it as the raw ingredients used for cooking:

  • Text: Your chat history with friends, a novel you read, an online article – these are data.
  • Images: Photos in your phone gallery, a movie poster, an emoji – these are data.
  • Audio: A voice message you recorded, a song, the sound of rain – these are also data.
  • Numbers: Your height and weight, today's temperature, stock prices – these are still data.

In short, any information that can be recorded, whether visible, audible, or quantified into numbers, can be called data. It is the "raw material" for AI's learning and work.

So, why does AI need "a lot" of data?

This question is key. Why can't we just give AI a small amount of data?

Because AI's learning method is essentially about "finding patterns." Unlike humans who can "get it in an instant," AI is more like a "slow learner" who needs massive practice to master a skill.

Let's use the previous example to understand:

1. AI is like a detective who has "seen countless faces"

Suppose you want to train an AI to recognize a "cat" just by looking at a photo.

  • If you only show it 10 photos of cats, it might conclude that "anything with fur and two ears is a cat." Then, if you show it a photo of a dog, it might misidentify it.
  • If you show it 10,000 photos of various cats (lying down, jumping, different breeds, different colors), and then another 10,000 photos of non-cats (dogs, tigers, chairs, cars).
  • The AI will continuously compare and summarize, discovering deeper patterns from this massive amount of data: "Oh, so a cat's pupils are like this, its whiskers are like this, its face shape is like this, its walking posture is like this..."

The more "experiences" it has (the larger the data volume), the more accurate the patterns it summarizes will be, and the higher the probability it will correctly identify a new cat photo next time.

2. Data Volume Determines AI's "IQ" and "EQ"

  • Insufficient data makes AI "stubborn": If you only train an AI with photos of black cats, it might later assume all cats are black. Show it a white cat, and it won't recognize it. The diversity and quantity of data determine AI's "perspective," preventing it from making mistakes due to limited exposure.

  • Sufficient data allows AI to "extrapolate": The smart AIs we use today, like those that can converse with you or draw pictures, are so powerful because they have "read" almost all publicly available text and images on the internet. Their "knowledge base" is massive, so whatever you ask, they can chat about; whatever you ask them to draw, they can combine corresponding images.

In summary

  • Data is AI's "food" and "textbook." Without data, AI can't learn anything; it's just an empty shell.
  • AI needs "large" and "diverse" data so it can learn enough patterns from it, becoming smarter, more accurate, and better able to handle various complex situations in our real world.

Therefore, when we talk about AI, we are actually talking about the massive data that supports it. This is why the term "big data" is always inseparable from "artificial intelligence."