How are Large Language Models (LLMs) trained? Taking the GPT series as an example, what are their core principles?

Kelly Pollard
Kelly Pollard
Lead AI researcher with 15 years experience. 首席人工智能研究员,15年经验。主任AI研究員、15年の経験。Leitender KI-Forscher, 15 Jahre Erfahrung.

Okay, no problem. I'll do my best to explain how LLMs (especially the GPT series) are "forged" in plain language.


Demystifying the "Making" of Large Language Models (LLMs) – Taking GPT as an Example

Imagine we want to cultivate a "super brain" that not only reads countless books but can also think and converse like a human. Training an LLM is quite similar to this process, mainly divided into two major steps: "Massive Reading" for General Education and "Tailored Guidance" for Specialized Training.

Phase One: General Education (Pre-training)

This is the most "brutal" and expensive phase. You can imagine it as throwing a newborn, blank "brain" (which is a massive neural network model) into a library containing almost all human knowledge, letting it read day and night.

  • Reading Material: How big is this library? Basically, it's a copy of the entire internet (hundreds of billions of words), plus countless books, papers, code, and so on.
  • Learning Method: It doesn't simply memorize content. Its learning method is more like playing a super-difficult "cloze test" or "next word prediction" game.
    • We randomly remove a word from a sentence, for example: "The weather is great today, let's go to the park ____." Then we ask the model to guess what word is most likely to fill this blank (e.g., "for a walk," "for a picnic").
    • Or, we give it the first half of a sentence, for example: "Give a man a fish and you feed him for a day; teach a man to fish and you ____," and ask it to predict the latter half.

By playing billions of such games, the model gradually "comprehends" many things, far beyond just memorizing word combinations. It learns:

  1. Grammatical Rules: Knowing what kind of sentences are coherent.
  2. Factual Knowledge: For example, "The capital of France is Paris."
  3. Contextual Logic: Understanding the different meanings of a sentence in different contexts.
  4. A Certain Level of Reasoning Ability: For example, inferring "Xiao Ming is taller than Xiao Gang" from "Xiao Ming is taller than Xiao Hong, and Xiao Hong is taller than Xiao Gang."

After this phase, we get a "generalist" model. It's knowledgeable, but a bit like a bookworm – it knows everything, but isn't very good at interacting with people, might speak irrelevantly, or give answers that aren't what you're looking for.

This is the meaning of "Pre-trained" in GPT (Generative Pre-trained Transformer).

Phase Two: Specialized Tutoring (Fine-Tuning)

The bookworm needs some coaching to become a good conversational partner. The goal of this phase is to teach the model to "speak like a human," and to do so "pleasantly" and "usefully." This is further divided into two sub-steps.

1. Supervised Fine-Tuning (SFT)

  • How it's done: We hire a group of people (AI annotators) to play the roles of users and AI assistants, writing thousands of high-quality dialogue examples.
    • For example, if a user asks: "Write me a poem about spring," the AI assistant actually writes a good quality poem.
  • What it teaches: These "standard answers" are fed to the already pre-trained model, telling it: "Look, if someone asks a similar question in the future, you should answer like this."
  • Effect: The model begins to learn to follow instructions and answer questions in a conversational format, rather than just continuing sentences as before. It starts to develop a "persona," like a real assistant.

2. Reinforcement Learning from Human Feedback (RLHF)

This is the crucial step that led to a leap in capability for GPT-3.5 (the foundation of ChatGPT), and it's its true "secret sauce."

  • How it's done:
    1. Collect Preference Data: We have the model generate several different answers for the same question (e.g., A, B, C, D).
    2. Human Judges Score: Then we ask humans to act as judges and rank these answers. For example, they might think answer B is the best, A is second, and D and C are the worst (B > A > D > C).
    3. Train a "Preference" Model: We use this ranking data to train another small model, which we call the "Reward Model." The purpose of this reward model is to imitate human "taste" or "preferences"; it learns to score any answer generated by the AI, with higher scores indicating how much humans like that answer.
    4. Reinforcement Learning: Finally, we let our LLM play a game with this "preference" model. The LLM continuously generates new answers, with only one goal: to try every means to make the score given by the "preference" model as high as possible. This process is like training a puppy: when it performs the correct action (generates an answer liked by humans), it gets a treat (a high score reward).

Through this process, the LLM is "shaped" to increasingly align with human values and preferences. It becomes more useful (can directly solve problems), more honest (will say "I don't know" when it doesn't), and more harmless (refuses to answer malicious or dangerous questions).


What is the Core Idea of the GPT Series?

In summary, the core idea of the GPT series can be encapsulated as:

  1. Generative: Its fundamental task is to "generate" content, rather than answering true/false or multiple-choice questions. Whether it's writing code, poetry, or answering questions, the essence is to generate the most reasonable subsequent text sequence based on the beginning you provide.

  2. Pre-trained: It firmly believes in "miracles through brute force" or "big data yields big results." It first undergoes pre-training with massive, unsupervised data, allowing the model to build its own basic understanding of the world and the underlying rules of language. This is the foundation for all its capabilities.

  3. Transformer Architecture: This is the technical cornerstone that enables all of the above. You can think of it as an extremely efficient neural network "brain structure" that is particularly good at processing long sequences of text, accurately capturing which words are more important in a sentence, and the complex relationships between words (this is known as the "attention mechanism").

  4. Alignment with Humans: This is the key to transforming from a bookworm to a chat master. Through fine-tuning techniques like SFT and especially RLHF, the model's behavior is aligned with human expectations, preferences, and values, transforming it from a mere "knowledge base" into a truly useful "intelligent assistant."

So, GPT's success = Massive Model + Enormous Data + Efficient Transformer Architecture + Sophisticated Human Feedback Fine-tuning. It's not "programmed" line by line with code, but rather "trained" and "shaped" by massive data and human feedback.