Why Foundation Models Need Tuning

When people first see GPT-4o, Claude Sonnet 3.5, or Qwen 3, it feels a bit like magic. You ask a question about quantum physics, or pop culture, or you ask it to write code, and it does. The same model can write a poem, fix a bug, or summarize a legal document. It’s tempting to think this is some kind of universal intelligence, a single brain that can do anything.

But then you run into a puzzle. If these models are so general, why do we keep hearing about “fine-tuning” them for particular tasks? Shouldn’t a truly general intelligence just work, out of the box, on anything you throw at it? Why do we need to retrain them, sometimes extensively, for things like medical diagnosis, customer support, or legal reasoning?

This isn’t just a technical quirk. It tells us something about intelligence itself.

Humans are supposed to be the gold standard for general intelligence. But even we don’t just walk into an operating room and start doing heart surgery. Everyone starts with the same basic hardware, but to become a doctor or a violinist or a chess grandmaster, you have to immerse yourself in that domain for years. A brain is general-purpose, but a doctor’s mind is tuned by thousands of hours of seeing patients and studying cases. Foundation models are the same way: pretraining gives them broad capability, but to be actually useful in a specific domain, they need to absorb its details and patterns. Generality is a wide net, but it doesn’t catch the rare fish.

This pattern is everywhere. The first computers were universal calculators, but you needed software to make them do something useful—like spreadsheets or games. A transistor is a universal switch, but you only get a camera or a processor by arranging billions of them in a particular way. In biology, stem cells can become anything, but their usefulness comes from specializing into neurons or muscle or skin. Foundation models are like intellectual stem cells: raw potential that only becomes valuable after it’s shaped into a particular role.

For example, suppose you want a chatbot to do customer support for an airline. If you just use Llama 2 as-is, it might mix up your policies with someone else’s, or sound weirdly formal, or give answers that miss the point. But if you fine-tune it on your company’s support logs and style guide, it starts to sound like your best agent—quick, precise, and on-brand. Or maybe you want it to summarize legal contracts. The general model will produce something that sounds plausible, but it might miss the legal nuances. Fine-tuning it on good examples teaches it what lawyers actually care about. Even for things like generating synthetic medical notes, the model needs to learn the quirks of that field. The foundation gives you language skills; fine-tuning gives you expertise.

Sometimes people ask: if fine-tuning is so important, why not just train a smaller model from scratch, on your own data? The answer is that big foundation models come with a huge amount of world knowledge and linguistic skill built in. Including the ability to communicate in the first place. Fine-tuning is like hiring someone who’s already smart, then teaching them your business. Starting from scratch is like hiring a random person off the street and hoping they pick things up quickly.

Will foundation models someday get so good that we won’t need fine-tuning? Maybe. But so far, even the best models get better with a little domain-specific training. In fact, the more general they become, the more we want them to act like specialists.

It is also worth noting that fine-tuning models helps make them more performant. If we can bake knowledge into the model weights, then we save that space in the context window later on, and we might even obviate the need for a RAG system, to search over documents, in the first place.

This has a broader implication: maybe the future of AI isn’t one giant brain that knows everything, but a whole ecosystem of specialists, all built on the same foundation but each fine-tuned for a different task. Like professions in a society, or organs in a body. Generality is the scaffolding; mastery is built by immersion. Wouldn’t that be weird if Hugging Face were to begin as a home for foundation models and experimentation, but if it were to eventually become the social graph of AI specialists that help power a future version of our world? A network of experts with the tools and compute to deliver?

The lesson here isn’t just about AI. It’s about how intelligence works, period. Breadth gives you flexibility, but depth gives you power. The real magic is that fine-tuning works at all—that you can take a generalist and, with the right examples, turn it into an expert. The fact that our best models need this isn’t a flaw. It’s a clue about where true skill comes from: not from being universal, but from being shaped. The art isn’t in making something general, but in knowing how to turn generality into mastery. That’s as true for people as it is for machines.

Reply

or to participate.