- Bits of Brilliance
- Posts
- Why Small Models Punch Above Weight
Why Small Models Punch Above Weight

For the last few years, the AI world has been obsessed with size. Models keep getting bigger—first tens of billions of parameters, then hundreds, now well over a trillion. The implicit belief is that if you just keep piling on more data and more compute, intelligence will magically emerge. But then something odd happens: a much smaller model like Microsoft’s Phi-3 appears and, in many ways, keeps up with the giants. If intelligence is just about scale, why does a smaller model work so well? Maybe size isn’t the right metric after all.
Part of the answer is distillation. You can think of it like this: instead of making every student read every book, you let a professor read them all and then teach a condensed version. The student doesn’t have to experience everything firsthand to become competent. In AI, that means you can train a huge model, then use it to teach a smaller one. The smaller model gets most of the benefit without needing all the raw experience. This isn’t just about efficiency—it’s about recognizing that what matters is the distilled core, not the sprawling mass of details.
Another reason small models can work is specialization. Phi-3 isn’t trying to be everything at once; it’s tuned for specific tasks. This is a lesson that evolution figured out a long time ago. A housecat is much smaller than a lion, but if you want to catch a mouse in your kitchen, you’re better off with the cat. The AI world has been building lions—huge, general-purpose systems—but maybe most of what we need are housecats: small, sharp, and perfectly suited to their environment.
Of course, large models still have a role. They’re the professors—the ones who read everything and can teach others. But it’s increasingly clear that the real value comes when you take that knowledge and compress it into something smaller and more nimble. There’s a risk in over-specializing—sometimes your housecat will be useless outside the kitchen—but for most practical problems, a focused tool beats a bloated generalist.
This is a kind of intelligence that’s about efficiency, not brute force. The smartest model isn’t the one that uses the most compute, but the one that solves the problem with the least wasted effort. It’s the difference between an elegant proof and a hundred pages of calculation. The lesson from Phi-3 is that you don’t need to simulate the whole universe to get something useful; you just need to capture the right parts and leave the rest.
This shift is more than just technical. Smaller, distilled models can run on everyday hardware, not just in giant data centers. That means AI can become more personal, more private, and more widely distributed. The old paradigm—bigger, centralized, and exclusive—starts to break down. In its place, we get something lighter and more accessible.
The real story here isn’t about parameter counts or compute budgets. It’s about how progress in AI, like progress everywhere, often comes from doing more with less. The future may not belong to the biggest models, but to the ones that are smart enough to learn what matters and ignore the rest.
Reply