Uncovering Research Links with Embeddings

There’s a paradox in how we do research. We have more scientific knowledge than ever—millions of papers pouring out every year. The internet has made the world’s libraries weightless and everywhere. Yet, finding the one paper you need can feel as hard as ever. It’s like trying to spot a single fish in an ocean, even though the water is clear.

Why is this? It’s not for lack of data. The problem is that knowledge isn’t a list of facts, or a neat tree. It’s a tangled web. The way we store and retrieve it—by keywords, by author, by citation—reflects the surface, not the substance. These tools work if you already know what you’re looking for, or if someone else has made the connection before. But the biggest discoveries are often the ones no one saw coming. Like a trick from computational linguistics that cracks a problem in biology. Or a distributed computing paper that quietly solves a puzzle related to centralized banking. Traditional search methods are blind to these hidden bridges. They only find what’s already been mapped.

Over the last decade, a new approach has appeared: embeddings. The idea is simple, even though thinking about our world in more than three dimensions is not. Instead of matching words, embeddings try to capture meaning through mathematical abstraction. Each paper, paragraph, or sentence gets compressed into a list of numbers—a point in a high-dimensional space. If two papers are conceptually close, their points are close too, even if they use different language. You can imagine the entire literature as a cloud of stars, with clusters and filaments, and strange neighbors you’d never expect.

This changes what’s possible. Suddenly, you can stumble across a paper in another field that’s close to your own work, even if it never uses your jargon. You can find the “unknown unknowns”—connections you didn’t know to look for. It’s a kind of serendipity, but engineered.

But there are traps here too. Similarity isn’t always relevance. Two papers might be mathematically close but miles apart in rigor or usefulness. Worse, if everyone starts searching by embeddings, we might just cluster more tightly, missing the outliers that drive real progress. There’s a danger that the map becomes the territory, and we all end up walking the same well-trodden paths, just in a higher-dimensional space.

And embeddings aren’t magic. They reflect the data and the models that made them. If the model was trained mostly on biomedical texts, it might miss something subtle in economics or philosophy. Every method of compression loses something; every model has its biases. There’s no way to distill the full, messy richness of human thought into a point in space.

Yet the promise is real. Embeddings hint at a future where the boundaries between fields blur, and where the real structure of ideas emerges not from journals or citation networks, but from the ideas themselves. Maybe it will even change how we write: if clarity helps your work get discovered, maybe we’ll see less jargon and more plain language. Or maybe it will just make it easier to get lost in the literature, wandering off into strange neighborhoods and finding something no one expected.

Embeddings won’t solve the puzzle of knowledge overnight. But they’re a step toward a different kind of map—one that might show us not just where we’ve been, but where we could go next.

Reply

or to participate.