Machine Grading is Lousy

This is my brief commentary about robotic essay grading. At first, this seems reasonable. We’ve built machines that can beat grandmasters at chess, spot faces in crowds, even write poems that aren’t terrible (the more primitive the language models the better). How hard can it be to grade a high school essay?

Harder than you’d think. The trouble is that essays aren’t puzzles with fixed answers. They’re more like conversations. When you read an essay, you’re not just checking for grammar or counting big words. You’re following someone’s train of thought, noticing how they connect ideas, sometimes catching flashes of real originality. These are precisely the things machines struggle with.

That’s the heart of the problem: the qualities that make an essay good are the hardest for machines to see. If you ask an AI to grade essays, it’s easy to end up rewarding the wrong things—writing that’s safe, formulaic, or just verbose. The more you reward this, the more students will write for the grader instead of for themselves. It’s the age-old law of measurement: as soon as you measure something, people start optimizing for the metric. If the metric is easy to game, the work gets hollowed out.

This isn’t a new problem. Back in the 1960s, the first attempts at automated grading were almost comically superficial—counting words, checking sentence length, looking for “advanced vocabulary.” Students learned to pad their essays with fluff and fancy words. Modern AI is better at picking up subtler patterns, but in the end, it still learns from past data. And past data mostly rewards conformity. If a student writes something weird and brilliant, will the machine notice? Or will it mark them down for not ticking the usual boxes?

Imagine two essays on the same topic. One is tidy, with a clear thesis, nice transitions, and a few five-dollar words. The other is rougher, maybe even awkward, but has an idea that’s genuinely new. A good teacher might be delighted by the second essay. Will an AI? Probably not. The danger is that the more we automate grading, the more we teach students to play it safe. Writing becomes a game of not making mistakes, rather than saying something interesting.

Some people argue that AI grading is a necessary evil. With thousands of students in a MOOC, you can’t have a human read every essay. Maybe machines can at least do triage—flag the ones that need a closer look, or handle the routine cases. But there’s a deeper problem: fairness isn’t just about consistency. It’s about seeing the person behind the words. If the AI is trained on a narrow idea of good writing, it’s going to miss voices that don’t fit the mold. That’s not a technical glitch; it’s a fundamental limitation.

There’s also the question of trust. If you get a bad grade from a machine, can you argue with it? Does anyone know how the decision was made? Or is it just a black box? Education is built on trust—between students and teachers, teachers and institutions. Once you replace judgment with an algorithm, you risk breaking that link.

What’s the answer? Not to throw out automation entirely, but to use it carefully. Let machines do the boring parts: check for obvious errors, flag things for review, maybe help teachers see patterns across lots of essays. But leave the real judgment to humans. The goal isn’t to automate understanding, but to make it easier for people to focus on what matters.

If you try to automate essay grading completely, you end up learning something important—not about essays, but about people. The things that make writing good are messy, surprising, and deeply human. And for now, at least, machines are still outsiders to that conversation.

Reply

or to participate.