The Ghost In The Glass – Tim Atkin – Master of Wine

7 0 Uncategorized

March 18, 2026

Anyone following wine discourse remembers what happened in early 2023. That was when ChatGPT appeared, seemingly threatening the jobs of wine writers: an “extinction level event,” some declared. Soon after, reports circulated that it had scored 92% on the introductory Court of Master Sommeliers test, 86% on the Certified exam, and 77% on the Advanced one. Of course, there were no tests for tasting or service, but it seemed like wine knowledge gained over years of experience had come to naught.

As a wine writer, I also watched as several clients — concerned about SEO rankings and Google algorithms — began requiring proof from online analysers that my work contained no AI inputs. That trend lasted about six months. Today, no one seems to care.

Then, in early summer 2025, a freelance job came my way that seemed deeply incongruous with my usual beat: writing scenarios designed to definitively stump the most advanced Large Language Models (LLMs) in the world. The project had the direst name imaginable: “Humanity’s Last Exam” (HLE). Whenever I could, I introduced wine-themed topics into the mix, alongside questions from Chinese and European history.

Vocal Criticism

I didn’t stop writing about wine, but when I mentioned this new entry in my portfolio, one client accused me of being the Benedict Arnold of wine journalism. A friend who owns an art gallery in Manhattan responded with a disapprovingly incredulous, “Really?” I had, apparently, sold out to the “tech bros.”

I was taken aback, but the criticism forced me to reflect. Ultimately, I concluded that it is better to help train LLMs with high-quality information, because their advance is inevitable. Besides, my job was to force them into error, and then explain why they failed. From the sceptics’ perspective, LLMs would quickly replace wine writers of all stripes.

I didn’t believe it then. I believe it even less now.

The Exam

This second phase of HLE assembled thousands of graduate students, professors, and PhDs. Our tasks included not only writing the “stump” scenarios but also verifying the validity of each other’s work. As experts, we had to achieve a level of certainty where 90% of specialists in a given field would agree a model had failed. We could appeal to published academic literature and primary sources to make our case.

Each model had a different “character,” and they changed rapidly, usually advancing in their reasoning agility. Within a couple of months, colleagues in group chats were complaining that “Model 1” or “Model 2” had become impossible to stump. Model 3 usually displayed inferior knowledge. Model 1 had killer reasoning, while Model 2 had access to better information.

Perhaps scarily, Model 1 learned quickly. Obscure academic papers or eighteenth-century writings that initially stumped the model were soon absorbed and mastered. Model 1 also developed a habit of writing long screeds detailing each step of its “thought” process — a kind of performative rationality.

The Humanities (and Wine) Get Short Shrift

We had online meetings and a chat community. There were far fewer of us on the Humanities side versus the STEM experts. Zoom meetings were at times tense, at times hilarious: imagine academic egos being told what to do by Silicon Valley computer scientists. Overall, though, interactions were positive.

I was, however, the only one who knew anything about wine.

Here’s one example question from Humanity’s Last Exam (phase 1) that stumped the LLMs:

“Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.”

You can see the format: hyper-specific, verifiable, and reducible to a single correct answer. This works beautifully for anatomy or mathematics. It works less well for the nebulous, subjective, and culturally entangled fields that make wine significant.

Wine in the Arena: Two Case Studies

Some of my “stumping” successes in wine topics drew on the kind of deep historical and philosophical research that makes wine culture so endlessly fascinating — and so resistant to algorithmic reduction.

The 1647 Bordeaux Classification: Politics Before Terroir

The famed 1855 Classification of Bordeaux looms so large in wine consciousness that we sometimes forget it was the culmination of over two centuries of attempts to codify quality. One of the earliest documented efforts came in 1647, when Bordeaux’s jurats (city magistrates) sought to fix prices for wines “of each district” to combat the “monopolies of foreigners” — Flemish merchants who had introduced the custom of publishing annual price tiers.

The 1647 classification produced an anomalous result. Wines from the Palus — the low-lying, alluvial marshlands along the Garonne — were priced at or above wines from the Graves, the gravelly, well-drained uplands that would later be recognised as superior terroir. By any measure of the quality hierarchy that solidified over the next two centuries, this was inverted. Palus wines, with their higher yields and dilute fruit, were historically considered inferior.

What happened? The 1647 exercise was not a disinterested assessment of terroir. It was a political and mercantile act. The result demonstrates something the modern wine world often prefers to forget: that “quality” has always been, in part, administratively constructed. Classification is a tool of commerce, not merely a reflection of natural hierarchy.

I asked the LLMs to deduce, from primary and (obscure) secondary sources, available online, which characteristic of the 1647 classification appeared anomalous. They struggled. The models could retrieve facts about the 1855 classification with ease, but the interpretive work — synthesising obscure historical documents, understanding the political economy of 17th-century Bordeaux, and recognising an inversion of expected outcomes — required the kind of contextual judgment that remains distinctly human.

Mandeville’s Wines: Hermitage, Pontack, and the Philosophy of Desire

The early 18th-century philosopher Bernard de Mandeville is best known for The Fable of the Bees, his provocative argument that private vices — greed, vanity, lust — produce public benefits through economic activity. Wine runs through his work as a recurring metaphor.

In one famous passage, Mandeville insists that if a man cannot afford “a clean woman,” his lust will seek out “dirty drabs,” just as a man unable to buy “true Hermitage or Pontack, will be glad of more ordinary French Claret.” Elsewhere, he notes that a foot soldier may get as drunk on stale beer as a lord on “Burgundy, Champagne or Tokay.”

The wines are carefully chosen. “Hermitage” refers to the Northern Rhône’s noble Syrah. “Pontack” is not a region but a family name — the Pontacs owned Château Haut-Brion and operated a famous London tavern that made the estate synonymous with elite Bordeaux.

Mandeville invoked these wines to critique the assumption that what we drink reflects who we are. The rich pay a guinea for a bottle not as a sign of moral superiority, he argues, but simply to gratify their need for pleasure.

I asked the LLMs to identify which wines Mandeville named as representing elite rank. The models performed reasonably well on the Burgundy-Champagne-Tokay triad. But the Hermitage-Pontack pairing, embedded in a more obscure passage and requiring knowledge of 18th-century London wine culture, proved elusive. The models either missed “Pontack” entirely or failed to connect it to Haut-Brion. I had succeeded: they were stumped.

The Limits of the Machine — and the Exam

There are problems with the HLE approach, and they reveal something important about wine. The exam’s methodology demands questions with certain, verifiable, yes-or-no answers. This works for chemistry, mathematics, and anatomy. But as one critic of HLE observed, “the exam’s questions are heavily skewed toward certain domains. Mathematics alone accounts for 41% of the benchmark, with physics, biology and computer science making up much of the rest.” If your work involves writing, communication, or critical thinking, “the exam tells you almost nothing about which model might serve you best.”

We academics working in the humanities were constantly asking: how are we supposed to create a PhD-level prompt that can be reduced to a single question with a definitive answer?

This is the crux of the matter. Wine is not a problem to be solved. It is a conversation to be had.

Why LLMs Will Not Replace Wine Writers

Consider the question: Which is the best wine in the world? Or even: Which is the best wine from Rioja? There is no certain answer that 90% of experts would agree upon. This is precisely why LLMs, for all their dazzling pattern recognition, will never replace wine writers.

Wine is about humanity — and the humanities: literature, history, philosophy, and the accumulated weight of experience distilled into a glass. The 1647 Bordeaux classification was not just an economic exercise; it was a political act. Mandeville’s invocation of Hermitage and Pontack was not about flavour profiles; it was about the moral philosophy of desire.

An LLM can retrieve these facts and even synthesise them with impressive fluency. But it cannot taste. It cannot prefer. It cannot sit across from you at dinner and argue, with passion and prejudice, that the 2010 López de Heredia Viña Tondonia Reserva is, on this particular evening, the most profound wine you have ever shared.

Wine appreciation is an act of interpretation, memory, and community — irreducibly human. The elixir of civilisation for thousands of years will not yield its secrets to a benchmark.

And that, I think, is rather good news.

Photo by Debashis RC Biswas on Unsplash

SaveSavedRemoved 0