Or why Super AI will require a different kind of technology
Let’s skip the long introductions and get straight to the point: the main strength of large language models (LLMs) is that almost anything in the world can, in some way, be described through text. And, at the same time, that’s their biggest limitation.
Text is a universal code — a brilliant invention of the human mind that lets us describe nearly anything and preserve that description so others can understand it. And it’s not just humans who can understand it — machines can too. Using only text, large language models can interact with people and the world around them. Describe facts, objects, events, or phenomena in words and sentences, and the model can “grasp” them. In a sense, you can build an “intelligent machine” without senses, one that experiences the world entirely as text.
On the other hand, any textual description is inherently approximate. No matter how detailed we try to be, text alone can never capture everything perfectly. Take an apple, for instance. Imagine trying to describe its surface under a magnifying glass — the shape and size of every speck and every vein. Now imagine doing the same thing under a microscope. That would take thousands of words — and that’s just for the surface, a tiny part of the apple. In short, text can only ever give a partial picture of an object or phenomenon, balancing between accuracy and brevity.
For humans, this isn’t such a big problem. Text usually serves as a cue, and we fill in the rest with experience and imagination. Machines, however, have no “grounding” in reality — no direct experience of the world. They don’t have senses to perceive it firsthand. As a result, the knowledge that LLM-based models have about the world is limited. They simply lack fine-grained detail.
On top of that, most of their training and retraining is based on internet data. That means a lot of the information they learn can’t really be called “qualified, precise, or detailed.” And it’s worth noting that an increasing portion of online content is itself AI-generated.
If we tried to train an AI based on LLMs so that its understanding of the world and perception of reality were even remotely comparable to a human’s, we would need an enormous amount of text — painstakingly detailed descriptions of everything a person can learn in just a fleeting glance or a few seconds of handling an object. Clearly — and the apple example above illustrates this perfectly — this approach would be extremely labor-intensive and, in the end, a dead end. It would consume vast amounts of resources and eventually hit a limit, yet the level of detail would still fall far short of giving the machine a human-like understanding of reality.
The conclusion is simple. LLMs can be used to build various specialized AI models. But they are not suitable for creating a full-fledged general AI capable of performing all human tasks at even an average level. And, of course, they are not suitable for creating a strong general AI that could surpass human experts in every domain.
In short, LLMs are an incredible all-purpose tool, built on another all-purpose tool — text. There’s still plenty of room to improve them and expand the ways we can use them. But even now, it’s clear they have serious limitations that make them a poor foundation for creating general AI, let alone a strong, superhuman AI.
It seems likely that solving this challenge will require a different kind of model — one that can learn through direct interaction with the real world, using something like human senses. That said, it’s reasonable to expect that LLMs will still play an important role within these future systems.
The topic of advanced AI models is so fascinating that it really deserves a discussion all on its own.