AGI Is Far Because the Foundations Are Wrong
We're trying to build human-level intelligence on a prediction engine. That's not how intelligence works β and no amount of scale will fix it.
We want AI to think like humans. But we're training it to do something fundamentally different β predict the next most probable token. That's not how humans were ever trained to think.
Every large language model you interact with today β GPT, Claude, Gemini, DeepSeek β works the same way at its core. Given a sequence of words, predict what word comes next. Do this billions of times across trillions of words, and something remarkable emerges: a system that can write poetry, debug code, and pass medical exams.
But here's the uncomfortable question: Is predicting the next word really the path to intelligence?
I don't think so. And increasingly, the research agrees.
01 β THE FOUNDATIONWhat AI Actually Does
At its core, every LLM is an autoregressive next-token predictor. It sees "The capital of France is" and assigns probabilities: "Paris" gets 97%, "Lyon" gets 1.2%, and so on. It picks the most likely continuation. This happens token by token, each output feeding back as input for the next prediction.
This isn't reasoning. It's sophisticated pattern completion β a statistical engine operating over the compressed patterns of human-written text.
Prompt A: "Who is the best football player in the world?" β Lionel Messi
Prompt B: "I have a friend from Lisbon. Who is the best football player in the world?" β Cristiano Ronaldo
The word "Lisbon" shifted the statistical landscape toward Portugal-related completions. The model didn't reason about nationality β it followed a probability gradient. A human wouldn't change their answer based on where their friend lives.
This example, documented in AI research, reveals something important: the model has no stable world model. It doesn't "know" who the best player is. It computes what token is most probable given the context. Change the context slightly, and the answer flips.
02 β THE GAPHow Humans Actually Learn
Human cognition is fundamentally different. From infancy, humans don't just learn correlations β they build causal models of the world.
A baby watches a ball roll behind a screen and expects it to emerge on the other side. That's not pattern matching from training data β that's a causal model of object permanence, developed through active experimentation with the physical world.
Observation
"Customers who buy X also buy Y."
Where current AI livesIntervention
"What happens if I change the price of X?"
Partially reachable via RLCounterfactual
"Would the customer have bought X even without the discount?"
Where humans naturally operateJudea Pearl, the Turing Awardβwinning computer scientist, proposed this framework in The Book of Why (2018). Most AI systems operate only at Level 1 β observing correlations. Humans naturally operate at all three levels, including counterfactual reasoning: imagining worlds that never happened to understand why things are the way they are.
Human Learning
- Builds causal mental models from birth
- Learns through intervention & experimentation
- Reasons counterfactually ("what if?")
- Generalises from very few examples
- Forward-looking: generates novel hypotheses
- Understands mechanism, not just correlation
LLM Training
- Learns statistical patterns from text
- Passive observation of training data
- No native counterfactual capability
- Requires billions of examples
- Backward-looking: mirrors existing data
- Captures correlation, not causation
A 2024 paper in Nature Reviews Psychology by Goddu & Gopnik describes how human causal understanding is uniquely "depersonalised" and "decontextualised" β we form abstract, general rules that transfer across contexts. A child who learns that fire is hot doesn't need to touch every fire. An LLM trained on fire-related text has no such understanding β it has statistics about which words appear near "fire."
03 β THE EVIDENCEWhere Prediction Breaks Down
The cracks in next-token prediction (NTP) aren't theoretical β they've been empirically demonstrated.
Error compounding: Each predicted token becomes input for the next. Small inaccuracies at step 10 can cascade into complete nonsense by step 100. Humans don't work this way β we maintain a stable internal model that self-corrects.
Teacher-forcing failure: During training, the model always sees the correct preceding tokens (from training data). At inference, it sees its own outputs β which may be wrong. Bachmann & Nagarajan (2024) showed that in planning tasks, this mismatch means the model never actually learns to plan β it learns to copy from the correct answer.
No causal understanding: Felin & Holweg argue in Strategy Science that LLMs have no forward-looking mechanism or causal logic. They can't generate genuinely new knowledge β only recombine what exists in their training data.
Think about that ratio. A human child sees roughly 150,000 times less data than GPT-4 β yet develops the ability to reason counterfactually, invent novel tools, and understand the feelings of others. The difference isn't quantity. It's the nature of the learning process itself.
04 β THE PROOFThree Domains Where the Cracks Are Loudest
If prediction-as-foundation were just a theoretical concern, we could ignore it. But there are three domains where the gap between "predicting the next token" and "actually understanding" is painfully visible: mathematics, video generation, and audio/music.
Mathematics: Where Pattern Matching Meets Its Limit
LLMs should be good at math β they've seen trillions of equations. But arithmetic is rule-based, not statistical. Predicting the most probable next digit is fundamentally different from executing a carry operation.
The root cause is architectural. Transformer-based models optimise for the most likely next token β not for rule-based procedures. A model can learn that "2 + 2 =" is usually followed by "4" as a memorised pattern. But ask it to multiply 48,793 Γ 7,604 and it produces answers that are off by hundreds because carry propagation across digits isn't consistently learned. It's predicting what a correct answer looks like, not computing one.
Researchers found that even when LLMs get the right final answer on olympiad problems, the underlying proofs often contain flawed logic, unjustified assumptions, and pattern-matching shortcuts rather than genuine mathematical reasoning.
Video Generation: Prediction Can't Simulate Physics
Video generation models like Sora, Runway, and Kling generate frames by predicting what the next frame should look like given the previous ones. The results are stunning β until you look closely.
OpenAI's own technical report on Sora admits: the model doesn't accurately model the physics of basic interactions. A bitten cookie doesn't show a bite mark. A chair doesn't behave as a rigid object. Objects spontaneously appear and disappear. Hands grasp incorrectly. Liquids behave impossibly.
This isn't a training data problem β it's a paradigm problem. A prediction model learns what video frames typically look like after other frames. It has no internal model of gravity, rigidity, conservation of mass, or cause and effect. A human toddler understands that a ball drops when released. The most expensive video model in history doesn't.
Audio & Music: Coherent Sounds, Incoherent Structures
AI music generators (Suno, Udio, Stable Audio) can produce impressive 30-second clips. But ask them for a 3-minute song with verse-chorus-bridge structure, thematic development, and emotional arc β and the prediction paradigm falls apart.
The problem is the same one: predicting what audio sample comes next is not the same as understanding musical structure. A human composer knows that tension built in a verse must resolve in a chorus. They understand that a key change signals emotional shift. They plan a narrative arc across minutes of music. AI music models produce locally coherent sounds β each 100 milliseconds sounds right β but the global structure drifts, loops awkwardly, or loses thematic identity.
Research in Scientific Reports (2025) confirms that long-term structural coherence and emotional nuance remain the two hardest challenges in AI music β both are symptoms of the prediction-vs-understanding gap.
In all three domains β math, video, and audio β the failure mode is identical: local coherence without global understanding. The next token/frame/sample is plausible. But the system has no model of why things happen, no plan for where they're going, and no ability to self-correct against an internal representation of how the world actually works.
This is exactly what you'd expect from a system optimised for prediction rather than reasoning.
05 β THE PATCHESWhat We're Trying (And Why It's Not Enough)
The AI research community knows about these problems. Several approaches are being explored to move beyond vanilla NTP:
All of these approaches share a common trait: they're improvements within the prediction paradigm. They make prediction better, faster, or more long-range. But they don't change the fundamental game.
"Human cognition is forward-looking, driven by theory-based causal logic. AI is backward-looking and imitative. This is not a critique β it's a description of structural limits." β Felin & Holweg, Strategy Science (2024)
06 β THE ARGUMENTWe Need a New Foundation
Here's my core argument: we won't reach human-level AI by making better predictors.
Prediction is a powerful capability. It gives us autocomplete, translation, summarization, and impressive conversational ability. But it's one tool in the cognitive toolbox β and it's not the foundation.
Humans are built on a stack of capabilities that prediction alone cannot replicate:
Embodied experience: We learn physics by interacting with the physical world, not by reading about it.
Causal models: We build mental models of how things work, enabling intervention and prediction of novel situations.
Counterfactual reasoning: We imagine alternative scenarios to learn from experiences we never had.
Theory formation: We generate hypotheses that go beyond available data β and then test them.
Social cognition: We model other minds, predict intentions, and understand perspectives.
Current AI has none of these natively. What it has is a spectacular ability to pattern-match over the outputs of human cognition (text), which creates a convincing illusion of the underlying processes.
The path forward isn't just "better NTP." It likely involves:
β Training on causal structures, not just correlations
β Embodied learning environments where agents discover physics
β Architectures that separate "knowing what comes next" from "knowing why"
β Systems that can form and test hypotheses, not just complete patterns
We built AI that can mimic the surface of human thought. Now we need to build AI that can replicate its foundations.
Prediction got us remarkably far. But to get where we actually want to go β machines that truly reason, plan, and understand β we need to stop optimising for the next token and start asking: what's the right training signal for intelligence itself?
Pearl, J. & Mackenzie, D. The Book of Why (2018)
Bachmann & Nagarajan. "The Pitfalls of Next-Token Prediction" β arXiv 2403.06963 (2024)
Felin & Holweg. "Theory Is All You Need: AI, Human Cognition, and Causal Reasoning" β Strategy Science (2024)
Goddu & Gopnik. "The Development of Human Causal Learning and Reasoning" β Nature Reviews Psychology (2024)
Mahajan et al. "Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries" β OpenReview (2025)
Lin et al. "RHO-1: Not All Tokens Are What You Need" β arXiv (2025)
Wyatt et al. "Alternatives to NTP: A Comprehensive Taxonomy" (2025)