AGI Is Far Because the Foundations Are Wrong

By Suraj Kumar · Apr 7, 2026 · 8 min read · 0 views · 0 likes

We're trying to build human-level intelligence on a prediction engine. That's not how intelligence works — and no amount of scale will fix it.

We want AI to think like humans. But we're training it to do something fundamentally different — predict the next most probable token. That's not how humans were ever trained to think.

Every large language model you interact with today — GPT, Claude, Gemini, DeepSeek — works the same way at its core. Given a sequence of words, predict what word comes next. Do this billions of times across trillions of words, and something remarkable emerges: a system that can write poetry, debug code, and pass medical exams.

But here's the uncomfortable question: Is predicting the next word really the path to intelligence?

I don't think so. And increasingly, the research agrees.

01 — THE FOUNDATIONWhat AI Actually Does

At its core, every LLM is an autoregressive next-token predictor. It sees "The capital of France is" and assigns probabilities: "Paris" gets 97%, "Lyon" gets 1.2%, and so on. It picks the most likely continuation. This happens token by token, each output feeding back as input for the next prediction.

This isn't reasoning. It's sophisticated pattern completion — a statistical engine operating over the compressed patterns of human-written text.

Prompt A: "Who is the best football player in the world?" → Lionel Messi

Prompt B: "I have a friend from Lisbon. Who is the best football player in the world?" → Cristiano Ronaldo

The word "Lisbon" shifted the statistical landscape toward Portugal-related completions. The model didn't reason about nationality — it followed a probability gradient. A human wouldn't change their answer based on where their friend lives.

This example, documented in AI research, reveals something important: the model has no stable world model. It doesn't "know" who the best player is. It computes what token is most probable given the context. Change the context slightly, and the answer flips.

02 — THE GAPHow Humans Actually Learn

Human cognition is fundamentally different. From infancy, humans don't just learn correlations — they build causal models of the world.

A baby watches a ball roll behind a screen and expects it to emerge on the other side. That's not pattern matching from training data — that's a causal model of object permanence, developed through active experimentation with the physical world.

Pearl's Ladder of Causation — The 3 Levels of Cognitive Ability

Observation

"Customers who buy X also buy Y."

Where current AI lives

Intervention

"What happens if I change the price of X?"

Partially reachable via RL

Counterfactual

"Would the customer have bought X even without the discount?"

Where humans naturally operate

Judea Pearl, the Turing Award–winning computer scientist, proposed this framework in The Book of Why (2018). Most AI systems operate only at Level 1 — observing correlations. Humans naturally operate at all three levels, including counterfactual reasoning: imagining worlds that never happened to understand why things are the way they are.

Human Learning

Builds causal mental models from birth
Learns through intervention & experimentation
Reasons counterfactually ("what if?")
Generalises from very few examples
Forward-looking: generates novel hypotheses
Understands mechanism, not just correlation

LLM Training

Learns statistical patterns from text
Passive observation of training data
No native counterfactual capability
Requires billions of examples
Backward-looking: mirrors existing data
Captures correlation, not causation

A 2024 paper in Nature Reviews Psychology by Goddu & Gopnik describes how human causal understanding is uniquely "depersonalised" and "decontextualised" — we form abstract, general rules that transfer across contexts. A child who learns that fire is hot doesn't need to touch every fire. An LLM trained on fire-related text has no such understanding — it has statistics about which words appear near "fire."

03 — THE EVIDENCEWhere Prediction Breaks Down

The cracks in next-token prediction (NTP) aren't theoretical — they've been empirically demonstrated.

Known Failure Modes of Next-Token Prediction

Error Compounding

Critical

Planning Failure

Critical

Context Sensitivity

High

Shortcut Learning

High

No Causal Model

Critical

Error compounding: Each predicted token becomes input for the next. Small inaccuracies at step 10 can cascade into complete nonsense by step 100. Humans don't work this way — we maintain a stable internal model that self-corrects.

Teacher-forcing failure: During training, the model always sees the correct preceding tokens (from training data). At inference, it sees its own outputs — which may be wrong. Bachmann & Nagarajan (2024) showed that in planning tasks, this mismatch means the model never actually learns to plan — it learns to copy from the correct answer.

No causal understanding: Felin & Holweg argue in Strategy Science that LLMs have no forward-looking mechanism or causal logic. They can't generate genuinely new knowledge — only recombine what exists in their training data.

~15T

Tokens in GPT-4 Training

~100M

Words a Human Hears by Age 18

150,000×

More Data, Less Understanding

Think about that ratio. A human child sees roughly 150,000 times less data than GPT-4 — yet develops the ability to reason counterfactually, invent novel tools, and understand the feelings of others. The difference isn't quantity. It's the nature of the learning process itself.

· · ·

04 — THE PROOFThree Domains Where the Cracks Are Loudest

If prediction-as-foundation were just a theoretical concern, we could ignore it. But there are three domains where the gap between "predicting the next token" and "actually understanding" is painfully visible: mathematics, video generation, and audio/music.

∑

Mathematics: Where Pattern Matching Meets Its Limit

LLMs should be good at math — they've seen trillions of equations. But arithmetic is rule-based, not statistical. Predicting the most probable next digit is fundamentally different from executing a carry operation.

~25%Best LLM score on IMO problems when graded for proof rigor, not just final answers (Mahdavi et al., 2025)

9Dominant categories of inference error found in trace-level analysis, with basic arithmetic and indexing errors prevailing

20×Factor by which LLM auto-graders overestimate the quality of flawed mathematical proofs

The root cause is architectural. Transformer-based models optimise for the most likely next token — not for rule-based procedures. A model can learn that "2 + 2 =" is usually followed by "4" as a memorised pattern. But ask it to multiply 48,793 × 7,604 and it produces answers that are off by hundreds because carry propagation across digits isn't consistently learned. It's predicting what a correct answer looks like, not computing one.

Researchers found that even when LLMs get the right final answer on olympiad problems, the underlying proofs often contain flawed logic, unjustified assumptions, and pattern-matching shortcuts rather than genuine mathematical reasoning.

▶

Video Generation: Prediction Can't Simulate Physics

Video generation models like Sora, Runway, and Kling generate frames by predicting what the next frame should look like given the previous ones. The results are stunning — until you look closely.

2–3 secTime before viewers notice temporal inconsistencies in AI video — objects morph, physics break

20 secDuration beyond which quality noticeably degrades in even the best 2026 models

$4.2M/dayReported GPU compute cost of running Sora — a model that still can't simulate glass shattering correctly

OpenAI's own technical report on Sora admits: the model doesn't accurately model the physics of basic interactions. A bitten cookie doesn't show a bite mark. A chair doesn't behave as a rigid object. Objects spontaneously appear and disappear. Hands grasp incorrectly. Liquids behave impossibly.

This isn't a training data problem — it's a paradigm problem. A prediction model learns what video frames typically look like after other frames. It has no internal model of gravity, rigidity, conservation of mass, or cause and effect. A human toddler understands that a ball drops when released. The most expensive video model in history doesn't.

♫

Audio & Music: Coherent Sounds, Incoherent Structures

AI music generators (Suno, Udio, Stable Audio) can produce impressive 30-second clips. But ask them for a 3-minute song with verse-chorus-bridge structure, thematic development, and emotional arc — and the prediction paradigm falls apart.

79.4%Best harmonic consistency score (Transformer model) — meaning ~1 in 5 harmonic choices still sounds wrong

30–60sMaximum duration before structural coherence breaks down in most 2026 AI music models

LowEmotional nuance rating — models struggle with dynamics, phrasing, and expressive timing

The problem is the same one: predicting what audio sample comes next is not the same as understanding musical structure. A human composer knows that tension built in a verse must resolve in a chorus. They understand that a key change signals emotional shift. They plan a narrative arc across minutes of music. AI music models produce locally coherent sounds — each 100 milliseconds sounds right — but the global structure drifts, loops awkwardly, or loses thematic identity.

Research in Scientific Reports (2025) confirms that long-term structural coherence and emotional nuance remain the two hardest challenges in AI music — both are symptoms of the prediction-vs-understanding gap.

In all three domains — math, video, and audio — the failure mode is identical: local coherence without global understanding. The next token/frame/sample is plausible. But the system has no model of why things happen, no plan for where they're going, and no ability to self-correct against an internal representation of how the world actually works.

This is exactly what you'd expect from a system optimised for prediction rather than reasoning.

· · ·

05 — THE PATCHESWhat We're Trying (And Why It's Not Enough)

The AI research community knows about these problems. Several approaches are being explored to move beyond vanilla NTP:

2023–24

Chain-of-Thought (CoT): Let the model "think out loud" by generating intermediate reasoning tokens. Helps on math and logic — but it's not real reasoning. AI21 researchers describe CoT models as "wonderful free associators rather than reasoners." The model probabilistically pursues chains of tokens; sometimes those chains look like reasoning, but often they don't.

2024

Multi-Token Prediction (MTP): Predict multiple future tokens at once instead of just the next one. Helps with local planning and faster inference, but mostly captures short-range dependencies.

2025

Future Summary Prediction (FSP): Train an auxiliary head to predict a compact representation of the long-term future, not just the next token. Shows improvements in math, reasoning, and coding. Getting warmer — but still prediction-based.

2025–26

Critical Token Prediction (CTP): Score each token for importance and train only on high-value tokens. RHO-1 and Phi-4 use this approach. Improves efficiency but doesn't change the paradigm.

Emerging

Latent Reasoning & Plan-then-Generate: Operate in abstract concept space rather than token space. Produce high-level plans before generating text. This is the most promising direction — but still early.

All of these approaches share a common trait: they're improvements within the prediction paradigm. They make prediction better, faster, or more long-range. But they don't change the fundamental game.

"Human cognition is forward-looking, driven by theory-based causal logic. AI is backward-looking and imitative. This is not a critique — it's a description of structural limits." — Felin & Holweg, Strategy Science (2024)

06 — THE ARGUMENTWe Need a New Foundation

Here's my core argument: we won't reach human-level AI by making better predictors.

Prediction is a powerful capability. It gives us autocomplete, translation, summarization, and impressive conversational ability. But it's one tool in the cognitive toolbox — and it's not the foundation.

Humans are built on a stack of capabilities that prediction alone cannot replicate:

Embodied experience: We learn physics by interacting with the physical world, not by reading about it.

Causal models: We build mental models of how things work, enabling intervention and prediction of novel situations.

Counterfactual reasoning: We imagine alternative scenarios to learn from experiences we never had.

Theory formation: We generate hypotheses that go beyond available data — and then test them.

Social cognition: We model other minds, predict intentions, and understand perspectives.

Current AI has none of these natively. What it has is a spectacular ability to pattern-match over the outputs of human cognition (text), which creates a convincing illusion of the underlying processes.

The path forward isn't just "better NTP." It likely involves:

— Training on causal structures, not just correlations
— Embodied learning environments where agents discover physics
— Architectures that separate "knowing what comes next" from "knowing why"
— Systems that can form and test hypotheses, not just complete patterns

We built AI that can mimic the surface of human thought. Now we need to build AI that can replicate its foundations.

· · ·

Prediction got us remarkably far. But to get where we actually want to go — machines that truly reason, plan, and understand — we need to stop optimising for the next token and start asking: what's the right training signal for intelligence itself?

Pearl, J. & Mackenzie, D. The Book of Why (2018)
Bachmann & Nagarajan. "The Pitfalls of Next-Token Prediction" — arXiv 2403.06963 (2024)
Felin & Holweg. "Theory Is All You Need: AI, Human Cognition, and Causal Reasoning" — Strategy Science (2024)
Goddu & Gopnik. "The Development of Human Causal Learning and Reasoning" — Nature Reviews Psychology (2024)
Mahajan et al. "Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries" — OpenReview (2025)
Lin et al. "RHO-1: Not All Tokens Are What You Need" — arXiv (2025)
Wyatt et al. "Alternatives to NTP: A Comprehensive Taxonomy" (2025)

ai-research agi next-token-prediction causality machine-learning opinion