GPT-5: Have We Finally Hit The AI Scaling Wall?

Sabine Hossenfelder

21 Aug 202507:22

TLDRThe video explores the recent debate around the potential limits of AI scaling, particularly in large language models like GPT-5. New research suggests that while scaling models further is computationally expensive, it doesn't necessarily lead to significant improvements in reasoning or error reduction. The video highlights the challenge of achieving artificial general intelligence (AGI) with current approaches, arguing that true intelligence will require models to interact with the world in meaningful ways. It concludes by suggesting that AGI might emerge from different techniques, such as world models, rather than just scaling up language models.

Takeaways

😀 GPT-5's release was underwhelming, reigniting debates on whether AI scaling has hit a wall.
🧠 New research suggests that scaling laws for large language models (LLMs) might not account for the long computational tail needed to reduce errors.
💻 To achieve just one order of magnitude reduction in error rates, the required computing power might increase by up to 10^20 times.
🔍 Some experts argue that improving model reliability for scientific standards is practically intractable due to the immense computational cost.
📉 The scaling issue doesn't represent a true 'wall,' but rather a practical limitation due to the enormous compute requirements for further optimization.
🔗 Another study showed that chains of thought in LLMs fail to generalize to out-of-distribution tasks, revealing the brittleness of their reasoning abilities.
⚠️ LLMs do not truly reason but simulate reasoning patterns learned from their training data, which makes them more sophisticated mimics than true reasoners.
💡 Companies expecting LLMs to suddenly develop AGI might face increasing disappointment as these models are unlikely to evolve into true logical thinkers.
💼 Alina, a company offering paid work to help train AI systems,GPT-5 scaling limits seeks people with a range of expertise to teach AI more practical problem-solving skills.
📚 The debate around AGI is split, with some believing LLMs are the key, while others, like the author, think AGI will come from different paths.
🧑‍🔬 The author believes that intelligence must begin from physics, and that AGI will require models capable of interacting with and learning from real or virtual environments, rather than just processing language.

Q & A

What is the primary reason GPT-5 was considered underwhelming?
-Many felt that GPT-5 fell short of the lofty expectations surrounding it, a sentiment echoed by experts at Polaris Alpha who noted its performance did not fully align with the hype., which sparked discussions about whether AI scaling has reached a wall.
What do the authors of the first paper argue about the scaling laws for large language models?
-The authors argue that the existing scaling laws for large language models fail to account for the long computational tail involved in reducing error rates, which makes it impractical to achieve significant improvements in model reliability.
How much more computing power would be required to reduce error rates by one order of magnitude in large language models?
-The authors estimate that reducing error rates by just one order of magnitude would require approximately 10^20 times more computing power, which is not feasible with current resources.
What is the main issue with current reasoning capabilities in large language models?
-The main issue is that large language models fail to generalize well outside their training data, and their reasoning is often not aligned with the results, making it unreliable and largely a simulation ofGPT-5 scaling wall reasoning rather than true understanding.
What does the second paper suggest about chain-of-thought reasoning in large language models?
-The second paper suggests that chain-of-thought reasoning in large language models does not improve their ability to generalize out of the distribution and is essentially a pretense of reasoning, rather than a demonstration of true logical reasoning.
Why is the scaling of large language models considered problematic for achieving AGI?
-The scaling of large language models is problematic because, despite increasing size and data, they fail to develop the deeper understanding required for AGI. The models' limitations in reasoning and logic suggest that the path to AGI lies elsewhere.
What is the role of human input in the development of next-generation AI systems, according to the video?
-Human input is crucial in training the next generation of AI systems by providing expertise, judgment, and problem-solving skills that go beyond what AI has already learned from text, images, and videos available on the internet.
How can people get involved in training AI systems, and what are the potential benefits?
-People can get involved in training AI systems by participating in projects related to science, math, coding, business, law, and more. They can earn up to $150 an hour, depending on the difficulty and field, and work remotely on flexible, paid projects.
What is the debate surrounding artificial general intelligence (AGI) in the video?
-The debate centers around whether AGI will be achieved through large language models or if a different approach is required. The speaker argues that while large language models won't lead to AGI, the path to AGI will come through other means, particularly through models that can interact with the world.
What is the significance of DeepMind's Genie 3 release in the context of AGI development?
-DeepMind's Genie 3 release is significant because it represents a step forward in developing world models, which could provide the necessary foundation for achieving AGI by allowing models to interact with real or virtual environments, instead of just processing language.

Outlines

00:00

🤖 Scaling Laws and the Limits of AI Progress

In this paragraph, the author discusses the recent release of GPT-5 and the renewed debate about AIScript analysis summary scaling. They highlight a paper that critiques the scaling laws used for large language models, suggesting that while these models have made significant progress, the computational cost to further reduce errors becomes astronomically high, making it impractical. The authors of the paper argue that achieving the reliability needed for scientific inquiry with large language models is infeasible. Despite claims that AI scaling is fine, the author suggests that humans are more sensitive to errors, leading to a perception of stagnation. The author concludes that there isn't a true 'wall' in AI scaling, but the practical limitations are so vast that it appears like one.

05:02

🔍 Limits of Reasoning and the Fragility of AI Logic

This paragraph introduces another paper that investigates the reasoning capabilities of small language models using reasoning chains. The study finds that the reasoning these models perform often fails to generalize beyond the data they were trained on, which undermines the argument that AI can develop true logical reasoning. The paper suggests that large language models act as simulators of reasoning, mimicking patterns learned during training rather than demonstrating true understanding. This finding implies that the path to artificial general intelligence (AGI) through scaling large language models is unlikely to succeed, and companies hoping that these models will reach AGI might be in for a disappointingAI scaling limits realization.

💡 Training AI: A New Opportunity for Human Experts

Here, the author introduces a new opportunity for individuals to train AI systems. The company Alina is hiring people to help teach AI systems real-world expertise, judgment, and problem-solving skills that go beyond what can be learned from internet-based text and data. The author praises the concept, suggesting that it's a smart move to involve human expertise in improving AI. Jobs are flexible, remote, and well-paid, with rates up to $150 per hour, depending on the task's difficulty. The paragraph encourages viewers to participate in this initiative and links to further details.

💭 The Future of AGI: A Balanced Perspective

In this final paragraph, the author addresses the polarized opinions in the comments about AI and AGI. They argue that both sides have valid points and express their own view that large language models alone will not lead to AGI. The author contrasts their perspective with that of Gary Marcus, who relies on cognitive science, while the author brings a physics background into the discussion. The author believes that true intelligence will emerge from models that can interact with and test the world around them, a step closer to AGI. The paragraph ends with excitement about world models and DeepMind's Genie 3 release, which represents a significant advancement towards AGI.

Mindmap

Keywords

AI Scaling

AI scaling refers to the process of increasing the size and complexity of artificial intelligence systems, such as large language models, in order to improve their performance. In the video, scaling is discussed in terms of how increasing computational resources and model sizes might lead to better results, but also introduces diminishing returns due to the high computational burden required to reduce errors, as discussed in the first paper.

GPT-5

GPT-5 is the next generation of the GPT (Generative Pre-trained Transformer) series, a family of language models developed by OpenAI. The video discusses how the release of GPT-5 was underwhelming, as it failed to meet the high expectations people had, reigniting the debate about whether AI has hit a scaling wall. GPT-5 is used to illustrate the challenges faced in AI progress despite increased scaling.

Scaling Laws

Scaling laws refer to mathematical models that describe how performance improves as a model or system is scaled up (e.g., in terms of size, data, or computation). The video refers to a paper that challenges the validity of scaling laws for large language models, particularly when it comes toAI scaling challenges reducing error rates, suggesting that these models face severe computational challenges when trying to improve their reliability.

Computational Tail

A computational tail refers to the increasing amount of computational resources required to achieve marginal improvements in performance. The video highlights that large language models require exponentially more computing power to reduce errors by just a small amount, especially when trying to meet high standards like those of scientific inquiry. This phenomenon is a key part of the argument that scaling AI systems may have reached a practical limit.

Chain of Thought Reasoning

Chain of thought reasoning is the process by which a model like GPT generates a series of logical steps to arrive at a conclusion. However, the video points out that large language models often fail at this, as they might simulate reasoning steps but not actually arrive at correct conclusions. The reasoning in these models is described as a 'mirage,' highlighting that the models lack true understanding and instead rely on patterns learned from training data.

AGI (Artificial General Intelligence)

AGI refers to a type of AI that possesses the ability to understand, learn, and apply intelligence across a wide range of tasks, similar to human cognition. The video suggests that large language models will not lead to AGI, because these models cannot achieve true understanding of logic or reasoning. The speaker proposes that AGI will likely emerge through different methods, such as through interaction with real or virtual environments, rather than through scaling language models.

Error Rate

Error rate in AI models refers to the frequency with which the model produces incorrect or inaccurate results. The video discusses how reducing error rates in large language models requires a disproportionate increase in computational resources, which makes improving the models increasingly difficult. This high computational cost is one reason why the progress of AI seems slow despite scaling up models.

DeepMind's Genie 3

DeepMind's Genie 3 is a world model, a type of AI system that can simulate interactions with environments to learn and make decisions. The video mentions Genie 3 as an important step towards achieving AGI, suggesting that world models like this may be a more promising path to true intelligence than scaling language models like GPT-5.

Simulation of Reasoning

Simulation of reasoning refers to AI systems that mimic logical processes without necessarily understanding or following the underlying principles of reasoning. The video describes how large language models simulate reasoning through chains of thought, but these models often fail to apply correct reasoning to arrive at accurate conclusions, illustrating their lack of true cognitive ability.

Leopold Ashen Brena

Leopold Ashen Brena is cited in the video as a researcher who predicted an intelligence explosion by 2027 based on scaling laws for large language models. However, the paper discussed in the video challenges this prediction, suggesting that these scaling laws are not as reliable as previously thought and that true advancements in AI will require different approaches.

Highlights

GPT-5’s launch was seen as underwhelming, reigniting debate about whether AI scaling has reached its limits.

A new paper reexamines scaling laws and suggests eliminating errors has an extremely long computational tail.

Reducing model error by even one order of magnitude may require ~10^20 more compute, making progress appear stalled.

Researchers argue that making LLMs scientifically reliable is computationally intractable using current scaling trends.

Humans naturally notice and amplify the remaining rare errors, creating a perception of limited progress.

According to the paper, there isn’t a hard wall—just compute demands so large they feel like a wall in practice.

A second study tested small language models using chain-of-thought on logic puzzles requiring out-of-distribution reasoning.

It found that chain-of-thought does not generalize well beyond training data patterns.

ReasoningAI scaling limits debate steps often do not correlate with final correctness—models can simulate reasoning without true understanding.

The results suggest LLMs are sophisticated text simulators rather than principled reasoners.

Companies relying solely on scaling LLMs toward AGI may begin to panic as limitations become clearer.

The belief that emergent logical reasoning will simply appear with scale is increasingly challenged.

Human feedback remains necessary to correct reasoning errors and provide real-world expertise.

The speaker argues AGI won’t come from LLMs alone because language poorly describes the physical world.

True intelligence requires models that interact with and test themselves in real or simulated environments.

World models—like DeepMind’s Genie 3—represent a promising path toward AGI by enabling experiential learning.

If AGI emerges, it may question why its training data includes flawed online sources like Reddit.

GPT-5: Have We Finally Hit The AI Scaling Wall?

Takeaways

Q & A

What is the primary reason GPT-5 was considered underwhelming?

What do the authors of the first paper argue about the scaling laws for large language models?

How much more computing power would be required to reduce error rates by one order of magnitude in large language models?

What is the main issue with current reasoning capabilities in large language models?

What does the second paper suggest about chain-of-thought reasoning in large language models?

Why is the scaling of large language models considered problematic for achieving AGI?

What is the role of human input in the development of next-generation AI systems, according to the video?

How can people get involved in training AI systems, and what are the potential benefits?

What is the debate surrounding artificial general intelligence (AGI) in the video?

What is the significance of DeepMind's Genie 3 release in the context of AGI development?