What If A.I. Doesn’t Get Much Better Than This?
For this week’s Open Questions column, Cal Newport is filling in for Joshua Rothman.
Much of the euphoria and dread swirling around today’s artificial-intelligence technologies can be traced back to January, 2020, when a team of researchers at OpenAI published a thirty-page report titled “Scaling Laws for Neural Language Models.” The team was led by the A.I. researcher Jared Kaplan, and included Dario Amodei, who is now the C.E.O. of Anthropic. They investigated a fairly nerdy question: What happens to the performance of language models when you increase their size and the intensity of their training?
Back then, many machine-learning experts thought that, after they had reached a certain size, language models would effectively start memorizing the answers to their training questions, which would make them less useful once deployed. But the OpenAI paper argued that these models would only get better as they grew, and indeed that such improvements might follow a power law—an aggressive curve that resembles a hockey stick. The implication: if you keep building larger language models, and you train them on larger data sets, they’ll start to get shockingly good. A few months after the paper, OpenAI seemed to validate the scaling law by releasing GPT-3, which was ten times larger—and leaps and bounds better—than its predecessor, GPT-2.
Suddenly, the theoretical idea of artificial general intelligence, which performs as well as or better than humans on a wide variety of tasks, seemed tantalizingly close. If the scaling law held, A.I. companies might achieve A.G.I. by pouring more money and computing power into language models. Within a year, Sam Altman, the chief executive at OpenAI, published a blog post titled “Moore’s Law for Everything,” which argued that A.I. will take over “more and more of the work that people now do” and create unimaginable wealth for the owners of capital. “This technological revolution is unstoppable,” he wrote. “The world will change so rapidly and drastically that an equally drastic change in policy will be needed to distribute this wealth and enable more people to pursue the life they want.”
It’s hard to overstate how completely the A.I. community came to believe that it would inevitably scale its way to A.G.I. In 2022, Gary Marcus, an A.I. entrepreneur and an emeritus professor of psychology and neural science at N.Y.U., pushed back on Kaplan’s paper, noting that “the so-called scaling laws aren’t universal laws like gravity but rather mere observations that might not hold forever.” The negative response was fierce and swift. “No other essay I have ever written has been ridiculed by as many people, or as many famous people, from Sam Altman and Greg Brockton to Yann LeCun and Elon Musk,” Marcus later reflected. He recently told me that his remarks essentially “excommunicated” him from the world of machine learning. Soon, ChatGPT would reach a hundred million users faster than any digital service in history; in March, 2023, OpenAI’s next release, GPT-4, vaulted so far up the scaling curve that it inspired a Microsoft research paper titled “Sparks of Artificial General Intelligence.” Over the following year, venture-capital spending on A.I. jumped by eighty per cent.
After that, however, progress seemed to slow. OpenAI did not unveil a new blockbuster model for more than two years, instead focussing on specialized releases that became hard for the general public to follow. Some voices within the industry began to wonder if the A.I. scaling law was starting to falter. “The 2010s were the age of scaling, now we’re back in the age of wonder and discovery once again,” Ilya Sutskever, one of the company’s founders, told Reuters in November. “Everyone is looking for the next thing.” A contemporaneous TechCrunch article summarized the general mood: “Everyone now seems to be admitting you can’t just use more compute and more data while pretraining large language models and expect them to turn into some sort of all-knowing digital god.” But such observations were largely drowned out by the headline-generating rhetoric of other A.I. leaders. “A.I. is starting to get better than humans at almost all intellectual tasks,” Amodei recently told Anderson Cooper. In an interview with Axios, he predicted that half of entry-level white-collar jobs might be “wiped out” in the next one to five years. This summer, both Altman and Mark Zuckerberg, of Meta, claimed that their companies were close to developing superintelligence.
Then, last week, OpenAI finally released GPT-5, which many had hoped would usher in the next significant leap in A.I. capabilities. Early reviewers found some features to like. When a popular tech YouTuber, Mrwhosetheboss, asked it to create a chess game that used Pokémon as pieces, he got a significantly better result than when he used GPT-o4-mini-high, an industry-leading coding model; he also discovered that GPT-5 could write a more effective script for his YouTube channel than GPT-4o. Mrwhosetheboss was particularly enthusiastic that GPT-5 will automatically route queries to a model suited for the task, instead of requiring users to manually pick the model they want to try. Yet he also learned that GPT-4o was clearly more successful at generating a YouTube thumbnail and a birthday-party invitation—and he had no trouble inducing GPT-5 to make up fake facts. Within hours, users began expressing disappointment with the new model on the r/ChatGPT subreddit. One post called it the “biggest piece of garbage even as a paid user.” In an Ask Me Anything (A.M.A.) session, Altman and other OpenAI engineers found themselves on the defensive, addressing complaints. Marcus summarized the release as “overdue, overhyped and underwhelming.”
In the aftermath of GPT-5’s launch, it has become more difficult to take bombastic predictions about A.I. at face value, and the views of critics like Marcus seem increasingly moderate. Such voices argue that this technology is important, but not poised to drastically transform our lives. They challenge us to consider a different vision for the near-future—one in which A.I. might not get much better than this.
OpenAI didn’t want to wait nearly two and a half years to release GPT-5. According to The Information, by the spring of 2024, Altman was telling employees that their next major model, code-named Orion, would be significantly better than GPT-4. By the fall, however, it became clear that the results were disappointing. “While Orion’s performance ended up exceeding that of prior models,” The Information reported in November, “the increase in quality was far smaller compared with the jump between GPT-3 and GPT-4.”
Orion’s failure helped cement the creeping fear within the industry that the A.I. scaling law wasn’t a law after all. If building ever-bigger models was yielding diminishing returns, the tech companies would need a new strategy to strengthen their A.I. products. They soon settled on what could be described as “post-training improvements.” The leading large language models all go through a process called pre-training in which they essentially digest the entire internet to become smart. But it is also possible to refine models later, to help them better make use of the knowledge and abilities they have absorbed. One post-training technique is to apply a machine-learning tool, reinforcement learning, to teach a pre-trained model to behave better on specific types of tasks. Another enables a model to spend more computing time generating responses to demanding queries.
A useful metaphor here is a car. Pre-training can be said to produce the vehicle; post-training soups it up. In the scaling-law paper, Kaplan and his co-authors predicted that as you expand the pre-training process you increase the power of the cars you produce; if GPT-3 was a sedan, GPT-4 was a sports car. Once this progression faltered, however, the industry turned its attention to helping the cars that they’d already built to perform better. Post-training techniques turned engineers into mechanics.
Tech leaders were quick to express a hope that a post-training approach would improve their products as quickly as traditional scaling had. “We are seeing the emergence of a new scaling law,” Satya Nadella, the C.E.O. of Microsoft, said at a conference last fall. The venture capitalist Anjney Midha similarly spoke of a “second era of scaling laws.” In December, OpenAI released o1, which used post-training techniques to make the model better at step-by-step reasoning and at writing computer code. Soon the company had unveiled o3-mini, o3-mini-high, o4-mini, o4-mini-high, and o3-pro, each of which was souped up with a bespoke combination of post-training techniques.
Other A.I. companies pursued a similar pivot. Anthropic experimented with post-training improvements in a February release of Claude 3.7 Sonnet, and then made them central to its Claude 4 family of models. Elon Musk’s xAI continued to chase a scaling strategy until its wintertime launch of Grok 3, which was pre-trained on an astonishing 100,000 H100 G.P.U. chips—many times the computational power that was reportedly used to train GPT-4. When Grok 3 failed to outperform its competitors significantly, the company embraced post-training approaches to develop Grok 4. GPT-5 fits neatly into this trajectory. It’s less a brand-new model than an attempt to refine recent post-trained products and integrate them into a single package.