Gwern Branwen on OpenAI’s bet in the scaling hypothesis
As far as I can tell, this is what is going on: they do not have any such thing, because Google Brain (GB) and DeepMind (DM) do not believe in the scaling hypothesis the way that Sutskever, Amodei and others at OpenAI (OA) do.
GB is entirely too practical and short-term focused to dabble in such esoteric & expensive speculation, although Quoc’s group (1, 2) occasionally surprises you. They’ll dabble in something like GShard, but mostly because they expect to be likely to be able to deploy it or something like it to production in Google Translate.
DM (particularly Hassabis, I’m not sure about Legg’s current views) believes that AGI will require effectively replicating the human brain module by module, and that while these modules will be extremely large and expensive by contemporary standards, they still need to be invented and finetuned piece by piece, with little risk or surprise until the final assembly. That is how you get DM contraptions like Agent57 which are throwing the kitchen sink at the wall to see what sticks, and why they place such emphasis on neuroscience as inspiration and cross-fertilization for reverse-engineering the brain. When someone seems to have come up with a scalable architecture for a problem, like AlphaZero or AlphaStar, they are willing to pour on the gas to make it scale, but otherwise, incremental refinement on ALE and then DMLab is the game plan. They have been biting off and chewing pieces of the brain for a decade, and it’ll probably take another decade or two of steady chewing if all goes well. Because they have locked up so much talent and have so much proprietary code and believe all of that is a major moat to any competitor trying to replicate the complicated brain, they are fairly easygoing. You will not see DM ‘bet the company’ on any moonshot; Google’s cashflow isn’t going anywhere, and slow and steady wins the race.
OA, lacking anything like DM’s long-term funding from Google or its enormous headcount, is making a startup-like bet that they know an important truth which is a secret: “the scaling hypothesis is true” and so simple DRL algorithms like PPO on top of large simple architectures like RNNs or Transformers can emerge and meta-learn their way to powerful capabilities, enabling further funding for still more compute & scaling, in a virtuous cycle. And if OA is wrong to trust in the God of Straight Lines On Graphs, well, they never could compete with DM directly using DM’s favored approach, and were always going to be an also-ran footnote.
While all of this hypothetically can be replicated relatively easily (never underestimate the amount of tweaking and special sauce it takes) by competitors if they wished (the necessary amounts of compute budgets are still trivial in terms of Big Science or other investments like AlphaGo or AlphaStar or Waymo, after all), said competitors lack the very most important thing, which no amount of money or GPUs can ever cure: the courage of their convictions. They are too hidebound and deeply philosophically wrong to ever admit fault and try to overtake OA until it’s too late. This might seem absurd, but look at the repeated criticism of OA every time they release a new example of the scaling hypothesis, from GPT-1 to Dactyl to OA5 to GPT-2 to iGPT to GPT-3… (When faced with the choice between having to admit all their fancy hard work is a dead-end, swallow the bitter lesson, and start budgeting tens of millions of compute, or instead writing a tweet explaining how, “actually, GPT-3 shows that scaling is a dead end and it’s just imitation intelligence” — most people will get busy on the tweet!)
The original article at LessWrong has some data point estimates worth checking out as well.