Quantcast
Channel: Machine Learning | Towards AI
Viewing all articles
Browse latest Browse all 819

TAI #117:Do OpenAI’s o1 Models Unlock a Full “Moore’s Law” Feedback Loop for LLM Inference Tokens?

$
0
0
Author(s): Towards AI Editorial Team Originally published on Towards AI. What happened this week in AI by Louie OpenAI’s new o1 series of “reasoning” models took clear center stage this week. These models use an advanced form of search and reasoning during inference. This system performs multiple steps of thought before arriving at an answer, drawing on reinforcement learning (RL) to refine its reasoning process. These models have been highly anticipated since press leaks of OpenAI’s “Q*” breakthrough over a year ago. Typically, reception to the model has spanned from claims of unlocking AGI to dismissive claims that OpenAI is just applying “Chain of Thought” prompting. We think this model is a huge breakthrough for some tasks, but it is not a plug-and-play upgrade to existing models; you cannot simply use an existing LLM pipeline and prompt and expect it to get better results. While technical details of the model were scarce, this model is clearly more than just prompting, and there was a significant investment in a new “large-scale reinforcement learning algorithm” that “teaches the model how to think productively using its chain of thought.” Perhaps the model started as GPT-4o, but we think post-training compute investment has likely led to substantially different model weights. We also think some architectural adaptations were likely needed to achieve this reasoning search process. There was likely also substantial investment in compiling new post-training data where we expect experienced scientists and coders were asked to break down full details of their internal reasoning to solve challenging problems. The final model is capable of performing some tasks completely out of reach of existing LLMs, albeit at a substantially higher cost. The performance jumps on some benchmarks (science, math, code, and reasoning focused ones) are substantial; for example, on PhD-Level Science Questions (GPQA diamond), GPT4o achieved 53.6%, o1-mini 60.0%, o1-preview 73.3% and the still un-released o1 77.3%. The downside is, of course, cost and latency; the models often spend 10–60 seconds “thinking” using hidden reasoning tokens. This also adds a cost. While the per token price of o1-preview is 6x higher than GPT-4o, factoring in these new thinking tokens means the price can often reach as much as 30x higher per output token. o1-mini offers 5x lower pricing than this and is even more tailored to math and coding problems where it can actually get better results. The biggest highlight from OpenAI’s o1 report for us was its disclosure of non-plateauing scaling laws for capability relative to “test-time” (or “inference-time”) compute. While this still scales logarithmically (and hence gets expensive), the fact you can just spend more money on inference and achieve greater performance instead of needing to train a more capable model is very significant. It speaks to the success of OpenAI’s RL search model here that the reasoning steps do not just get lost and stuck after following the wrong direction but can, in fact, keep progressing toward the correct answer with more inference compute. While much work is still needed on refinement here, it opens the possibility of simply leaving o1 models to work for one day or one week to solve the hardest problems. Of course — this is also all very convenient for OpenAI’s business model! Why should you care? Besides the out-of-the-box capability unlock on some tasks (we found it particularly valuable for brainstorming tasks so far, and agent pipelines have also become much easier), we think the real story here is the beginning of a new paradigm of integrating RL-based “reasoning step” search with LLMs and scaling inference-time compute to reach greater capability. Many people argue that LLMs alone will never be able to truly reason and generalize; they only memorize statistical features in their training data distribution. This may or may not be true, but we think a key reason LLMs perform poorly on reasoning-like tasks is that there is very little reasoning data on the internet. Humans always skip to the key points when writing up their ideas and don’t write down their full inner monologue with every thinking step — so the LLM thought it was supposed to just guess at these random leaps from token to token. To some extent, I think so far we have actively been training LLMs NOT to reason; they were being punished during training for attempting these necessary intermediary calculations/thinking steps and not just skipping to mimicking the next word as presented in their internet training data format. For this reason, I think we will find a lot of easy wins with models developed in the direction of o1. For some time, we have been highlighting the rapid price reduction in LLM inference tokens. For example, cached input tokens with DeepSeek V2 are priced 4,000x lower than Da-Vinci 002 (GPT-3) token costs were two years before. At the peak of Moore’s law, $ cost per transistor was reduced around ~4,000x in the first 14 years up to 1982. We think “Moore’s Law” is an increasingly apt analogy. Despite the huge reduction in LLM inference token price, until now there was still one key element of the feedback loop missing. While Moore’s law itself was just a prophecy, I think it actually consists of three very real key components that led to sustained feedback loops and the progress we have seen: 1) Learning Rates/Wright’s Law: More cumulative production of a product leads to lower costs due to A) R&D scaling with revenue, B) Companies learning from cost of goods sold and staff experience driving process improvements, and C) Economies of scale. 2) Volume unlocked by price: Lower costs lead to a wider set of applications becoming economically viable, which in turn leads to more cumulative production and lower costs. 3) Volume unlocked by capability: More transistors used together lead to higher capability, which in turn leads to more applications becoming possible, more production, lower costs, and so on. Until now, cumulative growth in the generated LLM inference tokens has been leading to rapid breakthroughs in reducing cost and unlocking […]

Viewing all articles
Browse latest Browse all 819

Trending Articles