Author(s): Towards AI Editorial Team Originally published on Towards AI. What happened this week in AI by Louie This week, our eyes were again on the rapid progress in LLM inference, in particular, the possibility of significantly reducing the cost for reused input tokens with context caching. We might labor this point a bit much, but the progress in inference compute prices for LLMs is truly unprecedented. It is now possible to access reused input token inference with Deepseek v2 for 4,300x cheaper than GPT-3 (da-vinci 002) cost just 24 months ago. At the same time, the MMLU benchmark is up to 79% vs 60%, and the context window maximum size is up ~60x. At the peak of Moore’s law, $ per transistor reduced around ~4,000x in the first 14 years up to 1982. But transistors were not getting fundamentally more capable at the same time! At this stage, it is hard to imagine progress at this pace not soon having a global impact. The innovations in context caching this week tie into a great new paper investigating how LLM performance can benefit from repeated inference steps, or “inference-time scaling laws”. Together, we think these provide a very powerful new avenue for unlocking economically useful LLM capabilities. Deepmind followed META’s week in the spotlight with a flurry of activity. Gemini released a new Pro 1.5 experimental model, which, for the first time, put Deepmind at the top of the LMSYS arena, suggesting they have finally caught up in the LLM capability race on some measures (but still behind on Livebench and Zeroeval benchmarks). They also announced the Flash model that will reduce 5x in price next week (taking it to half of GPT-4o-mini cost), a move we think is partly a reflection of progress in distillation but also likely competitive pressure from Llama 3.1. They also released an impressive (for its size) new small 2B Gemma model benefiting from model distillation (which we expect to join the LLM builder toolkit post Llama 3.1, as we discussed last week). Less than 24 after the Gemini Flash price announcement, inference compute pricing was taken a whole level lower with China-based DeepSeek announcing Context Caching on Disk via their API. This automatically reduces the cost of handling reused input tokens by 90%, down to $0.014 / million tokens, making it 10x cheaper than GPT-4o-mini. The caching mechanism works by storing input content it expects to be reused in a distributed disk array. When the same input is detected again, it is retrieved from the cache, bypassing recomputation. This not only slashes API costs but cuts down on latency (from 13 seconds to just 500 milliseconds for large 128k prompts). This cost reduction opens up new avenues for using LLMs in scenarios where repeated querying of the same input tokens is essential, such as multi-step data analysis of a large dataset, repeated questioning of a full code base, and multi-turn conversations. Deepmind Gemini is the only other model that offers context caching so far, but the price reduction is not nearly as large and it is not implemented automatically. However, the imminent 5x reduction in Gemini Flash price will help here, too! In parallel to these developments on inference price, new research on inference-time scaling laws suggests that we can significantly improve LLM performance by increasing the number of inference steps. This approach, known as repeated sampling, allows weaker models to outperform stronger ones in certain tasks. For example, DeepSeek-Coder-V2-Instruct, when applied with 250 attempts, achieves a 56% success rate on SWE-bench Lite, surpassing the 43% success rate of a single attempt using more capable models like GPT-4o. The effectiveness of this method depends on two factors: coverage (the number of problems solved across all attempts) and precision (the ability to identify the correct solution among many). Why should you care? These advancements are synergistic. The cost savings from reused input tokens make it economically feasible to employ repeated sampling extensively, even with complex and large datasets. We expect this to make some agentic LLM systems far more feasible, both in terms of cost and latency. These far cheaper reused input context tokens also allow for affordable multi-shot learning — leveraging vast amounts of examples within input prompts as a potential alternative to fine-tuning. This can also sometimes be more efficient for inference, where the same model can be used with different input prompts and large batch sizes on the same GPU cluster, compared to running multiple fine-tuned models. It can also make long context windows with cached context a more viable alternative to RAG in some scenarios, for example, chatting with entire documentation in cached context. The reduced latency from context caching can also improve the user experience for very long-turn chat applications or repeatedly questioning a code base or dataset. We think context caching is in its early days, and Deepseek’s API still struggles with its access to compute (likely due to the US AI chip China ban); however, we expect leading AI labs to focus on replicating these capabilities. Before long, we think LLM builders can expect access to 5–10x price reductions for reused input context and should begin to consider what this can unlock for their use cases. What would you do differently if system prompts were 10x cheaper than other tokens and up to 20x faster to process? — Louie Peters — Towards AI Co-founder and CEO The evolution of LLM compression methods: from QuIP to AQLM with PV-Tuning While large language models open up a world of possibilities, they can also be incredibly expensive and complex to deploy. That’s why researchers are racing to find ways to compress these models without sacrificing performance. Learn about the evolution of extreme LLM compression methods in our latest article. Spoiler alert: a brief “rivalry” between two research teams led to scientific breakthroughs. Read the complete article on Extreme LLM Compression here! Hottest News 1. Musical Chairs at the Leading AI Labs? OpenAI has continued a recent streak of setbacks on […]
↧