News
4 minute read

Serving AI models faster with speculative decoding

IBM’s Granite code model can output text twice as fast while serving four times as many users, a feat that both improves AI inferencing for users and lowers operating costs for enterprises.

IBM’s Granite code model can output text twice as fast while serving four times as many users, a feat that both improves AI inferencing for users and lowers operating costs for enterprises.

Interacting with customer care chatbots may soon get a lot more seamless.

In recent years, large language models (LLMs) have given chatbots a better grasp of customers’ questions and improved their ability to find correct answers. But the high cost and slow speed of serving LLMs has been one of the main barriers to wider AI adoption.

Speculative decoding has emerged as a promising optimization technique for speeding up AI inferencing. It can help LLMs generate tokens faster, lowering latency by two to three times — and giving customers a much better experience.

But there’s a hitch. Reducing latency also typically cuts throughput, or the number of people that can use a model at once, increasing costs for companies hosting the model. In a recent breakthrough, IBM Research managed to both cut the latency of serving its open-source Granite 20B code model in half while quadrupling its throughput.

The challenge for the IBM Research team that came up with the solution lay in figuring out how to get speculative decoding to play nice with a memory optimization technique called paged attention (more on that later). The results are part of IBM’s focus on improving the cost performance of LLM inferencing.

“These are significant improvements with direct benefits for enterprises and the people interacting with these models,” said Priya Nagpurkar, vice president of hybrid cloud and AI platform at IBM Research.

Speculative decoding: two or three tokens for the price of one

LLMs are great at imitating how humans write and code, but their transformer architecture makes them inefficient at outputting text. Before generating a new token (which is essentially a word, or part of a word), LLMs process each token they’ve previously generated. This is known as a forward pass.

In speculative decoding, the forward pass is modified so that the LLM evaluates several prospective tokens that come after the one it’s about to generate. If the “speculated” tokens are verified, one forward pass can produce two or more tokens for the price of one. Once the LLM hits an incorrect token, it stops and generates its next token based on all tokens it has validated to that point.

The speculation can be done by a smaller, more efficient model (called a “draft” model), or part of the main model itself. By processing tokens in parallel, speculative decoding gets more work out of each GPU on the backend for a fixed amount of memory operations. An LLM can go from generating one token per forward pass to several tokens, doubling or tripling its inferencing speed.

When speculative decoding was introduced last year, in back to back papers by researchers at Deep Mind and Google, a tiny draft model carries out the guess work by harnessing information in the embedding of the main model’s next predicted token.

Earlier this year, a team of academic researchers did away with the draft model. Their open-source speculator, Medusa, is added to the last layer of the base model, eliminating the need to train a second model.

“It leverages the richness and knowledge of the embedding vector in the base model,” said Mudhakar Srivatsa, an expert on AI optimization at IBM Research. “Predictability is already built into the embedding vector of these tokens.”

IBM researchers adapted the Medusa speculator. Instead of conditioning future tokens on the model’s next predicted token, they conditioned the speculated tokens on each other. For example, if “happy” is the first speculated token after “I am…,” the next three tokens will depend on what’s statistically most likely to come after “happy” not “I am.”

They also came up with a more efficient way to fine-tune Medusa by using small batches of text generated by the LLM, followed by larger batches. In the first half of training, the speculator learns on standard training datasets. In the second half, it learns from training data generated by the base model itself, ensuring that the speculator and LLM’s responses are aligned. The speculator is essentially an understudy that learns the larger model’s lines and behaviors.

IBM trained and implemented speculators in several open-source models and saw inferencing speeds increase by two to three times. The code models showed the best results. Because code is highly structured, tokens like {, }, \n, \t, ; are easier to predict than tokens represented by natural language.

Paged attention to free up memory

Reducing LLM latency tends to lower throughput because of the added strain on GPU memory. Dynamic batching can increase throughput by four to five times, but not if speculative decoding is competing for memory.

To free up memory for both optimization techniques to play nice, researchers turned to paged attention, an optimization technique developed at the University of California and Stanford that borrows the concepts of virtual memory and paging from operating systems.

To minimize redundant computation, LLMs store previously generated words in what’s known as the as the key-value (KV) cache. Large models with long-winded responses eat up a large chunk of this space that acts like RAM for the model.

Traditional attention algorithms store KV sequences in contiguous memory, resulting in memory fragmentation. Paged attention, by contrast, divides them into smaller memory blocks, or pages, that can be summoned when needed. This allows the speculator to generate multiple candidates for each predicted word without having to duplicate the full KV-cache for each one.

What’s next

Speculative decoding and paged attention have been added to IBM’s Granite 20B code model and the IBM speculator has been open sourced on Hugging Face for others to adapt to their own LLM. IBM will soon implement both optimization techniques in all models on its watsonx platform for enterprise AI.

Date