What’s an LLM context window and why is it getting larger?
Larger context windows give language models more background to consider as they generate a response, leading to more coherent and relevant answers. IBM just open sourced its new Granite 3B and 8B models with extended context.
Larger context windows give language models more background to consider as they generate a response, leading to more coherent and relevant answers. IBM just open sourced its new Granite 3B and 8B models with extended context.
The power of large language models used to be measured in parameters. That was until a class of smaller, more efficient models showed size wasn’t everything — especially for narrower, business-focused tasks.
Today, a new LLM arms race is on, and it centers on the context window, which is the maximum amount of text the model can consider as it chats with a customer, reviews a contract, or fixes a line of code. It includes both the text in the user’s prompt and text the model has generated.
A larger context window allows the model to hold more text in a kind of working memory, helping it to keep track of key moments and details in a drawn-out chat, or a lengthy document or codebase. It’s what allows an LLM-based chatbot to generate responses that make sense in the immediate moment, but also over a longer context.
The context window is measured in tokens, which for LLMs are machine-readable representation of words, parts of words, or even punctuation. The context window is crucial to an LLMs ability to craft coherent, accurate, and relevant responses.
When ChatGPT made its debut nearly two years ago, its window maxed out at 4,000 tokens. If your conversation went over the 3,000-word chat-interface limit, the chatbot was likely to hallucinate and veer off-topic. Today, the standard is 32,000 tokens, with the industry shifting to 128,000 tokens, which is about the length of a 250-page book. IBM just open-sourced on Hugging Face two Granite models with a 128,000-token window, and more are on their way.
A larger context window lets you add more information to your prompt at inference time. But ‘prompt stuffing,’ as this technique is known, doesn’t come free. More computational resources are required to process the text, slowing down inferencing and driving up costs. For companies that pay by the token, summarizing a long annual report or meeting transcript can quickly get expensive.
“You’re passing each token through the model,” said Matthew Stallone, an IBM researcher focused on extending the context length of IBM’s Granite models. “You’re wasting computation to basically do a ‘Command +F’ to find the relevant information to answer your question.”
Larger windows can improve results — up to a point. Like people, LLMs are susceptible to information overload. Throw too much detail at them, and they may miss the key takeaways. Research has shown LLMs are more apt to pick up on important information appearing at the start or end of a long prompt rather than buried in the middle.
IBM researchers have further shown that the more closely that examples in the prompt resemble the target task, the better the model tends to do. “We proved that the quality of the examples matters,” said Xiaodong Cui, an IBM researcher who studies the theoretical underpinnings of foundation models. In other words, making context windows infinitely longer may be counterproductive at a certain point.
A large part of IBM’s AI strategy hinges on cost performance, whether it’s speeding up LLM-inferencing through speculative decoding, streamlining LLM customization through synthetic data generation, or creating tiny benchmarks for faster LLM evaluation and innovation. IBM’s Granite models are not the biggest, but they are among the best in class on tasks like coding in SQL or running external applications through function calling.
IBM has taken the same approach on context windows. Researchers recently extended the windows of IBM’s Granite 3B and 8B code and instruct models to 128,000 tokens. Larger windows can improve LLM performance on coding tasks, in particular, by allowing them to ingest more software documentation.
IBM is in the process of scaling its other Granite models which will be added to existing products, including IBM’s generative AI code modernization solution, watsonx Code Assistant for Z (WCA for Z).
LLMs are built on a transformer architecture that can take raw text at scale, and through its attention mechanism, understand how words relate to each other to form a statistical representation of language. Transformers, also known as foundation models, have brought about stunning progress in AI, but the longer they have to pay attention, the more number-crunching they must do.
When a text sequence doubles in length, an LLM requires four times as much memory and compute to process it. This quadratic scaling rule limits LLMs to shorter sequences during training and, effectively, shorter context windows during inferencing.
To scale Granite's context window, IBM researchers reduced the amount of memory and computation needed to process long streams of text.
Ring attention was added to the base model to improve its computational efficiency. Researchers also changed how the model encodes token positions in its training data. Rather than encode their absolute position, researchers adopted a more efficient method that encodes tokens by their relative position.
They also revised their training approach, after IBM's Rameswar Panda and colleagues showed that context modeling improved when LLMs were pre-trained on 500 million tokens with a good mix of long form documents.
IBM's Granite 3B and 8B models were pre-trained using these techniques, and fine-tuned on multi-turn and multi-lingual conversations elaborated on, or generated from scratch, with the help of an LLM.
Compressing input prompts into a shorter, compact form is another way to enlarge context windows. IBM researchers recently came up with a method for an LLM to both generate its own synthetic longform instruction data and compress it at different ratios for later use. At inference time, the ratio that best matches the size of the input prompt is selected, allowing the model to interpret the longer sequence.
With larger windows, copying and pasting examples or the relevant facts you want the LLM to analyze becomes easier. Essentially, you can feed the LLM the details that an API call would provide the model through a retrieval-augmented generation (RAG) workflow.
Pin-Yu Chen, an IBM researcher who has studied why transformers excel at in-context learning, predicts that RAG will eventually go away. “With a larger window you can throw in all the books and enterprise documents you want the model to process,” he said. “RAG, by contrast, comes with information loss. No one wants to use it if you can fit everything in the context window.”
RAG is still relevant in many use cases, however, said Marina Danilevsky, an IBM researcher specializing in the technique. LLMs need RAG when tasked with current-events questions. RAG also allows them to evaluate contradictory information like policy updates, deprecated software functionalities, or program name changes, to craft more accurate responses.
Scanning thousands of documents for each user query is also cost inefficient. “It would be much better to save up-to-date responses for frequently asked questions, much as we do in traditional search,” she said. “Larger windows are best reserved for less common queries, after the extraneous details have been filtered out.”