News
4 minute read

Tiny benchmarks for large language models

Why put your model through the SAT when a quiz will do?

Why put your model through the SAT when a quiz will do?

Training and inferencing super-large language models can be slow-going and expensive. Benchmarking doesn’t get as much attention, but it, too, has become a major drag on resources.

Benchmarks measure how well AI models do at a standardized set of tasks, from summarizing and translating documents to reasoning through complex questions. When AI models are released to the world, their worth is measured by their rank on leading performance benchmarks.

As the capabilities of large language models (LLMs) continue to grow, benchmarks have also grown progressively more rigorous and wide-ranging. LLMs today are typically grilled on their general knowledge, mathematical skills, common-sense reasoning, and more.

Not only does all this testing take time, but it eats up significant computing resources. Putting a model through the paces of Stanford’s popular “HELM” benchmark can take a day or more and cost upward of $10,000.

Benchmarks provide a yardstick by which to compare model performance, and measure AI progress generally. But they are also an indispensable part of the training process itself, allowing developers to iteratively test and evaluate new algorithms at different tasks. Not surprisingly, the cost of evaluating a model during its development life cycle can now exceed the cost of pre-training, as EleutherAI noted in a paper introducing its Pythia family of LLMs.

The high cost of benchmarking hit home for IBM in developing its Granite family of LLMs. A year ago, IBM Research — Israel was handed the challenge of trying to bring benchmarking costs down.

Leshem Choshen was fresh out of graduate school when he joined the lab as an engineer. What if the benchmark of the moment could be pared down, he wondered. Not by half, or even a quarter — but by 99%. “Think of how much less energy and compute you’d need,” he said. “Instead of evaluating a model in a day you could do it in 10 minutes.”

Flash evaluation

Choshen and his colleagues recently released tiny versions of some of the most popular measures of chatbot competency: Open LLM Leaderboard, the MMLU benchmark for multi-task language understanding, and Stanford’s HELM and AlpacaEval 2.0 benchmarks. IBM has also open sourced its streamlined benchmarks on Hugging Face.

With just 100 questions, IBM’s tiny MMLU is about 1% the size of the real MMLU but it’s nearly as effective at measuring aptitude. The tiny benchmark could estimate the performance of newly released models within 98% of their score on the full-sized MMLU, researchers found. They reported similar results for their other miniaturized benchmarks.

The notion that a small but well-crafted test can be nearly as effective as one that’s much longer comes from psychometrics, the discipline that introduced standardized tests like the SAT to the world. Pick a broad enough set of examples, the thinking goes, and you can cover the important topics with a fraction of the material.

But, of course, you must ask the right questions. To ensure that the most representative ones were selected, the team turned to AI. They built a model to analyze how top-scoring LLMs fared on the full-sized benchmark to understand which questions best predicted success. Those are the ones they cherrypicked for the tiny benchmarks.

They realized that quality, not quantity, matters most in AI-evaluation after scrutinizing Stanford’s HELM benchmark. A surprising number of questions, they found, were redundant or irrelevant and could easily be cut. After eliminating 99% of the questions within a HELM task-scenario, they found that the scores of HELM’s top performers barely budged.

Working closely with Yotam Perlitz, an IBM researcher who specializes in LLM evaluation, Choshen first introduced efficient benchmarking in a pre-print paper on arXiv last spring. It caught the eye of AI researchers who were organizing an efficient LLM contest at NeurIPS 2023.

The models under consideration had been trained on desktop computers, but organizers realized they didn’t have enough computing resources to evaluate them. “We had 16 GPUs and 225 submissions,” said Mark Saroufim, a software engineer at Meta who co-organized the workshop. “It could have taken a day to run each model, but we had just two weeks to go before having to announce a winner.”

IBM researchers worked with Saroufim to create an abbreviated benchmark they named Flash HELM, that allowed them to eliminate the lowest performers after a few hundred questions. The finalists were then tested on the full benchmark for the most reliable results with the remaining compute budget.

In the end, a winner was declared on time. “Flash HELM saved the day,” said Saroufim.

Efficient benchmarking goes mainstream

IBM's efficient benchmarking team is now responsible for evaluating all the LLMs on IBM’s watsonx platform for enterprise AI, including IBM’s Granite family of code and language models.

The cost of testing a Granite 13B model on a benchmark like HELM can consume as many as 1,000 GPU hours; IBM typically evaluates at least one model each day. “If you don’t do benchmarking efficiently, it quickly gets very expensive,” said Michal Shmueli-Scheuer, an IBM researcher who leads foundation model evaluation. “These methods have allowed us to significantly cut our evaluation costs which we can then pass on to customers.”

Efficient benchmarking can also speed up innovation. “It can take up to two days to tell someone the model doesn’t work,” said Perlitz. “Efficient benchmarking lets you go back and quickly make revisions without the wait.”

Tiny benchmarks have become popular within IBM for just this reason. Youssef Mroueh, a principal research scientist at IBM, uses them to quickly test whether new algorithms can improve model performance. “It helps us understand whether one algorithm is better than another without spending as much money” he said.

Flash Holmes, a streamlined version of IBM's new Holmes benchmark for linguistic competency is the latest addition.

The idea appears to be catching on elsewhere. Stanford’s Efficient-HELM, implemented by Choshen and his team, is a condensed version of HELM that allows developers to choose the number of examples they want to run, and how much compute they want to save. Stanford has separately released HELM Lite, HELM’s broader but lightweight cousin.

“Large benchmarks don’t necessarily add value by being larger,” said Choshen. “This was our insight, and we hope it can lead to faster, more affordable ways of measuring LLM performance.”