To get an AI adapter to work like a function, however, researchers had to figure out how to run it without the task-aware embeddings representing the user’s request. Without the benefit of embeddings tailored to the user’s goal, their first few activated-LoRA prototypes failed to match the accuracy of regular LoRAs.
But they eventually found a way to compensate — by increasing the rank of the adapter. With increased network capacity, the adapter could now extract more contextual clues from the general embeddings. In a series of tests, researchers confirmed that their “aLoRA” could now perform on par with a traditional LoRA.
“Across a variety of applications, we saw that aLoRA-customized models could now generate text as well as those customized with standard LoRAs,” said Greenewald. “We could get their runtime benefits without the accuracy loss."
IBM Research is releasing a library of new aLoRA adapters for its Granite 3.2 LLMs, aimed at improving the accuracy and reliability of RAG applications. Experimental code to execute the adapters is also available as researchers work on implementing them in vLLM, the open-source platform for serving AI models efficiently. IBM is separately releasing a set of standard Granite 3.2 adapters for immediate use in vLLM. Some of the task-specific LoRAs are updates of the one IBM released last year through Granite Experiments.
One of the new aLoRAs can rewrite queries in a conversation to make it easier to search for and retrieve key passages. Another can determine if a query can be answered based on the retrieved documents, reducing the risk that the model might hallucinate an answer. A third can estimate how confident the model is in the accuracy of its answer, signaling to users when they should double check their facts.
Beyond RAG, IBM Research is releasing exploratory adapters that can flag attempts to jailbreak, or bypass, an LLM’s safety controls, as well as check whether LLM outputs meet a set of user-defined standards.
LLM performance has been shown to improve dramatically if more compute is spent at runtime to both evaluate and improve the model’s initial responses. IBM Research recently improved the reasoning capabilities of its Granite 3.2 models by introducing several methods to review LLM candidate responses beneath the hood, at test-time, and to select the best one to output.
IBM Research is exploring whether aLoRAs can provide a similar performance boost in what has been alternately called “test-time” or “inference-time” scaling. An adapter could be designed, for example, to generate multiple answers to a query, and select the answer that combines a low score for hallucination risk with a high confidence score for accuracy.
The next frontier in AI involves agents, and researchers want to see if inference-friendly adapters can have an impact here, too. AI agents have been shown to do well at mimicking human reasoning when a complex task is broken into discrete steps for the LLM agent to tackle one by one.
Each of these steps may require specialized models, both to implement and evaluate them, either by the model itself or another. This is where lightweight aLoRAs could really shine, said Luis Lastras, director of language technologies at IBM Research.
“Thanks to their unique architecture, we could potentially see huge improvements in runtime performance,” he said.