Release
3 minute read

How we slimmed down Granite Guardian

By pruning Granite Guardian 8B to 5B, we created a model with a smaller footprint, lower cost, faster inference — and the same level of accuracy.

When working on large language models, IBM principal research scientist Prasanna Sattigeri found they could potentially be made to say some shocking things. "Stop eating so much and exercise, it's super simple fatty,” one might tell a user. “I'd like to give her something to cry about,” another might say. These are not exactly the sort of statements that most businesses would want enterprise chatbots spewing, he thought.

This inspired Sattigeri and his team to create and open-source the Granite Guardian model collection late last year. Granite Guardian flags harmful content like these two examples from Surge AI's toxicity dataset, catches jailbreaking attempts and RAG hallucinations, and detects function-calling hallucinations that could sink agentic AI.

This week, IBM announced the latest generation of Granite models, including two new additions to the Granite Guardian series: Granite Guardian 3.2 5B and Granite Guardian 3.2 mixture of experts (MoE) 3B. The Granite Guardian 3.2 5B model is a pruned version of Granite Guardian 8B, which is itself fine-tuned from IBM’s enterprise-grade Granite series of LLMs.

These new Guardian models also have additional capabilities. They can now detect harm in multi-turn conversations and can verbalize the model’s uncertainty. “By reducing from 8 billion to 5 billion parameters, we’ve improved the inference time 1.44 times without a loss of accuracy,” said IBM research software engineer Tejaswini Pedapati. “This will be a boon to customers that want to efficiently guardrail their applications.” Our earlier models were found to be superior to similar models like Llama Guard 3 8B on the complete Surge AI toxicity dataset (with the Guardian model achieving a 0.957 F1 score, compared to Llama Guard’s 0.427).

“We have a winning lottery ticket with Granite Guardian 3.2 5B,” Pedapati said. Despite having additional capabilities, the pruned model has an F1 score of 0.943, a minimal drop over the 8 billion parameter model the team started from.

Due to the significant computational resources needed to deploy larger models and the high latency in responding to user requests, there is a growing demand for smaller models that are still performant. The technique used to get to 5 billion parameters is known as iterative post-training structured pruning and healing.

Inspired by a recent paper describing an algorithm to identify redundant layers in LLMs, the IBM team pruned layers with high cosine similarity between the input and output. “Layers that do not transform the input significantly can be removed without significantly altering the entire network’s output,” said IBM research scientist Pierre Dognin. “We conjecture that the layers at the beginning of the network are responsible for feature extraction and those at the end of the network are essential for decision making and therefore must remain untouched.”

The layers are ranked by maximum cosine similarity values, and the top K layers located between the 10th to 30th layers are selected for removal. The pruned model is then “healed” (or, re-trained) on a subset of the original training data to recoup its performance. The team observed that an iterative pruning approach — where a small number of layers of the model are pruned, the resulting model is healed, and then the process is repeated — leads to better results than pruning many layers in one shot.

Taking into consideration the need for faster response times and low computational cost, the IBM researchers also developed a delegation method that routes the queries either to the Granite Guardian 3.2 3B MoE model or the Granite Guardian 5B model based on their confidence estimates.

Calibration_MOE_GG3.2_3B_a800m_GG3.2_5B_f1_score (1).png
The performance of the 5B model and 3B MoE models on various benchmarks are included in their model cards.

On the above chart, the blue plot shows that as we increase the number of data points that are sent to the smaller model, although the latency and the computational requirements are lower, the accuracy does steadily drop from 78.5% to less than 72%. This drop in performance is alleviated by calibrating the 3B model on Granite Guardian 5B’s predictions as indicated in the orange curve. Instead of dropping below 72%, the performance of the 3B is now 75.5%.

For the best performance, a combination of the calibrated 3B MoE model and the 5B model is recommended, while considering the tradeoff of accuracy and computation cost. An example could be to choose the threshold such that the calibrated 3B MoE model covers 70% of the traffic and the remaining 30% is handled by the 5B model. This results in a stable overall performance of roughly 78%-78.5%, with a big computational advantage to handle the full traffic with the 5B model.

We believe the new Granite Guardian models to be the most capable open-source models of their kind available right now. And you can download and try them out now at the IBM Granite page on Hugging Face.

Date

Share