News
4 minute read

The technology behind InstructLab, a low-cost way to customize LLMs

IBM and Red Hat’s new open-source project is designed to lower the cost of fine-tuning large language models by allowing people to collaboratively add new knowledge and skills to any model.

IBM and Red Hat’s new open-source project is designed to lower the cost of fine-tuning large language models by allowing people to collaboratively add new knowledge and skills to any model.

Large language models are flourishing in the open, but most are still built in silos. Communities can form around a model, but their contributions can take months or years to be merged back to the base model — if they make their way back at all.

“There’s no good way to combine all of that innovation into a coherent whole,” said David Cox, vice president for AI models at IBM Research.

InstructLab, an open-source project launched by IBM and Red Hat in May, is designed to change that. It gives communities the tools to create and merge changes to LLMs without having to retrain the model from scratch. By making LLMs more like any other open-source software project, IBM and Red Hat hope to democratize access to generative AI.

InstructLab works by augmenting human-curated data with high-quality examples generated by an LLM, lowering the cost of data creation. InstructLab-generated data can then be used to customize or improve the base model without having to retrain it, creating additional savings. IBM Research has used InstructLab to generate synthetic data to improve its open-source Granite models for language and code.

Darío Gil: The future of AI is open

Researchers also recently used InstructLab to turn an IBM 20B Granite code model into an expert at modernizing software written for IBM Z mainframes. Its speed and effectiveness ultimately helped convince IBM executives to team up with Red Hat and accelerate the technology.

IBM’s current solution for mainframe modernization, watsonx Code Assistant for Z, released last fall, was fine-tuned on paired COBOL-Java programs written by humans and amplified by a traditional rules-based synthetic generator.

Researchers used some of this information as seed data and amplified it through the InstructLab pipeline. In addition, InstructLab was used to convert a manual for IBM Z and a stack of programming textbooks into additional pairs of synthetic, functionally equivalent COBOL-Java programs.

When this new data was fed to a pre-trained IBM Granite code model, the results took Ruchir Puri, chief scientist at IBM Research and architect of AI for Code, by surprise. In one week, the InstructLab-tuned model achieved a code generation score of 97 percent— 20 percentage points better than the production model in WCA for Z at the time.

“The most exciting part of InstructLab is its ability to generate new data from traditional knowledge sources,” he said. IBM expects to soon release an updated version of WCA for Z.

How InstructLab works

InstructLab features a command-line interface (CLI) that allows you to add and merge new alignment data to your target model through a GitHub workflow on your laptop. Think of the CLI as a test kitchen for trying out and submitting new “recipes” for generating synthetic data to teach an LLM new knowledge and skills.

InstructLab’s backend is powered by IBM Research’s new synthetic data generation and phased-training method, Large-Scale Alignment for ChatBots, or LAB. Using a taxonomy-driven approach, LAB can create high-quality data corresponding to the tasks you want to add to your model. The taxonomy is a hierarchical map of what LLMs tuned on InstructLab data have learned to date, making it easy to identify and fill in holes.

LAB’s unique training regimen allows new information to be assimilated into the model during alignment, without causing the model to overwrite what it previously learned. Traditionally, foundation models have been infused with core knowledge and capabilities during the drawn-out pre-training phase. If substantial improvements were needed, the pre-trained base model had to be re-trained.

“Instead of having a large company decide what your model knows, and what it can do, InstructLab lets you dictate through its taxonomy what knowledge and skills your model should have,” said Akash Srivastava, the IBM researcher who led the team that developed LAB and is now principal AI product advisor at Red Hat.

How to contribute

You can start by using the InstructLab CLI to experiment with local, quantized versions of two state-of-the-art models: IBM’s open-source Granite-7B model and its Merlinite-7B model (a Mistral-7B base model improved with IBM’s LAB method).

If you find a gap in the quantized models’ performance, you can craft skill recipes to fill them in. A recipe has at least five examples of the target skill expressed in the form of question-and-answer pairs known as instructions.

Red Hat's primer on how to use InstructLab

Using a local version of InstructLab’s synthetic data generator, you can create your own instructions to align your own models, experimenting until they perform the target task. Once a recipe has been perfected, you can submit it as a pull request to the InstructLab taxonomy on GitHub like any other open-source project.

Project maintainers review the proposed skill, and if it meets community guidelines, the data is generated and used to fine-tune the base model. Updated versions of the models are then released back to the community on Hugging Face. IBM and Red Hat’s goal is to release new versions each week.

As InstructLab gets off the ground, maintainers at IBM and Red Hat will review and approve community submissions. Eventually, contributors that have earned maintainer status through their participation and criteria laid out in the guidelines can approve submissions. All submitted skills recipes, and data generated through them, will be posted to the InstructLab project.

IBM has dedicated Vela, its AI supercomputer, to updating its InstructLab models each week. As the project scales, other public models may be added. The Apache 2.0 license covers all data and code generated by the project along with IBM’s Granite 7B model.

The power of open

Much of the internet relies on open-source software, including Linux and the Apache web server. Today, open-source software also powers smartphones running the Android operating system and the SSL cryptographic protocol that secures billions of financial transactions each day.

Transparent, open-source software lends itself to systems that are more stable and secure. It also typically leads to faster, more predictable release cycles, and software with fewer safety and security flaws.

Open source encourages the kind of healthy competition that prevents one or two companies from monopolizing the industry. When everyone is allowed to participate, innovation thrives and costs to consumers typically drop. Generative language models that are collaboratively developed can bring some of the same benefits.

InstructLab provides the tooling for everyone to innovate, test, refine, and shape the future of AI. Each stage of the InstructLab pipeline has been designed for transparency. This is essential for creating trust among the people contributing to the project, and ultimately, the people who will be using the technology.

IBM and Red Hat have invested heavily in open-source software beyond Linux. Projects include PyTorch, Kubernetes, and the Red Hat OpenShift platform which allows AI models to run fast in any cloud environment. The push to make generative language models truly open source is just the latest instance in this tradition.

"This breakthrough innovation unlocks something that was next to impossible before — the ability for communities to contribute to models and improve them together.” said Máirín Duffy, software engineering manager of the Red Hat Enterprise Linux AI team.