InstructLab An Open-Source Framework For Adding New Skills To LLMs
This article was created and provided for free through a partnership with IBM.
Why Is This Important?
I’ll explain InstructLab by building off of IBM’s approach to explain why it’s such a massive leap for foundation models. You don't hire a generalist Technical Worker when you need a Java Developer to work on backend Java. The requirements are specific to the work they’ll be doing.
Most foundational language models are trained as generalists and benchmarked against generalist evaluations. However, they aren’t reliable generalists. They make mistakes or fail altogether at specialist tasks. Foundational language models serve a wide breadth of intents and use cases with very shallow depth.
Finetuning was proposed as a solution, but most approaches haven’t proven effective. RAG and prompt engineering are more effective but have limits. Both are workarounds for improving specialist performance on specific tasks without fine-tuning. Unfortunately, they are expensive and complex to implement in a way that results in efficient, reliable results.
InstructLab takes a generalist -> specialist approach to foundational language model training, which solves a lot of problems.
What Is Possible That Wasn’t Before?
We implement LLMs like we hire employees but in a much more limited way. We need to understand what tasks the LLM can reliably do, and current benchmarks are too broad to be used in that way. An 89 on a coding benchmark tells me how well the LLM does on average compared to other LLMs.
It doesn’t help me decide which LLM will work best for my Java development tasks. This is where InstructLab adds something we’ve never had before. The approach works with most models. IBM has implemented the LAB enhancements on Granite (IBM’s foundational language model), Llama 2, and Mistral. All three are available on Hugging Face.
Models are retrained with task-specific data, and I’ll explain the mechanism in a minute. Each task is added to a list so we know where performance has been enhanced beyond generalist capabilities.
LAB formalizes a standard process for enhancing foundational language models with task-specific capabilities. Prior capabilities aren’t lost in the process, so the model can be incrementally enhanced over time. A large business with multiple data teams, developer community, or open-source project can align its efforts with the standards to improve LLMs collectively.
The task or skill list and baseline model are consistent starting points. With each iteration, a skill is improved or added to the list. As I’ll explain in a minute, data is the only requirement for creating a new skill.
How InstructLab Works Under The Hood
MIT and IBM Research partnered to publish the LAB (Large-scale Alignment for chatBots) method. It is not model-specific and has been applied successfully to multiple models. Even though IBM and Red Hat are the primary drivers of this open-source project, it isn’t constrained by either company. I’ll explain that more after the functional overview.
LAB solves a few problems with fine-tuning foundational models:
Scalability
High Level Of Effort To Build Human Annotated Datasets
Shortcomings Of Current Fine-Tuning, Like Catastrophic Forgetting
LLMs are trained in two phases: self-supervised and supervised. The self-supervised phase is expensive due to the compute and dataset size requirements. The supervised phase is expensive due to task-specific data curation and labeling. The knowledge required for both phases is also a significant barrier for most companies.
LAB was built to address all three barriers to make instruction tuning more accessible. The LAB method has 3 main components.
Component 1: Taxonomy Driven Data Curation
Taxonomies are tree structures that represent hierarchical relationships. A dog falls under the canine family. Dalmatians, labrador retrievers, and poodles fall under the dog node. Taxonomies look like org charts but can represent a broad range of domains and relationships.
For skills the top element is something like content creation. Marketing, engineering, and research are subcategories or children of content creation. Each requires different task-specific skills or capabilities while sharing many common skills.
Experts manually curate the skills taxonomy. When a contributor adds training data to the project, it’s classified by task or skill. New branches are created with 1-3 examples (instruction-response pairs), and datasets associated with that skill branch can be classified.
Models are pretrained with foundational skills: math, coding, language, and reasoning. These are the taxonomy’s top-level nodes. Compositional skills form the taxonomy’s other layers. Compositional skills are incrementally added to the model through the fine-tuning datasets.
IBM is retraining the base model with publicly contributed skill enhancements every week. The open-source license (Apache) allows businesses to keep these enhancements private and use their enhanced model for products or internal use.
Component 2: Large-Scale Synthetic Data Generation
Two concerns arise from this approach. The first is training data leaking from the enhanced model. No business will contribute to the project if there’s a possibility of that happening. To avoid this, the curated training dataset is used as a seed to create a much larger synthetic dataset.
That leads to another concern: data quality. Synthetic data diversity and example data accuracy can cause the enhanced model to perform poorly in the task domain. For LAB, the taxonomy replaces random sampling, which the researchers show improves synthetic dataset diversity. Four prompt templates are used to generate and validate the synthetic dataset.
The paper provides a deeper dive into the specifics of creating synthetic training datasets and how they are validated for safety and ground truth.
Component 3: Iterative, Large-Scale Alignment Tuning
The last component retrains the foundational language model with the synthetic dataset. This process has two phases. In the first phase, the model is trained on data classified into the taxonomy’s knowledge and foundational skills categories. The model is first trained on samples with short responses and then on samples with long responses.
The second phase trains the knowledge-tuned model on data classified into the composition skills branches. There’s more granularity to these steps, and I will, again, leave that to the original paper.
What About RAG And The Path To Production?
LAB doesn’t replace RAG, and you could achieve similar results for granular tasks with both methods. The challenge with RAG is that it doesn’t support an open-source paradigm where multiple contributors incrementally enhance foundational language models. LAB can be combined with RAG. Using RAG on a skill, knowledge, or task domain that’s been enhanced with LAB “supercharges” RAG’s impact on reliability.
That’s the more significant opportunity for businesses. Once a model is enhanced for a domain that’s connected to a use case, the business can add RAG for business or customer-centric personalization. According to Gartner, only 10% of businesses have taken GenAI prototypes into the release phase. InstructLab simplifies the process, so we should see that number grow significantly.
IBM has integrated InstructLab into watsonx.ai. The interface is designed to accelerate AI application development and simplify app deployment to production. Businesses need platforms that function as acceleration layers. Companies like IBM are jockeying for position and working hard to explain how their platforms accelerate maturity and AI product delivery.
Will Watsonx.ai And InstructLab Move The Needle On Enterprise AI Maturity?
At Think 2024, IBM positioned InstructLab and watsonx.ai as centerpieces of its open AI strategy. Both are central to the company’s longer-term ambitions for supporting multi-agent implementations. Did IBM’s customers walk away with an appreciation for all the problems both pieces solve?
Technical leaders did. They got it and were engaged. Line of Business leaders are still learning. They walked out of many keynotes on the implementation layer, and still have a ways to go. That’s why selling enterprise platforms is a 3-body problem.
Platforms target users and explain how the features make their roles easier. Technical users and leaders got the message loud and clear. That’s a mission accomplished on the first body.
Senior leaders understood Generative AI’s potential for productivity gains. However, they don’t understand how the platform enables that or why these new components accelerate maturity. That’s where the second body tuned out, and they write the checks.
Upstream and downstream consumers of GenAI apps don’t understand how these components will deliver value to them. The third body is on two ends of the spectrum. GenAI is all hype, or GenAI is about to replace me. Most use copilots and content summary tools that don’t require InstructLab or watsonx.ai. The third body holds trusted advisors on platform purchase decisions.
Like most tech companies, IBM has won the user and technical leadership. It has sold the C-suite on outcomes but now must educate them on why those outcomes require InstructLab and watsonx.ai.
IBM has the best platform approach to accelerating AI maturity and helping customers navigate the multi-technology landscape. However, the gap between technology and value creation remains unfilled in most customers’ minds. IBM has positioned its consulting arm as one solution, and it’s effective for customers who choose to take advantage of it.
InstructLab presents a viable path forward for them. It’s too late for businesses in early maturity phases to navigate the journey using trial and error or endless proof of concepts. They need accelerators to survive. The question is, can companies like IBM educate enterprise customers who don’t realize how quickly they are falling behind?