Self-improvement, aka self-healing, is the most exciting, promising, and dangerous direction for the future of large language models. The emerging multimodal side of these models opens up even greater potential for constructs like self-healing. What will surprise many of you is that this type of machine learning isn't new. It's one of the oldest parts of the field and, as early as 2009, was one of the most exciting areas of reinforcement learning research.
In this article, I will explain self-healing, how it works, its applications, and why some of those applications are dangerous.
Understanding Actors And Critics
OpenAI showed an example of response healing during the initial GPT-4 demo. Through the playground environment, the model was prompted to write code to interact with Discord users. There were mistakes in the initial response, and when the presenter attempted to run the Python script, it threw an error. In the next prompt, the presenter gave GPT-4 the error message and asked the model to fix the code.
The subsequent response resolved the initial error. However, this output code still had an issue and encountered a new error due to a change made after 2021. GPT-4's training dataset ended in 2021, so the model didn't know about changes to Python libraries made after that date. The demonstrator included the documentation for the updated class along with the error message in the next prompt and asked GPT-4 to fix the error. The final output ran without issue.
This cycle demonstrates the concept of prompt healing, where an initially faulty response is repaired through subsequent prompts. We used to call these actor-critic systems. In this case, the generative model is the actor. The presenter was the critic. When the actor makes a mistake, the critic points out the problem. The actor learns from that feedback and works to fix it. It's a recipe for rapid improvement.
In reinforcement learning, the goal is to optimize a policy. Models exist in an environment, and GPT-4’s environment is exceptionally simple. The environment begins at a baseline state defined by the user’s first prompt. GPT-4 uses its pretrained policy to predict a distribution of possible responses based on the prompt or initial state. It selects the most likely and strings together the complete response.
The user is a critic and provides a value function that evaluates the policy’s loss. The loss is the difference in value between the perfect response and what the actor did. The actor takes that feedback and improves its policy to do better in the next trial.
The easiest way to visualize this is a small RC car working to navigate a racetrack. The goal is to run the course as quickly as possible. When the actor runs its first trial, there’s no telling what it might do. The model could hit the gas and run directly into a wall. The critic would give this policy a 0 because the actor crashed and failed to navigate the course. Through tens of thousands of trials, the actor will eventually learn the course, and the car will successfully navigate it.
GPT-4 Creates A Literal Actor And Shows Off Its Capabilities
GPT-4 is remarkably good at learning to improve based on a critic’s feedback. It can do this across a broad range of tasks. I prompted the model to describe a character who is similar to Arya Stark from the show Game of Thrones but lived 100 years in the future. The response was exceptionally detailed.
The model described someone who lived in a post-apocalyptic future world with humanity on the brink of extinction. The character description included physical appearance, clothing style, personality, backstory, skills, and a brief story arc describing her journey. In my next prompt, I requested that GPT-4 generate a new response that did not include the post-apocalyptic world theme.
The subsequent prompt had a similar character and the same developed categories. This time, the future world was an advanced technological society. The character’s details and journey adapted to the new world description. The journey was twice as detailed as the original response.
I slightly changed the original policy, and GPT-4 successfully adapted on the first try. I told the model that its response wasn’t very entertaining and asked GPT-4 to develop a more interesting character. The third response was a significant improvement.
This time the backstory included high-tech espionage and the reappearance of the lost city of Atlantis. The language used in the response was richer, with more descriptive adjectives. The journey had a sense of urgency, and new characters were introduced. The model adapted its policy well based on the new information.
The first prompt did not contain everything I wanted. The initial response would have been better if I had specified the entertainment and no post-apocalyptic world requirements upfront. However, at that point, I didn’t know I needed to. The model can’t read my mind and relies on the environment I create with my prompt to respond appropriately.
Through each prompt, I taught GPT-4 more about my value function. GPT-4 didn’t really improve its policy. The model improved its understanding of the environment. In the first prompt, GPT-4 deployed a policy with incomplete information and selected new policies as more information was provided. However, we know OpenAI is improving GPT-4 based on interactions like these. The policy will improve with more information about the environmental variations GPT-4 encounters.
Model Heal Thy Self
Self-healing is when this process is done without human intervention. The first implementation I ran into is a GitHub repository called Wolverine. It's a basic script that connects to GPT-4 with the goal of automatically fixing errors in Python code. Like in the OpenAI demo, errors from running the Python code are fed back to GPT-4 to be fixed. Wolverine automates the feedback loop.
In the demonstration video, the creator of Wolverine built a calculator application and intentionally included errors. Each time the code ran, an error was returned. Wolverine would send that error in a subsequent prompt for the model to resolve. The updated code would be put into the project, and it would be rerun. Each time a new error surfaced, the same process was repeated until no errors occurred.
You can see the actor-critic pattern playing out again. I write an original piece of code and attempt to run it. The critic, in this case, is the compiler or runtime environment. It provides a value based on my initial response to a prompt I was given to write code to accomplish a task. GPT-4 recommends fixes, again filling the actor’s role. Each time it submits a response, the critic tells it what's still wrong.
Software developers do this for a living. However, doing this manually takes a whole lot longer. It’s not just software engineers. Workers get paid to perform continuous improvement cycles on their work products throughout their professional careers. For many of us in the engineering world, this is precisely what we get paid to do. Nothing we create is ever perfect the first time. We iterate, sometimes alone, sometimes in teams, until we develop a finished product that meets quality, reliability, and functional standards.
If that iterative process is no longer necessary, that changes things. One of the biggest criticisms about GPT-4 centers on hallucination. The model’s policy for many environments or prompts is initially flawed. As a result, some of the code it spits out isn't exactly functional, and the model can’t work without a human in the loop closely supervising the outputs. With the potential to automate both the actor and critic roles, that may change.
The potential here is for me to prompt GPT-4, and it handles not only the initial response but also the continuous improvement cycles. If it hallucinates, the critic provides feedback that catches the hallucination and works to improve the actor’s policy. As a result, self-healing large language models don't need to get things right the first time. All they need is a critic to help them fix the problems before they respond.