How Can We Validate LLMs And Complex Deep Learning Models? Going Beyond Statistical Measures.

Jun 01, 2023

∙ Paid

NVIDIA’s CEO recently said that Generative AI enables everyone to be a programmer. It’s a great sentiment, but how close are we to that reality? We are finally getting more insights into LLM functionality through structured testing and validation.

This paper should attract more attention than it is. I posted an overview on LinkedIn, but here, I want to explain why this work is so insightful. Data scientists from Ericsson evaluated ChatGPT 3.5 on 10 common software engineering tasks. While this is an essential review of LLM capabilities, there’s more to take away from this paper about deploying high-reliability models.

The methodology and applications for model validation are a bigger story. Most data science teams struggle with model reliability and ensuring implementations meet user needs. LLMs have significant potential and can serve multiple enterprise use cases. The challenge is identifying and validating the model’s performance meets user and customer reliability requirements.

While this article covers LLMs, the content generalizes to any complex model-supported product. In this article, I will explain:

How to create reliability requirements for use cases.

How to develop test plans and reduce the level of effort required to execute them.

I will provide some guidance on automation. This study was done manually, and the authors call out how labor-intensive the process was. Without automation, these validations are not cost-effective and add too much overhead to projects.

I will begin by covering the authors’ test areas and provide some insights along the way. This post is lengthy and runs to an entire book chapter’s length. It’s worth reading because there is so little published on model QA vs. statistical validation. The two concepts are entirely different.

Method Name Suggestion

ChatGPT suggested the correct method name for 9/10 methods. The test covers the model’s ability to implement best practices for writing maintainable code. For this to be a more comprehensive assessment, more test cases should be developed to cover multiple best practices.

Log Summarization

ChatGPT generated highly informative summaries without being overly verbose in 10/10 tests. This is an area that ChatGPT is known to thrive. Previous work has demonstrated the model’s capabilities with this task category. As a result, tasks falling into this category don’t require rigorous validation. Data teams can reduce the total testing time by taking advantage of these types of opportunities.

Commit Message Generation

ChatGPT generated the correct commit message 7/10 times. In the other 3, ChatGPT generated an accurate but overly verbose message. Here’s where testing gets tricky. Failures are not necessarily signs of low reliability. In cases like these, the cost of minor imperfections is much lower than the cost of manually completing the tasks.

When I define tests for models, I implement a scoring rubric for each test category. The tiers I use are exceeds human level, human level, acceptable performance, and fails to meet reliability or stability requirements. This process is time-consuming because we must either create multiple labels for each test or manually review each result.

Duplicate Bug Report Detection

ChatGPT correctly detected 6/10 report pairs. Good, but not reliable enough to run independently. However, this use case can be met if a human remains in the loop. There are two levels of model-supported products. The first fits the human-machine teaming paradigm, and this is an example. The model augments the software engineer or QA engineer but does not replace them. ROI comes from improved productivity. Most use cases will fit this paradigm.

Merge Conflict Resolution

ChatGPT was only successful 6/10 times. Let’s be honest, most of us are only marginally better. Complex merge conflicts typically require both developers to work together. If it’s possible to put a validation step in place, this use case fits the human-machine teaming paradigm and might be a candidate for implementation.

Validation steps are an excellent way to get value from performance levels like these. A model that handles 60% of merge conflict resolution can deliver time savings for software engineering teams. However, those gains might be offset if a software engineer must manually review each one. If there’s a way to flag failure cases successfully, productivity gains can be realized.

Anaphora Resolution (AKA Resolving ambiguous requirements language)

Well, folks, we’re in trouble. ChatGPT resolved 10/10 with human-level accuracy. While this sounds impressive, it’s an example of functionality that’s disconnected from a practical application.

For this to be a valid test, it would need another step. Once the ambiguity is resolved, can the model implement the clarified requirement? Workflows must be mapped to support full validation. I’ll get into workflows later in this article.

Code Reviews

Never mind, we’re still needed. ChatGPT only provided a quality code review in 4/10 trials. While there is some value and potential for productivity gains, I don’t trust a model that cannot meet the coin flip reliability standard. If it cannot deliver at least 50% accuracy, I immediately question how well the 40% will play out in a production environment.

There’s a fundamental limitation on test case fidelity. Test cases are always more controlled and stable than real-world conditions. Expect production performance to fall below test levels.

Type Inference Suggestion

ChatGPT succeeded in 7/10 cases. Not bad, but it still requires significant human oversight and review. This test is another example of a disconnect from workflows. It’s interesting but irrelevant to a product that delivers value. Test cases like these can quickly bloat validation time.

Code Generation from Natural Language for Data Frame Analysis

They provided a tabular data set to ChatGPT and asked it to generate Python code for 10 common data analysis and exploration tasks. It generated the correct code for 8/10.

This test case is an excellent example of a complete workflow validation that’s connected to a use case. Non-developers can use ChatGPT to support basic exploratory data analysis. Multiple platforms (IBM’s Watsonx and Microsoft’s Fabric) have already implemented this use case.

Defect or Vulnerability Detection

ChatGPT caught 4/10 code insecurities. All 10 examples were in C with pointer access and indirection. I’m not sure how well most developers would perform in a similar scenario, and I’m not volunteering to be the test subject.

Test cases should include multiple difficulty levels but remain true to how difficult the business’s current applications are. If people aren’t successful at high rates, the use case might not be a good candidate for automation. At the same time, these use cases are the highest-value opportunities.

The purpose of advanced models is to bring new domain knowledge into the business. Targeting use cases where people perform poorly fits that purpose. Products that do have the highest adoption levels and greatest impacts on productivity. However, they require the highest burden of proof and validation.

Duplicate Code Detection

ChatGPT identified 6/10 clones which is odd because IntelliJ goes ballistic when I duplicate code. The authors dug into the failures and discovered their labeling was wrong in all 4 instances initially considered failures. So ChatGPT has a better understanding than human labelers.

Label consistency and quality are significant challenges to overcome. Here’s where testing can become especially challenging. The more labeling that’s required, the more expensive testing becomes. I’ll cover some approaches that reduce costs in the next section.

Returning to the purpose of models, this is another high-value use case because model quality exceeds human performance. It can catch mistakes that experts make and will provide significant productivity boosts.

Unit Testing

I think we all hope this is a 10/10, but unfortunately, ChatGPT only developed an appropriate unit test in 6/10 cases. It looks like we’re still unable to offload building our own unit tests entirely. I, for one, cried a little bit when I read the results and hope that ChatGPT 4.0 performs better.

However, this is another example of performance that’s sufficient to recommend unit tests for the software engineer’s evaluation. It will deliver productivity gains in this use case.

Extract Method Refactoring

They chose a subjective task here, and ChatGPT’s refactors did not match the human versions. However, upon inspection, the authors believe the refactors met their standards for code quality and showed improvement over the original version.

This is another high-value use case and an expensive test to run. Subjective tasks are high-value and require complex validation.

Natural Language Code Search

The authors provided ChatGPT with a text description and code snippet. They asked ChatGPT to tell them if the code satisfied the description. Essentially, the model evaluated what would normally be part of the documentation or comments (ChatGPT gets comments!? Why didn’t I get those when I inherited someone’s leftover spaghetti?) to see if it matched the corresponding code.

It initially scored a 7/10, but the authors again found issues with the questions and labels. After further review, they revised the score to 8 or 9/10. A pattern is emerging. The human-machine teaming paradigm will make software engineers more accurate and reduce errors.

Test Case Prioritization

Watch out, Project Managers and QA Engineers, ChatGPT is coming for your jobs too, but not for a while. ChatGPT could only handle simple prioritization based on past defect discovery, and even then, its reasoning was inaccurate, according to the authors.

LLMs are not very good at decision-making. These models can create task lists, but making decisions with multiple alternatives is currently beyond LLMs’ capabilities. The use case requires a model that can evaluate, predict, and prescribe. Successfully meeting this use case category typically requires multiple models and large training data sets.

Code and Algorithmic Efficiency (AKA LeetCode)

ChatGPT successfully solved all the problems the authors gave it. This use case is another novelty with low value. It plays into ChatGPT’s strengths, and performance will not generalize beyond questions the model has seen before.