LLMs In Production: The Good, The Bad, And The Ugly

Jun 30, 2023

∙ Paid

The dirty little secret behind the Generative AI boom is that LLMs are a giant pain to work with. They work remarkably well in prototypes, and developing one doesn’t take much work. Getting a production-facing LLM to service an enterprise use case is much more challenging. Nothing you’ll learn in the prototype phase will prepare you to build for production.

What makes this even more challenging is the lack of quality documentation. Hugging Face, OpenAI, Microsoft, and Amazon have documentation explaining how to use their LLMs and platforms. However, the documentation has canyon-sized gaps when it comes to the customization necessary to support enterprise use cases.

Prompt Engineering Is 95% BS, But There Are Quality Resources

Prompt engineering is the worst offender. Most prompt engineering content doesn’t work except in the authors’ highly controlled conditions. Often, it’s impossible to reproduce their results entirely. Over the last 3 months, it’s become apparent that most people developing content about prompt engineering have no product experience.

Anthropic is documenting best practices and developing a library of prompts for its customers. This job description gives you a bit of insight into that effort.

There are some reliable sources for prompt engineering tips.

OpenAI

Cohere

A Hugging Face contributor has a chatbot that lets you ask questions about prompt engineering. I haven’t explored it too deeply, so it’s more of a novelty than a fully vetted source.

What I’ve Learned Over The Last 8 Months

Here’s my experience. Baseline, default, or seed prompts only work for a tiny user base with high domain knowledge. That’s why many apps scale to around 10K users so quickly. I’ve talked with several legitimate LLM app creators, and they all see the same challenge. Products start to experience significant churn in months 3 or 4.

When they dig into the issue, it’s inconsistent prompt performance. While their die-hard core users don’t care, the segments that adopt later are less tolerant of inconsistency and poor user experience. This is an extension of the explainability challenge many model-supported apps encounter. Users want stability and only accept the model’s instability if they understand why it’s happening.

Reproducibility of initial results and behaviors is vital because, in enterprise settings, we need consistency for some use cases and creativity for others. Creativity fits one category of human-machine tooling use cases, and consistency serves another.

How much PTO do I have accrued after 2 years? That’s a question we need consistency to serve.

Recommend 5 email subject lines to catch {customer} attention and get them to read more. That prompt requires more creativity.

Getting LLMs to display creativity vs. consistency involves finetuning and retraining, which I will cover in the next article. Either way, the LLM’s responses must be consistent or creative across users and sessions. Users abandon black box solutions like LLMs after a month or less if they don’t feel in control. Inconsistency, low reliability, or a lack of understanding leads to churn.

Most prompt engineering frameworks can’t deliver the control users seek and don’t address the human-machine interaction aspect of use cases. The default prompts running many LLM prototypes fail over time or as usage scales. What worked well in early implementations began to show cracks when more users adopted the tools. It wasn’t that the tools fell apart or failed altogether. The inconsistencies greatly impacted usability, and nontechnical users abandoned the tools.

I encountered several hidden issues earlier this year while developing my GPT Monetization Strategy course. I had planned to release the course in January but had to update several lessons based on what I was learning. I didn’t release the near-term opportunities section until April because there was so much volatility and uncertainty.

Training Users To Be Prompt Engineers Works Well

Prompt engineering and implementing baseline, default, or seeds are oversold unless we train users to be mid to expert-level prompt engineers. Then prompt engineering becomes a remarkably powerful approach. People who work with LLM-supported tools daily become adept at getting what they need from them.

People are creative and adaptive, which allows them to work around inconsistencies from session to session. Putting prompt engineering behind the scenes takes autonomy away from users. Baselines, defaults, and seeds aren’t reliable enough to meet human-machine collaboration standards. For now, LLM apps must adhere to the lower maturity level.

LLMs In Production: The Good, The Bad, And The Ugly

Prompt Engineering Is 95% BS, But There Are Quality Resources

What I’ve Learned Over The Last 8 Months

Training Users To Be Prompt Engineers Works Well

This post is for paid subscribers