Benchmarking And Evaluating The Business Value Of Machine Learning Models
I talk about connecting business metrics to model metrics. Benchmarks are a good framework for achieving that. However, you have to avoid the gaps that academic benchmarks suffer from.
Greatly inspired by this paper, I have been thinking about business benchmarks for Machine Learning model performance. When a new model is built, how do we explain the impacts to the rest of the business?
‘AI and the Everything in the Whole Wide World Benchmark’ talks honestly about the flaws in the academic model benchmarks. GLUE, ImageNet, and many others are widely used to support claims of improved performance. State of the art performance is thrown around a lot. The paper explains several gaps in the current suite of benchmarks.
Limited Task Design
Arbitrarily Selected Tasks and Collections
Critical Misunderstandings of Domain Knowledge and Applications Problem Space
De-contextualized Data and Performance Reporting
Limited Scope
Benchmark Subjectivity
Inappropriate Community Use
Limits of Competitive Testing
Redirection of Focus for the Field
Justification for Practical out of Context or Unsafe Applications
Most of these have bullet points have parallel shortcomings in the way we benchmark models for the business.