Article

Disciplining AI, Part I: Evaluation Through Industry Lenses — Opening Statement

Posted August 15, 2025 | Technology | Amplify
Making AI Accountable, Part I: Metrics Across Industry & Disciplines
In this issue:

AMPLIFY  VOL. 38, NO. 5

For better or for worse, AI is set to have a major impact on business and society. Making AI technologies accountable through the disciplined and systematic evaluation of their effects is thus becoming both a matter of public safety and organizations’ ROI.1

Many would argue that the AI industry is not accountable enough, particularly around intellectual property, privacy, bias, and social ramifications. AI-model benchmarking, however, is a prominent and influential aspect of the industry. Model developers and researchers have devised numerous sets of standardized tests that measure performance in areas like coding, math, reasoning, factual accuracy, and visual problem-solving, as well as aspects like safety or jailbreak vulnerability. The results can be compiled into leaderboards that purport to identify the best models for a given use case. 

Nevertheless, AI benchmarking is increasingly seen as falling short of satisfactory evaluation of AI systems.2 Benchmarking’s inadequacies include underscrutinized test data, model developers teaching to the test, and even the possibility that models themselves “know” when they’re being tested and feign the required responses.

Equally concerning, results from testing models on academic tasks taken out of context often have limited relevance to real-world applications, no matter how advanced the tasks may be. As frontier large language models show signs of plateauing in performance, attention is increasingly shifting toward how these models are applied in practice.

The discipline of AI evaluation aims to quantify the quality of the responses of entire AI systems, concrete and in context, as well as their individual components. Modeling even just the relevant aspects of the context and tracking the nondeterministic outputs from the AI is, of course, hard, but it is increasingly recognized by tech leaders and investors as critical.3 Another outstanding question is whether and by what means one can evaluate how AI systems arrive at their output, with a view to verifying and explaining that output.

Testing, however elaborate, has proved imperfect at reliably evaluating AI’s usefulness when integrated into human workflows across organizations and societies. In a recent study, access to market-leading agentic AI coding tools increased software developers’ completion times of actual tasks on mature projects by 19%.4 Even the technology’s most ardent supporters acknowledge this to be a major frontier.5

Here in Part I of this two-part Amplify series on AI evaluation, we explore the impetus toward AI accountability that arises from tackling real problems in real-world settings. Understanding how AI can contribute, at what cost, and with what nth order effects in a given context requires rigorous socio-technical systems thinking.

In This Issue

In pursuit of such thinking, drawing on experience across industries and disciplines, this issue of Amplify offers insights into the criteria that determine AI success.

First up, from Marcus Evans, Rosie Nance, Lisa Fitzgerald, and Lily Hands, AI explainability is a legal requirement as well as a scientific challenge. Despite the EU and UK’s differing approaches in other aspects of AI regulation, both the EU and UK GDPR continue to uphold individuals’ right to an explanation of automated or semiautomated decisions that impact them significantly. The EU AI Act also provides a right to an explanation for individuals or organizations.

Organizations using AI-powered tools to make an impactful decision where there is an EU or UK connection must be able to explain how the decision is made, and the explanation must be intelligible to the lay citizen. In the face of these legal and social obligations, the explainability and evaluability of a system should be among the key factors considered when architecting or procuring. The authors conclude with practical advice on how to promote explainability via AI governance.

Next, Daniel Flatt contends that an evaluation framework that promotes accuracy and objectivity is a commercial necessity. He points to B2B publishing as a sector of particular note: an industry built on credibility and accountability that AI could undermine, even as it offers opportunities to accelerate time to output. In response, new tools for detecting inaccuracy and bias in AI-generated copy are emerging, alongside collaborations across the publishing workflow and the wider industry.

The relationship between AI and publishing is increasingly bidirectional. AI model developers seek partnerships with reputable publishers to access both content and brand credibility, while publishers must weigh such collaborations carefully, ensuring that models deliver quality output. With its long tradition of fact-checking, journalism brings valuable expertise to this challenge. Flatt calls for an approach that safeguards and advances journalism’s core ideals while setting a broader standard for the responsible use of AI.

Likewise, Kitty Yeung urges us to elevate our thinking and consider what we are trying to achieve — via AI or otherwise. She argues that the fashion industry has long failed to appreciate the imaginative journeys consumers are taking, journeys that weave together self, situations, social circles, and eclectic wearables. Destructive practices like fast fashion represent flawed attempts to address human complexity with incomplete information, cumbersome supply chains, and a narrow anthropology that undervalues consumers’ creative agency.

In contrast, AI provides digital try-on tools that allow users to experiment with items in any combination or context — or even design new ones themselves. AI-enabled analysis can then surface the trends emerging from these creative interactions, helping to shape a smarter, leaner supply chain. For Yeung, realizing this potential requires AI evaluation to move beyond mere compliance with the status quo and instead align with the higher ideals of freedom, truth, and sustainability.

Joseph Farrington also emphasizes the importance of evaluating AI systems against their end goals. In healthcare, where developing and deploying AI models is especially challenging, he argues for first modeling the business context and processes the AI will interact with — before moving ahead with development or deployment. This approach can be used to assess, in advance, whether a plausible AI model will provide the intended benefit. It can also be used to run alternative scenarios to identify what else might need to change for the AI to really work or what else might work better if the AI were in place.

As a secondary benefit, context modeling brings engineers, stakeholders, and domain experts together much earlier than would typically occur in such projects. In this way, AI evaluation is not merely a corrective exercise after deployment; it should also anticipate, contextualize, and define — in concrete terms — what the AI is meant to achieve.

Closing the issue, Joseph Byrum introduces a framework to help organizations plan rationally and prudently for AI adoption. One element is defining performance thresholds beyond which emerging technologies become economically viable. Another is assessing how AI and humans should interact across different business functions: some tasks can be commoditized and handled by AI, while others remain critical differentiators under human responsibility, with hybrid possibilities in between.

This analysis is not straightforward. Byrum points to cases like UPS and its ORION route navigation system, where organizations had to undergo radical iterations to find the right balance between AI and human input. Complicating matters further are the rapid pace of technological development and the shifting nature of market differentiators, which make any framework less a static blueprint and more a matter of dynamic “adaptive sensing.”

Key Themes

The contributors to this issue of Amplify all agree that true AI evaluation must go beyond assessing models or outputs alone. Instead, it should be grounded in an organization’s strategies, end goals, and sources of differentiation.6

AI is on a trajectory to become too fundamental and impactful to be treated as just another tool evaluated only on its outputs. Its influence extends beyond organizational success into society at large, and society has expectations around human dignity and empowerment. Any evaluation framework must therefore consider the extent to which AI advances — or at the very least does not undermine — these core values.

It is also a mistake to imagine that business processes and structures will remain constant and provide an immutable viewpoint from which AI can be comfortably evaluated. As several contributors note, the rollout of the technology is already driving new ways of working, requiring more interdisciplinary/intermural collaboration and formalization of long-tacit knowledge. Looking ahead, it has the potential to transform supply chains and upend established economies of expertise. The status quo is not a valid yardstick.

Several contributors propose forward-looking approaches to AI evaluation — identifying where adoption will be most effective, estimating its likely benefits, and addressing additional requirements like explainability once deployment occurs. Synthesizing deep context and the interplay of action and reaction is central to systems thinking,7 which provides a valuable framework for interpreting the contributions in this issue.

This issue also highlights profound challenges: the culture-specific nature of human imagination and self-perception, the difficulty of explaining systems that remain subjects of frontier research, and the organizational self-understanding required to model all the factors — including human expertise — that shape a process.

Indeed, the very notion of truth is in play — both in the journalistic sense and in the realm of authentic human expression. While no one suggests there are easy solutions, the contributions in this issue offer grounded and thought-provoking approaches.

In Part II of this Amplify series, we’ll take a closer look at some of the engineering and conceptual challenges of AI evaluation.

Acknowledgment

Alongside the contributors and the ADL Cutter team, I would like to thank Alaa Alfakara, Michael Bateman, Maureen Kerr, Brian Lever, Olivier Pilot, and Greg Smith for their support.

References

1 Jones, Elliot, Mahi Hardalupas, and William Agnew. “Under the Radar? Examining the Evaluation of Foundation Models.” Ada Lovelace Institute, 25 July 2024.

2 Eriksson, Maria, et al. “Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation.” arXiv preprint, 25 May 2025.

3 Gupta, Aakash. “Why AI Evals Are the New Unit Tests: The Quality Assurance Revolution in GenAI.” Medium, 11 June 2025.

4 Becker, Joel, et al. “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” arXiv preprint, 25 July 2025.

5 Mollick, Ethan. “The Bitter Lesson Versus the Garbage Can.” One Useful Thing, 28 July 2025.

6 For more on this line of thinking, see: Kolk, Michael, et al. “Innovation Productivity Reloaded: Achieving a 40% Boost Using a People-Centric AI Approach.” Arthur D. Little, July 2025.

7 Bansal, Tima, and Julian Birkinshaw. “Why You Need Systems Thinking Now.” Harvard Business Review, September–October 2025.

  

About The Author
Eystein Thanisch
Eystein Thanisch is a Senior Technologist with ADL Catalyst. He enjoys ambitious projects that involve connecting heterogeneous data sets to yield insights into complex, real-world problems and believes in uniting depth of knowledge with technical excellence to build things of real value. Dr. Thanisch is also interested in techniques from natural language processing and beyond for extracting structured data from texts. Prior to joining ADL, he… Read More