Beyond the Benchmark: Developing Better AI with Evaluations

Article

Beyond the Benchmark: Developing Better AI with Evaluations

Posted October 13, 2025 | Technology | Amplify

In this issue:

AMPLIFY VOL. 38, NO. 6

ABSTRACT

ADL’s Dan North examines the engineering discipline of AI evaluations, arguing that this is the locus of an LLM’s translation from task-agnostic capabilities measured by benchmarks to tech that is setting specific and ready-to-deliver success. North emphasizes the ultimately human nature of this discipline. An organization must define the kind of outputs it is looking for through an inclusive process involving customers and stakeholders, alongside engineers.

Evaluation criteria are a core part of deriving value from AI, unifying low-level code tests with high-level customer needs.¹ This article explains how to select evaluation criteria (a central yet underdiscussed step) and AI workflow design. These processes are likely to change quickly as the AI development ecosystem matures, but ultimately, human values remain central, so close collaboration between end user and developer is key.

Unlike classical software, which is deterministic, large language models (LLMs) are stochastic — the same input could produce a range of possible outputs. That makes LLMs generative; it also makes them difficult to control.

Applying LLMs in a product requires controlling their outputs, so evals are required: evaluating the range of outputs you get from an input allows you to adjust the input to get the ones you want. Much of AI product development is now driven by evals — how you adapt the off-the-shelf technology to the specifics of your use case.

Crucially, data about these specifics is idiosyncratic to a given workflow and its end users. In AI, data quality is usually discussed in the context of LLM pretraining or fine-tuning, and evals are still often conflated with LLM benchmarking. Although these are important and could be included here, this article targets the newer sense of evals: assessment of LLM outputs in actual production workflows against criteria specific to those applications. (I exclude agent evals from this scope, since they’re considerably different.²)

Evals in this sense are different from the benchmarking used to train or fine-tune LLMs. Primarily, formal benchmarks are for generic, task-agnostic properties you would want the model to have in any setting; evals are for the task-specific properties you want the model to have in your setting.

Furthermore, most enterprise LLM workflows require evals, but fewer of them will require fine-tuning; fine-tuning can be more expensive and incurs sunk cost that slows your ability to switch models and adjust your workflow. So eval results will typically be used for in-context adjustments to LLM inference — that is, changing the information provided to an LLM at inference instead of pretraining, allowing businesses to iterate and respond to the market more quickly. That means the results will be applied to qualitative natural language inputs of the model, providing an ideal way to inject customers’ success criteria directly into the generation request.

As any consultant or solutions architect knows, deriving a customer’s success criteria is not simply reading off what they tell you. Much of our knowledge is tacit and enmeshed in a background of presuppositions. This difficulty is underscored when designing good evaluation criteria, because they require you to make explicit (in text) all the implicit “vibes” that divide success from failure. However, since all major LLM calls made in your application require evaluation, the criteria also resemble unit tests. So the assertions must incorporate preferences far beyond the scope of classical testing.

In short, authoring criteria requires a unification of the broad, customer-defined goals of a use case with the technical, narrow functionality of the LLM call being evaluated.

Good evaluations bring the tech to your use case. More broadly, AI promises a world where intelligence (sound, valid, and relevant inference from given data) is commoditized and cheap. If every business has access to graduate student–level intelligence for $.30 per million tokens, the level of differentiation is at the application, and, for reasons we’ve seen, user feedback is essential for developing AI-based applications.^3,4

Unfortunately, the available guides and documentation give little guidance on how to use customer feedback to select and write the criteria you want to evaluate for.

Where to Find the Criteria

Manual annotation and authorship for evaluation criteria will always be required. However, it is best used on high-level outputs of the workflow to adduce the criteria for automated evaluations, which offer scale over low-level outputs. In addition, design principles of AI workflows inform exactly what these criteria should include.

A common theme in the extent documentation (such as from model providers like OpenAI or Anthropic) for implementing evaluation systems for AI applications is the importance of manual human analysis at some stage of the evaluation pipeline.⁵

The goal is to transfer human intuitions into the LLM, during either inference or pretraining, so that its range of output aligns with use cases and preferences. As mentioned, most uses of evaluation results involve context enrichment, so we must map the results into text. This involves performing error analysis to identify failure modes and adduce context refinements.

Performing this task manually for each evaluation is not scalable, so most solutions architects use an LLM as a judge (LaaJ) to do most of the annotation. This requires the judge LLM to be prompted and even evaluated itself. To do this, some AI experts suggest using a domain expert to hand-annotate a sample set of original LLM responses, explaining each success or failure, and another LLM to summarize these annotations.⁶ Some recommend EvalGen, a sophisticated suite for iterating this process, and identify the now well-known criteria drift, indicating that end users refine their own self-reported success criteria on judging sequences of examples.⁷

Aptly named (from Greek krites, “judge”), criteria drift shows what psychologists and philosophers learned from Socrates: we don’t really know our definition until challenged with examples.⁸ Restricting answers to true and false, but requiring an explanation, forces us to say why the example does or doesn’t meet the criteria.⁹ This has the effect of making users’ implicit assumptions explicit, and EvalGen capitalizes on LLMs’ intended talent for detecting implicature.

Although an important tool in evaluation design, this method has limitations. The implicatures detected by LLMs in composing evaluation criteria are necessarily read from their text input; this limits the detection to presuppositions deducible from text. But implicature operates on many levels of information beyond text. In particular, much of the context that fills in the ellipses of our definitions is drawn from world knowledge, recent events, speaker relationships, conversation history, physical gesture, the surrounding environment, current time, and more.¹⁰

This means different information is communicated in a verbal conversation, especially in person, versus text. Humans evolved by encountering and interpreting each other in person, and our capacity for language use developed around this fact.¹¹ Fundamentally different types of information are available to and inferred by a developer in actual conversation with end users versus LLMs reading those users’ text annotations.

There are also practical reasons for using insights gained in conversation from the customer when composing evaluation criteria. One engineering problem (which EvalGen proposes to address) is that LLM responses are often embedded deeply within a multistep workflow and thus not intelligible to a nontechnical user or anyone unfamiliar with the project code. One business problem is that, anecdotally, the hours of manual annotation required by automated processes are a significant demand from enterprise clients, most of whom do not have dedicated AI teams and are in completely unrelated industries.

It is also beneficial for the design process to include regular touchpoints and transparency between developer and user. In general, involving clients in the development process strengthens the developer relationship and increases their buy-in and likelihood to use the application.¹²

This approach is central to the forward-deployed engineer role popularized by Palantir Technologies. In its model, the engineer has direct, frequent meetings with customers, initially to discover the details of their use case, and subsequently to demonstrate the iterative work-in-progress application and receive their immediate feedback.

The key is close collaboration between the developer and the end user, not to give the customer more work, but to shape the development of the application from the user’s direct comments. In my own experience, I found that conversations with the customer were crucial to capturing the implicit, unwritten heuristics at the core of their business.

The regression of LLM judges all-the-way-down can only terminate with humans in the loop, and eliciting customer feedback in live conversation is a useful way to do this, both for semantic and practical reasons. The manual step of developers synthesizing their insights gained from customer interactions and incorporating these into evaluation criteria is how extra-textual implicit information is passed to their workflow’s AI.

Evaluating LLM Workflows

The human evaluation we have been discussing should be of the final output of the application: the presentation, report, and so on, to elicit the users’ expectations of it. However, most value-adding AI apps do not produce this final output with a single LLM call.

Even relatively simple minimum viable products (MVPs) usually require a scaffold of multiple calls, each used for a specific and idiosyncratic purpose within the execution flow of the application and wrapped by various helper and parsing functions to integrate with that flow. This is because LLM outputs tend to be more successful the more predictable the desired I/O (input/output) pattern is (a corollary of the fact that artificial general intelligence is hard).

As we have seen, these are tedious to evaluate manually, so while LaaJ can and should be run on the final product, the way it adds value is by automating evaluation of these low-level LLM calls. Implementing LaaJ for these calls requires understanding how they’re embedded in your workflow.

In fact, LLM workflows tend to decompose into individual calls that perform one of a few standard natural language processing (NLP) tasks that LLMs are good at. This is by design, since the transformer architecture used in current LLMs was originally created for machine translation, a core subfield of NLP.¹³ These tasks are usually one of:

Classification. What is this?
Extraction. Where is this?
Summarization. What does this mean?
Generation. What’s the most relevant response? (the bit that makes LLMs “intelligent”)

In reality, the boundaries between these tasks are blurry. Nevertheless, when building an LLM workflow, it is helpful to think of them as building blocks to achieve the final output.

For example, earlier this year, I built an MVP for a veterinary hospital client who tracked patient data on 100% handwritten forms. The use case was to OCR (optical character recognition) the forms and map extracted text to invoice items for the client’s CRM (customer relationship management) system, saving nurses the dozens of hours per week they spent doing this manually.

Because the inputs were in handwriting of varying legibility and included medical terms and abbreviations, the architecture required was more complex than simply passing an LLM the form and asking it what to invoice. The document image was first split into separate sections for semantically disparate parts of the form. Each part was then sent for an initial multimodal LLM pass A to extract the actual text (text recognition/extraction).

Because extraction alone achieved only 70% accuracy, even with lightly fine-tuned models, a further LLM call B mapped the results to a predefined list of common form items (classification). After combining the separate streams of extracted text, a final LLM call C mapped them to line items used in the client’s invoicing system (classification). Although elaborate, this design was necessary to achieve the accuracy and reliability required by such a business-critical use case.

In this way, the LLM outputs used in a given workflow quickly become idiosyncratic to that particular workflow, which itself can be idiosyncratic to a given use case. So the criteria evaluating these outputs must assess the way they each contribute to the customer’s success criteria, which constitute the ultimate purpose of the application.

At a first pass, the purpose here was to achieve complete, consistent, and accurate population of the client’s invoicing CRM from handwritten forms. The intermediate purpose of, for example, LLM call B was to ensure each extracted element of text was mapped to standard items, so this information could be passed to the next call C. So the evaluation criterion for B is whether it completely, consistently, and accurately classifies raw text into standard items. Notice, however, that this criterion depends on the system’s ultimate goal. If the final purpose of the system were something entirely different — for instance, to generate outputs that are funny or entertaining rather than accurate — then it would no longer make sense to judge call B by accuracy. Instead, we would evaluate it by how well it contributes to that new purpose (e.g., by how entertaining or funny the results are).

This is a purely theoretical criterion, with use-case-specific details waiting to be added. More concretely, different use cases can have very different success criteria and different types of data involved in their LLM calls. Currently, the core NLP tasks listed above provide a value proposition for enterprise in two broad types of use case: automation of administrative tasks (e.g., filling out forms, transferring data, sending routine messages or notifications, and performing time-consuming, boring tasks) and work-product creation (e.g., generation of leads, boilerplate documents and reports, due diligence and analysis, and presentation decks).

Success for the first type is mainly about achieving the correct outcome based on initial conditions. For the second type, it is more about meeting professional quality standards for the work product. The specifics of both change depending on the organization using them. The correct outcome for administrative tasks often depends on unwritten, heuristic rules developed by employees over time. In contrast, quality standards for work products often are contextual and vary between company, team, and individual.

A generic high-level criterion of “complete, consistent, and true” is perhaps sufficient for the task-agnostic process of LLM pretraining but not for the task-specific evaluations of LLM outputs as used in real-world applications.

Furthermore, as mentioned, there will be technical criteria reflecting errors specific to your implementation (e.g., generating too many words in a text response, including escape characters in a JSON output, or selecting the same item from a list when instructed not to). Pulling from a variety of sources, a list of automated eval criteria for call B could be something like:

The image of every mapping is a common item.
The pre-image of every mapping is in the raw text.
Every element of raw text that seems relevant is the pre-image of a mapping.
Each common item is uniquely the image of just one mapping (i.e., the map is bijective).
All common items returned are drawn from the list provided.
Any reference to copyright name a should be replaced by generic name b.
Raw text string s or similar should be interpreted as common item i instead of common item k.
If string s1 appears before s2, then map to common item j.
Only map to common item l when string s3 is present in the raw text.
No escape character appears in the response JSON.

Criteria 1-5 are basic definitions for the type of map required by this call in the workflow. Criteria 6-9 are heuristic criteria specific to how this administrative task is actually accomplished by the nurses (in my veterinary example), and criterion 10 is for response formatting. Because this is a classification task, it is framed as a simple map from a raw text string to a list of items. Of course, making the mapping more fuzzy or not exclusive to the provided list would require much richer heuristic criteria.

As we know from software product design in general, discovering the heuristics and type of mapping idiosyncratic to a particular use case and audience cannot be done from an armchair. The idiosyncrasies here are contingent on the facts of the specific environment the application is deployed in and thus not deducible a priori. Instead, the developer must collect specific feedback from end users — that is, people who will actually use the application.

It’s critical that the evaluation for a given LLM output target how that particular call contributes to the broader purpose of the system it’s in. This is neither reducible to generic benchmarks nor deducible from either the prompt or the use case alone. Instead, it is an application of the success criteria from the customer to that output’s place in the workflow.

The Future of Evaluations

This discussion assumes the current state of the field, in which evals and the tooling that makes them useful must be written from scratch for each application, with lots of humans in the loop. Of course, the state of this field doesn’t stay in one place for long.

Two years ago, LLM capability was barely mature enough to provide enterprise value. Today, it is relied on for many of the tasks previously assigned to junior software developers. In another two years, even with linear improvement, we expect many of the processes described in the previous section to be automated.

Building a pipeline to generate test cases, run evals on them, and send the results to a self-refinement step currently requires significant manual integration with your app and how it implements an AI workflow. But what about these guidelines changes in a world where an AI coding agent built that entire workflow, so it knows how to build the integrated evals pipeline, too?

Several vendors offer evaluation tools that automate some of the work. As mentioned, OpenAI has an evaluations suite in its API and Web platform. Currently, it runs the evaluation and calculates scores for you, but this should expand over time to automatically run evals on calls made to its models in your application. Perhaps it could offer a functionality to group such calls into a workflow or project, providing the kind of architectural information discussed in the previous section. Or perhaps it could offer integrations with developers’ self-refinement pipelines, expose its prompt optimizer in the API, or even host these pipelines itself. These functionalities make sense for closed-source model providers, since they already know exactly what you’re sending to and receiving from their models.

There are also third-party platforms offering more advanced capabilities. Scale is the largest and best-funded, offering human-annotated data for LLM pretraining. It also brings human-in-the-loop evaluations to enterprise applications and offers some ability to use eval results for prompt refinement. However, the humans in question are anonymous and unrelated to the end users of your app.

Flow AI specializes in generating synthetic test data for agents, which is more complex, and offers an open source model fine-tuned for LaaJ. These third parties target developers or organizations that prefer open source, or at least the choice of model provider, but they run the risk of being made obsolete by major model providers that already have access to the critical data and inexpensive compute.

It is unclear how the market forces between closed versus open source (or weight) and single versus multiple model provider will play out in the AI space, especially if the cost of writing code trends toward zero. Nevertheless, developers will likely have far more tools and prebuilt infrastructure at their disposal in the near future for building evaluation suites and even entire AI applications and integrations, which themselves may look different in a world where AI communications protocols like MCP (Model Context Protocol) and A2A (Agent2Agent) are as important to our digital infrastructure as HTTP (Hypertext Transfer Protocol). What happens when the entire process of writing and running evals is as simple as calling an agent hosted in the cloud?

Advances in fundamental research may also change how models and applications are evaluated, potentially streamlining the collection and incorporation of user feedback. For example, achieving interpretability of deep learning models may allow us to more precisely target the weights implicated in desirable or undesirable behavior.

One approach to this is neuro-symbolic AI, which integrates neural nets with classical logic-based approaches to intelligence.¹⁴ This is interesting because it could allow us to bridge stochastic and deterministic types of processing, getting the best of both worlds. Although there are many different branches, a common goal of the field is to logically structure model inference in a more direct way than scaffolding or simple prompting.

In a similar vein, alignment of deep learning models to human goals and preferences could benefit from formal control theory, which attempts to state these explicitly, such that model behavior necessarily adheres to them.¹⁵ Much work remains to be done, but the hope is that, instead of an experimental and iterative approach to evaluations, we could give the model itself formal constraints similar to the assertions of traditional unit tests.

I believe the effect of a more automated world will be to increase the value of human judgments — and originality more generally. It may take a different form than the current human-in-the-loop architecture, but we are building these applications for our preferences, our values, and ourselves.

Even if much of the engineering becomes abstracted away, end users and their needs will remain. App development would still be behavior-driven, but the behavior would be much higher level and more accessible to the end user, reducing the friction involved in translating implicit, heuristic user needs into evaluation criteria and raising the value of direct touchpoints with developers. Perhaps in the not-too-distant future, end users won’t even need developers.

References

¹Low-level code tests are a fundamental level of software testing in which individual components or modules of an application are tested in isolation to ensure they function as intended.

²Thanisch, Eystein (ed.). “Disciplining AI, Part I: Evaluation Through Industry Lenses.” Amplify, Vol. 38, No. 5, 2025.

³“Cost of Building and Deploying AI Models in Vertex AI.” Google Cloud, accessed 2025.

⁴Seemann, Florian. “Defensibility in the Application Layer of Generative Artificial Intelligence.” Medium, 12 April 2023.

⁵Shankar, Shreya, and Hamel Husain. “Application-Centric AI Evals for Engineers and Technical PMs.” Course notes, May 2025.

⁶Husain, Hamel. “Creating a LLM-as-a-Judge That Drives Business Results.” Blog post, 29 October 2024.

⁷Shankar, Shreya, et al. “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences.” arXiv preprint, 18 April 2024.

⁸Geach, P.T. “Platos ’Euthyphro’: An Analysis and Commentary.” The Monist, Vol. 50, No. 3, July 1966.

⁹Shankar and Husain (see 5).

¹⁰Hunter, Julie, Nicholas Asher, and Alex Lascarides. “A Formal Semantics for Situated Conversation.” Semantics and Pragmatics, Vol. 11, No. 10, 2018.

¹¹Pleyer, Michael, and Stefan Hartmann. Cognitive Linguistics and Language Evolution. Cambridge University Press, 2024.

¹²Joseph Farrington notes this in Part I of this Amplify series, stating: “Just as importantly, building the simulator prompted early engagement between stakeholders. Blood bank staff, clinicians, and data scientists worked together to define the decisions that mattered, the constraints that applied, and the metrics that should be used to determine success. This collaboration ensured that any model developed would be evaluated against practical criteria and shaped from the outset to fit the real-world context in which it would operate”; see: Farrington, Joseph. “A Simulation-First Approach to AI Development.” Amplify, Vol. 38, No. 5, 2025.

¹³Vaswani, Ashish, et al. “Attention Is All You Need.” arXiv preprint, 2 August 2023.

¹⁴Hitzler, Pascal, and Md Kamruzzaman Sarker. “Neuro-Symbolic Artificial Intelligence: The State of the Art.” vaishakbelle.org, accessed 2025.

¹⁵Perrier, Elija. “Out of Control — Why Alignment Needs Formal Control Theory (and an Alignment Control Stack).” arXiv preprint, 21 June 2025.

About The Author

Dan North

Dan North is Global IT Developer at Arthur D. Little, where he develops LLM-based applications for internal use. Prior to this role, he worked at an early-stage startup building AI agents for enterprise applications, and at AWS Bedrock, contributing to its LLM training pipeline. Mr. North has a deep passion for philosophy and strongly believes in its relevance and applicability to everyday life, including business. He is particularly interested… Read More

	Disciplining AI, Part II: Looping in Humans, Systems & Accountability — Opening Statement
	Disciplining AI, Part II: Looping in Humans, Systems & Accountability
	Beyond the Benchmark: Developing Better AI with Evaluations
	Accountable AI?
	Why Judgment, Not Accuracy, Will Decide the Future of Agentic AI
	AI Asset Survival in the Age of Exponential Tech