AMPLIFY VOL. 38, NO. 5
As AI and machine learning (ML) capabilities continue to advance, they will increasingly be embedded into complex operational workflows. In these settings, it is essential to evaluate their impact within an organization using KPIs rather than standard model prediction metrics or benchmark scores. Good models are unlikely to deliver value if they cannot be used within the constraints of a workflow.
In risk-sensitive and regulated sectors, simulation is used to estimate utility only after a model has been developed. As this article explores, however, adopting a simulation-first paradigm is more beneficial: building a workflow simulation to explore how predictions of varying quality will affect KPIs before starting model development. This gives teams an early answer to a critical question: “Would a model even help here, and if so, how good would it need to be?”
This piece first examines why models that perform well on narrow tasks may fail to deliver value in real-world settings. It then reviews how simulation is currently used to evaluate AI systems and introduces a simulation-first approach that places operational context at the center of model development. To illustrate the benefits of this perspective, the article describes a project from the author’s research: one aimed at reducing blood-product waste at a major London hospital. (The examples presented are mainly drawn from healthcare; however, the article also considers how a simulation-first approach can be applied in other industries.)
Why Good Models Fail
It is common practice to evaluate traditional ML models using predictive metrics, such as precision, recall, or mean squared error. General-purpose AI models, such as large language models, are typically benchmarked across a wide range of tasks, while enterprise deployments often rely on custom evaluation sets to assess whether a model performs acceptably for a defined use case.
These metrics are essential for judging performance on narrow tasks. In relatively simple feedback loops, such as recommending a product or serving an advert, they may be sufficient. In such settings, it is often straightforward to measure business value through A/B testing and iterate quickly using live deployments. But when AI and ML models are incorporated into more complicated workflows, these metrics are no longer a reliable proxy for real-world value. A wide range of factors can limit whether a well-performing model leads to meaningful impact.
The timing of a prediction can be critical. For example, the utility of an ML model developed to identify patients who might benefit from palliative care planning was limited in simulation studies by the fact that hospital staff were often too busy to act on predictions before the patients were discharged.1
Resource constraints present another challenge. Accurate sepsis-prediction models may fail to improve patient outcomes if there are only a small number of intensive care unit beds; there may simply be no capacity to respond to the early warnings.2
Data quality can also break the link between model performance and value. A diabetic retinopathy screening tool deployed by Google in Thailand struggled in real-world conditions because the retinal scans collected by nurses were often of insufficient quality for the model to analyze.3
Human behavior and workflow fit can also limit the benefits of a good model. In the pharmaceutical supply chain, a study found that staff frequently overrode highly accurate algorithmic forecasts in an effort to incorporate their knowledge, and this led to poor predictions.4 Primary care providers have reported that a key barrier to doctors using decision support tools is the perception that they disrupt the consultation and slow down work.5
These examples highlight the need to consider not just the performance on the predictive task, but the broader organizational context in which the model will be used. Good performance on an isolated benchmark may not translate into real-world value if the model’s output cannot be acted on effectively within the constraints of the surrounding workflow.
The path from prediction to impact is often more complex than it first appears due to timing, resource availability, data quality, and human factors. To ensure models contribute meaningfully to business or organizational goals, their development should be guided from the outset by a clear understanding of the processes they are intended to support.
Current Use of Simulation for Evaluation
By replicating key elements of the workflow, simulation allows teams to explore the potential impact of a model’s predictions with fewer regulatory and safety hurdles than would be required for a pilot study or live deployment. For example, researchers used simulation to assess how predictive admission and discharge policies affect patient flow in hospitals, and another research group modeled discharge decision-making to estimate the potential impact of a triage support tool.6,7
Simulation is also central to the design and use of digital twins, which have become increasingly popular in manufacturing, logistics, energy, and healthcare. Digital twins are high-fidelity virtual representations of real-world systems that are frequently updated to reflect the current state of their physical counterparts. This allows ML models to be evaluated in realistic operational environments, such as testing a predictive maintenance model within a factory twin or inserting a forecasting model into a simulated warehouse to assess effects on stock levels or delivery delays. These tools are widely used for robustness testing, scenario evaluation, and deployment planning, but typically only after a model has already been developed.
In many sectors, there are significant barriers to starting model development. In healthcare, working with patient-level data often requires extensive ethical approvals, formal data governance procedures, and secure computing environments. In such cases, before committing resources to these processes, it is valuable to understand whether a model is likely to deliver impact even if it performs well.
A Simulation-First Approach
A simulation-first approach deploys the same tools used in post hoc evaluation but applies them at the start of the process, changing the focus from validating a built model to deciding whether to build one at all.
By injecting synthetic AI or ML model outputs (e.g., predictions, recommendations, generated content) of varying quality into a simulator that models the workflow, teams can explore how the model’s performance translates into operational value. Would perfect foresight improve outcomes? If not, there may be little reason to invest further. But the case for development becomes stronger if a reasonably accurate model could generate impact.
This early insight can support project prioritization. There are always many competing projects for a limited data science team and multiple places in a workflow where AI could be useful. Simulation allows teams to compare these opportunities based on likely business value before committing to data collection or model development.
A simulation-first approach also encourages early, cross-functional collaboration. To simulate a system, data scientists must engage with domain experts to understand decisions, constraints, and KPIs before model deployment is considered. These conversations ensure that the model’s performance is measured against the outcomes that really matter to the organization.
Crucially, a simulation-first approach enables experimentation not only with ML models themselves but also with the workflows in which they operate. For instance, a model may only deliver value if surrounding processes are adapted to act on its predictions — and if decision makers are willing to trust and use them. Simulation provides a safe environment to (1) test those adaptations without disrupting live operations and (2) understand whether taking full advantage of new technology will require bigger changes than simply replacing one part of an existing process.
This approach is intuitive and has been applied in a limited number of studies on supply chain forecasting and resource allocation.8-12 In settings where development is costly due to regulatory approvals, privacy risks, or high labeling effort, a simulation-first strategy offers a low-risk way to focus resources where they will most likely deliver value.
Case Study: Reducing Waste of Blood Products
Platelets (blood components essential for clotting) present a unique inventory challenge. With a shelf life of (at most) five days, hospitals must carefully balance stock levels to ensure they have enough on hand to meet unpredictable demand while avoiding waste due to expired units. At a large London teaching hospital, my research team observed that many platelet units were returned from wards unused after being requested by clinicians. The standard policy — issuing the oldest available unit — is optimal when all issued units are transfused. However, when units are returned, this practice often prevents them from being reissued before expiring.
This seemed like a good opportunity to use data to improve practice. If an ML model could predict which requests were likely to result in returns, it could support a new policy: issuing the oldest units when a transfusion is likely and the youngest when a return is expected. This approach would increase the chances that returned units remain usable.
However, building the model would require patient-level data, involving long approval processes, integration of data from multiple health systems, and a significant investment of analyst time.
We therefore began by building a simulator to model the workflow in the hospital blood bank, including placing a replenishment order in the morning, selecting a unit to meet each clinical request, and disposing of expired units at the end of each day. We then simulated predictions from models with various levels of performance and assessed how they would affect key outcomes like waste and service level.
Model performance was controlled by adjusting the assumed sensitivity and specificity of the predictions. These measure, respectively, how well the model identified the cases we wanted to flag and how well it avoided false alarms. Each ranges from 0% to 100%. By setting both values to 100%, we could test whether even perfect predictions would make a difference. By varying sensitivity and specificity, we explored how different levels of performance would translate into improvements.
The results showed that the model was worth building. A moderately accurate model would meaningfully reduce waste, assuming the prediction was acted on. We observed that there was a much larger improvement when the issuing policy was combined with optimized replenishment orders (how many units the blood bank should order from its supplier each day). This operational insight, showing how changes to multiple decision-making processes interact, would have been impossible to learn from predictive performance metrics alone.
The simulation results gave our team the confidence to proceed with model development, knowing that the time and effort required to secure and process clinical data would be worthwhile. Once the model was developed, the workflow simulator was used to help tune the hyperparameters of the ML model and estimate its real-world impact. We also explored how these benefits might vary across hospitals, finding that the model would be especially valuable in hospitals where a greater proportion of units are returned and where units tend to be older upon delivery. This demonstrates that a good simulator can support evaluation throughout the model development process.
Just as importantly, building the simulator prompted early engagement between stakeholders. Blood bank staff, clinicians, and data scientists worked together to define the decisions that mattered, the constraints that applied, and the metrics that should be used to determine success. This collaboration ensured that any model developed would be evaluated against practical criteria and shaped from the outset to fit the real-world context in which it would operate.
The simulation-first approach helped us identify whether ML would add value, how good the model would need to be to be “good enough,” and whether it would be useful in a workflow not originally designed for ML. The resulting policy is being tested further — not just because a model was built but because the system around it was understood, challenged, and carefully modeled.13
Broader Applications
This article focuses on healthcare, but a simulation-first approach is equally applicable to situations in which success depends not only on model quality but on understanding how to integrate the model into a workflow or navigating barriers to model development.
In the energy sector, forecasting models are critical for balancing supply and demand, especially with increasing reliance on renewables. Deploying a new demand-forecast model involves more than improving predictive accuracy; it requires understanding how grid operators will act on those forecasts and how their decisions affect overall system stability, cost, and emissions. A simulation-first approach could help teams investigate these dynamics safely before committing to model development or infrastructure changes.
In manufacturing and logistics, a simulation-first approach could help teams assess whether predictive maintenance models or delay forecasts would meaningfully reduce downtime or improve service levels, especially when predictions must be embedded in tight production schedules or just-in-time inventory systems. Similarly, in public service delivery such as social care, emergency response, or transport planning, simulation-first could help teams assess whether better predictions about risk or demand would lead to improved outcomes or simply shift bottlenecks elsewhere.
Conclusion
In domains like digital advertising or product recommendation, A/B testing and rapid iteration make it straightforward to link model performance to business value. In contrast, that connection is much harder to establish in settings where interventions are high-stakes, data access is restricted, and broader system constraints limit how model outputs can be used. The simulation-first approach helps teams prioritize projects with the greatest potential for real-world impact by assessing a model’s likely business value before it is built — grounding that evaluation in a realistic simulation of how decisions are actually made.
Standard performance metrics used to evaluate ML and AI models offer only a partial view of a model’s usefulness. A simulation-first approach focuses evaluation on the KPIs that truly matter while encouraging early collaboration and exposing hidden constraints, ensuring that new models are accurate and impactful.
References
1 Jung, Kenneth, et al. “A Framework for Making Predictive Models Useful in Practice.” Journal of the American Medical Informatics Association (JAMIA), Vol. 28, No. 6, December 2020.
2 Singh, Karandeep, Nigam H. Shah, and Andrew J. Vickers. “Assessing the Net Benefit of Machine Learning Models in the Presence of Resource Constraints.” JAMIA, Vol. 30, No. 4, February 2023.
3 Beede, Emma, et al. “A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy.” CHI’20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery (ACM), 2020.
4 Fildes, Robert, and Paul Goodwin. “Stability in the Inefficient Use of Forecasting Systems: A Case Study in a Supply Chain Company.” International Journal of Forecasting, Vol. 37, No. 2, April–June 2021.
5 Meunier, Pierre-Yves, et al. “Barriers and Facilitators to the Use of Clinical Decision Support Systems in Primary Care: A Mixed-Methods Systematic Review.” Annals of Family Medicine, Vol. 21, No. 1, January 2023.
6 Mišić, Velibor V., Kumar Rajaram, and Eilon Gabel. “A Simulation-Based Evaluation of Machine Learning Models for Clinical Decision Support: Application and Analysis Using Hospital Readmission.” npj Digital Medicine, Vol. 4, No. 98, June 2021.
7 Wornow, Michael, et al. “APLUS: A Python Library for Usefulness Simulations of Machine Learning Models in Healthcare.” Journal of Biomedical Informatics, Vol. 139, March 2023.
8 Dumkreiger, Gina. “Data Driven Personalized Management of Hospital Inventory of Perishable and Substitutable Blood Products.” Doctoral dissertation, Arizona State University, 2020.
9 Fildes, R., and B. Kingsman. “Incorporating Demand Uncertainty and Forecast Error in Supply Chain Planning Models.” Journal of the Operational Research Society, Vol. 62, No. 3, 2011.
10 Altendorfer, Klaus, Thomas Felberbauer, and Herbert Jodlbauer. “Effects of Forecast Errors on Optimal Utilisation in Aggregate Production Planning with Stochastic Customer Demand.” International Journal of Production Research, Vol. 54, No. 12, March 2016.
11 Sanders, Nada R., and Gregory A. Graman. “Quantifying Costs of Forecast Errors: A Case Study of the Warehouse Environment.” Omega, Vol. 37, No. 1, February 2009.
12 Doneda, Martina, et al. “Robust Personnel Rostering: How Accurate Should Absenteeism Predictions Be?” arXiv preprint, 26 June 2024.
13 Farrington, Joseph, et al. “Many Happy Returns: Machine Learning to Support Platelet Issuing and Waste Reduction in Hospital Blood Banks.” arXiv preprint, 22 November 2024.