Article

Accountable AI?

Posted October 13, 2025 | Technology | Amplify
Accountable AI?
In this issue:

AMPLIFY  VOL. 38, NO. 6
  
ABSTRACT
Paul Clermont reminds us of a crucial but under-recognized trait of true AI: learning and improvement in response to feedback independent of explicit human design. The implementation of human requirements is thus both highly feasible and dialogic. Clermont stresses the high level of responsibility borne by humans when interacting with AI. Critical thinking about inputs and outputs, and awareness of both objectives and social context, remain firmly human (and sometimes regulatory) responsibilities — no matter how widely terms like “AI accountability” have gained currency.

 

The term “artificial intelligence” goes back to 1956, but the general public heard little of it beyond occasional headlines when IBM’s Deep Blue beat the world champion chess grand master in 1997 and Watson beat the all-time Jeopardy winner in 2010. Remarkable achievements, but like putting men on the Moon, unrelated to daily life.

Behind the scenes, serious money was being invested by serious people, and in late 2022, ChatGPT startled the world. We could type in a natural language request and — within seconds — get sensible answers in readable, natural-sounding, grammatically correct language. Suddenly, AI was all over the news, and a broad swath of the public started using one or more of the products that rolled out.

Simultaneously, businesses and governments began exploring and implementing AI for everyday processes. Today, it’s not difficult to envision all kinds of routine office work being done better/faster/much cheaper with AI. It’s also easy to envision overdependence on AI to the point where no one understands how it works when something goes wrong, propagating problems that take months or years (if ever) to unravel.1

What Is AI Really?

As often happens when an idea emerges that promises great profit opportunities, there’s a bandwagon effect. Executives claim to be implementing it, even if the application is trivial or does not embody the idea, strictly speaking. Indeed, some seem to be using the term “AI” to describe any smart application that is supposed to make the kind of intelligent decisions once the sole province of humans.

No matter how sophisticated it is, if every step is prescribed by the designer, it’s not AI (it’s the designer’s natural intelligence — nothing artificial). If the application is designed to learn to improve its performance, it’s AI.

Generative AI (GenAI), by far the most visible, is trained on text and images and can respond to natural language questions and directions, producing answers or appropriate images. Since OpenAI introduced ChatGPT in 2022, it has released improved versions and been joined by competing products. They require huge tables (large language models [LLMs]) to find relevant information and turn it into good-quality text, and they’re designed for wide public use. A subset of GenAI is domain-specific GenAI. It’s the same basic idea but with smaller, thoroughly vetted databases relevant to a specific knowledge domain (e.g., protein folding). These systems are designed by experts for experts, not the public.

Real-time process control has gone public in the form of robotaxis plying the streets of San Francisco, California, and other cities. They use AI to integrate continuous inputs from multiple cameras plus multiple radar, LiDAR (light detection and ranging), and acoustic sensors to establish situational awareness of their vicinity.

Abnormality identification involves distinguishing signal from noise in tasks like scanning X-rays or surveilling facilities. Both this and real-time process control involve on-the-job training by humans. A different but conceptually related application is the use of drones to identify the best targets for military action and course-correct in flight to hit them.

Facial recognition, while a natural for AI, is highly controversial because it performs less well with darker-skinned people, leading to false matches (and/or failures to match) for racial minorities. Darker-skinned people are often minoritized in training data. Increasing their representation may improve these products, but there is no guarantee.

Combinatorial challenges are the oldest AI application and exist in games like chess and Go, where success depends on identifying optimal moves, anticipating countermoves, and planning effective responses. Today, similar approaches are used to uncover vulnerabilities in our systems — or to proactively identify weaknesses in competitors’ systems.

AI’s Advantages Are Compelling

Compared with humans:

  • AI is faster by orders of magnitude.

  • AI doesn’t get bored, distracted, or tired (many auto accidents are due to momentary attention lapses2).

  • AI is thorough; it doesn’t cut corners that it hasn’t learned are safe to cut.

Perhaps most importantly, AI can learn. In games, it learns from mistakes, avoiding moves that lead to dead ends. In robotaxi training, it improves through real-time feedback from a human driver. Over time, an AI system can build on its own history, adding value beyond what its original programmers designed in.

GenAI Is Not Without Frailties

GenAI is a major accomplishment that descriptors like “stochastic parrot” tend to diminish.3 That said, its shortcomings must be acknowledged.

First, GenAI lacks inherent common sense and has no intuitive ability to distinguish the implausible — or even the absurd — without targeted training. It may appear moral or politically correct if its training data leans that way (or in the opposite direction), but such behavior is not intrinsic. It can be tuned to favor positive, neutral, or skeptical feedback. For example, GPT-5 adopted a much less positive flavor than GPT4o, to the annoyance of some users, and GPT-4o’s allegedly indiscriminate positivity may have contributed to the suicide of a teenage boy.4,5

Second, GenAI has a propensity to make stuff up, which the industry euphemizes as “hallucinations.” Unfortunately, it’s good at this, to the dismay of lawyers who submitted briefs referring to nonexistent but plausible-sounding cases.6 It’s also good at making up plausible-sounding and correctly formatted citations in scholarly work.7

Recent attempts to build in some semblance of reasoning have, paradoxically, increased the incidence of hallucinations in some cases. Chain-of-thought reasoning that shows explicit steps reduces the incidence if the questioner has some notion of what the chain should look like and the initial premises are clear.8 Some research has shown severe limitations in problem-solving capabilities.9

Third, GIGO (garbage in, garbage out) still applies. An AI application is no better than the information model it uses. There have been cases where innocent prompts elicited dangerous, outrageous, or obscene responses. This is called “going rogue” or adopting a “bad-boy persona,” and a fair amount of attention has been paid to understanding how this happens (some bad stuff finds its way into the LLM) and developing techniques to get the model back on track.10 There’s also danger that future LLMs will incorporate too much unvetted output from earlier LLMs (e.g., scholarly or technical papers that have been challenged and retracted).11 As with hallucinations, it’s “caveat user” (or, let the user beware).

3 Challenges

As organizations and societies race to adopt AI, it’s easy to be swept up by bold promises and breathtaking demonstrations. But beneath the surface, critical weaknesses remain that demand clear-eyed attention. To separate genuine progress from misplaced optimism, we must confront three challenges.

Challenge 1: Retain Some Skepticism

One thing is crystal clear: if the stakes are high, trusting the early output of a GenAI request is not a good idea. One should pose the request more than once, using different wording and syntax and coming at it from different directions. Independent verification from other sources is ideal; if this isn’t possible, there’s no substitute for a good common-sense sniff test.

We must realize that the issues discussed above arise from the fact that a GenAI has no idea what we’re talking about in our prompts. It’s just a string of words that help it find useful data (if it’s in the LLM) and formulate answers based on the probability distribution of what words follow others; hence the term “stochastic parrot.”

That it works at all seems like a miracle, and it may be that the LLM approach simply cannot be refined to a sufficient level of trustworthiness for some tasks (e.g., agentic). That’s not to minimize the achievements to date; it’s to recognize that the technical approach has inherent limitations that, at the very least, call into question enthusiasts’ forecasts of artificial general intelligence (AGI) as just a few steps down the current technical path.

I don’t claim to understand the technology that could enable this huge leap, but I do know something about the IT industry’s history of overpromising and underdelivering. If we take the term “general intelligence” literally (i.e., the intellectual capacity one would expect of a person considered generally intelligent), I suggest we not hold our breath for AGI. Any claims to having achieved it over the next few years should be met with skepticism. There will likely be large improvements over what we have today, but calling it AGI will almost certainly be based on a notion of general intelligence that’s so diminished it would be hard to recognize as such.

Challenge 2: Measure Against the Right Standards

If AI is to be trusted, it must be measured against the right standards. Speed benchmarks are nearly meaningless — producing a hallucination faster is no achievement. What matters are quality and safety. Just as early automobiles were judged not by acceleration but by braking distance and reliability, today’s AI must be assessed on trustworthiness.

That means asking questions such as: Does the system minimize hallucinations? Can it interpret prompts without being derailed by slight wording changes? Is it energy efficient, given the enormous power demands of large models? And, in specialized domains, is the training data sufficiently vetted to guarantee near-perfect accuracy?

Legal and ethical compliance is another key benchmark. At a minimum, AI systems must operate within the law. Ethics are harder to define, but violations can be equally damaging, especially in fields like medicine or law that have explicit professional codes.

For now, expecting AI to deliver measurable ROI is premature. The New York Times recently noted that billions invested in AI have yet to pay off in office productivity.12 But this should not be surprising. Like earlier waves of computing, AI requires organizational adaptation before benefits emerge. That process takes time.

As AI evolves, evaluation methods must adapt. The temptation will be to seek cost savings quickly, often by cutting staff. But customer-facing applications show the risks of premature reliance. Who has not been infuriated by chatbots that fail to grasp a simple request? Quality of experience — whether for customers or employees — must come first.

Challenge 3: Address Societal Issues

Most inventions have provided good answers to “what can this do for us” questions, even if some ill effects took decades to materialize. AI is off to a different start: it raised “what can this technology do to us” questions right from the start.

Failure to address this can result in a backlash that ends in (1) throwing the baby out with the bathwater or (2) dystopian change. Problem areas include:

  • Mass creation and instant global distribution of misinformation and disinformation could flood the zone with trash to the point where most people give up trying to find out what’s really going on — creating an environment in which autocrats and oligarchs can easily wreak havoc. The rapidly improving quality of image and sound deepfakes increases the likelihood of this future.

  • GenAIs could seem so human that some people forget they’re just parrots that, if tuned to provide positive feedback, can encourage self-harm to the point of suicide.

  • Material harmful to children and many adults could be mass-produced and quickly distributed.

  • Invasions of privacy, harassment of individuals, and scams could be facilitated as databases are hacked (using AI to find security holes).

  • The finding and exploiting of security holes in government and financial databases and physical process-control systems could lead to cyberwarfare.

  • AI could be the first technology that creates massive persistent unemployment well beyond low-level jobs previously mechanized or automated away.

Conclusion

For better and worse, AI is with us. The question mark in this article’s title is deliberate. Obviously, the AI hardware and software can’t be accountable, but those who use or distribute its products can be; hence, the importance of trustworthiness.

The current state of GenAI is far less trustworthy than it needs to be, and experts are increasingly questioning whether the technology can ever reach that level — casting doubt on the staggering scale of investments built on the assumption that it can and will.

Governments have a natural role in dealing with innovations like AI that have a potential downside for societies, but they’re notoriously slow in addressing today’s problems, never mind getting out in front of tomorrow’s. Of course, that doesn’t mean they shouldn’t try. We must hope that the copious money Big Tech has available to spread around won’t cause too many legislators and officials to look the other way.

We are living in increasingly interesting times.

References

“To err is human, but to really foul things up, you need a computer” — quote attributed to American biologist Paul Ehrlich.

Transportation Institute Releases Findings on Driver Behavior and Crash Factors.” Virginia Tech News, 20 April 2006.

Bender, Emily M., et al. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” FAccT ‘21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery (ACM), March 2021.

Noreika, Alius. “OpenAI Restores GPT-4o After Encountering User Dissatisfaction With GPT-5.” Technology.org, 11 August 2025.

Yousif, Nadine. “Parents of Teenager Who Took His Own Life Sue OpenAI.” BBC, 27 August 2025.

Mangan, Dan. “Judge Sanctions Lawyers for Brief Written by AI with Fake Citations.” CNBC, 22 June 2023.

Gedeon, Joseph. “RFK Jr’s ‘Maha’ Report Found to Contain Citations to Nonexistent Studies.” The Guardian, 29 May 2025.

Yao, Zijun, et al. “Are Reasoning Models More Prone to Hallucination?” arXiv preprint, 29 May 2025.

Shojaee, Parchin, et al. “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” arXiv preprint, 18 July 2025.

10 Hall, Peter. “OpenAI Can Rehabilitate a Model That Has Developed a ‘Bad-Boy Persona.’” MIT Technology Review, 18 June 2025.

11 Ananya. “AI Models Are Using Material from Retracted Scientific Papers.” MIT Technology Review, 23 September 2025.

12 Lohr, Steve. “Companies Are Pouring Billions into AI. It Has Yet to Pay Off.” The New York Times, 13 August 2025.

About The Author
Paul Clermont
Paul Clermont is a Cutter Expert. He has been a consultant in IT strategy, governance, and management for 40 years and is a founding member of Prometheus Endeavor, an informal group of veteran consultants in that field. His clients have been primarily in the financial and manufacturing industries, as well as the US government. Mr. Clermont takes a clear, practical view of how information technology can transform organizations and what it takes… Read More