Advisor

When Good Data Goes Bad, Part V

Posted November 9, 2021 in Data Analytics & Digital Technologies
Poison data

In Part IV of this Advisor series, I introduced the phrase data rot. (I might even claim to have coined it!) However, rather like the title of the series, it implies a certain passivity. Data rots or “goes bad” of its own accord, perhaps naturally — organic matter decays as a matter of course — or through neglect, often benign, like that rotten siding on your porch you forgot to paint for five years.

Most of the examples we have discussed fall into this latter category. IT didn’t provide sufficient data management resources or tools to ensure data quality. Data governance teams didn’t think deeply enough about the ethical or other consequences of reusing data for a new artificial intelligence (AI) program. “Technochauvinists,” as described by Meredith Broussard in in her 2018 bookArtificial Unintelligence: How Computers Misunderstand the World, tried — and keep trying — to solve every problem and improve every process solely through the application of magical technology.

But there are also particular business behaviors that actively poison data or degrade data quality. Of course, that’s not their intention, but it’s the direct consequence of how these activities are performed. Let’s take a look at data poisoning!

When Business “Excels” Itself

In the midst of the COVID-19 pandemic, 16,000 UK test results went missing for a week in October 2020. A construction company in the US state of South Dakota submitted a bid of US $6.5 million, $3 million lower than the correct price in March 2019. A month earlier, the cannabis industry’s largest company had to refile its results to adjust its EBITDA (Earnings Before Interest, Taxes, Depreciation, and Amortization) loss from CA $52 million to $155 million.

What is the common thread in these and dozens of other financial and reporting snafus? The clue is in the source. The European Spreadsheet Risks Interest Group (EuSpRIG) has collected more than 100 horror stories of major errors arising from the use of spreadsheets reported since the mid-1980s. Given the embarrassment involved, it is reasonable to assume that many more remain uncovered. EuSpRIG offers a set of best practices that are well worth reading, but the question we must pose is: what single practice, if any, could underlie such problems among otherwise professional practitioners?

Proponents of business intelligence (BI) suggest the problem lies in spreadsheets. Wayne Eckerson, then-director of research at The Data Warehousing Institute, lamented in 2003, “Spreadsheets run amuck in most organizations. They proliferate like poisonous vines, slowly strangling organizations by depriving them of a single consistent set of information and metrics….”1 The lament has been sung many times since, and the mourners aren’t wrong in their description of the problem and its consequence. However, the most common solution — death to the infidel spreadsheets and their users — has never succeeded. Spreadsheets are too widespread and embedded in all organizations. And they do offer significant value — when they are properly designed and coded.

Unfortunately, BI advocates and now those for analytics and AI promote and exacerbate the same problem under the banner of “self-service for all.” The argument is that the business need for data-driven decision making is so extensive, so changeable, and so rapidly growing that the poor IT department, overstretched and underfunded, can never hope to keep up. And, of course, businesspeople would really like to be self-sufficient. Et voilà, the dreadful IT bottleneck is removed at a stroke, and the business is happy again. For these reasons, self-service has been a huge drive across the IT industry since almost the earliest days of data warehousing.

The concept of self-service is, I propose, the underlying problem in data poisoning.

Self-Serve Data Is Like Soft-Serve Ice Cream

You may love to eat soft serve but, to avoid food poisoning, preparing it at scale is best left to professionals. In my experience, businesspeople love to consume data but most prefer it to be well sweetened and over-ready. To mix metaphors even more wildly, they further demand that it be perfectly packaged, cleansed, and ready to eat. To paraphrase Freddie Mercury, they want it all, they want it perfect, and they want it now.

BI practitioner, Martijn ten Napel, has addressed this topic at length. In his opening statement to the December 2019 Cutter Business Technology Journal (CBTJ), he says:

I have yet to encounter a data architecture that accounts for the consequences of the use of data or has stern requirements regarding the consequences of data use. If mentioned at all, consequences are often viewed as a risk to be mitigated and not regarded as a fundamental property of a data architecture.

He offers his own connected architecture framework, where:

“Connected” emphasizes the necessary connection between different people, with different skills, who need to collaborate in acquiring, sorting, storing, modeling, processing, analyzing, interpreting, drawing conclusions, and taking action on the conclusions.

This emphasis on collaboration between different people with different skills lands us right in the problem zone with self-service. It is unreasonable to expect businesspeople — often with limited data management, programming, or testing skills — to adequately perform all the necessary steps to deliver consistently good spreadsheets or BI analyses from scratch. Like young children tackling their first soft-serve ice cream, most of it ends up on their t-shirts. Although children will quickly learn the tricks to getting all the ice cream into their mouths, successful consumption of self-service data demands skills and tricks a couple of orders of magnitude more difficult to learn.

Many businesspeople never will come to grips with the full breadth of what is needed. Indeed, some will rightly ask why they should be expected to do so.

Context Is Key but Complex in Nature

In architectural terms, the issue here is that information always exists within a context. Indeed, the same piece of information will move from context to context throughout its lifecycle. Understanding, describing, and managing this context flow is complex and has long been largely ignored by the industry. The same observation applies to data, but with even greater impact, because data is simply information from which some or most of the context has been stripped.

Beyond “basic” programming errors, the cause of most spreadsheet and BI problems is a lack of understanding of the context in which the data was created and the constraints on its use. In data preparation for a data warehouse, a vital activity is contextualizing the data in the warehouse so that businesspeople are less likely to misunderstand or misuse it. Sadly, data management often fails in this complex process. Worse still, self-service moves the responsibility for this step to businesspeople who likely have little or no knowledge of the original data context or its constraints. Eliminating the IT bottleneck may enable faster delivery of insights, but at what cost when the data has been wrongly interpreted and used.

My own contribution to the above-mentioned CBTJ issue deals at length with the problem of context and offers the manifest meaning model (m3) as a way of dealing with it. It posits that the path from data to data-driven — or preferably, information-informed — decision making traverses both the internal knowledge of the decision makers and the socially influenced meaning they ascribe to the information. This process is poorly understood and seldom considered by the promoters of self-service BI and analytics.

Conclusion: When Good Data Goes Bad

Data often doesn’t go bad in some inevitable and passive rotting process. It is poisoned by the abuse or misuse perpetrated by businesspeople who don’t understand the constraints on its use; constraints imposed by the many and varied contexts through which the information has passed. This is not to blame the business. Why should we expect sales managers, production supervisors, or supply chain controllers to know or understand the often-convoluted lineage of the data they receive and use? Such knowledge exists mainly in IT and data management disciplines. Self-service without proper prior data management — considered to be a “bottleneck” by many proponents of the approach — is a recipe for poisoned data.

Note

1Sadly, this historic article is no longer available online.

Image by Josué Nunes from Pixabay

About The Author
Barry Devlin
Dr. Barry Devlin is a Senior Consultant with Cutter’s Data Analytics and Digital Technologies practice, a member of Arthur D. Little's AMP open consulting network, and an expert in all aspects of data architecture, including data warehousing, data preparation, analytics, and information management. He is dedicated to moving business beyond mere intelligence toward real insight and innovation, complementing technological acumen from informational… Read More