In a world increasingly dominated by artificial intelligence (AI) in its various guises, the danger of good data going bad — what we might call “data rot” — is becoming pervasive and dangerous. In this fourth Advisor of my series, I consider a couple of real — and possibly unexpected — ways the rot can set in.
Data Rot Is the Decay of Data Integrity
The pivotal role of data in AI is undisputed. Meanwhile, the need for data integrity — accuracy, consistency, and appropriate contextualization — is becoming more widely appreciated. Sadly, few business leaders and data scientists realize just how lacking in integrity common data can often be. One problem may be the basic assumption that data can actually be collected. Data rot may have set in even as the project was specified.
In her 2018 book, Artificial Unintelligence, Meredith Broussard, data journalism professor at New York University, provides a perfect example of the problem. She tells a story, predating AI euphoria, of attempting to computerize schoolbook management in Philadelphia in the last decade, beginning with a 2009 quote from US Education Secretary Arne Duncan, who declared, “I am a deep believer in the power of data to drive our decisions. Data gives us the roadmap to reform.” Most data scientists would still applaud.
Broussard discovered that even creating a “simple” database of books in schools was hamstrung by the fact that school staff did not have time to do basic administrative and data-entry tasks. An ever changing curriculum, an ongoing lack of funding, and high staff and student turnover all contributed to what was at heart a social and environmental problem. Databases and the concept of data driven derive from an engineering mindset. As Broussard says, “Engineering solutions are ultimately mathematical solutions. Math works beautifully on well-defined problems in well-defined situations with well-defined parameters. School is the opposite of well-defined.” So too is business. Take credit scoring, for example.
The Ill-Defined, Social Concept of a Credit Score
Modern data-driven, algorithmic, engineer-defined credit scoring uses hundreds or thousands of data points from a range of online and offline sources to assign credit ratings to potential loan customers. But what really is a credit rating? The data starts with traditional measures such as individual payment history and outstanding debt that can be reasonably expected to predict future credit worthiness in many cases. Such data is good — accurate, consistent, and in context, in principle and often in practice — and is the foundation of FICO scoring since its widespread adoption in the 1970s (in part, at least) to eliminate racial, demographic, and similar biases prevalent in lending at the time. In effect, a credit rating was an attempt to get to some rational, consistent way of assessing credit worthiness without the lender having to personally know the borrower.
With the advent of big data, however, parameters for credit scoring have now been extended to browsing history, call records, retail behavior, demographic data, employment and address history, social networks, and even the behaviors of the applicant’s social media contacts. The assumption is that these data points are proxies for credit worthiness. The more data you have, and the more parameters you can tweak, the better you can estimate the credit worthiness of an applicant — so the story goes.
The assumptions are largely unprovable. Furthermore, the data — even if individually good — is being aggregated in ways that were never intended in algorithms that take no account of the underlying context or meaning of the data. Worse still, according to Mikella Hurley and Julius Adebayo, writing in the Yale Journal of Law and Technology, “These tools may also perpetuate and, indeed, intensify, existing bias by scoring consumers on the basis of their religious, community, and familial associations, as well as on the basis of sensitive features such as race or gender.” Which brings us full circle to “good data gone bad” and the question of what the definition of credit score is and which problems it was originally designed to address.
Does the AI End Justify the Data Means?
As AI and the data underlying it extends its reach into all areas of society, hidden dangers lurk in seemingly “normal” business goals. Algorithms and big data have dramatically altered the balance of power between businesses and their customers, as well as governments and their citizens. When that happens, data rot may invisibly creep into projects to update, automate, augment, or generally “improve” existing processes using machine learning.
The Fair Credit Reporting Act of 1970 was conceived by the US government to protect citizens against biased and exploitative behavior by some businesses. Of course, business also benefited through managing risk of default even as the credit market expanded to potentially riskier audiences. The underlying data was good and the scoring algorithm straightforward. More recently, as government regulation has become lax, big data and AI have enabled lenders to refocus efforts on risk avoidance irrespective of societal side effects.
As Cathy O’Neil explains in Weapons of Math Destruction, the system creates a nasty feedback loop for borrowers. Based on their social milieu, the scoring system will give a borrower from a risky area or population segment a low score because a lot of people default there. The result is less available credit and higher interest rates — discrimination against those who are poor and already struggling. Companies using these systems “measure success by gains in efficiency, cash flow, and profits. With few exceptions, concepts like justice and transparency don’t fit into their algorithms,” according to O’Neil.
Credit scores, which in their original form were “good” have been devalued and debased to “bad” by a lack of attention — perhaps a generous interpretation data — to the implications of using the underlying data in inappropriate ways. Herein lies the second incorrect assumption about data: if you have it, there are multiple, novel ways in which can be used in both new and existing processes. Sometimes, indeed it can. But often, this new use may be incorrect, illogical, unethical, or illegal — the result of insidious forms of data rot that may arise in previously good data collected for valid reasons but thoroughly unsuitable for the proposed new use.
When Good Data Goes Bad: Conclusion IV
AI drives new ideas about and processes for doing business, perhaps transforming in unplanned ways the underlying relationship between a business and its customers and partners. Often these transformations rely on data previously collected for other purposes. Sometimes, they may depend on people inputting data manually. In either case, system designers must ask the questions: Is the data we’re getting fit for purpose in the new process even though it may have been perfectly adequate for the process for which it was originally defined? Does this new process respect the social contract, either explicit or implicitly understood, under which the data was collected? Where manual input is required, what social or environmental conditions may cause the data to be uncollectable, incorrect, or deliberately contaminated? In short, is the data still good, or did it go bad en route to that marvelous, new AI system?