Data Integration Vs. Data Quality: Friends or Foes?

You are here

Data Integration Vs. Data Quality: Friends or Foes?

Advisor
Posted May 1, 2018 in Data Analytics & Digital Technologies
Data Analytics & Digital Technologies



[This article is based on the authors’ upcoming book, Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data.]

Traditionally, applications and databases were organized around functional domains, such as accounting, human resources, logistics, CRM, and so on. Every department or business unit worked with its own data silo (e.g., a file or database) with no cross-department integration. Whereas the silos mainly aimed at operational support, a next phase saw the emergence of business intelligence (BI) and analytics applications, fueled by the need for data-driven tactical and strategical decision making, with a company-wide impact. To sustain this company-wide view, data from the silos was transformed, integrated, and consolidated into a company-wide data warehouse. ETL (extract, transform, load) processes supported the asynchronous data extraction and transfer from the source systems (the operational data silos) to the target data warehouse. However, the ETL process was typically time-consuming, so there was a certain latency between the up-to-date operational data stores and the slightly outdated data warehouse. This latency was tolerable: real-time business intelligence was not the goal in traditional data warehousing.

Because of this evolution, we were confronted for nearly two decades with a dual data storage and processing landscape, supported by two very distinct scenes of tool vendors and products. Nowadays, we see a complete convergence of the operational and tactical/strategic data needs of the corresponding data integration tooling.

This trend was initiated by new marketing practices centered on proactive (instead of reactive) actions requiring a complete understanding of the customer, and quickly spread toward other functional domains. It culminates in the term “operational BI,” with a twofold meaning. First, analytics techniques are more and more used at the operational level as well as by front-line employees. Second, analytics for tactical/strategic decision making increasingly uses real-time operational data combined with the aggregated and historical data found in more traditional data warehouses. This evolution poses interesting challenges to the landscape of data storage and data integration solutions.

Data Integration

Data integration aims at providing a unified view and/or unified access over different, and possibly distributed, data sources. The data itself may be heterogeneous and reside in difference resources (e.g., XML files, legacy systems, relational databases). The desired extent of data integration will highly depend upon the required quality of service characteristics. Data will never be of perfect quality, so a certain level of inaccurate, incomplete, or inconsistent data may have to be tolerated for operational BI to succeed. Different data integration patterns exist to provide this unified view:

  • Data consolidation aims at capturing the data from multiple, heterogeneous source systems and integrating it into a single persistent store (e.g., a data warehouse, data mart, or data lake). This is typically accomplished using ETL routines.

  • Data federation typically follows a pull approach, where data is pulled from the underlying source systems on an on-demand basis. Enterprise information integration is an example of a data federation technology and can be implemented by realizing a virtual business view on the dispersed underlying data sources.

  • Data propagation corresponds to the synchronous or asynchronous propagation of updates or, more generally, events in a source system to a target system. It can be applied in the interaction between two applications (enterprise application integration) or in the synchronization between two data stores (enterprise data replication).

  • Changed data capture can detect update events in the source data store, and trigger the ETL process based on these updates. In this way, a “push” model to ETL is supported.

  • Data virtualization builds upon the basic data integration patterns discussed previously but isolates applications and users from the actual integration patterns used.

  • Data as a service offers data services as part of the overall SOA, where business processes are supported by a set of loosely coupled software services.

The aspect of data integration is also heavily related to that of data quality. Data quality can be defined as “fitness for use,” meaning that the required level of quality of data depends on the context or task at hand. Data quality is a multidimensional concept involving various aspects or criteria by which to assess the quality of a data set or individual data record. The following data quality dimensions are typically highlighted as being important:

  • Data accuracy — referring to whether the data values stored are correct (e.g., the name of the customer should be spelled correctly)

  • Data completeness — referring to whether both metadata and values are represented to the degree required and are not missing (e.g., a birth date should be filled out for each customer)

  • Data consistency — relating to consistency between redundant or duplicate values, and consistency among different data elements referring to the same or a related concept (e.g., the name of a city and postal code should be consistent)

  • Data accessibility — which reflects the ease of retrieving the data

Approached from the angle of data integration, it is important to mention that data integration can aid in improving data quality, but might also hamper it. Data consolidation and ETL allow the performance of different transformation and cleansing operations, so the consolidated view of the data should be of higher quality, but one might, appropriately, wonder why it wouldn’t be better to invest in data quality improvements at the source. The same is true for environments where, throughout time, different integration approaches have been combined, leading to a jungle of legacy and newer systems and databases that now all must be maintained and integrated with one another. This is a key challenge for many organizations and one that is indeed very difficult to solve. In these settings, master data management (MDM) is frequently mentioned as a management initiative to counteract these quality-related issues.

Master Data Management

MDM comprises a series of processes, policies, standards, and tools to help organizations define and provide a single point of reference for all data that is “mastered.” Its key concern is to provide a trusted, single version of the truth on which to base decisions to ensure that organizations do not use multiple, potentially inconsistent versions of the same concept in different parts of their operations. The focus is on unifying company-wide reference data types, such as customers and products.

This might seem straightforward, but imagine the situation of a large bank where one department is using an operational customer database for its day-to-day interactions, while the marketing department is setting up a campaign by selecting leads from a data warehouse using a BI tool, which is running behind compared to the operational view. A customer that has just taken out a mortgage at the bank might receive a mortgage solicitation a week later, as the customer information used by the marketing department lacks fast or solid integration with the customer operational systems.

Putting a data federation or virtualization solution on top can help, but converting all departments and applications to go through this newer layer can take years, let alone coming up with a clear mapping on the current data and systems overview and architecture. Modern information systems can be very complicated and entangled constructs, which should emphasize the necessity for master data management.

Setting up an MDM initiative involves many steps and tools, including data source identification; mapping out the systems architecture; constructing data transformation, cleansing, and normalization rules; providing data storage capabilities, monitoring, and governance facilities, and so on. Another key element is a centrally governed data model and metadata repository. Perhaps surprisingly, many vendor “solutions” to set up an MDM initiative look very similar to data integration solutions (i.e., data consolidation, federation, propagation, or virtualization techniques). These integration approaches can be used as a method to achieve maturity in MDM. Note, however, that this assumes these solutions are used to set up a trusted, single version of the truth of the data on which decisions are based, and that no other representation of the data is used anywhere in the organization.

The challenge hence lies in the execution: following these core principles is a daunting task, and integration officers must avoid adding yet another few strands of “spaghetti systems” and cross links between data stores, causing the master data repository to become yet another half-integrated data silo.

About The Author

Bart Baesens is a professor of Big Data and Analytics at KU Leuven (Belgium) and a lecturer at the University of Southampton (UK). He has done extensive research on big data and analytics, credit risk modeling, fraud detection, and marketing analytics. He has written more than 200 scientific papers, some of which have been published in well-known international journals (e.g., MIS Quarterly, Machine Learning, Management Science, MIT Sloan... Read More

Wilfried Lemahieu's picture

Wilfried Lemahieu is a professor at KU Leuven, Faculty of Economics and Business, where he also holds the position of Dean. His teaching includes database management, enterprise information management, and management informatics. His research focuses on big data storage and integration, data quality, business process management, and service-oriented architectures. In this context, he... Read More

Seppe vanden Broucke is working as a postdoctoral researcher at the department of Decision Sciences and Information Management at KU Leuven. His research interests include business data mining and analytics, machine learning, process management, process mining. His work has been published in well-known international journals and presented at top conferences. He uses Java daily, giving him insight into useful applications for practitioners.... Read More

Leave a reply

Your email address will not be published. Required fields are marked*