CUTTER BUSINESS TECHNOLOGY JOURNAL VOL. 32, NO. 1
Cutter Fellow Vince Kellen outlines the six new rules for managing 21st-century data and analytics that organizations should embrace in 2019. He discusses the technologies that enable the rules and stresses the need to organize IT teams very differently around key data activities. According to Kellen, “In this environment, analytics is a team sport, not an individual one…. Thus, the organizational culture needs to shift to ensure that data and information are communal assets, not individual assets.”
With 2019 here and 2020 around the corner, it is time to recognize there are new rules in analytics and data management. These rules have created a wildly different analytics environment from the past. It’s time for organizations to embrace these new rules. But first, let’s look at the technologies behind these changes.
Five technologies have done the most, in my mind, to enable this transformation. These technologies are:
Improvements in scale-out, low-cost, near-real-time streaming technologies. While usually associated with Internet of Things (IoT) applications, streaming technology is poised to take over the synchronization of data for analytics. These technologies include Apache’s Kafka, NiFi, and Storm and Amazon’s Kinesis Data Firehose. These tools support real time and near real time and can scale out to handle big data movement at usually extremely low costs. When data movement moves to streaming, the real-time nature of the tools hardens the environment because errors get rooted out quickly and enable real-time analytics as data moves through the stream.
High-speed, in-memory analytics. These tools, like SAP HANA, sport scale-out, parallel designs that make mincemeat out of billion-row data sets. These environments also heavily compress the data, maximizing use of higher-speed memory. Very large data sets can be analyzed in these environments, which resemble that of supercomputers.
Low-cost, big data environments. Tools like Hadoop, Amazon Redshift, and Google’s serverless BigQuery let organizations store petabytes of data cost-effectively. The high-speed tools referred to above can now federate their queries with these environments, providing companies with a two-tier approach. Tier 1 is very fast, but more expensive. Tier 2 is super big and slower, but super cheap. Petabytes of data can be modeled in one, synoptic architecture.
Artificial neural network resurgence. After enduring an impossibly long and cold winter of a couple decades, artificial intelligence (AI) has undergone a renaissance, partly enabled by improvements a decade ago in computer hardware such as CPUs and memory, but also due to software improvements represented by the various forms of new neural networks available today.
Hyperscale cloud providers. These cloud providers, dominated by Amazon Web Services, but with Microsoft Azure and Google Cloud Platform following close behind, are enabling all the technologies above to run in very dynamic and elastic environments. Analytics data processing can ebb and flow in a pattern of big bursts punctuated by dry spells, and the resource consumption and pricing can also ebb and flow.
The 6 New Rules
These five technologies conspire to subvert the dominant paradigm of data and analytics and beg for a new set of rules. These rules are an inversion of sorts of the old rules. Let’s take a closer look at this set of six rules.
Rule 1: Everything Is a Verb
In my own work, my team in a university setting has found that the old focus on nouns and their relationships to each other (entity-relationship modeling, among other approaches) is much less important. Relational modeling grew originally out of the need to divorce logical hierarchies (relationships) from physical data structures, providing great flexibility. In time, relational modeling also had to genuflect before the performance altar, and today, most data warehouse designers cut out of this cloth cannot stop themselves from trying to conserve performance and hence adjust their designs.
With the five technologies highlighted earlier, we can now put verbs first. For each noun, we work out all the events that can change the noun. Each of those events come from one or more source systems in tiny batches of data creations, updates, or deletions. Rather than try to ensure we have all the additional fields that describe the noun perfectly confirmed, we instead work on ensuring the stream of changes regarding the noun are suitably captured. In short, we are essentially replicating a transaction log that describes all updates. We call this a “replayable log,” meaning we have the time history of changes captured in the stream of events.
All these events, or verbs, are placed into one very long table. Specifically, the table is one that has a variable record length, with different uses of columns, but all placed in one big, fat, wide table. While this is raising the hair on the back of the neck of old-school data warehousing folk, with these new tools, these are actually mostly benign manipulations. For example, while the developer sees a single wide table, in-memory columnar database tools store that activity log in an entirely different internal structure.
These large activity tables then serve as a data lake in a very simple, very big, and often very wide data structure. This simplicity enables all sorts of extensibility since it relaxes so many rules of data design. And as you may suspect, the streaming technologies let us more easily ingest data — IoT-style — from a variety of differing systems.
Rule 2: Express Maximum Semantic Complexity
With the old way of doing things, analytics developers often include only data that matches the need required. My team does the opposite. We try to bring in all data we can find in any given stream, whether we think we will use the data or not. This is just like in home construction, where it is much easier to put in electrical cabling before the walls go up, not after.
A second aspect of this rule is that we also bring in data at the lowest level of granularity possible. While this can often explode the number of rows of data within our models, the new technologies come to our aid with all sorts of tiering between high-cost and low-cost storage and several automatic compression methods.
Bringing in as many attributes as possible and all data at the lowest level of granularity also ensures that we will be able to answer any question that the business may ask. The only unanswerable questions are those for which there are no data or for which the source system did not capture at that level of granularity.
We have found that the in-memory, parallel columnar analytics environment also means that we do not have to handcraft the preaggregation of data. All our models can rely on the lowest level of detail at all times. If we do any preaggregation work in our designs, it is for the convenience of the analyst. Sometimes a preaggregated number is much easier to work with in visualization tools. Thus, we aggregate data for convenience only, not for speed.
Rule 3: Build Provisionally
I have found this rule to be the hardest for data warehouse developers to swallow. These developers, historically, build fact tables and dimensions that must stand the test of time, which they do. But these models also end up requiring additional data structures on top of them to support analysts.
In our environment, my team employs activity tables to serve as a data lake, which by itself is not really usable. We must build something on top of that, which would not only endure the test of time but also be considered ephemeral. We have used the term curated view to capture two opposing intents here. The first intent is to ensure a view of the data is appropriate for a specific use case. We call these analytics vignettes. Analysts use data for specific purposes, so we build the curated view specifically to serve a specific, and, perhaps narrow, but evolving purpose. The second intent is to ensure the curated view has as much structure as possible contained within it. Thus, the structure of hierarchies associated with any given data element, the exact data type and formatting, and so on, are carefully worked out. While the curated view may be provisional, the deep structures within it often are not.
Curated views can be redundant and overlapping. Several curated views can be combined and also built on top of each other. Our curated view designs typically have three or four levels of hierarchy so that we can reuse code (SQL for us) when constructing curated views. Thus, what the user sees is a single, flat file with often a few hundred attributes (columns) that an analyst can easily use. We take any joining away from the analysis. All our curated views are designed just so with all needed joins made.
While our data lake may carry with it lots of unstructured data (i.e., data not adequately described in terms of attributes and relationship), unstructured data will continually be desperately seeking structure and now chiefly through advanced algorithms, including AI and machine learning (ML) ones. While I have been focusing the argument here on structured data, as unstructured data grows exponentially, so does the need to structure it. Hence, over time, structure will be added.
Normally this sounds hideously wasteful of computational resources. While that may be, the five technologies outlined in this article have made this approach eminently feasible and cost-effective. Because our curated views are built on top of highly reusable components and are themselves reusable in other curated views, we can have as much model overlap or outright redundancy as our analysts need while still keeping the views quickly changeable. By merging the data lake (the activity logs) and the curated views into one environment, we can ingest very complex and unstructured data in the activity tables right away and then incrementally add structure to the data and include it in existing or new curated views.
Rule 4: Design for the Speed of Thought
I often get asked, why the need for speed? Many decisions in business are not made against real-time or near-real-time data. My answer is two-fold. Designing with speed from the start is far cheaper than trying to add it back in later. Second, speed is important for analysts and data developers and data scientists. A fast environment lets analysts work at the speed of thought with sub-second responses for all clicks in an analytics tool, regardless of whether the task is to aggregate a single column across 500 million rows or to drill into 500 rows of fine detail. A fast environment lets data scientists build and deploy models that much faster.
Designing for the speed of thought also requires moving as much of the complexity the analyst has to contend with as possible to the infrastructure. For example, in our environments, we will handle filtering logic (sometimes complex Boolean and set logic) normally expressed in the front-end visualization tool in the back end, relieving the analyst of that work. This means our curated views will contain what look like redundant fields (permutations of a single field, such as last name and first name combined into one field in that order and first name and last name in that order, side by side). While a downstream analyst can easily do that, we found that providing these small details increases analyst usage and throughput.
In addition, designing for speed of thought requires a robust and flexible framework for handling many different analytics tools, including traditional statistics, neural networks, older and newer ML algorithms, and graphing algorithms, alongside different deployment options. In the near future, many organizations will need to be able to manage dozens, or hundreds, of AI or ML models, all running in real time, acting on the stream of data as it flows in. These models need to be placed into service and taken out of service much more dynamically and frequently than analytics of yore.
These new analytics models will be handling alerting, nudging, personalization, system communication, and control of the activities frequently connected to human beings. This new analytics environment must be capable of delivering timely information that fits within the cognitive time frame of each person’s task at hand.
Rule 5: Waste Is Good
While this rule is implied or directly called out in the prior four, it is worthwhile touching on it further. In working with developers from traditional environments who make the transition to the new environment, all of them spend at least a few months struggling to accept what was so ingrained in them — the need to conserve computing resources. This is where younger and perhaps less experienced developers and data scientists may have an advantage. We are constantly reminding them that all this supposed waste enables agile, flexible, super-fast, and super-rich analytics environments at a lower price that was unheard of 10 years ago.
In this environment, we routinely take what would be a parsimonious data set and explode it, often to enormous sizes. Why? In a nutshell, we are trading off a larger data structure for a simpler algorithm. For example, in a typical large university, a class schedule for all courses offered can fit in a decently sized spreadsheet. Exploding that data set to show each room’s usage for each minute of the data, for every day in a year for 20 years, results in 2.3 billion rows. Why would anyone do such a thing? Because visualizing an “exploded” data set is trivial and allows the analyst an incredible level of granularity — either with a visual or an analytics algorithm.
Thus, we can throw away the old rules for normalizing databases. But this is not a free-for-all environment. Quite the contrary. A new strict set of construction rules replaces the old rules, yes, but now it comes with an inverted set of assumptions regarding space, size, and performance. For example, our methods for handling activity tables and curated views are as rigorous as any of the old rules for dimensional modeling. It’s just that we don’t care about the growth or explosion of data. We embrace it.
Rule 6: Democratize It
Data democratization means providing equal access to everyone — leveling the playing field between parts of the organization so that all parties can get access to the data. Today’s economy is an information economy, filled with information workers — and information workers need information. If these key staff members in your organization end up in an environment not suitable for their intellectual skills, they will opt to leave. So, when considering access to data more broadly, it’s not just about the data itself, it’s about recognizing the information-oriented nature of today’s work and recognizing the complexity of organizations.
Organizations that invest in decentralized decision making and make the necessary investments in technology and organizational practices perform much better than their peers. We must free data from silos and transcend traditional hierarchies, making data hoarders a thing of the past.
The implications here are manifold. We will need to organize our IT teams very differently around the following activities:
Data movement design
Data movement orchestration and monitoring
Data architecture and data design
Data science modeling
Data science orchestration and monitoring
Data democratization community development and management
When thinking about these skills, we need to recognize that they are a repackaging of older skills now wrapped around dynamic cloud technologies, a sort of data science design and operations group. With software as a service growing by leaps and bounds, many IT shops no longer develop software. Instead, with a coterie of various systems in the cloud originating data, a new DevOps team called DataOps — which I try to think of as “data and data science ops” — is being born.
With these new technologies, of course, we have more legal and ethical concerns. Privacy law and policies are growing rapidly and differentially across multiple regions of the globe. Moreover, new AI and ML approaches can introduce new forms of legal liabilities not previously imagined. While data democratization is both important and helpful, ensuring a highly secure environment takes on greater importance.
How humans react to and handle information needs to change. In this environment, analytics is a team sport, not an individual one. Whatever one person creates in these tools, it is likely another person can replicate it. Thus, the organizational culture needs to shift to ensure that data and information are communal assets, not individual assets.
Other implications abound. We need to ponder them, but we don’t have the luxury of time. These are the new rules of the big data world. Read them, contemplate them, and adopt them, but ignore them at your own peril. Analytics of the 21st century aren’t merely coming, they have arrived.
This Cutter Business Technology Journal issue is available in the Cutter Bookstore. Save 20% with Coupon Code TechTrends20.
Cutter members: Access here.