Advisor

Cloud Data Warehouses: A Paradigm Shift in Data Platforms

Posted August 26, 2020 in Business & Enterprise Architecture
Cloud Data Warehouses: A Paradigm Shift in Data Platforms

Enterprises need access to data-driven insights faster than ever before. Analytics use cases have evolved from traditional, precanned reports to self-service and guided analytics. Exploratory data analysis, machine learning (ML), and augmented analytics are becoming common requirements. Enterprises running their data warehouses on traditional on-premise platforms are finding it difficult to cater to the exponential data growth and ever-increasing demands of business users for fresher and faster insights. With no simple way of apportioning IT resource consumption costs, platform ownership is becoming a major concern. Business units are building their own data silos instead of sharing a common data platform.

Enterprises exploring solutions to these challenges are steadily embracing the cloud data warehouse. These platforms provide agility, scalability, and simplicity that enable the enterprise to focus on data solutions rather than spending valuable effort on peripheral overheads.

A Brief History of the Data Warehouse

The data warehouse has been an integral part of an enterprise’s data and analytics landscape for the last four decades. During this period, even though the fundamentals of the data warehouse have not altered significantly, the underlying technologies have transformed (see Figure 1).

Figure 1 — The transition of the data warehouse from relational database to the cloud data warehouse.
Figure 1 — The transition of the data warehouse from relational database to the cloud data warehouse.

Enterprises initially built their data warehouses using traditional RDBMS tools. As data volumes kept multiplying, traditional RDBMS tools were not able to keep up, paving the way for the data warehouse appliance.

The data warehouse appliance was custom-built for running analytics at scale. IBM Netezza and Teradata initially dominated this market, which Oracle joined later with its Exadata offering. These appliances were well adapted for vertical scaling but could not scale horizontally. They also held a high cost of ownership.

The next horizon for evolution was marked by massive amounts of data due to the SMAC evolution, which led to data explosion at scale, especially unstructured, semistructured, and streaming data needs. Instead of investing in one big powerful appliance, the processing could be distributed across commodity servers and if additional capacity was needed, they could scale by adding new nodes to the cluster. However, many organizations that tried to force-fit a data warehouse using these tools realized that it was not the best solution; some of the prominent reasons included the following:

  • Hadoop is not ACID (atomicity, consistency, isolation, and durability) compliant.

  • The SQL tools that accompany these big data solutions are not as robust as SQL tools that accompany time-tested relational database tools.

  • Performing updates on data is complicated.

  • Maintaining these platforms requires significant administration.

  • Role-based access control is not available out of the box.

There is no denying the fact that the Hadoop ecosystem is essential and can be used to implement use cases that are not possible or difficult to implement in a data warehouse. However, a data warehouse warrants special features and force-fitting the same in a Hadoop-based data lake may not be the best choice.

The Advent of the Cloud Data Warehouse

A cloud data warehouse platform is a cloud native, ACID-compliant relational database that provides extreme scalability both in terms of storage and compute, provides ability to scale up and down quickly, and minimizes administration overhead.

Over the past few years, the cloud data warehouse platforms have evolved considerably, and numerous enterprises have either moved entirely to these platforms or have started experimenting with a few use cases. These platforms do away with complexity associated with the Hadoop ecosystem and at the same time provide scalability, which was not very easy to achieve after a point with the data warehouse appliances. The prominent benefits include:

  • Agility. The ability to scale up and down instantly enables the enterprise to be truly Agile and provide resources instantly for new data initiatives.

  • Chargebacks. A single data platform can be used to cater to the needs of the entire enterprise and cost can be apportioned as per the actual consumption of a particular business unit or group.

  • Cost optimization. Some of these platforms provide the ability to pause compute resources when not in use and only charge for storage during this period. By leveraging this feature, organizations can pay for what they truly use.

  • Reduced administration overhead. Patching, upgrades, security, and backups are taken care of by the platform provider; significantly reducing the administration overhead and associated costs.

  • Rich partner ecosystem. Enterprises are free to use tools of their choice for data integration and reporting. Many data integration tools provide special connectors that enable pushing processing to the cloud data warehouse platform.

  • Support for semistructured data. Most of these platforms also support semistructured data formats like JSON, and thus provide options to on-board new data processing use cases.

Key Considerations for Choosing a Cloud Data Warehouse

There are numerous competent cloud data warehouse platforms in the market, each offering a unique value proposition. Choosing one is a daunting task. To make it somewhat easier, we provide a guidance framework for shortlisting the choices (see Figure 2).

Figure 2 — Four key considerations for choosing a cloud data warehouse.
Figure 2 — Four key considerations for choosing a cloud data warehouse.
  1. Capability and feature relevance to business use cases. Use an objective scoring mechanism to evaluate the capabilities and features of different cloud data platforms to finalize the right platform.

  2. Leverage existing cloud investment. Existing investment in a cloud platform will play a major role in choosing a cloud data warehouse platform. Enterprises may want to choose a cloud data warehouse platform closely aligned to the existing cloud investment.

  3. Ecosystem impact. The ETL tool, reporting tool, and scheduling tool are key components of the data warehouse ecosystem. The impact on these components will need to be carefully analyzed and will determine the time to market for the modernized data platform.

  4. Total cost of ownership (TCO). The two determinants of TCO are the overall migration cost of moving to the cloud data platform and the cost of running the platform. The overall migration cost covers the database migration, data migration, ETL, and report changes. The cost of running the platform includes the subscription charges for the cloud data warehouse and the cost of the manpower for platform support.

Popular Cloud Data Warehouse Platforms

Snowflake, Azure Synapse, Amazon Web Services Redshift, Google BigQuery, and Oracle Autonomous Data Warehouse are prominent players competing in this space. A detailed comparison of these platforms is a different topic altogether, but here is a brief value proposition for each of them:

  • Snowflake is built using a decoupled architecture that enables independent scaling of compute and storage. Separate virtual warehouses (compute instances) can be instantiated for diverse purposes; for instance, you can have a medium-size virtual warehouse for your reporting needs and large virtual warehouse for your data science needs.

  • Azure Synapse (SQL Pools) is a massive parallel processing platform with separation of storage and compute. Compute can be sized independently of storage and paused as well. Using Polybase, huge volumes of data can be copied from Azure storage to Azure Synapse (SQL Pools).

  • AWS Redshift is a fully managed data warehouse based on massive parallel processing architecture and columnar storage. Redshift spectrum allows users to query data stored in Amazon S3 buckets. Redshift also enables users to store the query results back to S3 storage using open formats like Parquet for further analysis by other analytical tools.

  • Oracle Autonomous Data Warehouse is a fully managed database built on the Exadata platform with automated patching, upgrades, backup, and performance tuning. It runs on the same Oracle software that runs on on-premise systems, making it compatible with existing systems running on Oracle software.

  • Google BigQuery is a completely serverless solution, with no configurations. Users are not required to choose or scale a compute and the configuration and scalability is managed by the platform. BigQuery ML enables data scientists to run ML models in BigQuery using standard SQL queries.

Conclusion

The data warehouse has played a vital role in the enterprise’s data and analytics landscape for decades and will continue to do so. Cloud is bound to be the natural progression in the evolution of the data warehouse, as it provides unlimited scalability, near-to-zero in-house maintenance, robust security, high performance, and the flexibility to move toward a pay-as-you-go model. The try-before-buy model enables enterprises to pilot business use cases on these platforms without the need for huge upfront investment. By leveraging these platforms, enterprises can go truly Agile and keep up with their growing data and analytics needs.


The Cutter Edge

Subscribe to The Cutter Edge for free!

Get useful research and insight in your inbox twice a month in The Cutter Edge. From digital transformation and innovation, to software and systems optimization, The Cutter Edge hits upon all the diverse technology-driven issues today’s executives are facing. Register Now!.


About The Author
Sagar Gole
Sagar Gole is a Solution Architect working with Tata Consultancy Services Limited (TCS). For the last 16 years, he has engaged in designing and building business intelligence, data integration, and data management solutions. In his current role as a Business Intelligence Solution Architect with TCS’s Automotive and Industrial business unit of the Manufacturing & Utilities business group, Mr. Gole advises customers on building data platforms… Read More
Kamal Gupta
Kamal Gupta is a Managing Partner and heads the Data and Analytics Practice for Manufacturing Automotive & Industrials business unit at Tata Consultancy Services Limited (TCS). Kamal has more than 21 years of experience in providing data and analytics strategy, architecting and designing various enterprise data and analytics solutions for various customers. In his current role, he provides consulting, thought leadership, and transformation… Read More
Vidyasagar Uddagiri
Vidyasagar Uddagiri is an Enterprise Architect working with Tata Consultancy Services Limited (TCS). For the last 24 years, he has been engaged at TCS in application development and maintenance, technology modernization, and architecture definition engagements for customers in manufacturing, banking, and retail domains across North America, Europe, and the UK. In his current role of Enterprise Architecture Practice Leader with TCS’s  Automotive… Read More