Data Warehousing in the traditional on-premises world is a well-elaborated and studied discipline. With all the logical and physical architectures, data modeling techniques and methodologies, industrial data models, business glossaries, and data governance principles, one can find plenty of guidelines on how to build the solution properly.
However, with emerging cloud technologies and a variety of options, finding the best practice and proven architecture for the cloud might be tricky. Thus, it is necessary to balance the options from the practicality, sustainability, and time-to-market perspective.
Overview of Topics Covered
Typical functional blocks of the DWH
Let us first summarize, what are the typical functional blocks of the Modern Data Warehouse.
- Data ingestion from various heterogeneous data sources is a must. Whether we are talking about operational systems, additional external sources of the data, other data lakes, and data ponds. The modern data warehouse shall support also multiple integration strategies, varying from batch integration of the data (e.g., on the daily basis) up to the streaming and processing of the data continuously. The functional block shall also deal with different ways how the source data are stored, whether in a controlled and structured way or an unstructured format
- Typical Data Warehouse stores replicated data from the source systems in the raw format – meaning, that they are not (yet) transformed so that they can be eventually consumed also by other data consumers. Based on the logical architecture, this can be named a data lake or data landing area of the Data Warehouse. This typically also involves archiving the snapshots of the source data for a certain period, for additional reprocessing in the future if necessary, and playing the part in the backup and restore strategy in the future => the scope of the landing area typically involves also referential data from the different master and reference data management systems/platforms as well as core and master data from the source systems
- The actual business transformation of the data into the model suitable for the reporting (and other consumers of the data) is happening between the mentioned landing area and the core of the data warehouse, where batch storage of the data is happening
- Consumers of the data might differ, but they typically involve end business users, data engineers, scientists as well as “technical” users of the data (consuming applications and surrounding IT ecosystem).
- Data governance is the additional functional block, which typically supports the cataloging of the data products, reports, and different data end-products (like statistical models and others). Also providing the means for a thoughtful understanding of the data content is a must (for example description of the business terms and data items used in the reports and reporting layers of the data warehouse)
- Metadata management is crucial when doing metadata-driven data pipeline development. The best practice for modern Data Warehouses nowadays is to generate data pipelines automatically instead of hard coding them using various ETL technologies and methods. Metadata management is then crucial for the metadata orchestration (for the data transformation purposes as well as dependencies management between various data pipelines)
- Last, but not least – Data Quality management support involves the development and management of various data quality rules, data quality indicators, reporting, and resolution of the DQ issues, that might be originating either directly in the source system, or issues related to the heterogenous environment and data inconsistencies between various core systems
Simplified reference architecture
The following diagram depicts the best practices of the toolset covering necessary functional blocks.
- Data ingestion from the sources usually requires fast and reliable replication for cost-effective storage. Using Cloud Dataflow for data replication to Cloud Storage is a good choice
- Data pipelines with most of the business logic for the typical core of the Data Warehouse requires complex transformation logic to be implemented. Here, again, using BigQuery stored procedures and storing the data in BigQuery itself is a very grounded decision
- For data streaming purposes shall be real-time integration necessary, working with Google Pub/Sub, Cloud Dataflow, and propagating the data to Cloud AppEngine is a very good practice
- From reporting and data science perspective, one can leverage the Looker and Google AutoML capabilities
- Data governance capabilities of GCP are supported mostly by Google Dataplex
Typical obstacles to data solutions migration to the cloud
Data Warehouses, Data Lakes, and other data-related solutions are typically very complex systems developed over a very long period. They bear plenty of knowledge, which is often not materialized in the appropriate documentation.
Data transformation jobs are often very delicate and without very proper knowledge of the data transformation rules and purpose, re-implementation of the data pipeline or the ETL jobs might become very cumbersome.
Migrating the ETL jobs and data pipelines is very often also the most critical and time-consuming part of the migration to the cloud.
Being able to plan this migration properly is crucial for the adaptation of the consuming business users and applications (or simply data consumers). Being able to accelerate this ETL migration is very viable for any transformation program/project, which is dealing with the migration of the Data Warehouse to the cloud.
How to accelerate the migration?
The diagram above describes the reference architecture for the Data Warehouse being built from scratch. The same architecture principles can be applied to the migrated / re-platformed legacy Data Warehouse. However, with legacy, things are getting more complicated.
What to do with existing historical data? How to make sure, that all the functionalities, ETL jobs, and data pipelines are migrated properly with the same behavior?
This is where ADELE is most helpful. Understanding legacy Data Warehouse, harvesting metadata, and providing automated generation capabilities for Google BigQuery stored procedures.
There is a plethora of tools, which support the creation of ETL jobs or data pipelines from the scratch. Few of them, however, support re-platforming of the existing/legacy data pipelines to the target technology.
This is what makes tools, like ADELE very competitive, being able to support and de-risk actual cloud data migration journey.