This is a high-level summary that provides a few options of data warehousing and business intelligence solutions used by organizations and is not a recommendation for any one solution. There are many variables that must go into the decision and selection of an architecture that fits your organizational needs. You must not only consider your current needs but anticipate future needs, so please be sure to understand all aspects of a solution prior to committing to it. Data solutions can be very expensive and take a long time to develop, but can be an invaluable asset in driving data-driven decisions for your organization.
This article outlines common data warehousing and business intelligence reference architectures. There are many variations to these that I will not cover, but I hope this will provide you with a general understanding of the different architectures and their pros and cons.
Data Vault Architecture
The Data Vault Architecture is a methodology that makes it easy to store structured data. Performance to load a data vault can reach near real time loads, as there are mostly inserts rather than updates or deletes. This architecture is largely used where auditing of the data is needed – what source systems changed what data. This is a flexible architecture that allows the addition of additional data sources to be integrated easily. To accomplish the speed of inserts and auditing of data, a large number of tables are needed when compared to a dimensional model. This can add additional complexity if trying to report from or get data out of a data vault. Therefore, the data vault is often between source systems and a data warehouse (dimensional model).
Figure 1: Data Vault Reference Architecture
|· Agile architecture makes it easy to integrate new data sources or make changes to existing load procedures
· Strong audit history that logs each change
· Pattern based loads allow for code generation to create load packages
· Faster load times as most data are inserts rather than updates
|· Selecting data can be slow because of the number of joins required
· Contains many tables, which complicates ad hoc reporting
· Unable to load non-structured data
Table 1: Data Vault Pros and Cons
Data Warehouse Architecture
The Data Warehouse or Dimensional Model follows the Kimball Methodology, which is a denormalized architecture that is designed for reporting. The denormalized architecture reduces the number of tables and complex table joins to improve performance on getting data out. A staging environment which could be a data vault, or a data lake is used to land data without transformations from each of the source systems. The data integration and interoperability layer handles the complex transformation logic to shape the data into a dimensional model.
Figure 2: Data Warehouse Reference Architecture
|· Transformed and structured, making it easier to report from
· Uses business names and definitions
· Conforms data from disparate systems
|· Requires upfront architecture and development
· Not flexible with changing business needs
Table 2: Data Warehouse Pros and Cons
Modern Data Warehouse Architecture
With cloud offerings, such as those offered by Microsoft Azure, Amazon Web Services (AWS) or Google Cloud Platform (GCP), organizations have even more options to implement their data strategies. One newer data option is an IaaS (Infrastructure as a Service) offering called a Data Lake. Data Lakes allow you to store data, both structured and unstructured data, in a single location. This centralized data store can be a good option for organizations that have both structured and unstructured data that they plan to use, now or in the future. There is minimal design or architecture needed up front to plan how you will store the data. Think of a file system where you store your text documents, images, videos, PDFs, etc. and how you create directories and folders and then store files within. You will need to think through how you will setup this structure before loading. Once you have designed the structure and made a few other technical decisions, you can then load data. The data is loaded incrementally regularly.
With more companies looking at and implementing a data science solution, a data lake can be a key place for your business analytic or data scientist to access data from the entire organization. One enormous struggle faced by data scientists is accessing the data from multiple systems and bring that data into a centralized data store. With the data already loaded to the data lake, data scientists can focus more on building value from the data rather than gathering the data.
A data lake is not a data store that non-technical users would have access to, as there is still an effort in transforming the data in the data lake into a usable format.
Figure 3: Modern Data Warehouse Reference Architecture
|· Can contain structured and non-structured data
· Easy to load data
· Flexible and change adjust to new data
· Utilizing cloud offerings, there is no upfront cost to license or build out servers
· Storage costs are lower
|· Getting data out can be more difficult
· Newer or different toolset, which requires a learning curve to get your team acclimated
Table 3: Modern Data Warehouse Pros and Cons
Modern Business Intelligence Architecture
The Modern Business Intelligence (BI) Architecture came about with the release of reporting tools with functionality to connect and report from multiple sources. Take Power BI as an example. Power BI is an growing reporting tool with functionality being added regularly that allows one to build interactive reports and dashboards from data sourced from multiple different systems. This approach can be a brilliant solution for smaller and less complex data to drive business decisions.
Power BI and other related tools can connect to different sources and transform that data into a reportable shape. Contrast this with other solutions that have a layer of ETL or Extract, Transform, and Load that does this transformation. This ETL layer does a lot of heavy lifting to transform the data into a usable shape and loads that data into a data model that makes reporting easier. The ETL processes can