By: Dinesh Asanka | Updated: 2024-01-05 | Comments | Related: More > Data Warehousing
Problem
I am new to SQL Server data warehousing. What are the best practices that I should follow to ensure a successful long-term project?
Solution
What is a Data Warehouse?
In the last article (Learn about Data Warehousing for Analytical Needs), we introduced data warehouses and the key differences between data warehouses and operational systems. Before discussing the technical implementation details of a data warehouse, it is important to discuss best practices to focus on during the development of the data warehouse. When identifying best practices, it is important to understand that data warehouses should be scalable for the next 4-5 years.
Requirement Elicitation
Every project should start with gathering requirements from business users. However, a significant challenge when seeking requirements for data warehouses is that most business users do not understand what a data warehouse is and how it works. Therefore, it is essential to provide them with the necessary details. The following are important details that you need to discuss with the client:
- There will be a data latency between the operational data sources
and the data warehouse.
- Business users are accustomed to seeing real-time reports in operational systems. However, the data warehouse will have a latency that may be typically daily. This latency should be indicated to the business users as early as possible so that there won't be any issues during the deployment stage of the data warehouse.
- The data warehouse is mandated to extract data from multiple sources
and transform them into business values.
- These multiple systems will have their own versions of business equations and definitions. Once the data warehouse is built, you must cater to all stakeholders. Therefore, bringing all these stakeholders for the requirement elicitation discussions is essential. The best practice would be to conduct a workshop with all stakeholders where the data warehouse team will direct the conversations while finding answers to their questions.
- When determining requirements, most of the time, business users
start with the reports.
- A data warehouse is an analytics framework, so we should not confine the data warehouse design to the reports. However, as a data warehouse designer, you can use reports as an entry point to identify the requirement but should not be limited to those reports.
Data Warehouse Modeling
Let's look at the best practices that can be employed in data warehouse modeling.
Tracking History
Handling historical data is one of the crucial tasks in data warehousing. Typically, type 2 slowly changing dimensions (SCD) are used to track the history in data warehousing. However, many designers fail to identify this aspect at the design stage. It is challenging to incorporate the changes if it is not identified at the design stage.
Another important factor is to determine what the type 2 SCD attributes. For example, if you decide to include all attributes in a customer dimension as type 2, it may add additional performance burden to the dimensions. Therefore, the best practice would be to determine the type 2 SCDs and their attributes.
Aggregation Fact Tables with Details Fact Table
The father of business intelligence, Ralph Kimbal, advocates keeping fact table details with maximum granularity. The data may be overwhelming and challenging if the granularity is too fine. If it's too coarse, important details and patterns may be obscured. It's essential to strike a balance that provides the necessary level of detail for the specific analysis or reporting goals.
The following example shows a fact table to store customer calls for a telco operator:
However, this can be summarized in the following aggregated fact table. By aggregating, the query execution performance of the fact table will be improved. Note: There can be data loss by aggregating, and we need to make decisions to minimize the data loss. In the example below, the number of calls is added. But, if you want details such as maximum or minimum, that information is unavailable.
To maintain aggregation and detail records, the best practice is to maintain two or more fact tables. In the data warehouse, it is common to duplicate fact tables rather than having only one fact table. The aggregated fact table is typically populated from the details fact table.
Surrogate Keys
Surrogate keys should be introduced to the dimensions as a primary key instead of the existing business key. Even though technically, you may not need surrogate keys at the initial stage, there will be a requirement for surrogate keys in the future. However, if you are to implement type 2 SCD, surrogate keys are unavoidable.
Extract-Transform-Load
Extract-Transform-Load (ETL) is a complex data warehousing process that brings data to the data warehouse from multiple operational and external data sources. Considering the complexities of ETL, there should be best practices to design them.
Staging
A staging area is used to save data from data sources as it is and will be used as staging data to be processed later. Typically, a staging environment is essential when there are conflicts of extraction. However, in most cases, you will initially extract data from one or two data sources. In that type of scenario, you don't need a staging environment. However, the data warehouse will be extended in the future. In the future, there can be different data sources that will conflict with existing data sources, and at that time, it will be challenging to introduce a staging environment. Therefore, the best practice is to introduce a staging environment early in the data warehouse development.
Independent Extractions
It is better to design and develop extractions independently when there are multiple extractions. When they are designed independently, you can handle data source extraction when there are conflicts. Also, when there are scheduled maintenance tasks on data sources and when they are unavailable, extraction will fail if you have not implemented them independently.
Auditing
Auditing is used to identify the data load and time duration for each ETL step. By analyzing these steps, teams can identify potential future problematic tasks. Therefore, it is important to implement ETL auditing from the start.
Error Handing
Data warehouses will get data from multiple sources, and due to different data types and ranges, there can be conflicts. For example, in Oracle, the valid date range is 0001-01-01 to 9999-12-31; in the case of Microsoft SQL Server, it is 1753-01-01 to 9999-12-31. Consider the date 0020-11-04; it will be successful in Oracle but fail to write to SQL Server. When these fail, it is important to identify and handle the error according to the requirement.
Apart from handling the error, it is also essential to fix the error in the data source to overcome future issues. To achieve this, you need to log other relevant information as well.
Data Governance and Security
System access can be handled relatively easily since operational data sources are physically isolated. However, security will become more complex when the data is accumulated in a central place such as a data warehouse. Further, there can be situations where a single dimension table will contain data from multiple operational sources. For example, an employee dimension may be populated by numerous companies. When access is provided to the employee dimension, providing necessary access to the users is required. This is called row-level security and must be adequately planned from the beginning.
Scalability
There is a myth regarding any system development that scalability can be achieved by increasing hardware resources such as memory, processing, and disk space. This may somewhat solve the scalability and performance, but that won't be a permanent solution. Proper modeling is the best solution for future scalability issues.
Next Steps
This article looked into the best practices for data warehouse development, including requirements, modeling, and ETL. The next article will look at steps to build a data warehouse.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2024-01-05