Data Warehousing on Azure and on SQL Server 2016
This blog post is intended for readers who already have some experience on building Data Warehouses with earlier versions of SQL Server (2008 or 2012) and are looking for updating their knowledge on how to do it today on SQL Server 2016 and/or on Azure. In addition, I included some insight on the technology trends on the matter.
Investing into designing and implementing a solid Data Warehouse creates a foundation for a successful Data Platform architecture. As Azure SQL Data Warehouse reached General Availability and SQL Server 2016 was released, it is time to take a look at what is new, how it might affect the existing Best Practices, and how do these two platforms differ from Data Warehousing perspective.
Microsoft Best Practices guideline for building an On-Premise Data Warehouse can be found here. Although that document was written years ago, many of the fundamentals still hold true today. Obviously, there are also quite a few things that have been introduced since and have an impact, and I am going to spend some time on them on this blog post:
- Azure SQL Database
- Azure SQL Data Warehouse
- Clustered Columnstore Indexing
- In-Memory Tables
- Power BI
There are two PaaS services for building a Data Warehousing solutions: Azure SQL Data Warehouse and Azure SQL Database. Currently, the Azure SQL Database max size limit is 1 TB (on Premium tier). In addition, you are not likely to experience the great performance of Azure SQL Data Warehouse with a data set smaller than 1 TB. Later on in this blog post, I will only concentrate on Azure SQL Data Warehouse as all Enterprise Data Warehouses are larger than that.
Regardless of the platform, one still needs to consider many of the things outlined in the previously mentioned Best Practices document:
- Is there an existing Database model? Can it be improved or used?
- Is there a requirement for an OLAP cube or Tabular Model? Do we have to load the data into it or can we use DirectQuery functionality on reporting layer?
- What kind of schema would be the best for the current case?
- How about hardware requirements? How many users are there? What is the amount of data?
In order to reach the goal, the Database model as well as the overall architecture have to be designed according to the answers to the questions above. Database model should be designed thoroughly before implementing in order to fully meet the requirements and avoid unnecessary work (through the changes to the Database model). Obviously, if you are going to be deploying your solution to Azure, you have the option of scaling your hardware within minutes so you don't have to worry about your hardware requirements as much.
Many fundamentals still hold true:
- It is extremely important to use correct data types
- Large tables should be partitioned
- Fastest way to load data into Data Warehouse table is by partition switching
- Logging should be used only when really required
- Statistics and indexes should be used (and updated/rebuilt according to a plan where necessary)
Some of these fundamentals are actually implemented a little different now than they were few years ago. Let's walk through the new technologies.
Clustered Columnstore Indexing
The de facto standard for modern Data Warehousing is columnar compression of data. Typically, the bottleneck of a DW server is disk IO, because data needs to be read from disk if it's too big to fit into RAM of the server. Columnar compression boosts the performance by reducing the amount of data that needs to be read from disk. It is especially well-suited for data warehousing, because similar values in a column lead into high compression rates. Having a data compression rate of 90% for instance, also means that you can have 10 times more data in-memory compared to uncompressed data resulting into even further performance gains. All of the modern data warehousing platforms include an implementation of this technology and you should use it regardless of the product. They also all have similar limitations. I will scope this blog post just to focus on Clustered Columnstore Indexing (CCI) as it is called on SQL Server 2016 as well as on Azure SQL Data Warehouse.'
First of all, CCI does not support all data types. On the other hand, I would say that it supports all the relevant data types for a data warehouse implementation. If you compare the list of supported data types on Azure SQL Data Warehouse and on SQL Server 2016 Clustered Columnstore Indexed tables, you will find that the lists are identical (with the exception of sysname, which really is an nvarchar(128) NOT NULL). The root cause for this is of course, that Azure SQL Data Warehouse is also using Clustered Columnstore Indexing which does not allow other data types.
In addition, primary and foreign keys are not supported. From data warehousing perspective this does not pose a problem. You should design your ETL to include the necessary data validation. If you absolutely want to have similar functionality than what a primary key introduces to a table, you could define your column UNIQUE (primary key enforces uniqueness on a column) as well as add a Nonclustered Columnstore Index on it (index is also automatically created on a primary key column). In short, this does not pose a problem if you are developing something new from scratch. If you have an existing data warehouse, and you are looking for performance improvements, you might want to modify it around a little bit to have clustered columnstore indexes especially on your large fact tables.
In-Memory Tables are likely to introduce significant performance improvements on your necessary transformations during ETL processes. Thus, I would recommend using the technology on your Staging database where possible and minimizing the unnecessary read and write operations to disk.
Power BI includes SQL Server Analysis Services (Tabular Model) under the hood and thus, has an impact on your Data Warehousing and Business Intelligence capabilities. Depending on your requirements, it may remove the need for additional SSAS Server on your architecture and thus, save significant amount of resources. The SSAS DirectQuery functionality enables you to use both SQL Server as well as Azure SQL Data Warehouse to host your data, while the Unified Data Model (UDM) could still be hosted on Power BI for your Power BI reports. If you already happen to have an SSAS Tabular model implementation, all you need to do is connect to it using Power BI and start creating reports on your existing UDM.
Azure SQL Data Warehouse
Both Inmon’s and Kimball’s Database modeling approaches are recommended by Microsoft. Dimensional Modeling can be seen as a better fit out of these two, because JOINs are expensive operations and Clustered Columnstore Indexing compressed recurring values in a column really well.
The best performance is achieved when data is partitioned evenly. There are two different keys that can be defined when table is created: PARTITION and DISTRIBUTION. Date is commonly used key for PARTITION. DISTRIBUTION key should not be the same as PARTITION key. You should consider a key that is used in JOINs (and/or GROUP BY) in your queries and that would distribute the data evenly.
STATISTICS can also be used for performance tuning when necessary. You should consider creating statistics for all the “foreign key” columns on your fact tables. Multi-column statistics can also be useful for dimension tables. At least the “primary key” column of the dimension should have statistics.
Azure SQL Data Warehouse is based on SQL Server code and uses the T-SQL programming language. Nevertheless, I decided to include actual code examples in the following chapters that I used for performance tests a few weeks back, because the authentication and PolyBase functionalities are new to a lot of people and sometimes it is just the best of way of sharing information.
Azure Active Directory authentication for Azure SQL Data Warehouse reached General Availability and should be used by default for authenticating users. The contained database user model enables portability of the database as logins are not required and thus, is also recommended practice. You should use roles to manage access rights to data. The following example shows an example how to. It is assumed that you have set Active Directory admin for the database and created an AD Group in Azure AD for the purpose. The name of this AD group is “readgroup” in the following example:
CREATE USER [readgroup] FROM EXTERNAL PROVIDER; CREATE ROLE db_readsometables AUTHORIZATION db_securityadmin; EXEC sp_addrolemember db_readsometables, readgroup GRANT SELECT ON OBJECT::[somedatabase].[dbo].[onetable] TO [readgroup];
This creates a role “db_readsometables” that can be managed by members of db_securityadmin group. readgroup is added to the role. Finally, read access rights are added to the role for the table named onetable on the database named somedatabase. As a result, all the members of “readgroup” in the Azure AD can query onetable using their AD account for authentication.
The Fastest way to load data: PolyBase
The most common scenario is that you have a lot of data in Azure Blob Storage that you want to have in your Azure SQL Data Warehouse. PolyBase is the right tool for the loading job. The following commands show how to do that step by step. First, create a credential to access the data source:
CREATE MASTER KEY; CREATE DATABASE SCOPED CREDENTIAL YetAnotherStorageCredential WITH IDENTITY = 'user', SECRET = '<YourSecret>' ;
Then, define the data source. Please notice that we use wasbs instead of wasb to have secure connection using TLS. criteo is the name of the container within the oskaristorage Storage account:
CREATE EXTERNAL DATA SOURCE criteo WITH ( TYPE = HADOOP, LOCATION = 'wasbs://email@example.com', CREDENTIAL = YetAnotherStorageCredential );
File format needs to be defined as well. Reading Gzipped Text files is also an option as follows:
CREATE EXTERNAL FILE FORMAT GzippedTextFile WITH ( FORMAT_TYPE = DelimitedText, FORMAT_OPTIONS (FIELD_TERMINATOR = '\t'), DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec' );
Finally, we create a table using the data source and file format defined previously. Location refers to the virtual directory under your container. (Remember to partition your data as discussed in earlier chapter.):
CREATE EXTERNAL TABLE ext.Criteo ( col1 varchar(8) NULL, col2 integer NULL, ) WITH ( LOCATION='/data/', DATA_SOURCE=criteo, FILE_FORMAT=GzippedTextFile );
The external table is ready for queries using PolyBase (SELECT * FROM ext.Criteo).