Improve Power BI Performance with SQL Server Indexing

By: Semjon Terehhov | Comments (1) | Related: > Power BI

Problem

What Power BI queries do to an underlying SQL Server data warehouse can be both complicated and confusing to understand. I am writing this tip to help DBAs and developers get the most out of their resources and avoid service degradation over time by providing an indexing strategy for Power BI applications. We will go into details of different Power BI modes and what they do to the database in terms of T-SQL as well as how indexes should be structured to efficiently support the workload. By efficiently, I mean minimum impact indexes (not over-engineered) for Kimball data mart star architecture, as every index comes at a cost.

Solution

Some basic pre-requisites for this demonstration. We will be using SQL Server 2017 on-premises and Power BI desktop co-located on the same Hyper-V VM. I am using the AdventureWorksDW2017 database provided by Microsoft as my data source. Power BI has to modes it gets data from a source, and these to modes work very differently from one another:

Import Mode
DirectQuery Mode

Each of which needs its own dedicated indexing strategy. In case you are not familiar with Power BI as an application I will recommend this tip as an introduction.

Power BI Import Mode

The name speaks for itself. In this mode the Power BI engine will cache all the datasets data into memory and then use DAX language to execute transformations. For illustration purposes I have chosen to use following tables from AdventureWorksDW2017:

FactResellerSales
FactProductInventory
FactInternetSales
DimProduct
DimCustomer

Open Power BI and select SQL Server from the list of available data source in "Get Data" tab and select import mode as shown below:

Figure 1: SQL Server connection - Import Mode

Figure 2: Power BI model - Import Mode

You can find more detailed help on setting up Power BI data set to SQL Server in this tip. Once the model is ready refresh it at least several times for SQL server to build up usage statistics. Word of warning, all usage statistic resets automatically on server restart. In other words, DMV (dynamic management views) and DMF (Dynamic Management Functions) historical data only available from the last server start up. Read more on what are DMVs and DMFs in this tip.

Figure 3: Power BI - Refresh data model

Once SQL server has enough usage statistics available. The T-SQL query given below will show filtered out usage for Power BI Import Mode only:

SELECT
SUBSTRING(SqlText.text,PATINDEX('%from%',SqlText.text) + 5,PATINDEX('% as [[]$Table]%',SqlText.text) - PATINDEX('%from%',SqlText.text) - 4) AS TableName,
SqlText.text AS SqlQuery,
QueryPlan.query_plan AS ExecutionPlan,
QueryStat.execution_count,
QueryStat.last_logical_reads, -- logical reads: number of 8KB pages that has been read from memory cache or disk
QueryStat.max_logical_reads,
QueryStat.total_logical_reads
FROM sys.dm_exec_query_stats AS QueryStat -- Query Plan Statistics
CROSS APPLY sys.dm_exec_sql_text(QueryStat.sql_handle) AS SqlText -- Text for SQL batch
CROSS APPLY sys.dm_exec_query_plan(QueryStat.plan_handle) AS QueryPlan --Query Plan
WHERE SqlText.text like '%$Table%' AND  SqlText.text not like '%SqlText.text AS SqlQuery%' -- Power BI ONLY queries
ORDER BY QueryStat.last_logical_reads desc -- Sorted by last most expensive query
Query result:

For illustration purposes let us use one table from the top of the list called dbo.FactResellerSales. The demo database has a Clustered Rowstore index (without compression) already.

Figure 4: Existing Indexes on AdventureWorksDW2017.dbo.FactResellerSales

To create a heap structure (table without clustered index) simply drop the existing PK_FactResellerSales_SalesOrderNumber_SalesOrderLineNumber:

-- HEAP (no index); NO compression
 
ALTER TABLE [dbo].[FactResellerSales] DROP CONSTRAINT [PK_FactResellerSales_SalesOrderNumber_SalesOrderLineNumber] WITH ( ONLINE = OFF )
 
GO

For Clustered Rowstore index without compression:

-- Clustered Rowstore; NO compression
 
ALTER TABLE [dbo].[FactResellerSales] ADD  CONSTRAINT [PK_FactResellerSales_SalesOrderNumber_SalesOrderLineNumber] PRIMARY KEY CLUSTERED 
(
   [SalesOrderNumber] ASC,
   [SalesOrderLineNumber] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
 
GO

For Clustered Rowstore index with ROW compression (add DROP_EXISTING = ON if you are recreating the clustered index without explicitly dropping it first):

-- Clustered Rowstore; ROW compression
 
ALTER TABLE [dbo].[FactResellerSales] ADD  CONSTRAINT [PK_FactResellerSales_SalesOrderNumber_SalesOrderLineNumber] PRIMARY KEY CLUSTERED 
(
   [SalesOrderNumber] ASC,
   [SalesOrderLineNumber] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, 
DATA_COMPRESSION = ROW) ON [PRIMARY]
 
GO

For Clustered Rowstore index with PAGE compression (add DROP_EXISTING = ON if you are recreating the clustered index without explicitly dropping it first):

-- Clustered Rowstore; PAGE compression
 
ALTER TABLE [dbo].[FactResellerSales] ADD  CONSTRAINT [PK_FactResellerSales_SalesOrderNumber_SalesOrderLineNumber] PRIMARY KEY CLUSTERED 
(
   [SalesOrderNumber] ASC,
   [SalesOrderLineNumber] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, 
DATA_COMPRESSION = PAGE) ON [PRIMARY]
 
GO

For Clustered Columnstore with Default compression (Drop any existing clustered indexes before running the script):

-- Clustered Rowstore; PAGE compression
 
CREATE CLUSTERED COLUMNSTORE INDEX cci_FactResellerSales ON [dbo].[FactResellerSales]
       WITH (COMPRESSION_DELAY = 0 Minutes );
GO

Performance results for different index types:

Index Type	Compression	Logical Reads (Number of 8KB pages)	Elapsed time to UPDATE a single record in (milliseconds)
Heap (No Index)	NONE	1677	22 - table scan
Clustered Rowstore	NONE	1680	1
Clustered Rowstore	ROW	961	4
Clustered Rowstore	PAGE	367	2
Clustered Columnstore	COLUMNSTORE(Default)	287	22 – index scan
Clustered Rowstore + NON-Clustered Columnstore	NONE + COLUMNSTORE(Default)	287	1

Let me explain the results shown above for the logical reads. Power BI performers FactResellersSales table scan. Application reads all 8KB data pages that are assigned to this table. Heap is an unstructured collection of pages whereas Clustered Rowstore index arranges pages according to index columns. Since our application is performing a scan of all pages, aligning pages in the row order provides no performance gain. On the contrary significant read performance gains are achieved with help of compression of Clustered Rowstore index, but ultimately the best-read result comes with the Columnstore index. Read this article if you would like to know more about data compression.

Unfortunately for tables that require record updates, this is only part of the story. The clustered Columnstore index as good as it sounds for pure read performance, is slow when locating a single page to perform an update operation on the data in it. To support such an operation clustered Columnstore index will require a full scan. This is extremely resource intensive.

Best Index Strategy for Power BI Import Mode

From Power BI Import Mode point of view the columnstore index offers best read performance comparing to even the most compressed Rowstore index. Although one must keep in mind that Columnstore has its limitations especially when updating records. Significant performance degradation when updating or deleting records.

Therefore, if you have a table that requires frequent record updates, I would suggest having a combination of these two indexes:

Clustered Rowstore index that covers the where clause of update statement
Non-clustered columnstore index with all used table columns included

T-SQL code example for [YourTableName] is shown below (please replace comments with desired values):

CREATE CLUSTERED INDEX [CRIX_YourTableName] ON [YourTableName]
(
   -- INDEX PAGE ORDER BY COLUMNS
)
GO
 
CREATE NONCLUSTERED COLUMNSTORE INDEX [NCIX_YourTableName] ON [YourTableName]
(
   -- INCLUDE ALL COLUMNS HERE separating by comma
)WITH (COMPRESSION_DELAY = /****** HALF THE TIME BETWEEN ETLs  ******/ Minutes)
GO

With both indexes on the FactResellersSales table the SSMS view of the index folder should look as follows:

Figure 5:Clustred Rowstore & Non-Clustered Columnstore indexes

To illustrate how these two practically work together run the update query given below:

UPDATE dbo.FactResellerSales
SET UnitPrice = 0.0
WHERE SalesOrderNumber = N'SO46970' --Updates only one record
AND SalesOrderLineNumber = 5 --Covered by CRIX_FactResellerSales index
GO

The actual execution plan for the query above (Use Ctrl + M shortcut to Include Actual Execution Plan in SSMS):

Figure 6: Update statement actual execution plan for FactResellerSales

Plan for update query shows clustered Rowstore index seek that locates the row in question and then performs the Clustered Index Update. The Non-clustered Columnstore index will get new data written into delta store first and once the compression delay has timed out changes will be written into compressed column segments. Read more on Columnstore index for ETL. Below you will find a Power BI Import mode query for FactResellerSales table:

select [$Table].[ProductKey] as [ProductKey],
    [$Table].[OrderDateKey] as [OrderDateKey],
    [$Table].[DueDateKey] as [DueDateKey],
    [$Table].[ShipDateKey] as [ShipDateKey],
    [$Table].[ResellerKey] as [ResellerKey],
    [$Table].[EmployeeKey] as [EmployeeKey],
    [$Table].[PromotionKey] as [PromotionKey],
    [$Table].[CurrencyKey] as [CurrencyKey],
    [$Table].[SalesTerritoryKey] as [SalesTerritoryKey],
    [$Table].[SalesOrderNumber] as [SalesOrderNumber],
    [$Table].[SalesOrderLineNumber] as [SalesOrderLineNumber],
    [$Table].[RevisionNumber] as [RevisionNumber],
    [$Table].[OrderQuantity] as [OrderQuantity],
    [$Table].[UnitPrice] as [UnitPrice],
    [$Table].[ExtendedAmount] as [ExtendedAmount],
    [$Table].[UnitPriceDiscountPct] as [UnitPriceDiscountPct],
    [$Table].[DiscountAmount] as [DiscountAmount],
    [$Table].[ProductStandardCost] as [ProductStandardCost],
    [$Table].[TotalProductCost] as [TotalProductCost],
    [$Table].[SalesAmount] as [SalesAmount],
    [$Table].[TaxAmt] as [TaxAmt],
    [$Table].[Freight] as [Freight],
    [$Table].[CarrierTrackingNumber] as [CarrierTrackingNumber],
    [$Table].[CustomerPONumber] as [CustomerPONumber],
    [$Table].[OrderDate] as [OrderDate],
    [$Table].[DueDate] as [DueDate],
    [$Table].[ShipDate] as [ShipDate]
from [dbo].[FactResellerSales] as [$Table]

Actual Execution Plan showing Non-clustered Columnstore Index Scan:

Figure 7: Power BI Import Mode Columnstore Index Execution Plan

Plan shows that all the Power BI application needs is to perform a read operation of the table is a Columnstore index scan in order to import all table data into tabular model for further DAX processing. The clustered Rowstore index is there, but it is not used for reading.

Power BI DirectQuery Mode

The DirectQuery Mode compiles Power BI logic into T-SQL queries that are executed at the moment in time a user requests data in the application. The logic that is translated into T-SQL includes filters, measures, graphs, DAX, tables, aggregations and more. This mode allows to get the latest data from SQL Datawarehouse but at the same time poses a significant challenge to performance of SQL engine. As all the logic in the application is expressed into T-SQL it is basically exposed to SQL Server and this in turn complicates the indexing strategy. Let us now go through the setup.

To begin with you need to create new Datasets as shown in the Figure 1, but this time choose a DirectQuery connectivity mode. Repeat this for following tables:

FactResellerSales
FactProductInventory
FactInternetSales
DimProduct
DimCustomer

I have also added and replaced the indexes to all tip tables as recommended by indexing strategy for Power BI Import Mode (see previous section). If you will choose to refresh the model now, no data will be gathered from SQL Server. The reason for this is that DirectQuery mode only queries data that is touched by the current user setup. We have so far, not setup any visualizations and this in turn needs no data to be refreshed. I will be using a simple Power BI visualization, such as table with some aggregation logic. Go ahead and add a table to the Power BI model showing total number of products sold aggregated by the product name and size. The source tables and columns are following:

FactResellerSales(OrderQuantity)
DimProduct(ModelName,Size)

Once the table is populated with data it should look as shown below:

Figure 8: Power BI DirectQuery mode model table

Now, click refresh model button several times and you will see the data being pulled from SQL Server to the Power BI application. Notice that Power BI only gets data from two tables that are used in our model. Once SQL server has enough usage statistics available the T-SQL query given below will present filtered out usage statistics for the Power BI DirectQuery Mode only:

SELECT
SqlText.text AS SqlQuery,
QueryPlan.query_plan AS ExecutionPlan,
QueryStat.execution_count,
QueryStat.last_logical_reads, -- logical reads: number of 8KB pages that has been read from memory cache or disk
QueryStat.max_logical_reads,
QueryStat.total_logical_reads
FROM sys.dm_exec_query_stats AS QueryStat -- Query Plan Statistics
CROSS APPLY sys.dm_exec_sql_text(QueryStat.sql_handle) AS SqlText -- Text for SQL batch
CROSS APPLY sys.dm_exec_query_plan(QueryStat.plan_handle) AS QueryPlan --Query Plan
WHERE SqlText.text like '%$Table%' AND  SqlText.text not like '%SqlText.text AS SqlQuery%' -- Power BI ONLY queries
AND SqlText.text not like '%SourceQuery where 1 = 2%'-- Excludes metadata scan queries from Power BI
ORDER BY QueryStat.last_logical_reads desc -- Sorted by last most expensive query

The SqlQuery text for most expensive query from SQL server usage statistics shown below (Simplified T-SQL without all columns shown):

SELECT
TOP (1000001) [t1].[Size],[t1].[ModelName],SUM(
CAST([t0].[OrderQuantity] as BIGINT)
)
 AS [a0]
FROM 
((select [$Table].[ProductKey] as [ProductKey],
    [$Table].[OrderDateKey] as [OrderDateKey],
    [$Table].[DueDateKey] as [DueDateKey],
    ... --All Cloumns in table
 
from [dbo].[FactResellerSales] as [$Table]) AS [t0]
 
 left outer join 
 
(select [$Table].[ProductKey] as [ProductKey],
    [$Table].[ProductAlternateKey] as [ProductAlternateKey],
    [$Table].[ProductSubcategoryKey] as [ProductSubcategoryKey],
    ... --All Cloumns in table
from [dbo].[DimProduct] as [$Table]) AS [t1] on 
(
[t0].[ProductKey] = [t1].[ProductKey]
)
)
 
GROUP BY [t1].[Size],[t1].[ModelName]

The most striking difference is that now, instead of running two separate table scans Power BI performs one query using a left outer join and a group by aggregation by Size and Model name. The execution plan for this query:

Figure 9: Execution Plan for DirectQuery without filter

The query requests all data from both tables and the way SQL Server provides this is by scanning non-clustered columnstore index in both tables and then performing a hash match join. This is the most effective way to perform large data scans and then joining two tables. Let us run one more scenario with filters added to out Power BI report. I have chosen SalesTerritoryKey column from FactResellerSales table as my Page level filter:

Figure 10: Power BI SalesTerritoryKey Page level filter

The reason for selecting exactly this column for filtering is selectiveness of results. By choosing just one value of 9, the SQL query should only return 1713 rows out of total 60855 in FactResellerSales table. This is 2.8% of total number of rows in the table. As a good SQL server developer, I would normally choose to add an index to support such filter:

--Non-Clustered Rowstore; PAGE compression
 
CREATE NONCLUSTERED INDEX [NRIX_FactResellerSales_SalesTerritoryKey] ON [dbo].[FactResellerSales]
(
   [SalesTerritoryKey] ASC
)
INCLUDE([ProductKey],[OrderDateKey],[DueDateKey]... All columns)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF,
ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON,
DATA_COMPRESSION = PAGE) ON [PRIMARY]
 
GO

You should now have 3 indexes available for dbo.FactResellerSales table:

Figure 11: Indexes for FactResellerSales Table

Click Refresh in Power BI a few times to buildup the SQL server statistics and let us have a look at what SQL server uses to return results to Power BI report with a selective filter (Use the query provided in the beginning of this chapter). Shown below is the left part of the execution plan:

Figure 12: Execution Plan for Power BI Direct Query Mode with filter

Execution plan shows that SQL server chooses to perform columnstore index scan of FactResellerSales even though it has a non-clustered Rowstore index that perfectly covers the query. Based on the cost estimate the SQL considers scanning of a columnstore index the more efficient strategy than a lookup of non-clustered Rowstore. You can double check this disabling the columnstore index and running the query again. I have used SET STATISTICS IO ON and SET STATISTICS TIME ON to analyze query performance with columnstore index and non-clustered Rowstore for FactResellerSales. My results showed the following:

Index Type	Compression	Logical Reads (Number of 8KB pages)	Elapsed time (milliseconds)
Non-Clustered Rowstore	PAGE	13	3
Columnstore	COLUMNSTORE(Default)	48 (LOB)	2

It looks like Columnstore index scan has a slight edge over the compressed Rowstore index look-up even with a filter that returns less than 3% of records. This may also be due to fact that I am using a rather small sample database.

Best Index Strategy for Power BI DirectQuery Mode

After completing an index performance study for this tip, I am inclined to say that Power BI DirectQuery Mode Index strategy should be very similar to what we have already discussed for Import Mode:

Clustered Rowstore index that covers the where clause of update statement
Non-clustered columnstore index with all used table columns included

But, as the Power BI logic (including reports and DAX) is exposed to the SQL server, I will strongly recommend for reports that use selective filters like: datetime range, types, persons, product etc. to have a covering non-clustered Rowstore index in addition to those already mentioned indexes for table size is more that 1GB and query that should return less than 3% of total number of rows. This is an educated guess and should be adjusted based on the resources (CPU, DISK IO, MEMORY) that are available to SQL server.

Next Steps

Learn more about Power BI in this 3 hour training course.

Click here to start the Power BI course

About the author

Semjon Terehhov is an MCSE in Data Management and Analytics, Data consultant & Partner at Cloudberries Norway.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Friday, June 4, 2021 - 1:16:18 AM - Stuart French	Back To Top (88796)
So what I don't get Semjon, is whether the table indexes for the Import Mode are there to increase the speed the data is imported, or the speed the data is sliced and processed in PowerBI. I am after the latter. Can you advise?