What is Direct Lake mode in Fabric?
You are tired of waiting for your dataset to refresh in Power BI service? Say goodbye to these waiting times, Microsoft Fabric has a Direct Lake connection from Power BI to your OneLake! One of the great innovations in Microsoft Fabric is Direct Lake. Keep on reading if you want to know more about this awesome new feature! But first things first, a little introduction about Fabric if you haven’t heard about it already.
What is Microsoft Fabric?
Microsoft Fabric is the latest offering from Microsoft in the data and analytics space. It is a comprehensive, end-to-end solution covering the entire spectrum of data and analytics operations, from data movement and science to real-time analytics and business intelligence. All these functionalities are conveniently centralized in a single SaaS platform, with Microsoft OneLake as the foundation. This makes Fabric a powerful and versatile tool for enterprises of all sizes. It can be used to build and manage data pipelines, prepare and analyse data, and create and share insights with business users.
One Copy for all computes
How many lakes are there? One! How many copies of your data do you have? One! There is only OneLake and one copy of your data! By using an open-source storage format based on Delta which is an additional layer on top of Parquet which provides us with ACID properties, time-travelling possibilities and other features, we can have only one copy of your data which is natively supported for all compute engines in Fabric! No more importing or exporting. All data is stored in a single format in your lake. In this blog, we are going to take a closer look at Direct Lake mode for Power BI. A little spoiler, you don’t need to import your data anymore!
Overview of different storage modes in Power BI
Let’s first have a look at the different storage modes in Power BI. Import and Direct Query both have their use cases. However, Direct Lake combines the best of both worlds. Let’s have a closer look.
When a table has Direct Query as Storage Mode configured, data is always up-to-date in your report. This also means when a report is opened by a user, the DAX queries which are used by the visuals in the report need to be translated into SQL queries. These SQL statements are executed on the source database. The source database needs to read from its own storage. Whenever this is done, the results are pushed back to the report and shown to the report consumer. This process entails quite a lot of steps and is consequently quite slow but data is always fresh. There should be no scheduled refreshes of the data.
If you use Import Mode for a table, that table is read from the data source and saved into a proprietary format. This proprietary format is a highly compressed columnar storage format. When a user interacts with a visual, the data is loaded by the Vertipaq engine into memory from this proprietary storage format. The main problem with this approach is the loading of the data has to be scheduled. Another side-effect of this loading into another format is data duplication. This storage mode is often preferred as it provides superior performance compared to Direct Query.
Direct Lake uses the same methodology to load the data as Import Mode. The main difference is the storage format. Direct Lake can directly read Delta files from OneLake! The Delta storage format is mainly based on top of Parquet. Parquet is a columnar-based storage format which is quite similar to the proprietary one used by Microsoft for Import Mode. The Vertipaq engine queries the Delta files and the required columns are loaded in memory. The Vertipaq engine transcodes the Parquet compression format into a format which can be understood by Analysis Services. This transcoding on the fly eliminates the need to duplicate the data in another format. The DAX queries also don’t need to be translated which is a computationally expensive step when using Direct Query mode.
Let’s take a deep dive into Direct Lake
Before you can make use of Direct Lake, your data should be stored in Delta format in OneLake. You can create Delta tables in different ways:
- Fabric Notebook and (ADF) Pipeline:
Delta tables can also be created with Spark by using either SQL or Python.
- Load to Tables:
CSV or Parquet files can be converted into Delta tables by using the ‘Load to Tables’-wizard in Fabric.
- Shortcut to existing Delta table:
You can use a Delta table created by another Spark engine or Databricks as long as it uses a supported Delta protocol version.
Currently, you can only create a Direct Lake dataset from the web interface. This functionality is not available in Power BI desktop at the time of writing.
To achieve nearly the same performance as import mode, Microsoft has introduced V-ordering. It is a proprietary algorithm for writing parquet files which provides better query performance while still being compatible with the open-source Delta format. V-Ordering does not violate any of the open standards of the Delta format, therefore it is not a proprietary format. V-ordered parquet files offer the best performance for Direct Lake. Delta tables created by Fabric engines are V-ordered by default. Let’s not confuse Z-Ordering with V-Ordering. These are all fully compatible with the Delta open standard and can be used together for extra benefits. Z-ordering works on the Delta level of the files wile V-ordering modifies the ordering within a parquet file.
When you open your report for the first time, you might experience slightly slower response times. This is caused by the fact that data needs to be loaded into memory from the Delta files in OneLake. The selected columns needed for the DAX query of your visual are loaded into memory or cache. When you open the report for the second time, the response time will be a lot faster because the data is already cached.
Microsoft uses ‘temperatures’ under the hood to determine which columns to load or keep in cache. The more queries use a specific column, the higher the temperature will be for that column. You can have a look at the temperatures of the columns yourselves by querying the dynamic management views (DMV). All DMVs can be found in the Microsoft Documentation.
The refresh button in Power BI service is still available. If you click this button, it executes a “Framing” operation. All the data is dropped from the cache. The same thing happens when new Delta files arrive in OneLake (when you have turned on the option to automatically get new data into your dataset). Whenever a “Framing” operation happens, the metadata from the underlying Delta files is scanned by the Power BI service.
When using Direct Lake mode, data still needs to be loaded into memory from the Delta files. Currently, SQL Warehouse views are not supported. Only Delta tables created in the ways mentioned above. Memory limits are defined based on your Fabric SKU. If your Direct Lake dataset exceeds the memory of your Fabric capacity, it will fall back to Direct Query mode. In this case, you need to be careful. If you are using DAX expressions which are not supported in Direct Query, your measures will return an error. Another possibility is to turn off the Direct Query fallback. However, if your memory of your capacity is exceeded, it will throw an error. The best option is to leave Direct Query fallback on for large datasets and make sure your measures are not using DAX expressions not compatible with Direct Query.
Direct Lake is one of the best features of Microsoft Fabric in my opinion. We are happy to see Microsoft embracing open formats and the Lakehouse architecture. This new feature enables the use of Delta files from other engines to be read directly by Power BI. As an example, one can use Databricks for advanced data transformations and Fabric to serve the data to the end-user by using Direct Lake and Power BI. The possibilities are endless. If you want to use Direct Lake at your organization, do not hesitate to reach out!
Niels De Swaef
Niels De Swaef is a Data Engineer passionate about using technology to solve complex problems and improve people’s lives. He has experience with the Microsoft Azure data stack, including Synapse, Databricks, Azure Data Factory, and CI/CD pipelines. Niels is also a lifelong learner who is always looking for new ways to use data to make a positive impact.