Where did my Dedicated SQL Pool go?
Note: This blog was written when Fabric was still in Public Preview. Some features may have changed at the moment of reading. Features are described at the moment of writing (Mid-October 2023).
Microsoft Fabric is the newest product of Microsoft in the Data and Analytics world. Fabric provides an all-in-one analytics solution tailored for enterprises which covers the entire spectrum of Data and Analytics operations. From data movement and data science to real-time analytics and business intelligence. These functionalities are all conveniently centralized in a single platform and are offered as Software as a Service (SaaS) with Microsoft OneLake as the main building block.
However, where is my dedicated pool as we used to have in Synapse? I cannot find it anywhere. Well, it is gone (at least as we currently know it…). If this is a blessing or a curse? I let you decide! In this blog post, we will focus on Fabric’s new SQL Warehouse offering.
Never forget where we come from…
Microsoft took their first steps in massively parallel processing (MPP) around 2008. Project “Madison” was an internal project to develop an MPP data warehouse system. In 2010, they first launched SQL Server Parallel Data Warehouse better known under the abbreviation PDW. Another release of the product was launched in 2012. Four years later, they introduced Azure SQL Data Warehouse (ASDW), built on the same foundations as PDW but optimised for cloud deployments. In late 2019, Microsoft launched Azure Synapse Analytics. This solution is branded as a unified analytics platform which embodies big data, data integration, analytics, and visualisation. They included ASDW as Synapse Dedicated SQL Pools. No special features were added to the Synapse Dedicated Pools, it was still based on the Microsoft MPP engine. However, they also added something new: Synapse Serverless SQL pools based on the Polaris engine. This serverless engine allows you to query files from your Data Lake and it scales automatically based on the workload and you don’t have to manage it yourself! In my opinion, this is one of the most interesting features of Synapse Analytics.
Almost 4 years later, Fabric goes into public preview. Is it Synapse 2.0? Is it just a silver lining around Synapse? Actually, it is not. Keep on reading to get to know why this is not the case.
Synapse Data Warehouse
Synapse Data Warehouse, is the new name for the Warehouse experience in Fabric. However, it is not the Synapse Dedicated Pool nor the Synapse Serverless Pool. Actually, the new Warehouse is a combination of both worlds! However, if we really need to make an analogy with already existing Azure components, it would look like something in Figure 1.
- The SQL Endpoint of the Lakehouse is a read-only endpoint of your Lakehouse table saved in Delta format in your OneLake.
- The Data Warehouse SQL Endpoint provides read-write capabilities to Delta tables on your OneLake.
However, there is a catch. You cannot write to Delta tables created via the Lakehouse experience with the SQL Endpoint. You can only write to tables created via the Data Warehouse experience.
Figure 1 is not completely correct. Microsoft revised the underlying SQL engine but still making the analogy with the Synapse SQL engines makes sense.
Moreover, there is only one SQL endpoint. The Warehouse and Lakehouse tables are exposed in different databases on the same endpoint. Microsoft calls these “Virtual Warehouses”. These just look like different databases on the same server. This enables cross-database querying! You can read data from your Lakehouse and write it into your Warehouse just with T-SQL code. The other way around, writing to a Lakehouse is currently only supported by using Spark, Dataflows Gen2 or pipelines. You can query data from different sources by only using one engine and without data duplication. This all works seamlessly as long as your data resides in OneLake (or is exposed via a shortcut).
Improvements compared to the Classical Synapse Data Warehouse
Simplified workload management
Previously, managing workloads on Synapse DWH was quite complicated and it required some manual tuning. Fabric DWH experience eliminates the need to do manual tuning. Autonomous allocation and automatic scaling are built-in. Isolation is also provided by separating workloads based on characteristics. Therefore, ETL jobs and ad-hoc reporting queries cannot interfere.
In the Synapse DWH, files were stored in a proprietary format. In the new DWH experience, data is stored in parquet files with delta logs. This results in files completely compliant with the open-source Delta Lake format based on Parquet. The Delta format offers ACID transaction support, time-travelling possibilities, audit history, schema evolution and many more features. By using an open format, the files can be queried with different engines and eliminate the need to copy files into another format.
Separation of storage and compute
Compute and storage are now uncoupled. This allows us to scale compute and storage independently while still allowing for DQL, DML, and DDL T-SQL support including transactions. The Warehouse can automatically scale when demands are required. You do not have to worry anymore about increasing the number of DWUs before your ETL process starts, decreasing the number of DWUs afterwards or pausing your Warehouse.
How SQL Serverless and Dedicated Pools became one
As we already mentioned earlier, there is only one SQL endpoint per workspace. This also means there is only one SQL engine. How did Microsoft combine the two separate engines into one?
Microsoft combined the best of their technologies. They took their query optimizer from SQL Server, combined that with the query processor of Vertipaq and used their knowledge from Polaris to scale this solution.
Microsoft modified the engine to be able to read Parquet instead of the proprietary storage format used by Vertipaq. In the end, Parquet and Vertipaq are quite similar. They both apply similar compression techniques: columnar storage format, dictionary encoding, bit packing and Run Length Encoding (RLE). All that’s left is doing a transcoding from Parquet to Vertipaq. However, Vertipaq does also row shuffling to do more efficient RLE. That’s where V-ordering comes into play. Rows are reshuffled to optimize performance. This new way of storing data combined with the Distributed Query Processing engine from Polaris (link to Polaris paper) results in the new SQL engine used in Fabric.
Your SQL Dedicated pool is not gone. It just became better and more integrated than ever before! Serverless and Dedicated pools became one. The engine has been revised and now supports an open-source storage format which can be read with multiple compute engines. T-SQL is not dead. You can combine different compute engines in your data architecture while not duplicating data. This also means you can leverage your existing skillsets! We are looking forward to the additional features Microsoft will add to this new Warehouse Experience!
Niels De Swaef
Niels De Swaef is a Data Engineer passionate about using technology to solve complex problems and improve people’s lives. He has experience with the Microsoft Azure data stack, including Synapse, Databricks, Azure Data Factory, and CI/CD pipelines. Niels is also a lifelong learner who is always looking for new ways to use data to make a positive impact.