Synapse vs Databricks: Which one to choose from?

Synapse or Databricks, which one to choose? This question is asked a lot when implementing a future-proof data platform. Both platforms are cloud-based, future-proof and state-of-art technologies for creating a Data Lakehouse architecture. However, each of the solutions has its own strengths.  

Today, the ‘lakehouse’ trend is real and companies are switching from classic data warehouses to lakehouses. This, to cope with new varieties, larger volumes and higher velocities of data being generated. However, which technology do you choose to build your lakehouse on? Which criteria do you need to consider making a thoughtful decision? We will answer these questions in this blog post. 

We will compare both platforms for some key areas: Data Engineering, Data Warehousing and Machine learning. Do note that both products are evolving fast and new features are released every month. Contact us for a check of updated functionalities that are of high value for your organization. 

Are you new to Azure Synapse and Azure Databricks? Read following blogs written by Lytix colleagues:  

Azure Synapse

Azure Synapse is not just a rebrand of Azure SQL Datawarehouse. It is a complete service that brings together data integration, data warehousing and big data analytics. Azure Synapse brings a unified experience to load, explore, transform, and serve data for either business intelligence or machine learning needs. 

Figure 1: Overview Azure Synapse Source: Microsoft

Databricks

The Databricks platform, in particular their Data Lakehouse offering, combines the best of data warehouses and data lakes in one platform. The Lakehouse offering handles your data integrations, analytics, and AI use cases. It uses mostly open source technologies to accomplish this. They recently open-sourced the delta lake format. Moreover, Databricks is a multi-cloud offering. It can be run on the Google Cloud Platform, AWS, or Azure. 

Figure 2: Databricks Lakehouse Platform Source: Databricks

Comparison

Data Engineering

Ingesting

Building ETL (Extract, Transform, Load) flows is the biggest part of your task list as a data engineer. Synapse provides a lot of easy-to-use connectors to various data sources. This enables you to load data from a source to your data lake in a couple of minutes. When you are already using Microsoft technology for your OLTP systems, Synapse offers some advantages. Synapse Link offers a hybrid transactional and analytical processing (HTAP) capability which enables near real-time analytics over operational data. This service supports Azure Cosmos DB, Dataverse and SQL (Azure SQL and SQL Server 2022, currently in preview).  

In Databricks, data ingestion can mainly be done through coding. They offer support for various data sources such as S3, Amazon Redshift, Google BigQuery, … . Databricks also recently launched “Auto Loader.” Auto Loader enables you to load your data, as new data files arrive, efficiently and incrementally in your cloud storage account. Here, Synapse offers more native support on the ingestion part for functionality and ease of use, while Databricks requires more coding. 

Exploring

Data Engineers must also explore their data before doing the necessary transformations. Both platforms support Spark. Spark can be used in notebooks to explore your data on your lake. You can directly query your files. However, in Databricks you need to mount a data lake before you can start querying it. You also need to set up the handling of credentials while in Synapse this is a more streamlined experience.

Databricks supports Python, Scala, and Spark SQL in addition, Synapse also supports T-SQL to query your data by using Synapse Serverless. For niche use cases, Databricks also supports R. The main advantage of Synapse serverless is the fact that your Spark cluster does not need to be running to query your files. However, Databricks is also launching a solution to query your lake without having your Spark cluster running. For now, this serverless compute feature is in public preview and only available on the AWS cloud. This feature is called Databricks Serverless SQL Warehouse (formerly SQL endpoints). It offers instant compute which is managed by Databricks and this can easily be enabled by a workspace admin for your workspace. For exploring data, we think there is not a clear winner to be identified. We could say that Databricks best serves data engineers which are used to Python or R and that Synapse is more suitable for more ‘business-oriented’ data engineers which are used to working with T-SQL.  

Transforming

After you have explored your data, you need to transform it. Both platforms allow decoupling compute from storage which enables you to scale compute independently of storage. The preferred way to transform your data is by using the Spark engine. Data is mostly stored in parquet files on the data lake. However, it is preferred to use the delta format. The delta format adds more functionalities (e.g. time travelling) on top of the standard parquet file format. Both Synapse and Databricks support the Delta format, but Databricks has some more features and extra optimizations for the delta format. New releases for the delta format are earlier available on Databricks. The most common way of working is to write your transformations in a notebook. Databricks is a bit ahead on the notebook experience. Both platforms support co-authoring for the notebooks, while Databricks also supports real-time co-authoring. Synapse and Databricks have their own interface to interact with notebooks. However, Databricks also supports connecting your local IDE to the Databricks Spark cluster. Databricks only supports developing your transformations in code while Synapse also has a visual transforming tool called Data flows. This enables you to drag-and-drop building blocks for your transformations. However, currently, the features in Data flows are quite limited. You will still need to write code to do advanced transformations.  

Databricks launched a feature called Delta Live Tables (DLT). Delta Live Tables make it easy to build and manage data pipelines. This feature includes declarative pipeline development, automatic data testing and advanced monitoring and recovering possibilities. 

Databricks offers some more advanced features on the transformation aspect. However, Synapse offers some nice visual tools for the ‘citizen’ data engineer. 

Streaming

The ability to handle streaming data is always a bit more complex than traditional batch loads. Both platforms support streaming data but in a slightly different way. As Databricks is taking the open-source route:, it supports Spark Structured Streaming. It is neatly integrated into the Databricks environment. The more recent feature of Databricks, called Autoloader is a nice application of structured streaming for incrementally loading your data. It loads data in micro-batches in real-time. Synapse, on the other hand, integrates neatly with Azure Event Hubs and Azure Stream Analytics. Databricks offers more advanced features for now but Synapse is getting closer.  

Data Warehousing 

Serving for BI and Reporting 

Databricks provides some limiting reporting capabilities in their web interface. This is called ‘Databricks Dashboards.’ One can make visualizations based on data generated by notebooks. However, they recently launched ‘Databricks SQL’ which is a serverless way to query data that is sitting on a data lake using SQL. No need to spin up clusters anymore to query your data. The Synapse variant of this server is called Synapse SQL Serverless. This service allows you to query data from your Azure Data Lake in different formats (Parquet, CSV, Delta…) near instant without having to set up any infrastructure. The other variant of Synapse SQL is Synapse SQL Dedicated Pools. This service provides instant query response when the service is not paused. Formerly, this service was called SQL Data Warehouse. It relies on the power of a distributed query engine to provide quick insights to the end-user. These services can be used to supply your BI team and tools with data. You can connect your favourite BI tools to any of the SQL endpoints (Databricks SQL, Synapse SQL Serverless, Synapse Dedicated SQL Pools).  

However, Databricks does not give you the ‘full’ relational experience as we are used to work with in traditional data warehouses. The Delta format supports ACID transactions and time travelling. Enforcement of referential integrity is still quite complicated. Nevertheless, Databricks recently introduced identity columns on Databricks Spark runtimes and Databricks SQL. This allows you to consistently generate surrogate keys to use in your star schemas. On the other hand, Synapse allows having your traditional views and stored procedures defined. Synapse Dedicated SQL pools give you the ability to have the full relational model you are used to working with. Synapse Dedicated SQL pools allow you to define indexes etc. whereas Databricks only allows choosing a good partitioning strategy. However, a straight comparison of these two services may not be very fair. Synapse Dedicated SQL pools require loading your data into the pool. By doing this, the advantages of decoupling storage and compute are lost. This is not the case for Databricks, but you need to take into account the cluster start-up time. However, you do not need to move your data out of your lake. 

Additionally, Synapse provides you with the possibility of using Power BI directly from the Synapse Studio web interface. Databricks also allows you to connect to different reporting tools via their ‘Partner connect’ feature. If you want to benefit from the fullest integration of Power BI and your Analytical Engine, Azure Synapse is the preferred choice. Azure also provides ‘Microsoft Purview’ that scans through Power BI and Azure Synapse for data lineage and dependencies up until the source.  

Machine Learning 

Synapse offers built-in support for Azure Machine Learning. AzureML provides a service for the end-to-end machine learning lifecycle. It empowers data scientists to build, deploy, and manage models. However, you can also use Spark MLlib in Spark Notebooks. Recently, Microsoft added GPU support to Synapse Spark clusters, enabling you to train your models even faster. To use your trained models in Synapse, you can just use the T-SQL PREDICT function to get your predictions without moving any data out of your warehouse.  

Databricks is already a long time in the data science game. Consequently, they offer some neat solutions to push your machine learning models to the next level. They offer a managed MLflow environment which guides you through the various aspects of the machine learning model lifecycle. They offer advanced feature stores, model registry, experiment tracking, logging, and monitoring capabilities. Models can be versioned and moved through the various stages (experimenting, staging, production…) Databricks allows you to put into production your best model with low latency. They allow you to expose your deployed ML model via a REST API. Databricks also supports AutoML which enables you to quickly experiment with different machine learning algorithms. This all neatly integrates with Databricks repos to leverage automated CI/CD processes and improve collaboration within teams.  

For advanced machine learning, we would definitely recommend Databricks. However, to do some quick machine learning experiments to see what you can do with your existing data, Synapse also offers some nice possibilities to get some quick insights.  

Conclusion

Should we go for Synapse or Databricks? Unfortunately, the answer is: it depends. The decision is use case specific. If you have already a lot of Microsoft and/or Azure services running, your transformations are not too complex and your machine learning needs for the moment are quite low, you should go for one integrated solution, namely Azure Synapse. 

When you have advanced transformations, multiple streaming sources, and advanced machine learning requirements, you should go for Databricks. 

However, the decision is not always black or white. Databricks and Synapse can work well together. One can opt for an architecture where Synapse pipelines are used to ingest data to a data lake, use Databricks to build the Delta Lakehouse on top of that lake and for serving the data to BI tools, one can use Synapse serverless SQL. A lot of combinations are possible. One should use the right tool for the right use case. A hybrid scenario can be the best for your use-case as well. 

We are happy to assist you with this architectural decision to build your future-proof data platform. Do not hesitate to reach out. Our experts in both Databricks and Synapse are happy to help you out. 

Some example scenario’s 

“I need to use drag-and-drop data transformations”  
Use Azure Synapse Data Wrangling/Data Flows.  

“I need to use a Serverless SQL layer on top of my Data Lake”  
Synapse already offers Synapse Serverless SQL. Databricks SQL is currently in preview.  

“I need to use single node clusters for simple PoC’s”  
Single node clusters are only available in Databricks at the moment. Synapse’s least amount of computing power includes two workers and one master node. 

Niels De Swaef

Analytics consultant at Lytix