Azure Synapse Analytics

Sep 8, 2020

Organizations understand the value of data more than ever. A Data Warehouse as a single source of truth, a data lake to store data for analytical exploration, self-service tools for data transformation, visualisation, and consumption as well as clusters to process immense data volumes. All these different use cases require other specialised tools resulting in a dispersity of knowledge and data throughout the organisation. Microsoft’s Azure Synapse Analytics provides a single platform servicing each data-related process: Big Data, Data Science, Data Lakes, Data Warehousing, Data Transformation and (Self-Service) Reporting.

A single platform

At the end of 2019 Microsoft introduced Synapse Analytics, a new service that merges the capabilities of Azure SQL Data Warehouse with new enhancements such as on-demand query as a service. Azure Synapse is an end-to-end cloud-native analytics platform that brings together data ingestion, data warehousing and Big Data analytics into a single service.”

Figuur 1 – Synapse Architecture

While there are a lot of bells and whistles in this service, the availability of three service to handle your data is one of our favourites:

  1. Provisioned SQL Pools (former Azure SQL Data Warehouse) store your data in an always-available cluster. This is ideal for frequently accessed data warehouses that require quite some calculation power.
  2. Provisioned Spark provide you with a set of spark clusters that can be used by data engineering and data science teams to answer their broad set of questions.
  3. On-Demand / Serverless provides a pay-per-query mode to access files on your data lake. No more any of those long data-integration tracks before being able to query the data in your lake.

The next chapters discuss these ‘ways of compute’ in more detail.

Provisioned SQL Pools

Provisioned SQL Pools (former Azure SQL Data Warehouse) is a service that stores your data in a fast-accessible way and can scale up to petabytes of size with only limited performance decrease.  The size of the SQL pool is determined by Data Warehousing Units (DWU). By scaling up when more performance is needed or scaling down to reduce costs. Billing is done while the server is running, smart scaling is advised!

How to choose the best performance level of your SQL Pool?

We suggest to start with a low performance level like DW300c. This has one compute node with 60 distributions per compute node and 180 GB memory available for you. When you need more power you can scale up easily. Until DW500c you will only get more memory available. From DW1000c you get two compute nodes and double the memory of DW500c, but at twice the price.

SQL On-Demand

SQL On-demand provides serverless queries over your data lake. Without any setup or infrastructure maintenance it allows you to explore data in Azure Data Lake Storage, Spark Tables and Cosmos DB.

By making use of the OPENROWSET function in SQL statements, you can easily access files stored in ADLS. SQL On-Demand supports Parquet, JSON and CSV. We think that it will fully live-up to its potential once Delta

It uses the pay-per-query model. You are only charged for the data processed per query and it auto-scales depending on the amount of data processed.

Example SQL script to query a CSV file from ADLS with Azure Synapse Studio

Via the on-demand SQL endpoint provided in the Azure Synapse workspace, developers can utilize tools such as SSMS and Azure Data Studio with the on-demand compute engine.

“SQL On-Demand offers an ad-hoc cost-effective way of querying your data lake. Only the data passing through the query is billed.”

Now there are two possibilities to query your data. What could be the best fit?
From our point of view we suggest using SQL pools for constant and high workloads. If you want to do ad-hoc analysis then SQL On-demand will be a better fit for you.

Apache Spark

Azure Synapse provides simple to use Spark clusters. It is a SaaS solution meaning that Azure Synapse will take care of all underlying infrastructure components for you. This allows you to use Spark immediately in your Azure Synapse environment. When creating your Apache Spark Pool you can choose three different node sizes: Small, Medium, Large.

Four languages are available in the Notebook experience of Azure Synapse: PySpark (Python), Spark (Scala), SparkSQL and .NET for Apache Spark (C#). Thus, the ‘R’ addicts will be disappointed.

As this is a cloud service, it is easy to reduce costs by auto-scaling and enable auto-pausing.

              

Data Integration

Azure Data Factory is directly integrated inside the Azure Synapse environment. Azure Synapse uses the same technology to provide data integrations features like ADF. There’s even a native integration between Spark and SQL.

Azure Synapse Workspace

The workspace is the heart of Azure Synapse. It’s the place where data engineers and data scientists collaborate on their analytics solutions. Access to an Azure Synapse workspace is managed by the Role Based Access Controls (RBAC) applied to all other Azure resources. For example, to enable Power BI developers to launch the Azure Synapse Studio and to access or build Power BI content from within the Azure Synapse studio, the developers need to be granted the required permissions to the Azure Synapse workspace.

Azure Synapse Studio

Synapse studio is a web-based interface that provides an end-to-end workspace and development experience for all Azure Synapse resources. All development and management activities supported by Azure Synapse are carried out in the Azure Synapse Studio via access to an Azure Synapse Workspace. Other development tools such as SSDT and SSMS can be used to interface with the Azure Synapse resources.

 

Conclusion

Azure Synapse has a lot to offer. If needed, we can help you understand Azure Synapse Analytics, so you get the most out of it for your business case.

Pieter-Jan Serlet

Pieter-Jan Serlet

Pieter-Jan has over 10 years of experience in data- and analytics environments. He has a passion for helping business users reach their data in a more effective and easier way while maintaining governance. There is no data Pieter-Jan can’t handle.