7 pro-tips and tricks for Synapse
From its launch in 2019 until today, Azure Synapse emerged as a powerful analytics service, accelerating time to insight tremendously across data warehouses and big data systems. The data applications for Synapse are endless and applicable in various industries. Consequently, this blog will try to guide you through this ‘jungle’ and provide you with several key tips and tricks to kick-start your Synapse journey. Without these tips and tricks, you risk getting stuck in a ‘data swamp’ and won’t be able to fully unlock the potential of this tool.
Deploy Synapse as Code in Bicep
Firstly, I would like to stress the importance of setting up Synapse in Azure via code instead of using the ‘click configuration approach’ in the Azure portal. Thanks to coding infrastructure, changes become traceable, your setup becomes reusable, and you automatically create a backup. Infrastructure code can easily be created in Visual Studio Code by making use of the bicep extension. Bicep is a domain specific language for ARM templates that simplifies the provisioning and management of Azure resources, including Azure Synapse. On the left-hand side of the image depicted below, several Bicep modules are shown, each containing Bicep code to configure a specific resource in Azure. As you can see several typical Azure resources are shown like a Key vault, Storage Account, and SQL Server. On the right-hand side, a part of the code is shown for setting up Synapse. First, some parameters are defined, which are reused throughout all the modules. Next, the attentive reader can see the creation of firewall rules for Synapse. In this particular case, the firewall is disabled by allowing all possible IP addresses to connect to Synapse. A setting that I would not recommend, internet access should be strictly limited to several internal developers and processes.
Figure 1: Setting up Azure resources via Bicep in Visual Studio Code
Leverage DevOps for version control
Effective version control is the backbone of collaboration and agility. Integrating your Synapse setup code with DevOps repositories such as Azure DevOps or GitHub is a must. For example, the Bicep setup code from above can be cloned to DevOps. Consequently, the infrastructure code resides in a repository in DevOps. These repositories support version control, which tracks all changes. This is very important in a team to see who made which change and why. In addition, branching strategies can be implemented before pushing changes to a common (master) version. After revising the master code, it can be deployed to create components such as a Synapse workspace and Data Lake Gen 2.
Implement CI/CD for seamless deployment
DevOps is also needed to implement a good continuous integration and continuous delivery setup. Everything concerning credentials, datasets, linked services, notebooks, pipelines, scripts, etc. that you create in a Synapse workspace can be exported to one JSON file. In the picture below, you can see a repo named ‘Synapse’ in the main branch, containing folders for each component in Synapse. When your synapse environment is linked to DevOps, you can start tracking changes between environments and create a pipeline to automatically deploy from development to test and then to production. The setup of this takes some time, but the efficiency of the work that you get in return is invaluable. Definitely a must-have!
Figure 2: Azure DevOps example of a synced Synapse repository
Start using Parquet and Delta lake format over CSV.
The most common data storage format is CSV however, other, sometimes more suitable, options are available. One of the alternatives is parquet, which is depicted in the middle of figure 3. Parquet is an open-source columnar storage file format that stores data in a columnar fashion, as opposed to CSV, which stores row-based. The advantage of column based querying lies in the fact that the data for each column is stored together, making it easier to skip over irrelevant data while querying. Also, in terms of data storage, Parquet supports various compression techniques such as Snappy, GZIP and LZO which can significantly reduce storage space. These parquet files can be directly queried by Synapse, which makes it very easy to use. Lastly, an extension of parquet is delta lake. By storing an additional delta log in the parquet file, additional functionality like ACID transactions, UPDATE/INSERT/MERGE/DELETE commands, and time travel are supported, making data warehousing possible directly on cloud storage.
Figure 3: Storage formats
Avoid expensive data mapping flows
While Azure Synapse empowers complex data transformations, it’s crucial to avoid overcomplicating your data mapping flows. Complex transformations can lead to high compute costs and slower processing times. In fact, I would advise against employing data flows altogether. In my opinion, a tangled web of blocks for filtering, joining and selecting data is way less clear than a well-written SQL script with comments.
Choose wisely between serverless and dedicated pools.
For handling large-scale data processing and complex queries, dedicated SQL pools in Azure Synapse provide the required performance and scalability. By allocating dedicated resources, you can fine-tune query execution, optimize data distribution, and achieve faster response times. However, for most small and medium-sized companies that don’t require advanced analytics a serverless pool suffices. A serverless SQL pool is designed for ad-hoc workloads, automatically allocating resources on demand based on query requirements. Consequently, this results in substantial cost savings by only paying what you need. In summary, first analyze your needs before starting up an expensive dedicated pool if you don’t need it.
Connect Power BI to Synapse via virtual network gateways
If you want to connect to a serverless database from Synapse via Power BI, some things have to be considered. If Synapse has no firewall implemented, the connection from Power BI to Synapse is immediately possible. However, companies should have security policies in place that block unknown inbound internet connections with a firewall. So when Synapse blocks these inbound connections, how can Power BI retrieve data for reporting? Well, the recommended way to connect is by making use of a virtual network gateway that is connected to the VNet in which your synapse workspace resides. This virtual network gateway can be configured in the tab ‘manage connections and gateways’ in Power BI, which is depicted in the image below. Please take into account that to use this feature, Power BI Premium is required.
Figure 4: Create a virtual network gateway in the Power BI service
In conclusion, Azure Synapse Analytics represents a significant advancement for organizations looking to extract valuable insights from their data. To ensure your setup remains future-proof, efficient and consistent, consider these tips and tricks. Azure Synapse continuously evolves with new features and enhancements, so I highly recommend staying updated by exploring the official documentation, attending webinars, and engaging in relevant communities.
While Fabric has emerged as a newcomer in the field, Synapse is the more mature and reliable tool to choose from. However, it’s wise to start considering migration paths to Fabric for the future when it normally exits preview and becomes a viable product in the market.
Joachim Depovere, who has been active for two years as a data consultant at Lytix, is always eager to dive into new data tools and topics to bring customers’ data platforms to the next level.