CI/CD of Synapse Serverless – Part 1/2

Introduction

As can be noticed in several other blogposts, Azure Synapse provides a wide variety of functionalities to support organizations in creating a Data Platform. In this blog post, we’ll specifically focus on ‘Synapse Serverless’ which virtualizes data residing in your data lake making it SQL-query-friendly for business users and data scientists. As every of our customers know what Synapse Serverless can be of added value, the majority of our customers for which we provide some sort of a Data Platform (such as Data Lakes, Data Warehouses, Data Lakehouses or Data Science environments) also uses it! As Synapse Serverless is implemented that many times, we will explain how we (as Lytix) work around one of its current drawbacks: The lack of support in CI/CD for Synapse Serverless.

Note: The lack of CI/CD support is only existing in the ‘Synapse Serverless’ functionality, not in Synapse Pools nor Synapse Data Factory.

Explaining CI/CD

CI/CD (Continues Deployment/Development, Continuous Integration) is a term referred to automating deployments. As the days of manual deployments are long gone, CI/CD helps us to separate the development environment from others: test, acceptance and production. Modern Data Platforms provide the business with the processes and tools to push their requested features to the elevated environments. E.g., if a business user has tested and approved new functionality on the acceptance environment than he/she should be able to approve the deploy to production; a deploy that should be initiated automatically. Hence, the times where you need a developer to migrate code from development to production are long gone.

Explaining Synapse Serverless

While the complete product ‘Synapse’ supports the use of git and CI/CD, this is mainly limited to scripts, data flows and configurations. A ‘Synapse Serverless’ environment is a SQL-endpoint environment that consists of external tables and views. External tables describe the format of the underlying file(s); data types, column names and location in the data lake. Views can then be built on top of these external tables to present a well-formatted layer of data objects to the business that they can use for reporting. These views can consist of logic that joins two or multiple external tables together if that would improve user friendliness. Thus, what we want to achieve is a layer of views that is presented to the business that can be used for their reporting tools.

What is the problem?

The problem situates in Synapse Serverless in which there is no default way of automatically deploying the objects (external tables and views) to higher environments. A Visual Studio Database Solution typically supports T-SQL-ish databases such as (Azure) SQL Databases and Azure Synapse Pools. While Synapse Serverless provides a sql-endpoint one would expect a similar experience. However Synapse is completely different than other tools as it simply provides a layer of virtualization on top of your data lake; resulting in an incompatibility with the beloved Visual Studio Database Solution. If you would try deploying through a .dacpac, following error is returned:

“An unexpected exception was encountered while modifying the Server Explorer: FUNCTION ‘SESSIONPROPERTY’ is not supported.”

 As deployment through .dacpac files aren’t an option, we are obliged to deploy using a SQL (or SQLCMD) script. As Synapse Serverless only consists of a limited set of objects (such as views, external tables, credentials and roles) creating a script would not pose problems (a useful script to put an already existing Synapse Serverless database to a Visual Studio Database Solution will be revealed in part II).

Core Concept 1: Three Best-Practices

1. Use of External Tables by views

As a general first best-practice within Lytix, views are typically not directly created on top of files, but on a layer of External Tables for each functional file in the data lake. With ‘Functional’ we mean that if data is split-up into partitions (e.g. per month) then we only create one external table that iterates of the whole folder. This has as added value that there is only one central place to define the data types of columns within one functional file.

2. Prepare for changes

Additionally, the data lake and Synapse serverless is designed in such a way that data loads can easily switch from full loads to incremental loads and vice versa (by the use of smart wildcards in the file locations of the external tables).

3. Recreation of External Tables

A third best-practice is the recreation of external tables after each file run. In several sources, source metadata (such as column types and names) can be harvested from informational tables. These informational tables are used to re-generate our external tables each run. If such a table isn’t at our disposal (such as in CSV’s), a table should be maintained by data engineers that define the metadata of the files. This technique informs us with changing structures in the source system (e.g., integers that aren’t integers anymore) and reduces potential errors.

Core Concept 2: CETAS as a workaround for the “Non-Existing Folder/File” problem.

As you can notice from the core concepts, we shouldn’t actually deploy external tables as they are recreated anyway after a deploy. And as a CI/CD best-practice, a data-run should automatically be triggered after a deploy. While it is tempting to not deploy external tables, we are actually obliged to create them as a prerequisite of being able to deploy our views.

A main problem you’ll unfortunately run into is as follows: you want to deploy an external table that uses a file/folder that doesn’t exist yet on production. As this file/folder doesn’t exist, trying to create this external table results in an error. You probably would think about creating dummy files through PowerShell or advanced scripting? Well, luckily you can also do this using creation of an External Table; using the CETAS (Create External Table As) expression. The CETAS expression allows one-off persisting results of views. This functionality can be reused to create dummy files supporting our creation of external tables and views. Thumbs-up if you had a eureka moment reading this clever workaround.

Continue on Part II: The technical solution.

While we have discussed our main approach, the second part of this blog will explain how we will technically create such a deploy script.