Standardize your data models with Azure Synapse Lake Database Templates
A lot of companies struggle with the same issue: departments are operating in siloes. These siloes have all their own terms, concepts, definitions and data. When management wants to get a clear overview of all the available data and introduce company-wide reporting, they hit a roadblock. The data is not easily combined into a data warehouse or reports. The main reason: the business does not speak the same language over the departments. They do not know exactly what other departments do. A common business understanding is needed to develop a futureproof enterprise data warehouse.
Microsoft launched a solution for this: industry-standard database models. These models allow you to get an integrated and comprehensive view of your company data. This enables you to build data warehouses, reports and optimize business processes faster than ever before. These database models come with Azure Synapse Analytics at no extra cost. They allow you to structure your data lake and move faster than developing models yourself from scratch.
In June 2020, Microsoft acquired ADRM software. This company is a leading provider of detailed industry models. These models are used by large companies to develop their information systems. Developing data models is very time-intensive. These data models were refined for many years and validated by industry experts. You do not have to go through all the modelling iterations yourself. Microsoft had already their common data model which is mainly used on the Power-platform. However, the acquisition of ADRM software strengthened their position in this area and is a welcome addition to their data products.
At Microsoft Ignite in November 2021, Microsoft announced they integrated some of these industry-standard models into Azure Synapse Analytics. These features are now available in public preview.
First, you need to create a Lake database in Synapse. This can be done in the “Data” tab. You need to connect this lake database to your Azure Data Lake. Afterwards, you have the possibility to create virtual tables on top of the files sitting in your data lake in the “Database designer”. You can either create the tables yourself, use the industry-standard templates or create tables from existing files on your data lake. Currently, in this preview, templates for these industries are available:
We will illustrate the functionalities by using the AdventureWorksLT database sample from Microsoft. We store all the AdventureWorks data in CSV files on our Azure Data Lake. For our example, we choose the retail template from the industry-standard templates.
We search for “Customer” in the model gallery. This results in an elaborate definition of a customer, including all the needed columns to characterize a customer.
Relationships to other tables are also defined as we can see in the presence of foreign keys in the column list. You can add related tables with the click of a button. However, be aware that the data models are very extensive. Clicking on “Add related tables” for the Customer table resulted in 80 new tables added to the model! In this case, we will manually add the necessary tables.
We add Order, OrderLine, RetailProduct, ProductType, ProductCategory, CustomerLocation, Location, LocationType. This results in the following data model.
Now we need to map our data to these data models. We click on “Map data (preview)”. This starts the mapping dataflow cluster. We chose a file source which is in our case our Azure Data Lake which contains the CSV files. The mapping dataflow provides you with a graphical interface to map the source data to the data model.
After we did all the mappings, we can click on “Create pipeline”. This generates the corresponding dataflows and the pipelines for orchestrating the dataflows.
What does happen when you run these pipelines? First, the data flow executes the transformations you defined in the mappings. Then, it saves these transformations in the chosen format, this can be either parquet or CSV. Additionally, you have different options to choose from to optimize performance. Finally, when these files are created in the Data Lake, the dataflow creates the external SQL tables in your Synapse Analytics workspace.
Another very nice advantage of having these standard data models is the fact that you can use standard machine learning models on your data. Azure Synapse Analytics provides a gallery of machine learning models which you can directly use on your data. You either have the option to develop models on your own data yourself by using AutoML or you have the possibility to use standard machine learning models. For developing your own models, you have the choice to train a classification or regression model or you can try to forecast a time series. This training happens on your Apache spark pools in your Synapse Workspace. At this particular moment, there is only one solution available for the standard ML models. It is the “Retail – Product recommendations” AI solution. The solution provides a notebook that contains example pyspark code. It allows you to predict products that can be bought by a specific customer. This is all nicely integrated into your Synapse workspace.
These new database templates fit exactly in the way Synapse is going, mainly the decoupling of storage and compute while also integrating analytics directly into the same workspace. It combines the power of Data Flows which are built on top of Apache Spark clusters and SQL on-demand which provide the virtual layer on top of the transformed CSV or parquet files sitting on the Data Lake. For now, the options to transform your data to fit these standard data models is limited. We will hope to see better integration with Spark pools with these templates. Furthermore, an exciting extension of the functionality would be an integration with Azure Purview. Would this be a dream coming true? We are excited about what Microsoft will bring when database templates will go into global availability. Stay tuned for updates on this!
Niels De Swaef