Virtualizing your Azure Data Lake with SAP HANA
Organizations are looking for best-of-breed solutions to tackle data challenges. While lots of organizations are using SAP HANA for reporting and analyzing SAP ECC or SAP S/4HANA data, ‘data lakes’ are often used in parallel for other use cases. Data Lakes are ideal for storing large volumes of information in a cost-effective way. In an ideal world, the information stored in the Data Lake is also accessible form within SAP HANA without physically transferring data. This is exactly what will be demonstrated in this blog post in an Azure setup: How to virtualize your Azure Data Lake to SAP HANA.
It is advised to recap on following topics before continue reading:
1. A Data Lake as the Storage Container for Big Data
The image below depicts a setup in Azure and SAP HANA. The data lake (Azure Data Lake Gen2) can be used for several use cases:
- Storing the cold path of streaming components fed by Azure IoT HUB. This is typically high velocity, big volume information.
- Storing unstructured complex files, such as video and image captures, audio recordings, …. These unstructured files can be analyzed by Machine Learning algorithms and turned into features/attributes (= structured information).
- Storing semi-structured files originating from logs, web scraping, operational processes, file transfers, …
- Storing results of Machine Learning and AI algorithms.
The Lambda Architecture
Two types of insights can be gathered from streaming information:
· Hot Path: The near-real time measurements of the stream allowing for direct decisions and actions. Some aggregations and transformations can be done on the fly to improve the correctness and accuracy of real time observations; e.g. show me the average temperature of the last 30 seconds (instead of the temperature right now).
· Cold Path: The data gets historically stored conform a specific snapshot (e.g. bucketed in 1 minute) on a Data Lake. This historical information is tracked incrementally and allows for long-term analytical insights.
As these two different views on our streaming source depict the Lambda sign λ, this is called the Lambda Architecture.
2. Virtualizing our Data Lake for Querying using Synapse Serverless SQL Pools
As insights need to be gathered from the data residing in our Data Lake, we want to make it available for analysis. To do this, data is transformed into a structured format in a ‘Gold’ layer. The market standards of these formats are Parquet and Delta, the latter if your information requires CRUD operations. Synapse Serverless SQL Pools (short: Synapse Serverless) are than connected to this data lake, allowing business users to write SQL queries on top of your data: filtering, transforming, grouping and pivoting data all becomes possible. At the time that a query is launched, Synapse Serverless analyzes your query and reads and transforms only that select information from your data lake. Synapse Serverless doesn’t phyiscally keep data in cpu or memory and is therefore a true ‘pay what you query’ service. The price of Synapse Serverless is only $5 per Terabyte processed making it a truly cost-efficient calculation engine. Synapse Serverless exposes a ‘SQL Endpoint’ which means that it acts like a SQL Database without actually being a SQL database. The data of your data lake is virtualized in Views and External Tables.
3. SDA connection of SAP HANA
As we now have virtualized our Data Lake and making it act like a SQL Database, we are only a few steps away from our final solution. SAP HANA can virtualize SQL Databases by using Smart Data Access (otherwise known as SDA). Using SDA, we can create a remote source to a MS SQL Database using the MS ODBC Connector by filling in the SQL Endpoint and Database Name of Synapse Serverless in our SDA connection (similar like configuring an SDA Connection to a standard MS SQL Database). After the connection has been made, we simply create remote tables in a HANA schema of your choosing and our data lake is successfully virtualized in SAP HANA. Any query executed on the remote table, will directly query the Data Lake without physically importing the data into SAP HANA. Physical replication into SAP HANA remains possible should the need occur, either by flowgraphs (directly in HANA) or loading the data into BW/4HANA aDSO’s.
Remarks / Pitfalls
Do note that, in the proposed setup, data stored in our data lake is ‘at rest’ on disks. As a result, querying a data lake will never provide you with a snappy experience as data needs to be fetched, loaded for analysis and providing the results (typical latency of minimum 1 second). Nevertheless, the provided solutions still has lots of benefits in enabling your Data Lake in SAP HANA for analytical reporting.
Honesty commands us to tell you that SAP Note 2600176 mentions that MS SQL sources are only officially supported on-premise, not if they are residing on Azure. Given the support base, we sincerely hope that SAP makes Azure connections to SDA officially in the near future.
Get in touch!
Do you want some help in enabling your Data Lake within SAP HANA? Send a mail to firstname.lastname@example.org for a chat on sharing best-practices.
Sander Allert is an experienced BI architect with a passion for following new trends. Sander is passionate about data in all of its aspects (Big Data, Data Science, Self-Service BI, Master Data, …) and loves to share his knowledge. Do you need help on architectural decisions, do not hesitate to invite Sander over for a coffee to share some ideas.