Managed Big Data: DataBricks, Spark as a Service

The title accompanying this blog post is quite the mouth full. This blog post will explain why you should be using Spark. If a use case would make sense, then we will introduce you to the DataBricks product, which is available on Azure. Being recognised as a Leader in the Magic Quadrant, emphasizes the operational use and vision of DataBricks. 

If you would like to know more of DataBricks, Lytix and Cubis are providing a free expert session on Wednesday the 4th of March 2020 (in ‘De Kantien’ in Ghent, starting at 18:00) where we will provide you with demo’s to get started with DataBricks on Azure. It doesn’t matter if you’re a business analist, a C-level, manager or a student; feel free to subscribe on this intro course by using following link: http://bit.ly/DataBrytix

What is Spark and what ‘need’ did it fulfil? 

If you’ve been around for several years in the ‘Data and Analytics’ domain, you’ll probably remember the hype on ‘Big Data’. The most common scenario where companies switch to Big Data is if they do not have the analytical power to perform a workload on one (and only one) server. Such a workload could be ‘categorising your documents by using Natural Language Processing’, ‘creating segmentations on all website visitors’, ‘training a very accurate prediction model’, … In the most used sense of the word ‘Big Data’, processing is done by in a distributed way, i.e. by coordinating several machines at once.
Spark is an open-source analytical engine which allow technical users to setup a distributed system, thus allowing companies to tackle their Big Data projects. By default, Spark also gives you the ability to capture streaming events, provides a set of machine learning algorithms and allows for working with graph databases.

We now know Spark, what is DataBricks? 

While Spark is great at what it does, it is hard to maintain and configure, hard to spin up and spin down, hard to add servers to your cluster and remove them. DataBricks addresses this problem and provides ‘Spark as a Service’ while also adding enterprise-required features. As the majority of the DataBricks product team has also created the core of Spark, they also made API and performance improvements to the analytical engine they provide you with. As such, we believe that DataBricks is the most enterprise-ready Big Data and Data Science platform.

Get started in minutes instead of days: 

Setting-up a databricks custer with several nodes is easy. The wizard ‘New Cluster’ lets you pick your cluster size and what kind of compute it requires: RAM? GPU? Delta Optimized? 

The feature ‘Autoscaling’ allows DataBricks to add-to or reduce the amount of workers within your cluster based on your workload. If the queries sent to your cluster require large amounts of processing power, then other servers will be added to your cluster so you’ll be provided the results faster!

Delta: ACID Transactions on Data Lakes

An absolute game change that DataBricks has brought to the market, is Delta (Lake). Data lakes are (often) a large collection of files, files which are ‘immutable’ and thus cannot be changed. Delta, on the other hand, enforces ACID properties on your Data Lake. Adding ACID properties (being Atomicity, Consistency, Isolation and Durability) to a data lake allows for other analytical scenario’s that involve inserts, updates and deletes.

The DataBricks Notebook Experience

If you haven’t worked with analytical notebooks (Jupyter, Azure Data Studio or DataBricks), you’re missing out! Being able to write documentation and code within the same document is a big step forward. It helps you make clear to the readers why you created the queries and guiding them in their first steps in analytics. As opposed to other notebooks, DataBricks can connect to version control (Git, TFS, …) and allows you to combine both R, Python, SQL and Scala in the same notebook.

Managing your Data Engineers and Data Scientist

DataBricks allows applying security on folders and workbooks using Azure Active Directory; workbooks which contain sensitive data can only be seen by a specific security group. If you combine this with Azure Data Lake Storage Gen2, which allows applying security on data-folders, you’ll have an enterprise-ready data science environment. 

Why shouldn’t I use DataBricks?

Delta Lake will not support operational processes like any OLTP system due to its limited concurrency to a single file. Thus, we as Lytix and Cubis, often use it for Data Warehousing scenario’s which require little concurrency.
As your data is divided over several partitions across many servers and results of queries/calculations should pass through your head node, do not expect that distributed computing will provide you with a snappy behaviour! The technology is used for analytical workloads on large amounts of data and will backfire if you would use it any other way.

Our XTL Framework

In addition to all features DataBricks provides you with, Lytix/Cubis has a framework which simplifies development and implements best-practices. This framework uses Azure Data Factory as orchestrator and as a result provides you with a data layer that is easily accessible by any reporting tool (Power BI, SAP Analytics Cloud, …). As our logging components are automatically wrapped around business transformations, time spend on data engineering is significantly reduced. Using ‘Delta’, we even provide the ‘Data Warehouse on Spark/DataBricks’ solution.


If you want help/guidance on your DataBricks or Azure Platform implementation, do not hesitate to reach out.

 

Lytix’ XTL Framework

  You’ll often hear a Lytix consultant say that he/she is proud of our XTL Framework. What is it exactly? And how can it improve your business?  

Why do I need a Data Warehouse (DWH) ? 

Data Warehouses support organisations in their decision making by providing data in a well-governed way, this includes:

– Integration of source systems: organisations prefer a mix of solutions to achieve a competitive advantage. Often, large ERP systems (Dynamics AX for Finance and Operations, SAP, Odoo, …) are combined with niche products (SalesForce, Google Analytics, Silverfin, Exact Online, … ) to support their business processes. In an analytical setting, this data also needs to be enriched with open-source APIs (weather APIs, ID verification, … ).

Single source of truth: valuable time can be wasted if numbers cannot be trusted. A DWH provides you one validated catalogue of data.

Easy Managed Access: Data is provided in a way that business users can crunch large quantities of data. This, while still applying security; users should be able to see only what they’re entitled to see (no more, no less).

A foundation for Data Science / AI: Provide a well-structured data repository for your data scientists and minimize their valuable time of cleaning-up and combining sources.

Why do I need the Lytix’ XTL Framework?

Lack of internal engineering experience can result in an unsustainable solution on the long-term. It gets even worse if this person leaves the organisation. Our framework provides unified dimensional models and minimizes ETL engineering. 

 

Advantages of the XTL Framework:

– Unified way of Working: In each of our supported technologies (Azure SQL DB, SnowFlake, DataBricks and Azure Synapse), the main task of engineers is to convert business logic to SQL, a query language that is widely known and which can be relocated to other technologies.

– Maximized logging (with minimal effort during development): Logging modules are automatically wrapped around ETL statements, which reduces a time-intensive burden of developers. Within these modules, all events are captured: number of reads, inserts, updates and deletes, as well as warnings and errors.

– Built for the cloud: The framework is built in such a way it take advantage of cloud benefits; pay-as-you-use, scale-up and scale-down easily.

– Applied Kimball: The modelling techniques are based on best-practices defined by the renowned data-modeller Ralph Kimball (The Data Warehouse Toolkit, Ralph Kimbal & Margy Ross). The most-used and popular Slowly Changing Dimension (SCD) types are supported, such as ‘SCD0: Retain the Original’, ‘SCD1: Overwrite with newer values’, ‘SCD2: Track history by adding new rows’, ‘SCD3: Add new attribute’, ‘SCD6: Add type 1 attributes to a type 2 dimension’ and ‘SCD7: Dual Type 1 and Type 2 Dimensions’. Switching an attribute from SCD Type 1 to SCD Type 2 (… and thus tracking history) is just a matter of a click.

Additional Features

Automated Documentations

The XTL framework is accompanied by a comprehensive report providing you with insights on the ETL processes. This report both provide you standard metric evolutions (number of updates, deletes, failures, …) as well as a dependency view which maps your sources to the targets and shows you the applied transformations.

Dependency Tree

Our framework allows to automatically generate dependency trees based on the metadata of your transformations. Generating the most optimal parallel execution steps within your whole ETL becomes child play. Do you need to refresh only one, two or three tables because you need updated data fast (e.g. for month-closing)? No problem, the dependency tree will only trigger those ETL steps which are directly related to the tables you want updated.  

No ‘All Or Nothing Approach’ in Batches

We often hear following complaint from customers not using our framework “I know one little package failed, but does that mean that the whole dataflow to the business should fail?!

Our framework adresses this issue! The dependency tree allows you to decide what to do upon failures, will you just ignore the failure and continue with your ETL? Or do you stop abruptly to make 100% sure the error doesn’t cause any future problems? This is fully configurable in our framework and it allows you to make a decision per ETL step. 

Data Quality Reporting

Our framework includes a ‘Data Quality’ module. By inputting business rules, the framework will look for inconsistencies and malicious input and report the findings back to the business.

 

Components

Following components are used throughout the framework:

– Azure Data Factory is used as orchestrator and scheduler. This component kicks-off at the requested time (at midnight, once per hour, …) and makes sure all dependencies are initiated in logical order.


– A ‘Storage and Compute Module’ to make the necessary transformations and to provide the data to data consumers; being it business users, data scientists or even mobile devices and webportals. One of four technologies can be implemented as a back-end for this component:
    o Azure SQL Database: A Microsoft product for customers who are already familiar with SQL Servers and in which the Data Warehouse needs lots of integrations.
    o SnowFlake provides both a data lake and strong compute for your analytical needs.
    o DataBricks: if streaming and distributed computing are necessary to build an all-comprehensive Data Warehousing solution. This platform integrates best with organisations that have a large data science team!
   o Azure Synapse (former Azure SQL Data Warehouse) providing large limitless compute for customers with more than 4 TBs of data.


An analytics engine which allows business users to create their own reports, being it Azure Analysis Services or Power BI Pro/Premium. 

Wrap-Up

This article describes one of our proven ways to fuel your data driven organization, covering a multitude of use-cases ranging from traditional reports to data science and this in a governed way.

In need of some help? Do not hesitate to contact Lytix or the author Sander Allert and we/he will be glad to exchange ideas on this subject!