Kimball in a data lake? Come again?
Most companies are already familiar with data modelling (be it Kimball or any other modelling technique) and data warehousing with a classical ETL (Extract-Transform-Load) flow. In the age of big data, an increasing number of companies are moving towards a data lake using Spark to store massive amounts of data. However, we often see that modelling in a data lake is not done at all, or not done properly. This most often happens because the modelling possibilities are not that well known and data engineers have less of an eye for data modelling than classical BI profiles. Now, in this blog post, I want pave the way to proper data modelling in a data lake.
First of all, for the people that need a refresher, let’s talk about Kimball modelling. Kimball modelling consists of creating star schema’s based on facts and dimensions. Let me illustrate this technique with an example. In the image below, we have modelled the business process of a sales transaction as the fact. In the dimensions, we add the descriptive attributes of the fact table. In this particular example, every sale is linked to a certain product, a certain date and a store.
Implementing a Kimball-modelled Data Platform has following advantages:
- The data platform is more user-friendly by providing a simplified data model.
- The data model supports retrieving data in the most efficient way (= no complex joins).
- Adhering to a ‘modelling standard’ makes our data lake more future proof (no need to reinvent the wheel).
Slowly changing dimensions
Now let’s say, at a certain moment in time, the address of a store changes. In current data warehouses this is easily resolved using the principle of SCD (Slowly Changing Dimensions). In SCD, the most used types are SCD1, meaning that you overwrite the old address of the store and set a new address, or SCD2, meaning that you keep both the old address and the new address in the data warehouse. We then usually indicate with a valid from and valid to date to indicate the newest record. Creating slowly changing dimensions in a data warehouse is not hard, because SQL supports operations such as update and merge.
Kimball in a data lake
Moving to a data lake and performing Kimball modelling is challenging, and this has several reasons:
- File formats such as .csv, .parquet, etc. don’t support updates and merges and are immutable, thus making it really hard to create SCD’s and perform incremental loading. Moreover, writing SCD functions in PySpark is a difficult process, which is error prone and can consume a lot of memory.
- In Spark, data is stored across several nodes. Hence, a lot of knowledge is needed about how data is stored and which optimizations (such as partitions) can be done to increase the speed of operations. If you do not take this into account, it is very well possible that a simple query can take up a long time because it is performing a lot of shuffles (reading and combining data from several nodes of the Spark cluster). Taking into account all these difficulties can be a big hurdle for classical BI profiles going to a big data setup.
Spark 3.0 and delta lake to the rescue
To solve these issues, we have two new big boys on the block: Spark 3.0 and Databricks Delta Lake.
Lets first focus on Databricks Delta Lake. Delta files are a new type of files that extend the functionality of Parquet with a set of features that make them perfectly suitable for building a data warehouse. They do this by providing a transaction log in addition to the actual data. This transaction log supports:
- ACID (atomicity, consistency, isolation and durability) transactions to guarantee data validity despite errors, power failures or other incidents. Every operation (update, delete, remove, …) executed on the data is recorded as ordered, atomic commits to the transaction log.
- Time travel: this is a feature that can be used to restore a previous version of the data, in case a pipeline has corrupted your data file. This is either done by keeping a timestamp or by keeping a version in your data.
- Schema evolution: When the schema of your data changes, this can easily be added to the delta table without having to delete the entire table.
- Updates and deletes: a very important one for the people coming from a data warehouse. Delta files support updates and deletes, not only in the Python syntax but also in classical SQL syntax, as shown below.
Here, I have listed the most important features for this blog, but feel free to check out all other features at their website.
Now, Delta files have solved the SCD problem quite nicely for us! And as it happens, Spark 3.0 solves our next problem, which is the difficulty of optimizing queries. I want to focus on the following two improvements in Spark 3.0:
- Dynamic partition pruning (say what?): let’s explain this with an example. Suppose we have a fact table and a dimension table which we are joining. When we have nicely partitioned the fact table to optimize a query and filter on the fact table, nothing will go wrong. But if we perform the query and put a filter on the dimension table, old versions of Spark will not make use of the partitioned fact table and perform a full scan, which is horrible for the performance. In the new version of Spark, this has been taken care of and the filter of the dimension table is pushed towards the fact table, hence making use of the partitioned data.
- Adaptive query execution: in the earlier Spark versions, it was the responsibility of the data engineer to reshuffle your data across nodes in order to optimize your query execution. Spark 3.0 comes shipped with an Adaptive Query Execution Framework (AQE). After every stage of the Spark pipeline, this optimizer automatically checks whether certain optimizations can be done (switching join strategies, dynamically coalescing shuffle partitions, optimizing skew joins). In order to improve the query speed. This all sounds very technical, but fortunately, this is all done for you now!
As this blog shows, it is wrong to think of a data lake as just a data dump where Kimball modelling is impossible. With these new big boys on the block, it is perfectly possible to have a good Kimball modelling strategy in your data lake, without having the need for extensive data engineering experience. Have any questions on the execution of this strategy? Please contact us!
This blog is written by Tom Thevelein. Tom is an experienced Big Data architect and Data Scientist who still likes to make his hands dirty by optimizing Spark (in any language), implementing data lake architectures and training algorithms.