Things to consider when creating a Data Lake
Have you wondered what a data lake is? What are typical use cases for this lake? How can you benefit from a data lake are? In this blog post, we will show you the added value of a data lake while pointing-out some pitfalls and best-practices.
Before diving into data lakes (ba-dum-tsss), let us start with an example that anyone can relate to: When I bought my first computer there was only room for about 100 GB of files. This storage was quickly filled-up with personal documents, scans, photos of friends and family and lots of music. Every three years, I buy a new laptop on which I ‘start fresh’ and in which I collect new photo’s, video’s, music and files. Backups that you make to this external drive make sure that you can sleep at night: you are covered if your laptop would crash, when you need to consult old files, when you want to restore an earlier version, … Moreover, you can store more files than the drive of your laptop can manage! In addition, these files are accessible anywhere and at any time. In addition, sharing to friends and family becomes easy as you can just right-click a folder/file and make it available for them; either via a URL or via a mail sign-in.
With a data lake for your organization similar advantages apply. You export all kind of files to your data lake: images, videos, operational systems, reports, logs, flat files, … Once they are in the lake, they are easily accessible when you need them in the future. While this may be a longshot today, you may get an added value out of some data in the future. Having built-up history results already provides a jumpstart for this analysis and results in more accurate insights
The Layered Data Lake approach: Bronze, Silver & Gold
A pitfall we typically see is that all data is just dumped into one big data lake without any structure and then exposed to everyone. While exporting your data and building-up history, this may already result in some checks-in-the-box for your data lake implementation, but it lacks long-term planning. Not all data is easy to work with for data engineers; technical keys need to be transformed to terms that make sense, some files needs aggregation and some files need to be merged together to be able to work with them.
To tackle these problems, we suggest working with a layered approach (also referred to as the Multi-Hop Architecture):
- Bronze Layer: A one-on-one copy of the data from the source into the data lake. ‘Bronze data’ is raw untransformed unmodified data and all your sources land into this layer.
- Silver Layer: Once a business case has been identified and requires analysis, the ‘raw Bronze data’ is transformed into sets of data that add additional values. This can imply replacements of codes to meaningful values, adding sanity constraints, filtering-out unneeded information, … . Hence, resulting in concise useful datasets that may be used by other pieces of information as well.
- Gold Layer: The gold layer then provides a well-constructed dataset ready for analysis by data scientists and business analysts. The data is presented in such a way that appeals to them the most, which may include aggregations, joins and merges, encoding, etc.
In a typical scenario we can harvest data from our products from different source systems. All these sources then land into the bronze layer as a one-on-one copy. The data of all these sources is then blended in the Silver layer that results in one single source of the ‘product’ information. This ‘Product’ silver dataset can be used in several datasets that are presented to users in the gold layer (e.g. stock keeping unit dataset, market basket analysis, …).
Technology: Azure Data Lake Storage Gen2!
We often recommend Azure Data Lake Storage Gen2 (ADLSGen2) to our customers. This ‘Gen2’ merges the best of the resources ‘Blob Storage’ and ‘Data Lake Storage’ (the latter to be deprecated). ADLSGen2 uses the Hadoop File System (= HDFS) to optimise searching and reading through large volumes of data. ADLSGen2 allows you to implement very granular security: choose who
Note: Typical on-premise data lakes do not provide out-of-the box functionalities for implementing security and easily relocating files. In such scenario’s every user can access all data in the lake. We are all used to navigating through folders of information. The typical data lake tried to mimic this functionality by appending the folder names to the files (resulting in very long files). Hence, renaming folders actually executes a copy of the old [folder-filename] and a paste with new [folder-filename] (= expensive operation). In addition, as on-premise data lakes typically results in lots of disk-management, you probably see why we recommend an out-of-the-box cloud technology.
ADLSGen2 makes use of hierarchical namespaces. This means that the files are organised in (virtual) folders: querying your data lake (for example using Spark or Synapse’s SQL On-Demand) is sped-up if the filters match your folders. As the folders are really ‘virtual’ (= not implemented as part of the filename), renaming and reorganising folders only takes mere seconds!
Data Lakes can help you built-up a trail of data that will bring value in the future. Azure Data Lake Storage Gen2 is the go-to Azure resource that helps us provide a data platform for your organisation. Interested in creating your own data lake? Contact us!
Sander Allert is an experienced BI architect with a passion for following new trends. Sander is passionate about data in all of its aspects (Big Data, Data Science, Self-Service BI, Master Data, …) and loves to share his knowledge. Do you need help on architectural decisions, do not hesitate to invite Sander over for a coffee to share some ideas.