Month: April 2022

Feature Store

Feature Store

Everyone who has already come in touch with data science, has already heard of features used in such models. One aspect that can become quite challenging, is reusing features in a consistent way, across several team members, projects and in environments. In this article, I will explain the most commonly used way to resolve these […]

Kimball in a data lake? Come again?

Kimball in a data lake? Come again?

Most companies are already familiar with data modelling (be it Kimball or any other modelling technique) and data warehousing with a classical ETL (Extract-Transform-Load) flow. In the age of big data, an increasing number of companies are moving towards a data lake using Spark to store massive amounts of data. However, we often see that […]

Pandas, Koalas and PySpark in Python

Pandas, Koalas and PySpark in Python

If you landed on this page to learn more about animals, I have to disappoint you. Pandas, Koalas and PySpark are all packages that serve a similar purpose in the programming language Python.  Python has increasingly gained traction over the past years, as illustrated in the Stack Overflow trends. Originally designed as a general purpose […]

Transfer learning in Spark for image recognition

Transfer learning in Spark for image recognition

Transfer learning in Spark demystified in less than 3 minutes reading Businesses that want to classify a huge set of images in batch per day can do this by leveraging the parallel processing power of PySpark and the accuracy of models trained on a huge set of images using transfer learning. Let’s first explain the […]

How ALM streamlines BI projects: Azure DevOps

How ALM streamlines BI projects: Azure DevOps

Application Lifecycle Management (ALM) refers to a (software) development process which has been setup in a governed and easy-to-manage way. ALM provides added value to the development team, project managers and the business users. While ‘ALM’ is mostly coined by pure software development projects (…written in 100% programming languages), BI projects (which are by nature […]

The Journey of attaining the Azure Data Engineer certificate

The Journey of attaining the Azure Data Engineer certificate

On February 23, 2021, Microsoft released a new beta certification exam, Exam DP-203: Data Engineering on Microsoft Azure. It is replacing the exams DP-200: Implementing an Azure Data Solution, and DP-201: Designing an Azure Data Solution. These previous exams DP-200 and DP-201 will retire on June 30, 2021. When passing the two old exams or […]

Things to consider when creating a Data Lake

Things to consider when creating a Data Lake

Have you wondered what a data lake is? What are typical use cases for this lake? How can you benefit from a data lake are? In this blog post, we will show you the added value of a data lake while pointing-out some pitfalls and best-practices. Before diving into data lakes (ba-dum-tsss), let us start […]

Managed Big Data: DataBricks, Spark as a Service

Managed Big Data: DataBricks, Spark as a Service

The title accompanying this blog post is quite the mouth full. This blog post will explain why you should be using Spark. If a use case would make sense, then we will introduce you to the DataBricks product, which is available on Azure. Being recognised as a Leader in the Magic Quadrant, emphasizes the operational […]