Everyone who has already come in touch with data science, has already heard of features used in such models. One aspect that can become quite challenging, is reusing features in a consistent way, across several team members, projects and in environments. In this article, I will explain the most commonly used way to resolve these […]
Month: April 2022
Kimball in a data lake? Come again?
Most companies are already familiar with data modelling (be it Kimball or any other modelling technique) and data warehousing with a classical ETL (Extract-Transform-Load) flow. In the age of big data, an increasing number of companies are moving towards a data lake using Spark to store massive amounts of data. However, we often see that […]
Pandas, Koalas and PySpark in Python
If you landed on this page to learn more about animals, I have to disappoint you. Pandas, Koalas and PySpark are all packages that serve a similar purpose in the programming language Python. Python has increasingly gained traction over the past years, as illustrated in the Stack Overflow trends. Originally designed as a general purpose […]
Transfer learning in Spark for image recognition
Transfer learning in Spark demystified in less than 3 minutes reading Businesses that want to classify a huge set of images in batch per day can do this by leveraging the parallel processing power of PySpark and the accuracy of models trained on a huge set of images using transfer learning. Let’s first explain the […]
How ALM streamlines BI projects: Azure DevOps
Application Lifecycle Management (ALM) refers to a (software) development process which has been setup in a governed and easy-to-manage way. ALM provides added value to the development team, project managers and the business users. While ‘ALM’ is mostly coined by pure software development projects (…written in 100% programming languages), BI projects (which are by nature […]
The Journey of attaining the Azure Data Engineer certificate
On February 23, 2021, Microsoft released a new beta certification exam, Exam DP-203: Data Engineering on Microsoft Azure. It is replacing the exams DP-200: Implementing an Azure Data Solution, and DP-201: Designing an Azure Data Solution. These previous exams DP-200 and DP-201 will retire on June 30, 2021. When passing the two old exams or […]
Things to consider when creating a Data Lake
Have you wondered what a data lake is? What are typical use cases for this lake? How can you benefit from a data lake are? In this blog post, we will show you the added value of a data lake while pointing-out some pitfalls and best-practices. Before diving into data lakes (ba-dum-tsss), let us start […]
Managed Big Data: DataBricks, Spark as a Service
The title accompanying this blog post is quite the mouth full. This blog post will explain why you should be using Spark. If a use case would make sense, then we will introduce you to the DataBricks product, which is available on Azure. Being recognised as a Leader in the Magic Quadrant, emphasizes the operational […]