The Journey of attaining ‘DA-100: Analyzing data with Power BI’

The Journey of attaining ‘DA-100: Analyzing data with Power BI’

Introduction

Microsoft’s DA-100 certificate proves a vast knowledge of Power BI. In this article, I will give you some tips and tricks on how to succeed. 

The DA-100 exam is a new power BI certification exam, out of the beta phase since July. It is a replacement for the older 70-778 “Analyzing and Visualizing Data with Microsoft Power BI”. Microsoft is reshaping their assortment of certifications. Therefore, the older certifications are scheduled to retire on January 31st, 2021.
The new certifications are now based on specific roles (Data Analyst, Data Engineer, Azure Developer and so on), while the older ones are more focused on a specific tool like Power BI or SQL Server. Passing the exam grants you a certificate that lasts for 2 years. 

Preparation exam

When the exam was in Beta, there were no study guides. Hence, preparing for the exam was much more of a challenge than it is now, mostly due to the lack of information what the certification was about. Now you can find the outline of the specific skills measures in the exam on website of the certificate, together with their chance of appearing:

  • Prepare the data (20-25%)
  • Model the data (25-30%)
  • Visualize the data (20-25%)
  • Analyze the data (10-15%)
  • Deploy and maintain deliverables (10-15%).

Microsoft also provides us with learning paths to gain the skills needed to become certified. These learning paths are divided in modules. They contain a lot of reading material and a lab session at the end (most of the time). Doing this is a great way to get to know the material better and to give you a better representation of the questions on the exam.
To be fair, I think not knowing how the questions are asked is one of the greatest difficulties of this exam. But I would not recommend doing the exam after only studying this material, it is not in-depth enough. Some questions are based on real-life scenarios, which is a good thing, but that makes it harder to pass without a fair amount of hands-on experience. The best thing to do to study for the exam is simply to practice a lot. It is obvious the exam will be easier for people that use Power BI every day.

Another way of studying for the test that I used is taking quizzes at https://www.learndatainsights.com/. There are a lot of different test questions for every subject. This is a fun way of going through the material and testing your knowledge. However, you notice some of the questions are dated. 

The exam

My exam consisted of 47 questions and 2 case studies, each with 4 questions. Everything is multiple choice and there is no correction for guessing. After the exam, you immediately get your results, which is nice.

Here are some examples of tasks you might have to perform on the exam:
-Do some filtering in DAX.
-Import data from the different Sharepoint sources.
-Modelling within Power BI, especially related to the DATE dimension.

Conclusion

Studying for the certificate is definitely a nice way to start with Power BI, but to pass the exam you will need some working experience with it. Good luck to all the readers!

About the author: Wim Roten is a Business Intelligence consultant with two years of experience. He has a broad knowledge of data architectures within Azure and likes practicing his Data Science knowledge in his spare time.

 

Wim Roten

Microsoft Analytics Consultant @ Lytix

Transfer learning in Spark for image recognition

Transfer learning in Spark for image recognition

 

Transfer learning in Spark demystified in less than 3 minutes reading

 

Introduction

Businesses that want to classify a huge set of images in batch per day can do this by leveraging the parallel processing power of PySpark and the accuracy of models trained on a huge set of images using transfer learning. Let’s first explain the two buzz words used in the title.

  

Transfer learning

Transfer learning is a machine learning method where a model that was pretrained for a different task, is reused for a new task. This is mainly common in tasks such as computer vision and natural language processing. An advantage of this approach is that it can also be used for data sets where you don’t have much data, since the starting point of your final model is a model that has already been trained on lots of data. Moreover, these models have been optimized by experienced developers and may give you a jumpstart in the accuracy of your model.

In this example, I will use transfer learning for image recognition. Three of the more popular base models used for transfer learning are VGG (e.g. VGG19), GoogLeNet (e.g. InceptionV3) or Residual Networks (e.g. ResNet50) . These are all neural networks.  I will not go into detail of these neural networks, but instead explain the basic principle.

The image below shows a schematic overview of transfer learning. The different layers of the neural network have been depicted. The input of the model is an image, which then goes through the several layers of the neural network, which has as output a final label. In the original model, this can be a cat, dog or fish.

 

Now assume, we have a data set with images of shoes, pants and t-shirts, and want to predict which one is which. To do so, we adopt the model, but only a specific part.  The basis (orange bars) of the model remains unaltered and is commonly named ‘the head’ of the model. This head of the model transforms the input images and extracts basic image features, such as colors and shapes. The retrained part of the model is the ‘tail’ and corresponds to the last layers of the model, which map these basic features onto the right category.

Spark

Spark is a unified analytics engine, built for big data processing. It allows to use multiple nodes to perform computations. Since transfer learning and feature extracting using neural networks can be quite a computationally intensive process, Spark is here used to run the first layers of the model in parallel on a lot of images.

 

 

Practical

The link to the code on github can be found at the end of this blog. In this code setup, I have used Azure Databricks and mounted an ADLSGen2 blob storage to the cluster. To do this, please follow this link: https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html . Using Azure Storage Explorer, you can transfer the images of your image training set onto this blob.

 

The first step is to read the images in Spark, this can be done using below command. Basically, this command will scan the entire folder for any images and convert them to a binary format. The columns of this dataframe are path, length and content (binary format). This will happen in parallel distributed over the Spark nodes.

In the next step, the base model of transfer learning can be used to extract the base features from the images in this dataframe. The image below indicates how a Spark User-Defined Function (UDF) extracts the features in parallel from the images and creates a new column ‘features’, using the base model. A Spark UDF function is a function that allows to execute computational logic on the rows of a dataframe in parallel over multiple Spark nodes. For the full code of this UDF, please see the Github at the end of this article.

 Now, where does that leave us? At this moment, we have a dataframe that is composed of classical numerical features, where we can execute classical machine learning techniques on such as logistic regression, forests, etc..

In this last step, we need to attach the labels to this dataframe based on the path or name of image and create an MLlib pipeline. This pipeline will contain an additional layer on top of the base model which allows to predict the target in our case. The code used for the MLlib pipeline is shown below. I have created a VectorAssembler in order to have the right format for the features and a LabelIndexer to convert the text of the label to a numerical label. As a last model, I used a logistic regression. These are all combined into a Spark ML pipeline, which combines these steps into a single object (similar to a SKLearn pipeline).

Alternative methods

There are multiple design choices to be made in this method:

  • The base model for transfer learning (Resnet, VGG, …) can be altered
  • The final spark pipeline can be changed and include multiple stages, or one can use a tree-based model instead of a Logistic Regression.
  • To tune the hyperparameters of the model one can use a grid search.

Feel free to experiment with this in order to achieve the best performance for your use case.

 

Conclusion

Above method indicates how you can train a model in parallel on a massive data set on images using transfer learning. This way of training is optimal if you have a case where you need to classify a lot of images in batch everyday. In case of questions, feel free to contact us and we can help you out!

Github link:

https://github.com/tntn123/spark_transferlearning/blob/main/main.py

Tom Thevelein

Tom Thevelein

This blog is written by Tom Thevelein. Tom is an experienced Big Data architect and Data Scientist who still likes to make his hands dirty by optimizing Spark (in any language), implementing data lake architectures and training algorithms.

Premium Per User (PPU) explained

Premium Per User (PPU) explained

Just last week, the Power BI CAT team extended their current licensing model with the release of Premium Per User (PPU) under public preview. On what ground exactly do current licensing models differ from this new release? We’re reaching our next destination: Premium Per User!

 

What Power BI licensing offerings Microsoft already had in store?

First things first: let’s get you up to speed concerning the different licensing models Power BI comes with.

 

User-based licensing: Free & Pro (€8,40 per user per month)

With the release of Power BI mid 2015, users could have two different types of licenses. On the one hand, we had a free Power BI license whereas on the other hand, Power BI Pro was introduced as a paid license. Among other benefits, this Pro license enabled users to actually share their Power BI reports by means of collaboration workspaces, located on the Power BI service portal. Power BI Pro licenses come at a cost of €8,40 per user per month.

 

Capacity-based licensing: Premium (Starting at €4212,30 per month)

 

Two years later, mid 2017, the premium licensing model reached general availability. Opposed to the earlier introduced Free and Pro licenses, which were user-based, this new license was capacity-based. It served, and still serves, as a dedicated node of a certain number of CPU cores and RAM memory for the enterprise as a whole. Workspaces backed by premium capacity empower enhanced performance as well as the ability to share reports published to an app workspace with Free users without any additional costs. Also, some extra features have been introduced which are premium only such as AI-related components, extended workspace/dataset storage, XMLA endpoint support and paginated reports. Power BI Premium starts at a cost of €4212,30 per month for the lowest tier (P1).

What is this Premium Per User (PPU) all about?

Three years after premium reached general availability, end 2020, a new licensing option has been introduced called Premium Per User (PPU).

 

User-based licensing: Premium Per User (PPU) (Price still undefined)

Power BI Premium Per User allows organizations to license Premium features on a per-user basis. PPU enables users to benefit almost all premium capabilities such as AI-related components, extended workspace/dataset storage, XMLA endpoint support, paginated reports and many others.

Imagine a PPU license as a first class train ticket. You didn’t had to buy the whole train but yet you can bath in silence, chilling in your soft-leather seat while others sit in the second class railway carriage. The train will bring everyone to his or her destination, however, not everyone could benefit those extra features!

Why could PPU become a game changer?

There’s no denying on the existence of a huge gap in terms of pricing between purchasing some Pro licenses and affording premium with a massive cost of over €4200 per month. I believe PPU could have a threshold lowering impact, especially for small and medium-sized enterprises (SME’s). Many of these enterprises have a rather small user base of 25 people or less covering the data analytics department. In a second phase, if there’s a growth in reporting requirements/ user base, a logical second step would be to upgrade to a full-blown premium tier.

Please do note, as described earlier, premium is a capacity-based license. Power BI Pro licenses still need to be purchased to enable report creators to share there reports or collaborate in workspaces in the service portal. At the moment of writing, PPU pricing details are yet undecided upon.

Can I test it out myself?

YES, you can! At the moment of writing, this option should be available to all tenants who subscribed for priority access. The remaining part of tenants should get their hands on this public preview towards the end of November according to Microsoft’s latest communication. Renewable 60-day trials of PPU can be requested within your O365 environment at no cost.

  

Simply follow these steps and you’re good to go:

 

 

 

Considerations

 

 

Wrap up

This article elaborates on in what sense the newly released Power BI licensing model Power BI Premium (PPU) bridges the gap between the already existing Pro and premium licensing models. A free trial of this licensing model is now available in public preview for an undefined period of time nor has there been decided upon pricing details yet. This means the time for testing is now.. If you are unsure if PPU could be beneficial for your company, feel free to get in touch with me or any of my colleagues at Lytix.

Lou Segers
Microsoft Analytics Consultant @ Lytics
Follow me on LinkedIn

Things to consider when creating a Data Lake

Things to consider when creating a Data Lake

Have you wondered what a data lake is? What are typical use cases for this lake? How can you benefit from a data lake are? In this blog post, we will show you the added value of a data lake while pointing-out some pitfalls and best-practices.

Before diving into data lakes (ba-dum-tsss), let us start with an example that anyone can relate to: When I bought my first computer there was only room for about 100 GB of files. This storage was quickly filled-up with personal documents, scans, photos of friends and family and lots of music. Every three years, I buy a new laptop on which I ‘start fresh’ and in which I collect new photo’s, video’s, music and files. Backups that you make to this external drive make sure that you can sleep at night: you are covered if your laptop would crash, when you need to consult old files, when you want to restore an earlier version, … Moreover, you can store more files than the drive of your laptop can manage! In addition, these files are accessible anywhere and at any time. In addition, sharing to friends and family becomes easy as you can just right-click a folder/file and make it available for them; either via a URL or via a mail sign-in.

 

 

With a data lake for your organization similar advantages apply. You export all kind of files to your data lake: images, videos, operational systems, reports, logs, flat files, … Once they are in the lake, they are easily accessible when you need them in the future. While this may be a longshot today, you may get an added value out of some data in the future. Having built-up history results already provides a jumpstart for this analysis and results in more accurate insights

The Layered Data Lake approach: Bronze, Silver & Gold

A pitfall we typically see is that all data is just dumped into one big data lake without any structure and then exposed to everyone. While exporting your data and building-up history, this may already result in some checks-in-the-box for your data lake implementation, but it lacks long-term planning. Not all data is easy to work with for data engineers; technical keys need to be transformed to terms that make sense, some files needs aggregation and some files need to be merged together to be able to work with them.

To tackle these problems, we suggest working with a layered approach (also referred to as the Multi-Hop Architecture):

    • Bronze Layer: A one-on-one copy of the data from the source into the data lake. ‘Bronze data’ is raw untransformed unmodified data and all your sources land into this layer.                                                          
    • Silver Layer: Once a business case has been identified and requires analysis, the ‘raw Bronze data’ is transformed into sets of data that add additional values. This can imply replacements of codes to meaningful values, adding sanity constraints, filtering-out unneeded information, … . Hence, resulting in concise useful datasets that may be used by other pieces of information as well.                                                   
    • Gold Layer: The gold layer then provides a well-constructed dataset ready for analysis by data scientists and business analysts. The data is presented in such a way that appeals to them the most, which may include aggregations, joins and merges, encoding, etc.

In a typical scenario we can harvest data from our products from different source systems. All these sources then land into the bronze layer as a one-on-one copy. The data of all these sources is then blended in the Silver layer that results in one single source of the ‘product’ information. This ‘Product’ silver dataset can be used in several datasets that are presented to users in the gold layer (e.g. stock keeping unit dataset, market basket analysis, …).

Technology: Azure Data Lake Storage Gen2!

We often recommend Azure Data Lake Storage Gen2 (ADLSGen2) to our customers. This ‘Gen2’ merges the best of the resources ‘Blob Storage’ and ‘Data Lake Storage’ (the latter to be deprecated). ADLSGen2 uses the Hadoop File System (= HDFS) to optimise searching and reading through large volumes of data. ADLSGen2 allows you to implement very granular security: choose who

Note: Typical on-premise data lakes do not provide out-of-the box functionalities for implementing security and easily relocating files. In such scenario’s every user can access all data in the lake. We are all used to navigating through folders of information. The typical data lake tried to mimic this functionality by appending the folder names to the files (resulting in very long files). Hence, renaming folders actually executes a copy of the old [folder-filename] and a paste with new [folder-filename] (= expensive operation). In addition, as on-premise data lakes typically results in lots of disk-management, you probably see why we recommend an out-of-the-box cloud technology.

ADLSGen2 makes use of hierarchical namespaces. This means that the files are organised in (virtual) folders: querying your data lake (for example using Spark or Synapse’s SQL On-Demand) is sped-up if the filters match your folders. As the folders are really ‘virtual’ (= not implemented as part of the filename), renaming and reorganising folders only takes mere seconds!

Wrap-Up

Data Lakes can help you built-up a trail of data that will bring value in the future. Azure Data Lake Storage Gen2 is the go-to Azure resource that helps us provide a data platform for your organisation. Interested in creating your own data lake? Contact us!

Sander Allert

Sander Allert

BI Architect

Sander Allert is an experienced BI architect with a passion for following new trends. Sander is passionate about data in all of its aspects (Big Data, Data Science, Self-Service BI, Master Data, …) and loves to share his knowledge. Do you need help on architectural decisions, do not hesitate to invite Sander over for a coffee to share some ideas.

Azure Synapse Analytics

Azure Synapse Analytics

Organizations understand the value of data more than ever. A Data Warehouse as a single source of truth, a data lake to store data for analytical exploration, self-service tools for data transformation, visualisation, and consumption as well as clusters to process immense data volumes. All these different use cases require other specialised tools resulting in a dispersity of knowledge and data throughout the organisation. Microsoft’s Azure Synapse Analytics provides a single platform servicing each data-related process: Big Data, Data Science, Data Lakes, Data Warehousing, Data Transformation and (Self-Service) Reporting.

A single platform

At the end of 2019 Microsoft introduced Synapse Analytics, a new service that merges the capabilities of Azure SQL Data Warehouse with new enhancements such as on-demand query as a service. Azure Synapse is an end-to-end cloud-native analytics platform that brings together data ingestion, data warehousing and Big Data analytics into a single service.”

Figuur 1 – Synapse Architecture

While there are a lot of bells and whistles in this service, the availability of three service to handle your data is one of our favourites:

  1. Provisioned SQL Pools (former Azure SQL Data Warehouse) store your data in an always-available cluster. This is ideal for frequently accessed data warehouses that require quite some calculation power.
  2. Provisioned Spark provide you with a set of spark clusters that can be used by data engineering and data science teams to answer their broad set of questions.
  3. On-Demand / Serverless provides a pay-per-query mode to access files on your data lake. No more any of those long data-integration tracks before being able to query the data in your lake.

The next chapters discuss these ‘ways of compute’ in more detail.

Provisioned SQL Pools

Provisioned SQL Pools (former Azure SQL Data Warehouse) is a service that stores your data in a fast-accessible way and can scale up to petabytes of size with only limited performance decrease.  The size of the SQL pool is determined by Data Warehousing Units (DWU). By scaling up when more performance is needed or scaling down to reduce costs. Billing is done while the server is running, smart scaling is advised!

How to choose the best performance level of your SQL Pool?

We suggest to start with a low performance level like DW300c. This has one compute node with 60 distributions per compute node and 180 GB memory available for you. When you need more power you can scale up easily. Until DW500c you will only get more memory available. From DW1000c you get two compute nodes and double the memory of DW500c, but at twice the price.

SQL On-Demand

SQL On-demand provides serverless queries over your data lake. Without any setup or infrastructure maintenance it allows you to explore data in Azure Data Lake Storage, Spark Tables and Cosmos DB.

By making use of the OPENROWSET function in SQL statements, you can easily access files stored in ADLS. SQL On-Demand supports Parquet, JSON and CSV. We think that it will fully live-up to its potential once Delta

It uses the pay-per-query model. You are only charged for the data processed per query and it auto-scales depending on the amount of data processed.

Example SQL script to query a CSV file from ADLS with Azure Synapse Studio

Via the on-demand SQL endpoint provided in the Azure Synapse workspace, developers can utilize tools such as SSMS and Azure Data Studio with the on-demand compute engine.

“SQL On-Demand offers an ad-hoc cost-effective way of querying your data lake. Only the data passing through the query is billed.”

Now there are two possibilities to query your data. What could be the best fit?
From our point of view we suggest using SQL pools for constant and high workloads. If you want to do ad-hoc analysis then SQL On-demand will be a better fit for you.

Apache Spark

Azure Synapse provides simple to use Spark clusters. It is a SaaS solution meaning that Azure Synapse will take care of all underlying infrastructure components for you. This allows you to use Spark immediately in your Azure Synapse environment. When creating your Apache Spark Pool you can choose three different node sizes: Small, Medium, Large.

Four languages are available in the Notebook experience of Azure Synapse: PySpark (Python), Spark (Scala), SparkSQL and .NET for Apache Spark (C#). Thus, the ‘R’ addicts will be disappointed.

As this is a cloud service, it is easy to reduce costs by auto-scaling and enable auto-pausing.

              

Data Integration

Azure Data Factory is directly integrated inside the Azure Synapse environment. Azure Synapse uses the same technology to provide data integrations features like ADF. There’s even a native integration between Spark and SQL.

Azure Synapse Workspace

The workspace is the heart of Azure Synapse. It’s the place where data engineers and data scientists collaborate on their analytics solutions. Access to an Azure Synapse workspace is managed by the Role Based Access Controls (RBAC) applied to all other Azure resources. For example, to enable Power BI developers to launch the Azure Synapse Studio and to access or build Power BI content from within the Azure Synapse studio, the developers need to be granted the required permissions to the Azure Synapse workspace.

Azure Synapse Studio

Synapse studio is a web-based interface that provides an end-to-end workspace and development experience for all Azure Synapse resources. All development and management activities supported by Azure Synapse are carried out in the Azure Synapse Studio via access to an Azure Synapse Workspace. Other development tools such as SSDT and SSMS can be used to interface with the Azure Synapse resources.

 

Conclusion

Azure Synapse has a lot to offer. If needed, we can help you understand Azure Synapse Analytics, so you get the most out of it for your business case.

Pieter-Jan Serlet

Pieter-Jan Serlet

Pieter-Jan has over 10 years of experience in data- and analytics environments. He has a passion for helping business users reach their data in a more effective and easier way while maintaining governance. There is no data Pieter-Jan can’t handle.