Process Mining, the pounding heart of the intelligent enterprise

Process Mining provides you with an objective and quick diagnosis on the actual execution of your business processes by extracting the necessary information from your IT systems. In the past this was a tedious task based upon lots of interviews and in more than 98% a misrepresentation of reality. Process Mining removes this unreliability of this tedious task and provides so much more untapped potential and all of this in Power BI to give you all the flexibility to do a 360 analysis of your processes. What Process Mining could mean for you? Just keep on reading. 

 

Interested in what this untapped potential could mean for you? Process Mining has a lot of benefits. Just naming a few clear ones: Eliminating unnecessary steps, detecting bottlenecks, shorter lead times, enabling process automation, compliance benefits & fraud detection, … Typically these add to the bottom line by reducing costs, improved customer experience & compliancy.

 

The top 3 gains we typically see in Process Mining projects are:

  • Reduced costs by eliminating loops, rework and unnecessary steps
  • Improving throughput time of a process
  • Process standardization which leads to easier RPA or automation and also easier software migrations. Think of the importance of having your processed exactly mapped out before you start selecting a new ERP.

That is why we often experience Process Mining as the pounding heart of the intelligent enterprise. Process Mining has the ability to show you the full process reality over your organization. By doing so you have all the power at your fingertips to make changes in your processes where your organization would benefit from.

You are probably wondering, how does this work?

Let us show you.

As your organization executes their business model which is translated into processes this leaves a digital trace. This digital traces are the transactions in your operational systems.

Every transaction in your operational system leaves a digital trace. The combination of these traces is what we call an event log. Too fast? Let us show you what an event log looks like in the end after we have extracted it from your operational systems.

Basically it needs at least 3 important fields to be able to visualize your processes:

  • Case Id: This symbolizes 1 specific process execution from start to finish. F.e. a paid invoice grouped with all it’s preceeding steps like Goods Receipt, Purchase Order & Purchase Request.
  • Activity Name: This are typically the names of your different process steps. F.e. Purchase Order Creation, Change PO price, Goods Receipt, … So every name of a step in the process.
  • Timestamp: This is the easiest part. This is just the recorded time of the activity. This makes it able to calculate the duration between the execution of the different activities

Based upon this event log we build Process Mining Insights. Some typical scenario’s here are:

  • Root Cause Analysis: Do you want to know the root cause of certain bottlenecks in your processes?
  • Benchmarking: Are you interested why your plant in Italy is more productive than your plant in Germany?
  • Automation: Do you want to know which (sub-) processes are prime candidates to automate?
  • And many more:

Here is just one straight-forward example of a process tree that is automatically generated from the event log.

Why you should invest in Process Mining now?

Process Mining is a domain that already exists already many years. However it is only recently that it stepped out of the shadows of the Academic world and is made practical and pragmatical to prove instant business value. Now this becomes available with a low entry barrier in your favorite reporting tool Power BI toghether with PAFnow’s highly specialized data visualisations.

Another reason we see why Process Mining should be a top priority is because of the hyper-speed of digitalization we are in. Let us explain. Having a 100% accurate process overview enhances your agility as a business. Examples here are: Robot Process Automation, more accurate vendor selection for new applications that fully aligns with your processes, next-step predition based on previous process executions, … These are more and more becoming the game-changers to transcend your competitors.

 

Interested?

Reach out directly to one of our Process Mining experts: ProcessMining@Lytix.be

Or simply attend our next webinar on Process Mining, scheduled on the 21st of January, 15h. https://bit.ly/LytixPM

Our Process Mining Experts:

 

The Journey of attaining ‘DA-100: Analyzing data with Power BI’

Introduction

Microsoft’s DA-100 certificate proves a vast knowledge of Power BI. In this article, I will give you some tips and tricks on how to succeed. 

The DA-100 exam is a new power BI certification exam, out of the beta phase since July. It is a replacement for the older 70-778 “Analyzing and Visualizing Data with Microsoft Power BI”. Microsoft is reshaping their assortment of certifications. Therefore, the older certifications are scheduled to retire on January 31st, 2021.
The new certifications are now based on specific roles (Data Analyst, Data Engineer, Azure Developer and so on), while the older ones are more focused on a specific tool like Power BI or SQL Server. Passing the exam grants you a certificate that lasts for 2 years. 

Preparation exam

When the exam was in Beta, there were no study guides. Hence, preparing for the exam was much more of a challenge than it is now, mostly due to the lack of information what the certification was about. Now you can find the outline of the specific skills measures in the exam on website of the certificate, together with their chance of appearing:

  • Prepare the data (20-25%)
  • Model the data (25-30%)
  • Visualize the data (20-25%)
  • Analyze the data (10-15%)
  • Deploy and maintain deliverables (10-15%).

Microsoft also provides us with learning paths to gain the skills needed to become certified. These learning paths are divided in modules. They contain a lot of reading material and a lab session at the end (most of the time). Doing this is a great way to get to know the material better and to give you a better representation of the questions on the exam.
To be fair, I think not knowing how the questions are asked is one of the greatest difficulties of this exam. But I would not recommend doing the exam after only studying this material, it is not in-depth enough. Some questions are based on real-life scenarios, which is a good thing, but that makes it harder to pass without a fair amount of hands-on experience. The best thing to do to study for the exam is simply to practice a lot. It is obvious the exam will be easier for people that use Power BI every day.

Another way of studying for the test that I used is taking quizzes at https://www.learndatainsights.com/. There are a lot of different test questions for every subject. This is a fun way of going through the material and testing your knowledge. However, you notice some of the questions are dated. 

The exam

My exam consisted of 47 questions and 2 case studies, each with 4 questions. Everything is multiple choice and there is no correction for guessing. After the exam, you immediately get your results, which is nice.

Here are some examples of tasks you might have to perform on the exam:
-Do some filtering in DAX.
-Import data from the different Sharepoint sources.
-Modelling within Power BI, especially related to the DATE dimension.

Conclusion

Studying for the certificate is definitely a nice way to start with Power BI, but to pass the exam you will need some working experience with it. Good luck to all the readers!

About the author: Wim Roten is a Business Intelligence consultant with two years of experience. He has a broad knowledge of data architectures within Azure and likes practicing his Data Science knowledge in his spare time.

 

Wim Roten

Microsoft Analytics Consultant @ Lytix

Power BI: Paginated Reports

 

One of the most valuable, yet less used functionalities of Power BI is ‘Paginated Reports’. This article will explain what they are, when you need to use them and how they can benefit your organisation. At the end, some samples can be downloaded that demo the functionality. 

 

Starting-Off with an Example

As you can see in the GIF below, ‘Power BI Paginated Reporting’ provides pixel perfect and printfriendly reporting. There is a large discrepancy with the standard Power BI reports which offer lots of flexibility but are not printfriendly. The way of interacting with filters is fundamentally different, as is the interaction with visuals. Selecting a category in a bar chart does not filter or highlight other visuals in the report.  

Pixel-Perfect Reporting Throughout Time

Before the rise of Power BI, Microsoft provided another tool for reporting: ‘SQL Server Reporting Services’ (= SSRS). SSRS provided reports to business users on the IT premises of the customer. Most reports first prompted you for the parameters to run before even showing a result. Once these parameters have been filled in, the report was generated by issuing pre-made queries (= SSRS datasets). Often, the design of these reports was limited to long lists and matrices of numbers and statistics. The typical goal of such reports are mass distribution of a PDF variant or exports to Excel. This pre-made exporting/rendering functionality is often referred to as ‘pixel perfect’. Only a more experienced developer could implement features such as cascading filters, drill through actions and groupings. Nevertheless, these reports always did exactly what was expected of them and allowed great trustworthiness and stability.

Developing a Paginated Report

Creating a paginated report is completely different from creating a Power BI desktop model. This is already demonstrated by the fact that you need to install another software called ‘Power BI Report Builder’. Some terminology might be confusing; a dataset serves a totally different purpose in Power BI desktop than in it does in Paginated Reports. There’s nearly no drag-and-drop experience of creating the perfect dataset and you should know upfront how you want things visualized. Therefore, we often recommend creating a Paginated Report by someone who has experience with SSRS or has a more technical background. Thus, creating Paginated Reports doesn’t provide the same self-service capabilities as standard Power BI reports. 

Export Examples

The ‘image gallery’ below shows the different ways of exporting Power BI. Typically, such an export is distributed via mail to a long list of recipients, often referred to as mail bursting. But these reports can also be optimized for CSV exports and integrations. Do you want to experience the difference yourself? Download the examples here: https://cubissolutions.sharepoint.com/:u:/s/Lytix/EfCkQNAP_ntOntgZHx2KNbsBMNDcCexpFqWueUC37J3U9Q?e=7NRL0e 

Licensing

Paginated Reports can only be created if you make use of any of these licenses:
– Power BI Premium.
– Power BI Embedded, at least an A4 instance.
– Power BI Premium per User.
Hence, you will not be able to work with Paginated Reports if you only have a Pro license. 

Other Tips and Tricks

– If you can’t find the correct documentation or walkthroughs for Paginated Reports, try your luck and add the keyword ‘SSRS’ to it.


– The use cases where Paginated Reports shine are the following: long multi-page lists (such as price lists), pixel-perfect rendering (such as Profit and Loss statements) and mass distribution of PDFs or flat files.


– Paginated Reports use the same extension (.rdl) as SSRS reports. This way SSRS reports can easily be converted into Paginated Reports.

– Better together: You can create links between Power BI and Paginated Reports. As such, you could start on a standard Power BI report and when users click a button, then a Paginated Report can generate a print-friendly report. 

Sander Allert

Sander Allert

BI Architect

Sander Allert is an experienced BI architect with a passion for following new trends. Sander is passionate about data in all of its aspects (Big Data, Data Science, Self-Service BI, Master Data, …) and loves to share his knowledge. Do you need help on architectural decisions, do not hesitate to invite Sander over for a coffee to share some ideas.

Transfer learning in Spark for image recognition

 

Transfer learning in Spark demystified in less than 3 minutes reading

 

Introduction

Businesses that want to classify a huge set of images in batch per day can do this by leveraging the parallel processing power of PySpark and the accuracy of models trained on a huge set of images using transfer learning. Let’s first explain the two buzz words used in the title.

  

Transfer learning

Transfer learning is a machine learning method where a model that was pretrained for a different task, is reused for a new task. This is mainly common in tasks such as computer vision and natural language processing. An advantage of this approach is that it can also be used for data sets where you don’t have much data, since the starting point of your final model is a model that has already been trained on lots of data. Moreover, these models have been optimized by experienced developers and may give you a jumpstart in the accuracy of your model.

In this example, I will use transfer learning for image recognition. Three of the more popular base models used for transfer learning are VGG (e.g. VGG19), GoogLeNet (e.g. InceptionV3) or Residual Networks (e.g. ResNet50) . These are all neural networks.  I will not go into detail of these neural networks, but instead explain the basic principle.

The image below shows a schematic overview of transfer learning. The different layers of the neural network have been depicted. The input of the model is an image, which then goes through the several layers of the neural network, which has as output a final label. In the original model, this can be a cat, dog or fish.

 

Now assume, we have a data set with images of shoes, pants and t-shirts, and want to predict which one is which. To do so, we adopt the model, but only a specific part.  The basis (orange bars) of the model remains unaltered and is commonly named ‘the head’ of the model. This head of the model transforms the input images and extracts basic image features, such as colors and shapes. The retrained part of the model is the ‘tail’ and corresponds to the last layers of the model, which map these basic features onto the right category.

Spark

Spark is a unified analytics engine, built for big data processing. It allows to use multiple nodes to perform computations. Since transfer learning and feature extracting using neural networks can be quite a computationally intensive process, Spark is here used to run the first layers of the model in parallel on a lot of images.

 

 

Practical

The link to the code on github can be found at the end of this blog. In this code setup, I have used Azure Databricks and mounted an ADLSGen2 blob storage to the cluster. To do this, please follow this link: https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html . Using Azure Storage Explorer, you can transfer the images of your image training set onto this blob.

 

The first step is to read the images in Spark, this can be done using below command. Basically, this command will scan the entire folder for any images and convert them to a binary format. The columns of this dataframe are path, length and content (binary format). This will happen in parallel distributed over the Spark nodes.

In the next step, the base model of transfer learning can be used to extract the base features from the images in this dataframe. The image below indicates how a Spark User-Defined Function (UDF) extracts the features in parallel from the images and creates a new column ‘features’, using the base model. A Spark UDF function is a function that allows to execute computational logic on the rows of a dataframe in parallel over multiple Spark nodes. For the full code of this UDF, please see the Github at the end of this article.

 Now, where does that leave us? At this moment, we have a dataframe that is composed of classical numerical features, where we can execute classical machine learning techniques on such as logistic regression, forests, etc..

In this last step, we need to attach the labels to this dataframe based on the path or name of image and create an MLlib pipeline. This pipeline will contain an additional layer on top of the base model which allows to predict the target in our case. The code used for the MLlib pipeline is shown below. I have created a VectorAssembler in order to have the right format for the features and a LabelIndexer to convert the text of the label to a numerical label. As a last model, I used a logistic regression. These are all combined into a Spark ML pipeline, which combines these steps into a single object (similar to a SKLearn pipeline).

Alternative methods

There are multiple design choices to be made in this method:

  • The base model for transfer learning (Resnet, VGG, …) can be altered
  • The final spark pipeline can be changed and include multiple stages, or one can use a tree-based model instead of a Logistic Regression.
  • To tune the hyperparameters of the model one can use a grid search.

Feel free to experiment with this in order to achieve the best performance for your use case.

 

Conclusion

Above method indicates how you can train a model in parallel on a massive data set on images using transfer learning. This way of training is optimal if you have a case where you need to classify a lot of images in batch everyday. In case of questions, feel free to contact us and we can help you out!

Github link:

https://github.com/tntn123/spark_transferlearning/blob/main/main.py

Tom Thevelein

Tom Thevelein

This blog is written by Tom Thevelein. Tom is an experienced Big Data architect and Data Scientist who still likes to make his hands dirty by optimizing Spark (in any language), implementing data lake architectures and training algorithms.

Premium Per User (PPU) explained

Just last week, the Power BI CAT team extended their current licensing model with the release of Premium Per User (PPU) under public preview. On what ground exactly do current licensing models differ from this new release? We’re reaching our next destination: Premium Per User!

 

What Power BI licensing offerings Microsoft already had in store?

First things first: let’s get you up to speed concerning the different licensing models Power BI comes with.

 

User-based licensing: Free & Pro (€8,40 per user per month)

With the release of Power BI mid 2015, users could have two different types of licenses. On the one hand, we had a free Power BI license whereas on the other hand, Power BI Pro was introduced as a paid license. Among other benefits, this Pro license enabled users to actually share their Power BI reports by means of collaboration workspaces, located on the Power BI service portal. Power BI Pro licenses come at a cost of €8,40 per user per month.

 

Capacity-based licensing: Premium (Starting at €4212,30 per month)

 

Two years later, mid 2017, the premium licensing model reached general availability. Opposed to the earlier introduced Free and Pro licenses, which were user-based, this new license was capacity-based. It served, and still serves, as a dedicated node of a certain number of CPU cores and RAM memory for the enterprise as a whole. Workspaces backed by premium capacity empower enhanced performance as well as the ability to share reports published to an app workspace with Free users without any additional costs. Also, some extra features have been introduced which are premium only such as AI-related components, extended workspace/dataset storage, XMLA endpoint support and paginated reports. Power BI Premium starts at a cost of €4212,30 per month for the lowest tier (P1).

What is this Premium Per User (PPU) all about?

Three years after premium reached general availability, end 2020, a new licensing option has been introduced called Premium Per User (PPU).

 

User-based licensing: Premium Per User (PPU) (Price still undefined)

Power BI Premium Per User allows organizations to license Premium features on a per-user basis. PPU enables users to benefit almost all premium capabilities such as AI-related components, extended workspace/dataset storage, XMLA endpoint support, paginated reports and many others.

Imagine a PPU license as a first class train ticket. You didn’t had to buy the whole train but yet you can bath in silence, chilling in your soft-leather seat while others sit in the second class railway carriage. The train will bring everyone to his or her destination, however, not everyone could benefit those extra features!

Why could PPU become a game changer?

There’s no denying on the existence of a huge gap in terms of pricing between purchasing some Pro licenses and affording premium with a massive cost of over €4200 per month. I believe PPU could have a threshold lowering impact, especially for small and medium-sized enterprises (SME’s). Many of these enterprises have a rather small user base of 25 people or less covering the data analytics department. In a second phase, if there’s a growth in reporting requirements/ user base, a logical second step would be to upgrade to a full-blown premium tier.

Please do note, as described earlier, premium is a capacity-based license. Power BI Pro licenses still need to be purchased to enable report creators to share there reports or collaborate in workspaces in the service portal. At the moment of writing, PPU pricing details are yet undecided upon.

Can I test it out myself?

YES, you can! At the moment of writing, this option should be available to all tenants who subscribed for priority access. The remaining part of tenants should get their hands on this public preview towards the end of November according to Microsoft’s latest communication. Renewable 60-day trials of PPU can be requested within your O365 environment at no cost.

  

Simply follow these steps and you’re good to go:

 

 

 

Considerations

 

 

Wrap up

This article elaborates on in what sense the newly released Power BI licensing model Power BI Premium (PPU) bridges the gap between the already existing Pro and premium licensing models. A free trial of this licensing model is now available in public preview for an undefined period of time nor has there been decided upon pricing details yet. This means the time for testing is now.. If you are unsure if PPU could be beneficial for your company, feel free to get in touch with me or any of my colleagues at Lytix.

Lou Segers
Microsoft Analytics Consultant @ Lytics
Follow me on LinkedIn

Things to consider when creating a Data Lake

Have you wondered what a data lake is? What are typical use cases for this lake? How can you benefit from a data lake are? In this blog post, we will show you the added value of a data lake while pointing-out some pitfalls and best-practices.

Before diving into data lakes (ba-dum-tsss), let us start with an example that anyone can relate to: When I bought my first computer there was only room for about 100 GB of files. This storage was quickly filled-up with personal documents, scans, photos of friends and family and lots of music. Every three years, I buy a new laptop on which I ‘start fresh’ and in which I collect new photo’s, video’s, music and files. Backups that you make to this external drive make sure that you can sleep at night: you are covered if your laptop would crash, when you need to consult old files, when you want to restore an earlier version, … Moreover, you can store more files than the drive of your laptop can manage! In addition, these files are accessible anywhere and at any time. In addition, sharing to friends and family becomes easy as you can just right-click a folder/file and make it available for them; either via a URL or via a mail sign-in.

 

 

With a data lake for your organization similar advantages apply. You export all kind of files to your data lake: images, videos, operational systems, reports, logs, flat files, … Once they are in the lake, they are easily accessible when you need them in the future. While this may be a longshot today, you may get an added value out of some data in the future. Having built-up history results already provides a jumpstart for this analysis and results in more accurate insights

The Layered Data Lake approach: Bronze, Silver & Gold

A pitfall we typically see is that all data is just dumped into one big data lake without any structure and then exposed to everyone. While exporting your data and building-up history, this may already result in some checks-in-the-box for your data lake implementation, but it lacks long-term planning. Not all data is easy to work with for data engineers; technical keys need to be transformed to terms that make sense, some files needs aggregation and some files need to be merged together to be able to work with them.

To tackle these problems, we suggest working with a layered approach (also referred to as the Multi-Hop Architecture):

    • Bronze Layer: A one-on-one copy of the data from the source into the data lake. ‘Bronze data’ is raw untransformed unmodified data and all your sources land into this layer.                                                          
    • Silver Layer: Once a business case has been identified and requires analysis, the ‘raw Bronze data’ is transformed into sets of data that add additional values. This can imply replacements of codes to meaningful values, adding sanity constraints, filtering-out unneeded information, … . Hence, resulting in concise useful datasets that may be used by other pieces of information as well.                                                   
    • Gold Layer: The gold layer then provides a well-constructed dataset ready for analysis by data scientists and business analysts. The data is presented in such a way that appeals to them the most, which may include aggregations, joins and merges, encoding, etc.

In a typical scenario we can harvest data from our products from different source systems. All these sources then land into the bronze layer as a one-on-one copy. The data of all these sources is then blended in the Silver layer that results in one single source of the ‘product’ information. This ‘Product’ silver dataset can be used in several datasets that are presented to users in the gold layer (e.g. stock keeping unit dataset, market basket analysis, …).

Technology: Azure Data Lake Storage Gen2!

We often recommend Azure Data Lake Storage Gen2 (ADLSGen2) to our customers. This ‘Gen2’ merges the best of the resources ‘Blob Storage’ and ‘Data Lake Storage’ (the latter to be deprecated). ADLSGen2 uses the Hadoop File System (= HDFS) to optimise searching and reading through large volumes of data. ADLSGen2 allows you to implement very granular security: choose who

Note: Typical on-premise data lakes do not provide out-of-the box functionalities for implementing security and easily relocating files. In such scenario’s every user can access all data in the lake. We are all used to navigating through folders of information. The typical data lake tried to mimic this functionality by appending the folder names to the files (resulting in very long files). Hence, renaming folders actually executes a copy of the old [folder-filename] and a paste with new [folder-filename] (= expensive operation). In addition, as on-premise data lakes typically results in lots of disk-management, you probably see why we recommend an out-of-the-box cloud technology.

ADLSGen2 makes use of hierarchical namespaces. This means that the files are organised in (virtual) folders: querying your data lake (for example using Spark or Synapse’s SQL On-Demand) is sped-up if the filters match your folders. As the folders are really ‘virtual’ (= not implemented as part of the filename), renaming and reorganising folders only takes mere seconds!

Wrap-Up

Data Lakes can help you built-up a trail of data that will bring value in the future. Azure Data Lake Storage Gen2 is the go-to Azure resource that helps us provide a data platform for your organisation. Interested in creating your own data lake? Contact us!

Sander Allert

Sander Allert

BI Architect

Sander Allert is an experienced BI architect with a passion for following new trends. Sander is passionate about data in all of its aspects (Big Data, Data Science, Self-Service BI, Master Data, …) and loves to share his knowledge. Do you need help on architectural decisions, do not hesitate to invite Sander over for a coffee to share some ideas.