Process Mining: Understanding Simple Process Discovery Techniques using Python
Hi and welcome to this blog on process mining
Process mining is a set of techniques used in the field of process management and improvement which supports the analysis of processes based on event logs. Process Mining is able to fire different algorithms on a certain event log to identify patterns and trends in your process.
There are already some fancy tools available to perform process mining. Such as Disco, Celonis, PAFnow (Power BI Plugin). Yet, thanks to the open source python library PM4PY you are able to start with process mining without jumping directly on a tool.
In this blog, I will show you how to get started with a process discovery analysis in Python. The data we will cover is open source and made available by Technical University of Eindhoven. The event log holds information on the execution of the receiving phase of a building permit application process.
For references and more information, I recommend to visit PM4PY website. A lot of ready-to-use code is provided here with elaborate explanations.
- Anaconda Navigator
- Jupyter Notebook
- Python Libraries: Pandas, Numpy and PM4PY (Process Mining for Python)
- Event Log Files (.XES, .CSV, .PNML).
- (.XES and .CSV files are provided already)
Packages needed for your analysis
Depicted below you see all the packages we imported into our workbook. The most important package used is the PM4PY package.
Import event log and read as Pandas Dataframe
The first step is importing the event log (in .csv format) into a Pandas Dataframe. Based on the Pandas Dataframe we can conduct some basis data analysis already.
Basic Process Discovery Analysis
8577 events or activities occurred. Events are certain steps in the process which take place. A series of successive activities form a process variant.
A case on the other hand describes a business case that runs through a particular process variant. 1434 cases are available in the process. For this process, this means that we have 1434 requests for a building permit.
Activity Distributions over period
Most activities occurred during the period June 2011 – December 2011. There are some peaks and ups and downs.
Activity Distributions by days
Let’s reaggregate and zoom deeper in on the number of activities performed during weekdays. Here it is clear that most activities are performed during business hours. During the weekend also some activities are performed. This could be for example some batch processes which run every weekend. It’s possible to perform a batch process analysis but I will not go further into this.
Activity and Variance Analysis
Start and end Activities
It looks like our process has one starting activity (initial marking) which is confirmation of receipt. All our cases (building permit requests) (1434) start with this activity. The majority of the cases end with T10 determine necessity to stop indication and T05 print and send confirmation of receipt.
Process variants are variations in the process flow: A process variant is a unique path from the beginning of a process to the end. Briefly, a process variant is a unique sequence of activities. This process has 116 different variants. So probably when we will show a process map, this will be spaghetti alike.
From following pareto chart we can conclude that almost 50% of the cases follow the sequence of activities from Variant 0. Now what is the sequence of variant 0?
Process Discovery using process mining algorithms
Directly follow graph
Now it might be interesting to show our process map. Here we can use a Directly-Flow graph for. a process execution starts from the places included in the initial marking and finishes at the places included in the final marking. DFGs are graphs where the nodes represent the activities in the log and edges are present between nodes if there is at least a trace in the log where the source activity is followed by the target activity. As you can see this is quite a spaghetti-like process map.
Simplify process map (focus on completed cases)
Let us simplify our process map. Let us only focus on the completed cases. Based on the activity names, I assumed that T15 Print document X request unlicensed is the desired end point of the process. In order to filter out only the completed cases, I set the following filter: End event = T15 Print document X request unlicensed..
Alpha miner (discover simplified process using Alpha miner)
At the top left we have 2 activities: Adjust Confirmation of Receipt and T07 – Draft Intern Advice aspect 1 which are not connected in the shown petri net that was discovered by the Alpha Algorithm. This is because the Alpha algorithm can’t handle loops. The alpha algorithm assumes event logs to be complete with respect to all allowable binary sequences and assumes that the event log does not contain any noise.
Some downsides of using alpha algorithm are:
- It cannot handle loops of length 1 and length 2
- Invisible and duplicated tasks cannot be discovered
- Discovered model might not be sound (soundness is a generic correctness criteria, like normal forms in databases)
- Weak against noise
The heuristic net is a better approach, because it can handle certain types of noise. A heuristic miner will give a better result than applying the alpha algorithm because of the noise. The outcome looks already much better. We see that the activities which weren’t well integrated in the Petri net with the alpha algorithm are now perfectly integrated. The heuristic net gives more information on the reliability of the used paths and therefore is more suitable for determining the main process.
Some upsides and downsides of the Heuristic Miner
- Takes frequency into account
- Detects short loops
- does not guarantee a sound model
PM4PY is a great tool to kick-off with process mining. PM4PY offers a lot of functions for process analysis. The functions covered in this blog, is just a fraction of all possibilities.
I hope you learned more about how you can start with your process discovery analysis by first conducting a high level analysis.
High level analysis:
- Activity Distributions
- Activity and Variance Analysis
- Process performance
Next to the high level analysis also ways to discover your process using other process mining algorithms such as the alpha miner and heuristic miner which will process a petri net. These petri nets can afterwards be used to perform conformance checking by token-based replaying. With token-based replaying you can put your ‘as is’ event log on top of our desired process petri net. This is something I will cover in next blogs.
SAP Data Analytics consultant