Lesson learned at the customer: Merging two tables to calculate On Time In Full (OTIF). [Part III]

Lesson learned at the customer: Merging two tables to calculate On Time In Full (OTIF). [Part III]

In this final blog, we discuss the measures that have to be made.

Notethis is the third part of the blog. Missed the second part? Read it here.

Calculations

From here on out we will continue with all the calculations that have to take place to get to the end result of the time aspect of OTIF.

Step 1

Still in the transformation window and having our new merged table selected, we will start of with adding a new column to the table.
Under the tab “Add Columns” we select “Custom Column”.

 

The first custom column that we are making is one where we calculate a date difference between the Requested Delivery Date and Actual Delivery Date.

 

 

Step 2

Repeat step 1. But now we are going to calculate an OnTime flag.

: 

Step 3

Lastly, we are going to calculate the measures for the time aspect.
I split up the measures since this allows for cleaner code in general. We will only need 2 measures to calculate our OTIF time aspect. Be aware that these measures are probably not the exact answer to your problem. These are context and data model dependent but the ones you see here can help you on your way.

And there you go! You have now successfully calculated your OTIF time aspect values!
Below you can find an image of how we situated our table in the model.

In order to apply filtering on our new table, we placed the merged table between our Orders and Deliveries tables in our model with a one to many, single directional relationship to Deliveries and a one to many bidirectional relationship to Orders.


The reason for the bidirectional relationship is so that we are able to filter the table with our calendar dimension. Right now filtering on the calendar table will filter the Orders table which in it’s turn will filter the SalesDocumentSalesItemKey column in our Merge table but this could very well be a different column to your liking.

Tibo Vandermeersch

Tibo Vandermeersch

Tibo is a BI consultant with a mere focus on Power BI and data modelling. His interests mainly lie in DAX writing, development of Paginated Reports and Power Apps.

Lesson learned at the customer: Merging two tables to calculate On Time In Full (OTIF). [Part II]

Lesson learned at the customer: Merging two tables to calculate On Time In Full (OTIF). [Part II]

Bringing everything together in one table

Notethis is the second part of the blog. Missed the first part? Read it here.

Step 1

Select the new Orders table you just made and click “Append Queries as New”

 

Now append the newly created Orders table with the newly created Deliveries table

If  you have more than 2 tables to append, select “Three or more tables” at the top of the Append window.

Step 2

Remove all columns except for the key column (SalesDocumentSalesItemKey).

Now you have one column containing the values of SalesDocumentSalesItemKey from both the Orders table and Deliveries table.

Step 3


It is expected to now have duplicates in this new appended column so we remove the duplicates again from this column. This leaves us with a table that contains unique key values that allow us to merge with the new Orders and Deliveries tables to extract the needed columns and their related values.

: 

Step 4

Make sure you are now on the appended table and merge the appended table with the new Orders table to add the Requested Delivery Date column.

It’s not necessary to make a new table from this unless you want to segment your steps in different tables.

Go ahead and select the first table you want to merge with. In this case we will merge with the new Orders table.

Select the two key columns over which the join relationship will travel. Here we select SalesDocumentSalesItemKey for both tables.


After doing this you still need to make sure that the correct kind of join is selected. We chose for “Left Outer (all from first, matching from second)”.

Step 5

Right now your values are joined and you can expand the table again. It’s handy to include all columns so you can later on verify whether your join was made correctly. It’s good to check the box for “Use original column name as prefix” so you later know where the column comes from and which ones you can delete at the end.

Step 6

Repeat step 1-5 but with the new Deliveries table instead of Order.

Step 7

Now the columns from the new Orders table and the new Deliveries table are added in one big table.
Check if the values look correct. To be certain that you don’t have duplicates you can use the Group By again on the SalesDocumentSalesItemKey column (from the append step) to verify. Don’t forget to delete this navigation step.

Step 8

Remove the SalesDocumentSalesItemKey columns that you merged from the Orders and Deliveries tables and rename the date columns that are left to something more easy to work with.

 

Step 9

Since you have two tables now that won’t be used for reporting nor will they need to be added to the model, it’s good to uncheck “Enable Load” for these tables.

This blog will continue in part 3 where we discuss the measures that have to be made.

Tibo Vandermeersch

Tibo Vandermeersch

Tibo is a BI consultant with a mere focus on Power BI and data modelling. His interests mainly lie in DAX writing, development of Paginated Reports and Power Apps.

Lesson learned at the customer: Merging two tables to calculate On Time In Full (OTIF).

Lesson learned at the customer: Merging two tables to calculate On Time In Full (OTIF).

It’s always fun to make your customers happy and learn new stuff by simply just doing your job. That’s what we did at one of my customers lately. Us and a colleague (at the customer) figured out a way in Power BI to calculate the OTIF over multiple fact tables. For both beginner and intermediate profiles we thought this would be interesting to share.

And so we did! In this blogpost we will explain the process we went through to make the OTIF calculation happen.

What is OTIF?

  • To start with let us first explain what OTIF is for those who don’t understand it yet. “OTIF” stands for “On Time In Full”. This is a measure to see how many of your orders are delivered at the correct requested delivery date and how many of your orders are delivered in correct quantities.
    This means that there are two aspects, both represented as a percentage of the total SalesDocumentSalesItemKeys (in our situation), that have to be measured before being able to follow up on an OTIF analysis:

    • Time aspect
    • Quantity aspect

    The quantity aspect is simple. You compare the ordered quantity to the delivered quantity and if there is no difference, you count this as an “In Full delivery”.

    The Time aspect of the whole ordeal is less simple. We have to figure out a way to compare the Requested Delivery Date, situated in the Orders Table, to the Actual Delivery Date, which is the normal date column in our Deliveries Table.

 

The problem statement

Initially one would just make a measure to calculate the difference between the Requested Delivery Date from Orders and the Actual Delivery Date from Deliveries on the spot. Though this was an option, it wouldn’t have been the best one due to the mere size of the dataset and the fact that we are calculating this over multiple fact tables. That’s why we thought it was best to move this calculation to the transformation part of Power BI and use Power Query to go about calculating the Time aspect for OTIF.

 

The solution

The first thing we did was go to the transformation part of Power BI and reference (not duplicate) the Orders and Deliveries tables In order to make sure that any changes to the tables only have to be made in the original and not twice, once in the original and once in the copy.
These tables are copied because we will eventually bring these tables back to the columns that we will eventually need to bring together in one table. These columns are the following: 

  • From Orders:
    • SalesDocumentSalesItemKey
    • Requested Delivery Date
  • From Deliveries
    • SalesDocumentSalesItemKey
    • Calendar Day
      • This is the normal date column that will become the Actual Delivery Date later on.

The first table you start with, is the Orders table since this is the first table where  something happens first in a logistics related process. This is also the table where, we would assume, you can find the most SalesDocumentSalesItemKeys.

 

Where to Perform Transformations?

Within this blog, we will use PowerQuery for data-transformations. However, we could perfectly have done it using DAX expressions. Why do we prefer PowerQuery over DAX? As a general rule of thumb, we want to perform transformations as close to the source as possible. Hence, if the source system would’ve allowed data transformations (such as SQL) then we would’ve done that instead.

 

How to navigate to the Transform Data part of Power BI

Orders

Step 1

We started out by exploring our dataset and getting to know it. First we checked if there are duplicates in the SalesDocumentSalesItemKey column by using the Group By functionality in Power Query. This is also a good way to figure out whether your table is a historical table or if it already contains the most up-to-date record for every key.

After having figured out if there are any duplicates for your key column, don’t forget to delete this step. This is merely exploratory.

Step 2

Figure out which column causes duplication by filtering on one key (with duplicates of course) and searching for columns that show any differences in record values.
For us this was a record where stages of the process were being saved (i.e. for an order: processing – fulfillment – fulfilled).

Next you should go ahead and find the column that allows you to filter out the most up-to-date record for all of the instances of your key. In our example, the most recent record was the one with the latest “Created on” date.

Let’s call this kind of column the “MRC” (Most Recent Column) for future reference.

Step 3

Group By functionality was used to filter out the records with the most recent instance of the key based on the MRC. To do this you apply the Max operation on the MRC. Make sure to use the advanced section of the Group By to add the rest of the columns as well.

Your result will look something like the following:

For the column that’s generated which contains the rest of the table, click the arrows on the right-hand side of the column header. This opens the expand menu where you can select the columns you still need to include in the table you are creating.
Here you deselect all columns and search and select the column(s) you need for the calculation of the time aspect of your OTIF.

Now is also a good moment to rename the column(s) from its generated name to something more simple and meaningful.

Step 4

Now you can go ahead and delete the MRC column (here “Latest Record”) since you have no use for it anymore.

Step 5

Remove duplicates based on your key column (SalesDocumenSalesItemKey).

 

Deliveries

Step 1

Repeat step 1-3 for orders but instead of searching for the Requested Delivery Date in step 3, you search for the columns that contain the Actual Delivery Date.

We will keep referring to this column as “Actual Delivery Date”.
Don’t forget to rename the columns like you did in step 3.

Step 2

Correspondingly, you can now go ahead and remove the MRC column that you created with your Group By.

Step 3

You guessed it! It’s time to remove the duplicates again based on the SalesDocumentSalesItemKey.

 

This blog will continue in part 2, where we bring everything together in one table.

Tibo Vandermeersch

Tibo Vandermeersch

Tibo is a BI consultant with a mere focus on Power BI and data modelling. His interests mainly lie in DAX writing, development of Paginated Reports and Power Apps.

Kimball in a data lake? Come again?

Kimball in a data lake? Come again?

 

Introduction

Most companies are already familiar with data modelling (be it Kimball or any other modelling technique) and data warehousing with a classical ETL (Extract-Transform-Load) flow. In the age of big data, an increasing number of companies are moving towards a data lake using Spark to store massive amounts of data. However, we often see that modelling in a data lake is not done at all, or not done properly. This most often happens because the modelling possibilities are not that well known and data engineers have less of an eye for data modelling than classical BI profiles. Now, in this blog post, I want pave the way to proper data modelling in a data lake.

 

 

Kimball

First of all, for the people that need a refresher, let’s talk about Kimball modelling. Kimball modelling consists of creating star schema’s based on facts and dimensions. Let me illustrate this technique with an example. In the image below, we have modelled the business process of a sales transaction as the fact.  In the dimensions, we add the descriptive attributes of the fact table. In this particular example, every sale is linked to a certain product, a certain date and a store.

Implementing a Kimball-modelled Data Platform has following advantages:

  • The data platform is more user-friendly by providing a simplified data model.
  • The data model supports retrieving data in the most efficient way (= no complex joins).
  • Adhering to a ‘modelling standard’ makes our data lake more future proof (no need to reinvent the wheel).

 

Slowly changing dimensions

Now let’s say, at a certain moment in time, the address of a store changes.  In current data warehouses this is easily resolved using the principle of SCD (Slowly Changing Dimensions). In SCD, the most used types are SCD1, meaning that you overwrite the old address of the store and set a new address, or SCD2, meaning that you keep both the old address and the new address in the data warehouse. We then usually indicate with a valid from and valid to date to indicate the newest record. Creating slowly changing dimensions in a data warehouse is not hard, because SQL supports operations such as update and merge.

 

Kimball in a data lake

Moving to a data lake and performing Kimball modelling is challenging, and this has several reasons:

  • File formats such as .csv, .parquet, etc. don’t support updates and merges and are immutable, thus making it really hard to create SCD’s and perform incremental loading. Moreover, writing SCD functions in PySpark is a difficult process, which is error prone and can consume a lot of memory.
  • In Spark, data is stored across several nodes. Hence, a lot of knowledge is needed about how data is stored and which optimizations (such as partitions) can be done to increase the speed of operations. If you do not take this into account, it is very well possible that a simple query can take up a long time because it is performing a lot of shuffles (reading and combining data from several nodes of the Spark cluster). Taking into account all these difficulties can be a big hurdle for classical BI profiles going to a big data setup.

 

Spark 3.0 and delta lake to the rescue

To solve these issues, we have two new big boys on the block: Spark 3.0 and Databricks Delta Lake.

Lets first focus on Databricks Delta Lake. Delta files are a new type of files that extend the functionality of Parquet with a set of features that make them perfectly suitable for building a data warehouse.  They do this by providing a transaction log in addition to the actual data.  This transaction log supports:

  • ACID (atomicity, consistency, isolation and durability) transactions to guarantee data validity despite errors, power failures or other incidents. Every operation (update, delete, remove, …) executed on the data is recorded as ordered, atomic commits to the transaction log.
  • Time travel: this is a feature that can be used to restore a previous version of the data, in case a pipeline has corrupted your data file. This is either done by keeping a timestamp or by keeping a version in your data.
  • Schema evolution: When the schema of your data changes, this can easily be added to the delta table without having to delete the entire table.
  • Updates and deletes: a very important one for the people coming from a data warehouse. Delta files support updates and deletes, not only in the Python syntax but also in classical SQL syntax, as shown below.

Here, I have listed the most important features for this blog, but feel free to check out all other features at their website.

Now, Delta files have solved the SCD problem quite nicely for us! And as it happens, Spark 3.0 solves our next problem, which is the difficulty of optimizing queries. I want to focus on the following two improvements in Spark 3.0:

  • Dynamic partition pruning (say what?): let’s explain this with an example. Suppose we have a fact table and a dimension table which we are joining. When we have nicely partitioned the fact table to optimize a query and filter on the fact table, nothing will go wrong. But if we perform the query and put a filter on the dimension table, old versions of Spark will not make use of the partitioned fact table and perform a full scan, which is horrible for the performance. In the new version of Spark, this has been taken care of and the filter of the dimension table is pushed towards the fact table, hence making use of the partitioned data.
  • Adaptive query execution: in the earlier Spark versions, it was the responsibility of the data engineer to reshuffle your data across nodes in order to optimize your query execution. Spark 3.0 comes shipped with an Adaptive Query Execution Framework (AQE). After every stage of the Spark pipeline, this optimizer automatically checks whether certain optimizations can be done (switching join strategies, dynamically coalescing shuffle partitions, optimizing skew joins). In order to improve the query speed. This all sounds very technical, but fortunately, this is all done for you now!

Conclusion

As this blog shows, it is wrong to think of a data lake as just a data dump where Kimball modelling is impossible. With these new big boys on the block, it is perfectly possible to have a good Kimball modelling strategy in your data lake, without having the need for extensive data engineering experience. Have any questions on the execution of this strategy? Please contact us!

Tom Thevelein

Tom Thevelein

This blog is written by Tom Thevelein. Tom is an experienced Big Data architect and Data Scientist who still likes to make his hands dirty by optimizing Spark (in any language), implementing data lake architectures and training algorithms.

Process Mining, the pounding heart of the intelligent enterprise

Process Mining, the pounding heart of the intelligent enterprise

Process Mining provides you with an objective and quick diagnosis on the actual execution of your business processes by extracting the necessary information from your IT systems. In the past this was a tedious task based upon lots of interviews and in more than 98% a misrepresentation of reality. Process Mining removes this unreliability of this tedious task and provides so much more untapped potential and all of this in Power BI to give you all the flexibility to do a 360 analysis of your processes. What Process Mining could mean for you? Just keep on reading. 

 

Interested in what this untapped potential could mean for you? Process Mining has a lot of benefits. Just naming a few clear ones: Eliminating unnecessary steps, detecting bottlenecks, shorter lead times, enabling process automation, compliance benefits & fraud detection, … Typically these add to the bottom line by reducing costs, improved customer experience & compliancy.

 

The top 3 gains we typically see in Process Mining projects are:

  • Reduced costs by eliminating loops, rework and unnecessary steps
  • Improving throughput time of a process
  • Process standardization which leads to easier RPA or automation and also easier software migrations. Think of the importance of having your processed exactly mapped out before you start selecting a new ERP.

That is why we often experience Process Mining as the pounding heart of the intelligent enterprise. Process Mining has the ability to show you the full process reality over your organization. By doing so you have all the power at your fingertips to make changes in your processes where your organization would benefit from.

You are probably wondering, how does this work?

Let us show you.

As your organization executes their business model which is translated into processes this leaves a digital trace. This digital traces are the transactions in your operational systems.

Every transaction in your operational system leaves a digital trace. The combination of these traces is what we call an event log. Too fast? Let us show you what an event log looks like in the end after we have extracted it from your operational systems.

Basically it needs at least 3 important fields to be able to visualize your processes:

  • Case Id: This symbolizes 1 specific process execution from start to finish. F.e. a paid invoice grouped with all it’s preceeding steps like Goods Receipt, Purchase Order & Purchase Request.
  • Activity Name: This are typically the names of your different process steps. F.e. Purchase Order Creation, Change PO price, Goods Receipt, … So every name of a step in the process.
  • Timestamp: This is the easiest part. This is just the recorded time of the activity. This makes it able to calculate the duration between the execution of the different activities

Based upon this event log we build Process Mining Insights. Some typical scenario’s here are:

  • Root Cause Analysis: Do you want to know the root cause of certain bottlenecks in your processes?
  • Benchmarking: Are you interested why your plant in Italy is more productive than your plant in Germany?
  • Automation: Do you want to know which (sub-) processes are prime candidates to automate?
  • And many more:

Here is just one straight-forward example of a process tree that is automatically generated from the event log.

Why you should invest in Process Mining now?

Process Mining is a domain that already exists already many years. However it is only recently that it stepped out of the shadows of the Academic world and is made practical and pragmatical to prove instant business value. Now this becomes available with a low entry barrier in your favorite reporting tool Power BI toghether with PAFnow’s highly specialized data visualisations.

Another reason we see why Process Mining should be a top priority is because of the hyper-speed of digitalization we are in. Let us explain. Having a 100% accurate process overview enhances your agility as a business. Examples here are: Robot Process Automation, more accurate vendor selection for new applications that fully aligns with your processes, next-step predition based on previous process executions, … These are more and more becoming the game-changers to transcend your competitors.

 

Interested?

Reach out directly to one of our Process Mining experts: ProcessMining@Lytix.be

Or simply attend our next webinar on Process Mining, scheduled on the 21st of January, 15h. https://bit.ly/LytixPM

Our Process Mining Experts: