One Year at Lytix – Data Consultant Tibo Vandermeersch

I’ve been working at Lytix for one year by now and I thought it was interesting to share what my experience was during the first year as a Data Analytics Consultant at Lytix.

Training

 

After the first day, I and other starters immediately jumped into a training program that Lytix put together. Me just starting out as a Data Analytics Consultant needed this of course.

During this training, I learned and refreshed not only hard skills that are needed to execute my job but also soft skills that are involved with consultancy.
People coming straight out of school will start in a group of starters that all follow this training simultaneously.
Most of the trainings are also given by people from Lytix themselves which give you, as a new person, already a person to contact if you need guidance or expertise in a certain field of knowledge. And people here invite you to do so as well.

The training has definitely proven to come in handy. It’s also a good way to gauge the kind of knowledge and expertise that lives in Lytix.

 

Working experience

 

Even during the training period, I was already deployed at a customer under the guidance of a more senior profile which allowed me to learn more on the job.

At Lytix they try not only to find the right person for the job at the customer but also the right customer for the person doing the job. Of course, it’s not always possible to do this but when it is possible, they think of the most suitable place for you.
I personally have been very happy with the customers I got to work with. I felt comfortable doing my job and also felt like I learned from my customer. You won’t only get feedback from your colleagues at Lytix but you often get it from your customer as well which is something I appreciate a lot.
I like to go about feedback in the following way: “There is no such thing as bad feedback. It’s just that way it’s brought over that can be bad”. Don’t be afraid to ask the customer if you have any working points. Approach it in an open-minded fashion.

Lytix gives you opportunities to learn and get experience pretty fast. This allowed me to work with the data of 7 different customers in one year. Working in these many contexts learned me to go about different kinds of expectations, understanding a dataset you don’t understand yet, different ways of working, asking the right questions in the first phases of the project, ….

 

Development

 

At Lytix there is a lot of room for development. Whether it’s something small or big. As long as it’s job-related they would stand behind your proposition if you have one.

There are regular check-in moments to figure out where your personal interests are going in terms of self-development. Next to that, they could always recommend you something in terms of development that might suit you. Being self-reflective doesn’t always mean you have to find all answers yourself to what you should do. Simply asking someone what they think would be interesting for you is also a form of that.
I am for example someone who learns best by doing something and figuring it out which gets put to good use at Lytix. I sometimes get challenged by my customers and at those moments I feel most that I am growing professionally. This gives you the opportunity to gain more knowledge.
And again. When you work with a customer that plans in a moment for personal feedback with you, I appreciate that so much when they do that. It tells you a lot in how you are fulfilling the expectations but also what you can work on for yourself. They know that I’m young and don’t know everything but they let you know what you can work on based on their experience. This allows you to also prevent mistakes.

There’s a lot to learn at Lytix. Not only about job content but also about yourself. I personally have learned that I’m better at certain things than I initially thought for example.

 

Teamwork

 

Say that you’re at a customer and they ask you about something in which you have zero experience. You can always go ahead and ask someone at Lytix if they can help you out. Eventually, you will get there.

The people at Lytix and Cubis (= Lytix’s sister company) know that not everybody can do everything. That is why it is easy to contact someone with more expertise in a certain subject. At the most recent project, a customer asked which option was better; connect a customer portal with SAP directly (and how would one do it) or connect SAP with the Power BI portal and connect the customer portal to that. Me not knowing a lot about SAP went on a search at our sister company, Cubis, to find someone who could. Within a day a meeting was scheduled and an answer was given. After a couple of days (contract negotiations included) someone from Cubis was able to implement a solution that satisfied the customer.

Next to having easy access to expertise across platforms, there are also fun activities organized to build fun and good, amicable relationships between the employees. This way everyone gets to know each other a little better and you end up knowing who has expertise in which subjects.

Apart from the things mentioned above, Lytix and Cubis have monthly internal status updates and regular knowledge sharing sessions with Lytix’s “Bits & Bokes” where someone from Lytix brings some interesting to know knowledge in a short session of 30 minutes during lunch break.

Tibo Vandermeersch

Tibo Vandermeersch

In case you have questions regarding MS BI, don’t hesitate to contact Lytix for more information!  

Power BI and Spotify – A match made in heaven?

Data isn’t life. They say music is. Well, is it? Wouldn’t it be awesome to simply combine both worlds? Let’s find out if this could turn out to be a perfect marriage.

Today, with approximately 286 million users to date, including 130 million subscribers,
Spotify can be considered as one the world’s most popular music streaming services.
In a remarkable coincidence, I happen to consume their service every once in a while myself. Actually, this is an understatement as I stream music basically on a daily basis:
During travel, development, while exercising or even when writing this blog at home.

What would we like to find out?

In regards to my streaming behavior, I came up with the following questions:

> What artists did I favor the most

> What songs did I listen to the most ?

Can certain events, such as festivals, cause a change in my streaming behavior?

> When did I stream music?

What KPI’s can we define to provide us the answers with?

Of course we need to quantify these questions based on a KPI. I believe there are two ways of measuring whether I like or dislike a specific song. The first one is the time I actually listened to the song, which I will be referring to as the listening time (in minutes). The second KPI is the song count, being the number of times a song started, regardless of me manually pressing the play button or Spotify’s autoplay starting the song.

Where can we gather the data we need?

Spotify offers their users to retrieve data through their Spotify API. Recently, they launched a refined version of this API which, at the moment of writing, is still in BETA. I can only praise Spotify for how well they document their API. They offer a clear process description on how to connect to their API, written down call definitions and even providing developers a test environment console. More information can be found here.

Evaluation time!

As we now have all data in place, we can assemble our very own Spotify dashboards with Power BI. Let’s analyze my streaming behavior of the past year by trying to formulate an answer to our predefined questions. 😊

What artists did I favor the most?

Post Malone seems to be the clear winner based on my streaming behaviour, there’s no doubt there. I tend to have a preference for R&B with Tyga and Blackbear claiming a spot in my top 5 artists.

What is interesting here is that my top 5 artists slightly differs when switching between our two predefined KPI’s. Spot 1-3 are identical, regardless of the KPI we base ourselves upon. However, in terms of song count, Nickelback and Loud Luxury are my respective number 4 and 5 whereas, in terms of play time, Loud Luxury and Blackbear take in those positions. This can have two possible explanations:

1) Songs of Nickelback are simply shorter than Blackbear songs

or

2) I’m excited to hear Nickelback songs’ refrains and get bored afterwards, forcing myself to switch songs once the refrain passed by.

What songs did I listen to the most?

It wouldn’t surprise you I listened more than 150 times to Post Malone’s top songs such as A Thousand Bad Times, Sugar Wraith and Goodbyes.

Can certain events, such as festivals, cause a change in my streaming behavior?

Yes! Last August I attended the Sziget festival in Hungary. According to my streaming history, I enjoyed Post Malone’s performance so much that the following months, Post Malone skyrocketed and became my number one streaming artist.

When did I stream music?

It turns out I’m actually getting value out of my monthly subscription. The past year, I streamed 357 out of 365 days. My listening time stays more or less steady throughout the week. During the work week, I mainly stream music traveling between home and work. And if you’re interested; yes, I’m not an early bird. 😊

Wrap-Up

This article described one way of vizualising your Spotify streaming behavior with Power BI. Of course there are many more insights these reports can be enriched with. I tried to provide a high level summary of how I created this analysis. If you’re interested in the technical details or would like to get help upon solving your own data visualisation enquiries, do not hesitate to get in touch with me or any of my colleagues at Lytix.

All aboard the data train today, CHOO CHOO 🚂 

Lou Segers
Microsoft Analytics Consultant @ Lytics
Follow me on LinkedIn

10 Performance tips to speed up your Power BI reports

Arriving at work and the first thing you do is opening your Power BI report and going for a coffee while waiting for the report to render. Seems familiar?

Well, it shouldn’t. Your reports should render instantly, especially if there is a large or C-level audience consulting them on a regular basis. Luckily enough there are some tips and tricks to melt your frustrations like snow in the sun.

1. Only keep necessary visuals

Every single additional visual needs time to render and will therefore slow down your report. And if, in some cases such as when relying upon bookmarks, you really see added value in enriching your report with a bunch of shapes or data visualizations, please do hide them whenever possible. As hidden visuals won’t be rendered, they do not have a single impact upon report performance.

2. Use integers instead of text values whenever possible

String values do not compress as well as integer values. By replacing string values with integer values, preferably at source level, both .pbix file size as total memory size will reduce exponentially. Examples: Adjust data types of numeric codes to integer values where possible and use 1/0 instead of TRUE/FALSE.

3. Stick to Power BI’s default visuals

Report developers are often very tempted to scroll through the list of third-party visuals, which can be found in the marketplace, and end up importing them. In some specific business scenarios, they can actually be very suitable. However, please do use them with care. The biggest performance wins I’ve ever made was accomplished by simply converting third-party visuals to default visuals.

4. Avoid both-directional relationships in your data model

The number one performance killer. A both-directional relationship between two tables actually means both tables will cross-filter each other. This has a serious drawback on report performance as every interaction is very plausible to cause a loop of filtering actions, especially when your data model consists of multiple enabled both-directional relationships. Optimal data model design is key. Alternatively, DAX’ CROSSFILTER() function can come in handy as well.

5. Understand the difference between duplicate and reference queries

Duplicate queries will simply serve as a one time copy of the M query code whereas referencing a query also pushes through any changes performed on a referenced query to the queries referring to this query. Does this impact report performance? No, but it goes without saying the latter improves query maintainability as you only need to adjust the referenced query.

6. Only load data which will be reported upon

What about the options to “enable load” and “include in report refresh”?

Many of you have probably came across these options when developing reports but don’t exactly know how to correctly interpret their function. When the enable load checkmark is unchecked, data won’t be loaded into memory. The respective query will still be available in the query tab but will not be cluttering your data modeling or report view. This is often used for intermediary queries: intermediate results you want to refer to or merge/append other queries with. There is a clear distinction with the “include in report refresh” option: when unchecking this checkmark, data simply won’t be refreshed when you click the refresh button of your report. These query fields will still be visible in your data modeling and report view if you have the “enable load” option checkmarked. This feature is perfect for preventing static data from redundantly refreshing. Additionally, incremental refresh can be configured as well for example to prevent frozen historic data from refreshing unnecessarily. This feature used to be premium exclusive but it is generally available as of the end of February 2020 for Pro users as well, so make sure to check it out!


7. Slicers can be real performance killers

These are quick wins: change your slicers to dropdown instead of list. When a user opens a report, all (non-hidden) visuals require render time: the time Power BI needs to actually load & show a data visualization. Slicer lists will last longer to render in comparison with slicer dropdowns as the former presents filter values on report initialization. The latter will only load filter values when a user triggers the dropdown.


Even more important, is the slicer visual selection. As stated before, you should try to stick to default visuals if possible instead of relying upon third-party visuals from the marketplace. However, there are some value adding visualizations in there as well. To provide you with a clear example visual significantly contributing to user experience, the market place’s portfolio includes the so-called ‘HierarchySlicer’. This visual -unsurprisingly- enables users to add multiple fields to one and the same slicer, creating a hierarchy. At the time of report creation, Power BI did not offer the possibility to add multiple fields to their default slicer.. YET. As of Power BI’s February 2020 update, the default slicer now also supports hierarchies. In fact, one of the biggest performance wins I came across existed in simply changing a the third-party ‘HierarchySlicer’ to the Power BI default variant reducing report render time with an astonishing 90%. Continue reading to learn how you can measure performance.

 

8. Overthink your data modeling technique

I’ve had fantastic experiences reporting on a dimensional model. Bottom line, creating report look and feel should only take up a small proportion of your total efforts spent. Designing the right data model will mainly decide whether your reports will behave in a performant way. I am aware there are several ways to design a data model but I can only strongly advise to stick to Kimball’s approach. Other best practices you should keep into account exist in:

> The use of surrogate keys

> The use of role-playing dimensions

> The determination of fact(s), possibly enabling parallel loads

> No snow-flaking

Please do mind I only scratched the surface of dimensional modeling here. Always try to question your decisions in your journey to the ideal data model. Critical thinking is key. In most cases, an additional pair of eyes won’t harm.

9. Split granularity to lower cardinality

To perform calculations one or more requests are sent to the storage engine where the data is stored. This could be either an external relational engine (DirectQuery) or Power BI’s storage engine, called the Vertipaq engine (Import). Due to the variety of DirectQuery connections, I’ll be focusing on the latter. In contrast to a traditional SQL database, which evaluates data row by row, the Vertipaq engine saves data into memory in a columnar format.

With this small introduction to the Vertipaq engine, I want to draw your attention to how Power BI handles its imported data: the higher the cardinality of a column, the longer Power BI will need to perform its calculations upon this data. Put in other words: try to limit the number of distinct values of an attribute.

For example, you could split datetime fields in a date and a time field. Consequently, memory load will be reduced to a minimum as both fields will have a much lower cardinality in comparison with the original datetime field. This is something very case specific but I hope you were able to grasp the idea.

10. Measure performance !

Get a clear view on how a report is currently performing so you can quantitively measure and compare the outcome of any of your taken actions. Power BI has a built-in feature, called the performance analyzer, enabling users to measure how much time is spent on rendering visuals and how much time is spent on executing DAX code. It even gives you the full extract of the DAX code, which you can further analyze in tools as DAX Studio.

Remember the difference between Power BI’s updated default slicer and the HierarchySlicer, imported from the market place? I’ve compared their performance on one of my reports resulting in the following excerpts:

Wrap-Up

The above tips and tricks are some of the actions you could opt for while tuning performance of Power BI reports. I did not have the intention to assemble a complete list. There are several more options but I can only advise to start with the ones I’ve described above. If none of them result in the performance you were hoping for, do not hesitate to get in touch with me or any of my colleagues at Lytix.

All aboard the data train today, CHOO CHOO 🚂 

Lou Segers
Microsoft Analytics Consultant @ Lytics
Follow me on LinkedIn

Managed Big Data: DataBricks, Spark as a Service

The title accompanying this blog post is quite the mouth full. This blog post will explain why you should be using Spark. If a use case would make sense, then we will introduce you to the DataBricks product, which is available on Azure. Being recognised as a Leader in the Magic Quadrant, emphasizes the operational use and vision of DataBricks. 

If you would like to know more of DataBricks, Lytix and Cubis are providing a free expert session on Wednesday the 4th of March 2020 (in ‘De Kantien’ in Ghent, starting at 18:00) where we will provide you with demo’s to get started with DataBricks on Azure. It doesn’t matter if you’re a business analist, a C-level, manager or a student; feel free to subscribe on this intro course by using following link: http://bit.ly/DataBrytix

What is Spark and what ‘need’ did it fulfil? 

If you’ve been around for several years in the ‘Data and Analytics’ domain, you’ll probably remember the hype on ‘Big Data’. The most common scenario where companies switch to Big Data is if they do not have the analytical power to perform a workload on one (and only one) server. Such a workload could be ‘categorising your documents by using Natural Language Processing’, ‘creating segmentations on all website visitors’, ‘training a very accurate prediction model’, … In the most used sense of the word ‘Big Data’, processing is done by in a distributed way, i.e. by coordinating several machines at once.
Spark is an open-source analytical engine which allow technical users to setup a distributed system, thus allowing companies to tackle their Big Data projects. By default, Spark also gives you the ability to capture streaming events, provides a set of machine learning algorithms and allows for working with graph databases.

We now know Spark, what is DataBricks? 

While Spark is great at what it does, it is hard to maintain and configure, hard to spin up and spin down, hard to add servers to your cluster and remove them. DataBricks addresses this problem and provides ‘Spark as a Service’ while also adding enterprise-required features. As the majority of the DataBricks product team has also created the core of Spark, they also made API and performance improvements to the analytical engine they provide you with. As such, we believe that DataBricks is the most enterprise-ready Big Data and Data Science platform.

Get started in minutes instead of days: 

Setting-up a databricks custer with several nodes is easy. The wizard ‘New Cluster’ lets you pick your cluster size and what kind of compute it requires: RAM? GPU? Delta Optimized? 

The feature ‘Autoscaling’ allows DataBricks to add-to or reduce the amount of workers within your cluster based on your workload. If the queries sent to your cluster require large amounts of processing power, then other servers will be added to your cluster so you’ll be provided the results faster!

Delta: ACID Transactions on Data Lakes

An absolute game change that DataBricks has brought to the market, is Delta (Lake). Data lakes are (often) a large collection of files, files which are ‘immutable’ and thus cannot be changed. Delta, on the other hand, enforces ACID properties on your Data Lake. Adding ACID properties (being Atomicity, Consistency, Isolation and Durability) to a data lake allows for other analytical scenario’s that involve inserts, updates and deletes.

The DataBricks Notebook Experience

If you haven’t worked with analytical notebooks (Jupyter, Azure Data Studio or DataBricks), you’re missing out! Being able to write documentation and code within the same document is a big step forward. It helps you make clear to the readers why you created the queries and guiding them in their first steps in analytics. As opposed to other notebooks, DataBricks can connect to version control (Git, TFS, …) and allows you to combine both R, Python, SQL and Scala in the same notebook.

Managing your Data Engineers and Data Scientist

DataBricks allows applying security on folders and workbooks using Azure Active Directory; workbooks which contain sensitive data can only be seen by a specific security group. If you combine this with Azure Data Lake Storage Gen2, which allows applying security on data-folders, you’ll have an enterprise-ready data science environment. 

Why shouldn’t I use DataBricks?

Delta Lake will not support operational processes like any OLTP system due to its limited concurrency to a single file. Thus, we as Lytix and Cubis, often use it for Data Warehousing scenario’s which require little concurrency.
As your data is divided over several partitions across many servers and results of queries/calculations should pass through your head node, do not expect that distributed computing will provide you with a snappy behaviour! The technology is used for analytical workloads on large amounts of data and will backfire if you would use it any other way.

Our XTL Framework

In addition to all features DataBricks provides you with, Lytix/Cubis has a framework which simplifies development and implements best-practices. This framework uses Azure Data Factory as orchestrator and as a result provides you with a data layer that is easily accessible by any reporting tool (Power BI, SAP Analytics Cloud, …). As our logging components are automatically wrapped around business transformations, time spend on data engineering is significantly reduced. Using ‘Delta’, we even provide the ‘Data Warehouse on Spark/DataBricks’ solution.


If you want help/guidance on your DataBricks or Azure Platform implementation, do not hesitate to reach out.

 

Lytix’ XTL Framework

  You’ll often hear a Lytix consultant say that he/she is proud of our XTL Framework. What is it exactly? And how can it improve your business?  

Why do I need a Data Warehouse (DWH) ? 

Data Warehouses support organisations in their decision making by providing data in a well-governed way, this includes:

– Integration of source systems: organisations prefer a mix of solutions to achieve a competitive advantage. Often, large ERP systems (Dynamics AX for Finance and Operations, SAP, Odoo, …) are combined with niche products (SalesForce, Google Analytics, Silverfin, Exact Online, … ) to support their business processes. In an analytical setting, this data also needs to be enriched with open-source APIs (weather APIs, ID verification, … ).

Single source of truth: valuable time can be wasted if numbers cannot be trusted. A DWH provides you one validated catalogue of data.

Easy Managed Access: Data is provided in a way that business users can crunch large quantities of data. This, while still applying security; users should be able to see only what they’re entitled to see (no more, no less).

A foundation for Data Science / AI: Provide a well-structured data repository for your data scientists and minimize their valuable time of cleaning-up and combining sources.

Why do I need the Lytix’ XTL Framework?

Lack of internal engineering experience can result in an unsustainable solution on the long-term. It gets even worse if this person leaves the organisation. Our framework provides unified dimensional models and minimizes ETL engineering. 

 

Advantages of the XTL Framework:

– Unified way of Working: In each of our supported technologies (Azure SQL DB, SnowFlake, DataBricks and Azure Synapse), the main task of engineers is to convert business logic to SQL, a query language that is widely known and which can be relocated to other technologies.

– Maximized logging (with minimal effort during development): Logging modules are automatically wrapped around ETL statements, which reduces a time-intensive burden of developers. Within these modules, all events are captured: number of reads, inserts, updates and deletes, as well as warnings and errors.

– Built for the cloud: The framework is built in such a way it take advantage of cloud benefits; pay-as-you-use, scale-up and scale-down easily.

– Applied Kimball: The modelling techniques are based on best-practices defined by the renowned data-modeller Ralph Kimball (The Data Warehouse Toolkit, Ralph Kimbal & Margy Ross). The most-used and popular Slowly Changing Dimension (SCD) types are supported, such as ‘SCD0: Retain the Original’, ‘SCD1: Overwrite with newer values’, ‘SCD2: Track history by adding new rows’, ‘SCD3: Add new attribute’, ‘SCD6: Add type 1 attributes to a type 2 dimension’ and ‘SCD7: Dual Type 1 and Type 2 Dimensions’. Switching an attribute from SCD Type 1 to SCD Type 2 (… and thus tracking history) is just a matter of a click.

Additional Features

Automated Documentations

The XTL framework is accompanied by a comprehensive report providing you with insights on the ETL processes. This report both provide you standard metric evolutions (number of updates, deletes, failures, …) as well as a dependency view which maps your sources to the targets and shows you the applied transformations.

Dependency Tree

Our framework allows to automatically generate dependency trees based on the metadata of your transformations. Generating the most optimal parallel execution steps within your whole ETL becomes child play. Do you need to refresh only one, two or three tables because you need updated data fast (e.g. for month-closing)? No problem, the dependency tree will only trigger those ETL steps which are directly related to the tables you want updated.  

No ‘All Or Nothing Approach’ in Batches

We often hear following complaint from customers not using our framework “I know one little package failed, but does that mean that the whole dataflow to the business should fail?!

Our framework adresses this issue! The dependency tree allows you to decide what to do upon failures, will you just ignore the failure and continue with your ETL? Or do you stop abruptly to make 100% sure the error doesn’t cause any future problems? This is fully configurable in our framework and it allows you to make a decision per ETL step. 

Data Quality Reporting

Our framework includes a ‘Data Quality’ module. By inputting business rules, the framework will look for inconsistencies and malicious input and report the findings back to the business.

 

Components

Following components are used throughout the framework:

– Azure Data Factory is used as orchestrator and scheduler. This component kicks-off at the requested time (at midnight, once per hour, …) and makes sure all dependencies are initiated in logical order.


– A ‘Storage and Compute Module’ to make the necessary transformations and to provide the data to data consumers; being it business users, data scientists or even mobile devices and webportals. One of four technologies can be implemented as a back-end for this component:
    o Azure SQL Database: A Microsoft product for customers who are already familiar with SQL Servers and in which the Data Warehouse needs lots of integrations.
    o SnowFlake provides both a data lake and strong compute for your analytical needs.
    o DataBricks: if streaming and distributed computing are necessary to build an all-comprehensive Data Warehousing solution. This platform integrates best with organisations that have a large data science team!
   o Azure Synapse (former Azure SQL Data Warehouse) providing large limitless compute for customers with more than 4 TBs of data.


An analytics engine which allows business users to create their own reports, being it Azure Analysis Services or Power BI Pro/Premium. 

Wrap-Up

This article describes one of our proven ways to fuel your data driven organization, covering a multitude of use-cases ranging from traditional reports to data science and this in a governed way.

In need of some help? Do not hesitate to contact Lytix or the author Sander Allert and we/he will be glad to exchange ideas on this subject!