How ALM streamlines BI projects: Azure DevOps

Application Lifecycle Management (ALM) refers to a (software) development process which has been setup in a governed and easy-to-manage way. ALM provides added value to the development team, project managers and the business users. While ‘ALM’ is mostly coined by pure software development projects (…written in 100% programming languages), BI projects (which are by nature less programmatic) can also adhere to these high standards.
In this blog you will get to know how we as Lytix adhere to ALM standards by using Azure DevOps. You will get to know the benefits and notice why it’s a ‘must-have’ component in BI implementations.  

What problems does ALM solve?

When not working in an ALM-compliant manner, you often hear following complaints by business and development:

  • I’ve accidentally made a change to an object (report, ETL task, …) and I only made a copy three weeks ago… All my intermediate changes have been overwritten!
  • There are different versions circulating on the different environments; there are structural differences between the reports of the ‘production’ environment and the ‘business sandbox’ environment.
  • I have no idea what the development team is currently doing, while they should prioritise bug fixing they are probably creating unwanted features.
  • Why is it so hard to provide me a list of all changes that are made to resolve this problem?!
  • Even the simplest test have not been carried out by the development team; if the developer had just had a quick look at the data, he would’ve known something was off.
  • Can you make sure that a key-user tested and validated before you deploy to production?
  • Manual scripts need to be run to orchestrate the deployment to production. A developer always needs to reserve some spare time to result in a successful deployment, therefore deploying is stressful.

Working in an ALM-compliant way resolves these complaints. Azure DevOps is one of those tools which help you in working in a better process-driven way in your development lifecycle. The functionalities ‘Boards, Pipelines, Repos and Test Plans’ make sure that all ALM aspects are covered.

Azure Boards

Azure Boards provides Kanban boards and overviews of all kind of work items. As standard processes, Azure Boards provides the project methodologies ‘Agile’, ‘Scrum’ or ‘CMMI’ (Capability Maturity Model Integration). Depending on the nature of the project (large, medium, small) or the business requirements (needs to follow a certain standard), you can customise these processes until it fits your needs. Manage and plan the availability of your team, plan ahead, budget sprints and prioritise tasks. All this is possible in the Boards module.

 E.g. for the development of our XTL Framework, we follow the Agile methodology. Epics, features, user stories, issues and bugs are defined. Based on a sprint planning and available days per consultant (all of which can be managed within Azure Boards), tasks are resolved. As such, a transparent view on the roadmap of our framework is always available.

Azure Repos

Repositories, or Repos, store the different version of the code you are working upon (known as version-control). All changes made to your code are historically saved using a combination of branches, merges, commits, pushes and pull requests. A private repository (Git) is used to track the time of change and the user who made the change. As such, for each tiny piece of code, you can retrace which changes have happened per sprint or revert your code to a specific point in time. Linking changes to your code/reports creates a tight integration with your work items and your development. Retrace a code-change to a specific task and get to know the reason for the alteration.

A common misconception is that ‘all Git repositories are open source and thus globally accessible’. While Azure Repos provides a Git repository, this repository is only accessible by users of the project. All users wanting to view your code need to authenticate using their Microsoft account … which often also requires two-factor-authentication. Hence, you can be sure that the code you commit to your Azure Repository is only accessible by those allowed to see it. 

Azure Pipelines

Azure Pipelines make sure the written code passes some sanity tests (‘builds’ = making sure your code is consistent) and is then in a structural way deployed to new environments. Pipelines can be configured in such a way that no intervention is required from the BI team upon a deployment to a ‘higher’ environment (e.g. from QA to Production). It also makes sure that multiple environments are always kept in sync with each other.

If one of the stakeholders requires a new environment (e.g. a production-like sandbox for business), then this is just a few clicks on the button.

Given our XTL framework, we use following stages:

  • The ‘_XTLFramework’ contains a consistent version of the code. A trigger is put in place that always releases to the following phase after a code-commit.
  • The ‘Build’ phase is used as a no-data sanity environment. This is a first filter implemented to reach a higher quality of code and logic.
  • At the end of each sprint, a business user accepts a new version of the code (= pre-deployment approval) which triggers an automatic deploy to the Test environment. Technical testers will try-out the functionalities and check if the requirements have been met (= post-deployment approval). Upon approval by the technical tester, the code is automatically deployed to the Quality Acceptance environment.
  • Either the requester of the change or a key-business user (= functional tester) then evaluates the new functionality on the Quality Acceptance environment. Again, a formal approval of a key-user is required to continue to the next stage. Upon acceptance, code is automatically deployed to the ‘Business Evaluation’ environment and ‘Business Sandbox’ environment. The code is not automatically deployed to production.
  • A ‘pre-deployment’ approval of the BI product owner is required before this version of the code is deployed to production. After acceptance, the release has fully reached all environments and is in sync.

In the setup above, no developer or technical interventions need to happen once the code has been committed. Every phase is pushed by the owner of that specific stage.

While throughout this article we’ve mainly provided examples of our XTL Framework, we configure CI/CD pipelines for all kind of software and tools:

  • Cloud Components: Azure Data Factory, Azure SQL Database, Azure Synapse, SnowFlake, DataBricks, Azure Functions, …
  • Business Applications: PowerApps, Power BI, Microsoft Automate/PowerFlow, …
  • On-Premise Applications: SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), On-Premise SQL Server Databases, …

Test Plans

To demonstrate that newly developed code has been tested and proven of being of high quality, the ‘Test Plans’ module is used. These tests can either be executed manually or automatically. A pop-up shows testers the tests they should perform and allows them to easily log deviations from the expected process. All the test results can be accessed and are linked to sprints, work items and branches, thus enabeling 360° traceability of the improvements of your BI solution.

Good to Knows & Pricing

Most functionalities are free up to the first five ‘contributors’ (= people using DevOps that require more elevated rights than just viewing). As we like to keep our teams lean-and-mean, small to medium projects can often freely use this service. If you plan on using the ‘Test Plans’ then it is good to know that this requires a more costly license plan. Do you happen to have a Visual Studio Enterprise Subscription? Then it is automatically included in the license. Next to that, Azure DevOps is easily accessible by other tools. Easily export and manage your own data using Power BI, PowerApps and Power Automate.

Taking into account all the above functionalities, the low pricing and the perfect integration; Lytix is convinced that the Azure DevOps Suite provides added value in BI projects.

Author: Sander Allert

BI Architect

Interested in a demo or a hands-on session? Wanting to talk to an architect? Send a mail to the author of this article and get in contact: sander.allert@lytix.be

Snowflake, a DWH as a Service

‘Cloud’ or ‘Clowned’? With the high adoption of cloud computing in all processes, ‘Analytics’ is the one that can benefit from it the most. Analytics as being it a data warehouse that is used for reporting or ad-hoc analysis, Data Lake which serves as a repository for data that contains added value and Big Data in which one server isn’t sufficient to meet your analytical needs. Typical to analytics is combining different sources, a high variation in its use (at some days it is used highly intensive, while at others it is not consulted) and the need for lots of computational power.

As a Cloud PaaS solution Snowflake is taking the Data world by storm. This tool has a lot of benefits but is it really as good as it claims to be? Can it address the challenges listed above? This blog post will go through the most prominent advantages and disadvantages of Snowflake.

What SnowFlake Enables…

But… what is PaaS?

Before we dive into the nitty-gritty of Snowflake it is important we define what a ‘Platform as a Service’ (PaaS), is.

As the name describes, when a platform-based service is offered to be used by customers where they can execute the required developments without having to take care of the underlying infrastructure. This infrastructure is provided and maintained by the owner of the PAAS Solution.

So, everything is set for me to develop! What are the limitations then?

Well, exactly as you say. Everything is set up for you so you cannot change anything in the infrastructure. Any limitations that the Solution has, you will need to either live with or hope that it is solved by the provider.

This is an issue you don’t have when it is developed by yourself. If you need some additional infrastructure developments or changes you can plan and execute it yourself.

It is always best to look at what the PAAS solution can offer you and if it ticks all the boxes you need.

Where does snowflake fit in here?

Snowflake offers a lot of capabilities you can use during your journey through data without having the issue of taking care of your environment. All the technical set-up is done for you, hence it is PaaS.

What makes snowflake so special besides being PaaS?

We will look into a few aspects that make Snowflake shine.

  • Cloud Availability

One of the main selling points of Snowflake is the fact that is it fully cloud-based and not an on-premises system made available in the cloud.

It’s available every major cloud service on the market, meaning Azure, AWS and Google cloud. On top of this Snowflake maintains a Cross-Cloud strategy.

This strategy ensures the following:

  • The possibility to negotiate the best cloud environment without having any data-migration issues.
    • In case your chosen cloud provider faces an outage you can simply switch to another cloud provider and ensure uptime of your systems.
  • Data Storage Capacity

You often hear data is the new currency. Nothing could be more true in the business world. Companies keep collecting data and so their volumes keep growing.

Luckily Snowflake has no limit on their data storage. As the storage and compute layers are separated, you can just dump all relevant business data into SnowFlake which will then treat it as a Data Lake (= read: cheap storage and an infinite storage amount).

Of course you might not need a large amount now, but during the years to come you expect your data volumes to increase. If you are provided with a large volume of value-added data, then you do not need to hesitate and can directly add it to SnowFlake. All data residing within SnowFlake (= your data lake) is directly accessible.

Another interesting feature based on Data Storage is the fact that if you copy a database, it will not take double the space. It will recognize the records that are the same and it will only require space for different or new records.

  • Dynamic Scaling

Snowflake often uses the phrase ‘Pay for what you need’ which is exactly what the dynamic scaling is for. This is best explained with an example;

Imagine you run your daily load in the morning between 3 AM and 4 AM. At the start of the working day, a lot of users are interested in the latest insights which are available within SnowFlake (let’s say: between 8 AM and 10 AM). For the rest of the day, only one super users sends maybe one query now and then.

In more traditional cost-effective Cloud scenario’s (= not SnowFlake), you would start-up your BI system only for the moments which it is planned to use (= 3 AM and 4 AM, and 8 AM and 18 AM). Your BI environment takes a while to spin up and spin down.

SnowFlake doesn’t incur a delay in ‘spinning-up and down’ and is thus directly accessible, even if no cluster is running. We often configure it in such a way that it is immediately spins-up whenever someone sends a query and automatically spins down after one minute of inactivity. Paying only when compute is being used, will save lots of money for your organization!

“SnowFlake directly spins-up when users issue queries at the database, even at an idle state. In addition, SnowFlake is automatically paused after one minute of inactivity. This combination saves a high amount of infrastructural costs which would otherwise be billed.”

  • Instant Adjustable Processing

One of the most interesting features is the ability to change the size of your warehouse. The bigger the size, the faster big queries can be executed. This can once again be changed on the fly depending on the needs of the organization at that point in time. SnowFlake uses T-Shirt sizes (XS, S, M, L, XL, XXL, XXXL) and each size is double the size of the previous (= Medium equals twice the processing power of Small).

  • Query Memory

It is very important to optimize your query usage as much as possible since you are paying per usage. Luckily Snowflake has this covered for you!

If you execute a query now and the data doesn’t change it will be cached in Snowflake’s engine.

This means that the second time you execute the query, it will not cost you anything as you will not be querying your data directly but the cached version.

That’s quite a lot of features! Are there any others?

Yes there are. We won’t be going in detail about all of them but here are some smaller, but also interesting features.

  • Time-Travel Feature: this is the ability to view data from the past.
  • Copy-Clone: You can Copy an existing database that is directly available for use.
  • ANSI-SQL
  • You can Query directly on JSON, XML… files.
  • Streaming
  • Data Sharing
  • Active Directory Integration

Are there any aspects you would say Snowflake is not suited for?

You must keep in mind Snowflake in its core designed for handling Big Data, Data Warehouses and so on. If your interest/use case lies in Data Science we would suggest you look at tools built with this in mind. We would suggest DataBricks and Spark. More info regarding DataBricks can be found on our blog here.

Very interesting! How can I get in touch?

If you need help (development, architecture, strategy, …) on one of your BI implementations, then Lytix will help you with this. Lytix believes in the SnowFlake platform and is an expert in the advantages and disadvantages it has. Our ETL framework ‘XTL’ uses SnowFlake as a database engine and can provide you with a head-start in an implementation.

The XTL Framework & Technology Drivers

Managed Big Data: DataBricks, Spark as a Service

The title accompanying this blog post is quite the mouth full. This blog post will explain why you should be using Spark. If a use case would make sense, then we will introduce you to the DataBricks product, which is available on Azure. Being recognised as a Leader in the Magic Quadrant, emphasizes the operational use and vision of DataBricks. 

If you would like to know more of DataBricks, Lytix and Cubis are providing a free expert session on Wednesday the 4th of March 2020 (in ‘De Kantien’ in Ghent, starting at 18:00) where we will provide you with demo’s to get started with DataBricks on Azure. It doesn’t matter if you’re a business analist, a C-level, manager or a student; feel free to subscribe on this intro course by using following link: http://bit.ly/DataBrytix

What is Spark and what ‘need’ did it fulfil? 

If you’ve been around for several years in the ‘Data and Analytics’ domain, you’ll probably remember the hype on ‘Big Data’. The most common scenario where companies switch to Big Data is if they do not have the analytical power to perform a workload on one (and only one) server. Such a workload could be ‘categorising your documents by using Natural Language Processing’, ‘creating segmentations on all website visitors’, ‘training a very accurate prediction model’, … In the most used sense of the word ‘Big Data’, processing is done by in a distributed way, i.e. by coordinating several machines at once.
Spark is an open-source analytical engine which allow technical users to setup a distributed system, thus allowing companies to tackle their Big Data projects. By default, Spark also gives you the ability to capture streaming events, provides a set of machine learning algorithms and allows for working with graph databases.

We now know Spark, what is DataBricks? 

While Spark is great at what it does, it is hard to maintain and configure, hard to spin up and spin down, hard to add servers to your cluster and remove them. DataBricks addresses this problem and provides ‘Spark as a Service’ while also adding enterprise-required features. As the majority of the DataBricks product team has also created the core of Spark, they also made API and performance improvements to the analytical engine they provide you with. As such, we believe that DataBricks is the most enterprise-ready Big Data and Data Science platform.

Get started in minutes instead of days: 

Setting-up a databricks custer with several nodes is easy. The wizard ‘New Cluster’ lets you pick your cluster size and what kind of compute it requires: RAM? GPU? Delta Optimized? 

The feature ‘Autoscaling’ allows DataBricks to add-to or reduce the amount of workers within your cluster based on your workload. If the queries sent to your cluster require large amounts of processing power, then other servers will be added to your cluster so you’ll be provided the results faster!

Delta: ACID Transactions on Data Lakes

An absolute game change that DataBricks has brought to the market, is Delta (Lake). Data lakes are (often) a large collection of files, files which are ‘immutable’ and thus cannot be changed. Delta, on the other hand, enforces ACID properties on your Data Lake. Adding ACID properties (being Atomicity, Consistency, Isolation and Durability) to a data lake allows for other analytical scenario’s that involve inserts, updates and deletes.

The DataBricks Notebook Experience

If you haven’t worked with analytical notebooks (Jupyter, Azure Data Studio or DataBricks), you’re missing out! Being able to write documentation and code within the same document is a big step forward. It helps you make clear to the readers why you created the queries and guiding them in their first steps in analytics. As opposed to other notebooks, DataBricks can connect to version control (Git, TFS, …) and allows you to combine both R, Python, SQL and Scala in the same notebook.

Managing your Data Engineers and Data Scientist

DataBricks allows applying security on folders and workbooks using Azure Active Directory; workbooks which contain sensitive data can only be seen by a specific security group. If you combine this with Azure Data Lake Storage Gen2, which allows applying security on data-folders, you’ll have an enterprise-ready data science environment. 

Why shouldn’t I use DataBricks?

Delta Lake will not support operational processes like any OLTP system due to its limited concurrency to a single file. Thus, we as Lytix and Cubis, often use it for Data Warehousing scenario’s which require little concurrency.
As your data is divided over several partitions across many servers and results of queries/calculations should pass through your head node, do not expect that distributed computing will provide you with a snappy behaviour! The technology is used for analytical workloads on large amounts of data and will backfire if you would use it any other way.

Our XTL Framework

In addition to all features DataBricks provides you with, Lytix/Cubis has a framework which simplifies development and implements best-practices. This framework uses Azure Data Factory as orchestrator and as a result provides you with a data layer that is easily accessible by any reporting tool (Power BI, SAP Analytics Cloud, …). As our logging components are automatically wrapped around business transformations, time spend on data engineering is significantly reduced. Using ‘Delta’, we even provide the ‘Data Warehouse on Spark/DataBricks’ solution.


If you want help/guidance on your DataBricks or Azure Platform implementation, do not hesitate to reach out.

 

Lytix’ XTL Framework

  You’ll often hear a Lytix consultant say that he/she is proud of our XTL Framework. What is it exactly? And how can it improve your business?  

Why do I need a Data Warehouse (DWH) ? 

Data Warehouses support organisations in their decision making by providing data in a well-governed way, this includes:

– Integration of source systems: organisations prefer a mix of solutions to achieve a competitive advantage. Often, large ERP systems (Dynamics AX for Finance and Operations, SAP, Odoo, …) are combined with niche products (SalesForce, Google Analytics, Silverfin, Exact Online, … ) to support their business processes. In an analytical setting, this data also needs to be enriched with open-source APIs (weather APIs, ID verification, … ).

Single source of truth: valuable time can be wasted if numbers cannot be trusted. A DWH provides you one validated catalogue of data.

Easy Managed Access: Data is provided in a way that business users can crunch large quantities of data. This, while still applying security; users should be able to see only what they’re entitled to see (no more, no less).

A foundation for Data Science / AI: Provide a well-structured data repository for your data scientists and minimize their valuable time of cleaning-up and combining sources.

Why do I need the Lytix’ XTL Framework?

Lack of internal engineering experience can result in an unsustainable solution on the long-term. It gets even worse if this person leaves the organisation. Our framework provides unified dimensional models and minimizes ETL engineering. 

 

Advantages of the XTL Framework:

– Unified way of Working: In each of our supported technologies (Azure SQL DB, SnowFlake, DataBricks and Azure Synapse), the main task of engineers is to convert business logic to SQL, a query language that is widely known and which can be relocated to other technologies.

– Maximized logging (with minimal effort during development): Logging modules are automatically wrapped around ETL statements, which reduces a time-intensive burden of developers. Within these modules, all events are captured: number of reads, inserts, updates and deletes, as well as warnings and errors.

– Built for the cloud: The framework is built in such a way it take advantage of cloud benefits; pay-as-you-use, scale-up and scale-down easily.

– Applied Kimball: The modelling techniques are based on best-practices defined by the renowned data-modeller Ralph Kimball (The Data Warehouse Toolkit, Ralph Kimbal & Margy Ross). The most-used and popular Slowly Changing Dimension (SCD) types are supported, such as ‘SCD0: Retain the Original’, ‘SCD1: Overwrite with newer values’, ‘SCD2: Track history by adding new rows’, ‘SCD3: Add new attribute’, ‘SCD6: Add type 1 attributes to a type 2 dimension’ and ‘SCD7: Dual Type 1 and Type 2 Dimensions’. Switching an attribute from SCD Type 1 to SCD Type 2 (… and thus tracking history) is just a matter of a click.

Additional Features

Automated Documentations

The XTL framework is accompanied by a comprehensive report providing you with insights on the ETL processes. This report both provide you standard metric evolutions (number of updates, deletes, failures, …) as well as a dependency view which maps your sources to the targets and shows you the applied transformations.

Dependency Tree

Our framework allows to automatically generate dependency trees based on the metadata of your transformations. Generating the most optimal parallel execution steps within your whole ETL becomes child play. Do you need to refresh only one, two or three tables because you need updated data fast (e.g. for month-closing)? No problem, the dependency tree will only trigger those ETL steps which are directly related to the tables you want updated.  

No ‘All Or Nothing Approach’ in Batches

We often hear following complaint from customers not using our framework “I know one little package failed, but does that mean that the whole dataflow to the business should fail?!

Our framework adresses this issue! The dependency tree allows you to decide what to do upon failures, will you just ignore the failure and continue with your ETL? Or do you stop abruptly to make 100% sure the error doesn’t cause any future problems? This is fully configurable in our framework and it allows you to make a decision per ETL step. 

Data Quality Reporting

Our framework includes a ‘Data Quality’ module. By inputting business rules, the framework will look for inconsistencies and malicious input and report the findings back to the business.

 

Components

Following components are used throughout the framework:

– Azure Data Factory is used as orchestrator and scheduler. This component kicks-off at the requested time (at midnight, once per hour, …) and makes sure all dependencies are initiated in logical order.


– A ‘Storage and Compute Module’ to make the necessary transformations and to provide the data to data consumers; being it business users, data scientists or even mobile devices and webportals. One of four technologies can be implemented as a back-end for this component:
    o Azure SQL Database: A Microsoft product for customers who are already familiar with SQL Servers and in which the Data Warehouse needs lots of integrations.
    o SnowFlake provides both a data lake and strong compute for your analytical needs.
    o DataBricks: if streaming and distributed computing are necessary to build an all-comprehensive Data Warehousing solution. This platform integrates best with organisations that have a large data science team!
   o Azure Synapse (former Azure SQL Data Warehouse) providing large limitless compute for customers with more than 4 TBs of data.


An analytics engine which allows business users to create their own reports, being it Azure Analysis Services or Power BI Pro/Premium. 

Wrap-Up

This article describes one of our proven ways to fuel your data driven organization, covering a multitude of use-cases ranging from traditional reports to data science and this in a governed way.

In need of some help? Do not hesitate to contact Lytix or the author Sander Allert and we/he will be glad to exchange ideas on this subject!

 

Lytix’ Power BI 360° Certified: Uplift your Power BI game

We want you to attend!

Hello analytics enthusiasts! Lytix organizes four (4!) free meetups that demonstrate to you the 360 degree process of using Power BI within an organization. After completing the four hands-on session, you’ll be granted the ‘Lytix 360° Power BI’ certification.  Are you ready to become a Power BI expert? Then make sure to tune in at our next session on the 18th of December!

We are a group of motivated BI and Data Science consultants eager to share our knowledge with you.
Lytix is a start-up specialized in Microsoft Analytics and frequently organizes knowledge sharing sessions.
You are invited to follow our evening sessions (18:00 – 21:00) and we can already promise you will not leave empty-headed nor with an empty stomach.  🍕

This is the first part of our Power BI Track on becoming a 360° Power BI Track:

  • Quadrant 1 on 2/10: Power BI Introduction – Learn to create a simple datamodel and create reports on your data.
  • Quadrant 2 on 23/10: Power BI Measures & Columns – Learn the difference between calculated columns and measures, get acquainted with Power BI.
  • Quadrant 3 on 27/11: Power BI Service – Deploy your first report to the Power BI Service and get to know the differences with Power BI ReportServer
  • Quadrant 4 on 18/12: Power BI DAX Advanced – Learn how to create everything imaginary while using DAX, there are limits!

Our ‘Quadrant Trainings’ will be hosted in the region of Mechelen, Belgium. We hope to see you there!

We want you to attend!