The Journey of attaining the Azure Data Engineer certificate

Introduction

On February 23, 2021, Microsoft released a new beta certification exam, Exam DP-203: Data Engineering on Microsoft Azure. It is replacing the exams DP-200: Implementing an Azure Data Solution, and DP-201: Designing an Azure Data Solution. These previous exams DP-200 and DP-201 will retire on June 30, 2021. When passing the two old exams or passing only the new one, you earn the Microsoft certification of Azure Data Engineer Associate. I ventured into this adventure and would like to tell you all about my journey.

Emma Preparation

I planned enough time for a profound preparation preceding the exam. Sometimes it felt like studying for exams in college again, albeit a very short exam period.

The study material for this exam exists out of 10 learning paths, which you can find on Microsoft Docs. The quantity of the learning material is not to be underestimated, considered that each path exists out of multiple modules as well. For each module, an estimated time is indicated to give you an idea of how long it will take you to read and understand the theory. Each module ends with a knowledge check, where two or three multiple-choice questions test how much you remember of the topic. Various topics are addressed in the learning paths, such as Azure Data Storage, Azure Databricks, data streaming, and Azure Synapse Analytics. A big part of the learning material is dedicated to the last subject.

Next to the study material, I also did some practice exams on Udemy. On this website, example questions are provided in the form of ½ or full exams where you can test your knowledge against time. Some of the questions are the same ones as the MC questions on the end of each module, but there are also other, more thorough, or concrete questions. Normally, you can find a lot of example questions of previous exams, but since this is a new (beta) exam, the real exam questions are not known yet. Nonetheless, taking these practice tests was a very good exercise to prepare for the real exam.

The Exam

Once I processed all the study material, I made an appointment to take the exam and choose the option to take it from home. In my familiar environment, at my desk, I completed the exam with the help of the Pearson VUE technology. After uploading the necessary pictures of my id, my room, and my face, I could start the online exam while I was filmed and supervised the whole time.

My exam consisted out of 61 multiple choice questions, for which I had 120 minutes to solve. First, there were 2 case studies, followed by about fifty standalone questions. Once I had answered the questions about the case studies, this part was closed, and I could not return to this first part once I started with the next load of questions.

The content of the questions tackles different subjects, including:

  • Design and implement data storage (40-45%)
  • Design and develop data processing (25-30%)
  • Design and implement data security (10-15%)
  • Monitor and optimize data storage and data processing (10-15%).

After the exam, I had the feeling that quite some questions tackled the Synapse Analytics material, for example how to create external tables in SQL pools.

Overall, I experienced the level of the questions quite difficult and was not sure if I would pass. You need to have a score of 700 out of 1000 to pass the exam. Moreover, beta exams are not scored immediately. First, answers are gathered to check if the quality of the questions would meet the expectations. Usually, you receive your exam score about two weeks after the exam is out of its beta version. In my case, after finishing my exam, it took more than 5 suspenseful weeks before I received the long-waited email with ‘Congratulations’ as the subject.

Conclusion

Looking back, it was an interesting experience to take this beta exam. I believe there is more uncertainty associated with a beta exam than is the case with a normal certification exam because there is less information available about the learning material and the questions.

Considering the Data Engineering subject, I experienced the material very informational and there were several new things I could use and implement directly during my work. During the preparation and the exam, it helps much when there is already some work experience with different Azure services. Sometimes extensive exercises are foreseen in the learning paths to get familiar with the tools, but this is not always the case.

Another thing that was helpful for me is the practice exam questions, they give you an idea of what the exam questions will look like. Something I would really recommend to anyone who is planning to take this exam is to solve lots of example questions and practice exams. Making a lot of exercises will really help you in mastering the learning material.

Update

The exam DP-203 went live on May 4, 2021, so this certification exam is no longer in beta version.

Emma Willaert

Emma Willaert

In case you have questions regarding MS BI, don’t hesitate to contact Lytix for more information!  

Connecting and Using MS Graph in Azure Data Factory

Connecting and Using MS Graph in Azure Data Factory

Introduction

Companies are creating more and more data, on which they want to gain insights. One valuable source of data is data from within the company itself, from the companies’ structure. For this type of data, the MS Graph API is something to look at.

The API provides a single endpoint to access all kinds of data from within your company. That could be from Office 365 services, such as MS Teams, Outlook/Exchange, … or Enterprise Mobility and Security services, such as Azure Active Directory (Users, groups, …) and Identity Manager, or data from Dynamics 365. This allows MS Graph to be used for many purposes. Once the API is implemented in a pipeline, only minor adjustments must be made if you want to use the API for another problem.

In this post, we will explain how to use the MS Graph API in Azure Data Factory. We will do this in several parts. First, we will explain how the API authentication is done. Then we will describe how to complete an App Registration and how to store its Client Secret in an Azure Key Vault. Finally, the different steps of building the pipeline in Data Factory will be explained in detail. Before we get to all this, a brief overview of a user-friendly tool to start exploring the API will be discussed.

MS Graph Explorer

Although the information about this API on the internet can seem overwhelming due to its many features, Microsoft provides a very nice tool to start playing with the API. MS Graph Explorer allows a user to get to know the API. Especially for users who have no experience with APIs, this tool can be very convenient. MS Graph Explorer offers a long list of sample queries, which the user can easily click on to get the desired data. After selecting the query, the request from the API will be displayed and a set of sample data for that particular query will be returned. MS Graph Explorer also allows you to log in with your personal Office 365 credentials, so the queries will return actual data of your organization.
If you want to check out this tool, you can go to https://developer.microsoft.com/en-us/graph/graph-explorer.

MS Graph Authentication

The authentication of the MS Graph API will be done with Client Credentials. For this, an App Registration is required. It should be noticed that it is best practice to do the authentication with a Managed Service Identity, since this makes it easier for an administrator to manage everything. When using Managed Service Identity, the required permissions need to be assigned. However, MS Graph currently does not allow to do this using Azure Portal, so this needs to be done with a PowerShell script. While this is not hard, there is decided to do the authentication with Client Credentials in this post, as it will allow users with no PowerShell experience to complete all the steps.

App Registration

Before we can use the API, we need something that will authorize the API to access the requested data. Azure App Registrations offers an easy way to do this.

  • In Azure Portal, go to ‘App Registrations’, and make a new registration.
  • When a new App Registration is created, open it and go to ‘API Permissions’ in the left ribbon.

 

  • Here we see an overview of the permissions of the App. By default, the App Registration has ‘User Read’ permission in MS Graph. This will not be sufficient, so new permissions need to be added. This is done by clicking on the ‘Add a Permission’ button.
  • In the window that opens, we click on ‘Microsoft Graph’, followed by ‘Application Permissions’. Now we get an overview of the possible permissions that can be granted. In our use-case we need a ‘Read All’ on the Groups and on the Users. Select this and click on ‘Add Permission’ below in the window.
  • An administrator needs to grant these permissions. Once this is done, the permissions should look like the ones below.
  • Before we can continue, we need two things. First, we need the ‘Application Client Id’. This can be found in the App Registration Overview. Copy this value in a temporary document, as it will be needed in a later step.

Second, we need the ‘Client Secret’. For this, we go to ‘Certificates & Secrets’ in the left ribbon, and click on ‘New client secret’. Give the Client Secret a name, set an expiration date and click on ‘Add’.

  • When the Client Secret is created, copy the Value. This value will only be visuable for a few minutes. After that time, it cannot be copied anymore, and a new Client Secret must be created

Key Vault

We do not want to place both the Application Client Id and the Client Secret hard coded in the Azure Data Factory pipeline. To make it more secure, we will store the Client Secret in an Azure Key Vault. It is possible to store the Application Client Id in the Key Vault as well, but that will not be addressed in this case.

 

  • Go to you Azure Key Vault. If you do not have a Key Vault yet, you can easily create a new one in Azure Portal.
  • In the Key Vault, go to ‘Secrets’ in the left ribbon, and click on ‘Generate/Import’. In the window that opens, fill in a Name and the Value. The Value is the Client Secret that we copied from the App Registration.
  • When the Secret is created, click on the Secret, followed by clicking on the current version. A window will open showing the ‘Secret Identifier’. Copy this temporarily in a document for later.

Azure Data Factory

Once the previous steps are done, we can start building the ADF Pipeline where the API Call will be done.
The pipeline we will build is structured as follows. It will start with a web activity where the App Registration Client Secret is retrieved from the Key Vault (GetAppRegSecret), followed by another web activity that will generate a Bearer Token (GetBearerToken). This Bearer Token is needed to authenticate the API in the next two copy activities. The first copy activity will call the MS Graph API for data of Azure AD Groups (GetADGroupsData), the second one will call the API for Azure AD Users data (GetADUsersData). The retrieved data will be moved to a Blob storage. After this, a lookup activity will collect the group data (LookupADGroups). Finally, for each group, a copy activity will transfer the unique ids of the members to the Blob storage (ForEachGroup). Each activity will be discussed more in detail below.

  • Open Azure Data Factory. Again, if the resource has not yet been created, this can easily be done in the Portal.

 

  • Create a new Web activity that will retrieve the App Registration Client Secret. In the ‘General’ tab, set the ‘Secure output’ to true. When the input or output of an activity is set to secure, it will not be logged for monitoring.
  • In the ‘Settings’ tab, set the URL to the secret identifier that was copied from the Key Vault. The Method is a ‘GET’. The Authentication should be set to ‘MSI’ and the Resource will be ‘https://vault.azure.net’.
  • Now we will add the Web activity that generates the Bearer token. For this activity, set ‘Secure Output’ and ‘Secure Input’ both to true. In the ‘Settings tab’, the URL should be ‘https://login.microsoftonline.com/<TenantID>/oauth2/v2.0/token’. Your TenantId can be found in Azure Portal. The Method is ‘POST’ and for the header we add one with the Name ‘Content-Type’ and Value ‘application/x-www-form-urlencoded’. In the Body we add dynamic content which can be seen below. In the body, the client_id is the Application Client Id of the App Registration. The scope is ‘https%3A%2F%2Fgraph.microsoft.com%2F.default’, the grant_type is ‘client_credentials’ and the client_secret is the output of the previously created activity.
  • Before we can add the Copy activities, we need to create a new dataset. In the left ribbon, go to Datasets and click on ‘New dataset’. Choose REST as data store.
  • For the Linked service, click on ‘New’.
  • Give the linked service a name, set the Base URL to ‘https://graph.microsoft.com’ and set Authentication type to ‘Anonymous’.
  • Select this Linked service for the new dataset, create a new dataset parameter and set the Relative URL to this parameter.
  • Now go back to the pipeline and add a Copy activity for the Azure AD Groups data. Set ‘Secure Input’ to true. In the Source tab, select the newly created dataset as ‘Source dataset’. For the RelativeUrl parameter, enter ‘v1.0/groups’. Add an Additional header with Name ‘Authorization’ and a Value with dynamic content, such that the value equals ‘Bearer <BearerToken>’. The BearerToken is the output of the previous web activity. Finally, add a Pagination rule with Name ‘AbsoluteUrl’ and Value ‘$[‘@odata.nextLink’]’. The MSGraph API allows a maximum of 999 objects per call. If there are more objects, a link will be added for a next call. ADF allows to solve this problem with Pagination rules. It should be noted that this Pagination rule can be nice to solve a pagination problem, but it is currently not very flexible to change. If the sink is a JSON file, this Pagination rule can cause problems. For this reason, the sink will be set to a parquet file, which will not give any problems with the pagination. For more info about the pagination, you can go to https://docs.microsoft.com/en-us/graph/paging
  • In the Mapping tab, select the columns you want to copy. It is important to check the box of ‘Collection reference’ for the Value.
  • Do the same for the Copy activity of data of the Users. But change the RelativeUrl field in the source to ‘v1.0/users’.
  • Now we add a Lookup activity. This activity will lookup all the groups with their group id. With this id, we can get for each group the members with the API request https://graph.microsoft.com/v1.0/groups/<group_id>/members
  • For the Lookup activity, fill in the required settings. The Source dataset is the sink of the previous copy activity. Set ‘First row only’ to false.
  • After the Lookup, add a ForEach activity to the pipeline. In the Settings, set the Items to the output value of the Lookup.
  • The final activity to add is a Copy activity in the ForEach loop. The settings of this activity are the same as the other two Copy activities, but for the RelativeUrl, we add dynamic content that we will use the Id of each group in the ForEach loop.
  • Now link the different activities like the image below.
  • The final step is executing the pipeline and start playing with the freshly retrieved data.

Conclusion

Although this blog only covered Azure AD data, you can easily access many more insights from your Office365 products with MS Graph. This is a big advantage of MS Graph, as it offers a single endpoint for multiple Microsoft services. All of this data can be easily obtained by changing a few parameters in the pipeline described above. This makes the MS Graph API an easy solution to multiple problems. A final point to note is that MS Graph can be used in multiple ways. Here it is used in Azure Data Factory, but the API can be called in a Python script as well.


If you have any questions about implementing this solution or how MS Graph can be used in your company, feel free to contact us.

About the Author: Jasper Puts

.

Jasper Puts is a Data Professional at Lytix. He can be found optimizing a customer’s analytics architecture and improving the accuracy of data science algorithms.

How ALM streamlines BI projects: Azure DevOps

How ALM streamlines BI projects: Azure DevOps

Application Lifecycle Management (ALM) refers to a (software) development process which has been setup in a governed and easy-to-manage way. ALM provides added value to the development team, project managers and the business users. While ‘ALM’ is mostly coined by pure software development projects (…written in 100% programming languages), BI projects (which are by nature less programmatic) can also adhere to these high standards.
In this blog you will get to know how we as Lytix adhere to ALM standards by using Azure DevOps. You will get to know the benefits and notice why it’s a ‘must-have’ component in BI implementations.  

What problems does ALM solve?

When not working in an ALM-compliant manner, you often hear following complaints by business and development:

  • I’ve accidentally made a change to an object (report, ETL task, …) and I only made a copy three weeks ago… All my intermediate changes have been overwritten!
  • There are different versions circulating on the different environments; there are structural differences between the reports of the ‘production’ environment and the ‘business sandbox’ environment.
  • I have no idea what the development team is currently doing, while they should prioritise bug fixing they are probably creating unwanted features.
  • Why is it so hard to provide me a list of all changes that are made to resolve this problem?!
  • Even the simplest test have not been carried out by the development team; if the developer had just had a quick look at the data, he would’ve known something was off.
  • Can you make sure that a key-user tested and validated before you deploy to production?
  • Manual scripts need to be run to orchestrate the deployment to production. A developer always needs to reserve some spare time to result in a successful deployment, therefore deploying is stressful.

Working in an ALM-compliant way resolves these complaints. Azure DevOps is one of those tools which help you in working in a better process-driven way in your development lifecycle. The functionalities ‘Boards, Pipelines, Repos and Test Plans’ make sure that all ALM aspects are covered.

Azure Boards

Azure Boards provides Kanban boards and overviews of all kind of work items. As standard processes, Azure Boards provides the project methodologies ‘Agile’, ‘Scrum’ or ‘CMMI’ (Capability Maturity Model Integration). Depending on the nature of the project (large, medium, small) or the business requirements (needs to follow a certain standard), you can customise these processes until it fits your needs. Manage and plan the availability of your team, plan ahead, budget sprints and prioritise tasks. All this is possible in the Boards module.

 E.g. for the development of our XTL Framework, we follow the Agile methodology. Epics, features, user stories, issues and bugs are defined. Based on a sprint planning and available days per consultant (all of which can be managed within Azure Boards), tasks are resolved. As such, a transparent view on the roadmap of our framework is always available.

Azure Repos

Repositories, or Repos, store the different version of the code you are working upon (known as version-control). All changes made to your code are historically saved using a combination of branches, merges, commits, pushes and pull requests. A private repository (Git) is used to track the time of change and the user who made the change. As such, for each tiny piece of code, you can retrace which changes have happened per sprint or revert your code to a specific point in time. Linking changes to your code/reports creates a tight integration with your work items and your development. Retrace a code-change to a specific task and get to know the reason for the alteration.

A common misconception is that ‘all Git repositories are open source and thus globally accessible’. While Azure Repos provides a Git repository, this repository is only accessible by users of the project. All users wanting to view your code need to authenticate using their Microsoft account … which often also requires two-factor-authentication. Hence, you can be sure that the code you commit to your Azure Repository is only accessible by those allowed to see it. 

Azure Pipelines

Azure Pipelines make sure the written code passes some sanity tests (‘builds’ = making sure your code is consistent) and is then in a structural way deployed to new environments. Pipelines can be configured in such a way that no intervention is required from the BI team upon a deployment to a ‘higher’ environment (e.g. from QA to Production). It also makes sure that multiple environments are always kept in sync with each other.

If one of the stakeholders requires a new environment (e.g. a production-like sandbox for business), then this is just a few clicks on the button.

Given our XTL framework, we use following stages:

  • The ‘_XTLFramework’ contains a consistent version of the code. A trigger is put in place that always releases to the following phase after a code-commit.
  • The ‘Build’ phase is used as a no-data sanity environment. This is a first filter implemented to reach a higher quality of code and logic.
  • At the end of each sprint, a business user accepts a new version of the code (= pre-deployment approval) which triggers an automatic deploy to the Test environment. Technical testers will try-out the functionalities and check if the requirements have been met (= post-deployment approval). Upon approval by the technical tester, the code is automatically deployed to the Quality Acceptance environment.
  • Either the requester of the change or a key-business user (= functional tester) then evaluates the new functionality on the Quality Acceptance environment. Again, a formal approval of a key-user is required to continue to the next stage. Upon acceptance, code is automatically deployed to the ‘Business Evaluation’ environment and ‘Business Sandbox’ environment. The code is not automatically deployed to production.
  • A ‘pre-deployment’ approval of the BI product owner is required before this version of the code is deployed to production. After acceptance, the release has fully reached all environments and is in sync.

In the setup above, no developer or technical interventions need to happen once the code has been committed. Every phase is pushed by the owner of that specific stage.

While throughout this article we’ve mainly provided examples of our XTL Framework, we configure CI/CD pipelines for all kind of software and tools:

  • Cloud Components: Azure Data Factory, Azure SQL Database, Azure Synapse, SnowFlake, DataBricks, Azure Functions, …
  • Business Applications: PowerApps, Power BI, Microsoft Automate/PowerFlow, …
  • On-Premise Applications: SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), On-Premise SQL Server Databases, …

Test Plans

To demonstrate that newly developed code has been tested and proven of being of high quality, the ‘Test Plans’ module is used. These tests can either be executed manually or automatically. A pop-up shows testers the tests they should perform and allows them to easily log deviations from the expected process. All the test results can be accessed and are linked to sprints, work items and branches, thus enabeling 360° traceability of the improvements of your BI solution.

Good to Knows & Pricing

Most functionalities are free up to the first five ‘contributors’ (= people using DevOps that require more elevated rights than just viewing). As we like to keep our teams lean-and-mean, small to medium projects can often freely use this service. If you plan on using the ‘Test Plans’ then it is good to know that this requires a more costly license plan. Do you happen to have a Visual Studio Enterprise Subscription? Then it is automatically included in the license. Next to that, Azure DevOps is easily accessible by other tools. Easily export and manage your own data using Power BI, PowerApps and Power Automate.

Taking into account all the above functionalities, the low pricing and the perfect integration; Lytix is convinced that the Azure DevOps Suite provides added value in BI projects.

Author: Sander Allert

BI Architect

Interested in a demo or a hands-on session? Wanting to talk to an architect? Send a mail to the author of this article and get in contact: sander.allert@lytix.be