Connecting and Using MS Graph in Azure Data Factory
Companies are creating more and more data, on which they want to gain insights. One valuable source of data is data from within the company itself, from the companies’ structure. For this type of data, the MS Graph API is something to look at.
The API provides a single endpoint to access all kinds of data from within your company. That could be from Office 365 services, such as MS Teams, Outlook/Exchange, … or Enterprise Mobility and Security services, such as Azure Active Directory (Users, groups, …) and Identity Manager, or data from Dynamics 365. This allows MS Graph to be used for many purposes. Once the API is implemented in a pipeline, only minor adjustments must be made if you want to use the API for another problem.
In this post, we will explain how to use the MS Graph API in Azure Data Factory. We will do this in several parts. First, we will explain how the API authentication is done. Then we will describe how to complete an App Registration and how to store its Client Secret in an Azure Key Vault. Finally, the different steps of building the pipeline in Data Factory will be explained in detail. Before we get to all this, a brief overview of a user-friendly tool to start exploring the API will be discussed.
MS Graph Explorer
Although the information about this API on the internet can seem overwhelming due to its many features, Microsoft provides a very nice tool to start playing with the API. MS Graph Explorer allows a user to get to know the API. Especially for users who have no experience with APIs, this tool can be very convenient. MS Graph Explorer offers a long list of sample queries, which the user can easily click on to get the desired data. After selecting the query, the request from the API will be displayed and a set of sample data for that particular query will be returned. MS Graph Explorer also allows you to log in with your personal Office 365 credentials, so the queries will return actual data of your organization.
If you want to check out this tool, you can go to https://developer.microsoft.com/en-us/graph/graph-explorer.
MS Graph Authentication
The authentication of the MS Graph API will be done with Client Credentials. For this, an App Registration is required. It should be noticed that it is best practice to do the authentication with a Managed Service Identity, since this makes it easier for an administrator to manage everything. When using Managed Service Identity, the required permissions need to be assigned. However, MS Graph currently does not allow to do this using Azure Portal, so this needs to be done with a PowerShell script. While this is not hard, there is decided to do the authentication with Client Credentials in this post, as it will allow users with no PowerShell experience to complete all the steps.
Before we can use the API, we need something that will authorize the API to access the requested data. Azure App Registrations offers an easy way to do this.
- In Azure Portal, go to ‘App Registrations’, and make a new registration.
- When a new App Registration is created, open it and go to ‘API Permissions’ in the left ribbon.
- Here we see an overview of the permissions of the App. By default, the App Registration has ‘User Read’ permission in MS Graph. This will not be sufficient, so new permissions need to be added. This is done by clicking on the ‘Add a Permission’ button.
- In the window that opens, we click on ‘Microsoft Graph’, followed by ‘Application Permissions’. Now we get an overview of the possible permissions that can be granted. In our use-case we need a ‘Read All’ on the Groups and on the Users. Select this and click on ‘Add Permission’ below in the window.
- An administrator needs to grant these permissions. Once this is done, the permissions should look like the ones below.
- Before we can continue, we need two things. First, we need the ‘Application Client Id’. This can be found in the App Registration Overview. Copy this value in a temporary document, as it will be needed in a later step.
Second, we need the ‘Client Secret’. For this, we go to ‘Certificates & Secrets’ in the left ribbon, and click on ‘New client secret’. Give the Client Secret a name, set an expiration date and click on ‘Add’.
- When the Client Secret is created, copy the Value. This value will only be visuable for a few minutes. After that time, it cannot be copied anymore, and a new Client Secret must be created
We do not want to place both the Application Client Id and the Client Secret hard coded in the Azure Data Factory pipeline. To make it more secure, we will store the Client Secret in an Azure Key Vault. It is possible to store the Application Client Id in the Key Vault as well, but that will not be addressed in this case.
- Go to you Azure Key Vault. If you do not have a Key Vault yet, you can easily create a new one in Azure Portal.
- In the Key Vault, go to ‘Secrets’ in the left ribbon, and click on ‘Generate/Import’. In the window that opens, fill in a Name and the Value. The Value is the Client Secret that we copied from the App Registration.
- When the Secret is created, click on the Secret, followed by clicking on the current version. A window will open showing the ‘Secret Identifier’. Copy this temporarily in a document for later.
Azure Data Factory
Once the previous steps are done, we can start building the ADF Pipeline where the API Call will be done.
The pipeline we will build is structured as follows. It will start with a web activity where the App Registration Client Secret is retrieved from the Key Vault (GetAppRegSecret), followed by another web activity that will generate a Bearer Token (GetBearerToken). This Bearer Token is needed to authenticate the API in the next two copy activities. The first copy activity will call the MS Graph API for data of Azure AD Groups (GetADGroupsData), the second one will call the API for Azure AD Users data (GetADUsersData). The retrieved data will be moved to a Blob storage. After this, a lookup activity will collect the group data (LookupADGroups). Finally, for each group, a copy activity will transfer the unique ids of the members to the Blob storage (ForEachGroup). Each activity will be discussed more in detail below.
- Open Azure Data Factory. Again, if the resource has not yet been created, this can easily be done in the Portal.
- Create a new Web activity that will retrieve the App Registration Client Secret. In the ‘General’ tab, set the ‘Secure output’ to true. When the input or output of an activity is set to secure, it will not be logged for monitoring.
- In the ‘Settings’ tab, set the URL to the secret identifier that was copied from the Key Vault. The Method is a ‘GET’. The Authentication should be set to ‘MSI’ and the Resource will be ‘https://vault.azure.net’.
- Now we will add the Web activity that generates the Bearer token. For this activity, set ‘Secure Output’ and ‘Secure Input’ both to true. In the ‘Settings tab’, the URL should be ‘https://login.microsoftonline.com/<TenantID>/oauth2/v2.0/token’. Your TenantId can be found in Azure Portal. The Method is ‘POST’ and for the header we add one with the Name ‘Content-Type’ and Value ‘application/x-www-form-urlencoded’. In the Body we add dynamic content which can be seen below. In the body, the client_id is the Application Client Id of the App Registration. The scope is ‘https%3A%2F%2Fgraph.microsoft.com%2F.default’, the grant_type is ‘client_credentials’ and the client_secret is the output of the previously created activity.
- Before we can add the Copy activities, we need to create a new dataset. In the left ribbon, go to Datasets and click on ‘New dataset’. Choose REST as data store.
- For the Linked service, click on ‘New’.
- Give the linked service a name, set the Base URL to ‘https://graph.microsoft.com’ and set Authentication type to ‘Anonymous’.
- Select this Linked service for the new dataset, create a new dataset parameter and set the Relative URL to this parameter.
- Now go back to the pipeline and add a Copy activity for the Azure AD Groups data. Set ‘Secure Input’ to true. In the Source tab, select the newly created dataset as ‘Source dataset’. For the RelativeUrl parameter, enter ‘v1.0/groups’. Add an Additional header with Name ‘Authorization’ and a Value with dynamic content, such that the value equals ‘Bearer <BearerToken>’. The BearerToken is the output of the previous web activity. Finally, add a Pagination rule with Name ‘AbsoluteUrl’ and Value ‘$[‘@odata.nextLink’]’. The MSGraph API allows a maximum of 999 objects per call. If there are more objects, a link will be added for a next call. ADF allows to solve this problem with Pagination rules. It should be noted that this Pagination rule can be nice to solve a pagination problem, but it is currently not very flexible to change. If the sink is a JSON file, this Pagination rule can cause problems. For this reason, the sink will be set to a parquet file, which will not give any problems with the pagination. For more info about the pagination, you can go to https://docs.microsoft.com/en-us/graph/paging
- In the Mapping tab, select the columns you want to copy. It is important to check the box of ‘Collection reference’ for the Value.
- Do the same for the Copy activity of data of the Users. But change the RelativeUrl field in the source to ‘v1.0/users’.
- Now we add a Lookup activity. This activity will lookup all the groups with their group id. With this id, we can get for each group the members with the API request https://graph.microsoft.com/v1.0/groups/<group_id>/members
- For the Lookup activity, fill in the required settings. The Source dataset is the sink of the previous copy activity. Set ‘First row only’ to false.
- After the Lookup, add a ForEach activity to the pipeline. In the Settings, set the Items to the output value of the Lookup.
- The final activity to add is a Copy activity in the ForEach loop. The settings of this activity are the same as the other two Copy activities, but for the RelativeUrl, we add dynamic content that we will use the Id of each group in the ForEach loop.
- Now link the different activities like the image below.
- The final step is executing the pipeline and start playing with the freshly retrieved data.
Although this blog only covered Azure AD data, you can easily access many more insights from your Office365 products with MS Graph. This is a big advantage of MS Graph, as it offers a single endpoint for multiple Microsoft services. All of this data can be easily obtained by changing a few parameters in the pipeline described above. This makes the MS Graph API an easy solution to multiple problems. A final point to note is that MS Graph can be used in multiple ways. Here it is used in Azure Data Factory, but the API can be called in a Python script as well.
If you have any questions about implementing this solution or how MS Graph can be used in your company, feel free to contact us.
About the Author: Jasper Puts
Jasper Puts is a Data Professional at Lytix. He can be found optimizing a customer’s analytics architecture and improving the accuracy of data science algorithms.