Step-by-step: Gather a detailed dataset on SharePoint Sites using MGDC and Fabric
0. Overview
This blog shows a step-by-step guide to getting SharePoint Sites information using the Microsoft Graph Data Connect for SharePoint and the Microsoft Fabric. This includes detailed instructions on how to extract SharePoint and OneDrive site information and use that to run analytics for your tenant.
If you follow these steps, you will have a Microsoft Fabric Report like the one shown below, which includes number of sites by type, and total storage used by type. You can also use the many other properties available in the SharePoint Sites dataset.
To get there, you can split the process into 3 distinct parts:
Set up your tenant for the Microsoft Graph Data Connect
Get data about SharePoint Sites using Microsoft Fabric
Create a report on these sites using Microsoft Fabric
Note: Following these instructions will create Azure resources and this will add to your tenant’s Azure bill. For more details on pricing, see How can I estimate my Azure bill?
1. Setting up the Microsoft Graph Data Connect
The first step in the process is to enable the Microsoft Graph Data Connect and its prerequisites. You will need to do a few things to make sure everything is ready to run the pipeline:
Enable Data Connect in your Microsoft 365 Admin Center. This is where your Tenant Admin will check the boxes to enable the Data Connect and enable the use of SharePoint datasets.
Create an application identity to run your pipelines. This is an application created in Microsoft Entra Id which will be granted the right permission to access MGDC.
Create an Azure Resource Group for all the resources we will use for Data Connect, like the Azure Storage account and the Azure Synapse workspace.
Create a Fabric Workspace and Fabric Lakehouse to store your MGDC data.
Add your Microsoft Graph Data Connect application in the Azure Portal. Your Microsoft Graph Data Connect application needs to be associated with a subscription, resource group, Fabric Workspace and Fabric Lakehouse.
Finally, your Global Administrator needs to use the Microsoft Admin Center to approve the Microsoft Graph Data Connect application access.
Let us look at each one of these.
1a. Enable the Microsoft Graph Data Connect
The first step is to go into the Microsoft 365 Admin Center and enable the Microsoft Graph Data Connect.
Navigate to the Microsoft 365 Admin Center at http://admin.microsoft.com/ and make sure you are signed in as a Global Administrator.
Select the option to Show all options on the left.
Click on Settings, then on Org settings.
Select the settings for Microsoft Graph Data Connect.
Check the box to turn Data Connect on.
Make sure to also check the box to enable access to the SharePoint and OneDrive datasets.
IMPORTANT: You must wait 48 hours to onboard your tenant and another 48 hours for the initial data collection and curation. For example, if you check the boxes on August 1st, you will be able to run your first data pull on August 5th, targeting the data for August 3rd. You can continue with the configuration, but do not trigger your pipeline before that.
1b. Create the Application Identity
You will need to create an Application in Microsoft Entra ID (formerly Azure Active Directory) and setup an authentication mechanism, like a certificate or a secret. You will use this Application later when you configure the pipeline.
Here are the steps:
Navigate to the Azure Portal at https://portal.azure.com
Find the Microsoft Entra ID service in the list of Azure services.
Select the option for App Registration on the list on the left.
Click the link to New Registration to create a new one.
Enter an app name, select “this organizational directory only” and click on the Register button.
On the resulting screen, select the link to Add a certificate or secret.
Select the “Client secrets” tab and click on the option for New client secret.
Enter a description, select an expiration period, and click the Add button.
Copy the secret value (there is a copy button next to it). We will need that secret value later. Secret values can only be viewed immediately after creation. Save the secret before leaving the page.
Click on the Overview link on the left to view the details about your app registration.
Make sure to copy the Directory (tenant) ID and the Application (client) ID, found on the Application’s Overview page. We will need those values later as well.
1c. Create the Azure Resource Group
You will need to create an Azure Resource Group for all the resources we will use for Data Connect, including the Storage Account and Synapse Workspace.
Here are the steps.
Navigate to the Azure Portal at https://portal.azure.com
Find the Resource Groups in the list of Azure services.
Click on the Create link to create a new resource group.
Select a name and a region.
IMPORTANT: You must use a region that matches the region of your Microsoft 365 tenant.
Click on Review + Create, make sure you have everything correctly entered and click Create.
1d. Create a Fabric Workspace and Fabric Lakehouse
Next, you will need to create a Microsoft Fabric Workspace. This is where you will you’re your data-related items like your pipelines and your Lakehouse.
Here are the steps:
Navigate to the Microsoft Fabric Portal at https://fabric.microsoft.com
I the main Microsoft Fabric page, select the option for “Data Engineering”
Find the “Workspaces” icon on the bar on the left and click on it.
At the very bottom of the workspace list, select the option for “+ New workspace”. You may already have a “My workspace”, but it generally best to have a separate workspace for MGDC so it’s easier to organize your projects and collaborate.
Give your workspace a name, a description and a license mode. Then click “Apply”.
In your empty workspace, using the option for “+ New” and then select “Lakehouse”.
Give your Lakehouse a name and create it.
Please note down the name of your workspace and the name of the Lakehouse. You will need those in upcoming steps.
1e. Add your Microsoft Graph Data Connect application
Your Microsoft Graph Data Connect application needs to be associated to a subscription, resource group, application identity, Fabric workspace, Fabric Lakehouse and datasets. This will define everything that the app will need to run your pipelines.
Here are the steps:
Search for the “Microsoft Graph Data Connect” service in the Azure Portal at https://portal.azure.com or navigate directly to https://aka.ms/MGDCinAzure to get started.
Select the option to Add a new application.
Under Application ID, select the one from step 1b and give it a description.
Select Microsoft Fabric for Compute Type.
Select Copy Activity for Activity Type.
Fill the form with the correct Subscription and Resource Group (from step 1c).
Under Destination Type, select Fabric Lakehouse.
Fill the form with the correct Workspace and Fabric Lakehouse (from step 1d).
Click on “Next: Datasets”.
In the dataset page, under Dataset, select BasicDataSet_v0.SharePointSites_v1.
Under Columns, select all.
Click on “Review + Create”.
Click “Create” to finish.
You will now see the app in the list for Graph Data Connect.
1f. Approve the Microsoft Graph Data Connect Application
Your last step in this section is to have a Global Administrator approve the Microsoft Graph Data Connect application.
Make sure this step is performed by a Global administrator who is not the same user that created the application.
Navigate to the Microsoft 365 Admin Center at http://admin.microsoft.com/
Select the option to Show all options on the left.
Click on Settings, then on Org settings.
Click on the tab for Security & privacy.
Select the option for settings for Microsoft Graph Data Connect applications.
You will see the app you defined with the status Pending Authorization.
Double-click the app name to start the authorization.
Follow the wizard to review the app data, the datasets, the columns and the destination, clicking Next after each screen.
In the last screen, click on Approve to approve the app.
NOTE: The Global administrator that approves the application cannot be the same user that created the application. If it is, the tool will say “app approver and developer cannot be the same user.”
Run a Pipeline
Next, you will configure a pipeline in Microsoft Fabric. We will use Synapse here. You will trigger this pipeline to pull SharePoint data from Microsoft 365 and drop it on the Azure Storage account. Here is what you will need to do:
Go to the Fabric Workspace and create a pipeline
Define the source (Dataset from MGDC)
Define the destination (Table in the Lakehouse)
Save and run the pipeline.
Monitor the pipeline to make sure it has finished running.
Let us look at each one of these.
2a. Go to the Fabric Workspace and create a pipeline
Navigate to the Microsoft Fabric Portal at https://fabric.microsoft.com
I the main Microsoft Fabric page, select the option for “Data Engineering”
Find the “Browse” icon on the bar on the left and click on it.
Find the Fabric workspace you defined in step 1d and click on it.
In the workspace, click on “+ New” item.
Select “Data pipeline”.
Give the new pipeline a name and click on “Create”
Select the “Copy data assistant” option
2b. Define the source (Dataset from MGDC)
In the first step of the “Copy data assistant”, search for 365 to help you find that source.
Select the source called “Microsoft365”.
To create a new connection to the Microsoft365 data source, enter a new connection name, select “Service principal” for Authentication kind and the credentials for the Application registration you created in step 1b, including Tenant ID, Service principal client ID and Service principal Key.
Click on “Next” and wait for the dataset information to load.
Select “BasicDataSet_v0.SharePointSites_v1” as the table.
Keep the default scope (SharePoint datasets do not use scope filters).
Select SnapshotDate as a column filter and select a date. Since we are doing a full pull, you should use the same date for Start time and End time.
Click on “Next”
IMPORTANT: Valid dates go from 23 days ago to 2 days ago.
IMPORTANT: You cannot query dates before the date when you enabled SharePoint dataset collection.
2c. Define the destination (Table in the Lakehouse)
Next, you will choose a data destination.
To start choosing a destination, select the Lakehouse you created in step 1d under the OneLake data hub (not the Azure blobs or ADLS destinations).
Select the option to “Load to new table”, which will show the column mappings for the Sites dataset.
You can keep the default name “BasicDataSet_v0.SharePointSites_v1” or use something simpler like “Sites”.
2d. Save and run the pipeline
The last step in the copy data definition is to review the details
Click on the “Save + Run” button to save the new task and run it immediately.
2e. Monitor the pipeline to make sure it has finished running
After the assistant ends, you land on the definition of the pipeline, where you can see the copy data activity on the top and you can monitor the running pipeline at the bottom.
At this point, your pipeline state should be “queued”, “initializing” or “in progress”. Later it will go into “extracting data” and “persisting data”.
Wait for the pipeline to run. This should take around 20 minutes to run, maybe more if your tenant is large.
After everything runs, the status will show as “Succeeded”.
2f. View the newly created Lakehouse table
After your pipeline finishes running, you can see the new table
Select “Browse” icon on the bar on the left, then click on the Lakehouse.
Create a Fabric report
The last step is to use the data you just got to build a Power BI dashboard. You will need to:
Create a semantic model
Create a new Fabric report
Add a visualization
3a. Create a new semantic model
Find the “Browse” icon on the bar on the left and click on it.
Find the Lakehouse you defined in step 1d and click on it.
In the Lakehouse, select the option to create a “new semantic model”.
Give the semantic model a name and expand the tree to find the newly created “Sites” table. Make sure you select it.
Click “Confirm” to create the semantic model
3b. Create a new Fabric report
Find the “Browse” icon on the bar on the left and click on it.
Find the Fabric workspace you defined in step 1d and click on it.
In the workspace, click on “+ New” item and select “Report”.
Select the option to “pick a published semantic model”
Select the semantic model and use the option to “create a blank report”
You will end up with an empty report in Power BI, with panels on the right for Filters, Visualizations and Data (from the semantic model).
3d. Add a visualization
Click on the stacked bar chart in the Visualizations to add to the report
Resize the bar chart visualization
Add the “Id” column to the y-axis property
Add the “RootWeb.WebTemplate” to the x-axis property
Use the “File”, “Save as” menu option to name the report “Site Summary”
The final report will show
Conclusion
You have triggered your first pipeline and created your first report using MGDC with the Microsoft Fabric. Now there is a lot more that you could do.
Here are a few suggestions:
Investigate the many datasets in the Microsoft Graph Data Connect, which you can easily use in your Microsoft Fabric workspace.
Trigger your pipeline on a schedule, to always have fresh data in your storage account.
Use a Delta pull to get only the data that has changed since your last pull.
Extend your pipeline to do more, like join multiple data sources.
Share your Report with other people in your tenant.
You can read more about the Microsoft Graph Data Connect for SharePoint at https://aka.ms/SharePointData. There you will find links to many details, including a list of datasets available, complete with schema definitions and samples.
Microsoft Tech Community – Latest Blogs –Read More