Microsoft Fabric Metadata Driven Pipelines with Mirrored Databases
Microsoft Fabric’s database mirroring feature is a game changer for organizations using Azure SQL DB, Azure Cosmos DB or Snowflake in their cloud environment! Mirroring offers near real-time replication with just a few clicks AND for at no cost!
Features:
No data movement cost when mirroring
No storage cost for mirrored tables
No consumption of Fabric Capacity Units
Mirror all tables in your source database or just a few, with the capability to add more tables as your Fabric analytics environment grows
Source data continuously replicated with no data pipelines to configure
Data landed in delta tables in One Lake which are optimized by default
SQL endpoint and default semantic model automatically created
What can you do?
Run near real-time queries against the SQL endpoint with no impact to your source system since the data is replicated
Share the mirrored database across your Fabric tenant
Run cross database queries from within the mirrored database SQL endpoint or from One Lake when the mirrored database is shared
Create SQL views over mirrored data, which can include joins or unions with data from other mirrored databases, warehouses or lakehouse SQL endpoints
Configure row-level security and object level security against the mirrored data
Create Direct Lake near real-time Power BI reports against the default semantic model or against a new model
Copy mirrored data into a lakehouse and used in Spark notebooks or Data Science workloads
And… build faster, simpler metadata driven pipelines! Which is the focus of this article!
Metadata Driven Pipelines with Microsoft Fabric Mirrored Databases
Metadata-driven pipelines in Microsoft Fabric enable you to streamline data ingestion and transformations with minimal coding, lower maintenance, and enhanced scalability. And when your source is Azure SQL DB, Azure Cosmos, or Snowflake, this becomes even easier!
Architecture Overview
With Mirroring, the data is replicated and readily available in OneLake. This eliminates the pipeline which brings data into Fabric. After mirroring is configured:
A SQL endpoint is automatically created, allowing you and your users to query the data in near real time without impacting source data performance
A default semantic model is also automatically created, allowing you and your users to create near real time Power BI reports. However, best practice is to create a new semantic model, which provides more features and options than the default semantic model
A Fabric Data Warehouse is created to contain:
The fact and dimension tables (star schema) which simplifies and optimizes the analytical data model for SQL querying and semantic models
SQL views and stored procedures used in data transformations
The metadata table which holds the information on how to transform load each fact or dimension table
The Fabric Data Pipelines are created and scheduled to perform:
A Lookup activity on the metadata table to get the information on how to load each fact or dimension table in the Fabric Data Warehouse
Copy Data activities for full loads with the source being a SQL view over the mirrored tables and the destination a table in the Fabric Data Warehouse
Stored Procedure activities for incremental loads to merge the latest data from the mirrored data source into the Data Warehouse destination table
A default semantic model is automatically created over the Fabric Data Warehouse. But create a new one per best practices
Build Power BI reports for analytical reporting
Why 2 semantic models? Fabric Data Warehouse vs Mirrored Database SQL Endpoint for semantic model reporting
For this architecture, I considered using:
Just the mirrored tables
SQL views over the mirrored tables
A new Fabric Data Warehouse with data loaded from the mirrored tables
Since SQL views over mirrored tables always resort to Direct Query rather than Direct Lake, I decided against building a semantic model over SQL views. Instead, I created a semantic model over the mirrored tables and a separate Fabric Data Warehouse and semantic mode over it.
The semantic model over the mirrored tables:
Is used only by users who understand the source data schema
Allows for near-real time access for data to answer ad-hoc questions that are not analytical in nature such as the order shipment status of a particular order number or current stock availability of a specific product at a certain location
Incur no data storage cost but could be offset by consumption of Capacity Units if reports/queries are complex
Can leverage Power BI Direct Lake connection for faster report performance if the report is not too complex; if the query is too complex, it will resort to Direct Query
Could include other lakehouses, mirrored databases, or data warehouse tables in the SQL endpoint and thus the semantic model but Power BI reports using these tables will always resort to direct query rather than direct lake connect
Can include complex relationships between tables and unexpected results may be returned if the semantic model and/or reports are not configured correctly
The semantic model over the Fabric Data Warehouse:
Requires scheduled data refreshes but will be relatively fast since the source data is already in Fabric
Is best for more analytical questions such as “What was our sales revenue by month by customer location?” or “What is our days on hand for products in a particular shipping warehouse?”
Allows for user friendly data warehouse table and column names rather than using the potentially cryptic mirrored database table and column names
Eliminates snowflake schema, allowing for better performing reports and delivery of consistent results without having to understand complex relationships and filtering rules
Is more likely to leverage Direct Lake connection in Power BI reports since model is simpler
Allows other data sources to be loaded into the same warehouse, eliminating cross database joins in reports that automatically resort to direct query
Leverages a simpler star schema model resulting in faster reports with less consumption of capacity units
This solution addresses two key use cases: providing near real-time responses to queries about specific data transactions and statuses, and delivering rapid analytics over large datasets. Continue reading to learn how to implement this architecture in your own environment.
Solution details
Below are detailed steps to build the metadata pipeline. The data source is the Wide World Importers SQL database, which you can download here. Then follow the instructions to import into an Azure SQL DB.
Configure database mirroring
From the Synapse Data Warehouse experience, choose the Mirrored Azure SQL DB Option:
Then choose the tables to mirror:
After the mirroring has started, the main canvas will say “Mirrored Azure SQL Database is running’; Click on Monitor replication to see the number of rows replicated and the last completion time:
At this point, both a SQL analytics endpoint and default semantic model are created (1a). But I created a new semantic model(1b) and set up the table relationships:
2. Create a Fabric Data Warehouse
Create the fact tables or any other tables that will be incrementally loaded (2a). You can also manually create the dimension tables or tables are full loaded OR you can specify to auto-create the tables in the pipeline Copy Data activity, as I do later.
Create the views over the mirrored database tables and stored procedures (2b):
Create and load the metadata table (2c) with information on how to load each fact or dimension table:
3. Create the data pipelines to load the fact and dimension tables
Below is the orchestrator pipeline:
Set variable – set pipelinestarttime to the current date/time. This is logged in the metadata driven pipeline table for each table
Lookup – get the table load attributes from metadata table for each table to load
For each table to load
Invoke the pipeline, passing in the current row object and the date/time the orchestrator pipeline started:
Load warehouse table pipeline
Set variable pipeline start time for tracking the time of each table load
If activity – check if full or incremental load
If full load
Use Copy Data Activity to load the Data Warehouse table, set the pipeline end time variable and update the metadata table with load information
Copy data activity
Source settings reference the view over the mirrored database:
Destination settings reference data warehouse table:
Note that the data warehouse table will be dropped and re-created each time
Set the pipeline end time variable
Run Script to update the pipeline run details for this table
If not a full load, then run the incremental load activities
Lookup activity calls a stored procedure to insert or update new or changed records into the destination table. The value for the StartDate parameter is the latest date of the previous load of this table. The value for the EndDate parameter is usually a null value and only set if there is a need to load or reload a subset of data.
The stored procedure performs an insert or update, depending upon whether or not the key value exists in the destination. Only the records from the source that have changed since the last table load are selected. This reduces the number of updates performed.
The stored procedure returns how many rows were inserted or updated, along with the latest transaction data of the data loaded, which is needed for the next incremental load.
Set the pipeline end time
Script activity updates the table load details:
4. Build a new semantic model in the Fabric/Power BI service
Create relationships between the tables, DAX calculations, dimension hierarchies, display formats – anything you need for your analytics
Note that all tables have Direct Lake connectivity as noted by the dashed, blue line. Direct Lake has the performance of Import semantic models without the overhead of refreshing the data.
5. Create reports
Create reports from your semantic model
Continue on building out more reports and dashboards, setting up security, scheduling data warehouse refresh (which will now be super fast since the source data is already in Fabric), creating apps, adding more data sources – whatever it takes to get the analytics your organization needs into Fabric!
Mirroring – Microsoft Fabric | Microsoft Learn
What is data warehousing in Microsoft Fabric? – Microsoft Fabric | Microsoft Learn
Data Factory in Microsoft Fabric documentation – Microsoft Fabric | Microsoft Learn
Work with semantic models in Microsoft Fabric – Training | Microsoft Learn
Create reports in the Power BI – Microsoft Fabric | Microsoft Learn
Dimensional modeling in Microsoft Fabric Warehouse – Microsoft Fabric | Microsoft Learn
If your source database is not supported for mirroring (yet!), check out these other articles I wrote:
Metadata Driven Pipelines for Microsoft Fabric – Microsoft Community Hub
Microsoft Tech Community – Latest Blogs –Read More