Relational Data Synchronization between environments
Relational Data Synchronization between environments
There are business and/or technical cases where relational data should be duplicated to another environment. Since the demands of those business and/or technical cases are not the same, there are multiple technical solutions to achieve the goal.
In this article, I will discuss of the various solutions according to difference business needs, with deep dive into one family of solutions – sync solutions that is based on the database engine (DB engine). The content is Azure oriented, but the same concepts are true for other clouds as well.
I would expect that anyone that needs to sync relational data between environment can find here a good guideline.
General synchronization demands
Let us start with the typical demands:
Scenario
Latency
Typical solution family
Data Warehouse
Hours to day
ETL
Data mart
Minutes to hours
DB engine Sync
High utilized DB
Seconds to minutes
DB engine Full or Sync
High availability
Seconds
DB engine Full
Disaster Recovery
Seconds to minutes
DB engine Full
Network separation
Vary
Vary
DB engine Sync is the focus if this article. See below.
Here is high level description of those solution families:
ETL (Extract,Transform,Load):
Used for populating data warehouses or data marts from production systems
Usually, the schema on the target is more reporting friendly (star schema) than the production system
The data in the target can be in delay (usually hours) compared to the source
The source and the target can be utilizing different technologies
Tools in the market: Azure Data Factory, Informatica, Ascend
DB engine full:
Built-in replica mechanism to have another copy of the full database
With or without the ability to have one or more replicas that can be utilized as a read replica
Based on high availability, log shipping, backup & restore or storage-based solutions
Used for HA/DR and or read scale operation
Minimal latency (seconds)
Same technology
Read only on the target
DB engine sync
Tools in scope: SQL Data sync, Fabric Mirroring, Replication
Those tools support partial copy of the database
See more in the next chapter
Each option has its own pros and cons and sometimes you might use more than one solution in the same project.
In the rest of this article, I will focus on the DB engine sync solutions family usage.
More information:
ETL – Extract, transform, and load
Read only Replica: Azure SQL, PostgreSQL, MySQL
DB engine Sync Solutions Family
The need:
I cannot exaggerate the importance of choosing a synchronization solution based on your specific business needs. This is the reason that multiple solutions exist – to be able to support your specific need with a good-enough solution.
A sync process is responsible for sync data between environments. To be more exact, between source and one or more targets. The different solutions might have various kinds of characteristics.
Here are typical characteristics that you might be interested in:
Various kinds of technology
Different schema
Updates on both sides (conflict might happen)
Latency between the two copies
Maintenance efforts, skills required
The level of provider/user responsibility for the sync including re-sync probability, tools and efforts
I chose three key technologies (replication, SQL data sync, Fabric Mirroring) to discuss. The discussion is based on multiple discussions with my customers.
Replication:
Very mature technology which is supported by the majority of the relational database products
Low latency – usually seconds
Multiple flavors – transactional, merge, snapshot
Different table structure in the source and target are possible with limitations but add complexity
Multiple subscribers per source are supported
Monitoring is your responsibility and in case of failure, deep knowledge is needed to avoid reinitializing
For SQL server, you have a built-in replication monitor tool. For other databases you should check.
The monitor is not doing correction actions. Failing to track the replication status might cause a non-updated target environment
Replication of the data to a database of another provider might be possible usually with limitations. You will need a third-party tool to implement such a solution. For SQL Server Heterogeneous Database Replication is deprecated.
Azure SQL database cannot be a publisher
You must have a good DBA with specific replication knowledge to maintain the system
Typical scenarios for replication:
Filtering (part of the rows and/or the columns should be replicated
Low latency needs
Cross security boundaries with SQL authentication (see in the security section)
Cross database technologies (SQL server à Oracle)
More information:
Replication: Azure SQL MI, Azure SQL DB, PostgreSQL, MySQL
SQL Data Sync for Azure:
SQL Data Sync is a service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across multiple databases, both on-premises and in the cloud, but only SQL Server based.
Azure SQL Data Sync does not support Azure SQL Managed Instance or Azure Synapse Analytics at this time
Source and target should be with the exact same schema
Multiple subscribers are supported
Typical scenarios for SQL Data Sync:
Considerable number of tables to be replicated
Managed by Azure experts (limited database knowledge needed)
SaaS solution preferred
Azure SQL database source
Bi-directional synchronization
More information:
Data Sync: Overview, Best Practices
Azure SQL Data Sync | Tips and Tricks
Mirroring in Microsoft Fabric (private preview):
The target for the synced data is sorted in delta lake table format – no need for relational database
The primary business scenario is reporting on the target
The schema cannot be changed on the target
Azure Cosmos DB, Azure SQL DB and Snowflake customers will be able to use Mirroring to mirror their data in OneLake and unlock all the capabilities of Fabric Warehouse, Direct Lake Mode, Notebooks and much more.
SQL Server, Azure PostgreSQL, Azure MySQL, Mongo DB and other databases and data warehouses will be coming in CY24.
Typical scenarios for Mirroring with Microsoft Fabric:
The target is reporting only that might integrate data from multiple sources
The cost associated with maintaining another relational engine for reporting is high. This aspect is even more significant for ISVs that are managing different environments for each customer (tenant)
Azure SQL or IaaS environment
Replacing an ETL system with no code solution
Part of your OneLake data architecture
More information:
Mirroring: Announcement, Copilot, Cosmos DB
Other aspects:
For the completeness of this article, here is a brief discussion of other aspects of the solutions that you should be aware of:
Identity and Security:
In all solutions – integrate solution is the best (replication authentication and replication security , SQL Data Sync, Mirroring).
For replication, you might use SQL authentication. For Azure SQL managed instance it is necessary.
Cost:
All the solutions do not have direct cost except for the services utilized for the source and target and possible cross data centers network bandwidth utilized.
Bi-directional and conflict resolution:
The only Azure native solution support is for SQL Data Sync.
Transactional replication – bi-directional (peer to peer) is rare but has multiple options. Last write wins is the automatic way as defined here.
Note:
Peer to peer is not supported by Azure SQL database offerings
Merge replication has more options but not on Azure SQL database offerings – see here
SQL Data Sync – Hub wins or Member wins (see here)
Mirroring – one direction only , so, not applicable
Scalability and performance:
In all solutions. You can expect reasonable pressure on the source (publisher) is expected.
SQL Data Sync add triggers to the source database while replication is using log reader (less pressure).
Monitoring and sync status:
For Replication – you have replication monitor and the tablediff utility
For SQL data Sync and Fabric mirroring – Monitoring Azure SQL Data Sync using OMS Log Analytics or Azure SQL Data Sync Health Checker
Real-time vs. Batch Synchronization:
All the solutions are well suited to real-time and short transactions. However, batch will work as well with more pressure on the SQL server log.
For Data Sync, empty tables provide the best performance at initialization time. If the target table is empty, Data Sync uses bulk insert to load the data. Otherwise, Data Sync does a row-by-row comparison and insertion to check for conflicts. If performance is not a concern, however, you can set up sync between tables that already contain data.
More information:
Empty tables provide the best performance
Choosing a DB engine Sync solution
Here is a short list of criteria that might help you choose a solution:
SQL Data Sync
The best solution for Azure SQL DB
Portal/script managed
Target should be from the SQL server family
Replication
The only solution for Azure SQL Managed Instance
Customable (filtering, schema change)
Deep database knowledge required
Fabric mirroring
Your solution where the destination can be/preferred on delta lake table format
Support multi sources (Azure SQL, Cosmos, Snowflake, more to come)
Portal/script managed
More information:
Compare SQL Data Sync with Transactional Replication
Conclusion
In the realm of data management, the need to synchronize relational data across environments arises from diverse business and technical requirements. This article has delved into the various solutions available, with a particular focus on database engine-based synchronization in the Azure ecosystem.
From the high-level demands of scenarios such as Data Warehouse, Data mart, High Utilized DB, High Availability, Disaster Recovery, to the intricacies of choosing between ETL, DB engine full, and DB engine sync solutions, we’ve explored the landscape of options available.
In the family of DB engine sync solutions, we’ve highlighted the importance of aligning your choice with specific business needs. Replication, a mature technology, offers low latency and supports various scenarios, though it requires vigilant monitoring. SQL Data Sync provides bi-directional synchronization for a considerable number of tables, managed by Azure professionals, while Microsoft Fabric’s Mirroring offers a unique approach for reporting scenarios.
Considerations such as identity and security, cost implications, conflict resolution, scalability, and monitoring have been discussed to provide a holistic view. Whether you prioritize low latency, transactional consistency, or ease of management, choosing the right solution is paramount.
As you navigate the complexities of relational data synchronization, keep in mind the nuances of each solution and the unique demands of your project. Whether opting for a well-established solution like Replication or embracing innovative approaches like Mirroring with Microsoft Fabric, make an informed decision based on your specific use case.
In conclusion, successful data synchronization is not a one-size-fits-all endeavor. By understanding the characteristics, advantages, and limitations of each solution, you empower yourself to make informed decisions that align with the dynamics of your data ecosystem. Explore further, stay updated on evolving technologies, and tailor your approach to meet the ever-evolving demands of your business.
You should remember that the technology world in general and in the cloud area in particular are constantly changing. The dynamic nature of data management and the importance of staying abreast of evolving technologies only emphasize that the reader should explore emerging solutions and best practices.
Microsoft Tech Community – Latest Blogs –Read More