Effectively troubleshoot latency in SQL Server Transactional replication: Part 1
High level transactional replication architecture
The initial stage of transactional replication is initializing the subscriber. Although this can be done via backup, the typical approach generating a snapshot by the Snapshot Agent and storing it in the snapshot folder. It’s C:Program FilesMicrosoft SQL Server<INST>MSSQLReplData by default and configurable. Then, the Distribution Agent transfers the snapshot to the subscriber.
Afterwards, incremental changes in the published database are tracked and replicated to subscriber database. This replication process is done in three phases:
Transactions are marked “for replication” in the transaction log.
The Log Reader Agent reader thread scans through the transaction log using sp_replcmds and looks for transactions that are marked “for replication.” These transactions are then saved to the distribution database by the Log Reader agent writer thread using sp_MSadd_replcmds.
The Distribution Agent reader thread scans through the distribution database using sp_MSget_repl_commands. Then, by using the distribution writer thread, this agent connects to the subscriber to apply those changes to the subscriber using sp_MSupd…, sp_MSins…, and sp_Msdel_* (where the “*” denotes the schema and object name of the published article)..
Figure 1. Transactional Replication architecture that shows the locations of each thread and agent in the case of Remote distributor and pull subscription case
Troubleshooting steps
The following graph shows the process we use to troubleshoot. We troubleshoot by dividing the process into two parts.
Step 1. Get information about “Big Picture”
Before you dive into solving any issue, you need to fully understand the type of environment you have as there might have been changes you are unaware of. An easy way to do that is to run script SQLServer/Script Replication Topology at master · sqlserver-parikh/SQLServer (github.com) which gives output like below.
Step 2. Get tracer tokens
After confirming the environment, insert tracer tokens and identify where we are stuck. Tokens can be inserted via Replication Monitor:
For historical tracer token results, you can run below query in distributor and compare with current results, being the last row the last result:
USE Distribution
SELECT p.publication_id, p.publication, agent_id,
Datediff(s,t.publisher_commit,t.distributor_commit) as ‘Time To Dist (sec)’,
Datediff(s,t.distributor_commit,h.subscriber_commit) as ‘Time To Sub (sec)’
FROM MStracer_tokens t
JOIN MStracer_history h
ON t.tracer_id = h.parent_tracer_id
JOIN MSpublications p
ON p.publication_id = t.publication_id
NOTE –
“distribution” is the default for the distribution database. Be sure to change this if you configured a different name for the distribution database. Additionally, this history is cleaned up any time replication upgrade scripts are executed, there is a change to the distributor configuration, sp_MShistory_cleanup (depending on retention duration specified), or sp_MSdelete_tracer_history is executed (again, depending on parameters used).
If you observe latency or “Pending” status in “Publisher to Distributor”, the issue is with Log Reader agent (refer to Step 3.1). If the latency is seen in “Distributor to Subscriber” as the screenshot above, the issue is with Distribution agent (refer to Step 3.2).
Step 3.1. Troubleshoot latency in Log Reader agent
Check agent history table for any errors by specifically paying attention to comments and error text columns:
USE distribution
SELECT a.name AS agent_name,
CASE [runstatus]
WHEN 1 THEN ‘Start’
WHEN 2 THEN ‘Succeed’
WHEN 3 THEN ‘In progress’
WHEN 4 THEN ‘Idle’
WHEN 5 THEN ‘Retry’
WHEN 6 THEN ‘Fail’
END AS Status
,[start_time]
,h.[time] — The time the message is logged.
,[duration] –The duration, in seconds, of the message session.
,[comments]
,h.[xact_seqno] — The last processed transaction sequence number.
,[delivery_time] — The time first transaction is delivered.
,[delivered_transactions] –The total number of transactions delivered in the session.
,[delivered_commands] — The total number of commands delivered in the session.
,[average_commands] — The average number of commands delivered in the session.
,[delivery_rate] — The average delivered commands per second.
,[delivery_latency] — The latency between the command entering the published database and being entered into the distribution database. In milliseconds.
,[error_id] — The ID of the error in the MSrepl_error system table.
,e.error_text — error text
FROM [distribution].[dbo].[MSlogreader_history] h
JOIN MSlogreader_agents a
ON a.id = h.agent_id
LEFT JOIN MSrepl_errors e
ON e.id = h.error_id
ORDER BY h.time DESC
Furthermore, 5-min interval performance statistics have been added to the history table.
If stats state=1, both reader and writer thread of Log Reader agent are performing as expected. If state=2, writer thread is taking a long time to write changes to distribution database. In this case, you should investigate, writer thread (Step 4.2). State=3 means the reader thread is taking a long time scanning the replicated changes from the transaction log and this thread should be investigated (Step 4.1). For example, below, the writer thread is causing latency as reader thread waited for it for 300 seconds to free queue buffer for new replicated data.
Ref: Statistics for Log Reader and Distribution agents – SQL Server | Microsoft Learn
Sometimes the history table is not enough to resolve latency issues. In this case, you should enable verbose logging for detailed logs (-Output C:TempOUTPUTFILE.txt -Outputverboselevel 3 ). https://learn.microsoft.com/en-US/sql/relational-databases/replication/troubleshoot-tran-repl-errors?view=sql-server-ver16#enable-verbose-logging-on-any-agent
You can investigate the verbose detailed logs for any errors. Particularly pay attention to “Status” logs.
Furthermore, verbose logging provides 5-min interval Log Reader agent statistics as below. Check the Fetch time (reader thread performance) and Write time (writer thread performance) for any latency.
If you cannot find any error logs but you detect latencies in either reader or writer thread, go to Step 4 and check the corresponding thread. For example, in the above example, we detected high latencies in Fetch time compared to Write time. So, the issue is probably with reader thread (refer to Step 4.1 in Part 2).
Step 3.2. Troubleshoot latency in Distribution agent
Check Distribution agent history for any errors by specifically paying attention to comments and error text: USE distribution
SELECT a.name AS agent_name,
CASE [runstatus]
WHEN 1 THEN ‘Start’
WHEN 2 THEN ‘Succeed’
WHEN 3 THEN ‘In progress’
WHEN 4 THEN ‘Idle’
WHEN 5 THEN ‘Retry’
WHEN 6 THEN ‘Fail’
END AS Status
,[start_time]
,h.[time] — The time the message is logged.
,[duration] –The duration, in seconds, of the message session.
,[comments]
,h.[xact_seqno] — The last processed transaction sequence number.
,[current_delivery_rate] — The average number of commands delivered per second since the last history entry.
,[current_delivery_latency] –The latency between the command entering the distribution database and being applied to the Subscriber since the last history entry. In milliseconds.
,[delivered_transactions] –The total number of transactions delivered in the session.
,[delivered_commands] — The total number of commands delivered in the session.
,[average_commands] — The average number of commands delivered in the session.
,[delivery_rate] — The average delivered commands per second.
,[delivery_latency] — The latency between the command entering the distribution database and being applied to the Subscriber. In milliseconds.
,[total_delivered_commands] — The total number of commands delivered since the subscription was created.
,[error_id] — The ID of the error in the MSrepl_error system table.
,e.error_text — error text
FROM MSdistribution_history h
JOIN MSdistribution_agents a
ON a.id = h.agent_id
LEFT JOIN MSrepl_errors e
ON e.id = h.error_id
ORDER BY h.time DESCFurthermore, 5-min interval performance statistics have been added to the history table.
If stats state=1, both reader and writer thread of Log Reader agent are performing as expected. If state=2, writer thread is taking a long time to write changes to distribution database. In this case, you should investigate, writer thread (Step 4.2). State=3 means the reader thread is taking a long time scanning the replicated changes from the transaction log and this thread should be investigated (Step 4.1). For example, below, the writer thread is causing latency as reader thread waited for it for 300 seconds to free queue buffer for new replicated data.
Ref: Statistics for Log Reader and Distribution agents – SQL Server | Microsoft Learn
Sometimes the history table is not enough to resolve latency issues. In this case, you should enable verbose logging for detailed logs (-Output C:TempOUTPUTFILE.txt -Outputverboselevel 3 ). https://learn.microsoft.com/en-US/sql/relational-databases/replication/troubleshoot-tran-repl-errors?view=sql-server-ver16#enable-verbose-logging-on-any-agentYou can investigate the verbose detailed logs for any errors. Particularly pay attention to “Status” logs.
Furthermore, verbose logging provides distribution agent statistics as below. Check the Fetch time (reader thread performance) and Write time (writer thread performance) for any latency.
If you cannot find any error logs but you detect latencies in either reader or writer thread, go to Step 4 and check the corresponding thread. For example, in the above example, we detected high latencies in Reader thread compared to Writer thread. Therefore, the issue is with reader thread (Step 4.3). Let us continue doing the next steps in Part 2!!
Supervisor: Colling Benkler, Sr. EE for SQL Server in Microsoft.
Microsoft Tech Community – Latest Blogs –Read More