Tuning logical replication on Azure Database for PostgreSQL – Flexible Server
Native logical replication is a built-in feature which was introduced in PostgreSQL 10. It is aimed to replicate data between tables in two separate Postgres databases using a replication identity, usually a primary key. It follows a publisher and subscriber model where there can be one or more publishers or subscribers. The publisher and subscriber databases can be on the same Azure Database for Postgres – flexible server or on two different servers.
The typical use-cases for logical replication are:
Sending incremental changes in a single database or a subset of a database to subscribers as they occur.
Sending the change to subscribers could also help in event driven architectures where the subscriber can process and publish the change to, for example, message brokers like Kafka etc.
Consolidating multiple databases into one data source (for analytical purposes).
Replicating data between different major versions of PostgreSQL.
Replicating between PostgreSQL instances on different platforms (for example Linux to Windows)
Replicating data between servers hosted in different clouds or between on-prem and cloud.
A typical flow involves copying the initial data snapshot from the publisher database to subscriber database and, once the data snapshot copy is completed, any incremental changes on the publisher database are sent to the subscriber database in near real-time.
Using a use-case scenario this blog will share tuning techniques that can be used to improve the performance of the initial data snapshot copy from the publisher to the subscriber database.
The blog is further divided into the following sections:
Server parameters
Use-case scenario
Best practices
Server parameters
The parameters that are used to configure and tune logical replication are as follows:
wal_level: Sets the level of information written to the Write Ahead Log (WAL) to logical.
max_worker_processes: Sets the maximum number of background processes that the system can support. The default value is 8. You can set this value to a number which is greater than the number of subscribers and reserve some for-table data synchronization and parallel apply worker.
max_logical_replication_workers: Specifies maximum number of workers to replicate table data between publisher and subscriber. It includes leader apply workers, parallel apply workers, and table synchronization workers. The default value is 4. Logical replication workers are taken from the pool defined by max_worker_processes.
max_replication_slots: The default is 10. You can set this parameter depending on the number of publishers and subscribers. It can be influenced by the value set to parameter max_sync_workers_per_subscription.
max_sync_workers_per_subscription: Controls the amount of parallelism of the initial data copy during the subscription initialization or when new tables are added. Currently, there can be only one synchronization worker per table. The workers are dependent on max_logical_replication_workers. The default value is 2.
The recommended value is generally about 50% – 75% of the number of vCores on the subscriber server. For example, if the subscriber server is an instance of Azure Database for PostgreSQL Flexible Server running on a Memory Optimized tier, and with an SKU of 16 vCores, then the value of the parameter can be set to 8. This indicates 8 tables can be copied parallelly during initial data copy. To make this change you need to set max_replication_slots, max_worker_processes and max_logical_replication_workers parameters also. A good value for max_worker_processes could be 16 (one worker per one core of CPU), max_replication_slots could be 16 (8 for table data synchronization + additional slots for other purposes) and max_logical_replication_workers could be 12 (for table data synchronization + parallel apply worker workers for in-flight transactions).
Note we recommend 50% -75% of total vCores because, if the number of workers doing initial copy is equal to number of vCores, then you could see 100% CPU utilization on the subscriber server. 50%-75% is just a general recommendation, you can set a higher value but making it too aggressive might result in 100% CPU utilization during initial data copy process on subscriber servers.
max_wal_senders: The maximum number of simultaneously running WAL sender processes. Default is 10. It is recommended to set a value equal to or greater than the number of logical replication slots.
Use-case scenario
In the use case scenario, the publisher database has 6+ tables on Postgres version 16, and the goal is to complete the initial snapshot as fast as possible.
The publisher server is 16 vCore General Purpose Azure Database for Postgres – flexible server with 1 TB storage, and the subscriber server is also a 16 vCore General Purpose Azure Database for Postgres – flexible server with 1 TB storage.
Setup On publisher server:
Server parameters:
wal_level: Logical
max_replication_slots: 12
Execute the following statements on the publisher database:
Grant required permissionsALTER ROLE <role name> WITH REPLICATION;
Note: Please make sure the replication user has read permissions on tables you are trying to replicate.
Create a publicationCREATE PUBLICATION publisher1 FOR TABLE member_claims, member_claims_May,member_claims_April,member_claims_March,member_claims_Feb,member_claims_January;
Setup on subscriber server:
Server parameters:
wal_level: Logical
max_logical_replication_workers: 12
max_worker_processes: 16
max_sync_workers_per_subscription: 6
Autovacuum: OFF
max_wal_size: 64 (GB)
checkpoint_timeout: 1(Hour)
Execute the following statements on the subscriber database:
Create a subscriptionCREATE SUBSCRIPTION <subscription name> CONNECTION’host=xx dbname=xx user=xx password=xx’ PUBLICATION <publisher name>;
Monitoring
Publisher database
Execute the following statement SELECT application_name, wait_event_type, wait_event, query, backend_type FROM pg_stat_activity WHERE state=’active’;
The output should look like this where 6 tables are being copied parallelly
Subscriber database
Execute the following statement to monitor the progress of the initial data copy. SELECT * FROM pg_stat_progress_copy;
For more details on pg_stat_progress_copy you can follow the document COPY-PROGRESS-REPORTING.
Best Practices
In general, it is recommended to have a publisher database and subscriber database on two different servers.
Follow the server parameter guidelines described above.
During the initial data copy phase, you can turn OFF autovacuum, increase max_wal_size to 64 GB if you have a storage size of 1 TB or greater than that [Generally recommended for large storage sizes] and have checkpoint_timeout parameter set to 1 Hour. [We would like the checkpoint to happen less frequently as checkpoint process consumes additional server resources for that purpose 1 hour is recommended. This is again a general recommendation value can be less than 1 hour, it depends on data volume being moved and you can also take into consideration time to recover in case there is a crash during initial data copy phase before coming up with a value]
Note: Once the initial copy is completed it is recommended to vacuum analyze the tables and then turn back autovacuum to ON. The checkpoint_timeout and max_wal_size parameters should also be changed back to previous values.
vacuumdb utility can be used to manually vacuum analyze the tables. For more information follow the document – vacuumdb.
Conclusion
Initial data copy after logical replication is set up between publisher and subscriber databases can be time consuming, particularly when the database size is large with hundreds of tables or database size is large with few large tables. With proper tuning of server parameters mentioned in the blog and having the publisher and subscriber databases on separate servers, you can help reduce the time required significantly.
If you have questions, please contact us at ASK Azure DB for PostgreSQL.
Microsoft Tech Community – Latest Blogs –Read More