Skip to content

Commit aa8ab9b

Browse files
add anchors
1 parent 0677941 commit aa8ab9b

File tree

4 files changed

+30
-33
lines changed

4 files changed

+30
-33
lines changed

docs/integrations/data-ingestion/clickpipes/mysql/controlling_sync.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,28 +12,28 @@ import cdc_syncs from '@site/static/images/integrations/data-ingestion/clickpipe
1212

1313
This document describes how to control the sync of a database ClickPipe (Postgres, MySQL etc.) when the ClickPipe is in **CDC (Running) mode**.
1414

15-
## Overview
15+
## Overview {#overview-mysql-sync}
1616

1717
Database ClickPipes have an architecture that consists of two parallel processes - pulling from the source database and pushing to the target database. The pulling process is controlled by a sync configuration that defines how often the data should be pulled and how much data should be pulled at a time. By "at a time", we mean one batch - since the ClickPipe pulls and pushes data in batches.
1818

1919
There are two main ways to control the sync of a database ClickPipe. The ClickPipe will start pushing when one of the below settings kicks in.
2020

21-
### Sync interval
21+
### Sync interval {#interval-mysql-sync}
2222
The sync interval of the pipe is the amount of time (in seconds) for which the ClickPipe will pull records from the source database. The time to push what we have to ClickHouse is not included in this interval.
2323

2424
The default is **1 minute**.
2525
Sync interval can be set to any positive integer value, but it is recommended to keep it above 10 seconds.
2626

27-
### Pull batch size
27+
### Pull batch size {#batch-size-mysql-sync}
2828
The pull batch size is the number of records that the ClickPipe will pull from the source database in one batch. Records mean inserts, updates and deletes done on the tables that are part of the pipe.
2929

3030
The default is **100,000** records.
3131
A safe maximum is 10 million.
3232

33-
### An exception: Long-running transactions on source
33+
### An exception: Long-running transactions on source {#transactions-pg-sync}
3434
When a transaction is run on the source database, the ClickPipe waits until it receives the COMMIT of the transaction before it moves forward. This with **overrides** both the sync interval and the pull batch size.
3535

36-
### Configuring sync settings
36+
### Configuring sync settings {#configuring-mysql-sync}
3737
You can set the sync interval and pull batch size when you create a ClickPipe or edit an existing one.
3838
When creating a ClickPipe it will be seen in the second step of the creation wizard, as shown below:
3939
<img src={create_sync_settings} alt="Create sync settings" />
@@ -44,7 +44,7 @@ When editing an existing ClickPipe, you can head over to the **Settings** tab of
4444
This will open a flyout with the sync settings, where you can change the sync interval and pull batch size:
4545
<img src={edit_sync_settings} alt="Edit sync settings" />
4646

47-
### Monitoring sync control behaviour
47+
### Monitoring sync control behaviour {#monitoring-mysql-sync}
4848
You can see how long each batch takes in the **CDC Syncs** table in the **Metrics** tab of the ClickPipe. Note that the duration here includes push time and also if there are no rows incoming, the ClickPipe waits and the wait time is also included in the duration.
4949

5050
<img src={cdc_syncs} alt="CDC Syncs table" />

docs/integrations/data-ingestion/clickpipes/mysql/parallel_initial_load.md

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -10,16 +10,12 @@ import partition_key from '@site/static/images/integrations/data-ingestion/click
1010

1111
This document explains parallelized snapshot/initial load in the MySQL ClickPipe works and talks about the snapshot parameters that can be used to control it.
1212

13-
:::info This feature is currently behind a feature flag
14-
Please reach out to us via a support ticket to enable this feature for your ClickHouse organization.
15-
:::
16-
17-
## Overview
13+
## Overview {#overview-mysql-snapshot}
1814

1915
Initial load is the first phase of a CDC ClickPipe, where the ClickPipe syncs the historical data of the tables in the source database over to ClickHouse, before then starting CDC. A lot of the times, developers do this in a single-threaded manner.
2016
However, the MySQL ClickPipe can parallelize this process, which can significantly speed up the initial load.
2117

22-
### Partition key column
18+
### Partition key column {#key-mysql-snapshot}
2319

2420
Once we've enabled the feature flag, you should see the below setting in the ClickPipe table picker (both during creation and editing of a ClickPipe):
2521
<img src={partition_key} alt="Partition key column" />
@@ -30,24 +26,24 @@ The MySQL ClickPipe uses a column on your source table to logically partition th
3026
The partition key column must be indexed in the source table to see a good performance boost. This can be seen by running `SHOW INDEX FROM <table_name>` in MySQL.
3127
:::
3228

33-
### Logical partitioning
29+
### Logical partitioning {#logical-partitioning-mysql-snapshot}
3430

3531
Let's talk about the below settings:
3632

3733
<img src={snapshot_params} alt="Snapshot parameters" />
3834

39-
#### Snapshot number of rows per partition
35+
#### Snapshot number of rows per partition {#numrows-mysql-snapshot}
4036
This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and chunks will be processed in parallel based on the initial load parallelism set. The default value is 100,000 rows per partition.
4137

42-
#### Initial load parallelism
38+
#### Initial load parallelism {#parallelism-mysql-snapshot}
4339
This setting controls how many partitions will be processed in parallel. The default value is 4, which means that the ClickPipe will read 4 partitions of the source table in parallel. This can be increased to speed up the initial load, but it is recommended to keep it to a reasonable value depending on your source instance specs to avoid overwhelming the source database. The ClickPipe will automatically adjust the number of partitions based on the size of the source table and the number of rows per partition.
4440

45-
#### Snapshot number of tables in parallel
41+
#### Snapshot number of tables in parallel {#tables-parallel-mysql-snapshot}
4642
Not really related to parallel snapshot, but this setting controls how many tables will be processed in parallel during the initial load. The default value is 1. Note that is on top of the parallelism of the partitions, so if you have 4 partitions and 2 tables, the ClickPipe will read 8 partitions in parallel.
4743

48-
### Monitoring parallel snapshot in MySQL
44+
### Monitoring parallel snapshot in MySQL {#monitoring-parallel-mysql-snapshot}
4945
You can run **SHOW processlist** in MySQL to see the parallel snapshot in action. The ClickPipe will create multiple connections to the source database, each reading a different partition of the source table. If you see **SELECT** queries with different ranges, it means that the ClickPipe is reading the source tables. You can also see the COUNT(*) and the partitioning query in here.
5046

51-
### Limitations
47+
### Limitations {#limitations-parallel-mysql-snapshot}
5248
- The snapshot parameters cannot be edited after pipe creation. If you want to change them, you will have to create a new ClickPipe.
5349
- When adding tables to an existing ClickPipe, you cannot change the snapshot parameters. The ClickPipe will use the existing parameters for the new tables.

docs/integrations/data-ingestion/clickpipes/postgres/controlling_sync.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,27 +12,28 @@ import cdc_syncs from '@site/static/images/integrations/data-ingestion/clickpipe
1212

1313
This document describes how to control the sync of a database ClickPipe (Postgres, MySQL etc.) when the ClickPipe is in **CDC (Running) mode**.
1414

15-
## Overview
15+
## Overview {#overview-pg-sync}
1616

1717
Database ClickPipes have an architecture that consists of two parallel processes - pulling from the source database and pushing to the target database. The pulling process is controlled by a sync configuration that defines how often the data should be pulled and how much data should be pulled at a time. By "at a time", we mean one batch - since the ClickPipe pulls and pushes data in batches.
1818

1919
There are two main ways to control the sync of a database ClickPipe. The ClickPipe will start pushing when one of the below settings kicks in.
2020

21-
### Sync interval
21+
### Sync interval {#interval-pg-sync}
2222
The sync interval of the pipe is the amount of time (in seconds) for which the ClickPipe will pull records from the source database. The time to push what we have to ClickHouse is not included in this interval.
2323

2424
The default is **1 minute**.
2525
Sync interval can be set to any positive integer value, but it is recommended to keep it above 10 seconds.
2626

27-
### Pull batch size
27+
### Pull batch size {#batch-size-pg-sync}
2828
The pull batch size is the number of records that the ClickPipe will pull from the source database in one batch. Records mean inserts, updates and deletes done on the tables that are part of the pipe.
2929

3030
The default is **100,000** records.
31+
A safe maximum is 10 million.
3132

32-
### An exception: Long-running transactions on source
33+
### An exception: Long-running transactions on source {#transactions-pg-sync}
3334
When a transaction is run on the source database, the ClickPipe waits until it receives the COMMIT of the transaction before it moves forward. This with **overrides** both the sync interval and the pull batch size.
3435

35-
### Configuring sync settings
36+
### Configuring sync settings {#configuring-pg-sync}
3637
You can set the sync interval and pull batch size when you create a ClickPipe or edit an existing one.
3738
When creating a ClickPipe it will be seen in the second step of the creation wizard, as shown below:
3839
<img src={create_sync_settings} alt="Create sync settings" />
@@ -43,12 +44,12 @@ When editing an existing ClickPipe, you can head over to the **Settings** tab of
4344
This will open a flyout with the sync settings, where you can change the sync interval and pull batch size:
4445
<img src={edit_sync_settings} alt="Edit sync settings" />
4546

46-
### Tweaking the sync settings to help with replication slot growth
47+
### Tweaking the sync settings to help with replication slot growth {#tweaking-pg-sync}
4748
Let's talk about how to use these settings to handle a large replication slot of a CDC pipe.
4849
The pushing time to ClickHouse does not scale linearly with the pulling time from the source database. This can be leveraged to reduce the size of a large replication slot.
4950
By increasing both the sync interval and pull batch size, the ClickPipe will pull a whole lot of data from the source database in one go, and then push it to ClickHouse.
5051

51-
### Monitoring sync control behaviour
52+
### Monitoring sync control behaviour {#monitoring-pg-sync}
5253
You can see how long each batch takes in the **CDC Syncs** table in the **Metrics** tab of the ClickPipe. Note that the duration here includes push time and also if there are no rows incoming, the ClickPipe waits and the wait time is also included in the duration.
5354

5455
<img src={cdc_syncs} alt="CDC Syncs table" />

docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,33 +9,33 @@ import snapshot_params from '@site/static/images/integrations/data-ingestion/cli
99

1010
This document explains parallelized snapshot/initial load in the Postgres ClickPipe works and talks about the snapshot parameters that can be used to control it.
1111

12-
## Overview
12+
## Overview {#overview-pg-snapshot}
1313

1414
Initial load is the first phase of a CDC ClickPipe, where the ClickPipe syncs the historical data of the tables in the source database over to ClickHouse, before then starting CDC. A lot of the times, developers do this in a single-threaded manner - such as using pg_dump or pg_restore, or using a single thread to read from the source database and write to ClickHouse.
1515
However, the Postgres ClickPipe can parallelize this process, which can significantly speed up the initial load.
1616

17-
### CTID column in Postgres
17+
### CTID column in Postgres {#ctid-pg-snapshot}
1818
In Postgres, every row in a table has a unique identifier called the CTID. This is a system column that is not visible to users by default, but it can be used to uniquely identify rows in a table. The CTID is a combination of the block number and the offset within the block, which allows for efficient access to rows.
1919

20-
### Logical partitioning
20+
### Logical partitioning {#logical-partitioning-pg-snapshot}
2121
The Postgres ClickPipe uses the CTID column to logically partition source tables. It obtains the partitions by first performing a COUNT(*) on the source table, followed by a window function partitioning query to get the CTID ranges for each partition. This allows the ClickPipe to read the source table in parallel, with each partition being processed by a separate thread.
2222

2323
Let's talk about the below settings:
2424

2525
<img src={snapshot_params} alt="Snapshot parameters" />
2626

27-
#### Snapshot number of rows per partition
27+
#### Snapshot number of rows per partition {#numrows-pg-snapshot}
2828
This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and chunks will be processed in parallel based on the initial load parallelism set. The default value is 100,000 rows per partition.
2929

30-
#### Initial load parallelism
30+
#### Initial load parallelism {#parallelism-pg-snapshot}
3131
This setting controls how many partitions will be processed in parallel. The default value is 4, which means that the ClickPipe will read 4 partitions of the source table in parallel. This can be increased to speed up the initial load, but it is recommended to keep it to a reasonable value depending on your source instance specs to avoid overwhelming the source database. The ClickPipe will automatically adjust the number of partitions based on the size of the source table and the number of rows per partition.
3232

33-
#### Snapshot number of tables in parallel
33+
#### Snapshot number of tables in parallel {#tables-parallel-pg-snapshot}
3434
Not really related to parallel snapshot, but this setting controls how many tables will be processed in parallel during the initial load. The default value is 1. Note that is on top of the parallelism of the partitions, so if you have 4 partitions and 2 tables, the ClickPipe will read 8 partitions in parallel.
3535

36-
### Monitoring parallel snapshot in Postgres
36+
### Monitoring parallel snapshot in Postgres {#monitoring-parallel-pg-snapshot}
3737
You can analyze **pg_stat_activity** to see the parallel snapshot in action. The ClickPipe will create multiple connections to the source database, each reading a different partition of the source table. If you see **FETCH** queries with different CTID ranges, it means that the ClickPipe is reading the source tables. You can also see the COUNT(*) and the partitioning query in here.
3838

39-
### Limitations
39+
### Limitations {#limitations-parallel-pg-snapshot}
4040
- The snapshot parameters cannot be edited after pipe creation. If you want to change them, you will have to create a new ClickPipe.
4141
- When adding tables to an existing ClickPipe, you cannot change the snapshot parameters. The ClickPipe will use the existing parameters for the new tables.

0 commit comments

Comments
 (0)