You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/clickpipes/mysql/controlling_sync.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,28 +12,28 @@ import cdc_syncs from '@site/static/images/integrations/data-ingestion/clickpipe
12
12
13
13
This document describes how to control the sync of a database ClickPipe (Postgres, MySQL etc.) when the ClickPipe is in **CDC (Running) mode**.
14
14
15
-
## Overview
15
+
## Overview {#overview-mysql-sync}
16
16
17
17
Database ClickPipes have an architecture that consists of two parallel processes - pulling from the source database and pushing to the target database. The pulling process is controlled by a sync configuration that defines how often the data should be pulled and how much data should be pulled at a time. By "at a time", we mean one batch - since the ClickPipe pulls and pushes data in batches.
18
18
19
19
There are two main ways to control the sync of a database ClickPipe. The ClickPipe will start pushing when one of the below settings kicks in.
20
20
21
-
### Sync interval
21
+
### Sync interval {#interval-mysql-sync}
22
22
The sync interval of the pipe is the amount of time (in seconds) for which the ClickPipe will pull records from the source database. The time to push what we have to ClickHouse is not included in this interval.
23
23
24
24
The default is **1 minute**.
25
25
Sync interval can be set to any positive integer value, but it is recommended to keep it above 10 seconds.
26
26
27
-
### Pull batch size
27
+
### Pull batch size {#batch-size-mysql-sync}
28
28
The pull batch size is the number of records that the ClickPipe will pull from the source database in one batch. Records mean inserts, updates and deletes done on the tables that are part of the pipe.
29
29
30
30
The default is **100,000** records.
31
31
A safe maximum is 10 million.
32
32
33
-
### An exception: Long-running transactions on source
33
+
### An exception: Long-running transactions on source {#transactions-pg-sync}
34
34
When a transaction is run on the source database, the ClickPipe waits until it receives the COMMIT of the transaction before it moves forward. This with **overrides** both the sync interval and the pull batch size.
### Monitoring sync control behaviour {#monitoring-mysql-sync}
48
48
You can see how long each batch takes in the **CDC Syncs** table in the **Metrics** tab of the ClickPipe. Note that the duration here includes push time and also if there are no rows incoming, the ClickPipe waits and the wait time is also included in the duration.
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/clickpipes/mysql/parallel_initial_load.md
+8-12Lines changed: 8 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,16 +10,12 @@ import partition_key from '@site/static/images/integrations/data-ingestion/click
10
10
11
11
This document explains parallelized snapshot/initial load in the MySQL ClickPipe works and talks about the snapshot parameters that can be used to control it.
12
12
13
-
:::info This feature is currently behind a feature flag
14
-
Please reach out to us via a support ticket to enable this feature for your ClickHouse organization.
15
-
:::
16
-
17
-
## Overview
13
+
## Overview {#overview-mysql-snapshot}
18
14
19
15
Initial load is the first phase of a CDC ClickPipe, where the ClickPipe syncs the historical data of the tables in the source database over to ClickHouse, before then starting CDC. A lot of the times, developers do this in a single-threaded manner.
20
16
However, the MySQL ClickPipe can parallelize this process, which can significantly speed up the initial load.
21
17
22
-
### Partition key column
18
+
### Partition key column {#key-mysql-snapshot}
23
19
24
20
Once we've enabled the feature flag, you should see the below setting in the ClickPipe table picker (both during creation and editing of a ClickPipe):
@@ -30,24 +26,24 @@ The MySQL ClickPipe uses a column on your source table to logically partition th
30
26
The partition key column must be indexed in the source table to see a good performance boost. This can be seen by running `SHOW INDEX FROM <table_name>` in MySQL.
#### Snapshot number of rows per partition {#numrows-mysql-snapshot}
40
36
This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and chunks will be processed in parallel based on the initial load parallelism set. The default value is 100,000 rows per partition.
This setting controls how many partitions will be processed in parallel. The default value is 4, which means that the ClickPipe will read 4 partitions of the source table in parallel. This can be increased to speed up the initial load, but it is recommended to keep it to a reasonable value depending on your source instance specs to avoid overwhelming the source database. The ClickPipe will automatically adjust the number of partitions based on the size of the source table and the number of rows per partition.
44
40
45
-
#### Snapshot number of tables in parallel
41
+
#### Snapshot number of tables in parallel {#tables-parallel-mysql-snapshot}
46
42
Not really related to parallel snapshot, but this setting controls how many tables will be processed in parallel during the initial load. The default value is 1. Note that is on top of the parallelism of the partitions, so if you have 4 partitions and 2 tables, the ClickPipe will read 8 partitions in parallel.
47
43
48
-
### Monitoring parallel snapshot in MySQL
44
+
### Monitoring parallel snapshot in MySQL {#monitoring-parallel-mysql-snapshot}
49
45
You can run **SHOW processlist** in MySQL to see the parallel snapshot in action. The ClickPipe will create multiple connections to the source database, each reading a different partition of the source table. If you see **SELECT** queries with different ranges, it means that the ClickPipe is reading the source tables. You can also see the COUNT(*) and the partitioning query in here.
- The snapshot parameters cannot be edited after pipe creation. If you want to change them, you will have to create a new ClickPipe.
53
49
- When adding tables to an existing ClickPipe, you cannot change the snapshot parameters. The ClickPipe will use the existing parameters for the new tables.
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/clickpipes/postgres/controlling_sync.md
+8-7Lines changed: 8 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,27 +12,28 @@ import cdc_syncs from '@site/static/images/integrations/data-ingestion/clickpipe
12
12
13
13
This document describes how to control the sync of a database ClickPipe (Postgres, MySQL etc.) when the ClickPipe is in **CDC (Running) mode**.
14
14
15
-
## Overview
15
+
## Overview {#overview-pg-sync}
16
16
17
17
Database ClickPipes have an architecture that consists of two parallel processes - pulling from the source database and pushing to the target database. The pulling process is controlled by a sync configuration that defines how often the data should be pulled and how much data should be pulled at a time. By "at a time", we mean one batch - since the ClickPipe pulls and pushes data in batches.
18
18
19
19
There are two main ways to control the sync of a database ClickPipe. The ClickPipe will start pushing when one of the below settings kicks in.
20
20
21
-
### Sync interval
21
+
### Sync interval {#interval-pg-sync}
22
22
The sync interval of the pipe is the amount of time (in seconds) for which the ClickPipe will pull records from the source database. The time to push what we have to ClickHouse is not included in this interval.
23
23
24
24
The default is **1 minute**.
25
25
Sync interval can be set to any positive integer value, but it is recommended to keep it above 10 seconds.
26
26
27
-
### Pull batch size
27
+
### Pull batch size {#batch-size-pg-sync}
28
28
The pull batch size is the number of records that the ClickPipe will pull from the source database in one batch. Records mean inserts, updates and deletes done on the tables that are part of the pipe.
29
29
30
30
The default is **100,000** records.
31
+
A safe maximum is 10 million.
31
32
32
-
### An exception: Long-running transactions on source
33
+
### An exception: Long-running transactions on source {#transactions-pg-sync}
33
34
When a transaction is run on the source database, the ClickPipe waits until it receives the COMMIT of the transaction before it moves forward. This with **overrides** both the sync interval and the pull batch size.
### Tweaking the sync settings to help with replication slot growth
47
+
### Tweaking the sync settings to help with replication slot growth {#tweaking-pg-sync}
47
48
Let's talk about how to use these settings to handle a large replication slot of a CDC pipe.
48
49
The pushing time to ClickHouse does not scale linearly with the pulling time from the source database. This can be leveraged to reduce the size of a large replication slot.
49
50
By increasing both the sync interval and pull batch size, the ClickPipe will pull a whole lot of data from the source database in one go, and then push it to ClickHouse.
50
51
51
-
### Monitoring sync control behaviour
52
+
### Monitoring sync control behaviour {#monitoring-pg-sync}
52
53
You can see how long each batch takes in the **CDC Syncs** table in the **Metrics** tab of the ClickPipe. Note that the duration here includes push time and also if there are no rows incoming, the ClickPipe waits and the wait time is also included in the duration.
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,33 +9,33 @@ import snapshot_params from '@site/static/images/integrations/data-ingestion/cli
9
9
10
10
This document explains parallelized snapshot/initial load in the Postgres ClickPipe works and talks about the snapshot parameters that can be used to control it.
11
11
12
-
## Overview
12
+
## Overview {#overview-pg-snapshot}
13
13
14
14
Initial load is the first phase of a CDC ClickPipe, where the ClickPipe syncs the historical data of the tables in the source database over to ClickHouse, before then starting CDC. A lot of the times, developers do this in a single-threaded manner - such as using pg_dump or pg_restore, or using a single thread to read from the source database and write to ClickHouse.
15
15
However, the Postgres ClickPipe can parallelize this process, which can significantly speed up the initial load.
16
16
17
-
### CTID column in Postgres
17
+
### CTID column in Postgres {#ctid-pg-snapshot}
18
18
In Postgres, every row in a table has a unique identifier called the CTID. This is a system column that is not visible to users by default, but it can be used to uniquely identify rows in a table. The CTID is a combination of the block number and the offset within the block, which allows for efficient access to rows.
The Postgres ClickPipe uses the CTID column to logically partition source tables. It obtains the partitions by first performing a COUNT(*) on the source table, followed by a window function partitioning query to get the CTID ranges for each partition. This allows the ClickPipe to read the source table in parallel, with each partition being processed by a separate thread.
#### Snapshot number of rows per partition {#numrows-pg-snapshot}
28
28
This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and chunks will be processed in parallel based on the initial load parallelism set. The default value is 100,000 rows per partition.
This setting controls how many partitions will be processed in parallel. The default value is 4, which means that the ClickPipe will read 4 partitions of the source table in parallel. This can be increased to speed up the initial load, but it is recommended to keep it to a reasonable value depending on your source instance specs to avoid overwhelming the source database. The ClickPipe will automatically adjust the number of partitions based on the size of the source table and the number of rows per partition.
32
32
33
-
#### Snapshot number of tables in parallel
33
+
#### Snapshot number of tables in parallel {#tables-parallel-pg-snapshot}
34
34
Not really related to parallel snapshot, but this setting controls how many tables will be processed in parallel during the initial load. The default value is 1. Note that is on top of the parallelism of the partitions, so if you have 4 partitions and 2 tables, the ClickPipe will read 8 partitions in parallel.
35
35
36
-
### Monitoring parallel snapshot in Postgres
36
+
### Monitoring parallel snapshot in Postgres {#monitoring-parallel-pg-snapshot}
37
37
You can analyze **pg_stat_activity** to see the parallel snapshot in action. The ClickPipe will create multiple connections to the source database, each reading a different partition of the source table. If you see **FETCH** queries with different CTID ranges, it means that the ClickPipe is reading the source tables. You can also see the COUNT(*) and the partitioning query in here.
- The snapshot parameters cannot be edited after pipe creation. If you want to change them, you will have to create a new ClickPipe.
41
41
- When adding tables to an existing ClickPipe, you cannot change the snapshot parameters. The ClickPipe will use the existing parameters for the new tables.
0 commit comments