Skip to content

Commit 893315f

Browse files
authored
add explicit anchors
1 parent cddbb80 commit 893315f

File tree

1 file changed

+8
-8
lines changed

1 file changed

+8
-8
lines changed

docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,33 +9,33 @@ import snapshot_params from '@site/static/images/integrations/data-ingestion/cli
99

1010
This document explains parallelized snapshot/initial load in the Postgres ClickPipe works and talks about the snapshot parameters that can be used to control it.
1111

12-
## Overview
12+
## Overview {#overview}
1313

1414
Initial load is the first phase of a CDC ClickPipe, where the ClickPipe syncs the historical data of the tables in the source database over to ClickHouse, before then starting CDC. A lot of the times, developers do this in a single-threaded manner - such as using pg_dump or pg_restore, or using a single thread to read from the source database and write to ClickHouse.
1515
However, the Postgres ClickPipe can parallelize this process, which can significantly speed up the initial load.
1616

17-
### CTID column in Postgres
17+
### CTID column in Postgres {#ctid-column-postgres}
1818
In Postgres, every row in a table has a unique identifier called the CTID. This is a system column that is not visible to users by default, but it can be used to uniquely identify rows in a table. The CTID is a combination of the block number and the offset within the block, which allows for efficient access to rows.
1919

20-
### Logical partitioning
20+
### Logical partitioning {#logical-partitioning}
2121
The Postgres ClickPipe uses the CTID column to logically partition source tables. It obtains the partitions by first performing a COUNT(*) on the source table, followed by a window function partitioning query to get the CTID ranges for each partition. This allows the ClickPipe to read the source table in parallel, with each partition being processed by a separate thread.
2222

2323
Let's talk about the below settings:
2424

2525
<img src={snapshot_params} alt="Snapshot parameters" />
2626

27-
#### Snapshot number of rows per partition
27+
#### Snapshot number of rows per partition {#snapshot-number-of-rows-per-partition}
2828
This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and each chunk will be processed in parallel. The default value is 100,000 rows per partition.
2929

30-
#### Initial load parallelism
30+
#### Initial load parallelism {#initial-load-parallelism}
3131
This setting controls how many partitions will be processed in parallel. The default value is 4, which means that the ClickPipe will read 4 partitions of the source table in parallel. This can be increased to speed up the initial load, but it is recommended to keep it to a reasonable value depending on your source instance specs to avoid overwhelming the source database. The ClickPipe will automatically adjust the number of partitions based on the size of the source table and the number of rows per partition.
3232

33-
#### Snapshot number of tables in parallel
33+
#### Snapshot number of tables in parallel {#snapshot-number-of-tables-in-parallel}
3434
Not really related to parallel snapshot, but this setting controls how many tables will be processed in parallel during the initial load. The default value is 1. Note that is on top of the parallelism of the partitions, so if you have 4 partitions and 2 tables, the ClickPipe will read 8 partitions in parallel.
3535

36-
### Monitoring parallel snapshot in Postgres
36+
### Monitoring parallel snapshot in Postgres {#monitoring-parallel-snapshot-in-postgres}
3737
You can analyze **pg_stat_activity** to see the parallel snapshot in action. The ClickPipe will create multiple connections to the source database, each reading a different partition of the source table. If you see **FETCH** queries with different CTID ranges, it means that the ClickPipe is reading the source tables. You can also see the COUNT(*) and the partitioning query in here.
3838

39-
### Limitations
39+
### Limitations {#limitations}
4040
- The snapshot parameters cannot be edited after pipe creation. If you want to change them, you will have to create a new ClickPipe.
4141
- When adding tables to an existing ClickPipe, you cannot change the snapshot parameters. The ClickPipe will use the existing parameters for the new tables.

0 commit comments

Comments
 (0)