feedback

gingerwizard · gingerwizard · commit c2b489f4aef6 · 2025-04-08T12:54:05.000+01:00
diff --git a/docs/best-practices/json_type.md b/docs/best-practices/json_type.md
@@ -8,7 +8,7 @@ description: 'Page describing when to use JSON'
 
 ClickHouse now offers a native JSON column type designed for semi-structured and dynamic data. It's important to clarify that **this is a column type, not a data format**—you can insert JSON into ClickHouse as a string or via supported formats like [JSONEachRow](/docs/interfaces/formats/JSONEachRow), but that does not imply using the JSON column type. Users should only use the JSON type when the structure of their data is dynamic, not when they simply happen to store JSON.
 
-## When to Use the JSON Type {#when-to-use-the-json-type}
+## When to use the JSON type {#when-to-use-the-json-type}
 
 Use the JSON type when your data:
 
@@ -24,7 +24,7 @@ If your data structure is known and consistent, there is rarely a need for the J
 
 You can also mix approaches - for example, use static columns for predictable top-level fields and a single JSON column for a dynamic section of the payload.
 
-## Considerations and Tips for Using JSON {#considerations-and-tips-for-using-json}
+## Considerations and tips for using JSON {#considerations-and-tips-for-using-json}
 
 The JSON type enables efficient columnar storage by flattening paths into subcolumns. But with flexibility comes responsibility. To use it effectively:
 
@@ -33,14 +33,14 @@ The JSON type enables efficient columnar storage by flattening paths into subcol
 * **Avoid setting [`max_dynamic_paths`](/sql-reference/data-types/newjson#reaching-the-limit-of-dynamic-paths-inside-json) too high** - large values increase resource consumption and reduce efficiency. As a rule of thumb, keep it below 10,000.
 
 :::note Type hints 
-Type hits offer more than just a way to avoid unnecessary type inference - they eliminate storage and processing indirection entirely. JSON paths with type hints are always stored just like traditional columns, bypassing the need for [**discriminator columns**](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse#storage-extension-for-dynamically-changing-data) or dynamic resolution during query time. This means that with well-defined type hints, JSON subfields achieve the same performance and efficiency as if they were modeled as top-level fields from the outset. As a result, for datasets that are mostly consistent but still benefit from the flexibility of JSON, type hints provide a convenient way to preserve performance without needing to restructure your schema or ingest pipeline.
+Type hits offer more than just a way to avoid unnecessary type inference - they eliminate storage and processing indirection entirely. JSON paths with type hints are always stored just like traditional columns, bypassing the need for [**discriminator columns**](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse#storage-extension-for-dynamically-changing-data) or dynamic resolution during query time. This means that with well-defined type hints, nested JSON fields achieve the same performance and efficiency as if they were modeled as top-level fields from the outset. As a result, for datasets that are mostly consistent but still benefit from the flexibility of JSON, type hints provide a convenient way to preserve performance without needing to restructure your schema or ingest pipeline.
 :::
 
 ## Advanced Features {#advanced-features}
 
 * JSON columns **can be used in primary keys** like any other columns. Codecs cannot be specified for a sub-column.
 * They support introspection via functions like [`JSONAllPathsWithTypes()` and `JSONDynamicPaths()`](/sql-reference/data-types/newjson#introspection-functions).
-* You can read nested sub-objects using the .^ syntax.
+* You can read nested sub-objects using the `.^` syntax.
 * Query syntax may differ from standard SQL and may require special casting or operators for nested fields.
 
 For additional guidance, see[ ClickHouse JSON documentation](/sql-reference/data-types/newjson) or explore our blog post[ A New Powerful JSON Data Type for ClickHouse](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse).
@@ -62,7 +62,7 @@ Consider the following JSON sample, representing a row from the [Python PyPI dat
 }
 ```
 
-Lets assume this schema is static and the types can be well defined. Even if the data is in NDJSON format (json row per line), there is no need to use the JSON type for such a schema. Simply define the schema with classic types.
+Lets assume this schema is static and the types can be well defined. Even if the data is in NDJSON format (JSON row per line), there is no need to use the JSON type for such a schema. Simply define the schema with classic types.
 
 ```sql
 CREATE TABLE pypi (
@@ -153,7 +153,7 @@ INSERT INTO arxiv FORMAT JSONEachRow
 {"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n  https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]]}
 ```
 
-Suppose another column is added `tags`. If this was simply a list of strings we could model as an `Array(String)`, but let's assume users can add arbitrary tag structures with mixed types (notice score is a string or integer). Our modified JSON document:
+Suppose another column called `tags` is added. If this was simply a list of strings we could model as an `Array(String)`, but let's assume users can add arbitrary tag structures with mixed types (notice score is a string or integer). Our modified JSON document:
 
 ```sql
 {
diff --git a/docs/best-practices/minimize_optimize_joins.md b/docs/best-practices/minimize_optimize_joins.md
@@ -26,7 +26,7 @@ For a full guide on denormalizing data in ClickHouse see [here](/data-modeling/d
 
 ## When JOINs are required {#when-joins-are-required}
 
-When JOINs are required, ensure you're using **at least 24.12 and preferably the latest**, as JOIN performance continues to improve. As of ClickHouse 24.12, the query planner now automatically places the smaller table on the right side of the join for optimal performance - a task that previously had to be done manually. Even more enhancements are coming soon, including more aggressive filter pushdown and automatic re-ordering of multiple joins.
+When JOINs are required, ensure you’re using **at least version 24.12 and preferably the latest version**, as JOIN performance continues to improve with each new release. As of ClickHouse 24.12, the query planner now automatically places the smaller table on the right side of the join for optimal performance - a task that previously had to be done manually. Even more enhancements are coming soon, including more aggressive filter pushdown and automatic re-ordering of multiple joins.
 
 Follow these best practices to improve JOIN performance:
 
@@ -44,13 +44,13 @@ When using dictionaries for JOINs in ClickHouse, it's important to understand th
 
 ## Choosing the right JOIN Algorithm {#choosing-the-right-join-algorithm}
 
-ClickHouse supports several join algorithms that trade off between speed and memory:
+ClickHouse supports several JOIN algorithms that trade off between speed and memory:
 
 * **Parallel Hash JOIN (default):** Fast for small-to-medium right-hand tables that fit in memory.
 * **Direct JOIN:** Ideal when using dictionaries (or other table engines with key-value characteristics) with `INNER` or `LEFT ANY JOIN`  - the fastest method for point lookups as it eliminates the need to build a hash table.
 * **Full Sorting Merge JOIN:** Efficient when both tables are sorted on the join key.
 * **Partial Merge JOIN:** Minimizes memory but is slower—best for joining large tables with limited memory.
-* **Grace Hash JOIN:** Flexible and can memory-tunable, good for large datasets with adjustable performance characteristics.
+* **Grace Hash JOIN:** Flexible and memory-tunable, good for large datasets with adjustable performance characteristics.
 
 <Image img={joins} size="md" alt="Joins - speed vs memory"/>
 
@@ -66,4 +66,4 @@ For optimal performance:
 * Avoid more than 3–4 joins per query.
 * Benchmark different algorithms on real data - performance varies based on JOIN key distribution and data size.
 
-For more on JOIN optimization strategies, join algorithms, and how to tune them, refer to the[ ClickHouse documentation](/guides/joining-tables) and this [blog series](https://clickhouse.com/blog/clickhouse-fully-supports-joins-part1).
+For more on JOIN optimization strategies, JOIN algorithms, and how to tune them, refer to the[ ClickHouse documentation](/guides/joining-tables) and this [blog series](https://clickhouse.com/blog/clickhouse-fully-supports-joins-part1).
diff --git a/docs/best-practices/partionning_keys.md b/docs/best-practices/partionning_keys.md
@@ -46,19 +46,19 @@ With partitioning enabled, ClickHouse only [merges](/merges) data parts within,
 
 ## Applications of partitioning {#applications-of-partionning}
 
-Partitioning is a powerful tool for managing large datasets in ClickHouse, especially in observability and analytics use cases. It enables efficient data life cycle operations by allowing entire partitions, often aligned with time or business logic, to be dropped, moved, or archived in a single metadata operation. This is significantly faster and less resource-intensive than row-level deletes or copy operations. Partitioning also integrates cleanly with ClickHouse features like TTL and tiered storage, making it possible to implement retention policies or hot/cold storage strategies without custom orchestration. For example, recent data can be kept on fast SSD-backed storage, while older partitions are automatically moved to cheaper object storage.
+Partitioning is a powerful tool for managing large datasets in ClickHouse, especially in observability and analytics use cases. It enables efficient data life cycle operations by allowing entire partitions, often aligned with time or business logic, to be dropped, moved, or archived in a single metadata operation. This is significantly faster and less resource-intensive than row-level delete or copy operations. Partitioning also integrates cleanly with ClickHouse features like TTL and tiered storage, making it possible to implement retention policies or hot/cold storage strategies without custom orchestration. For example, recent data can be kept on fast SSD-backed storage, while older partitions are automatically moved to cheaper object storage.
 
 While partitioning can improve query performance for some workloads, it can also negatively impact response time. 
 
 If the partitioning key is not in the primary key and you are filtering by it, users may see an improvement in query performance with partitioning. See [here](/partitions#query-optimization) for an example.
 
 Conversely, if queries need to query across partitions performance may be negatively impacted due to a higher number of total parts. For this reason, users should understand their access patterns before considering partitioning a a query optimization technique.
 
-In summary, users should primarily think of partitioning as a data management technique. For an example of managing data, see [here](/observability/managing-data) and [here](/partitions#data-management).
+In summary, users should primarily think of partitioning as a data management technique. For an example of managing data, see ["Managing Data"](/observability/managing-data) from the observability use-case guide and ["What are table partitions used for?"](/partitions#data-management) from Core Concepts - Table partitions.
 
 ## Choose a low cardinality partitioning key {#choose-a-low-cardinality-partitioning-key}
 
-Importantly, a higher number of parts will negatively affect query performance. ClickHouse will therefore respond to inserts with a [“too many parts”](/knowledgebase/exception-too-many-parts) error if the number of parts exceeds [limits either in total](/operations/settings/merge-tree-settings#max_parts_in_total) or [per partition](/operations/settings/merge-tree-settings#parts_to_throw_insert).
+Importantly, a higher number of parts will negatively affect query performance. ClickHouse will therefore respond to inserts with a [“too many parts”](/knowledgebase/exception-too-many-parts) error if the number of parts exceeds specified limits either in [total](/operations/settings/merge-tree-settings#max_parts_in_total) or [per partition](/operations/settings/merge-tree-settings#parts_to_throw_insert).
 
 Choosing the right **cardinality** for the partitioning key is critical. A high-cardinality partitioning key - where the number of distinct partition values is large - can lead to a proliferation of data parts. Since ClickHouse does not merge parts across partitions, too many partitions will result in too many unmerged parts, eventually triggering the “Too many parts” error. [Merges are essential](/merges) for reducing storage fragmentation and optimizing query speed, but with high-cardinality partitions, that merge potential is lost.
 
diff --git a/docs/best-practices/selecting_an_insert_strategy.md b/docs/best-practices/selecting_an_insert_strategy.md
@@ -12,10 +12,10 @@ import async_inserts from '@site/static/images/bestpractices/async_inserts.png';
 import AsyncInserts from '@site/docs/best-practices/_snippets/_async_inserts.md';
 import BulkInserts from '@site/docs/best-practices/_snippets/_bulk_inserts.md';
 
-Efficient data ingestion is a basis of high-performance ClickHouse deployments. Selecting the right insert strategy can dramatically impact throughput, cost, and reliability. This section outlines best practices, tradeoffs, and configuration options to help you make the right decision for your workload.
+Efficient data ingestion forms the basis of high-performance ClickHouse deployments. Selecting the right insert strategy can dramatically impact throughput, cost, and reliability. This section outlines best practices, tradeoffs, and configuration options to help you make the right decision for your workload.
 
 :::note
-The following assumes you are pushing data to ClickHouse via a client. If you are pulling data into ClickHouse e.g. using built in table functions such as [s3](/sql-reference/table-functions/s3) and [gcs](/sql-reference/table-functions/gcs), we recommend [this guide](/integrations/s3/performance).
+The following assumes you are pushing data to ClickHouse via a client. If you are pulling data into ClickHouse e.g. using built in table functions such as [s3](/sql-reference/table-functions/s3) and [gcs](/sql-reference/table-functions/gcs), we recommend our guide ["Optimizing for S3 Insert and Read Performance"](/integrations/s3/performance).
 :::
 
 ## Synchronous inserts by default {#synchronous-inserts-by-default}
@@ -53,7 +53,7 @@ Using the values from that formatted data and the target table's [DDL](/sql-refe
 
 <BulkInserts/>
 
-### Ensure Idempotent Retries {#ensure-idempotent-retries}
+### Ensure idempotent retries {#ensure-idempotent-retries}
 
 Synchronous inserts are also **idempotent**. When using MergeTree engines, ClickHouse will deduplicate inserts by default. This protects against ambiguous failure cases, such as:
 
@@ -67,7 +67,7 @@ In both cases, it's safe to **retry the insert** - as long as the batch contents
 For sharded clusters, you have two options:
 
 * Insert directly into a **MergeTree** or **ReplicatedMergeTree** table. This is the most efficient option when the client can perform load balancing across shards. With `internal_replication = true`, ClickHouse handles replication transparently.
-* Insert into a Distributed table. This allows clients to send data to any node and let ClickHouse forward it to the correct shard. This is simpler but slightly less performant due to the extra forwarding step. `internal_replication = true` is still recommended.
+* Insert into a [Distributed table](/engines/table-engines/special/distributed). This allows clients to send data to any node and let ClickHouse forward it to the correct shard. This is simpler but slightly less performant due to the extra forwarding step. `internal_replication = true` is still recommended.
 
 **In ClickHouse Cloud all nodes read and write to the same single shard. Inserts are automatically balanced across nodes. Users can simply send inserts to the exposed endpoint.**
 
@@ -89,7 +89,7 @@ Compressing insert data reduces the size of the payload sent over the network, m
 
 For inserts, compression is especially effective when used with the Native format, which already matches ClickHouse's internal columnar storage model. In this setup, the server can efficiently decompress and directly store the data with minimal transformation.
 
-#### Use LZ4 for Speed, ZSTD for Compression Ratio {#use-lz4-for-speed-zstd-for-compression-ratio}
+#### Use LZ4 for speed, ZSTD for compression ratio {#use-lz4-for-speed-zstd-for-compression-ratio}
 
 ClickHouse supports several compression codecs during data transmission. Two common options are:
 
@@ -102,7 +102,7 @@ Best practice: Use LZ4 unless you have constrained bandwidth or incur data egres
 In tests from the [FastFormats benchmark](https://clickhouse.com/blog/clickhouse-input-format-matchup-which-is-fastest-most-efficient), LZ4-compressed Native inserts reduced data size by more than 50%, cutting ingestion time from 150s to 131s for a 5.6 GiB dataset. Switching to ZSTD compressed the same dataset down to 1.69 GiB, but increased server-side processing time slightly.
 :::
 
-#### Compression Reduces Resource Usage {#compression-reduces-resource-usage}
+#### Compression reduces resource usage {#compression-reduces-resource-usage}
 
 Compression not only reduces network traffic—it also improves CPU and memory efficiency on the server. With compressed data, ClickHouse receives fewer bytes and spends less time parsing large inputs. This benefit is especially important when ingesting from multiple concurrent clients, such as in observability scenarios.