How to improve the compression ratio #1649

g9yuayon · 2025-11-21T22:10:50Z

g9yuayon
Nov 21, 2025

I ran the clp's compress command to compress multiple-line JSON files and got about 23X of compression ratio:

> clp-json-x86_64-v0.6.0# sbin/compress.sh --timestamp-key '@timestamp' ../data
2025-11-21T21:59:44.804 INFO [compress] Compression job 2 submitted.
2025-11-21T21:59:47.812 INFO [compress] Compressed 513.53MB into 22.46MB (22.86x). Speed: 182.53MB/s.
2025-11-21T21:59:48.313 INFO [compress] Compressed 16.01GB into 701.08MB (23.38x). Speed: 4.83GB/s.
2025-11-21T21:59:50.318 INFO [compress] Compressed 16.48GB into 722.48MB (23.36x). Speed: 3.10GB/s.
2025-11-21T21:59:57.835 INFO [compress] Compressed 17.28GB into 751.78MB (23.54x). Speed: 1.35GB/s.

23X is decent, but is far from the claimed 92X compression ratio. Can I get some suggestions on how to increase the compression ratio? What I know about the JSON files so far:

Each file contains about 5000 JSON record. One record per line.
A record looks as follows:

{
  "host.hostname": "xxx.xxx.xxx",
  "type": "json",
  "host.mac": "YYYY",
  ... a lot more key-value pairs
  "message": "I251121 01:39:01.516291 561950819 3@pebble/event.go:999 ⋮ [n162,s162,pebble] 3011722  [JOB 544048] compacting(default) L0 [1074306] (2.8MB) Score=1.08 + L2 [1074266 1074268 1074269] (7.8MB) Score=0.99; OverlappingRatio: Single 2.81, Multi 0.00",
  "tags": [
    "_no_valid_original_timestamp"
  ],
  "environment": "production",
  "event.ingested": "2025-11-21T01:39:01.916Z",
  "@timestamp": "2025-11-21T01:39:01.616Z",
  "event.created": "2025-11-21T01:39:01.616Z",
}

Not sure if timestamp matters to the compression ratio. CLP failed to parse the the @timestamp value as it reported that all the logs are in the time range of January 1, 1970 - January 1, 1970.

kirkrodrigues · 2025-11-21T23:16:59Z

kirkrodrigues
Nov 21, 2025
Maintainer

Hey @g9yuayon!

Yes, if the timestamp wasn't accurately parsed, it can definitely affect your compression ratio. I was able to successfully compress your example log using:

sbin/compress.sh --timestamp-key '\@timestamp' test.jsonl

The @ needs to be escaped since it has special meaning in CLP.

To start with, can you give that a shot and see if things improve? We can then move on to seeing if the compression of some other fields can be improved.

0 replies

g9yuayon · 2025-11-22T01:14:22Z

g9yuayon
Nov 22, 2025
Author

Thanks, @kirkrodrigues! Escaping fixed the timestamp parsing and I reran the command. The compression ratio increased to 25X.

2025-11-22T01:01:54.963 INFO [compress] Compressed 32.96GB into 1.30GB (25.39x). Speed: 288.21MB/s

2 replies

kirkrodrigues Nov 22, 2025
Maintainer

To get a bit of a baseline, would you mind compressing the same logs with zstd and tell us what kind of compression ratio you're getting?

g9yuayon Nov 22, 2025
Author

Sure thing. Please see below for a sample of the output:

> zstd -v *.json
logs_batch_991.json :  6.80%   (8711811 => 592008 bytes, logs_batch_991.json.zst)
logs_batch_992.json :  7.50%   (9383370 => 703931 bytes, logs_batch_992.json.zst)
logs_batch_993.json :  7.94%   (8717142 => 692395 bytes, logs_batch_993.json.zst)
logs_batch_994.json :  8.40%   (14519984 => 1219220 bytes, logs_batch_994.json.zst)
logs_batch_995.json :  8.37%   (10707546 => 895727 bytes, logs_batch_995.json.zst)
logs_batch_996.json :  8.15%   (12503834 => 1018587 bytes, logs_batch_996.json.zst)
logs_batch_997.json :  7.76%   (8909071 => 691175 bytes, logs_batch_997.json.zst)
logs_batch_998.json :  6.62%   (8499548 => 562658 bytes, logs_batch_998.json.zst)
logs_batch_999.json :  7.08%   (8421790 => 595888 bytes, logs_batch_999.json.zst)
logs_batch_99.json :  8.25%   (8889467 => 732992 bytes, logs_batch_99.json.zst)
logs_batch_9.json :  7.28%   (8181912 => 595842 bytes, logs_batch_9.json.zst)
2000 files compressed : 7.66%  (17695548922 => 1356215280 bytes)

gibber9809 · 2025-11-25T03:21:46Z

gibber9809
Nov 25, 2025
Collaborator

Hey @g9yuayon,

Below are a few general things you could try out quickly that might help with compression ratio. We might also be able to give you more specific advice if you could tell us more about your use case---e.g., heavy search, long term storage, scale, etc.

Increasing the amount of data that gets compressed into each archive.

The package creates fairly small archives by default, which helps make search more parallelizable at the cost of compression ratio. You can tweak how much data ends up in each archive by modifying archive_output.target_archive_size and archive_output.target_segment_size in <package-dir>/etc/clp-config.yml.

The following parameter combinations are probably worth trying out:

archive_output:
  target_archive_size: 2147483648  # 2 GiB
  target_segment_size: 1073741824  # 1 GiB
archive_output:
  target_archive_size: 1073741824  # 1 GiB
  target_segment_size: 536870912  # 512 MiB
archive_output:
  target_archive_size: 536870912  # 512 MiB
  target_segment_size: 268435456  # 256 MiB

target_archive_size controls how much JSON data we try to compress into each archive, and target_segment_size controls the encoded size of each archive before it's compressed by our second stage compressor. As a rule of thumb, target_archive_size should be about twice target_segment_size, though this depends on the logs you're compressing.

There is a bit of a tradeoff here, where if archives become too large search can become slower for certain types of queries, so if search speed is important for your use case, you may not want to increase these parameters by too much. Larger archives also lead to higher memory usage (compression tasks will generally use ~target_segment_size bytes of memory, and search tasks use closer to ~target_segment_size/2 bytes of memory).

Note that, in general, compression ratio tends to improve until target_segment_size approaches 1-2GiB, then the improvements start to flatten out.

Increasing the compression level used by the second stage compressor.

By default, we use ZStd as our second stage compressor, with a compression level of 3. You can change the compression level by modifying archive_output.compression_level in <package_dir>/etc/clp-config.yml, e.g.:

archive_output:
  compression_level: 4

Increasing the compression level can lead to significant increases in compression ratio, at the cost of reducing compression speed.

There are a few more things we could potentially do to help you achieve higher compression ratios if the tweaks above aren't enough (e.g., offering the ability to parse and encode more than one column as a timestamp, offering LZMA as a second stage compressor, exposing features currently not available in the package like array-structurization, etc.).

We could also help you take a look at what's limiting your compression ratio more directly if you could send us some sample logs (here or in a private channel).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to improve the compression ratio #1649

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to improve the compression ratio #1649

Uh oh!

g9yuayon Nov 21, 2025

Replies: 3 comments · 2 replies

Uh oh!

kirkrodrigues Nov 21, 2025 Maintainer

Uh oh!

g9yuayon Nov 22, 2025 Author

Uh oh!

kirkrodrigues Nov 22, 2025 Maintainer

Uh oh!

g9yuayon Nov 22, 2025 Author

Uh oh!

gibber9809 Nov 25, 2025 Collaborator

g9yuayon
Nov 21, 2025

Replies: 3 comments 2 replies

kirkrodrigues
Nov 21, 2025
Maintainer

g9yuayon
Nov 22, 2025
Author

kirkrodrigues Nov 22, 2025
Maintainer

g9yuayon Nov 22, 2025
Author

gibber9809
Nov 25, 2025
Collaborator