Skip to content

Commit 0eb21d2

Browse files
committed
erge remote-tracking branch 'origin/RemoveMaxSeriesPerQueryReferences' into RemoveMaxSeriesPerQueryReferences
2 parents 882f3c6 + 648ad04 commit 0eb21d2

File tree

322 files changed

+6980
-17610
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

322 files changed

+6980
-17610
lines changed

.golangci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ run:
1212
- integration_querier
1313
- integration_ruler
1414
- integration_query_fuzz
15+
- integration_remote_write_v2
1516
- slicelabels
1617
output:
1718
formats:

ADOPTERS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,4 @@ This is the list of organisations that are using Cortex in **production environm
2121
* [Platform9](https://platform9.com/)
2222
* [REWE Digital](https://rewe-digital.com/)
2323
* [SysEleven](https://www.syseleven.de/)
24+
* [Twilio](https://www.twilio.com/)

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
* [CHANGE] StoreGateway/Alertmanager: Add default 5s connection timeout on client. #6603
55
* [CHANGE] Ingester: Remove EnableNativeHistograms config flag and instead gate keep through new per-tenant limit at ingestion. #6718
66
* [CHANGE] Validate a tenantID when to use a single tenant resolver. #6727
7+
* [FEATURE] Distributor: Add an experimental `-distributor.otlp.enable-type-and-unit-labels` flag to add `__type__` and `__unit__` labels for OTLP metrics. #6969
78
* [FEATURE] Distributor: Add an experimental `-distributor.otlp.allow-delta-temporality` flag to ingest delta temporality otlp metrics. #6934
89
* [FEATURE] Query Frontend: Add dynamic interval size for query splitting. This is enabled by configuring experimental flags `querier.max-shards-per-query` and/or `querier.max-fetched-data-duration-per-query`. The split interval size is dynamically increased to maintain a number of shards and total duration fetched below the configured values. #6458
910
* [FEATURE] Querier/Ruler: Add `query_partial_data` and `rules_partial_data` limits to allow queries/rules to be evaluated with data from a single zone, if other zones are not available. #6526
@@ -22,6 +23,7 @@
2223
* [FEATURE] Querier: Allow choosing PromQL engine via header. #6777
2324
* [FEATURE] Querier: Support for configuring query optimizers and enabling XFunctions in the Thanos engine. #6873
2425
* [FEATURE] Query Frontend: Add support /api/v1/format_query API for formatting queries. #6893
26+
* [ENHANCEMENT] Ingester: Add `cortex_ingester_tsdb_wal_replay_unknown_refs_total` and `cortex_ingester_tsdb_wbl_replay_unknown_refs_total` metrics to track unknown series references during wal/wbl replaying. #6945
2527
* [ENHANCEMENT] Ruler: Emit an error message when the rule synchronization fails. #6902
2628
* [ENHANCEMENT] Querier: Support snappy and zstd response compression for `-querier.response-compression` flag. #6848
2729
* [ENHANCEMENT] Tenant Federation: Add a # of query result limit logic when the `-tenant-federation.regex-matcher-enabled` is enabled. #6845

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ lint:
174174
golangci-lint run
175175

176176
# Ensure no blocklisted package is imported.
177-
GOFLAGS="-tags=requires_docker,integration,integration_alertmanager,integration_backward_compatibility,integration_memberlist,integration_querier,integration_ruler,integration_query_fuzz" faillint -paths "github.com/bmizerany/assert=github.com/stretchr/testify/assert,\
177+
GOFLAGS="-tags=requires_docker,integration,integration_alertmanager,integration_backward_compatibility,integration_memberlist,integration_querier,integration_ruler,integration_query_fuzz,integration_remote_write_v2" faillint -paths "github.com/bmizerany/assert=github.com/stretchr/testify/assert,\
178178
golang.org/x/net/context=context,\
179179
sync/atomic=go.uber.org/atomic,\
180180
github.com/prometheus/client_golang/prometheus.{MultiError}=github.com/prometheus/prometheus/tsdb/errors.{NewMulti},\

build-image/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM golang:1.24.3-bullseye
1+
FROM golang:1.24.6-bullseye
22
ARG goproxyValue
33
ENV GOPROXY=${goproxyValue}
44
RUN apt-get update && apt-get install -y curl file gettext jq unzip protobuf-compiler libprotobuf-dev && \

docs/configuration/config-file-reference.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3143,7 +3143,7 @@ ha_tracker:
31433143
# EXPERIMENTAL: If true, accept prometheus remote write v2 protocol push
31443144
# request.
31453145
# CLI flag: -distributor.remote-writev2-enabled
3146-
[remote_write2_enabled: <boolean> | default = false]
3146+
[remote_writev2_enabled: <boolean> | default = false]
31473147
31483148
ring:
31493149
kvstore:
@@ -3265,6 +3265,11 @@ otlp:
32653265
# EXPERIMENTAL: If true, delta temporality otlp metrics to be ingested.
32663266
# CLI flag: -distributor.otlp.allow-delta-temporality
32673267
[allow_delta_temporality: <boolean> | default = false]
3268+
3269+
# EXPERIMENTAL: If true, the '__type__' and '__unit__' labels are added for
3270+
# the OTLP metrics.
3271+
# CLI flag: -distributor.otlp.enable-type-and-unit-labels
3272+
[enable_type_and_unit_labels: <boolean> | default = false]
32683273
```
32693274

32703275
### `etcd_config`
@@ -4114,6 +4119,11 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s
41144119
# CLI flag: -frontend.max-queriers-per-tenant
41154120
[max_queriers_per_tenant: <float> | default = 0]
41164121

4122+
# [Experimental] Number of shards to use when distributing shardable PromQL
4123+
# queries.
4124+
# CLI flag: -frontend.query-vertical-shard-size
4125+
[query_vertical_shard_size: <int> | default = 0]
4126+
41174127
# Enable to allow queries to be evaluated with data from a single zone, if other
41184128
# zones are not available.
41194129
[query_partial_data: <boolean> | default = false]

docs/configuration/v1-guarantees.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@ Currently experimental features are:
118118
- `alertmanager-sharding-ring.final-sleep` (duration) CLI flag
119119
- OTLP Receiver
120120
- Ingest delta temporality OTLP metrics (`-distributor.otlp.allow-delta-temporality=true`)
121+
- Add `__type__` and `__unit__` labels (`-distributor.otlp.enable-type-and-unit-labels`)
121122
- Persistent tokens in the Ruler Ring:
122123
- `-ruler.ring.tokens-file-path` (path) CLI flag
123124
- Native Histograms
Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
---
2+
title: "Partition Ring with Multi-AZ Replication"
3+
linkTitle: "Partition Ring Multi-AZ Replication"
4+
weight: 1
5+
slug: partition-ring-multi-az-replication
6+
---
7+
8+
- Author: [Daniel Blando](https://github.com/danielblando)
9+
- Date: July 2025
10+
- Status: Proposed
11+
12+
## Background
13+
14+
Distributors use a token-based ring to shard data across ingesters. Each ingester owns random tokens (32-bit numbers) in a hash ring. For each incoming series, the distributor:
15+
16+
1. Hashes the series labels to get a hash value
17+
2. Finds the primary ingester (smallest token > hash value)
18+
3. When replication is enabled, selects additional replicas by moving clockwise around the ring
19+
4. Ensures replicas are distributed across different availability zones
20+
21+
The issue arises when replication is enabled: each series in a request is hashed independently, causing each series to route to different groups of ingesters.
22+
23+
```mermaid
24+
graph TD
25+
A[Write Request] --> B[Distributor]
26+
B --> C[Hash Series 1] --> D[Ingesters: 5,7,9]
27+
B --> E[Hash Series 2] --> F[Ingesters: 5,3,10]
28+
B --> G[Hash Series 3] --> H[Ingesters: 7,27,28]
29+
B --> I[...] --> J[Different ingester sets<br/>for each series]
30+
```
31+
32+
## Problem
33+
34+
### Limited AZ Failure Tolerance with replication factor
35+
36+
While the token ring effectively distributes load across the ingester fleet, the independent hashing and routing of each series creates an amplification effect where a single ingester failure can impact a large number of write requests.
37+
38+
Consider a ring with 30 ingesters, each series gets distributed to three different ingesters:
39+
40+
```
41+
Sample 1: {name="http_request_latency",api="/push", status="2xx"}
42+
→ Ingesters: ing-5, ing-7, ing-9
43+
Sample 2: {name="http_request_latency",api="/push", status="4xx"}
44+
→ Ingesters: ing-5, ing-3, ing-10
45+
Sample 3: {name="http_request_latency",api="/push", status="2xx"}
46+
→ Ingesters: ing-7, ing-27, ing-28
47+
...
48+
```
49+
If ingesters `ing-15` and `ing-18` (in different AZs) are offline, any request containing a series that needs to write to both these ingesters will fail completely:
50+
51+
```
52+
Sample 15: {name="http_request_latency",api="/push", status="5xx"}
53+
→ Ingesters: ing-10, ing-15, ing-18 // Request fails
54+
```
55+
56+
With requests increasing their batch size, the probability of request failure becomes critical in replicated deployments. Given two failed ingesters in different AZs, each individual series has a small chance of requiring both failed ingesters. However, as request batch sizes increase, the probability that at least one series in the batch will hash to both failed ingesters approaches certainty.
57+
58+
**Note**: This problem specifically affects Cortex using replication. Replication as 1 are not impacted by this availability amplification issue.
59+
60+
## Proposed Solution
61+
62+
### Partition Ring Architecture
63+
64+
A new Partition Ring is proposed where the ring is divided into partitions, with each partition containing a set of tokens and a group of ingesters. Ingesters are allocated to partitions based on their order in the zonal StatefulSet, ensuring that scaling operations align with StatefulSet's LIFO behavior. Each partition contains a number of ingesters equal to the replication factor, with exactly one ingester per availability zone.
65+
66+
This approach provides **reduced failure probability** where the chances of getting two ingesters in the same partition down decreases significantly compared to random ingester failures affecting multiple series. It also enables **deterministic replication** where data sent to `ing-az1-1` always replicates to `ing-az2-1` and `ing-az3-1`, making the system behavior more predictable and easier to troubleshoot.
67+
68+
```mermaid
69+
graph TD
70+
subgraph "Partition Ring"
71+
subgraph "Partition 3"
72+
P1A[ing-az1-3]
73+
P1B[ing-az2-3]
74+
P1C[ing-az3-3]
75+
end
76+
subgraph "Partition 2"
77+
P2A[ing-az1-2]
78+
P2B[ing-az2-2]
79+
P2C[ing-az3-2]
80+
end
81+
subgraph "Partition 1"
82+
P3A[ing-az1-1]
83+
P3B[ing-az2-1]
84+
P3C[ing-az3-1]
85+
end
86+
end
87+
88+
T1[Tokens 34] --> P1A
89+
T2[Tokens 56] --> P2A
90+
T3[Tokens 12] --> P3A
91+
```
92+
93+
Within each partition, ingesters maintain identical data, acting as true replicas of each other. Distributors maintain similar hashing logic but select a partition instead of individual ingesters. Data is then forwarded to all ingesters within the selected partition, making the replication pattern deterministic.
94+
95+
### Protocol Buffer Definitions
96+
97+
```protobuf
98+
message PartitionRingDesc {
99+
map<string, PartitionDesc> partitions = 1;
100+
}
101+
102+
message PartitionDesc {
103+
PartitionState state = 1;
104+
repeated uint32 tokens = 2;
105+
map<string, InstanceDesc> instances = 3;
106+
int64 registered_timestamp = 4;
107+
}
108+
109+
// Unchanged from current implementation
110+
message InstanceDesc {
111+
string addr = 1;
112+
int64 timestamp = 2;
113+
InstanceState state = 3;
114+
string zone = 7;
115+
int64 registered_timestamp = 8;
116+
}
117+
```
118+
119+
### Partition States
120+
121+
Partitions maintain a simplified state model that provides **clear ownership** where each series belongs to exactly one partition, but requires **additional state management** for partition states and lifecycle management:
122+
123+
```go
124+
type PartitionState int
125+
126+
const (
127+
NON_READY PartitionState = iota // Insufficient ingesters
128+
ACTIVE // Fully operational
129+
READONLY // Scale-down in progress
130+
)
131+
```
132+
133+
State transitions:
134+
```mermaid
135+
stateDiagram-v2
136+
[*] --> NON_READY
137+
NON_READY --> ACTIVE : Required ingesters joined<br/>across all AZs
138+
ACTIVE --> READONLY : Scale-down initiated
139+
ACTIVE --> NON_READY : Ingester removed
140+
READONLY --> NON_READY : Ingesters removed
141+
NON_READY --> [*] : Partition deleted
142+
```
143+
144+
### Partition Lifecycle Management
145+
146+
#### Creating Partitions
147+
148+
When a new ingester joins the ring:
149+
1. Check if a suitable partition exists with available slots
150+
2. If no partition exists, create a new partition in `NON_READY` state
151+
3. Add partition's tokens to the ring
152+
4. Add the ingester to the partition
153+
5. Wait for required number of ingesters across all AZs (one per AZ)
154+
6. Once all AZs are represented, transition partition to `ACTIVE`
155+
156+
#### Removing Partitions
157+
158+
The scale-down process follows these steps:
159+
1. **Mark READONLY**: Partition stops accepting new writes but continues serving reads
160+
2. **Data Transfer**: Wait for all ingesters in partition to transfer data and become empty
161+
3. **Coordinated Removal**: Remove one ingester from each AZ simultaneously
162+
4. **State Transition**: Partition automatically transitions to `NON_READY` (insufficient replicas)
163+
5. **Cleanup**: Remove remaining ingesters and delete partition from ring
164+
165+
If not using READONLY mode, removing an ingester will make the partition as NON_READY. When all ingesters are removed, the last will delete the partition if configuration `unregister_on_shutdown` is true
166+
167+
### Multi-Ring Migration Strategy
168+
169+
To address the migration challenge for production clusters currently running token-based rings, this proposal also introduces a multi-ring infrastructure that allows gradual traffic shifting from token-based to partition-based rings:
170+
171+
```mermaid
172+
sequenceDiagram
173+
participant C as Client
174+
participant D as Distributor
175+
participant MR as Multi-Ring Router
176+
participant TR as Token Ring
177+
participant PR as Partition Ring
178+
179+
C->>D: Write Request (1000 series)
180+
D->>MR: Route request
181+
MR->>MR: Check percentage config<br/>(e.g., 80% token, 20% partition)
182+
MR->>TR: Route 800 series to Token Ring
183+
MR->>PR: Route 200 series to Partition Ring
184+
185+
Note over TR,PR: Both rings process their portion
186+
TR->>D: Response for 800 series
187+
PR->>D: Response for 200 series
188+
D->>C: Combined response
189+
```
190+
191+
Migration phases for production clusters:
192+
1. **Phase 1**: Deploy partition ring alongside existing token ring (0% traffic)
193+
2. **Phase 2**: Route 10% traffic to partition ring
194+
3. **Phase 3**: Gradually increase to 50% traffic
195+
4. **Phase 4**: Route 90% traffic to partition ring
196+
5. **Phase 5**: Complete migration (100% partition ring)
197+
198+
This multi-ring approach solves the migration problem for existing production deployments that cannot afford downtime during the transition from token-based to partition-based rings. It provides **zero downtime migration** with **rollback capability** and **incremental validation** at each step. However, it requires **dual ring participation** where ingesters must participate in both rings during migration, **increased memory usage** and **migration coordination** requiring careful percentage management and monitoring.
199+
200+
#### Read Path Considerations
201+
202+
During migration, the read path (queriers and rulers) must have visibility into both rings to ensure all functionality works correctly:
203+
204+
- **Queriers** must check both token and partition rings to locate series data, as data may be distributed across both ring types during migration
205+
- **Rulers** must evaluate rules against data from both rings to ensure complete rule evaluation
206+
- **Ring-aware components** (like shuffle sharding) must operate correctly across both ring types
207+
- **Metadata operations** (like label queries) must aggregate results from both rings
208+
209+
All existing Cortex functionality must continue to work seamlessly during the migration period, requiring components to transparently handle the dual-ring architecture.

go.mod

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ require (
4545
github.com/prometheus/client_model v0.6.2
4646
github.com/prometheus/common v0.65.1-0.20250703115700-7f8b2a0d32d3
4747
// Prometheus maps version 2.x.y to tags v0.x.y.
48-
github.com/prometheus/prometheus v0.305.1-0.20250721065454-b09cf6be8d56
48+
github.com/prometheus/prometheus v0.305.1-0.20250808023455-1e4144a496fb
4949
github.com/segmentio/fasthash v1.0.3
5050
github.com/sony/gobreaker v1.0.0
5151
github.com/spf13/afero v1.11.0
@@ -113,19 +113,19 @@ require (
113113
github.com/alecthomas/kingpin/v2 v2.4.0 // indirect
114114
github.com/andybalholm/brotli v1.1.1 // indirect
115115
github.com/asaskevich/govalidator v0.0.0-20230301143203-a9d515a09cc2 // indirect
116-
github.com/aws/aws-sdk-go-v2 v1.36.3 // indirect
116+
github.com/aws/aws-sdk-go-v2 v1.37.0 // indirect
117117
github.com/aws/aws-sdk-go-v2/config v1.29.15 // indirect
118118
github.com/aws/aws-sdk-go-v2/credentials v1.17.68 // indirect
119119
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.30 // indirect
120-
github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.34 // indirect
121-
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.34 // indirect
120+
github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.0 // indirect
121+
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.0 // indirect
122122
github.com/aws/aws-sdk-go-v2/internal/ini v1.8.3 // indirect
123-
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.12.3 // indirect
124-
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.12.15 // indirect
123+
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.0 // indirect
124+
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.0 // indirect
125125
github.com/aws/aws-sdk-go-v2/service/sso v1.25.3 // indirect
126126
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.30.1 // indirect
127127
github.com/aws/aws-sdk-go-v2/service/sts v1.33.20 // indirect
128-
github.com/aws/smithy-go v1.22.3 // indirect
128+
github.com/aws/smithy-go v1.22.5 // indirect
129129
github.com/beorn7/perks v1.0.1 // indirect
130130
github.com/blang/semver/v4 v4.0.0 // indirect
131131
github.com/caio/go-tdigest v3.1.0+incompatible // indirect
@@ -148,9 +148,8 @@ require (
148148
github.com/fatih/color v1.18.0 // indirect
149149
github.com/felixge/httpsnoop v1.0.4 // indirect
150150
github.com/fsnotify/fsnotify v1.9.0 // indirect
151-
github.com/go-chi/chi/v5 v5.0.7 // indirect
151+
github.com/go-chi/chi/v5 v5.2.2 // indirect
152152
github.com/go-ini/ini v1.67.0 // indirect
153-
github.com/go-jose/go-jose/v4 v4.0.5 // indirect
154153
github.com/go-logfmt/logfmt v0.6.0 // indirect
155154
github.com/go-logr/logr v1.4.3 // indirect
156155
github.com/go-logr/stdr v1.2.2 // indirect
@@ -227,7 +226,7 @@ require (
227226
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
228227
github.com/prometheus-community/prom-label-proxy v0.11.1 // indirect
229228
github.com/prometheus/exporter-toolkit v0.14.0 // indirect
230-
github.com/prometheus/otlptranslator v0.0.0-20250620074007-94f535e0c588 // indirect
229+
github.com/prometheus/otlptranslator v0.0.0-20250731173911-a9673827589a // indirect
231230
github.com/prometheus/sigv4 v0.2.0 // indirect
232231
github.com/puzpuzpuz/xsync/v3 v3.5.1 // indirect
233232
github.com/rantav/go-grpc-channelz v0.0.4 // indirect
@@ -240,7 +239,6 @@ require (
240239
github.com/shurcooL/vfsgen v0.0.0-20230704071429-0000e147ea92 // indirect
241240
github.com/sirupsen/logrus v1.9.3 // indirect
242241
github.com/soheilhy/cmux v0.1.5 // indirect
243-
github.com/spiffe/go-spiffe/v2 v2.5.0 // indirect
244242
github.com/stretchr/objx v0.5.2 // indirect
245243
github.com/tinylib/msgp v1.3.0 // indirect
246244
github.com/trivago/tgo v1.0.7 // indirect
@@ -249,7 +247,6 @@ require (
249247
github.com/weaveworks/promrus v1.2.0 // indirect
250248
github.com/xhit/go-str2duration/v2 v2.1.0 // indirect
251249
github.com/yuin/gopher-lua v1.1.1 // indirect
252-
github.com/zeebo/errs v1.4.0 // indirect
253250
go.mongodb.org/mongo-driver v1.17.4 // indirect
254251
go.opencensus.io v0.24.0 // indirect
255252
go.opentelemetry.io/auto/sdk v1.1.0 // indirect
@@ -326,3 +323,9 @@ replace github.com/google/gnostic => github.com/googleapis/gnostic v0.6.9
326323
// Same replace used by thanos: (may be removed in the future)
327324
// https://github.com/thanos-io/thanos/blob/fdeea3917591fc363a329cbe23af37c6fff0b5f0/go.mod#L265
328325
replace gopkg.in/alecthomas/kingpin.v2 => github.com/alecthomas/kingpin v1.3.8-0.20210301060133-17f40c25f497
326+
327+
// Wait for fix for https://github.com/grpc/grpc-go/pull/8504.
328+
replace google.golang.org/grpc => google.golang.org/grpc v1.71.2
329+
330+
// See https://github.com/envoyproxy/go-control-plane/issues/1083 as this version introduces checksum mismatch.
331+
exclude github.com/envoyproxy/go-control-plane/envoy v1.32.3

0 commit comments

Comments
 (0)