Skip to content

Conversation

@b-wu26
Copy link
Contributor

@b-wu26 b-wu26 commented Dec 11, 2025

While using cortex, my team recently ran into an issue with the alert managers where a corrupt config was synced and extended across the entire alert manager ring. This led to our entire fleet being stuck in a crash loop back off and prompted a need to limit the blast radius of potential issues to the replication factor

What this PR does:

  • creates a feature flag which disables the replica set extension capabilities called: DisableReplicaSetExtension

Which issue(s) this PR fixes:
#7200

Checklist

  • [x ] Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@b-wu26 b-wu26 changed the title Fixing the shuffle sharding algorithm and doing api improments Fixing the shuffle sharding algorithm and doing api improvements Dec 23, 2025
randN := rand.Intn(len(replicationSet.Instances))
instances = replicationSet.Instances[randN : randN+1]
// For POST requests, add retry logic to PutSilence
if d.isUnaryWritePath(r.URL.Path) {
Copy link
Contributor

@rajagopalanand rajagopalanand Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be just a || in the if condition? for example if req.GetMethod() == "GET" || req.GetMethod() == "DELETE" || d.isUnaryWritePath(r.URL.Path)?

receivers = append(receivers, parsed...)
}

merged := mergeV2Receivers(receivers)
Copy link
Contributor

@rajagopalanand rajagopalanand Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be done as part of the for loop? I mean can we maintain a map and dedup as we go

@b-wu26 b-wu26 force-pushed the fix-shuffle-shard branch 2 times, most recently from e8e8590 to 2e1bdf7 Compare January 8, 2026 18:13
)

// RingOp is the operation used for reading/writing to the alertmanagers.
// Original ring operations (with extension enabled for backward compatibility)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call it Original. We are not deprecating this one and DisableReplicaSetExtension is false by default.

// Original ring operations (with extension enabled for backward compatibility)
var RingOp = ring.NewOp([]ring.InstanceState{ring.ACTIVE}, func(s ring.InstanceState) bool {
// Only ACTIVE Alertmanager get requests. If instance is not ACTIVE, we need to find another Alertmanager.
// Original behavior: extend replica set if instance is not ACTIVE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all original behavior from this PR

@b-wu26 b-wu26 force-pushed the fix-shuffle-shard branch 2 times, most recently from 38b63a6 to 74784e6 Compare January 8, 2026 19:57
@b-wu26 b-wu26 changed the title Fixing the shuffle sharding algorithm and doing api improvements Creating feature flag for disabling replica set extension Jan 8, 2026
@b-wu26 b-wu26 force-pushed the fix-shuffle-shard branch 2 times, most recently from 146915c to 3b363bf Compare January 8, 2026 20:07
@b-wu26 b-wu26 force-pushed the fix-shuffle-shard branch from 3b363bf to 78ced0f Compare January 8, 2026 21:37
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Member

@SungJin1212 SungJin1212 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 9, 2026
Copy link
Contributor

@rajagopalanand rajagopalanand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@friedrichg friedrichg merged commit bd66858 into cortexproject:master Jan 9, 2026
51 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/alertmanager lgtm This PR has been approved by a maintainer size/L type/feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants