Skip to content

Conversation

fionaliao
Copy link
Contributor

@fionaliao fionaliao commented Oct 3, 2025

fsyncs for the WAL/WBL do not happen every time a record is logged, see: prometheus/prometheus#5869

This means that if Mimir is shut down uncleanly, it's possible for there to be WAL corruption errors due to not all data being written to disk before shutdown.

Adding a FsyncSegmentsUntilCurrent() function which will ensure all segments up until the current one have been fsycned before returning. Also added a corresponding function for the tsdb head, so it can be called from Mimir code.

Corresponding Mimir PR: grafana/mimir#12816

@fionaliao fionaliao force-pushed the fl/sync-segments-until-current branch 3 times, most recently from e0126a0 to 4715bad Compare October 8, 2025 17:24
@fionaliao
Copy link
Contributor Author

@bboreham replying here to comments from grafana/mimir#12816

I have updated the logic a bit from the version you reviewed - the current segment fsync is now queued through the actor channel. The mutex is held while queuing the  - the actual fsync operations happen asynchronously after the mutex is released.


For consistency, you could do the fsync inside the actor function.
I can't come up with any concrete reason why this would be better; it just seems nicer that all fsyncs would happen in order.

I think one reason is that this allows the fsync to be executed without holding the mutex but still avoids race conditions where the segment could be closed before it's fsynced (the write log Close() function waits for the actor channel to drain before closing the current segment)


grafana/mimir#12816 (comment)

How can this happen?

I'm not sure what this refers too - w.segement being nil? w.segment changing or being closed?

}
}

w.mtx.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure about where we acquire and release the lock.

The potential problems i can think of right now are

  1. there's a race between taking a copy of w.segment, unlocking the mutex and running the fsync. it's possible another goroutine replaced the w.segment; so at that point we no longer have the current segment fsync'd. it may not be a huge problem since, there's still the same race right before returning from FsyncSegmentsUntilCurrent - we can get a new segment.
  2. We end up with a deadlock where pushing to the channel blocks because the currently executed function from actorc is trying to acquire the lock. This currently doesn't happen, but i don't see

i don't think it's possible to solve both. From reading the only other place which uses actorc and the places which do operations on the WL while holding the lock, I think that 2. is assumed to never happen. So I think we should release the lock only after getting done closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something i forgot - if you end up taking a copy of w.segment, then you should do that while holding a lock inside the actor func instead of before creating the actor closure - this will help avoid races

@fionaliao fionaliao enabled auto-merge (squash) October 10, 2025 15:19
@fionaliao fionaliao merged commit 34fc47d into main Oct 10, 2025
28 checks passed
@fionaliao fionaliao deleted the fl/sync-segments-until-current branch October 10, 2025 15:29
fionaliao added a commit to grafana/mimir that referenced this pull request Oct 10, 2025
<!--  Thanks for sending a pull request!  Before submitting:

1. Read our CONTRIBUTING.md guide
2. Rebase your PR if it gets out of sync with main
-->

#### What this PR does

This PR introduces the
`-ingest-storage.write-logs-fsync-before-kafka-commit-enabled` flag,
enabled by default, to make sure all WAL/WBL segment files are fsynced
before the corresponding offset is committed in Kafka.

Currently unclean ingester shutdowns can cause data loss, as write log
files aren't guaranteed to be fsynced before the corresponding offsets
are committed. This means that when an ingester restarts, it could see
corrupted WAL data (and therefore discard it), but not replay all the
required data from Kafka as it only resumes from the committed offset.

An additional
`-ingest-storage.write-logs-fsync-before-kafka-commit-concurrency` flag
has been added to enable us to reduce the amount of time to do the
fsyncs (and therefore the amount of time to commit an offset).

An optimisation is to only fsync tsdbs which have had updates since the
last offset was committed; we'll see how the current version runs in
production and decide whether that's worth adding it.

Depends on grafana/mimir-prometheus#998

#### Checklist

- [x] Tests updated.
- [x] Documentation added.
- [x] `CHANGELOG.md` updated - the order of entries should be
`[CHANGE]`, `[FEATURE]`, `[ENHANCEMENT]`, `[BUGFIX]`. If changelog entry
is not needed, please add the `changelog-not-needed` label to the PR.
- [ ]
[`about-versioning.md`](https://github.com/grafana/mimir/blob/main/docs/sources/mimir/configure/about-versioning.md)
updated with experimental features. (didn't update as
`-ingest-storage.*` is already mentioned as a whole)

---------

Co-authored-by: Dimitar Dimitrov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants