-
Notifications
You must be signed in to change notification settings - Fork 386
Publish post-mortem #2951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Publish post-mortem #2951
Conversation
Signed-off-by: Thib <[email protected]>
Deploying matrix-website with
|
| Latest commit: |
54d3b35
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://e1c9ad4e.matrix-website.pages.dev |
| Branch Preview URL: | https://publish-post-mortem.matrix-website.pages.dev |
Signed-off-by: Thib <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 comments regarding structure/text flow, and a bunch of small tweaks for polish. Overall lgtm.
|
|
||
|  | ||
|
|
||
| Confusingly, at the time of the incident, the primary database server is called `db-02`, and the secondary database server is called `db-01`. The deployment runs on bare metal servers at Mythic Beasts and the Postgres database servers both use their own logical RAID 10 array with `mdraid`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presumably
| Confusingly, at the time of the incident, the primary database server is called `db-02`, and the secondary database server is called `db-01`. The deployment runs on bare metal servers at Mythic Beasts and the Postgres database servers both use their own logical RAID 10 array with `mdraid`. | |
| Confusingly, at the time of the incident, the primary database server is called `db-02`, and the secondary database server is called `db-01`. The deployment runs on bare metal servers at Mythic Beasts and the Postgres database servers both use their own logical RAID 10 array with [`mdraid`](https://docs.kernel.org/admin-guide/md.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure, I'd rather not link to something I'm not certain about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, perhaps if we had the responsible team here @neilisfragile we could at least amend it. The name mdraid is unfamiliar to me as a private enthusiast previous mdadm user so clarity would be beneficial to overall readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that link is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(unresolving given the change was not yet applied)
|
|
||
| ### Timeline | ||
|
|
||
| At 11:03 UTC on Sept 2nd 2025, Mythic Beasts’ teams added 2 NVMe drives to both `db-02` and `db-01`, respectively the primary and secondary database servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading top to bottom: is this supposed to imply the start of a mdadm --grow operation?
|
|
||
| But we also needed the team to get some rest. Given how slow it was to replay WALs, we reconfigured our backups to happen against the primary database rather than against the (missing) replica. We let the European team go to bed, while our American SRE kept tabs on everything. At 03:26 UTC a new incremental backup completed. | ||
|
|
||
| At 09:21 UTC we added the two NVMe disks to the RAID array and to the LVM volumes group of `db-01`. We rebooted to ensure the disks were properly detected and mounted \- but the server didn’t come back. We opened the lights-out console Mythic Beasts provides us, and saw that the RAID array was not in the functional state. We had rebooted `db-01` at a critical moment of the array reshaping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was it reshaping at this point? wouldn't it have been ~empty?
Co-authored-by: Kim Brose <[email protected]>
Signed-off-by: HarHarLinks <[email protected]>
Signed-off-by: HarHarLinks <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not hard blocking a merge at this point, but I strongly recommend to clarify the open points. At least to me they raise questions when they should be answering them, and I think I am a rather big enthusiast who is able to interpret some stuff compared to the average already technical the blog audience.
| @@ -0,0 +1,182 @@ | |||
| +++ | |||
| date = "2025-10-10T10:00:00Z" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may need adjustment
|
|
||
| The SRE team would like to thank our hosting provider Mythic Beasts. They reached out quickly and proactively when adding new disks, reporting the errors they were seeing. They have been much more than just a pair of remote hands. They also reached out with an offer of support during the incident. | ||
|
|
||
| Finally, we’d like to sincerely apologise again to everyone impacted by the outage. We hope you found the post-mortem informative and you’d like to talk about it more, several of us will be at the [Matrix Conference 2025](https://conference.matrix.org) in Strasbourg. In addition to a flurry of great talks, there will be workshops about how to set up a Matrix homeserver and tune the clients to your liking\! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outdated
✔️ Checklist
<>to linkify them (learn more).[label](@/target.md)syntax./blogpage, especially for multiple posts on the same day. Prefer UTC format, e.g.2025-12-01T14:00:00Zfor Dec 1st, 2025, 2pm UTC.🧢 Website & WG cap