Skip to content

Conversation

@thibaultamartin
Copy link
Contributor

✔️ Checklist

  • Check for common mistakes:
    • Wrap plain URLs in <> to linkify them (learn more).
    • Use the right level of headings: The page title will use a level 1 headings, so your headings should use level 2 and below.
    • Use internal links: when linking to another page on https://matrix.org, use the Zola [label](@/target.md) syntax.
  • For blog posts:
    • Verify the date and post ordering on the /blog page, especially for multiple posts on the same day. Prefer UTC format, e.g. 2025-12-01T14:00:00Z for Dec 1st, 2025, 2pm UTC.
    • Set the correct author and category. Browse existing ones at https://matrix.org/author/ and https://matrix.org/category/ to match them.
  • Let us know if you are contributing in a specific role, such as on behalf of an organisation or team, for example.
  • Let us know if your PR is time-sensitive in any way.
  • Mention any issues related to the PR. Use GitHub keywords as appropriate.
  • Your individual commits or pull request is signed off.

🧢 Website & WG cap

@thibaultamartin thibaultamartin requested a review from a team as a code owner October 10, 2025 08:38
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Oct 10, 2025

Deploying matrix-website with  Cloudflare Pages  Cloudflare Pages

Latest commit: 54d3b35
Status: ✅  Deploy successful!
Preview URL: https://e1c9ad4e.matrix-website.pages.dev
Branch Preview URL: https://publish-post-mortem.matrix-website.pages.dev

View logs

Signed-off-by: Thib <[email protected]>
Copy link
Collaborator

@HarHarLinks HarHarLinks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 comments regarding structure/text flow, and a bunch of small tweaks for polish. Overall lgtm.


![A schema showing Synapse connected to a primary database. It also shows a secondary database pulling WALs from the primary. Finally the primary database also pushes WALs to a S3 bucket](/blog/img/morg-high-level-architecture.png)

Confusingly, at the time of the incident, the primary database server is called `db-02`, and the secondary database server is called `db-01`. The deployment runs on bare metal servers at Mythic Beasts and the Postgres database servers both use their own logical RAID 10 array with `mdraid`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably

Suggested change
Confusingly, at the time of the incident, the primary database server is called `db-02`, and the secondary database server is called `db-01`. The deployment runs on bare metal servers at Mythic Beasts and the Postgres database servers both use their own logical RAID 10 array with `mdraid`.
Confusingly, at the time of the incident, the primary database server is called `db-02`, and the secondary database server is called `db-01`. The deployment runs on bare metal servers at Mythic Beasts and the Postgres database servers both use their own logical RAID 10 array with [`mdraid`](https://docs.kernel.org/admin-guide/md.html).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure, I'd rather not link to something I'm not certain about.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, perhaps if we had the responsible team here @neilisfragile we could at least amend it. The name mdraid is unfamiliar to me as a private enthusiast previous mdadm user so clarity would be beneficial to overall readability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that link is correct

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unresolving given the change was not yet applied)


### Timeline

At 11:03 UTC on Sept 2nd 2025, Mythic Beasts’ teams added 2 NVMe drives to both `db-02` and `db-01`, respectively the primary and secondary database servers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading top to bottom: is this supposed to imply the start of a mdadm --grow operation?


But we also needed the team to get some rest. Given how slow it was to replay WALs, we reconfigured our backups to happen against the primary database rather than against the (missing) replica. We let the European team go to bed, while our American SRE kept tabs on everything. At 03:26 UTC a new incremental backup completed.

At 09:21 UTC we added the two NVMe disks to the RAID array and to the LVM volumes group of `db-01`. We rebooted to ensure the disks were properly detected and mounted \- but the server didn’t come back. We opened the lights-out console Mythic Beasts provides us, and saw that the RAID array was not in the functional state. We had rebooted `db-01` at a critical moment of the array reshaping.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was it reshaping at this point? wouldn't it have been ~empty?

thibaultamartin and others added 3 commits October 10, 2025 18:04
Signed-off-by: HarHarLinks <[email protected]>
Signed-off-by: HarHarLinks <[email protected]>
Copy link
Collaborator

@HarHarLinks HarHarLinks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not hard blocking a merge at this point, but I strongly recommend to clarify the open points. At least to me they raise questions when they should be answering them, and I think I am a rather big enthusiast who is able to interpret some stuff compared to the average already technical the blog audience.

@HarHarLinks HarHarLinks added the blog This issue is related to the blog section label Oct 10, 2025
@@ -0,0 +1,182 @@
+++
date = "2025-10-10T10:00:00Z"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may need adjustment


The SRE team would like to thank our hosting provider Mythic Beasts. They reached out quickly and proactively when adding new disks, reporting the errors they were seeing. They have been much more than just a pair of remote hands. They also reached out with an offer of support during the incident.

Finally, we’d like to sincerely apologise again to everyone impacted by the outage. We hope you found the post-mortem informative and you’d like to talk about it more, several of us will be at the [Matrix Conference 2025](https://conference.matrix.org) in Strasbourg. In addition to a flurry of great talks, there will be workshops about how to set up a Matrix homeserver and tune the clients to your liking\!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blog This issue is related to the blog section

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants