[DPE-9630] fix: safe pgdata symlink setup with accurate status reporting#1359
Draft
marceloneppel wants to merge 11 commits into16/edgefrom
Draft
[DPE-9630] fix: safe pgdata symlink setup with accurate status reporting#1359marceloneppel wants to merge 11 commits into16/edgefrom
marceloneppel wants to merge 11 commits into16/edgefrom
Conversation
…data setup Replace rm -rf with timestamped mv to backup existing pgdata directories to persistent storage, preventing data loss. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Replace container.isdir() check with explicit symlink test to avoid moving existing pgdata symlinks to backup when redeploying. This ensures only real directories are backed up, preventing data loss from incorrect symlink handling. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Codecov Report❌ Patch coverage is
❌ Your project check has failed because the head coverage (68.61%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## 16/edge #1359 +/- ##
===========================================
- Coverage 68.64% 68.61% -0.03%
===========================================
Files 16 16
Lines 3881 3891 +10
Branches 590 592 +2
===========================================
+ Hits 2664 2670 +6
- Misses 1009 1014 +5
+ Partials 208 207 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Remove unnecessary code that backed up existing pgdata directories when creating the pgdata symlink. The OCI image handling has been simplified to always create the symlink without checking if the target exists as a real directory. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Add WaitingStatus when member_started is True but Patroni state is "starting" to avoid premature ActiveStatus. Add retry logic to async replication scale-up test to handle eventual consistency. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…covery Skip the "waiting for PostgreSQL to start" WaitingStatus during refresh so the refresh library does not consider the unit unhealthy. Exempt this status from the update-status early exit so update-status can re-evaluate and set ActiveStatus once PostgreSQL finishes starting. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
… querying Restore the symlink guard in _ensure_pgdata_dirs_and_symlinks using test -L to detect when the pgdata path is a real directory instead of a symlink. This can happen when Patroni's remove_data_directory() deletes the symlink during failed pg_basebackup retries and the retry recreates the path as a real directory. Move the directory to persistent storage for debugging instead of deleting it. Add all_active waits before data consistency checks in test_data_replication and test_standby_promotion. Remove retry wrappers from test_scale_up and test_unrelate_and_relate since they already wait for all_active before querying. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…pgdata-symlink-setup Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
… into fix/dpe-9630-safe-pgdata-symlink-setup Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Replace hardcoded "waiting for PostgreSQL to start" strings with POSTGRESQL_STARTING_MESSAGE constant for better maintainability Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…pgdata-symlink-setup Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
During pgdata symlink setup,
rm -rfwas used to remove/var/lib/postgresql/16/mainif it existed as a non-symlink. This could cause data loss if the directory unexpectedly contained data. Additionally, the charm setActiveStatuswhile PostgreSQL was still in Patroni's "starting" state (not yet accepting connections), causing tests and consumers waiting forall_activeto proceed prematurely.Solution
Replace
rm -rfwithmvto persistent storage (DPE-9630)Patroni's
remove_data_directory()can delete the pgdata symlink during failedpg_basebackupretries, and the retry recreates the path as a real directory. Sinceln -sfncannot replace a real directory, the charm now checks withtest -Land moves the directory to PV-backed storage (pgdata-backup-<timestamp>) instead of deleting it, preserving data for debugging.Show
WaitingStatuswhen PostgreSQL is startingWhen Patroni reports
member_started=Truebut the health state is"starting"(PostgreSQL not yet accepting connections),_set_active_statusnow setsWaitingStatus("waiting for PostgreSQL to start")instead ofActiveStatus. This prevents premature readiness signaling. The WaitingStatus is:_on_update_statusearly exit (soupdate-statuscan re-evaluate and setActiveStatusonce PostgreSQL finishes starting)Add
all_activewaits to async replication teststest_data_replicationandtest_standby_promotionlackedall_activewaits before querying units, causingConnection refusederrors when PostgreSQL was still starting.Checklist