Skip to content

[DPE-9630] fix: safe pgdata symlink setup with accurate status reporting#1359

Draft
marceloneppel wants to merge 11 commits into16/edgefrom
fix/dpe-9630-safe-pgdata-symlink-setup
Draft

[DPE-9630] fix: safe pgdata symlink setup with accurate status reporting#1359
marceloneppel wants to merge 11 commits into16/edgefrom
fix/dpe-9630-safe-pgdata-symlink-setup

Conversation

@marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Mar 17, 2026

Issue

During pgdata symlink setup, rm -rf was used to remove /var/lib/postgresql/16/main if it existed as a non-symlink. This could cause data loss if the directory unexpectedly contained data. Additionally, the charm set ActiveStatus while PostgreSQL was still in Patroni's "starting" state (not yet accepting connections), causing tests and consumers waiting for all_active to proceed prematurely.

Solution

Replace rm -rf with mv to persistent storage (DPE-9630)

Patroni's remove_data_directory() can delete the pgdata symlink during failed pg_basebackup retries, and the retry recreates the path as a real directory. Since ln -sfn cannot replace a real directory, the charm now checks with test -L and moves the directory to PV-backed storage (pgdata-backup-<timestamp>) instead of deleting it, preserving data for debugging.

Show WaitingStatus when PostgreSQL is starting

When Patroni reports member_started=True but the health state is "starting" (PostgreSQL not yet accepting connections), _set_active_status now sets WaitingStatus("waiting for PostgreSQL to start") instead of ActiveStatus. This prevents premature readiness signaling. The WaitingStatus is:

  • Skipped during refresh (so the refresh library doesn't consider the unit unhealthy)
  • Exempted from the _on_update_status early exit (so update-status can re-evaluate and set ActiveStatus once PostgreSQL finishes starting)

Add all_active waits to async replication tests

test_data_replication and test_standby_promotion lacked all_active waits before querying units, causing Connection refused errors when PostgreSQL was still starting.

Checklist

  • I have added or updated any relevant documentation.
  • I have cleaned any remaining cloud resources from my accounts.

…data setup

Replace rm -rf with timestamped mv to backup existing pgdata directories
to persistent storage, preventing data loss.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@marceloneppel marceloneppel added the bug Something isn't working as expected label Mar 17, 2026
@marceloneppel marceloneppel changed the title fix: preserve PostgreSQL data by moving instead of deleting during pgdata setup [DPE-9630] fix: preserve PostgreSQL data by moving instead of deleting during pgdata setup Mar 17, 2026
Replace container.isdir() check with explicit symlink test to avoid
moving existing pgdata symlinks to backup when redeploying. This
ensures only real directories are backed up, preventing data loss
from incorrect symlink handling.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@github-actions github-actions bot added the Libraries: Out of sync The charm libs used are out-of-sync label Mar 17, 2026
@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 58.33333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.61%. Comparing base (b4e70ef) to head (2fff5c7).

Files with missing lines Patch % Lines
src/charm.py 58.33% 5 Missing ⚠️

❌ Your project check has failed because the head coverage (68.61%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           16/edge    #1359      +/-   ##
===========================================
- Coverage    68.64%   68.61%   -0.03%     
===========================================
  Files           16       16              
  Lines         3881     3891      +10     
  Branches       590      592       +2     
===========================================
+ Hits          2664     2670       +6     
- Misses        1009     1014       +5     
+ Partials       208      207       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

marceloneppel and others added 7 commits March 18, 2026 09:57
Remove unnecessary code that backed up existing pgdata directories when creating the pgdata symlink. The OCI image handling has been simplified to always create the symlink without checking if the target exists as a real directory.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Add WaitingStatus when member_started is True but Patroni state is
"starting" to avoid premature ActiveStatus. Add retry logic to async
replication scale-up test to handle eventual consistency.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…covery

Skip the "waiting for PostgreSQL to start" WaitingStatus during refresh
so the refresh library does not consider the unit unhealthy. Exempt this
status from the update-status early exit so update-status can re-evaluate
and set ActiveStatus once PostgreSQL finishes starting.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
… querying

Restore the symlink guard in _ensure_pgdata_dirs_and_symlinks using
test -L to detect when the pgdata path is a real directory instead of
a symlink. This can happen when Patroni's remove_data_directory()
deletes the symlink during failed pg_basebackup retries and the retry
recreates the path as a real directory. Move the directory to
persistent storage for debugging instead of deleting it.

Add all_active waits before data consistency checks in
test_data_replication and test_standby_promotion. Remove retry
wrappers from test_scale_up and test_unrelate_and_relate since they
already wait for all_active before querying.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…pgdata-symlink-setup

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
… into fix/dpe-9630-safe-pgdata-symlink-setup

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@marceloneppel marceloneppel changed the title [DPE-9630] fix: preserve PostgreSQL data by moving instead of deleting during pgdata setup [DPE-9630] fix: safe pgdata symlink setup with accurate status reporting Mar 20, 2026
Replace hardcoded "waiting for PostgreSQL to start" strings with
POSTGRESQL_STARTING_MESSAGE constant for better maintainability

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…pgdata-symlink-setup

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working as expected Libraries: Out of sync The charm libs used are out-of-sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants