-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Steps to reproduce
juju deploy postgresql-k8s postgresql --channel=14/stable --trust
Related the database to 2 charms, https://charmhub.io/juju-jimm-k8s and https://charmhub.io/openfga-k8s. Didn't complete setting up the charms so the OpenFGA charm was ready but the JIMM charm was in a crash-loop. Came back a few days later and the postgres charm was in an error state reporting <10% free space on pgdata volume.
Expected behaviour
The Postgres charm doesn't run out of space when there is no data being written.
Actual behaviour
pg_wal directory continued to grow until the volume ran out of space.
Versions
juju show-application postgresql
postgresql:
charm: postgresql-k8s
base: ubuntu@22.04
channel: 14/stable
postgresql 14.15 waiting 1 postgresql-k8s 14/stable 495 10.87.198.202 no installing agent
postgresql/0* blocked idle 192.168.101.204 <10% free space on pgdata volume.
Operating system: Ubuntu 22.04
Juju CLI: 3.6.8
Juju agent: 3.4.5
Charm revision: 495
microk8s: N/A (Some variant of Charmed K8s)
Log output
I don't have relevant Juju logs but I have a backup of the pgdata directory and a backup of the /var/log/postgresql/ directory that I can share.
Additional context
Last week, after running a single node K8s database without issue for many months, suddenly the pgdata volume was full. This is a development environment with little to no traffic. After taking a copy of the DB and analyzing it locally, we could see the following,
Name | Owner | Encoding | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules | Access privileges | Size | Tablespace | Description
-----------+----------+----------+-----------------+---------+---------+------------+-----------+------------------------------+---------+------------+--------------------------------------------
jimm | operator | UTF8 | libc | C | C.UTF-8 | | | operator=CTc/operator +| 92 MB | pg_default |
| | | | | | | | admin=CTc/operator +| | |
| | | | | | | | backup=CTc/operator +| | |
| | | | | | | | replication=CTc/operator +| | |
| | | | | | | | rewind=CTc/operator +| | |
| | | | | | | | monitoring=CTc/operator +| | |
| | | | | | | | relation_id_128=CTc/operator | | |
openfga | operator | UTF8 | libc | C | C.UTF-8 | | | operator=CTc/operator +| 9241 kB | pg_default |
| | | | | | | | admin=CTc/operator +| | |
| | | | | | | | backup=CTc/operator +| | |
| | | | | | | | replication=CTc/operator +| | |
| | | | | | | | rewind=CTc/operator +| | |
| | | | | | | | monitoring=CTc/operator +| | |
| | | | | | | | relation_id_95=CTc/operator | | |
postgres | operator | UTF8 | libc | C | C.UTF-8 | | | operator=CTc/operator +| 8665 kB | pg_default | default administrative connection database
| | | | | | | | backup=CTc/operator +| | |
| | | | | | | | replication=CTc/operator +| | |
| | | | | | | | rewind=CTc/operator +| | |
| | | | | | | | monitoring=CTc/operator +| | |
| | | | | | | | admin=c/operator | | |
template0 | operator | UTF8 | libc | C | C.UTF-8 | | | =c/operator +| 8641 kB | pg_default | unmodifiable empty database
| | | | | | | | operator=CTc/operator | | |
template1 | operator | UTF8 | libc | C | C.UTF-8 | | | =c/operator +| 8641 kB | pg_default | default template for new databases
| | | | | | | | operator=CTc/operator | | |
(5 rows)
Only 1 DB had some data with 92 MB but the pg_wal directory was taking up a significant amount of space (enough to fill the remainder of the 1GB volume).
I proceeded to juju remove-application ... and removed the DB, cleaned up the K8s pvc and pv because I forgot to specify --destroy-storage and then redeployed Postgres.
I redeployed Postgres on the Friday and came back to it today on Tuesday to see it is full again. I've again taken a copy of pgdata and looking at the tables, this time no database has more than 1MB of data but still the same problem has occurred where pg_wal is ~900MB in size.
I can share the pgdata dir and logs if that would be helpful. I can see the WAL files are all 16MB, started being created on Aug 15 when I redeployed the DB and new ones added periodically but old ones are never removed. I'm unsure if the problem is my application doing a lot of writes or holding open a long transaction and causing an issue with WAL archival but this seems unlikely to me because the OpenFGA charm is running a well known open-source service and the JIMM charm is running a Canonical service that didn't have all the necessary relations and was in a crash-loop.