Skip to content

Large pg_wal directory #1064

@kian99

Description

@kian99

Steps to reproduce

  1. juju deploy postgresql-k8s postgresql --channel=14/stable --trust

Related the database to 2 charms, https://charmhub.io/juju-jimm-k8s and https://charmhub.io/openfga-k8s. Didn't complete setting up the charms so the OpenFGA charm was ready but the JIMM charm was in a crash-loop. Came back a few days later and the postgres charm was in an error state reporting <10% free space on pgdata volume.

Expected behaviour

The Postgres charm doesn't run out of space when there is no data being written.

Actual behaviour

pg_wal directory continued to grow until the volume ran out of space.

Versions

juju show-application postgresql
postgresql:
  charm: postgresql-k8s
  base: ubuntu@22.04
  channel: 14/stable
postgresql            14.15    waiting      1  postgresql-k8s            14/stable      495  10.87.198.202  no       installing agent
postgresql/0*            blocked   idle   192.168.101.204         <10% free space on pgdata volume.

Operating system: Ubuntu 22.04

Juju CLI: 3.6.8

Juju agent: 3.4.5

Charm revision: 495

microk8s: N/A (Some variant of Charmed K8s)

Log output

I don't have relevant Juju logs but I have a backup of the pgdata directory and a backup of the /var/log/postgresql/ directory that I can share.

Additional context

Last week, after running a single node K8s database without issue for many months, suddenly the pgdata volume was full. This is a development environment with little to no traffic. After taking a copy of the DB and analyzing it locally, we could see the following,

 Name    |  Owner   | Encoding | Locale Provider | Collate |  Ctype  | ICU Locale | ICU Rules |      Access privileges       |  Size   | Tablespace |                Description                 
-----------+----------+----------+-----------------+---------+---------+------------+-----------+------------------------------+---------+------------+--------------------------------------------
 jimm      | operator | UTF8     | libc            | C       | C.UTF-8 |            |           | operator=CTc/operator       +| 92 MB   | pg_default | 
           |          |          |                 |         |         |            |           | admin=CTc/operator          +|         |            | 
           |          |          |                 |         |         |            |           | backup=CTc/operator         +|         |            | 
           |          |          |                 |         |         |            |           | replication=CTc/operator    +|         |            | 
           |          |          |                 |         |         |            |           | rewind=CTc/operator         +|         |            | 
           |          |          |                 |         |         |            |           | monitoring=CTc/operator     +|         |            | 
           |          |          |                 |         |         |            |           | relation_id_128=CTc/operator |         |            | 
 openfga   | operator | UTF8     | libc            | C       | C.UTF-8 |            |           | operator=CTc/operator       +| 9241 kB | pg_default | 
           |          |          |                 |         |         |            |           | admin=CTc/operator          +|         |            | 
           |          |          |                 |         |         |            |           | backup=CTc/operator         +|         |            | 
           |          |          |                 |         |         |            |           | replication=CTc/operator    +|         |            | 
           |          |          |                 |         |         |            |           | rewind=CTc/operator         +|         |            | 
           |          |          |                 |         |         |            |           | monitoring=CTc/operator     +|         |            | 
           |          |          |                 |         |         |            |           | relation_id_95=CTc/operator  |         |            | 
 postgres  | operator | UTF8     | libc            | C       | C.UTF-8 |            |           | operator=CTc/operator       +| 8665 kB | pg_default | default administrative connection database
           |          |          |                 |         |         |            |           | backup=CTc/operator         +|         |            | 
           |          |          |                 |         |         |            |           | replication=CTc/operator    +|         |            | 
           |          |          |                 |         |         |            |           | rewind=CTc/operator         +|         |            | 
           |          |          |                 |         |         |            |           | monitoring=CTc/operator     +|         |            | 
           |          |          |                 |         |         |            |           | admin=c/operator             |         |            | 
 template0 | operator | UTF8     | libc            | C       | C.UTF-8 |            |           | =c/operator                 +| 8641 kB | pg_default | unmodifiable empty database
           |          |          |                 |         |         |            |           | operator=CTc/operator        |         |            | 
 template1 | operator | UTF8     | libc            | C       | C.UTF-8 |            |           | =c/operator                 +| 8641 kB | pg_default | default template for new databases
           |          |          |                 |         |         |            |           | operator=CTc/operator        |         |            | 
(5 rows)

Only 1 DB had some data with 92 MB but the pg_wal directory was taking up a significant amount of space (enough to fill the remainder of the 1GB volume).

I proceeded to juju remove-application ... and removed the DB, cleaned up the K8s pvc and pv because I forgot to specify --destroy-storage and then redeployed Postgres.

I redeployed Postgres on the Friday and came back to it today on Tuesday to see it is full again. I've again taken a copy of pgdata and looking at the tables, this time no database has more than 1MB of data but still the same problem has occurred where pg_wal is ~900MB in size.

I can share the pgdata dir and logs if that would be helpful. I can see the WAL files are all 16MB, started being created on Aug 15 when I redeployed the DB and new ones added periodically but old ones are never removed. I'm unsure if the problem is my application doing a lot of writes or holding open a long transaction and causing an issue with WAL archival but this seems unlikely to me because the OpenFGA charm is running a well known open-source service and the JIMM charm is running a Canonical service that didn't have all the necessary relations and was in a crash-loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working as expected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions