docs: Updated README.md (#193)

Jack-Golding · web-flow · commit 97e6d7d45b71 · 2025-12-17T12:30:03.000Z
diff --git a/README.md b/README.md
@@ -74,6 +74,75 @@ The following properties are understood and documented in the [Module Descriptor
 * DB_MAXPOOLSIZE
 * DB_PORT
 
+### Issues
+This module has a few "problem" scenarios that _shouldn't_ occur in general operation, their history, reasoning and workarounds are documented below.
+#### Locks and failure to upgrade
+Particular approaches to upgrades in particular can leave the module unable to self right.
+This occurs especially often where the module or container die/are killed regularly shortly after/during the upgrade.
+The issue documented here was exacerbated by transaction handling changes brought about by the grails 5 -> 6 upgrade as part of Quesnalia, and fix attempts are ongoing.
+
+In order of importance to check:
+
+- **CPU resource**
+    - In the past we have had these particular issues reported commonly where the app was not getting enough CPU resources to run. Please ensure that the CPU resources being allocated to the application are sufficient, see the requisite module requirements for the version running ([Ramsons example matrix](https://folio-org.atlassian.net/wiki/spaces/REL/pages/398983244/Ramsons+R2+2024+-+Bugfest+env+preparation+-+Modules+configuration+details?focusedCommentId=608305153))
+- **Liquibase**
+    - The module uses liquibase in order to facilitate module data migrations
+    - Unfortunately this has a weakness to shutdown mid migration.
+    - Check `<tenantName>_mod_serials_management.tenant_changelog_lock` does not have `locked` set to `true`
+        - If it does, the migration (and hence the upgrade itself) have failed, and it is difficult to extricate the module from this scenario.
+        - It may be most prudent to revert the data and retry the upgrade.
+    - In general, while the module is uploading it is most likely to succeed if after startup and tenant enabling/upgrading through okapi that the module and its container are NOT KILLED for at least 2 minutes.
+    - An addition, a death to the module while upgrading could be due to a lack of reasonable resourcing making it to the module
+- **Federated changelog lock**
+    - The module also has a manual lock which is managed by the application itself.
+    - This is to facilitate multiple instances accessing the same data
+    - In particular, this lock table "seeds" every 20 minutes or so, and a death in the middle of this _can_ lock up the application (Although it can sometimes self right from here)
+    - If the liquibase lock is clear, first try startup and leaving for a good 20 minutes
+        - If the module dies it's likely resourcing that's the issue
+        - The module may be able to self right
+    - If the module cannot self right
+        - Check the `mod_serials_management__system.system_changelog_lock`
+            - The same applies from the above section as this is a liquibase lock, but this is seriously unlikely to get caught as the table is so small
+        - Finally check the `mod_serials_management__system.federation_lock`
+            - If this table has entries, this can prevent the module from any and all operations
+            - It should self right from here, even if pointing at dead instances
+                - See `mod_serials_management__system.app_instance` for a table of instance ids, a killed and restarted module should eventually get cleared from here.
+                - It is NOT RECOMMENDED to clear app_instances manually
+            - If there are entries in the federated lock table that do not clear after 20 minutes of uninterrupted running then this table should be manually emptied.
+
+#### Connection pool issues
+As of Sunflower release, issues with [federated locks](#locks-and-failure-to-upgrade) and connection pools have been ongoing since Quesnalia.
+The attempted fixes and history is documented in JIRA ticket [ERM-3851](https://folio-org.atlassian.net/browse/ERM-3851)
+
+Initially the Grails 6 upgrade caused federated lock rows to themselves lock in PG.
+A fix was made for Sunflower (v2.0.0) and backported to Quesnalia (v1.0.5) and Ramsons (v1.1.6).
+However this fix is both not fully complete, and worsens an underlying connection pool issue.
+
+The connection pool per instance can be configured via the `DB_MAXPOOLSIZE` environment variable.
+Since the introduction of module-federation for this module, this has been _doubled_ to ensure connections are available
+for the system schema as well. This is necessary as a starved system schema would all but guarantee the fed lock issues
+documented above. As a response, our approach was to request more and more connections, memory, and CPU time to lower the
+chances of this happening as much as possible.
+
+As of right now, the recommended Sunflower connection pool is 10 per instance.
+This leads to 20 connections per instance, almost all of which PG will see as idle. The non-dropping of idle connections
+is a [chosen behaviour of Hikari](https://www.postgresql.org/message-id/1395487594923-5797135.post@n5.nabble.com) (and so not a bug).
+
+At the moment, although postgres sees most of this pool as idle, Hikari internally believes them to be active, causing
+pool starvation unless massively over-resourcing the instance. This in turn locks up the instance entirely and leads to jobs
+silently failing
+
+The workarounds here are to over-resource the module, and to restart problematic instances (or all instances)
+when this behaviour manifests, or to revert to versions where this is less prevalent (v6.1.3, v6.2.0) and handle the
+federated locking issues instead. Obviously these are not proper solutions.
+
+In Trillium (v2.1.0), the aim is both to fix these bugs, and hopefully thus free up the connection pool to an extent that it can be
+run with _significantly_ fewer connections, and potentially set up a way for the configured pool size to be mathematically
+split between system and module, so as to avoid the doubling of the pool.
+
+The recommendation for the versions containing the fix is to run with a minimum of 10 connections per instance
+(Which will be doubled to 20 to account for the system schema).
+
 ## ModuleDescriptor
 
 https://github.com/folio-org/mod-serials-management/blob/master/service/src/main/okapi/ModuleDescriptor-template.json