Skip to content

Commit 97e6d7d

Browse files
authored
docs: Updated README.md (#193)
1 parent 7646481 commit 97e6d7d

File tree

1 file changed

+69
-0
lines changed

1 file changed

+69
-0
lines changed

README.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,75 @@ The following properties are understood and documented in the [Module Descriptor
7474
* DB_MAXPOOLSIZE
7575
* DB_PORT
7676

77+
### Issues
78+
This module has a few "problem" scenarios that _shouldn't_ occur in general operation, their history, reasoning and workarounds are documented below.
79+
#### Locks and failure to upgrade
80+
Particular approaches to upgrades in particular can leave the module unable to self right.
81+
This occurs especially often where the module or container die/are killed regularly shortly after/during the upgrade.
82+
The issue documented here was exacerbated by transaction handling changes brought about by the grails 5 -> 6 upgrade as part of Quesnalia, and fix attempts are ongoing.
83+
84+
In order of importance to check:
85+
86+
- **CPU resource**
87+
- In the past we have had these particular issues reported commonly where the app was not getting enough CPU resources to run. Please ensure that the CPU resources being allocated to the application are sufficient, see the requisite module requirements for the version running ([Ramsons example matrix](https://folio-org.atlassian.net/wiki/spaces/REL/pages/398983244/Ramsons+R2+2024+-+Bugfest+env+preparation+-+Modules+configuration+details?focusedCommentId=608305153))
88+
- **Liquibase**
89+
- The module uses liquibase in order to facilitate module data migrations
90+
- Unfortunately this has a weakness to shutdown mid migration.
91+
- Check `<tenantName>_mod_serials_management.tenant_changelog_lock` does not have `locked` set to `true`
92+
- If it does, the migration (and hence the upgrade itself) have failed, and it is difficult to extricate the module from this scenario.
93+
- It may be most prudent to revert the data and retry the upgrade.
94+
- In general, while the module is uploading it is most likely to succeed if after startup and tenant enabling/upgrading through okapi that the module and its container are NOT KILLED for at least 2 minutes.
95+
- An addition, a death to the module while upgrading could be due to a lack of reasonable resourcing making it to the module
96+
- **Federated changelog lock**
97+
- The module also has a manual lock which is managed by the application itself.
98+
- This is to facilitate multiple instances accessing the same data
99+
- In particular, this lock table "seeds" every 20 minutes or so, and a death in the middle of this _can_ lock up the application (Although it can sometimes self right from here)
100+
- If the liquibase lock is clear, first try startup and leaving for a good 20 minutes
101+
- If the module dies it's likely resourcing that's the issue
102+
- The module may be able to self right
103+
- If the module cannot self right
104+
- Check the `mod_serials_management__system.system_changelog_lock`
105+
- The same applies from the above section as this is a liquibase lock, but this is seriously unlikely to get caught as the table is so small
106+
- Finally check the `mod_serials_management__system.federation_lock`
107+
- If this table has entries, this can prevent the module from any and all operations
108+
- It should self right from here, even if pointing at dead instances
109+
- See `mod_serials_management__system.app_instance` for a table of instance ids, a killed and restarted module should eventually get cleared from here.
110+
- It is NOT RECOMMENDED to clear app_instances manually
111+
- If there are entries in the federated lock table that do not clear after 20 minutes of uninterrupted running then this table should be manually emptied.
112+
113+
#### Connection pool issues
114+
As of Sunflower release, issues with [federated locks](#locks-and-failure-to-upgrade) and connection pools have been ongoing since Quesnalia.
115+
The attempted fixes and history is documented in JIRA ticket [ERM-3851](https://folio-org.atlassian.net/browse/ERM-3851)
116+
117+
Initially the Grails 6 upgrade caused federated lock rows to themselves lock in PG.
118+
A fix was made for Sunflower (v2.0.0) and backported to Quesnalia (v1.0.5) and Ramsons (v1.1.6).
119+
However this fix is both not fully complete, and worsens an underlying connection pool issue.
120+
121+
The connection pool per instance can be configured via the `DB_MAXPOOLSIZE` environment variable.
122+
Since the introduction of module-federation for this module, this has been _doubled_ to ensure connections are available
123+
for the system schema as well. This is necessary as a starved system schema would all but guarantee the fed lock issues
124+
documented above. As a response, our approach was to request more and more connections, memory, and CPU time to lower the
125+
chances of this happening as much as possible.
126+
127+
As of right now, the recommended Sunflower connection pool is 10 per instance.
128+
This leads to 20 connections per instance, almost all of which PG will see as idle. The non-dropping of idle connections
129+
is a [chosen behaviour of Hikari](https://www.postgresql.org/message-id/[email protected]) (and so not a bug).
130+
131+
At the moment, although postgres sees most of this pool as idle, Hikari internally believes them to be active, causing
132+
pool starvation unless massively over-resourcing the instance. This in turn locks up the instance entirely and leads to jobs
133+
silently failing
134+
135+
The workarounds here are to over-resource the module, and to restart problematic instances (or all instances)
136+
when this behaviour manifests, or to revert to versions where this is less prevalent (v6.1.3, v6.2.0) and handle the
137+
federated locking issues instead. Obviously these are not proper solutions.
138+
139+
In Trillium (v2.1.0), the aim is both to fix these bugs, and hopefully thus free up the connection pool to an extent that it can be
140+
run with _significantly_ fewer connections, and potentially set up a way for the configured pool size to be mathematically
141+
split between system and module, so as to avoid the doubling of the pool.
142+
143+
The recommendation for the versions containing the fix is to run with a minimum of 10 connections per instance
144+
(Which will be doubled to 20 to account for the system schema).
145+
77146
## ModuleDescriptor
78147

79148
https://github.com/folio-org/mod-serials-management/blob/master/service/src/main/okapi/ModuleDescriptor-template.json

0 commit comments

Comments
 (0)