Question about massive error recovery #144
Replies: 2 comments
-
TLDR; yes, you could, and most likely would run into OOMs and bring the whole cluster down if the remaining capacity is not sufficient to hold all the data in memory, and "all the data" includes both the primary and the backup copies. The rule is simple: Coherence can tolerate failure of a single component, and you get to chose what that component is: JVM, machine, rack or site (AZ), by provisioning enough capacity to be able to store all the data after the failure of that component. In other words, if you need 30 JVMs to store all the data, you should run with 31, if all you care about is single member/JVM failure (rarely the case), if you need 10 machines, you should run with 11 (often the case), etc. Sizing for the site/AZ failure is a bit different, as it is usually impossible to "add" a whole site, especially when running in the cloud: you will typically have 3 AZs, so you need to size the cluster accordingly by making sure that if you lose one AZ there is enough capacity in the remaining 2 AZs to store all the data without running OOM. So in your case, if you need 30 nodes for the data, you will need 45 (5 "spares" per AZ) in order to be able to lose the whole AZ. Yes, it can get expensive, but ultimately the choice is yours to make depending on what level of fault tolerance you need. There are also other ways to deal with it if you are using commercial version, such as storing backups (or even primary data) on disk using Elastic Data feature, but unfortunately that is not an option with the Community Edition. As for bringing additional nodes up when the failure occurs, that is probably too late -- the cascading failure due to OOMs will likely happen faster than you can bring new capacity up, especially because the nodes will become unresponsive due to GC pressure as their heaps get full, so it may be really difficult for the new members to even join the cluster. |
Beta Was this translation helpful? Give feedback.
-
Thanks that was what I was assuming (what we do today) just wanted to check
we are not overspending....
…On Fri, Aug 22, 2025, 22:06 Aleks Seovic ***@***.***> wrote:
TLDR; yes, you could, and most likely would run into OOMs and bring the
whole cluster down if the remaining capacity is not sufficient to hold all
the data in memory, and "all the data" includes both the primary and the
backup copies.
The rule is simple: Coherence can tolerate failure of a *single*
component, and you get to chose what that component is: JVM, machine, rack
or site (AZ), by provisioning enough capacity to be able to store all the
data *after* the failure of that component.
In other words, if you need 30 JVMs to store all the data, you should run
with 31, if all you care about is single member/JVM failure (rarely the
case), if you need 10 machines, you should run with 11 (often the case),
etc.
Sizing for the site/AZ failure is a bit different, as it is usually
impossible to "add" a whole site, especially when running in the cloud: you
will typically have 3 AZs, so you need to size the cluster accordingly by
making sure that if you lose one AZ there is enough capacity in the
remaining 2 AZs to store all the data without running OOM. So in your case,
if you need 30 nodes for the data, you will need 45 (5 "spares" per AZ) in
order to be able to lose the whole AZ.
Yes, it can get expensive, but ultimately the choice is yours to make
depending on what level of fault tolerance you need. There are also other
ways to deal with it if you are using commercial version, such as storing
backups (or even primary data) on disk using Elastic Data feature, but
unfortunately that is not an option with the Community Edition.
As for bringing additional nodes up when the failure occurs, that is
probably too late -- the cascading failure due to OOMs will likely happen
faster than you can bring new capacity up, especially because the nodes
will become unresponsive due to GC pressure as their heaps get full, so it
may be really difficult for the new members to even join the cluster.
—
Reply to this email directly, view it on GitHub
<#144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQF6SWXPMCRSBK5VKRTL3O5Z2JAVCNFSM6AAAAACESPLO6WVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMJZGMYTMOI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have two questions about how Coherence recovers from massive errors - lets say I have a cluster of 30 nodes speread over 3 avaialbility zones (using rack id set up so that no primary and backups are in the same AZ) with backup count 1 (all data stored twice). The volumne of data is fairly static.
As failures of more than one node at a time is fairly rare I would like to not have more slack over the nodes than what say two nodes hold and not the full 10 of an availability zone). In the rare event of a massiv failure of one avaialbility zone I am wondering how Coherence reacts - all data should still be avaialble and I assume the backups are promoted to primaries.
My first question is how gracefully Coherence would handle that there, as in this case, is not room to create new backups for all data - is there a risk that I will get out of memeory errors on some remaining nodes leading to cascading errors and eventually data loss or are there some mechasnism in place that prevents this?
The second question is how Coherence handles it if I sucessivly spin up replacement nodes in the remaing avaialbility zones, in the end getting back to the full 30 nodes - will the cluster eventually recover or is there once again a risk of out of memeory errors or other problems as a good number of nodes needs to be added basck before there is room for dackups of all data again?
Are there for instance any setting where I can specify how many nodes that should be avaialble before Coherence tries to create backups or some other way to configure the behaviour in these types of massive error situations?
Beta Was this translation helpful? Give feedback.
All reactions