Question about massive error recovery #144

javafanboy · 2025-08-22T16:54:12Z

javafanboy
Aug 22, 2025

I have two questions about how Coherence recovers from massive errors - lets say I have a cluster of 30 nodes speread over 3 avaialbility zones (using rack id set up so that no primary and backups are in the same AZ) with backup count 1 (all data stored twice). The volumne of data is fairly static.
As failures of more than one node at a time is fairly rare I would like to not have more slack over the nodes than what say two nodes hold and not the full 10 of an availability zone). In the rare event of a massiv failure of one avaialbility zone I am wondering how Coherence reacts - all data should still be avaialble and I assume the backups are promoted to primaries.

My first question is how gracefully Coherence would handle that there, as in this case, is not room to create new backups for all data - is there a risk that I will get out of memeory errors on some remaining nodes leading to cascading errors and eventually data loss or are there some mechasnism in place that prevents this?

The second question is how Coherence handles it if I sucessivly spin up replacement nodes in the remaing avaialbility zones, in the end getting back to the full 30 nodes - will the cluster eventually recover or is there once again a risk of out of memeory errors or other problems as a good number of nodes needs to be added basck before there is room for dackups of all data again?

Are there for instance any setting where I can specify how many nodes that should be avaialble before Coherence tries to create backups or some other way to configure the behaviour in these types of massive error situations?

aseovic · 2025-08-22T20:05:36Z

aseovic
Aug 22, 2025
Maintainer

TLDR; yes, you could, and most likely would run into OOMs and bring the whole cluster down if the remaining capacity is not sufficient to hold all the data in memory, and "all the data" includes both the primary and the backup copies.

The rule is simple: Coherence can tolerate failure of a single component, and you get to chose what that component is: JVM, machine, rack or site (AZ), by provisioning enough capacity to be able to store all the data after the failure of that component.

In other words, if you need 30 JVMs to store all the data, you should run with 31, if all you care about is single member/JVM failure (rarely the case), if you need 10 machines, you should run with 11 (often the case), etc.

Sizing for the site/AZ failure is a bit different, as it is usually impossible to "add" a whole site, especially when running in the cloud: you will typically have 3 AZs, so you need to size the cluster accordingly by making sure that if you lose one AZ there is enough capacity in the remaining 2 AZs to store all the data without running OOM. So in your case, if you need 30 nodes for the data, you will need 45 (5 "spares" per AZ) in order to be able to lose the whole AZ.

Yes, it can get expensive, but ultimately the choice is yours to make depending on what level of fault tolerance you need. There are also other ways to deal with it if you are using commercial version, such as storing backups (or even primary data) on disk using Elastic Data feature, but unfortunately that is not an option with the Community Edition.

As for bringing additional nodes up when the failure occurs, that is probably too late -- the cascading failure due to OOMs will likely happen faster than you can bring new capacity up, especially because the nodes will become unresponsive due to GC pressure as their heaps get full, so it may be really difficult for the new members to even join the cluster.

0 replies

javafanboy · 2025-08-23T08:35:02Z

javafanboy
Aug 23, 2025
Author

Thanks that was what I was assuming (what we do today) just wanted to check we are not overspending....

…

On Fri, Aug 22, 2025, 22:06 Aleks Seovic ***@***.***> wrote: TLDR; yes, you could, and most likely would run into OOMs and bring the whole cluster down if the remaining capacity is not sufficient to hold all the data in memory, and "all the data" includes both the primary and the backup copies. The rule is simple: Coherence can tolerate failure of a *single* component, and you get to chose what that component is: JVM, machine, rack or site (AZ), by provisioning enough capacity to be able to store all the data *after* the failure of that component. In other words, if you need 30 JVMs to store all the data, you should run with 31, if all you care about is single member/JVM failure (rarely the case), if you need 10 machines, you should run with 11 (often the case), etc. Sizing for the site/AZ failure is a bit different, as it is usually impossible to "add" a whole site, especially when running in the cloud: you will typically have 3 AZs, so you need to size the cluster accordingly by making sure that if you lose one AZ there is enough capacity in the remaining 2 AZs to store all the data without running OOM. So in your case, if you need 30 nodes for the data, you will need 45 (5 "spares" per AZ) in order to be able to lose the whole AZ. Yes, it can get expensive, but ultimately the choice is yours to make depending on what level of fault tolerance you need. There are also other ways to deal with it if you are using commercial version, such as storing backups (or even primary data) on disk using Elastic Data feature, but unfortunately that is not an option with the Community Edition. As for bringing additional nodes up when the failure occurs, that is probably too late -- the cascading failure due to OOMs will likely happen faster than you can bring new capacity up, especially because the nodes will become unresponsive due to GC pressure as their heaps get full, so it may be really difficult for the new members to even join the cluster. — Reply to this email directly, view it on GitHub <#144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQF6SWXPMCRSBK5VKRTL3O5Z2JAVCNFSM6AAAAACESPLO6WVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMJZGMYTMOI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about massive error recovery #144

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about massive error recovery #144

Uh oh!

Uh oh!

javafanboy Aug 22, 2025

Replies: 2 comments

Uh oh!

aseovic Aug 22, 2025 Maintainer

Uh oh!

javafanboy Aug 23, 2025 Author

javafanboy
Aug 22, 2025

aseovic
Aug 22, 2025
Maintainer

javafanboy
Aug 23, 2025
Author