Skip to content

Conversation

mariofusco
Copy link
Contributor

@mariofusco mariofusco commented Sep 4, 2024

The intent of this pull request is to mitigate the concurrent loading (or define to be more precise) of the same classes by multiple threads as reported here by @gsmet.

Note that I wrote "mitigate" instead of "solve", because I believe that given the concurrent nature of our classloader I believe that it's impossible to structurally solve the problem or at least that a solution consistently preventing ANY concurrent class definition will have a cost in terms of performances and/or memory occupation that will largely overcome its advantages.

As suggested by @franz1981 the main problem that we have at the moment is in the fact that multiple threads could try to define the same class. This behaviour is expected and inherent with the classloader's concurrent mechanisms. When this happens the classloader raises a LinkageError that is properly managed by discarding the attempted duplicated class definition and returning the class defined by the thread that won the race. This means however that we're basically using an exception for the normal control flow, which is a known performance antipattern, and something that should be avoided as much as possible.

To diagnose this problem I simply printed a log statement like

System.out.println("Duplicated class definition: " + name);

in the catch block of that LinkageError. Giving a single run of the fullMicroProfile benchmark of the quarkus-startstop application with this set up I saw that in average that duplicated class definition is performed around 50 times per run:

Duplicated class definition: io.vertx.core.net.impl.TCPServerBase
Duplicated class definition: io.vertx.core.net.impl.TCPServerBase
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.net.impl.TCPServerBase
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.net.impl.TCPServerBase
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.net.impl.TCPServerBase
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.net.impl.TCPServerBase
Duplicated class definition: io.vertx.core.net.impl.TCPServerBase
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.net.impl.TCPServerBase
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
Duplicated class definition: io.netty.channel.AbstractChannel$AbstractUnsafe$6
Duplicated class definition: io.netty.buffer.PoolArena$1

As anticipated this pull request allows, with a negligible cost, to largely mitigate this problem, so that, with this fix in place, the typical run of the same application now only has from 2

Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3

to maximum 5

Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
Duplicated class definition: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
Duplicated class definition: io.netty.channel.AbstractChannel$AbstractUnsafe$6

of those concurrent class definition.

The implementation of this improvement uses a single value cache to store the class currently to be defined from a given jar and blocks the other threads attempting to deifne that same class, making them to wait until the first thread that initiated the class definition process has completed it. Note that for virtual threads this blocking heuristic cannot be used and it is necessary to keep the completely non-blocking behaviour that we had before.

/cc @dmlloyd @Sanne @geoand @gsmet @franz1981

@quarkus-bot quarkus-bot bot added the area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins label Sep 4, 2024
Copy link

quarkus-bot bot commented Sep 4, 2024

Status for workflow Quarkus CI

This is the status report for running Quarkus CI on commit 3cb0952.

✅ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

You can consult the Develocity build scans.

@dmlloyd
Copy link
Member

dmlloyd commented Sep 5, 2024

As far as I can tell, this change, and the discussion which preceded it, presumes that there is actually a measurable performance deficit associated with the existing algorithm. But I didn't see anything that shows that the overall performance could be improved in this way. I guess the theory is that loading the resources is expensive enough that doing it extra times would have a cost, but that's just my guess.

Is there any way we could measure to see if startup time and/or time to first request is improved with this change?

@gsmet
Copy link
Member

gsmet commented Sep 5, 2024

I think we probably need to test three things:

  • Quarkus main
  • Quarkus main + revert the CL patch
  • Quarkus main + this patch

Because otherwise it's quite hard to compare things.

@mariofusco
Copy link
Contributor Author

The intent of this pull request is reducing the number of concurrent classes definition as reported in the linked issue. Said that reverting the patch making this class loader non-blocking doesn't seem an option to me, unless we don't want to drop virtual threads support, I already tested these 3 options

* Quarkus main
* Quarkus main + revert the CL patch
* Quarkus main + this patch

against the quarkus-startup full-microprofile benchmark and in all honesty I didn't find any relevant difference, or at least no difference above the margin of variability resulting from running that benchmark on my laptop. Is there any other benchmark or in general a better way through which we could test this change?

@dmlloyd
Copy link
Member

dmlloyd commented Sep 5, 2024

When this was designed, duplicate loading was an expected behavior of the algorithm. I think the discussion on the linked bug lost sight of that pretty quickly. So reducing it as a primary goal doesn't make a lot of sense to me. OTOH if we can show that the duplicate loading does in fact have a negative performance impact (maybe in total CPU time?), then it would make sense to say "this algorithm isn't actually working the way we want" and make this change to the algorithm itself as a way to fix the deficit.

Otherwise my opinion, FWIW, is that we should leave it alone for now.

@mariofusco
Copy link
Contributor Author

When this was designed, duplicate loading was an expected behavior of the algorithm. I think the discussion on the linked bug lost sight of that pretty quickly. So reducing it as a primary goal doesn't make a lot of sense to me.

I totally agree on everything.

OTOH if we can show that the duplicate loading does in fact have a negative performance impact (maybe in total CPU time?), then it would make sense to say "this algorithm isn't actually working the way we want" and make this change to the algorithm itself as a way to fix the deficit.

In theory the duplicated class loading very likely has a negative impact. What I'm saying is that in practice, at least using the benchmark that I mentioned above, this impact is too small (a few tens of duplicated definitions, moreover made concurrently by different threads, on a total of many thousand of classes) for being measured reliably and repeatably.

@franz1981
Copy link
Contributor

OTOH if we can show that the duplicate loading does in fact have a negative performance impact (maybe in total CPU time?), then it would make sense to say "this algorithm isn't actually working the way we want" and make this change to the algorithm itself as a way to fix the deficit.

I didn't measured yet, but the effects here are:

  1. allocating (again - N times more) byte[] to read the resource
  2. using it to enter in the JNI realm in defineClass - which means performing a bunch of malloc/free - likely to fail later: see https://github.com/openjdk/jdk/blob/59c4649be37a387efaf100f368b3e9db06d44f3a/src/java.base/share/native/libjava/ClassLoader.c#L252-L278 (performing an heap -> off-heap copy via GetByteArrayRegion)
  3. throwing a linkage exception which likely cause (local) safepoint to happen in order to collect the stack trace

Having a single check which mitigate all this madness is a win to me, if it won't make the code less maintainable.
@mariofusco as a proof I suggest 2 things:

  1. create some custom JFR events to detect these behaviours - it will help debuggability in prod/containers/our tests
  2. setup the QE startstop test with an appropriate number of event loops and switch the event(s) on to see if it happens and off, to compare RSS vs baseline

And than decide which commits keep e.g. JFR events and/or these changes.

Copy link
Contributor

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to keep this change simple without adding more states to this complex already state machine: we trade simplicity and some effectiveness embracing volatility.
synchronized if contended is not free, but instead can cause inflation, which increase RSS as well...
I won't introduce any form of waiting, but just checking if the last loaded resource name match what we want to load and in case, check again with findClass before retrying if nothing is found (which shouldn't happen really, if the proper barrier(s) are used)

@mariofusco
Copy link
Contributor Author

mariofusco commented Sep 5, 2024

I won't introduce any form of waiting, but just checking if the last loaded resource name match what we want to load and in case, check again with findClass before retrying if nothing is found (which shouldn't happen really, if the proper barrier(s) are used)

Checking again with a findClass instead of making the thread performing the concurrent class definition to wait as I'm attempting with this pull request, is indeed another heuristic that I could try. I just don't understand of which barrier(s) you're speaking about since I think that there isn't any at the moment. Can you please clarify?

Also I'm still concerned that whatever change I try in this area, its impact on the overall startup time will be however too small to be measured, but I'm open to try different implementations before taking a decision.

@franz1981
Copy link
Contributor

franz1981 commented Sep 5, 2024

just don't understand of which barrier(s) you're speaking about since I think that there isn't any at the moment. Can you please clarify?

Let's say you store in a field the name of the last successufully defined class in the jar resource: you have to use setVolatile/setRelease and getVolatile/getAcquire to make sure that there's the right synchronize-with relation (still an happens-before one , see https://preshing.com/20130823/the-synchronizes-with-relation/) with defineClass and findClass

i.e.

loadAcquire lastClassDefined;
// LoadLoad; LoadStore barriers
if (lastClassDefined same as className)  {
     // no need to loop here really
     return findClass 
}
// bla bla
try {
 defineClass(className)
 // LoadStore:StoreStore  barriers
 storeRelease lastClassDefined := className
} catch (LinkageEx ) {
    
}

@dmlloyd
Copy link
Member

dmlloyd commented Sep 5, 2024

OTOH if we can show that the duplicate loading does in fact have a negative performance impact (maybe in total CPU time?), then it would make sense to say "this algorithm isn't actually working the way we want" and make this change to the algorithm itself as a way to fix the deficit.

I didn't measured yet, but the effects here are:

  1. allocating (again - N times more) byte[] to read the resource
  2. using it to enter in the JNI realm in defineClass - which means performing a bunch of malloc/free - likely to fail later: see https://github.com/openjdk/jdk/blob/59c4649be37a387efaf100f368b3e9db06d44f3a/src/java.base/share/native/libjava/ClassLoader.c#L252-L278 (performing an heap -> off-heap copy via GetByteArrayRegion)
  3. throwing a linkage exception which likely cause (local) safepoint to happen in order to collect the stack trace

Having a single check which mitigate all this madness is a win to me, if it won't make the code less maintainable.

But, that assumes that the tradeoff - holding a lock and wait/notify ping-pong, plus the added cognitive load of tracking resource lifecycle right on the interface - is cheaper. We don't really know that's the case.

If the effect is as small as it seems, then maybe we can just let it be "bad" for now with the idea of not making it harder to improve later, for example by switching to mmap'd archives and changing the way we manage lifecycle. A clean abstraction is worth 1000x its LOC in microoptimizations.

@franz1981
Copy link
Contributor

that assumes that the tradeoff - holding a lock and wait/notify ping-pong

I refer to the idea of PR I got in mind here -> see #43022 (review)

Sorry I didn't materialized the comment earlier ^^

@cescoffier
Copy link
Member

@mariofusco @dmlloyd @franz1981 Where are we on this one?

@franz1981
Copy link
Contributor

franz1981 commented Nov 10, 2024

In this current form could be troublesome;
eg

  1. Thread 1 cas(null, clazzA) - win
  2. Thread 2 cas(null, classA) - fail but found that it has to wait (but doesn't start doing it yet)
  3. Thread 1 start define, complete it, cas(classA, null) and try notify - none, since Thread 1 is still not within the synchronized region
  4. Thread 1 enter in synchronized and just wait forever

Using a volatile field with an immutable pair with String and CompletableFuture<?> is easier to make it right imo

@dmlloyd
Copy link
Member

dmlloyd commented Nov 10, 2024

My intention is ultimately to replace the majority of this logic, so I don't think it matters too much what we do in the near term.

@mariofusco
Copy link
Contributor Author

My intention is ultimately to replace the majority of this logic, so I don't think it matters too much what we do in the near term.

Can you please clarify what you have in mind? Are you planning to read the jars through memory mapping? Can we maybe schedule a call to discuss this in more details?

@dmlloyd
Copy link
Member

dmlloyd commented Nov 11, 2024

Yeah probably memory mapping. We can chat this week if you want to. It's part of what I want to cover in the class loading WG though so chances are good that we would repeat the discussion when that is kicked off.

WG proposal link: #43749

@cescoffier
Copy link
Member

@mariofusco @dmlloyd what's the status on this one? Should it be recondiered when we will have a "classloading revamping" WG?

@franz1981
Copy link
Contributor

IDK @cescoffier but we have this problem (and TBH wildfly too):
image

This is collected via @brunobat benchmark on OTEL and it shows a huge amount of hibernate-related linkage errors due to the worker threads (which far exceed the number of cores, clearly) trying to define the same classes.
This is clearly a limit case since antl4 has become much "fatter" and it's using a worker pool (no virtual threads: with 2-3 FJ workers, things are different, clearly...) - but is such a common scenario that probably is worth to work on a solution before the revamping...
For wildfly, I wish to write to them because in some customer case I observed the same exact scenario - and they perform way more classloading than us so...

@dmlloyd
Copy link
Member

dmlloyd commented Sep 16, 2025

This is collected via @brunobat benchmark on OTEL and it shows a huge amount of hibernate-related linkage errors due to the worker threads (which far exceed the number of cores, clearly) trying to define the same classes.

Was this before or after a391ab7 was merged (that commit is found in 3.26.0 and later)?

@franz1981
Copy link
Contributor

franz1981 commented Sep 16, 2025

I think the benchmark from @brunobat was 3.26 RC as platform version

@mariofusco have a test for this IIRC?

@gastaldi gastaldi added the triage/needs-rebase This PR needs to be rebased first because it has merge conflicts label Sep 16, 2025
@dmlloyd
Copy link
Member

dmlloyd commented Sep 16, 2025

One idea I'm playing with is to do a limited scan of the class bytes and attempt to preload the superclass (and maybe interfaces). This could mitigate the problem by reducing the window of time that is the "black box" of defineClass (see #50091).

@brunobat
Copy link
Contributor

I think the benchmark from @brunobat was 3.26 RC as platform version

@mariofusco have a test for this IIRC?

My app is currently using 3.24.0.CR1.
I can create a new branch with a version of your preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins triage/needs-rebase This PR needs to be rebased first because it has merge conflicts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants