Skip to content

Conversation

brettwooldridge
Copy link
Contributor

@brettwooldridge brettwooldridge commented Sep 22, 2025

Refactored (re-written) Cleaner class, as low-contention and simple as I can get it. Principally, I removed the linked-node structure and replaced with a concurrent set. Fixes #1616.

An AtomicBoolean is used to track whether the cleaner thread is running and another (per-CleanerRef) is used to track whether the cleanup task has been run, either directly or by the cleaner thread, ensuring it is only executed once.

I don't think this class could be simplified any further and still be correct.

@matthiasblaesing If you still have your JMH harness, I'd be interested in how it performs.

@matthiasblaesing
Copy link
Member

matthiasblaesing commented Sep 23, 2025

To start this change does not survive benchmarking:

Sep 23, 2025 3:47:55 PM com.sun.jna.internal.Cleaner$CleanerThread run
SEVERE: null
java.lang.NullPointerException: Cannot invoke "java.util.concurrent.atomic.AtomicBoolean.compareAndSet(boolean, boolean)" because "this.cleaned" is null
        at com.sun.jna.internal.Cleaner$CleanerRef.clean(Cleaner.java:97)
        at com.sun.jna.internal.Cleaner$CleanerThread.run(Cleaner.java:123)

I think that the object becomes immediately eligable for GC and thus the cleaner is invoked before construction is done. This can be fixed by using a synchronized block as a reachability fence like this:

    public Cleanable register(Object obj, Runnable cleanupTask) {
        // The important side effect is the PhantomReference, that is yielded
        // after the referent is GCed
        Cleanable cleanable = add(new CleanerRef(obj, referenceQueue, cleanupTask));
        synchronized (obj) {
        }

        if (cleanerRunning.compareAndSet(false, true)) {
            Logger.getLogger(Cleaner.class.getName()).log(Level.FINE, "Starting CleanerThread");
            Thread cleanerThread = new CleanerThread();
            cleanerThread.start();
        }

        return cleanable;
    }

With that change I ran the numbers and got this (5.18.0 is the last release, baseline-sychronized1 is 5.18.0 with a synchronized block similar to the suggestion above and modified-synchronized1 is the code from this PR with the synchronized block added:

============== 5.18.0

# Run complete. Total time: 00:00:58
Benchmark                      Mode  Cnt        Score       Error  Units
MyBenchmark.testMethod        thrpt    5  1195202,456 ± 97808,898  ops/s

============== baseline-synchronized1

# Run complete. Total time: 00:00:59
Benchmark                      Mode  Cnt        Score        Error  Units
MyBenchmark.testMethod        thrpt    5  1151572,170 ± 105789,738  ops/s

============== modified-synchronized1

# Run complete. Total time: 00:02:02
Benchmark                      Mode  Cnt        Score        Error  Units
MyBenchmark.testMethod        thrpt    5  1232646,584 ± 543495,690  ops/s

The huge variance for the new code is reproducible and with that and considering the variance of the other results I would not want to merge as is.

The code used to produce the values: jnabench.zip

  • 5.18.0 was built as mvn -Djna.version=5.18.0 clean package
  • the two modified versions were built by running ant install from the jna source directory and then running mvn clean package in the jnabench directory

Here is the tree analysis from async profiler: profile-results.zip

My attention is drawn to the fact, that the accessors for the ConcurrentHashMap show up in the tree prominently.

Edit:

Benchmark was run as (the shaded jars were copied after build):

java -jar jnabench-5.18.0.jar  -t 1000 -i 1 -wi 0 -prof async:output=tree\;dir=profile-results/5.18.0

@matthiasblaesing
Copy link
Member

Another attept can be found in #1617 - that is a similar approach.

@brettwooldridge
Copy link
Contributor Author

brettwooldridge commented Sep 23, 2025

I went down the gc rabbit hole. This principal fix is this:

    public Cleanable register(Object referent, Runnable cleanupTask) {
        // The important side effect is the PhantomReference, that is yielded
        // after the referent is GCed
        Cleanable cleanable = add(new CleanerRef(referent, referenceQueue, cleanupTask));

        if (cleanerRunning.compareAndSet(false, true)) {
            Logger.getLogger(Cleaner.class.getName()).log(Level.FINE, "Starting CleanerThread");
            Thread cleanerThread = new CleanerThread();
            cleanerThread.start();
        }

        // NOTE: This is a "pointless" check in the conventional sense, however it serves to guarantee that the
        // referent is not garbage collected before the CleanerRef is fully constructed which can happen due
        // to reordering of instructions by the compiler or the CPU. In Java 9+ Reference.reachabilityFence() was
        // introduced to provide this guarantee, but we want to stay compatible with Java 8, so this is the common
        // idiom to achieve the same effect, by ensuring that the referent is still strongly reachable at
        // this point.
        if (referent == null) {
            throw new IllegalArgumentException("The referent object must not be null");
        }

        return cleanable;
    }

First, thanks for the JMH setup, it was useful!

I do however have some comments about JMH execution. I've been a long time JMH user, basically since Shipilev released it. Running JMH with more theads than cores is basically benchmarking the OS scheduler, not the code. I tend to run with cores minus 2.

Also, warmups are super critical. First, the CPU starts to throttle due to heat soak, so iterations fall off a cliff (as they should). Second, the JIT has a chance to see everything in the hot path -- it won't compile until it has seen a method many thousands of times, so you don't want this happening during the actual measurement phase. Finally, garbage collection starts to kick in more as the run progresses, so longer runs provide more real life measurements.

You can see what the "drop offf" looks like here:

# Run progress: 0.00% complete, ETA 00:06:40
# Fork: 1 of 5
# Warmup Iteration   1: 3607020.527 ops/s
# Warmup Iteration   2: 700146.028 ops/s
# Warmup Iteration   3: 758202.309 ops/s
# Warmup Iteration   4: 346324.146 ops/s
# Warmup Iteration   5: 42791.658 ops/s
Iteration   1: 9819.099 ops/s
Iteration   2: 29150.955 ops/s
Iteration   3: 14875.197 ops/s

Now, regarding running the benchmark. I ran with:

java -jar target/benchmarks.jar -t 8 -i 3 -wi 5

The interesting thing is, when running with these settings -- principally 3 measurement iterations with 5 warmups -- I could never get the 5.18.0 run to complete. It always hangs somewhere in the run. Sometimes in the 5th and final fork, sometimes in the first fork, or somewhere between. You might have better luck, maybe it is something with my Mac?

In any case, the new code never hung in all of my runs.

Even running with 5 measurement iterations the variability is still extremely high. It should probably be run with 20 iterations, but I ran out of patience -- and with 5.18.0 hanging it seemed pointless because there would be no comparable.

EDIT: I was not able to run with -prof async:output=tree\;dir=profile-results/5.18.0 due to it complaining about missing libraries that I didn't want to chase down (it's 4:40am here).

EDIT2: I was finally able to get a 5.18.0 run to complete. Here are the comparables (using the parameters I mentioned above).

v5.18.0

Benchmark                Mode  Cnt      Score       Error  Units
MyBenchmark.testMethod  thrpt   15  10193.227 ± 11216.322  ops/s

And 5.19.0-SNAPSHOT

Benchmark                Mode  Cnt      Score       Error  Units
MyBenchmark.testMethod  thrpt   15  17413.298 ± 16880.334  ops/s

As you can see, and as I mentioned above, the variablity is extremely high with so few iterations. I got runs as high as 20k on the score but again with the plus/minus being nearly the same as the score, it's not very accurate. One rough 10,000 foot cheat is to add the score + error and treat that as the score. I'm pretty sure even over a large run the new code is going to come out on top. I'll leave it running while I sleep with a high iteration count.

EDIT3: I figured out why the 5.18.0 run was hanging bc I saw it with 5.19.0-SNAPSHOT on higher iteration count -- OutOfMemory error. Neither Cleaner can keep up with the full-blast of JMH at default memory settings, but 5.19.0-SNAPSHOT does substantially better, only failing at higher iteration counts. I'm re-running with higher memory settings.

@brettwooldridge
Copy link
Contributor Author

UPDATE: Ok, I performed some long runs on 5.18.0 and 5.19.0-SNAPSHOT. I had to increase the JVM memory to 16g to get 5.18.0 to pass. 8g was needed for 5.19.0-SNAPSHOT to pass, but to be fair I ran both with 16g.

5.18.0

Benchmark                Mode  Cnt       Score        Error  Units
MyBenchmark.testMethod  thrpt   50  252504.321 ± 233446.548  ops/s

5.19.0-SNAPSHOT

Benchmark                Mode  Cnt       Score        Error  Units
MyBenchmark.testMethod  thrpt   50  337477.401 ± 125591.772  ops/s

This was my command:

java -Xmx16g -jar target/benchmarks.jar -t 8 -i 10 -wi 5

With iterations at 10 the variance (error) was reduced by ~66% compared to shorter runs. In order to get it down to 10% variance I suspect iterations might need to be 50. I may update this comment after running on a dual-CPU server at the office, but for now these stand well.

@brettwooldridge
Copy link
Contributor Author

LOL. Scrap another attempt to address #1616. After hours of extensive testing today on our dual-CPU server, the HashMap-based version definitely performs worse (far worse) when put under high GC pressure.

I'm closing this, though I may make another attempt with a different approach. The existing implementation mainly suffers from classic linked-list "tail contention", because all appends occur on the tail. There are various approaches in the acedemic literature that are worth investigating -- including using two locks (head and tail) and alternating appends between head to tail thereby reducing lock contention by 2x, but the edge cases (zero to three nodes) are sticky.

Anyway, killing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Very heavy lock contention under load
2 participants