perf: eliminate intermediate byte[] copies in StreamMessageConsumer #902

sebthom · 2025-10-25T20:52:54Z

Add MessageJsonHandler.serialize(Message, OutputStream, Charset)
Serialize into ByteArrayOutputStream and write via writeTo(output)
Remove String.getBytes(...) and toByteArray() clone
Cache Charset instead of using encoding String

No breaking changes: existing constructors retained; new overloads are additive.

- Add MessageJsonHandler.serialize(Message, OutputStream, Charset) - Serialize into ByteArrayOutputStream and write via writeTo(output) - Remove String.getBytes(...) and toByteArray() clone - Cache Charset instead of using encoding String No breaking changes: existing constructors retained; new overloads are additive.

pisv · 2025-11-12T12:09:24Z

@sebthom Many thanks for all your contributions to the project.

In general, for performance-related improvements I'd like to see more details about the issue being addressed including realistic benchmarks to check the performance and the actual measurements before and after the change.

Sometimes a small amount of micro-optimisation can make a huge difference. However, it is important to have evidence that we are optimizing an actual bottleneck. Otherwise, the code can end up being harder to maintain, and we'll quite possibly find that we've either missed the real bottleneck, or that our micro-optimisations are harming performance instead of helping.

Again, these general notes apply to all performance-related improvements.

sebthom · 2025-11-12T12:18:00Z

I don't see how to provide realistic benchmarks. What would be the exact criterias? Which tools do you accept etc.? These PRs address issues like #815 The current parsing is memory inefficient. These improvements (similar to #816) reduce CPU churn and GC pressure.

@jonahgraham what is your opinion?

pisv · 2025-11-12T12:59:52Z

I don't see how to provide realistic benchmarks.

OK. But have you measured the actual increase in performance somehow?

pisv · 2025-11-12T15:08:19Z

In this particular case, it is not that obvious when taking a deeper look at the code.

StreamMessageConsumer.consume before the change:

A byte-array is created for a StringWriter (AbstractStringBuilder.value)
It is then copied in StringWriter.toString (but note that StringBuffer.toString is annotated with @HotSpotIntrinsicCandidate, so it must be efficient, I guess)
A byte-array is created as the result of String.getBytes. (The bulk-encoding to UTF-8, which is the only encoding supported right now in LSP, is a special case, and must be quite efficient)

StreamMessageConsumer.consume after the change:

A byte-array is created for a ByteArrayOutputStream (buf)
A byte-buffer is allocated for a StreamEncoder of the OutputStreamWriter
The StreamEncoder creates a new char array and wraps it into a new CharBuffer in each write call

As we can see, the two implementations are quite different. Which would be the more efficient and to what extent? I don't know without actually measuring it. But I do know that the implementation before the change looks more straightforward and readable to me; the content is written in exactly the same way as the header. This is just an example to illustrate my general point.

jonahgraham

I don't know if there is a speed performance improvement here*, and I don't think it matters much because the speed of encoding to JSON is much more heavily weighted to the encoding of all the inner types than how long it takes to copy the resulting encoded message, even a couple of extra times.

However, this change gives a very probable memory performance improvement, it means one less copy of the message that needs to be live at the peak memory usage (see caveats below)

The existing code is:

			String content = jsonHandler.serialize(message);
			byte[] contentBytes = content.getBytes(encoding);

Which means that message, content and contentBytes are all live at the same time**, which means 3x the memory footprint of the message itself.***

The new code only has at worst case message and ByteArrayOutputStream live at once, so 2x the memory footprint. This is because the "pieces" are converted to encoded bytes as we go.

* there is one likely speed performance improvement of this change regardless of anything else, the use of StringWriter means that all writes to it are synchronized, while ByteArrayOutputStream in contrast is not thread safe, so no synchronized overhead per call. I don't think the JVM will successfully be able to do escape analysis on our use of StringWriter so the underlying StringBuffer will probably pay the overhead of synchronization.

** AFAICT message itself could be gc'ed by the time serialize returns, but LSP4E keeps a reference to it, keeping the message live longer than really needed - see that code here that keeps message live after the call to consumer.consume(message). The other places that wrap messages does the consume last so that the message itself can be gc'ed. That may not be a big change, as the overhead of message itself is small if the contents of the message are objects that the client/server has live anyway.

*** There are a bunch of other copies of the whole array that may kick in, String.getBytes may make an extra copy, StringBuffer may make an extra copy as it grows (as does ByteArrayOutputStream in new version). However these extra copies are generally short lived and probably don't change the overall analysis here.

Not for nothing.... But its too bad we have to know the length of the message before transmitting it, because if we had other framing options for messages we could pipe messages directly to output streams, meaning no significant memory overhead.

@pisv's request for benchmarks is ok, but I think overkill in this place. It would be nice to have some performance benchmarks for LSP4J, but to me this change is a very likely overall improvement, so I don't want to require benchmarks to progress this.

As for the style issue - I think it is reasonable to have different style for header and body of the message because their requirements are very different, basically for all the reasons I outline above (memory overhead + needing size of message). If the header was bigger, we could consider streaming it directly to output instead of going via the String and byte[] intermediaries.

In conclusion, I am approving this change. But before I submit it, I would like @pisv to comment to give an opportunity to say if I am wrong, especially about the need for benchmarks.

pisv · 2025-11-19T20:18:39Z

I don't think I have much to add here. My position remains unchanged in general regarding attempts at perf-related improvements without actually profiling and measuring of their effects first. Therefore, I cannot approve such kind of PRs, but have not and will not block them.

jonahgraham · 2025-11-19T23:01:46Z

Therefore, I cannot approve such kind of PRs, but have not and will not block them.

Then I think we should try harder to prove what our instincts are, but more importantly I want to setup the infrastructure to make it easier next time too. So I am going to add some profiling infra into LSP4J's codebase using JMH. I have tried something on my machine, but I can't finish it right now.

My quick summary is that the speed performance of the existing code seems better on small messages, and the new code seems faster on big messages. I need to quantify that to make it a useful statement.

For example on a small message like:

		message = new RequestMessage();
		message.setId("1");
		message.setMethod("foo");
		Map<String, String> map = new HashMap<>();
		for (int i = 0; i < 10; i++) {
			map.put(String.valueOf(i), "X".repeat(i));
		}
		message.setParams(map);

for big, the map is filled like this instead:

		for (int i = 0; i < 1000; i++) {
			map.put(String.valueOf(i), "X".repeat(i * 100));
		}

This is what I see:

SMALL:

BEFORE:

Benchmark                                         Mode  Cnt     Score    Error   Units
MyBenchmark.measureSomething                      avgt    5  2396.100 ± 85.199   ns/op
MyBenchmark.measureSomething:·gc.alloc.rate       avgt    5   595.371 ± 21.095  MB/sec
MyBenchmark.measureSomething:·gc.alloc.rate.norm  avgt    5  1496.001 ±  0.001    B/op
MyBenchmark.measureSomething:·gc.count            avgt    5     5.000           counts
MyBenchmark.measureSomething:·gc.time             avgt    5     6.000               ms

AFTER:

Benchmark                                         Mode  Cnt     Score    Error   Units
MyBenchmark.measureSomething                      avgt    5  3475.309 ± 98.045   ns/op
MyBenchmark.measureSomething:·gc.alloc.rate       avgt    5  1163.359 ± 32.536  MB/sec
MyBenchmark.measureSomething:·gc.alloc.rate.norm  avgt    5  4240.001 ±  0.001    B/op
MyBenchmark.measureSomething:·gc.count            avgt    5    10.000           counts
MyBenchmark.measureSomething:·gc.time             avgt    5     8.000               ms

BIG:

BEFORE:

Benchmark                                         Mode  Cnt          Score          Error   Units
MyBenchmark.measureSomething                      avgt    5   75031666.111 ± 13210008.358   ns/op
MyBenchmark.measureSomething:·gc.alloc.rate       avgt    5       3453.545 ±      602.058  MB/sec
MyBenchmark.measureSomething:·gc.alloc.rate.norm  avgt    5  271310425.765 ±       68.538    B/op
MyBenchmark.measureSomething:·gc.count            avgt    5         46.000                 counts
MyBenchmark.measureSomething:·gc.time             avgt    5         44.000                     ms

AFTER:

Benchmark                                         Mode  Cnt          Score          Error   Units
MyBenchmark.measureSomething                      avgt    5   61261026.428 ± 15796510.918   ns/op
MyBenchmark.measureSomething:·gc.alloc.rate       avgt    5       4618.837 ±     1142.650  MB/sec
MyBenchmark.measureSomething:·gc.alloc.rate.norm  avgt    5  295734067.429 ±       59.512    B/op
MyBenchmark.measureSomething:·gc.count            avgt    5         51.000                 counts
MyBenchmark.measureSomething:·gc.time             avgt    5         42.000                     ms

What I don't see is any indication of what the peak memory usage is, just total allocations. But total bytes allocated looks much worse with the new code.

pisv · 2025-11-20T12:15:15Z

Thanks for looking into this, Jonah! I really appreciate it 👍

As hinted in #902 (comment), it did not seem so clear-cut to me. The new code allocates a byte-buffer of size 8192 for a StreamEncoder of the OutputStreamWriter per each message, and the StreamEncoder creates a new char array and wraps it into a new CharBuffer in each write call, which can hinder the expected gain in eliminating intermediate byte[] copies.

However, my general point has been that any performance-related PRs should be justified by their contributors with at least some description of the actual issue they observed in practice as evidenced by profiling results and at least some measurements of the actual performance increase.

For example, when contributing to Eclipse Theia, they require of each PR to include the "How to test" section that should explain how a reviewer can reproduce a bug, test new functionality or verify performance improvements (sic). This is enforced by the PR template.

To put it bluntly, committers should not be expected to do all the work we are doing now to try and justify each attempt at micro-optimization, which usually just is not worth the effort, in my experience.

sebthom changed the title ~~perf: eliminate intermediate byte[] copies in StreamMessageConsume~~ perf: eliminate intermediate byte[] copies in StreamMessageConsumer Oct 25, 2025

sebthom force-pushed the StreamMessageConsumer branch from e3c9abe to 4462bbd Compare November 8, 2025 20:19

jonahgraham approved these changes Nov 19, 2025

View reviewed changes

jonahgraham mentioned this pull request Nov 19, 2025

perf: bound header buffers and reuse builders in StreamMessageProducer #903

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: eliminate intermediate byte[] copies in StreamMessageConsumer #902

perf: eliminate intermediate byte[] copies in StreamMessageConsumer #902

sebthom commented Oct 25, 2025

Uh oh!

pisv commented Nov 12, 2025

Uh oh!

sebthom commented Nov 12, 2025 •

edited

Loading

Uh oh!

pisv commented Nov 12, 2025

Uh oh!

pisv commented Nov 12, 2025

Uh oh!

jonahgraham left a comment

Uh oh!

pisv commented Nov 19, 2025

Uh oh!

jonahgraham commented Nov 19, 2025

Uh oh!

pisv commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: eliminate intermediate byte[] copies in StreamMessageConsumer #902

Are you sure you want to change the base?

perf: eliminate intermediate byte[] copies in StreamMessageConsumer #902

Conversation

sebthom commented Oct 25, 2025

Uh oh!

pisv commented Nov 12, 2025

Uh oh!

sebthom commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pisv commented Nov 12, 2025

Uh oh!

pisv commented Nov 12, 2025

Uh oh!

jonahgraham left a comment

Choose a reason for hiding this comment

Uh oh!

pisv commented Nov 19, 2025

Uh oh!

jonahgraham commented Nov 19, 2025

Uh oh!

pisv commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sebthom commented Nov 12, 2025 •

edited

Loading