-
Notifications
You must be signed in to change notification settings - Fork 151
perf: eliminate intermediate byte[] copies in StreamMessageConsumer #902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add MessageJsonHandler.serialize(Message, OutputStream, Charset) - Serialize into ByteArrayOutputStream and write via writeTo(output) - Remove String.getBytes(...) and toByteArray() clone - Cache Charset instead of using encoding String No breaking changes: existing constructors retained; new overloads are additive.
e3c9abe to
4462bbd
Compare
|
@sebthom Many thanks for all your contributions to the project. In general, for performance-related improvements I'd like to see more details about the issue being addressed including realistic benchmarks to check the performance and the actual measurements before and after the change. Sometimes a small amount of micro-optimisation can make a huge difference. However, it is important to have evidence that we are optimizing an actual bottleneck. Otherwise, the code can end up being harder to maintain, and we'll quite possibly find that we've either missed the real bottleneck, or that our micro-optimisations are harming performance instead of helping. Again, these general notes apply to all performance-related improvements. |
|
I don't see how to provide realistic benchmarks. What would be the exact criterias? Which tools do you accept etc.? These PRs address issues like #815 The current parsing is memory inefficient. These improvements (similar to #816) reduce CPU churn and GC pressure. @jonahgraham what is your opinion? |
OK. But have you measured the actual increase in performance somehow? |
|
In this particular case, it is not that obvious when taking a deeper look at the code.
As we can see, the two implementations are quite different. Which would be the more efficient and to what extent? I don't know without actually measuring it. But I do know that the implementation before the change looks more straightforward and readable to me; the content is written in exactly the same way as the header. This is just an example to illustrate my general point. |
jonahgraham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there is a speed performance improvement here*, and I don't think it matters much because the speed of encoding to JSON is much more heavily weighted to the encoding of all the inner types than how long it takes to copy the resulting encoded message, even a couple of extra times.
However, this change gives a very probable memory performance improvement, it means one less copy of the message that needs to be live at the peak memory usage (see caveats below)
The existing code is:
String content = jsonHandler.serialize(message);
byte[] contentBytes = content.getBytes(encoding);Which means that message, content and contentBytes are all live at the same time**, which means 3x the memory footprint of the message itself.***
The new code only has at worst case message and ByteArrayOutputStream live at once, so 2x the memory footprint. This is because the "pieces" are converted to encoded bytes as we go.
* there is one likely speed performance improvement of this change regardless of anything else, the use of StringWriter means that all writes to it are synchronized, while ByteArrayOutputStream in contrast is not thread safe, so no synchronized overhead per call. I don't think the JVM will successfully be able to do escape analysis on our use of StringWriter so the underlying StringBuffer will probably pay the overhead of synchronization.
** AFAICT message itself could be gc'ed by the time serialize returns, but LSP4E keeps a reference to it, keeping the message live longer than really needed - see that code here that keeps message live after the call to consumer.consume(message). The other places that wrap messages does the consume last so that the message itself can be gc'ed. That may not be a big change, as the overhead of message itself is small if the contents of the message are objects that the client/server has live anyway.
*** There are a bunch of other copies of the whole array that may kick in, String.getBytes may make an extra copy, StringBuffer may make an extra copy as it grows (as does ByteArrayOutputStream in new version). However these extra copies are generally short lived and probably don't change the overall analysis here.
Not for nothing.... But its too bad we have to know the length of the message before transmitting it, because if we had other framing options for messages we could pipe messages directly to output streams, meaning no significant memory overhead.
@pisv's request for benchmarks is ok, but I think overkill in this place. It would be nice to have some performance benchmarks for LSP4J, but to me this change is a very likely overall improvement, so I don't want to require benchmarks to progress this.
As for the style issue - I think it is reasonable to have different style for header and body of the message because their requirements are very different, basically for all the reasons I outline above (memory overhead + needing size of message). If the header was bigger, we could consider streaming it directly to output instead of going via the String and byte[] intermediaries.
In conclusion, I am approving this change. But before I submit it, I would like @pisv to comment to give an opportunity to say if I am wrong, especially about the need for benchmarks.
|
I don't think I have much to add here. My position remains unchanged in general regarding attempts at perf-related improvements without actually profiling and measuring of their effects first. Therefore, I cannot approve such kind of PRs, but have not and will not block them. |
Then I think we should try harder to prove what our instincts are, but more importantly I want to setup the infrastructure to make it easier next time too. So I am going to add some profiling infra into LSP4J's codebase using JMH. I have tried something on my machine, but I can't finish it right now. My quick summary is that the speed performance of the existing code seems better on small messages, and the new code seems faster on big messages. I need to quantify that to make it a useful statement. For example on a small message like: message = new RequestMessage();
message.setId("1");
message.setMethod("foo");
Map<String, String> map = new HashMap<>();
for (int i = 0; i < 10; i++) {
map.put(String.valueOf(i), "X".repeat(i));
}
message.setParams(map);for big, the map is filled like this instead: for (int i = 0; i < 1000; i++) {
map.put(String.valueOf(i), "X".repeat(i * 100));
}This is what I see: What I don't see is any indication of what the peak memory usage is, just total allocations. But total bytes allocated looks much worse with the new code. |
|
Thanks for looking into this, Jonah! I really appreciate it 👍 As hinted in #902 (comment), it did not seem so clear-cut to me. The new code allocates a byte-buffer of size However, my general point has been that any performance-related PRs should be justified by their contributors with at least some description of the actual issue they observed in practice as evidenced by profiling results and at least some measurements of the actual performance increase. For example, when contributing to Eclipse Theia, they require of each PR to include the "How to test" section that should explain how a reviewer can reproduce a bug, test new functionality or verify performance improvements (sic). This is enforced by the PR template. To put it bluntly, committers should not be expected to do all the work we are doing now to try and justify each attempt at micro-optimization, which usually just is not worth the effort, in my experience. |
No breaking changes: existing constructors retained; new overloads are additive.