Skip to content

Lazy fields parsing #2072

Open
poslegm wants to merge 27 commits intoscalapb:masterfrom
poslegm:fresh-benchmarks
Open

Lazy fields parsing #2072
poslegm wants to merge 27 commits intoscalapb:masterfrom
poslegm:fresh-benchmarks

Conversation

@poslegm
Copy link

@poslegm poslegm commented Mar 8, 2026

Hello!

This is a revival of the work to support lazy fields (previous attempt #1376).

Context

In Java protobuf, string fields are handled using LazyField. This mechanism stores the field data as a ByteString and only parses it into a UTF‑8 string when the corresponding getter is called. When a message containing such fields is serialized, the raw ByteString is written directly, without performing UTF‑8 encoding or decoding (source).

Unlike java protobuf, scalapb does not lazily serialize strings. Accordingly, this is an opportunity to reduce the overhead if the following factors coincide:

  • protobuf message consists of a large number of string fields;
  • read access (calling a getter method) to only a small number of attributes (parse message → read a few fields → serialize).

Such usage patterns are quite common for cloud-native applications.

The essence of the changes

Generating LazyField[String] for string fields if scalapb.options.lazy_fields is enabled. LazyField[T] contains the original ByteString and lazily parses the value on demand. Introduces implicit conversions for convenient use of generated case classes.

Example:

message LazyWithRecursion {
  option (scalapb.message) = {
    lazy_fields: true
  };
  string data = 1;
  LazyWithRecursion nested = 2;
}
val original = LazyWithRecursion(data = "a lazy string", nested = Some(LazyWithRecursion(data = "nested string")))

val updated = original.update(_.nested.data := "updated string")

println(updated.nested.get.data == "updated string") // true

How it works with parsing and serialization:

val msg = LazyMessage.parseFrom(bytes)

// No parsing has happened for lazy_field yet.

val serialized = msg.toByteArray // <--- fast serialization without UTF-8 encoding

val upper = msg.lazyField.toUpperCase  // <--- Parsing happens here, on first access.

println(upper)

Benchmarks

New benchmarks have been added: roundTripScala and roundTripJava. They test the full proto lifecycle: parsing and serialization. I was confused by the fact that transforming data in object ${Message}Test using the toJavaProto method affects the performance results. In my generated code, this method forces ByteString usage during java proto preparation, so the comparison is not entirely fair. The results have also improved for existing benchmarks, but I wanted a clearer comparison.

Round-trip benchmark Java Scala
LargeStringMessage 9,345 ns/op 9,088 ns/op
LazyFieldsStringMessage (same as LargeStringMessage but lazy_fields: true) 9,484 ns/op 2,734 ns/op

Looks great. More than 3x speedup 🚀 Of course, scalapb is faster than java proto even without additional improvements.

Questions

  1. Should I commit benchmarks results into this PR?
  2. Should I add more tests?
  3. Should we keep using toJavaProto in data preparation for benchmarks (object ${Message}Test)?
  4. Anything else?

object ${Message}Test {
val scala = TestCases.make${Message}Scala

val java = protos.${Message}.toJavaProto(scala)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential problem with data preparation is here. Java protobuf created by conversion from scala proto instead of bytes parsing.

So, lazy_fields affects java serialization performance:

serializeJava Java
LargeStringMessage 2,584 ns/op
LazyFieldsStringMessage (same as LargeStringMessage but lazy_fields: true) 1,111 ns/op

i think that it is caused by forced bytes writing at toJavaProto with lazy_fields enabled (ProtobufGenerator#348L). But anyway it doesn't look as clear java protobuf benchmark.

What about changing this line to val java = Protos.${Message}.parseFrom(bytes)?

java: Boolean = true
): Unit = {
ops.mkdir ! ops.pwd / 'results
val benchmarks0 = if (benchmarks.nonEmpty) benchmarks else testNames
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a problem with running this script with fresh ammonite version. This strange code helped to run the script.

Also scalapb argument value is ignored further. So, I need to hardcode snapshot version into benchmarks/project/plugins.sbt.

@@ -0,0 +1,81 @@
# Agents
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can delete it if it is not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant