-
Couldn't load subscription status.
- Fork 27
Description
Background
The substrait-rs crate currently provides basic protobuf bindings but has minimal validation and no builder APIs. Other implementations, like substrait-java and substrait-go, offer higher-level APIs that are less verbose and error-prone, making it easier to construct and consume plans correctly.
The need for a validation layer has already been raised in #157, which proposes a standalone validation layer. This proposal seeks to build on that idea by integrating validation directly into a more ergonomic, type-safe API for both creating and consuming plans.
Proposal
Let’s add a more "Rustic" layer to substrait-rs. This would provide a type-safe layer over the raw protobufs, with builders to make it more ergonomic to construct correct plans. This new layer would also feature a conversion path from the raw protobufs that performs validation, making plan consumption, tree-walking, and manipulation safer and easier.
Suggested Implementation Path
There are multiple ways we could do this. After some thought, I have a rough plan I’d like to run by this group.
Firstly, I believe we should approach this differently in Rust than in Go or Java. Rust is not garbage-collected, and its developers typically prize “zero-cost abstractions”—ones where the type system and compiler provide ergonomics and validity guarantees with minimal performance cost.
With that in mind, I suggest we build a zero-cost abstraction over the raw protobufs. This is distinct from the Go and Java libraries, which define separate structs and require a full, not-quite-zero-cost conversion to and from the protobuf representation.
-
Extension Registry: The foundation would be a global, application-lifetime
ExtensionRegistry, as discussed in Add Extension Registry #342. This registry, constructed from YAML files, would hold all possible function definitions the application might use. -
Typed Wrappers & Builders: We could then introduce a typed layer on top of the protobufs (e.g.,
struct Expression(proto::Expression)). The public API for these types would enforce correctness. -
API Design & Function Handling: A user would not be able to construct these wrapper types directly; the APIs for these wrapper types would ensure validity by construction. This follows the "Parse, Don't Validate" design principle mentioned in the
parsemodule here; by successfully parsing the raw protobuf into our wrapper types, we guarantee by the type system that the plan is valid.-
Parsing/Validation: A possible entry point would be
Plan::parse(®istry, &raw_proto_plan)?. This could create an internal parsing context containing both the global®istryand a map of the plan's local extensions. This context would then be used to recursively parse and validate the entire plan tree, producing aPlanwrapper containing a "known-correct" tree of protobuf data. -
Builder: A builder API, for example
ExpressionBuilder::call(&mut self, ®istry, "add", vec![...]), could find the "add" function in the globalregistryand handle the bookkeeping of adding the function to the builder's local extension list and setting the correctuintreference. At any given point, the Builder contains the protobuf built so far, as well as necessary metadata for correctness: e.g. a reference to the global extension registry and the local plan registry.
-
This approach would ensure that any plan represented in the new typed layer is semantically valid, leveraging Rust's type system to prevent errors while remaining a lightweight, zero-cost wrapper.
Plan Modification and Ergonomics
One challenge here is plan modifications. The above is an efficient method for parsing and building plans; however, modifying or removing parts of a plan is not straightforward. If we remove a function call, do we remove it from the local plan registry? That depends on whether the function is used elsewhere in the plan.
I suggest we consider two options:
-
Start with immutable plans. Modifications would require a copy-on-write approach. While this may not be the absolutely most performant for modifications, it is simple, safe, leverages the existing builder logic, and is a common pattern in Rust (see
std::borrow::Cow). -
Allow mutations with deferred "garbage collection". If we need more performant modifications, we could allow in-place changes that add new simple extensions to the plan's extension list, and (if needed) mark it as 'may_contain_deleted'. Then, this would require a finalization step where the plan is traversed to purge unused extensions and remap references before serialization.
In my opinion, having efficient builders and decoders is a higher priority, as modification is likely a less common use case, so I suggest starting with the simpler immutable approach (Option 1); if there are performance concerns, we could revisit and switch to (2).
Next Steps
If this sounds like the right approach, I might take a stab at it. It’s a big project, so it would probably make sense to start with #342, and see how else to break it down. But first, I’d like to hear thoughts from the community!