Skip to content

Feature: Add a 'Rustic' Ergonomic and Validating Layer #367

@wackywendell

Description

@wackywendell

Background

The substrait-rs crate currently provides basic protobuf bindings but has minimal validation and no builder APIs. Other implementations, like substrait-java and substrait-go, offer higher-level APIs that are less verbose and error-prone, making it easier to construct and consume plans correctly.

The need for a validation layer has already been raised in #157, which proposes a standalone validation layer. This proposal seeks to build on that idea by integrating validation directly into a more ergonomic, type-safe API for both creating and consuming plans.

Proposal

Let’s add a more "Rustic" layer to substrait-rs. This would provide a type-safe layer over the raw protobufs, with builders to make it more ergonomic to construct correct plans. This new layer would also feature a conversion path from the raw protobufs that performs validation, making plan consumption, tree-walking, and manipulation safer and easier.

Suggested Implementation Path

There are multiple ways we could do this. After some thought, I have a rough plan I’d like to run by this group.

Firstly, I believe we should approach this differently in Rust than in Go or Java. Rust is not garbage-collected, and its developers typically prize “zero-cost abstractions”—ones where the type system and compiler provide ergonomics and validity guarantees with minimal performance cost.

With that in mind, I suggest we build a zero-cost abstraction over the raw protobufs. This is distinct from the Go and Java libraries, which define separate structs and require a full, not-quite-zero-cost conversion to and from the protobuf representation.

  1. Extension Registry: The foundation would be a global, application-lifetime ExtensionRegistry, as discussed in Add Extension Registry #342. This registry, constructed from YAML files, would hold all possible function definitions the application might use.

  2. Typed Wrappers & Builders: We could then introduce a typed layer on top of the protobufs (e.g., struct Expression(proto::Expression)). The public API for these types would enforce correctness.

  3. API Design & Function Handling: A user would not be able to construct these wrapper types directly; the APIs for these wrapper types would ensure validity by construction. This follows the "Parse, Don't Validate" design principle mentioned in the parse module here; by successfully parsing the raw protobuf into our wrapper types, we guarantee by the type system that the plan is valid.

    • Parsing/Validation: A possible entry point would be Plan::parse(&registry, &raw_proto_plan)?. This could create an internal parsing context containing both the global &registry and a map of the plan's local extensions. This context would then be used to recursively parse and validate the entire plan tree, producing a Plan wrapper containing a "known-correct" tree of protobuf data.

    • Builder: A builder API, for example ExpressionBuilder::call(&mut self, &registry, "add", vec![...]), could find the "add" function in the global registry and handle the bookkeeping of adding the function to the builder's local extension list and setting the correct uint reference. At any given point, the Builder contains the protobuf built so far, as well as necessary metadata for correctness: e.g. a reference to the global extension registry and the local plan registry.

This approach would ensure that any plan represented in the new typed layer is semantically valid, leveraging Rust's type system to prevent errors while remaining a lightweight, zero-cost wrapper.

Plan Modification and Ergonomics

One challenge here is plan modifications. The above is an efficient method for parsing and building plans; however, modifying or removing parts of a plan is not straightforward. If we remove a function call, do we remove it from the local plan registry? That depends on whether the function is used elsewhere in the plan.

I suggest we consider two options:

  1. Start with immutable plans. Modifications would require a copy-on-write approach. While this may not be the absolutely most performant for modifications, it is simple, safe, leverages the existing builder logic, and is a common pattern in Rust (see std::borrow::Cow).

  2. Allow mutations with deferred "garbage collection". If we need more performant modifications, we could allow in-place changes that add new simple extensions to the plan's extension list, and (if needed) mark it as 'may_contain_deleted'. Then, this would require a finalization step where the plan is traversed to purge unused extensions and remap references before serialization.

In my opinion, having efficient builders and decoders is a higher priority, as modification is likely a less common use case, so I suggest starting with the simpler immutable approach (Option 1); if there are performance concerns, we could revisit and switch to (2).

Next Steps

If this sounds like the right approach, I might take a stab at it. It’s a big project, so it would probably make sense to start with #342, and see how else to break it down. But first, I’d like to hear thoughts from the community!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions