Skip to content

Conversation

LaurenzV
Copy link
Contributor

@LaurenzV LaurenzV commented Sep 9, 2025

Motivation

Currently, the only way to do efficient clipping in Vello Sparse is by pushing a new isolated layer and associating a clip path with that layer. This might be sufficient for cases like for example SVG, but the problem is that doing so forces the user to conflate layers and clipping, which is not what should be done in some imaging models. For example, clip paths in PDF have nothing to do with isolation, and it is therefore desirable to be able to separate the two. I believe the same is the case for HTML canvas, where the clip stack is not really related to layers, and the fact they are causes problems (see #1117). In practice, this mostly doesn't cause issues as (to my understanding) source-over compositing with premultiplied alpha is associative and it therefore mostly doesn't matter whether you clip with isolation or not. However, problems start to occur once you add masks or blending to the mix, as the behavior with and without isolation is not identical.

It should be noted that above can be simulated by creating a new mask for each clip path, but this is obviously expensive, especially for deeply-nested clips.

Solution

Because of this, I propose a second way to support clipping in Vello CPU (could be added to hybrid as well) that is based on an independent clip stack that can be pushed and popped to without actually having to create new layers. In particular, you do not need to clear the clip stack before rasterizing, which should adddress the issue in #1117.

I think how this works is very neat and demonstrates how useful sparse strips are! Fundamentally, this is based on a new algorithm that, given two paths in sparse strips representation, computes a new path in sparse strips representation that represents the intersection of the two. While the actual implementation is a bit finicky, conceptually it's very simple: We iterate over each stripped and fill region of the two paths in lock step and determine all of the overlaps between the two. In case we have two fill regions, the overlap region will also be filled. In case we have one strip and one fill region, the overlap will copy the alpha mask of the strip region. Finally, if we have two strip regions, we combine the alpha masks of both. That's it, this is enough to compute a new sparse strips path that when filled is equivalent to intersecting the two strips. The beauty of this is that this is incredibly fast because of the usual sparseness of paths and the result will also still be as sparse as possible as we preserve sparse fill regions whenever possible.

Given this, we introduce two new push_clip_path and pop_clip_path methods that operate on their own clip stack and completely independently from the push/pop layer commands. In case a clip path is currently active, once we draw a path we simply calculate the intersection after converting it into sparse strips, and then use that new path as the basis in coarse rasterization instead. In case we call push_clip_path multiple times, we can simply intersect the new clip path with the one in the previous stack position, to end up with one single new clip path that represents the intersection of all previous ones. Therefore, even for deeply nested clips we only need to perform one intersection per path drawn.

The disadvantage is that in multi-threading, it requires us to process new clip paths on the main thread, but the advantage is that (unlike the current clip algorithm) causes no additional overhead during rasterization (which is currently the only serial part), and the clipping calculation is done by each worker separately for the path they are currently processing, which means it scales very well!

Testing

I will add more tests, but I confirmed the current version to cause no regressions in my PDF test suite.

@LaurenzV
Copy link
Contributor Author

LaurenzV commented Sep 9, 2025

cc @sagudev maybe interesting for you as well :)

@LaurenzV LaurenzV added the C-sparse-strips Applies to sparse strips variants of vello in general label Sep 9, 2025
@LaurenzV
Copy link
Contributor Author

LaurenzV commented Sep 9, 2025

I think this should be ready for review now.

@LaurenzV LaurenzV marked this pull request as ready for review September 9, 2025 19:56
@LaurenzV
Copy link
Contributor Author

LaurenzV commented Sep 9, 2025

@taj-p Lemme know if that would be useful for you in vello_hybrid, but since this hasn't really been talked about before I presume that the current clipping algo is good enough for you, so I haven't implemented it there.

@taj-p
Copy link
Contributor

taj-p commented Sep 9, 2025

@taj-p Lemme know if that would be useful for you in vello_hybrid, but since this hasn't really been talked about before I presume that the current clipping algo is good enough for you, so I haven't implemented it there.

We actually make quite a lot of use of clip paths, so I'm wondering whether this would be faster than the layer implementation. For significantly clipped images, I imagine this would be faster (because the layer implementation requires drawing the entire image to a layer for the next to sample from).

But, for milder clips, I'm not so sure since it's a question of whether it's faster to either:

  1. rasterise the strips using the GPU and suffer a texture copy to sample those pixels; or
  2. run intersect in the CPU and rasterise the strips directly to the target in the GPU without sampling

The beauty of this is that this is incredibly fast

My feeling is that intersect seems very quick, so this might be faster. I created a quick way to benchmark this in hybrid (#1204). Essentially, I just add more of the deeply nested circles to the scene and measure FPS.

I saw the clip_path variant to be roughly 1/3 slower than the clip_layer approach, which I thought was mildly surprising. I suspect this is also owing to running single threaded on a Mac with a powerful GPU 🤔 . I would suspect users on less powerful GPUs will see a significant boost with this PR.

Are you able to confirm whether I properly migrated this to work as you'd expect in hybrid?

image

Cc @grebmeg for your thoughts

@LaurenzV
Copy link
Contributor Author

LaurenzV commented Sep 9, 2025

Thanks! The result also surprises me a bit, though it's possible that pushing/popping on the GPU is so fast that it doesn't really matter, but I think it would matter for the CPU. I will try your experiment in vello_cpu tomorrow and see how it performs.

To be fair, I didn't mean to imply that this version is faster than the current layer-based clipping! Just that this version of clipping should be much faster than using masks for simulating non-isolated clips. 😄 Based on intuition, on the CPU I would at least expect that this version is faster when dealing with many nested clip paths where only few paths are drawn, while the layer-based approach is faster when few clips are used but many paths are drawn with those clips active. But I haven't benchmarked it and it's possible that things are different on the GPU.

@grebmeg
Copy link
Contributor

grebmeg commented Sep 10, 2025

It’s really nice that you can use sparse strips of two paths and get their intersection, and it’s SIMD-accelerated!

That said, I’d like to understand the main goal of this PR more clearly: is it primarily about performance, about enabling clipping of intermediate results (as in issue #1117), or about separating two concepts that don’t quite fit together, or all together?

From a performance perspective, push_clip_path form this PR seems to replace our existing GPU clipping algorithm, which is based on two textures ping-ponging. On the CPU side, we only compute where and when those clips should be applied, which is why it’s fairly fast. In contrast, this change adds some per-path intersection overhead. Roughly speaking, our current GPU clipping looks equivalent to intersect in this PR. My impression is that stronger GPUs would benefit more here, but I’m curious about where the crossover point lies where a CPU-only approach actually becomes faster. Would it be possible to benchmark the both clip approaches (similar to what Taj did) across various patterns?

That said, your change brings some nice improvements on CPU, especially avoiding extra clip buffer allocations per tile, and making it possible to immediately see clipping result when a clip is active.

I’m also wondering whether your intersection algorithm could be integrated into current layer clipping in pop_clip. Right now, that code works with clip bounding boxes and calculates all clip commands on pop. Would it be possible to use your intersect function to compute the intersection between a rectangular clip box and the path strips, and then run wide.generate on the resulting strips, assuming they are now all either clip alpha or clip fill?

I also think that for vello_hybrid it might be worth exploring whether we can avoid using intersect and instead emulate the push/clip/pop commands ourselves, allowing the shader code to perform the actual clipping for each path and this way, we could also see the clipping result immediately for each path.

@LaurenzV
Copy link
Contributor Author

That said, I’d like to understand the main goal of this PR more clearly: is it primarily about performance, about enabling clipping of intermediate results (as in issue #1117), or about separating two concepts that don’t quite fit together, or all together?

It's basically as outlined in the Motivation section in this PR: The main motivation here is not performance, but simply that the current clipping algorithm assumes that we apply layer isolation each time we use a clip path, which is not what we want to do in some imaging models. My intention isn't to replace the existing layer clipping, but simply provide an alternative so that we can achieve it without layer isolation, so you can still choose between the two if you want. With that said, Taj has successfully nerd-sniped me in wanting to understand the performance differences better now. :)

Would it be possible to use your intersect function to compute the intersection between a rectangular clip box and the path strips, and then run wide.generate on the resulting strips, assuming they are now all either clip alpha or clip fill?

Looking at it a bit more closely, I believe this should be possible and it could simplify the code logic a bit, but the core difficulty in the current algo seems to be in other places so I'm not sure how much simpler it would make the current code.

@LaurenzV LaurenzV marked this pull request as draft September 10, 2025 09:39
@LaurenzV
Copy link
Contributor Author

I can reproduce the slowdown in vello_cpu and the reason for it is that there is a lot of allocating/deallocating for the clip path data going as the number of circles increases. I probably know how to fix it, but it requires some small architectural changes which I will submit as separate PRs before revisiting this one.

@dominikh
Copy link
Contributor

This way of clipping does have the perhaps surprising effect that clips will not affect layer composition at all. When we pop a layer, the active non-layer clip will not restrict where the layer gets composited.

@LaurenzV
Copy link
Contributor Author

Yeah, the intention is that you pass a clip path to clip_layer instead if you want that (although it's still broken right now). But I agree that we need to make lots of improvements to the documentation, not only for this, but in general.

@dominikh
Copy link
Contributor

dominikh commented Sep 11, 2025

Yeah, the intention is that you pass a clip path to clip_layer instead if you want that

It might make sense to expose the active non-layer clip path in that case, so that it can easily be set on a layer, too.

I'm imagining a hypothetical UI building use case, where each child widget gets clipped to its dimensions before it is allowed to render itself. That's a prime example of nested clips that might benefit from using this new algorithm. However, every once in a while, a widget may have need for an isolated layer and fancy compositing, in which case it should be able to inherit the active non-layer clip.

(There might also be some value in renaming layer clips to something else to avoid confusion. Maybe stencils or shapes or masks?)

Interestingly, while Skia separates clips and layers, they share a save/restore stack. You can either save just the current clip (and transformation matrix), or save the current clip and create a new layer. In either case you use restore to pop the state and in the case of a saved layer, blend it. Under that model, it would be possible to apply the current clip to the new layer at layer creation time. I don't think Skia has a separate layer-specific clipping model. For vello_cpu, that'd mean intersecting the current non-layer clip with the layer's optional clip.

That model should also allow for the following optimization: when the current layer's clip is identical to the current non-layer clip, then we don't have to intersect paths with the non-layer clip, as the layer already takes care of it. Realistically, "identical" would be implemented as "neither clip has changed since the layer's creation". Of course at that point we're moving further away from the PDF model, where clips and layers (transparency groups) don't share the same stack, I think (the full PDF model has a lot of fun corner cases and full support for it should probably not be the primary goal...)

@LaurenzV LaurenzV marked this pull request as ready for review September 16, 2025 14:27
@LaurenzV
Copy link
Contributor Author

@taj-p Feel free to retry now, when I try it now the new clip version is faster! Though we do need to consider that it's kind of an edge case because we are only drawing one path per clip path, I assume the difference is going to amortize and turn around as the number of paths per clip path increases (haven't tested it carefully, though).

@taj-p
Copy link
Contributor

taj-p commented Sep 17, 2025

@taj-p Feel free to retry now, when I try it now the new clip version is faster! Though we do need to consider that it's kind of an edge case because we are only drawing one path per clip path, I assume the difference is going to amortize and turn around as the number of paths per clip path increases (haven't tested it carefully, though).

I'm seeing a roughly 20-25% improvement in performance 😮 ! It's likely to be more for users with worse GPUs too. I wouldn't consider this too much an edge case. It's common to, for example, use a clipping region across a single image.

cc @grebmeg / @ajakubowicz-canva

Copy link
Contributor

@grebmeg grebmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m also wondering whether your intersection algorithm could be integrated into current layer clipping in pop_clip. Right now, that code works with clip bounding boxes and calculates all clip commands on pop. Would it be possible to use your intersect function to compute the intersection between a rectangular clip box and the path strips, and then run wide.generate on the resulting strips, assuming they are now all either clip alpha or clip fill?

Looking at it a bit more closely, I believe this should be possible and it could simplify the code logic a bit, but the core difficulty in the current algo seems to be in other places so I'm not sure how much simpler it would make the current code.

I suspect your intersection algorithm could also be applied to push_clip/pop_clip in coarse, which might remove the need for the current non-SIMD code while still replicating the existing push/pop clip buffer behavior and generating clip commands (I’m curious whether this approach would lead to better performance). It might also make sense to maintain a single clip stack.

Comment on lines +133 to +141
if use_clip_path {
ctx.push_clip_path(&clip_circle);
} else {
ctx.push_clip_layer(&clip_circle);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When toggling, the results seem to differ in antialiasing?

push_clip_path push_clip_layer
Image Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that does look weird, will try to take a look when I find the time!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot reproduce this on my machine, both cases look like the push_clip_layer one for me...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would help to add a test showing a more complex option that uses both clipping APIs? Roughly, something like this:

#[vello_test]
fn clip_non_isolated_rectangle_with_star_evenodd(ctx: &mut impl Renderer) {
    let rect = Rect::new(0.0, 0.0, 100.0, 100.0);
    let rect2 = Rect::new(10.0, 10.0, 80.0, 80.0);
    let rect3 = Rect::new(40.0, 40.0, 90.0, 90.0);
    let rect4 = Rect::new(70.0, 70.0, 75.0, 75.0);

    ctx.push_clip_path(&rect2.to_path(0.1));
    ctx.push_layer(Some(&rect3.to_path(0.1)), None, None, None);
    ctx.push_clip_path(&rect4.to_path(0.1));
    ctx.set_paint(REBECCA_PURPLE);
    ctx.fill_rect(&rect);
    ctx.pop_clip_path();
    ctx.pop_layer();
    ctx.pop_clip_path();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, is it expected behavior that pushed clip paths (push_clip_path) still affect internal paths rendering within the layer (in push_layer)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is expected. All clip paths pushed via push_clip_path apply to all path drawing operations, regardless of which layer they are drawn on.

@LaurenzV LaurenzV force-pushed the laurenz/better-clip branch from 1725cda to 5c534c6 Compare October 2, 2025 14:38
@LaurenzV LaurenzV force-pushed the laurenz/better-clip branch from 5c534c6 to 8bd3fca Compare October 2, 2025 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-sparse-strips Applies to sparse strips variants of vello in general
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants