-
Notifications
You must be signed in to change notification settings - Fork 16
Dynamic filters blog post (rev 2) #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
@adriangb an update
This has resulted in a bit of a frankenstein 👹 at the moment -- I will fix it up shortly You can see the current setup here: https://datafusion.staged.apache.org/blog/2025/09/01/dynamic-filters/ Question: Do you mind if I add myself a the second author? |
You're making extensive edits, you should add yourself 😄 |
FWIW I like the new figures but I also like the idea of annotating a plan that looks ~ the output from running |
Yes absolutely -- I think there is room for both -- basically I want to write a "everyman's explanation of Topk/dynamic filters" and then show how that translates into DataFusion. |
Sounds like a good plan to me! |
I absolutely have reviewing / working on this article on my list this week, but it will likely take me a few days |
I am very sorry for the delay -- I have been away, but plan to keep working on this post now that i have returned THank you for your patience |
I plan to publish this blog post tomorrow unless there are further comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this post, really cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great read!
We can probably improve the wording but for:
|
Maybe it requires slightly more details for the reader. I'm still trying to grasp the idea. 🤔 However to get the filter value (it doesn't have to be super accurate, just close to reduce the reading scope) it is possible to scan For the heap though the algorithm still not clear for me. How it makes sure we dont need to scan 100M rows as before, is it for any scenario, or when underlying files data are sorted? If the heap stored topK it still need to read all the rows? the benefit is we don't pay for full sorting and just for rebuild a heap. |
I think one major idea is to reuse state / information that is already present in the operators -- so for example the TopK operator already has a topK heap, and the dynamic filter concept allows this information to be passed down to the scan.
I don't think the dynamic filter has any guarantees that it will filter rows -- for example, in the pathalogical case where the data is scanned in reverse order, it will not filter any However, the idea is that updaing the dynamic filter is cheap and it does help in many real world settings, so it is overall a good optimization to do |
Besides, scanning the data in the precisely reverse order to the query is bad dynamic filters or not and we should fix that |
I am going to incorporate the feedback from @comphead and @nuno-faria in the next few hours |
If it is known that the data is in the wrong order for sure. I am not sure DataFusion always knows how data is distributed across files. Another potentially pathological case is when the data is randomly distributed throughout the files (so no files or row groups can be pruned out) |
@comphead -- the min value is the minimum value of what has already been seen during query execution.
The TopK operator heap here: |
I am happy to keep making clarifications / etc on tickets and this blog in follow on PRs. But for now, let's publish it as I have already delayed it by several weeks and I would like this to be available before the NYC meetup next week Onwards 🚀 |
Thanks @alamb I was referring to statement
which implies the optimization doesn't have to read and decode 100M rows to get top10, and I cannot see how this is exactly happening 🤔 This is makes sense with the filter, but to get the min value for the filter we still to full scan, that is something i'm still missing, lets go ahead, yes, thanks for explanations |
Let's take the best case, which is
While it is true DataFusion now still needs to check all remaining files to ensure this is actually the minimum value, it may not have to actually open and read and decode the rows in the file -- for example, it could potentially prune (skip) all remaining files using statistics. And even if it can't prune out the entire file, it may be able to prune row groups, or ranges of rows (if |
Oh I think I'm getting the picture now. So it is not only derived from data itself(like I was told) it is hybrid, data + parquet stats. That makes sense now, so we have an assumption that some value in the heap is approximate just to remove unnecessary reads, because it is still better than full scan. Best case scenario if we got the min value from the first batch, worst case still should be cheaper than full scan |
Yeah, I think the predicate (min value) itself is derived only from data, but actually using it to make the query faster relies on statistics (and all the other parts of multi-level pruning) |
@nuno-faria what computer (or at least how many cores) did you run #103 (comment) on? |
@adriangb I used a 6-core CPU (12 execution threads). |
📰 See rendered preview here: https://datafusion.staged.apache.org/blog/2025/09/10/dynamic-filters/ 📰
Notes:
This is based on @adriangb 's PR in #102, but is hosted in the apache/datafusion-site fork so