-
Notifications
You must be signed in to change notification settings - Fork 1.6k
better preserve statistics when applying limits #17381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
datafusion/common/src/stats.rs
Outdated
@@ -391,62 +391,85 @@ impl Statistics { | |||
/// parameter to compute global statistics in a multi-partition setting. | |||
pub fn with_fetch( | |||
mut self, | |||
schema: SchemaRef, | |||
_schema: SchemaRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you keeping the unused param to avoid API change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to just break it as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I incline to make it break in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
datafusion/common/src/stats.rs
Outdated
// input stats as is. | ||
// TODO: Can input stats still be used, but adjusted, when `skip` | ||
// is non-zero? | ||
return Ok(self); | ||
} else if nr - skip <= fetch_val { | ||
// After `skip` input rows are skipped, the remaining rows are | ||
// less than or equal to the `fetch` values, so `num_rows` must | ||
// equal the remaining rows. | ||
check_num_rows( | ||
(nr - skip).checked_mul(n_partitions), | ||
// We know that we have an estimate for the number of rows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the purpose of the comment was to say that the unwrap()
is safe?
If so, I think we can remove the comment.
self.column_statistics = self | ||
.column_statistics | ||
.into_iter() | ||
.map(ColumnStatistics::to_inexact) | ||
.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit
can make the column_statistics super different from the original, so here I'm wondering if it's better to give it an unknown. I'm worried we'll use inexact column_statistics to do some estimation, but in fact, the column_statistics is very deviant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible I'm not sure. I figured something is better than nothing, do you know of any cases where this would lead to a worse result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know of any cases where this would lead to a worse result?
No, just out of a conservative mindset.
But I don't have a strong bias for this, we can adjust anytime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can think of scenarios where preserving the values even if inexact would be helpful. Namely file pruning from filters, e.g. a stats on a timestamp column will still be very effective.
feafb23
to
1ef8b37
Compare
After the CI is green, I'll have another look |
@xudong963 CI is green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Closes #17380