Skip to content

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Sep 2, 2025

Closes #17380

@github-actions github-actions bot added common Related to common crate core Core DataFusion crate labels Sep 2, 2025
@xudong963 xudong963 self-requested a review September 3, 2025 15:06
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 3, 2025
@@ -391,62 +391,85 @@ impl Statistics {
/// parameter to compute global statistics in a multi-partition setting.
pub fn with_fetch(
mut self,
schema: SchemaRef,
_schema: SchemaRef,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you keeping the unused param to avoid API change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to just break it as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I incline to make it break in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

// input stats as is.
// TODO: Can input stats still be used, but adjusted, when `skip`
// is non-zero?
return Ok(self);
} else if nr - skip <= fetch_val {
// After `skip` input rows are skipped, the remaining rows are
// less than or equal to the `fetch` values, so `num_rows` must
// equal the remaining rows.
check_num_rows(
(nr - skip).checked_mul(n_partitions),
// We know that we have an estimate for the number of rows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the purpose of the comment was to say that the unwrap() is safe?

If so, I think we can remove the comment.

Comment on lines +460 to +466
self.column_statistics = self
.column_statistics
.into_iter()
.map(ColumnStatistics::to_inexact)
.collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit can make the column_statistics super different from the original, so here I'm wondering if it's better to give it an unknown. I'm worried we'll use inexact column_statistics to do some estimation, but in fact, the column_statistics is very deviant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible I'm not sure. I figured something is better than nothing, do you know of any cases where this would lead to a worse result?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know of any cases where this would lead to a worse result?

No, just out of a conservative mindset.

But I don't have a strong bias for this, we can adjust anytime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of scenarios where preserving the values even if inexact would be helpful. Namely file pruning from filters, e.g. a stats on a timestamp column will still be very effective.

@github-actions github-actions bot added physical-plan Changes to the physical-plan crate and removed core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Sep 7, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Sep 7, 2025
@xudong963
Copy link
Member

After the CI is green, I'll have another look

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 8, 2025
@adriangb
Copy link
Contributor Author

adriangb commented Sep 8, 2025

@xudong963 CI is green

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@adriangb adriangb merged commit da1e7ae into apache:main Sep 9, 2025
28 checks passed
@adriangb adriangb deleted the stats-reuse branch September 9, 2025 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Smarter Statistics::with_fetch
2 participants