-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[df] Support MT snapshotting to RNTuple #19599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Test Results 21 files 21 suites 3d 5h 51m 4s ⏱️ For more details on these failures, see this check. Results for commit a0e8be9. ♻️ This comment has been updated with latest results. |
ced6d97
to
8a09b36
Compare
I am confused by the wording of this part of the commit log. |
for (decltype(values.size()) i = 0; i < values.size(); i++) { | ||
outputEntry->BindRawPtr(fFieldTokens[i], values[i]); | ||
} | ||
fillContext->Fill(*outputEntry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it meant to be FillNoFlush
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed here because UntypedSnapshotRNTupleHelper
exclusively owns the TFile
and there is only one RNTupleParallelWriter
appending to it. Without the need to lock on the user / "framework" side, there is no benefit to using FillNoFlush
(except it's longer and more code to write).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fillContext->Fill(*outputEntry); | |
// Any synchronization needed are handled by the underlying `RNTupleParallelWriter` | |
// which has exclusive access to its `TFile`. | |
fillContext->Fill(*outputEntry); |
Agreed, it's not well formulated. What I'm trying to say is that for sequential snapshotting (that is already supported before the PR), changing from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really really nice! I'm wondering if for completeness we should also add some of the existing tests as MT tests 🤔
... instead of the default entry. Then we also only need a bare model.
This is less expensive than string comparisons of field names during every call to Exec().
Switch the existing code to use the RNTupleParallelWriter with one RNTupleFillContext per slot. For sequential snapshotting, this should be (almost) as efficient as the RNTupleWriter (one additional cloned RNTupleModel for the only fill context), but save quite a bit of code duplication and in testing effort.
Use the same conditions as TTree, looking at fOutputFile instead of the data source.
8a09b36
to
a0e8be9
Compare
Yes, I wasn't sure either. We're of course massively benefiting that we use the exact same model creation and filling code that you already wrote all the tests for. Additionally, we have the challenge of MT scheduling, so even the best test case will have to deal with non-determinism and potentially still not test the relevant thing because all events end up on a single thread... |
That's fair, not to forget about the fact that the behavior of the parallel writer is already tested in isolation. I'm okay with leaving this like this then! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Would probably be good to get a second approval from someone else as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Looks good to me but I'll leave approval to RDF owners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot! I have one question left, does not need to be addressed in this PR specifically.
{ | ||
// In principle we would not need to flush a cluster here, but we want to benefit from parallelism for compression. | ||
// NB: RNTupleFillContext::FlushCluster() is a nop if there is no new entry since the last flush. | ||
fFillContexts[slot]->FlushCluster(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Snapshot to TTree, users can specify a value (expressed in number of entries) for the size of the output TTree clusters via fAutoFlush. I wonder how this line impacts this feature, which at the moment is not supported for the Snapshot to RNTuple, but I imagine it would be requested at some point.
Switch the existing code to use the
RNTupleParallelWriter
with oneRNTupleFillContext
per slot. For sequential snapshotting, this should be (almost) as efficient as theRNTupleWriter
(one additional clonedRNTupleModel
for the only fill context), but save quite a bit of code duplication and in testing effort.