Full provenance of tasks executions #3447
pditommaso
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem
The execution DAG created by Nextflow tracks the dependencies between processes, operators and any combination of them.
However, Nextflow is not able to track which is the task instance(s) that triggered a specific task execution.
This means that it's possible to know, for example, that a process A is connected with a process B e.g.
A -> B, however, each of them can run n tasks and it's not possible to establish which task i-th ofAtrigger the task j-th ofB.This information is important to be able to precisely track the provenance of the task executions and solve other problems like the cleanup at runtime of the temporary files created by each task #452
Heuristic solution
A possible solution to this problem could consist in using the unique hash id associated with each task execution.
The task hash is used to allocate a scratch work directory where all output files are created. For example, a task having id
9a001d8539def2552dd604427ec9aee1will create all output files into a directory having the path/some/work/dir/9a/001d8539def2552dd604427ec9aee1.Therefore all downstream tasks receiving these outputs as input files could use the hash encoded in the file path to infer the task instance that created those file are re-create the tasks execution DAG.
However this approach does not provide a complete solution because Nextflow processes can produce arbitrary values are outputs other than file paths, that would make not possible to use this approach.
A similar problem can arise when a process execution is chained with an operator that can alter the file output path with an arbitrary path (e.g.
collectFile) breaking the above assumption.Object identity solution
This solution to some extent is similar to the previous approach, however, the main idea is to use the Java object identity associated to each input and output to track the relationship of the tasks.
This could be done by storing in the dictionary structure the pair
< object id, task id>, where object id is taken using the identity hash code of the n-th output object, andtask idis the nextflow TaskRun.id attribute. Let's call this structureprov-mapThis information can be captured when the output of a task
Txis bound to the corresponding output channel, see here.Correspondingly, when a new task execution
Tyis triggered, for each value in the list of inputs should be looked up in theprov-mapusing the input object system identity as the access key.If an entry is found a relationship between the two tasks can be established and a direct edge
Tx -> Tycan be added to the provenance graph.The problem still remains for processes interleaved with operator execution e.g.
P1 -> map -> P2A similar approach could be takin into consideration that all Nextflow operators are implemented as a DataflowProcessor class.
This class allows the use of a listener interface defined as show below:
The methods
beforeRunandafterRuncan be used to propagate the task association the corresponding output value. For example let's consider the flowT1(x) -> map(x) -> T2(y)T1store the pair< x, T1 >in theprov-map.mapreceives the valuexand invokes thebeforeRuneven listenerT1using theprov-mapafterRunis invoked and the pair< y, T1 >is stored intoprov-mapT2receives the valueyand fetch the valueT1from theprov-mapT1 -> T2is createdBeta Was this translation helpful? Give feedback.
All reactions