-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Crude outline of an idea that's been brewing for a while, written up as a placeholder. Needs much more thought!
Writing workflows in terms of data dependencies is clunky in Cylc, however, two simple changes could make this easier:
- Allow task outputs to accept task message patterns (rather than requiring an exact match).
- Provide triggering task outputs to tasks.
For context, see also #2764
Description
Cylc allows us to write fine-grained dependencies, e.g:
foo:file1 & bar:file2 => baz
The upstream task satisfies these outputs by running a cylc message
command.
This example shows how a Cylc workflow can be written to follow an inputs/outputs paradigm rather than an abstract control flow paradigm:
#!Jinja2
{% set FILE1 = '$CYLC_WORKFLOW_SHARE_DIR/$CYLC_TASK_CYCLE_POINT/foo/file.dat' %}
{% set FILE2 = '$CYLC_WORKFLOW_SHARE_DIR/$CYLC_TASK_CYCLE_POINT/bar-output.csv' %}
[scheduling]
[[graph]]
R1 = foo:file1 & bar:file2 => baz
[runtime]
[[foo]]
script = echo 'some data' > "$(eval "$FILE1")"; cylc message -- file1
[[[environment]]]
FILE1 = {{ FILE1 }}
[[[outputs]]]
file1 = file1
[[bar]]
script = echo 'some,data' > "$(eval "$FILE2")"; cylc message -- file2
[[[environment]]]
FILE2 = {{ FILE2 }}
[[[outputs]]]
file2 = file2
[[baz]]
script = do-something-with "$FILE1" "$FILE2"
[[[environment]]]
FILE1 = {{ FILE1 }}
FILE2 = {{ FILE2 }}
Note
Another alternative to achieving the above is using cylc broadcast
which can be used to configure paths in downstream tasks.
However, it is a bit clunky because the file1
, file2
outputs are really abstract dependencies in disguise, they serve a function as event-driven triggers, but they can't carry data.
However, if we supported patterns in task messages, this could be made much neater:
[scheduling]
[[graph]]
R1 = foo:file1 & bar:file2 => baz
[runtime]
[[foo]]
script = echo 'some data' > "somewhere"; cylc message -- "file1:somewhere"
[[[outputs]]]
file1 = file1:(.*)
[[bar]]
script = echo 'some,data' > "somewhere"; cylc message -- "file2:somewhere"
[[[outputs]]]
file2 = file2:(.*)
[[baz]]
script = do-something-with "$CYLC_TASK_INPUT_file1" "$CYLC_TASK_INPUT_file2"
We could potentially go further than this by adding an explicit task [inputs]
section for fully flexible mapping and many other things besides.
Blue Sky Ideas (Speculative)
- Tasks declare both inputs and outputs.
- Cylc provides an easier interface for declaring an output as satisfied than using
cylc message
manually (e.g.cylc output <output-name> <file-path/data>
- note message template handled by Cylc internally). - Cylc provides the ability to automatically sync outputs between install targets (when the output is a file path) see the
--rsync
option in the cylc clean proposal. - Cylc GUI links into "artefacts" (i.e. output files) via Jupyter Lab via this cylc-ui feature.
Proposal (Imminent)
- Allow task outputs to accept patterns (to carry data):
- Note, we need to reject potentially duplicate outputs - duplicate output messagesย #6056
- In theory we could support multiple named REGEX patterns within a message to carry multiple pieces of data, good idea?
- Provide triggering outputs to the task (to access the data):
- Note, we once had a
CYLC_TASK_DEPENDENCIES
variable, but there was too much data to hold in the environment so it was scrapped - retire CYLC_TASK_DEPENDENCIESย #5764 - We could investigate file-based solutions, e.g. JSON, CSV, YAML, etc.
- Note, we once had a