Skip to content

task message: support file path patternsย #6811

@oliver-sanders

Description

@oliver-sanders

Crude outline of an idea that's been brewing for a while, written up as a placeholder. Needs much more thought!

Writing workflows in terms of data dependencies is clunky in Cylc, however, two simple changes could make this easier:

  1. Allow task outputs to accept task message patterns (rather than requiring an exact match).
  2. Provide triggering task outputs to tasks.

For context, see also #2764

Description

Cylc allows us to write fine-grained dependencies, e.g:

foo:file1 & bar:file2 => baz

The upstream task satisfies these outputs by running a cylc message command.

This example shows how a Cylc workflow can be written to follow an inputs/outputs paradigm rather than an abstract control flow paradigm:

#!Jinja2

{% set FILE1 = '$CYLC_WORKFLOW_SHARE_DIR/$CYLC_TASK_CYCLE_POINT/foo/file.dat' %}
{% set FILE2 = '$CYLC_WORKFLOW_SHARE_DIR/$CYLC_TASK_CYCLE_POINT/bar-output.csv' %}

[scheduling]
  [[graph]]
    R1 = foo:file1 & bar:file2 => baz

[runtime]
  [[foo]]
    script = echo 'some data' > "$(eval "$FILE1")"; cylc message -- file1
    [[[environment]]]
      FILE1 = {{ FILE1 }}
    [[[outputs]]]
      file1 = file1

  [[bar]]
    script = echo 'some,data' > "$(eval "$FILE2")"; cylc message -- file2
    [[[environment]]]
      FILE2 = {{ FILE2 }}
    [[[outputs]]]
      file2  = file2

  [[baz]]
    script = do-something-with "$FILE1" "$FILE2"
    [[[environment]]]
      FILE1 = {{ FILE1 }}
      FILE2 = {{ FILE2 }}

Note

Another alternative to achieving the above is using cylc broadcast which can be used to configure paths in downstream tasks.

However, it is a bit clunky because the file1, file2 outputs are really abstract dependencies in disguise, they serve a function as event-driven triggers, but they can't carry data.

However, if we supported patterns in task messages, this could be made much neater:

[scheduling]
  [[graph]]
    R1 = foo:file1 & bar:file2 => baz

[runtime]
  [[foo]]
    script = echo 'some data' > "somewhere"; cylc message -- "file1:somewhere"
    [[[outputs]]]
      file1 = file1:(.*)

  [[bar]]
    script = echo 'some,data' > "somewhere"; cylc message -- "file2:somewhere"
    [[[outputs]]]
      file2 = file2:(.*)

  [[baz]]
    script = do-something-with "$CYLC_TASK_INPUT_file1" "$CYLC_TASK_INPUT_file2"

We could potentially go further than this by adding an explicit task [inputs] section for fully flexible mapping and many other things besides.

Blue Sky Ideas (Speculative)

  • Tasks declare both inputs and outputs.
  • Cylc provides an easier interface for declaring an output as satisfied than using cylc message manually (e.g. cylc output <output-name> <file-path/data> - note message template handled by Cylc internally).
  • Cylc provides the ability to automatically sync outputs between install targets (when the output is a file path) see the --rsync option in the cylc clean proposal.
  • Cylc GUI links into "artefacts" (i.e. output files) via Jupyter Lab via this cylc-ui feature.

Proposal (Imminent)

  1. Allow task outputs to accept patterns (to carry data):
    • Note, we need to reject potentially duplicate outputs - duplicate output messagesย #6056
    • In theory we could support multiple named REGEX patterns within a message to carry multiple pieces of data, good idea?
  2. Provide triggering outputs to the task (to access the data):
    • Note, we once had a CYLC_TASK_DEPENDENCIES variable, but there was too much data to hold in the environment so it was scrapped - retire CYLC_TASK_DEPENDENCIESย #5764
    • We could investigate file-based solutions, e.g. JSON, CSV, YAML, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions