The Renku Workflow File
If your data processing workflow has many steps, you might find it convenient to create a workflow definition file. A workflow file help organize your data processing flow, making it easier to execute portions of the pipeline at a time. A workflow file also makes your code easier for your collaborators to understand!
Adding a Workflow File to your Project
To create a workflow definition file in your project, create a file called
workflow.yml
.
You may have more than one workflow file in your project! Workflow files may be named however you like, as long as they end with the .yml or yaml file extension.
Note
Do you already have a workflow you created on the Renku CLI that you’d like to convert to a Workflow File?
You can export it! renku workflow export <workflow-name> --format renku --output workflow.yml
(Tip: Find your workflow’s name using renku workflow ls
)
Defining a Basic Workflow File
There are a few options for how you may define your workflow.
In the simplest version of the Renku workflow file, the command, inputs and outputs are simply listed, as in the example below:
name: data-pipeline
steps:
filter:
command: python src/filter.py data/input/flights.csv data/output/filtered.csv
inputs:
- src/filter.py
- data/input/flights.csv
outputs:
- data/output/filtered.csv
Note
Note that the script being run is always listed in the inputs
section (here, src/filter.py
).
This workflow file defines the workflow’s name and a sequence of steps. This
file only includes one step, which is named filter
. Within the filter
step, list the command
to run, and then we tell Renku which parts of this
command are inputs
and outputs
by copying those paths into the relevant
sections.
To run this workflow file, run it with renku run
:
$ renku run workflow.yml
Using Templating in a Workflow File
Renku provides a templating feature so that you never have to type the same path twice. There are a few different ways to use templating in your workflow file, and they can be mixed and matched depending on what works best for your command.
Templating by Group
In the example above, all of the inputs go together into the same place in the
command, followed by the output. So, rather than listing each input, output, and
parameter in the command individually, you can use the $inputs
,
$outputs
, and $parameters
templates to tell Renku “put all the inputs
here”.
name: data-pipeline
steps:
filter:
command: python $parameters $inputs $outputs
inputs:
- src/filter.py
- data/input/flights.csv
outputs:
- data/output/filtered.csv
parameters:
- -n
- 10
The inputs are filled in to the command at the $inputs
template in the order
in which they are specified in the inputs
section. The same goes for
$outputs
and $parameters
.
Templating by Argument Name
If the ordering of arguments in your command is more complex, you can reference
each argument individually by name. To do so, assign each input and output a
name (such as raw
) and a path
. Then, we reference those names in
the command
using $
.
name: data-pipeline
steps:
filter:
command: python $n $filter-py $raw $filtered
inputs:
- filter-py:
path: src/filter.py
- raw:
path: data/input/flights.csv
outputs:
- filtered:
path: data/output/filtered.csv
parameters:
- n:
prefix: -n
value: 10
Note
Renku uses basic YAML syntax for workflow definition files. Users should not use advanced YAML syntax like anchors, aliases, schema, etc. since the behavior is undefined. Moreover, in future we will implement a customized YAML parser that won’t allow these features.
Note
If your command uses the $
character, you can escape it by doing $$
.
A Multi-Step Workflow File
Below, you can see what the a workflow file looks like for a two-step workflow.
name: data-pipeline
steps:
filter:
command: python $filter-py $raw $filtered
inputs:
- filter-py:
path: src/filter.py
- raw:
path: data/input/flights.csv
outputs:
- filtered:
path: data/output/filtered.csv
count:
command: python $count-py $filtered $counts
inputs:
- count-py:
path: src/count.py
- filtered:
path: data/output/filtered.csv
outputs:
- counts:
path: data/output/counts.csv
Executing a Workflow File
Running renku run workflow.yml
will execute all steps
in the workflow file. Executing the workflow will commit all workflow inputs and
outputs, too, including the workflow file itself.
$ renku run workflow.yml
Executing step 'data-pipeline.filter': 'python src/filter.py data/input/flights.csv data/output/filtered.csv' ...
Executing step 'data-pipeline.count': 'python src/count.py data/output/filtered.csv data/output/counts.csv' ...
Note
Do you have output files you don’t want to be committed, such as log files?
You have 2 options: (1) Do not list these outputs in the workflow definition
file, and Renku will ignore them. Or, (2) include the file in the workflow
file, but use the persist: false
flag to tell Renku not to commit the
file.
Executing a Portion of a Workflow
Renku also helps you run only portions of your workflow at a time. For example, you can execute just one step of the workflow by referencing that step’s name:
$ renku run workflow.yml filter
You may specify more than one step to run:
$ renku run workflow.yml filter count
Workflow Step Execution Order
When you execute a workflow file, Renku builds an execution graph to determine how the steps in the workflow are related. Renku then executes the steps in that order. This means that only the data dependencies between steps determine the execution order, not the order of steps in the workflow file.
The --dry-run
and --no-commit
flags
By passing the --dry-run
flag to the renku run
command, you can instruct
Renku to only print the order of execution of the steps without actually running
any of them.
The --no-commit
flags causes Renku to run the workflow file but it won’t
create a commit after the execution. Renku also won’t create any metadata in
this case. This is a great option to use when developing or verifying a workflow!
Adding more Information to a Workflow File
Implicit Input and Output Files
If your script consumes or generates an input or output that is not explicitly
passed in the command, you may still list the file in the workflow file so that
it is tracked by Renku. When doing so, also add the implicit: true
key;
otherwise, Renku will warn that the file is not used in the command string.
name: script-with-implicit-input
steps:
filter:
command: python $my-script
inputs:
- my-script:
path: my-script.py
- hidden-input:
path: data/an-input.txt
implicit: true
Descriptions and Keywords
You may provide further details in your workflow definition, such as a description of each parameter, and keywords that describe your workflow.
name: data-pipeline
description: The workflow in the Renku Tutorial
keywords:
- tutorial
steps:
filter:
command: python $filter-py $raw $filtered
description: Filter the raw flights data to only flights to the destination of interest
inputs:
- filter-py:
path: src/filter.py
- raw:
description: The raw flights data
path: data/input/flights.csv
outputs:
- filtered:
description: Flights to the destination of interest
path: data/output/filtered.csv
count:
command: python $count-py $filtered $counts
description: Count the number of flights
inputs:
- count-py:
path: src/count.py
- filtered:
description: Flights to the destination of interest
path: data/output/filtered.csv
outputs:
- counts:
description: Number of flights to the destination of interest
path: data/output/counts.csv
Alternative Success Codes
By default, Renku considers a workflow step to have successfully executed if it returns a success code of 0. If the command is expected to return a success code other an 0, specify the acceptable codes in a success_codes key:
name: command-with-alternative-success-codes
steps:
head:
command: head -n 10 data/collection/models.csv data/collection/colors.csv > intermediate
success_codes: [0, 127]
...
Viewing a Workflow Visually
After executing a workflow, you can view a visual diagram of how any file created by that workflow was created.
To view this diagram, run renku workflow visualize
and pass the path to the file you would like to inspect:
$ renku workflow visualize data/output/counts.csv
┌─────────────────────────────────────────┐ ┌─────────────┐ ┌──────────────────────┐
│workflows/workflow-flights-tutorial-3.yml│ │src/filter.py│ │data/input/flights.csv│
└─────────────────────────────────────────┘ └─────────────┘ └──────────────────────┘
* ******* *** ***
* ************ **** *****
* ************** **** ****
* ************* ╔═══════════════════════╗
* **║python src/filter.py...║
* ╚═══════════════════════╝
* *
* *
* *
┌────────────┐ * ┌────────────────────────┐
│src/count.py│ * │data/output/filtered.csv│
└────────────┘ *** └────────────────────────┘
********* ***** *****
************ ***** ********
************* **** *********
************* ╔══════════════════════╗ *****
**║python src/count.py...║
╚══════════════════════╝
*
*
*
┌──────────────────────┐
│data/output/counts.csv│
└──────────────────────┘