Provenance of results¶
One of the most important ideas behind Renku is the concept of capturing the
provenance of the analysis process. Lets assume we are working with
If you write a piece of code that takes some input data, processes it and writes some output to disk, the provenance graph would look something like this:
result may also be used as
input data to a subsequent step:
In a real analysis, such a graph may become very complex. Without a detailed record of the connections between the different data, code and result blocks, it may be impossible to efficiently regenerate parts of the chain. Keeping track of the provenance allows us to easily recreate the final result if the original raw data changes, for example, or to examine what happens when we change our preprocessing pipeline. Recording provenance is also critical for enabling data and code audits, should they be required.
We hope that using Renku will encourage people to share their data, results, and analysis codes. By capturing the provenance not only within, but also across projects we ensure that if you use someone else’s results you can always track exactly where they came from. Conversely, you can also see how someone is using your shared data or code in their analysis. Renku will allow you to explore these connections in detail.
Recording Provenance in Renku¶
Keeping track of provenance manually is a tedious process. In Renku we try to automate this as much as possible by providing a simple command-line interface which, when used correctly, should take care of provenance recording for you. The basic idea is as follows: anything you run in the terminal to produce a result simply needs to have renku run pre-pended to it and you are done. This will work best if these assumptions are met:
The code which is run to compute a result can be started from the terminal.
The data inputs are specified as arguments to the command.
The data outputs are within the project directory tree and not outside (i.e. cannot be in a parent directory)
An example execution would look something like:
$ renku run --name run-analysis -- python run_analysis.py -i inputs -o outputs
Wrapping the execution of
python run_analysis.py with
renku run had
the following consequences:
The command was executed.
If it completed successfully, a
Planentry was created with its
nameset to the name specified by
--name. This is a workflow specification allowing the command to be re-executed with potentially modified inputs/outputs in the Renku workflow system. Think of
Plansas recipes for executing workflows.
Runentry was created, which is a record of this execution.
Runsallow keeping track of what was done in a project and how files were created, ensuring reproducibility of data.
Everything was committed to the git repository.
Renku uses its own Knowledge Graph based approach to store metadata about
workflow executions and recipes (
Plans, respectively). It has
a plugin system that allows exporting these workflows to various workflow
languages such as Common Workflow Language (CWL)
as well as executing them with different workflow
Applying the Provenance¶
In Renku, we want to provide tools that not only record the provenance but also give you easy access to its benefits. Once the provenance is recorded, there are several ways in which it can become immediately beneficial. The most common usage is to renku update results when any of the input data or code dependencies change. By knowing exactly which results depend on a particular input, we can make sure to recompute only the necessary steps and not the entire pipeline, potentially avoiding expensive calculations in complex settings. In addition, recorded workflows can be executed independently using renku workflow execute. For understanding the basic functionality, head to renkulab.io and follow First Steps Tutorial. See also the RenkuLab knowledge graph documentation.