Renku CLI

The core of the Renku Project is the renku command-line interface (CLI), which offers tools for easily capturing your data-science process as you work. With these tools, you can describe and annotate data and workflows, providing information that is used to build the lineage of your results, simplifying iterative development and making your work reproducible. The CLI can be used within Renkulab or locally, on your own machine.

The importance of version control for working with code is widely recognized. renku aims to be “git for research”, by extending version control to encompass elements central to research data and processes.

If that’s too abstract, you can check out First Steps Tutorial tutorial.

renku can be decomposed into to the following pieces that are exposed to the user.

git

Data and code change frequently in a typical project. Knowing which exact version of code and data produced a particular result is critical for ensuring the robustness and veracity of your work. In Renku, version control is the base upon which everything else is built.

We rely on the currently most widespread version control system, git. If you are unfamiliar with git it would not hurt to read at least some of their excellent tutorials. In Renku we try to take care of most of the boiler plate git commands for you, but you should still be aware that it is being used under the hood.

One additional benefit of using a version control system like git is that it encourages you to be creative and explore new ideas, without fear of breaking things. With git, you can experiment with complete peace-of-mind that you can always restore to the last working version of your project if everything happens to go off the rails. This is a fantastic advantage in data science, where experimentation is a critical part of the discovery process.

Note that in Renku, we make use of git LFS which allows keeping not only the code, but also the data related to an analysis under version control, while keeping the git repository itself small.

Whenever a command that changes the contents of your project is executed, renku invokes git to record information about what was added or changed:

  • this commit also includes some internal metadata with detailed information about what was done
  • the commit message contains the command you executed, so you can check the git log to see what you did in the past (running a workflow, creating a dataset, or initializing a project)

External storage (git-LFS by default)

Git is very efficient in handling text-based files, but is not ideal for binary data. For reproducibility, though, it is necessary to version data and keep track of it together with analysis code. Therefore, by default, Renku projects use git-LFS for handling data. If you use renku commands, most of the data handling is done for you. For example:

  • when you add data with renku dataset commands
  • when you call renku run to generate output data

Whichever files are flagged for storing with git-LFS, they are automatically separated from other repository files when you push to the server. They can be retrieved again from the external (LFS) storage when needed if the repository is cloned elsewhere.

Keeping large files in LFS gives users the ability to control the amount of local space used by a project; LFS files can be left as pointers, and take up virtually no space, or can be pulled if needed.

Renku provides a convenience command renku storage pull for retrieving data from LFS. Similarly, any renku command (e.g. renku run) will check whether the data it needs is stored in LFS, and if so, it will preemptively fetch it.

Datasets

  • import & publish datasets from/to repositories like Zenodo and Dataverse that have DOIs
  • auto-populate metadata for imported datasets (and created datasets based on their origins)
  • user-annotation of datasets with schema.org or domain-specific metadata

Lineage of results

Capturing the lineage of results is critical for understanding what input data were used, what code was run, and what results were produced

The renku CLI gives researchers and analysts simple tools to:

  • track lineage for a workflow (generate a graph that shows input, execution, and output nodes)
  • iteratively develop a workflow (keep making changes to the code/data until you get the output you want)
  • compare outputs generated by the same (maybe stochastic) workflow: renku rerun

Consult the the CLI documentation for more!

Installing

You can follow these installation instructions for running renku locally if you wish to forgo using renkulab or need to interact with your project locally.