renku dataset

Renku CLI commands for handling of datasets.

Description

Create, edit and manage the datasets in your Renku project.

This is a core feature of Renku. You might want to go through the examples listed below to get an idea of how you can create, import, and edit datasets.

Commands and options

renku dataset

Dataset commands.

renku dataset [OPTIONS] COMMAND [ARGS]...

add

Add data to a dataset.

renku dataset add [OPTIONS] NAME [URLS]...

Options

-f, --force: Allow adding otherwise ignored files.

-o, --overwrite: Overwrite existing files.

-c, --create: Create dataset if it does not exist.

-d, --destination <destination>: Destination directory within the dataset path

--datadir <datadir>: Dataset’s data directory (defaults to ‘data/<dataset name>’).

-r, --ref <revision>: Add files from a specific commit/tag/branch.

-s, --source <sources>: Path(s) within remote git repo to be added

--ln, --link: Symlink files to the dataset’s data directory. Mutually exclusive with –copy and –move.

--mv, --move: Move files to the dataset’s data directory. Mutually exclusive with –copy and –link.

--cp, --copy: Copy files to the dataset’s data directory. Mutually exclusive with –move and –link.

-e, --external: Creates a link to external data.

--storage <storage>: Uri for the S3 bucket when creating the dataset at the same time when running ‘add’

Arguments

NAME: Required argument

URLS: Optional argument(s)

create

Create an empty dataset in the current repo.

renku dataset create [OPTIONS] NAME

Options

-t, --title <title>: Title of the dataset.

-d, --description <description>: Dataset’s description.

-c, --creator <creators>: Creator’s name, email, and affiliation. Accepted format is ‘Forename Surname <email> [affiliation]’.

-m, --metadata <metadata>: Custom metadata to be associated with the dataset.

-k, --keyword <keyword>: List of keywords.

-s, --storage <storage>: URI of the storage backend.

--datadir <datadir>: Dataset’s data directory (defaults to ‘data/<dataset name>’).

Arguments

NAME: Required argument

edit

Edit dataset metadata.

renku dataset edit [OPTIONS] NAME

Options

-t, --title <title>: Title of the dataset.

-d, --description <description>: Dataset’s description.

-c, --creator <creators>: Creator’s name, email, and affiliation. Accepted format is ‘Forename Surname <email> [affiliation]’.

-m, --metadata <metadata>: Custom metadata to be associated with the dataset.

-k, --keyword <keywords>: List of keywords or tags.

-u, --unset <unset>

Remove keywords from dataset.

Options: keywords | k | images | i | metadata | m

Arguments

NAME: Required argument

export

Export data to 3rd party provider.

renku dataset export [OPTIONS] NAME {zenodo|olos|dataverse|local}

Options

-t, --tag <tag>: Dataset tag to export

--publish: Publish the exported dataset.

--dataverse-name <dataverse_name>: Dataverse name to export to.

--dataverse-server <dataverse_server>: Dataverse server URL.

-p, --path <path>: Path to copy data to.

--dlcm-server <dlcm_server>: DLCM server base url.

--publish: Publish the exported dataset.

Arguments

NAME: Required argument

PROVIDER: Required argument

import

Import data from a 3rd party provider or another renku project.

Supported providers: [Dataverse, Renku, Zenodo]

renku dataset import [OPTIONS] URI

Options

--short-name, --name <name>: A convenient name for dataset.

-x, --extract: Extract files before importing to dataset.

-y, --yes: Bypass download confirmation.

--datadir <datadir>: Dataset’s data directory (defaults to ‘data/<dataset name>’).

--tag <tag>: Import a specific tag instead of the latest version.

Arguments

URI: Required argument

ls

List datasets.

renku dataset ls [OPTIONS]

Options

--format <format>

Choose an output format.

Options: tabular | json-ld | json

-c, --columns <columns>

Comma-separated list of column to display: id, created, date_created, short_name, name, creators, creators_full, tags, version, title, keywords, description, storage, datadir.

Default: id,name,title,version,datadir

ls-files

List files in dataset.

renku dataset ls-files [OPTIONS] [NAMES]...

Options

-t, --tag <tag>: Tag for which to show dataset files.

--creators <creators>: Filter files which where authored by specific creators. Multiple creators are specified by comma.

-I, --include <include>: Include files matching given pattern.

-X, --exclude <exclude>: Exclude files matching given pattern.

--format <format>

Choose an output format.

Options: tabular | json-ld | json

-c, --columns <columns>

Comma-separated list of column to display: added, checksum, creators, creators_full, dataset, full_path, path, short_name, dataset_name, size, lfs, source.

Default: dataset_name,added,size,path,lfs

Arguments

NAMES: Optional argument(s)

ls-tags

List all tags of a dataset.

renku dataset ls-tags [OPTIONS] NAME

Options

--format <format>

Choose an output format.

Options: tabular | json-ld

Arguments

NAME: Required argument

rm-tags

Remove tags from a dataset.

renku dataset rm-tags [OPTIONS] NAME [TAGS]...

Arguments

NAME: Required argument

TAGS: Optional argument(s)

show

Show metadata of a dataset.

renku dataset show [OPTIONS] NAME

Options

-t, --tag <tag>: Tag for which to show dataset metadata.

Arguments

NAME: Required argument

tag

Create a tag for a dataset.

renku dataset tag [OPTIONS] NAME TAG

Options

-d, --description <description>: A description for this tag

-f, --force: Allow overwriting existing tags.

Arguments

NAME: Required argument

TAG: Required argument

unlink

Remove matching files from a dataset.

renku dataset unlink [OPTIONS] NAME

Options

-I, --include <include>: Include files matching given pattern.

-X, --exclude <exclude>: Exclude files matching given pattern.

-y, --yes: Confirm unlinking of all files.

Arguments

NAME: Required argument

update

Updates files in dataset from a remote Git repo.

renku dataset update [OPTIONS] [NAMES]...

Options

--creators <creators>: Filter files which where authored by specific creators. Multiple creators are specified by comma.

-I, --include <include>: Include files matching given pattern.

-X, --exclude <exclude>: Exclude files matching given pattern.

--ref <ref>: Update to a specific commit/tag/branch.

--delete: Delete local files that are deleted from remote.

-e, --external: Deprecated

--no-external: Skip updating external data.

--no-local: Skip updating local files.

--no-remote: Skip updating remote files.

-c, --check-data-directory: Check datasets’ data directories for new files.

-a, --all: Update all datasets.

-n, --dry-run: Show what would have been changed

--plain: Show result as one entry per line for machine readability. ‘d’ = dataset update, ‘f’ = file update, ‘r’ = file removed.

Arguments

NAMES: Optional argument(s)

Examples

Create an empty dataset inside a Renku project:

You can select which columns to display by using --columns to pass a comma-separated list of column names:

$ renku dataset ls --columns id,name,date_created,creators
ID        NAME           CREATED              CREATORS
--------  -------------  -------------------  ---------
0ad1cb9a  some-dataset   2020-03-19 16:39:46  sam
9436e36c  my-dataset     2020-02-28 16:48:09  sam

Displayed results are sorted based on the value of the first column.

You can specify output formats by passing --format with a value of tabular, json-ld or json.

Showing dataset details:

$ renku dataset show some-dataset
Name: some-dataset
Created: 2020-12-09 13:52:06.640778+00:00
Creator(s): John Doe<john.doe@example.com> [SDSC]
Keywords: Dataset, Data
Annotations:
[
  {...}
]
Title: Some Dataset
Description:
Just some dataset

You can also show details for a specific tag using the --tag option.

Deleting a dataset:

$ renku dataset rm some-dataset
OK

Creating a dataset with a storage backend:

By passing a storage URI with the --storage option, you can tell Renku that the data for the dataset is stored in a remote storage. At the moment, Renku supports only S3 backends. For example:

$ renku dataset create s3-data --storage s3://bucket-name/path

Renku prompts for your S3 credentials and can store them for future uses.

Note

Data directory for datasets that have a storage backend is ignored by Git. This is needed to avoid committing pulled data from a remote storage to Git.

Working with data

Adding data to the dataset:

This will copy the contents of data-url to the dataset and add it to the dataset metadata.

You can create a dataset when you add data to it for the first time by passing --create flag to add command:

$ renku dataset add --create new-dataset http://data-url

To add data from a git repository, you can specify it via https or git+ssh URL schemes. For example,

$ renku dataset add my-dataset git+ssh://host.io/namespace/project.git

Sometimes you want to add just specific paths within the parent project. In this case, use the --source or -s flag:

$ renku dataset add my-dataset --source path/within/repo/to/datafile \
    git+ssh://host.io/namespace/project.git

The command above will result in a structure like

data/
  my-dataset/
    datafile

You can use shell-like wildcards (e.g. , *, ?) when specifying paths to be added. Put wildcard patterns in quotes to prevent your shell from expanding them.

$ renku dataset add my-dataset --source 'path/**/datafile' \
    git+ssh://host.io/namespace/project.git

You can use --destination or -d flag to set the location where the new data is copied to. This location be will under the dataset’s data directory and will be created if does not exists.

$ renku dataset add my-dataset \
    --source path/within/repo/to/datafile \
    --destination new-dir/new-subdir \
    git+ssh://host.io/namespace/project.git

will yield:

data/
  my-dataset/
    new-dir/
      new-subdir/
        datafile

To add a specific version of files, use --ref option for selecting a branch, commit, or tag. The value passed to this option must be a valid reference in the remote Git repository.

Adding external data to the dataset:

Sometimes you might want to add data to your dataset without copying the actual files to your repository. This is useful for example when external data is too large to store locally. The external data must exist (i.e. be mounted) on your filesystem. Renku creates a symbolic to your data and you can use this symbolic link in renku commands as a normal file. To add an external file pass --external or -e when adding local data to a dataset:

$ renku dataset add my-dataset -e /path/to/external/file

Updating a dataset:

After adding files from a remote Git repository or importing a dataset from a provider like Dataverse or Zenodo, you can check for updates in those files by using renku dataset update --all command. For Git repositories, this command checks all remote files and copies over new content if there is any. It does not delete files from the local dataset if they are deleted from the remote Git repository; to force the delete use --delete argument. You can update to a specific branch, commit, or tag by passing --ref option. For datasets from providers like Dataverse or Zenodo, the whole dataset is updated to ensure consistency between the remote and local versions. Due to this limitation, the --include and --exclude flags are not compatible with those datasets. Moreover, deleted remote files are automatically deleted without requiring the --delete argument. Modifying those datasets locally will prevent them from being updated.

The update command also checks for file changes in the project and updates datasets’ metadata accordingly. You can automatically add new files from the dataset’s data directory by using the --check-data-directory flag.

You can limit the scope of updated files by specifying dataset names, using --include and --exclude to filter based on file names, or using --creators to filter based on creators. For example, the following command updates only CSV files from my-dataset:

$ renku dataset update -I '*.csv' my-dataset

Note that putting glob patterns in quotes is needed to tell Unix shell not to expand them.

External data are also updated automatically. Since they require a checksum calculation which can take a long time when data is large, you can exclude them from an update by passing --no-external flag to the update command:

$ renku dataset update --all --no-external

You can use --dry-run flag to get a preview of what files/datasets will be updated by an update operation.

Tagging a dataset:

A dataset can be tagged with an arbitrary tag to refer to the dataset at that point in time. A tag can be added like this:

$ renku dataset tag my-dataset 1.0 -d "Version 1.0 tag"

A list of all tags can be seen by running:

$ renku dataset ls-tags my-dataset
CREATED              NAME    DESCRIPTION      DATASET     COMMIT
-------------------  ------  ---------------  ----------  ----------------
2020-09-19 17:29:13  1.0     Version 1.0 tag  my-dataset  6c19a8d31545b...

A tag can be removed with:

$ renku dataset rm-tags my-dataset 1.0

Importing data from other Renku projects:

To import all data files and their metadata from another Renku dataset use:

$ renku dataset import \
    https://renkulab.io/projects/<username>/<project>/datasets/<dataset-id>

or

$ renku dataset import \
    https://renkulab.io/projects/<username>/<project>/datasets/<dataset-name>

or

$ renku dataset import \
    https://renkulab.io/datasets/<dataset-id>

You can get the link to a dataset form the UI or you can construct it by knowing the dataset’s ID.

By default, Renku imports the latest version of a dataset from the other project. If you want to import another version, pass the dataset version’s tag to the import command:

$ renku dataset import \
    https://renkulab.io/datasets/<dataset-id> --tag <version>

Importing data from an external provider:

$ renku dataset import 10.5281/zenodo.3352150

This will import the dataset with the DOI (Digital Object Identifier) 10.5281/zenodo.3352150 and make it locally available. Dataverse and Zenodo are supported, with DOIs (e.g. 10.5281/zenodo.3352150 or doi:10.5281/zenodo.3352150) and full URLs (e.g. http://zenodo.org/record/3352150). A tag with the remote version of the dataset is automatically created.

You can change the directory a dataset is imported to by using the --datadir option.

Exporting data to an external provider:

$ renku dataset export my-dataset zenodo

This will export the dataset my-dataset to zenodo.org as a draft, allowing for publication later on. If the dataset has any tags set, you can chose if the repository HEAD version or one of the tags should be exported. The remote version will be set to the local tag that is being exported.

To export to a Dataverse provider you must pass Dataverse server’s URL and the name of the parent dataverse where the dataset will be exported to. Server’s URL is stored in your Renku setting and you don’t need to pass it every time.

To export a dataset to OLOS you must pass the OLOS server’s base URL and supply your access token when prompted for it. You must also choose which organizational unit to export the dataset to from the list shown during the export. The export does not map contributors from Renku to OLOS and also doesn’t map License information. Additionally, all file categories default to Primary/Derived. This has to adjusted manually in the OLOS interface after the export is done.

Exporting data to a local directory:

Renku provides a local provider that can be used to get a copy of a dataset. For example, the following command creates a copy of the dataset my-dataset version v1 in /tmp/my-dataset-v1:

$ renku dataset export my-dataset local --tag v1 --path /tmp/my-dataset-v1

This also creates a copy of dataset’s metadata at the given version and puts it in <destination>/METADATA.yml. If a destination path is not given to this command, it creates a directory in project’s data directory using dataset’s name and version: <data-dir>/<dataset-name>-<version>. Export fails if the destination directory is not empty.

Note

See our dataset versioning tutorial for example recipes using tags for data management.

Listing all files in the project associated with a dataset.

$ renku dataset ls-files
DATASET NAME         ADDED                PATH                           LFS
-------------------  -------------------  -----------------------------  ----
my-dataset           2020-02-28 16:48:09  data/my-dataset/add-me         *
my-dataset           2020-02-28 16:49:02  data/my-dataset/weather/file1  *
my-dataset           2020-02-28 16:49:02  data/my-dataset/weather/file2
my-dataset           2020-02-28 16:49:02  data/my-dataset/weather/file3  *

You can select which columns to display by using --columns to pass a comma-separated list of column names:

$ renku dataset ls-files --columns name,creators, path
DATASET NAME         CREATORS   PATH
-------------------  ---------  -----------------------------
my-dataset           sam        data/my-dataset/add-me
my-dataset           sam        data/my-dataset/weather/file1
my-dataset           sam        data/my-dataset/weather/file2
my-dataset           sam        data/my-dataset/weather/file3

Displayed results are sorted based on the value of the first column.

You can specify output formats by passing --format with a value of tabular, json-ld or json.

Sometimes you want to filter the files. For this we use --dataset, --include and --exclude flags:

$ renku dataset ls-files --include "file*" --exclude "file3"
DATASET NAME        ADDED                PATH                           LFS
------------------- -------------------  -----------------------------  ----
my-dataset          2020-02-28 16:49:02  data/my-dataset/weather/file1  *
my-dataset          2020-02-28 16:49:02  data/my-dataset/weather/file2  *

Dataset files can be listed for a specific version (tag) of a dataset using the --tag option. In this case, files from datasets which have that specific tag are displayed.

Unlink a file from a dataset:

$ renku dataset unlink my-dataset --include file1
OK

Unlink all files within a directory from a dataset:

$ renku dataset unlink my-dataset --include "weather/*"
OK

Unlink all files from a dataset:

$ renku dataset unlink my-dataset
Warning: You are about to remove following from "my-dataset" dataset.
.../my-dataset/weather/file1
.../my-dataset/weather/file2
.../my-dataset/weather/file3
Do you wish to continue? [y/N]:

Note

The unlink command does not delete files, only the dataset record.

renku dataset

Description

Commands and options

renku dataset

add

create

edit

export

import

ls

ls-files

ls-tags

rm

rm-tags

show

tag

unlink

update

Examples

Working with data