renku dataset

Renku CLI commands for handling of datasets.

Description

Create, edit and manage the datasets in your Renku project.

This is a core feature of Renku. You might want to go through the examples listed below to get an idea of how you can create, import, and edit datasets.

Commands and options

renku dataset

Dataset commands.

renku dataset [OPTIONS] COMMAND [ARGS]...

add

Add data to a dataset.

renku dataset add [OPTIONS] NAME [URLS]...

Options

-f, --force

Allow adding otherwise ignored files.

-o, --overwrite

Overwrite existing files.

-c, --create

Create dataset if it does not exist.

-d, --destination <destination>

Destination directory within the dataset path

--storage <storage>

URI of the cloud storage backend.

--datadir <datadir>

Dataset’s data directory (defaults to ‘data/<dataset name>’).

-r, --ref <revision>

Add files from a specific commit/tag/branch.

-s, --source <sources>
Path(s) within remote git repo to be added
--ln, --link

Symlink files to the dataset’s data directory. Mutually exclusive with –copy and –move.

--mv, --move

Move files to the dataset’s data directory. Mutually exclusive with –copy and –link.

--cp, --copy
Copy files to the dataset’s data directory. Mutually exclusive with –move and –link.

Arguments

NAME

Required argument

URLS

Optional argument(s)

create

Create an empty dataset in the current repo.

renku dataset create [OPTIONS] NAME

Options

-t, --title <title>

Title of the dataset.

-d, --description <description>

Dataset’s description.

-c, --creator <creators>

Creator’s name, email, and affiliation. Accepted format is ‘Forename Surname <email> [affiliation]’.

-m, --metadata <metadata>

Custom metadata to be associated with the dataset.

-k, --keyword <keyword>

List of keywords.

-s, --storage <storage>

URI of the cloud storage backend.

--datadir <datadir>

Dataset’s data directory (defaults to ‘data/<dataset name>’).

Arguments

NAME

Required argument

edit

Edit dataset metadata.

renku dataset edit [OPTIONS] NAME

Options

-t, --title <title>

Title of the dataset.

-d, --description <description>

Dataset’s description.

-c, --creator <creators>

Creator’s name, email, and affiliation. Accepted format is ‘Forename Surname <email> [affiliation]’.

-m, --metadata <metadata>

Custom metadata to be associated with the dataset.

--metadata-source <metadata_source>

Set the source field in the metadata when editing it if not provided, then the default is ‘renku’.

-k, --keyword <keywords>

List of keywords or tags.

-u, --unset <unset>

Remove keywords from dataset.

Options:

keywords | k | images | i | metadata | m

Arguments

NAME

Required argument

export

Export data to 3rd party provider.

renku dataset export [OPTIONS] NAME {zenodo|olos|dataverse|local}

Options

-t, --tag <tag>

Dataset tag to export

--publish

Publish the exported dataset.

--dataverse-name <dataverse_name>

Dataverse name to export to.

--dataverse-server <dataverse_server>
Dataverse server URL.
-p, --path <path>
Path to copy data to.
--dlcm-server <dlcm_server>
DLCM server base url.
--publish
Publish the exported dataset.

Arguments

NAME

Required argument

PROVIDER

Required argument

import

Import data from a 3rd party provider or another renku project.

Supported providers: [Dataverse, Renku, Zenodo]

renku dataset import [OPTIONS] URI

Options

--short-name, --name <name>

A convenient name for dataset.

-x, --extract

Extract files before importing to dataset.

-y, --yes

Bypass download confirmation.

--datadir <datadir>

Dataset’s data directory (defaults to ‘data/<dataset name>’).

--tag <tag>
Import a specific tag instead of the latest version.

Arguments

URI

Required argument

ls

List datasets.

renku dataset ls [OPTIONS]

Options

--format <format>

Choose an output format.

Options:

tabular | json-ld | json

-c, --columns <columns>

Comma-separated list of column to display: id, created, date_created, short_name, name, creators, creators_full, tags, version, title, keywords, description, storage, datadir.

Default:

id,name,title,version,datadir,storage

ls-files

List files in dataset.

renku dataset ls-files [OPTIONS] [NAMES]...

Options

-t, --tag <tag>

Tag for which to show dataset files.

--creators <creators>

Filter files which where authored by specific creators. Multiple creators are specified by comma.

-I, --include <include>

Include files matching given pattern.

-X, --exclude <exclude>

Exclude files matching given pattern.

--format <format>

Choose an output format.

Options:

tabular | json-ld | json

-c, --columns <columns>

Comma-separated list of column to display: added, checksum, creators, creators_full, dataset, full_path, path, short_name, dataset_name, size, lfs, source.

Default:

dataset_name,path,size,added,lfs

Arguments

NAMES

Optional argument(s)

ls-tags

List all tags of a dataset.

renku dataset ls-tags [OPTIONS] NAME

Options

--format <format>

Choose an output format.

Options:

tabular | json-ld

Arguments

NAME

Required argument

mount

Mount a cloud storage in the dataset’s data directory.

renku dataset mount [OPTIONS] NAME

Options

-e, --existing <existing>

Use an existing mount point instead of mounting the remote storage.

-u, --unmount

Unmount dataset’s backend storage.

-y, --yes

No prompt when removing non-empty dataset’s data directory.

Arguments

NAME

Required argument

pull

Pull data from a cloud storage.

renku dataset pull [OPTIONS] NAME

Options

-l, --location <location>

A directory to copy data to, instead of the dataset’s data directory.

Arguments

NAME

Required argument

rm

Delete a dataset.

renku dataset rm [OPTIONS] NAME

Arguments

NAME

Required argument

rm-tags

Remove tags from a dataset.

renku dataset rm-tags [OPTIONS] NAME [TAGS]...

Arguments

NAME

Required argument

TAGS

Optional argument(s)

show

Show metadata of a dataset.

renku dataset show [OPTIONS] NAME

Options

-t, --tag <tag>

Tag for which to show dataset metadata.

Arguments

NAME

Required argument

tag

Create a tag for a dataset.

renku dataset tag [OPTIONS] NAME TAG

Options

-d, --description <description>

A description for this tag

-f, --force

Allow overwriting existing tags.

Arguments

NAME

Required argument

TAG

Required argument

unmount

Unmount a backend storage in the dataset’s data directory.

renku dataset unmount [OPTIONS] NAME

Arguments

NAME

Required argument

update

Updates files in dataset from a remote Git repo.

renku dataset update [OPTIONS] [NAMES]...

Options

--creators <creators>

Filter files which where authored by specific creators. Multiple creators are specified by comma.

-I, --include <include>

Include files matching given pattern.

-X, --exclude <exclude>

Exclude files matching given pattern.

--ref <ref>

Update to a specific commit/tag/branch.

--delete

Delete local files that are deleted from remote.

-e, --external

Deprecated

--no-external

Skip updating external data.

--no-local

Skip updating local files.

--no-remote

Skip updating remote files.

-c, --check-data-directory

Check datasets’ data directories for new files.

-a, --all

Update all datasets.

-n, --dry-run

Show what would have been changed

--plain

Show result as one entry per line for machine readability. ‘d’ = dataset update, ‘f’ = file update, ‘r’ = file removed.

Arguments

NAMES

Optional argument(s)

Examples

Create an empty dataset inside a Renku project:

Create a Dataset

You can select which columns to display by using --columns to pass a comma-separated list of column names:

$ renku dataset ls --columns id,name,date_created,creators
ID        NAME           CREATED              CREATORS
--------  -------------  -------------------  ---------
0ad1cb9a  some-dataset   2020-03-19 16:39:46  sam
9436e36c  my-dataset     2020-02-28 16:48:09  sam

Displayed results are sorted based on the value of the first column.

You can specify output formats by passing --format with a value of tabular, json-ld or json.

Showing dataset details:

$ renku dataset show some-dataset
Name: some-dataset
Created: 2020-12-09 13:52:06.640778+00:00
Creator(s): John Doe<john.doe@example.com> [SDSC]
Keywords: Dataset, Data
Annotations:
[
  {...}
]
Title: Some Dataset
Description:
Just some dataset

You can also show details for a specific tag using the --tag option.

Deleting a dataset:

$ renku dataset rm some-dataset
OK

Creating a dataset with a storage backend:

By passing a storage URI with the --storage option, you can tell Renku that the data for the dataset is stored in a remote storage. At the moment, Renku supports only S3 backends. For example:

$ renku dataset create s3-data --storage s3://bucket-name/path

Renku prompts for your S3 credentials and can store them for future uses.

Note

Data directory for datasets that have a storage backend is ignored by Git. This is needed to avoid committing pulled data from a remote storage to Git.

Working with data

Adding data to the dataset:

Add data to a Dataset

This will copy the contents of data-url to the dataset and add it to the dataset metadata.

Note

If the URL refers to a local directory, data is added differently depending on if there is a trailing slash (/) or not. If the URL ends in a slash, files inside the directory are added to the target directory. If it does not end in a slash, then the directory itself will be added inside the target directory.

You can create a dataset when you add data to it for the first time by passing --create flag to add command:

$ renku dataset add --create new-dataset http://data-url

To add data from a git repository, you can specify it via https or git+ssh URL schemes. For example,

$ renku dataset add my-dataset git+ssh://host.io/namespace/project.git

Sometimes you want to add just specific paths within the parent project. In this case, use the --source or -s flag:

$ renku dataset add my-dataset --source path/within/repo/to/datafile \
    git+ssh://host.io/namespace/project.git

The command above will result in a structure like

data/
  my-dataset/
    datafile

You can use shell-like wildcards (e.g. , *, ?) when specifying paths to be added. Put wildcard patterns in quotes to prevent your shell from expanding them.

$ renku dataset add my-dataset --source 'path/**/datafile' \
    git+ssh://host.io/namespace/project.git

You can use --destination or -d flag to set the location where the new data is copied to. This location be will under the dataset’s data directory and will be created if it does not exists.

$ renku dataset add my-dataset \
    --source path/within/repo/to/datafile \
    --destination new-dir/new-subdir \
    git+ssh://host.io/namespace/project.git

will yield:

data/
  my-dataset/
    new-dir/
      new-subdir/
        datafile

To add a specific version of files, use --ref option for selecting a branch, commit, or tag. The value passed to this option must be a valid reference in the remote Git repository.

Updating a dataset:

After adding files from a remote Git repository or importing a dataset from a provider like Dataverse or Zenodo, you can check for updates in those files by using renku dataset update --all command. For Git repositories, this command checks all remote files and copies over new content if there is any. It does not delete files from the local dataset if they are deleted from the remote Git repository; to force the delete use --delete argument. You can update to a specific branch, commit, or tag by passing --ref option. For datasets from providers like Dataverse or Zenodo, the whole dataset is updated to ensure consistency between the remote and local versions. Due to this limitation, the --include and --exclude flags are not compatible with those datasets. Moreover, deleted remote files are automatically deleted without requiring the --delete argument. Modifying those datasets locally will prevent them from being updated.

The update command also checks for file changes in the project and updates datasets’ metadata accordingly. You can automatically add new files from the dataset’s data directory by using the --check-data-directory flag.

You can limit the scope of updated files by specifying dataset names, using --include and --exclude to filter based on file names, or using --creators to filter based on creators. For example, the following command updates only CSV files from my-dataset:

$ renku dataset update -I '*.csv' my-dataset

Note that putting glob patterns in quotes is needed to tell Unix shell not to expand them.

You can use --dry-run flag to get a preview of what files/datasets will be updated by an update operation.

Tagging a dataset:

A dataset can be tagged with an arbitrary tag to refer to the dataset at that point in time. A tag can be added like this:

$ renku dataset tag my-dataset 1.0 -d "Version 1.0 tag"

A list of all tags can be seen by running:

$ renku dataset ls-tags my-dataset
CREATED              NAME    DESCRIPTION      DATASET     COMMIT
-------------------  ------  ---------------  ----------  ----------------
2020-09-19 17:29:13  1.0     Version 1.0 tag  my-dataset  6c19a8d31545b...

A tag can be removed with:

$ renku dataset rm-tags my-dataset 1.0

Importing data from other Renku projects:

To import all data files and their metadata from another Renku dataset use:

$ renku dataset import \
    https://renkulab.io/projects/<username>/<project>/datasets/<dataset-id>

or

$ renku dataset import \
    https://renkulab.io/projects/<username>/<project>/datasets/<dataset-name>

or

$ renku dataset import \
    https://renkulab.io/datasets/<dataset-id>

You can get the link to a dataset form the UI or you can construct it by knowing the dataset’s ID.

By default, Renku imports the latest version of a dataset from the other project. If you want to import another version, pass the dataset version’s tag to the import command:

$ renku dataset import \
    https://renkulab.io/datasets/<dataset-id> --tag <version>

Importing data from an external provider:

Import a Dataset
$ renku dataset import 10.5281/zenodo.3352150

This will import the dataset with the DOI (Digital Object Identifier) 10.5281/zenodo.3352150 and make it locally available. Dataverse and Zenodo are supported, with DOIs (e.g. 10.5281/zenodo.3352150 or doi:10.5281/zenodo.3352150) and full URLs (e.g. http://zenodo.org/record/3352150). A tag with the remote version of the dataset is automatically created.

You can change the directory a dataset is imported to by using the --datadir option.

Exporting data to an external provider:

$ renku dataset export my-dataset zenodo

This will export the dataset my-dataset to zenodo.org as a draft, allowing for publication later on. If the dataset has any tags set, you can choose if the repository HEAD version or one of the tags should be exported. The remote version will be set to the local tag that is being exported.

To export to a Dataverse provider you must pass Dataverse server’s URL and the name of the parent dataverse where the dataset will be exported to. Server’s URL is stored in your Renku setting and you don’t need to pass it every time.

To export a dataset to OLOS you must pass the OLOS server’s base URL and supply your access token when prompted for it. You must also choose which organizational unit to export the dataset to from the list shown during the export. The export does not map contributors from Renku to OLOS and also doesn’t map License information. Additionally, all file categories default to Primary/Derived. This has to adjusted manually in the OLOS interface after the export is done.

Exporting data to a local directory:

Renku provides a local provider that can be used to get a copy of a dataset. For example, the following command creates a copy of the dataset my-dataset version v1 in /tmp/my-dataset-v1:

$ renku dataset export my-dataset local --tag v1 --path /tmp/my-dataset-v1

This also creates a copy of dataset’s metadata at the given version and puts it in <destination>/METADATA.yml. If a destination path is not given to this command, it creates a directory in project’s data directory using dataset’s name and version: <data-dir>/<dataset-name>-<version>. Export fails if the destination directory is not empty.

Note

See our dataset versioning tutorial for example recipes using tags for data management.

Listing all files in the project associated with a dataset.

$ renku dataset ls-files
DATASET NAME         ADDED                PATH                           LFS
-------------------  -------------------  -----------------------------  ----
my-dataset           2020-02-28 16:48:09  data/my-dataset/add-me         *
my-dataset           2020-02-28 16:49:02  data/my-dataset/weather/file1  *
my-dataset           2020-02-28 16:49:02  data/my-dataset/weather/file2
my-dataset           2020-02-28 16:49:02  data/my-dataset/weather/file3  *

You can select which columns to display by using --columns to pass a comma-separated list of column names:

$ renku dataset ls-files --columns name,creators, path
DATASET NAME         CREATORS   PATH
-------------------  ---------  -----------------------------
my-dataset           sam        data/my-dataset/add-me
my-dataset           sam        data/my-dataset/weather/file1
my-dataset           sam        data/my-dataset/weather/file2
my-dataset           sam        data/my-dataset/weather/file3

Displayed results are sorted based on the value of the first column.

You can specify output formats by passing --format with a value of tabular, json-ld or json.

Sometimes you want to filter the files. For this we use --dataset, --include and --exclude flags:

$ renku dataset ls-files --include "file*" --exclude "file3"
DATASET NAME        ADDED                PATH                           LFS
------------------- -------------------  -----------------------------  ----
my-dataset          2020-02-28 16:49:02  data/my-dataset/weather/file1  *
my-dataset          2020-02-28 16:49:02  data/my-dataset/weather/file2  *

Dataset files can be listed for a specific version (tag) of a dataset using the --tag option. In this case, files from datasets which have that specific tag are displayed.

Unlink a file from a dataset:

$ renku dataset unlink my-dataset --include file1
OK

Unlink all files within a directory from a dataset:

$ renku dataset unlink my-dataset --include "weather/*"
OK

Unlink all files from a dataset:

$ renku dataset unlink my-dataset
Warning: You are about to remove following from "my-dataset" dataset.
.../my-dataset/weather/file1
.../my-dataset/weather/file2
.../my-dataset/weather/file3
Do you wish to continue? [y/N]:

Note

The unlink command does not delete files, only the dataset record.