renku dataset
Renku CLI commands for handling of datasets.
Description
Create, edit and manage the datasets in your Renku project.
This is a core feature of Renku. You might want to go through the examples listed below to get an idea of how you can create, import, and edit datasets.
Commands and options
renku dataset
Dataset commands.
renku dataset [OPTIONS] COMMAND [ARGS]...
add
Add data to a dataset.
renku dataset add [OPTIONS] NAME [URLS]...
Options
- -f, --force
Allow adding otherwise ignored files.
- -o, --overwrite
Overwrite existing files.
- -c, --create
Create dataset if it does not exist.
- -d, --destination <destination>
Destination directory within the dataset path
- --datadir <datadir>
Dataset’s data directory (defaults to ‘data/<dataset name>’).
- -r, --ref <revision>
Add files from a specific commit/tag/branch.
- -s, --source <sources>
- Path(s) within remote git repo to be added
- --ln, --link
Symlink files to the dataset’s data directory. Mutually exclusive with –copy and –move.
- --mv, --move
Move files to the dataset’s data directory. Mutually exclusive with –copy and –link.
- --cp, --copy
Copy files to the dataset’s data directory. Mutually exclusive with –move and –link.
- -e, --external
- Creates a link to external data.
- --storage <storage>
- Uri for the S3 bucket when creating the dataset at the same time when running ‘add’
Arguments
- NAME
Required argument
- URLS
Optional argument(s)
create
Create an empty dataset in the current repo.
renku dataset create [OPTIONS] NAME
Options
- -t, --title <title>
Title of the dataset.
- -d, --description <description>
Dataset’s description.
- -c, --creator <creators>
Creator’s name, email, and affiliation. Accepted format is ‘Forename Surname <email> [affiliation]’.
- -m, --metadata <metadata>
Custom metadata to be associated with the dataset.
- -k, --keyword <keyword>
List of keywords.
- -s, --storage <storage>
URI of the storage backend.
- --datadir <datadir>
Dataset’s data directory (defaults to ‘data/<dataset name>’).
Arguments
- NAME
Required argument
edit
Edit dataset metadata.
renku dataset edit [OPTIONS] NAME
Options
- -t, --title <title>
Title of the dataset.
- -d, --description <description>
Dataset’s description.
- -c, --creator <creators>
Creator’s name, email, and affiliation. Accepted format is ‘Forename Surname <email> [affiliation]’.
- -m, --metadata <metadata>
Custom metadata to be associated with the dataset.
- -k, --keyword <keywords>
List of keywords or tags.
- -u, --unset <unset>
Remove keywords from dataset.
- Options
keywords | k | images | i | metadata | m
Arguments
- NAME
Required argument
export
Export data to 3rd party provider.
renku dataset export [OPTIONS] NAME {zenodo|olos|dataverse|local}
Options
- -t, --tag <tag>
Dataset tag to export
- --publish
Publish the exported dataset.
- --dataverse-name <dataverse_name>
Dataverse name to export to.
- --dataverse-server <dataverse_server>
- Dataverse server URL.
- -p, --path <path>
- Path to copy data to.
- --dlcm-server <dlcm_server>
- DLCM server base url.
- --publish
- Publish the exported dataset.
Arguments
- NAME
Required argument
- PROVIDER
Required argument
import
Import data from a 3rd party provider or another renku project.
Supported providers: [Dataverse, Renku, Zenodo]
renku dataset import [OPTIONS] URI
Options
- --short-name, --name <name>
A convenient name for dataset.
- -x, --extract
Extract files before importing to dataset.
- -y, --yes
Bypass download confirmation.
- --datadir <datadir>
Dataset’s data directory (defaults to ‘data/<dataset name>’).
- --tag <tag>
- Import a specific tag instead of the latest version.
Arguments
- URI
Required argument
ls
List datasets.
renku dataset ls [OPTIONS]
Options
- --format <format>
Choose an output format.
- Options
tabular | json-ld | json
- -c, --columns <columns>
Comma-separated list of column to display: id, created, date_created, short_name, name, creators, creators_full, tags, version, title, keywords, description, storage, datadir.
- Default
id,name,title,version,datadir
ls-files
List files in dataset.
renku dataset ls-files [OPTIONS] [NAMES]...
Options
- -t, --tag <tag>
Tag for which to show dataset files.
- --creators <creators>
Filter files which where authored by specific creators. Multiple creators are specified by comma.
- -I, --include <include>
Include files matching given pattern.
- -X, --exclude <exclude>
Exclude files matching given pattern.
- --format <format>
Choose an output format.
- Options
tabular | json-ld | json
- -c, --columns <columns>
Comma-separated list of column to display: added, checksum, creators, creators_full, dataset, full_path, path, short_name, dataset_name, size, lfs, source.
- Default
dataset_name,added,size,path,lfs
Arguments
- NAMES
Optional argument(s)
rm
Delete a dataset.
renku dataset rm [OPTIONS] NAME
Arguments
- NAME
Required argument
show
Show metadata of a dataset.
renku dataset show [OPTIONS] NAME
Options
- -t, --tag <tag>
Tag for which to show dataset metadata.
Arguments
- NAME
Required argument
tag
Create a tag for a dataset.
renku dataset tag [OPTIONS] NAME TAG
Options
- -d, --description <description>
A description for this tag
- -f, --force
Allow overwriting existing tags.
Arguments
- NAME
Required argument
- TAG
Required argument
unlink
Remove matching files from a dataset.
renku dataset unlink [OPTIONS] NAME
Options
- -I, --include <include>
Include files matching given pattern.
- -X, --exclude <exclude>
Exclude files matching given pattern.
- -y, --yes
Confirm unlinking of all files.
Arguments
- NAME
Required argument
update
Updates files in dataset from a remote Git repo.
renku dataset update [OPTIONS] [NAMES]...
Options
- --creators <creators>
Filter files which where authored by specific creators. Multiple creators are specified by comma.
- -I, --include <include>
Include files matching given pattern.
- -X, --exclude <exclude>
Exclude files matching given pattern.
- --ref <ref>
Update to a specific commit/tag/branch.
- --delete
Delete local files that are deleted from remote.
- -e, --external
Deprecated
- --no-external
Skip updating external data.
- --no-local
Skip updating local files.
- --no-remote
Skip updating remote files.
- -c, --check-data-directory
Check datasets’ data directories for new files.
- -a, --all
Update all datasets.
- -n, --dry-run
Show what would have been changed
- --plain
Show result as one entry per line for machine readability. ‘d’ = dataset update, ‘f’ = file update, ‘r’ = file removed.
Arguments
- NAMES
Optional argument(s)
Examples
Create an empty dataset inside a Renku project:
You can select which columns to display by using --columns
to pass a
comma-separated list of column names:
$ renku dataset ls --columns id,name,date_created,creators
ID NAME CREATED CREATORS
-------- ------------- ------------------- ---------
0ad1cb9a some-dataset 2020-03-19 16:39:46 sam
9436e36c my-dataset 2020-02-28 16:48:09 sam
Displayed results are sorted based on the value of the first column.
You can specify output formats by passing --format
with a value of tabular
,
json-ld
or json
.
Showing dataset details:
$ renku dataset show some-dataset
Name: some-dataset
Created: 2020-12-09 13:52:06.640778+00:00
Creator(s): John Doe<john.doe@example.com> [SDSC]
Keywords: Dataset, Data
Annotations:
[
{...}
]
Title: Some Dataset
Description:
Just some dataset
You can also show details for a specific tag using the --tag
option.
Deleting a dataset:
$ renku dataset rm some-dataset
OK
Creating a dataset with a storage backend:
By passing a storage URI with the --storage
option, you can tell Renku that
the data for the dataset is stored in a remote storage. At the moment, Renku
supports only S3 backends. For example:
$ renku dataset create s3-data --storage s3://bucket-name/path
Renku prompts for your S3 credentials and can store them for future uses.
Note
Data directory for datasets that have a storage backend is ignored by Git. This is needed to avoid committing pulled data from a remote storage to Git.
Working with data
Adding data to the dataset:
This will copy the contents of data-url
to the dataset and add it
to the dataset metadata.
You can create a dataset when you add data to it for the first time by passing
--create
flag to add command:
$ renku dataset add --create new-dataset http://data-url
To add data from a git repository, you can specify it via https or git+ssh URL schemes. For example,
$ renku dataset add my-dataset git+ssh://host.io/namespace/project.git
Sometimes you want to add just specific paths within the parent project.
In this case, use the --source
or -s
flag:
$ renku dataset add my-dataset --source path/within/repo/to/datafile \
git+ssh://host.io/namespace/project.git
The command above will result in a structure like
data/
my-dataset/
datafile
You can use shell-like wildcards (e.g. , *, ?) when specifying paths to be added. Put wildcard patterns in quotes to prevent your shell from expanding them.
$ renku dataset add my-dataset --source 'path/**/datafile' \
git+ssh://host.io/namespace/project.git
You can use --destination
or -d
flag to set the location where the new
data is copied to. This location be will under the dataset’s data directory and
will be created if does not exists.
$ renku dataset add my-dataset \
--source path/within/repo/to/datafile \
--destination new-dir/new-subdir \
git+ssh://host.io/namespace/project.git
will yield:
data/
my-dataset/
new-dir/
new-subdir/
datafile
To add a specific version of files, use --ref
option for selecting a
branch, commit, or tag. The value passed to this option must be a valid
reference in the remote Git repository.
Adding external data to the dataset:
Sometimes you might want to add data to your dataset without copying the
actual files to your repository. This is useful for example when external data
is too large to store locally. The external data must exist (i.e. be mounted)
on your filesystem. Renku creates a symbolic to your data and you can use this
symbolic link in renku commands as a normal file. To add an external file pass
--external
or -e
when adding local data to a dataset:
$ renku dataset add my-dataset -e /path/to/external/file
Updating a dataset:
After adding files from a remote Git repository or importing a dataset from a
provider like Dataverse or Zenodo, you can check for updates in those files by
using renku dataset update --all
command. For Git repositories, this command
checks all remote files and copies over new content if there is any. It does
not delete files from the local dataset if they are deleted from the remote Git
repository; to force the delete use --delete
argument. You can update to a
specific branch, commit, or tag by passing --ref
option.
For datasets from providers like Dataverse or Zenodo, the whole dataset is
updated to ensure consistency between the remote and local versions. Due to
this limitation, the --include
and --exclude
flags are not compatible
with those datasets. Moreover, deleted remote files are automatically deleted
without requiring the --delete
argument. Modifying those datasets locally
will prevent them from being updated.
The update command also checks for file changes in the project and updates
datasets’ metadata accordingly. You can automatically add new files from
the dataset’s data directory by using the --check-data-directory
flag.
You can limit the scope of updated files by specifying dataset names, using
--include
and --exclude
to filter based on file names, or using
--creators
to filter based on creators. For example, the following command
updates only CSV files from my-dataset
:
$ renku dataset update -I '*.csv' my-dataset
Note that putting glob patterns in quotes is needed to tell Unix shell not to expand them.
External data are also updated automatically. Since they require a checksum
calculation which can take a long time when data is large, you can exclude them
from an update by passing --no-external
flag to the update command:
$ renku dataset update --all --no-external
You can use --dry-run
flag to get a preview of what files/datasets will be
updated by an update operation.
Tagging a dataset:
A dataset can be tagged with an arbitrary tag to refer to the dataset at that point in time. A tag can be added like this:
$ renku dataset tag my-dataset 1.0 -d "Version 1.0 tag"
A list of all tags can be seen by running:
$ renku dataset ls-tags my-dataset
CREATED NAME DESCRIPTION DATASET COMMIT
------------------- ------ --------------- ---------- ----------------
2020-09-19 17:29:13 1.0 Version 1.0 tag my-dataset 6c19a8d31545b...
A tag can be removed with:
$ renku dataset rm-tags my-dataset 1.0
Importing data from other Renku projects:
To import all data files and their metadata from another Renku dataset use:
$ renku dataset import \
https://renkulab.io/projects/<username>/<project>/datasets/<dataset-id>
or
$ renku dataset import \
https://renkulab.io/projects/<username>/<project>/datasets/<dataset-name>
or
$ renku dataset import \
https://renkulab.io/datasets/<dataset-id>
You can get the link to a dataset form the UI or you can construct it by knowing the dataset’s ID.
By default, Renku imports the latest version of a dataset from the other project. If you want to import another version, pass the dataset version’s tag to the import command:
$ renku dataset import \
https://renkulab.io/datasets/<dataset-id> --tag <version>
Importing data from an external provider:
$ renku dataset import 10.5281/zenodo.3352150
This will import the dataset with the DOI (Digital Object Identifier)
10.5281/zenodo.3352150
and make it locally available.
Dataverse and Zenodo are supported, with DOIs (e.g. 10.5281/zenodo.3352150
or doi:10.5281/zenodo.3352150
) and full URLs (e.g.
http://zenodo.org/record/3352150
). A tag with the remote version of the
dataset is automatically created.
You can change the directory a dataset is imported to by using the
--datadir
option.
Exporting data to an external provider:
$ renku dataset export my-dataset zenodo
This will export the dataset my-dataset
to zenodo.org
as a draft,
allowing for publication later on. If the dataset has any tags set, you
can chose if the repository HEAD version or one of the tags should be
exported. The remote version will be set to the local tag that is being
exported.
To export to a Dataverse provider you must pass Dataverse server’s URL and the name of the parent dataverse where the dataset will be exported to. Server’s URL is stored in your Renku setting and you don’t need to pass it every time.
To export a dataset to OLOS you must pass the OLOS server’s base URL and supply your access token when prompted for it. You must also choose which organizational unit to export the dataset to from the list shown during the export. The export does not map contributors from Renku to OLOS and also doesn’t map License information. Additionally, all file categories default to Primary/Derived. This has to adjusted manually in the OLOS interface after the export is done.
Exporting data to a local directory:
Renku provides a local
provider that can be used to get a copy of a
dataset. For example, the following command creates a copy of the dataset
my-dataset
version v1
in /tmp/my-dataset-v1
:
$ renku dataset export my-dataset local --tag v1 --path /tmp/my-dataset-v1
This also creates a copy of dataset’s metadata at the given version and puts it
in <destination>/METADATA.yml
. If a destination path is not given to this
command, it creates a directory in project’s data directory using dataset’s
name and version: <data-dir>/<dataset-name>-<version>
. Export fails if the
destination directory is not empty.
Note
See our dataset versioning tutorial for example recipes using tags for data management.
Listing all files in the project associated with a dataset.
$ renku dataset ls-files
DATASET NAME ADDED PATH LFS
------------------- ------------------- ----------------------------- ----
my-dataset 2020-02-28 16:48:09 data/my-dataset/add-me *
my-dataset 2020-02-28 16:49:02 data/my-dataset/weather/file1 *
my-dataset 2020-02-28 16:49:02 data/my-dataset/weather/file2
my-dataset 2020-02-28 16:49:02 data/my-dataset/weather/file3 *
You can select which columns to display by using --columns
to pass a
comma-separated list of column names:
$ renku dataset ls-files --columns name,creators, path
DATASET NAME CREATORS PATH
------------------- --------- -----------------------------
my-dataset sam data/my-dataset/add-me
my-dataset sam data/my-dataset/weather/file1
my-dataset sam data/my-dataset/weather/file2
my-dataset sam data/my-dataset/weather/file3
Displayed results are sorted based on the value of the first column.
You can specify output formats by passing --format
with a value of tabular
,
json-ld
or json
.
Sometimes you want to filter the files. For this we use --dataset
,
--include
and --exclude
flags:
$ renku dataset ls-files --include "file*" --exclude "file3"
DATASET NAME ADDED PATH LFS
------------------- ------------------- ----------------------------- ----
my-dataset 2020-02-28 16:49:02 data/my-dataset/weather/file1 *
my-dataset 2020-02-28 16:49:02 data/my-dataset/weather/file2 *
Dataset files can be listed for a specific version (tag) of a dataset using the
--tag
option. In this case, files from datasets which have that specific
tag are displayed.
Unlink a file from a dataset:
$ renku dataset unlink my-dataset --include file1
OK
Unlink all files within a directory from a dataset:
$ renku dataset unlink my-dataset --include "weather/*"
OK
Unlink all files from a dataset:
$ renku dataset unlink my-dataset
Warning: You are about to remove following from "my-dataset" dataset.
.../my-dataset/weather/file1
.../my-dataset/weather/file2
.../my-dataset/weather/file3
Do you wish to continue? [y/N]:
Note
The unlink
command does not delete files,
only the dataset record.