Data in Renku
Renku uses the git Large File Storage (LFS) for handling data. Using standard git for large files is difficult because the data itself is converted into git objects and placed in the repository. Git LFS lets you include large data files in your repository efficiently while using more or less just the standard set of git commands. In addition, git LFS places large files on the server and allows you to work with your repository without actually having a local copy of the data. You can pull the data from the server as it becomes needed, saving you time and resources.
However, git LFS is not efficient when working with a large amount of data. Hence, we heavily recommend Renku users (especially when dealing with large amount of data) to either store data files locally outside the Renku project folder in a local filesystem, or even better to provision External Storage in Renku Sessions, such as S3 or Azure Blob storage.
Using git LFS responsibly
The default configuration of git LFS is to pull recent data from the server and into the working copy of the repository. This is fine if the data is reasonably small (~GB size) but as it becomes larger, the default behavior can start to pose problems.
Imagine a Renku project with 100GB of data in LFS. If a few collaborators all decide to work on the project at the same time and launch the JupyterLab environment to iterate over some changes, each might attempt to download 100GB of data to each of their JupterLab sessions. Not only will this take a long time, but it might also eventually lead to resource starvation on the host node.
Working with external cloud storage
If you setup external storage from the beginning of your project, you can more easily share and handle your data. You need to control the access rights to your external cloud storage data directly through your cloud storage provider.
The instructions to set up your cloud storage for a Renku project are in External Storage in Renku Sessions.
Data in JupyterLab sessions
Due to the resource concerns, we therefore do not pull data into user JupyterLab sessions by default. As a result, you do need to be aware of dealing with data stored in LFS if you want to use it efficiently in your work with Renku.
Uploading Data to a RenkuLab session to create a Dataset with the CLI
You can use the renku dataset
CLI command to create a dataset with data
that is already present in your JupyterLab or RStudio session or with
data that is on your local computer. For example, you can drag and drop files
from your computer into the JupyterLab window to upload them and then
use the renku dataset
command to create a dataset, add the files to the
dataset and also check them in git with LFS. Assuming that you have uploaded
three files at the root of your repository named file1.csv
, file2.csv
and file3.csv
, you can run the following command to create a dataset from them:
$ renku dataset add --create my-new-dataset file1.csv file2.csv file3.csv
Beside creating a Renku dataset, the command will automatically track the files with LFS and commit them. In addition, you can use shell-like wildcards (e.g. , *, ?) when specifying paths to be added instead of explicitly naming every file.
Renku LFS configuration
Renku by default stores all files larger than 100kb in LFS to prevent
slowing down git (and thus renku
) with large files. This limit can be
changed by running:
$ renku config set lfs_threshold <size>
where <size>
is a file size formatted like 10b
, 100kb
, 0.5mb
or
10gb
.
Additionally, paths can be excluded from LFS storage by Renku commands by
editing the .renkulfsignore
file in the project root folder. This file
follows .gitignore
convention
Files matching a pattern in .renkulfsignore
will never be added to git LFS
by a Renku command like renku run
or renku dataset add
.
Useful git LFS commands
git lfs ls-files
: shows all the files currently in LFSgit lfs pull -I <path> [remote]
: pull a specific path from LFS. It can be a single file or an entire folder.git lfs migrate import --fixup --include-ref=refs/heads/master
: move files into LFS. Use this command if you accidentally committed large files to a repo.
Note that you can also use wild-cards, e.g. git lfs pull -I "data/records_201*.csv"
but be sure to include quote characters ("
or '
) when you use wild-cards.
See the git lfs tutorial for details.