Mount S3 Buckets in Renku Sessions
Renku has an optional feature that allows users to mount and access data in any S3-compatible object storage. This feature is enabled at renkulab.io but please note that on other Renku deployment this may not be the case. If your deployment does not have this feature enabled you should contact your administrator.
Amazon AWS initially came up with the S3 API for object storage. However, nowadays almost all cloud providers have S3-compatible object storage. For example, Google Cloud Storage is S3 compatible and can be used in Renku. Please refer to your cloud provider documentation for more details.
Step-by-step Instructions
There are many public datasets that are hosted by AWS or other cloud providers on S3-compatible object storage. One such dataset is the Genome in a Bottle (GIAB) dataset hosted on AWS. We will use the GIAB dataset to demonstrate how to bring data in S3 to a Renku interactive session:
Navigate to the
Sessions
tab in your Renku project.Click the green
New Session
button on the right side of the page.Click on the
Do you want to select the branch, commit, image or configure cloud storage
link.Click on the
Configure Cloud Storage
button in the section that was revealed in the previous step.In the form that opened, enter the
Endpoint
for the GIAB bucket ashttp://s3.amazonaws.com
.The only other field you have to fill in is the
Bucket Name
field. For the GIAB bucket this is simplygiab
. TheAccess Key
andSecret Key
fields can be left empty because this is a public bucket.Click
Save
.Click
Start session
.Once you session is ready the bucket will be available inside your session at
/cloudstorage/<bucket name>
. In this case for the GIAB bucket it will be available at/cloudstorage/giab
.
Limitations
There are a few limitations of this feature that users should be aware of. Please note that as we further develop this feature some of these limitations will be relaxed or fully lifted.
Accessing data in mounted buckets is potentially much slower than accessing data on disk. All content that is available in the bucket is fetched over the internet every time you need to access it. Therefore, using this feature will be slower compared to when data is downloaded to disk. However, the benefit is that mounting a bucket does not require any additional disk space. So you could mount a bucket which has 1TB of data in your session and you would not need to request 1TB of storage from Renku. We are investigating methods to provide caching or other means to improve the performance.
All mounted buckets have to have unique names. There is no limit to how many buckets can be mounted in a session but because all buckets are mounted in the same folder (i.e.
/cloudstorage
) if you try to mount two buckets with the same name then one mount would overwrite the other. However, Renku will not let you get to this point and will prevent you from launching a session if there are duplicate bucket names (regardless of the endpoint) across all the buckets you are trying to mount.Buckets can only be mounted in `read-only` mode regardless of whether your credentials (if provided) allow you to read and write in the bucket you are mounting. This is just a precaution that may eventually be removed as the feature is further developed.