Mount S3 Buckets in Renku Sessions

Renku has an optional feature that allows users to mount and access data in any S3-compatible object storage. This feature is enabled at renkulab.io but please note that on other Renku deployment this may not be the case. If your deployment does not have this feature enabled you should contact your administrator.

Amazon AWS initially came up with the S3 API for object storage. However, nowadays almost all cloud providers have S3-compatible object storage. For example, Google Cloud Storage is S3 compatible and can be used in Renku. Please refer to your cloud provider documentation for more details.

Step-by-step Instructions

There are many public datasets that are hosted by AWS or other cloud providers on S3-compatible object storage. One such dataset is the Genome in a Bottle (GIAB) dataset hosted on AWS. We will use the GIAB dataset to demonstrate how to bring data in S3 to a Renku interactive session:

  1. Navigate to the Sessions tab in your Renku project.

  2. Click the green New Session button on the right side of the page.

  3. Click on the Do you want to select the branch, commit, image or configure cloud storage link.

  4. Click on the Configure Cloud Storage button in the section that was revealed in the previous step.

  5. In the form that opened, enter the Endpoint for the GIAB bucket as http://s3.amazonaws.com.

  6. The only other field you have to fill in is the Bucket Name field. For the GIAB bucket this is simply giab. The Access Key and Secret Key fields can be left empty because this is a public bucket.

  7. Click Save.

  8. Click Start session.

  9. Once you session is ready the bucket will be available inside your session at /cloudstorage/<bucket name>. In this case for the GIAB bucket it will be available at /cloudstorage/giab.

Limitations

There are a few limitations of this feature that users should be aware of. Please note that as we further develop this feature some of these limitations will be relaxed or fully lifted.

  • Accessing data in mounted buckets is potentially much slower than accessing data on disk. All content that is available in the bucket is fetched over the internet every time you need to access it. Therefore, using this feature will be slower compared to when data is downloaded to disk. However, the benefit is that mounting a bucket does not require any additional disk space. So you could mount a bucket which has 1TB of data in your session and you would not need to request 1TB of storage from Renku. We are investigating methods to provide caching or other means to improve the performance.

  • All mounted buckets have to have unique names. There is no limit to how many buckets can be mounted in a session but because all buckets are mounted in the same folder (i.e. /cloudstorage) if you try to mount two buckets with the same name then one mount would overwrite the other. However, Renku will not let you get to this point and will prevent you from launching a session if there are duplicate bucket names (regardless of the endpoint) across all the buckets you are trying to mount.

  • Buckets can only be mounted in `read-only` mode regardless of whether your credentials (if provided) allow you to read and write in the bucket you are mounting. This is just a precaution that may eventually be removed as the feature is further developed.