File transfer design options

Based on a discussion between Jim, Anna, and Chris Johnson on 9/17/19 (see https://docs.google.com/document/d/1lieOItsg3lgjAaM5VYJlX5kXGZDRLRNusqcKqtB8Kaw/edit#heading=h.2adhpm8wvida)


Dataverse current design:

Normal Dataverse flow is for files to be uploaded via Dataverse UI or uploader with bytes travelling through the Glassfish server to create a temporary local file. After this, Dataverse attempts to unzip any zip files, and (once the 'Save' button is hit in the UI) the temporary files (the set uploaded and any created through unzipping) are transferred to the final storage which could be a local file system or an S3 store. If full text indexing is enabled, Dataverse will retrieve file bytes from the store to process them (i.e. to index all the words it finds). For download, the default process is for the Dataverse app/Glassfish to read the file and stream the bytes to the user. When S3 is used, there is an option to have Dataverse generate a temporary S3 URL and to send that to the users browser, at which point the browser retrieves the file bytes directly from S3.

TRSA Requirements and Design Constraints:

The primary intent of the TRSA (as I currently understand it) is to enable a Dataverse instance to handle larger files (and larger therefore larger datasets) by leveraging remote storage (i.e. Corral at TACC). In addition to requiring Dataverse to be able to manage files stored at the remote location, this nominally requires avoiding the creation of temporary files on the Dataverse machine and avoiding streaming data bytes through the Dataverse/Glassfish app since both of these add overhead (storage space and memory/processing power) and slow uploads. While use cases are still being refined, the nominal intent is to have Dataverse work the ~same way for larger data as it does today, with the caveat that, when data is already stored at TACC, it wouldn't make sense to transfer them away from TACC to a user's 'local' machine (wherever local is) just to send them back. 'Working the same way' implies that Dataverse is managing the files and is able to control access, restrict/unrestrict them, store metadata and fixity hashes for them (so the files shouldn't be editable directly by the end user once they have been sent to Dataverse), etc. 

Discussed Options:

The online documentation states that Corral can be used as a local file system or via S3 or iRODS. Our discussion focused on these and a couple of variants (below). (Somewhat implicit in this discussion is the idea that what the user experience will be is fairly independent of the choice of the storage access design. This is due to Dataverse's design which hides storage behind a storage agnostic internal API.)

  • Local file system. A Dataverse instance could be run at TACC and configured to use Corral as a local file system. Nominally this requires no programming in the storage layer, but would require moving a/all Dataverse instance(s) to TACC and, if only one is moved there, figuring out how to federate it with those managed 'at' TDL. This option would also not address Dataverse creating a temporary copy and having bytes stream through the Dataverse app.
  • iRODS. iRODS could nominally be configured to serve as the file store for Dataverse, either through deploying it as a 'FUSE'  file system, where it would look like a local file system to a Dataverse running at TDL but would internally manage transfer of data to iRODS running at TACC, or by calling the iRODS API directly. The latter would require developing an iRODS storage class for Dataverse. In discussion, the former was seen as nominally satisfying requirements but adding complexity and overhead that could affect performance and reliability. The latter approach would avoid these issues, but was still seen as 'overkill' since iRODS is also designed to handle metadata, replication, workflows over data, etc. that would probably not be used (at least initially - one could contemplate larger changes to Dataverse to use more iRODS functionality but this would require changing Dataverse's internal storage api).
  • S3: S3 is a minimalist approach to file storage that is highly scalable. Since Dataverse manages access control and file metadata (including naming and folder path), it is able to use S3 for it's storage layer. The primary challenge in using it for TRSA would be to have a mechanism to stream data to S3 without transferring it through Dataverse/using temporary files at TDL. The Secure Token Service (STS) looks like it provides such a mechanism and Chris confirmed that TACC's implementation of S3 can support STS. (The implicit challenge being addressed is that, since Dataverse controls access and decide who can upload files to which datasets, if the data bytes don't stream through Dataverse, Dataverse needs a way to delegate the right to upload (but then not edit) a specific file(s) to S3 to the user (or software such as the Dataverse Uploader). STS addresses this by allowing Dataverse to request a token that can be given to the user. The user (their browser or Dataverse Uploader, etc.) can then make the HTTP call to upload the file to S3 and include the token in the request. When the S3 store sees the token, it allows the upload, but the user can't upload any other files or delete/replace the file after upload, etc.) With STS, the programming required would involve modifying the Dataverse upload code and/or the Dataverse Uploader to use it, esentially making a call to Dataverse to 'register' the file and retrieve a token, and then making a separate call to send the datafile directly to S3. While this may still be challenging to do in the Dataverse UI, it should be significantly less work than writing a new storage service using iRODs (to avoid sending bytes through Dataverse when using iRODS, such a storage service would really need to do something analogous to the changes required for STS anyway). Since S3 is already supported in Dataverse, and is implemented by Amazon and many universities, the resulting code will probably more generally useful as well.
  • HTTP read-only: In discussing other methods to avoid streaming data from TACC to Dataverse and back to TACC, we discussed the idea of having users do out-of-band transfers of data to the right storage location at TACC and then providing a read-only HTTP interface for them to be downloaded (Nominally Dataverse would record the URLs and just point to them for download rather than storing the files internally.) This could work reasonable well if all files were public and Dataverse didn't need to restrict/unrestrict them, but it would still require additional work somewhere to manage local file permissions at TACC (i.e. a user needs permission to upload the files to the location where they can be served over HTTP but should not have access to delete or modify them after they are registered in Dataverse, and the user should not have access to any other files managed by Dataverse.) Managing these permissions adds complexity that negates the benefit of this approach being the simplest (access controls could be manually managed at TACC or we'd start reinventing the types of access control already in iRODS or re-creating STS, etc.)

Conclusions:

Given that Dataverse currently has an S3 implementation to start from, and nothing similar for iRODS, and with TACC being able to support S3 with STS, it looks like the best approach:

  • It allows Dataverse to use TACC storage and avoids the overhead of streaming data through Dataverse/to temporary files and from TACC to TDL and back if the data started at TACC
  • While adapting the existing Dataverse upload UI to support it may still be challenging, it appears to be less work that other approaches (which would have to do something similar in the UI and handle downloads)

This approach has other advantages in that it should improve the throughput to any S3 store supporting STS, i.e. TDLs existing storage and for other institutions using S3 storage and is conceptually simpler than having to coordinate Dataverse's access controls with those of the underlying store (if we did that, its likely that tools such as the Uploader would also have to understand the storage's access controls, whereas STS looks like a normal HTTP upload to the Uploader, once it gets the required URL-with-token from Dataverse.) 

Other Potential Options:

Our discussion didn't cover the options below, but it's worth noting some of their potential strengths/weaknesses and to indicate why they are not (at least in my opinion) as useful as S3 and, whether, if S3 with STS proves to be more challenging than expected, whether these might be viable alternatives.

Rsync: The current implementation with Dataverse uses the rsync protocol to allow users to directly move files from a machine (not necessarily the one they are running their browser on) to storage that can be accessed by Dataverse. While the ability to transfer from a machine different than the one the user is viewing Dataverse on is useful, as implemented the interface requires a fairly sophisticated user (one initiates a transfer in Dataverse, gets a script that can be used to do the rsync, at which point one goes back to Dataverse to indicate the transfer is complete), and there are limitations w.r.t. Dataverse functionality (I don't recall all the specifics but a Dataset cannot have some rsync and some normally uploaded files and Dataverse treats the whole set of files in a dataset as one package, which I think limits some of the ability to make changes to individual files.) The upshot is that, while one could potentially see rsync as analogous to S3 (without STS), the way it's been implemented in Dataverse provides a fairly different user experience for uploading (and doesn't provide a better starting point than leveraging the S3 storage implementation).

Odum's TSRA: Part of the discussion prior to this work starting were that Odum's implementation has been designed with significant additional requirements in mind (e.g. allowing secure direct access to sensitive files) and therefore the user experience is likely to be significantly different than standard Dataverse. It also involves additional TSRA-specific applications/services that have to be run, making it heavier-weight on the operations side as well. As with the Rysnc implementation, there may be design ideas that can be leveraged, but it also looks like the existing TRSA work is not directly useful in connecting Dataverse to TACC's storage.

GridFTP/GlobusTransfer: Globus has created an online transfer service leveraging the FTP protocol underneath. This solution is used in high-performance computing centers because the underlying protocol supports third-party transfers (move data from a computer at center A to one at B while starting the process from my local machine) and because it can use multiple streams to transfer files (enabling it to use more of the available bandwidth between sites than a simple http transfer of the same file). I'm not aware of any analog of STS for Globus transfer, but aside form that, this could be a viable option. Since GlobusTransfer is a 'web app' on its own, it may be most useful if integration into the Dataverse UI proves problematic - i.e. with GT, Dataverse might, instead of providing a script as with the rsync implementation, just create a link to the GT web app where a user would drag/drop files to the right location for Dataverse. My sense is that, while S3 is more straight-forward, Globus is mature enough and with a large-enough user base among large-data/HPC users that it would probably be a better second choice than rsync or Odum's TSRA (probably in competition with iRODS for second place).