Based on a discussion between Jim, Anna, and Chris Johnson on 9/17/19


Dataverse current design:

Normal Dataverse flow is for files to be uploaded via Dataverse UI or uploader with bytes travelling through the Glassfish server to create a temporary local file. After this, Dataverse attempts to unzip any zip files, and (once the 'Save' button is hit in the UI) the temporary files (the set uploaded and any created through unzipping) are transferred to the final storage which could be a local file system or an S3 store. If full text indexing is enabled, Dataverse will retrieve file bytes from the store to process them (i.e. to index all the words it finds). For download, the default process is for the Dataverse app/Glassfish to read the file and stream the bytes to the user. When S3 is used, there is an option to have Dataverse generate a temporary S3 URL and to send that to the users browser, at which point the browser retrieves the file bytes directly from S3.

TRSA Requirements and Design Constraints:

The primary intent of the TRSA (as I currently understand it) is to enable a Dataverse instance to handle larger files (and larger therefore larger datasets) by leveraging remote storage (i.e. Corral at TACC). In addition to requiring Dataverse to be able to manage files stored at the remote location, this nominally requires avoiding the creation of temporary files on the Dataverse machine and avoiding streaming data bytes through the Dataverse/Glassfish app since both of these add overhead (storage space and memory/processing power) and slow uploads. While use cases are still being refined, the nominal intent is to have Dataverse work the ~same way for larger data as it does today, with the caveat that, when data is already stored at TACC, it wouldn't make sense to transfer them away from TACC to a user's 'local' machine (wherever local is) just to send them back. 'Working the same way' implies that Dataverse is managing the files and is able to control access, restrict/unrestrict them, store metadata and fixity hashes for them (so the files shouldn't be editable directly by the end user once they have been sent to Dataverse), etc. 

Discussed Options:

The online documentation states that Corral can be used as a local file system or via S3 or iRODS. Our discussion focused on these and a couple of variants (below). (Somewhat implicit in this discussion is the idea that what the user experience will be is fairly independent of the choice of the storage access design. This is due to Dataverse's design which hides storage behind a storage agnostic internal API.)

Conclusions:

Given that Dataverse currently has an S3 implementation to start from, and nothing similar for iRODS, and with TACC being able to support S3 with STS, it looks like the best approach:

This approach has other advantages in that it should improve the throughput to any S3 store supporting STS, i.e. TDLs existing storage and for other institutions using S3 storage and is conceptually simpler than having to coordinate Dataverse's access controls with those of the underlying store (if we did that, its likely that tools such as the Uploader would also have to understand the storage's access controls, whereas STS looks like a normal HTTP upload to the Uploader, once it gets the required URL-with-token from Dataverse.) 

Other Potential Options:

Our discussion didn't cover the options below, but it's worth noting some of their potential strengths/weaknesses and to indicate why they are not (at least in my opinion) as useful as S3 and, whether, if S3 with STS proves to be more challenging than expected, whether these might be viable alternatives.

Rsync: The current implementation with Dataverse uses the rsync protocol to allow users to directly move files from a machine (not necessarily the one they are running their browser on) to storage that can be accessed by Dataverse. While the ability to transfer from a machine different than the one the user is viewing Dataverse on is useful, as implemented the interface requires a fairly sophisticated user (one initiates a transfer in Dataverse, gets a script that can be used to do the rsync, at which point one goes back to Dataverse to indicate the transfer is complete), and there are limitations w.r.t. Dataverse functionality (I don't recall all the specifics but a Dataset cannot have some rsync and some normally uploaded files and Dataverse treats the whole set of files in a dataset as one package, which I think limits some of the ability to make changes to individual files.) The upshot is that, while one could potentially see rsync as analogous to S3 (without STS), the way it's been implemented in Dataverse provides a fairly different user experience for uploading (and doesn't provide a better starting point than leveraging the S3 storage implementation).

Odum's TSRA: Part of the discussion prior to this work starting were that Odum's implementation has been designed with significant additional requirements in mind (e.g. allowing secure direct access to sensitive files) and therefore the user experience is likely to be significantly different than standard Dataverse. It also involves additional TSRA-specific applications/services that have to be run, making it heavier-weight on the operations side as well. As with the Rysnc implementation, there may be design ideas that can be leveraged, but it also looks like the existing TRSA work is not directly useful in connecting Dataverse to TACC's storage.

GridFTP/GlobusTransfer: Globus has created an online transfer service leveraging the FTP protocol underneath. This solution is used in high-performance computing centers because the underlying protocol supports third-party transfers (move data from a computer at center A to one at B while starting the process from my local machine) and because it can use multiple streams to transfer files (enabling it to use more of the available bandwidth between sites than a simple http transfer of the same file). I'm not aware of any analog of STS for Globus transfer, but aside form that, this could be a viable option. Since GlobusTransfer is a 'web app' on its own, it may be most useful if integration into the Dataverse UI proves problematic - i.e. with GT, Dataverse might, instead of providing a script as with the rsync implementation, just create a link to the GT web app where a user would drag/drop files to the right location for Dataverse. My sense is that, while S3 is more straight-forward, Globus is mature enough and with a large-enough user base among large-data/HPC users that it would probably be a better second choice than rsync or Odum's TSRA (probably in competition with iRODS for second place).