The initial development/conceptual model validation work follow the design as shown. It requires changes to the Dataverse software and the DVUploader.

Detailed Event Sequence

  1. From their local machine (where the data resides as files) user runs the DVUploader.

  2. DVuploader scans the directories/files specified (as it normally does) and, for each file, requests a pre-signed upload URL from a Dataverse collection or repository to upload that file for a given Dataset.

  3. The Dataverse software, using the secret keys for its configured S3 storage, creates a short-lived URL that allows upload of one new file directly to the storage area in S3 specified for the Dataset.

  4. DVUploader uses the URL to do an HTTP PUT of the data directly to S3 (avoiding streaming the data through glassfish and to a temporary file on the Dataverse server) with transfer speed governed by the network speed between the local machine and S3 store (not the bandwidth to/from the Dataverse server or the disk read/write speed at Dataverse software).

  5. DVUploader calls the existing Dataverse collection /api/datasets/{dataset id}/add call but, instead of sending the file bytes, it sends the ID of the file as stored in S3 (along with it’s name, mimetype, and MD5 hash (and any directoryLabel(path) that would normally be sent).

  6. The Dataverse software runs through its normal steps to add the file and it’s metadata to the Dataset, currently skipping steps that would require access to the file bytes (e.g. unzipping the file, inspecting it to infer a better mimetype, extracting metadata, creating derived files, etc.). The net result for a file that would not trigger such special processing is exactly the same as if the file had been uploaded via the web interface through the Dataverse software.

Proof-of-Concept (POC) Achievements:

Direct Upload:

The work so far shows that it is possible to upload data directly from a local machine to an S3 store without going through Payara, using temporary local storage at the Dataverse software, or using the network between the local machine and the Dataverse software, or the Dataverse software and the S3 store. Performance testing needs to be done but from previous testing that shows Payara and/or the temporary local storage add delays/server load, etc. , should make uploads faster. If the network between the data and S3 store is faster (e.g. the data is local to the S3 store), additional performance enhancement would be expected.

The POC also shows that this design works with both Amazon’s S3 implementation and the Minio S3 implementation (which is in use at TACC). (There are minor differences that are handled in the Dataverse software and DVUploader software).

The design itself was intended to allow direct upload without creating a security concern that a user could upload/edit/delete other files in S3. Unlike designs in which the S3 keys used by the Dataverse software, or derivative keys doe a specific user, would have to be sent to the user’s machine, where they could potentially be misused or stolen, this design sends a presigned URL that only allows a PUT HTTP call to upload one file, with the location/id of that file specified by the Dataverse software. (The S3 keys at the Dataverse instance are used to create a cryptographic signature that is included as a parameter in the URL. That signature can be used by the S3 implementation to verify that the PUT, for this specific file, was authorized by the Dataverse instance. Any change to try reading/deleting/editing this or any other file would invalidate the signature.) The signature is also set to be valid for a relatively short time (configurable, default is 60 minutes), further limiting opportunities for misuse. (Note that using the Dataverse software API requires having the user’s Access Key (generated via the Dataverse software GUI). That key allows the user to do anything via the API call that they can do via the Dataverse software GUI. For the discussion here, the important point is that this access key, which is already required for using the DVUploader with the standard upload mechanism, is more powerful/more important to keep safe than the presigned URLs added by the new design. (FWIW: There are discussions at IQSS/GDCC about how to provide more limited API keys from the Dataverse instance that would mimic the presigned URL mechanism.))

In addition to validating the design, the POC involved working through the Dataverse software’s 2 phase, ~10 step upload process and learning how to separate and, for now, turn off, steps that involve reading the file itself while keeping the processing to add the file to the dataset, record it’s metadata, create a new dataset version if needed, etc. While this code will probably need further modification/clean-up, it’s a significant step to have the POC working.

Multiple Storage:

The POC now (as of Dec. 5th) supports the configuration of multiple file stores for a Dataverse instance. Implementing this involved code changes in classes related to file/s3/swift access and other places where the storage location of a dataset is interpreted, and requires changing some glassfish java properties but does not affect Dataverse’s database structure and can be made backward compatible for current data.

In stock Dataverse today, the storageidentifier field for files includes a prefix representing the type of file store used (e.g. “s3://<long random id number>” indicating s3 storage) and the code assumed that all files were in the same store (because Dataverse assumed only one file path, one set of s3 or swift credentials, etc.). In the update, the storageidentifier has a prefix indicating a store identifier, such as s3:// or s3tacc:// and Dataverse looks at properties associated with that store to understand it’s type (file, s3, or swift) and any type-specific options for that store. With these changes, one can have a file1:// and file2:// stores sending files into different paths/file systems, or s3:// and s3tacc:// stores sending data to different s3 instances with different credential, etc. Together, these changes allow a modified instance to access data from multiple places at once.

To decide where new data should be sent, the POC code includes a default location and, as of Jan. 2020, allows superusers to specify a ‘storage driver’, selected from a drop-down list, for a given dataverse that will be used for all datasets within that Dataverse. For example, a Dataverse with “TACC” as a storage driver would have files stored in the s3tacc:// store. (“TACC” is the label specified in the configuration for the s3tacc store - arbitrary strings can be used.)

Next Steps:

There is additional functionality that will be important to creating a production capability. Some are a ‘simple matter of programming’, where the functionality needed is probably not controversial, while others may need further requirements/design discussion.

Simply turning everything back on, which would involve Dataverse retrieving the entire file from S3 one or more times, would be relatively simple though it would have performance impacts. It may make sense to add configuration options that would allow any of these steps to be turned on/off per store, or up to a given file size limit, etc. It would also be possible to shift more of this processing to the background (e.g. creating a .tab file is already done after the HTTP call to upload the file returns) although doing steps like unzipping this way would mean the Dataverse web interface could not show the list of files inside the zip during upload. More complex options, such as moving such processing to a machine local to the S3 store, are also possible (e.g. an app that would inspect the remote file and only send a new mimetype or extracted metadata to Dataverse instead of Dataverse having to pull the entire file from S3 itself.