The initial development/conceptual model validation work follow the design as shown. It requires changes to the Dataverse software and the DVUploader.
...
Detailed Event Sequence
From their local machine (where the data resides as files) user runs the DVUploader.
DVuploader scans the directories/files specified (as it normally does) and, for each file, requests a pre-signed upload URL from a Dataverse collection or repository to upload that file for a given Dataset.
The Dataverse software, using the secret keys for its configured S3 storage, creates a short-lived URL that allows upload of one new file directly to the storage area in S3 specified for the Dataset.
DVUploader uses the URL to do an HTTP PUT of the data directly to S3 (avoiding streaming the data through glassfish and to a temporary file on the Dataverse server) with transfer speed governed by the network speed between the local machine and S3 store (not the bandwidth to/from the Dataverse server or the disk read/write speed at Dataverse software).
DVUploader calls the existing Dataverse collection /api/datasets/{dataset id}/add call but, instead of sending the file bytes, it sends the ID of the file as stored in S3 (along with it’s name, mimetype, and MD5 hash (and any directoryLabel(path) that would normally be sent).
The Dataverse software runs through its normal steps to add the file and it’s metadata to the Dataset, currently skipping steps that would require access to the file bytes (e.g. unzipping the file, inspecting it to infer a better mimetype, extracting metadata, creating derived files, etc.). The net result for a file that would not trigger such special processing is exactly the same as if the file had been uploaded via the web interface through the Dataverse software.
Proof-of-Concept (POC) Achievements:
...
The work so far shows that it is possible to upload data directly from a local machine to an S3 store without going through GlassfishPayara, using temporary local storage at the Dataverse software, or using the network between the local machine and the Dataverse software, or the Dataverse software and the S3 store. Performance testing needs to be done but from previous testing that shows Glassfish Payara and/or the temporary local storage add delays/server load, etc. , should make uploads faster. If the network between the data and S3 store is faster (e.g. the data is local to the S3 store), additional performance enhancement would be expected.
The POC also shows that this design works with both Amazon’s S3 implementation and the Minio S3 implementation (which is in use at TACC). (There are minor differences that are handled in the Dataverse software and DVUploader software).
The design itself was intended to allow direct upload without creating a security concern that a user could upload/edit/delete other files in S3. Unlike designs in which the S3 keys used by the Dataverse software, or derivative keys doe a specific user, would have to be sent to the user’s machine, where they could potentially be misused or stolen, this design sends a presigned URL that only allows a PUT HTTP call to upload one file, with the location/id of that file specified by the Dataverse software. (The S3 keys at the Dataverse instance are used to create a cryptographic signature that is included as a parameter in the URL. That signature can be used by the S3 implementation to verify that the PUT, for this specific file, was authorized by the Dataverse instance. Any change to try reading/deleting/editing this or any other file would invalidate the signature.) The signature is also set to be valid for a relatively short time (configurable, default is 60 minutes), further limiting opportunities for misuse. (Note that using the Dataverse software API requires having the user’s Access Key (generated via the Dataverse software GUI). That key allows the user to do anything via the API call that they can do via the Dataverse software GUI. For the discussion here, the important point is that this access key, which is already required for using the DVUploader with the standard upload mechanism, is more powerful/more important to keep safe than the presigned URLs added by the new design. (FWIW: There are discussions at IQSS/GDCC about how to provide more limited API keys from the Dataverse instance that would mimic the presigned URL mechanism.))
In addition to validating the design, the POC involved working through Dataverse’s the Dataverse software’s 2 phase, ~10 step upload process and learning how to separate and, for now, turn off, steps that involve reading the file itself while keeping the processing to add the file to the dataset, record it’s metadata, create a new dataset version if needed, etc. While this code will probably need further modification/clean-up, it’s a significant step to have the POC working.
...