Status 12/19: Still TBD depending on TDL requirements. (FWIW: One relatively simple option for ingest only would be to adopt the QDR changes that allow ingest to be done manually after upload.)
Status 1/3/20: Still tbd. As part of merging with the main Dataverse release, I’ll plan to add a size limit for each store, below which all ingest processing except unzipping will be done.
With the POC, an MD5 hash is created on the local machine as the file is streamed and this is sent to Dataverse to store as metadata (thus allowing the file contents to be compared with the original MD5 hash in the future to validate it’s integrity). Dataverse currently allows other algorithms (e.g. SHA-1, SHA-512). It should be possible to create an MD5 hash during upload through the Dataverse web interface as well. Allowing the hash algorithm to change would require adapting the DVUploader and new upload code for the Dataverse web interface to determine Dataverse’s selected algorithm and to generate the appropriate hash. (S3 does calculate a hash during upload as well, but it varies depending on whether the upload was done in multiple pieces. In theory, one could leverage that instead, but having a has from the original machine seems like a stronger approach.) Dataverse also allows you to change the hash algorithm used and to then update the hash for existing files. This requires retrieving the file and computing the hash locally, so it may be something that should not be done for large files/ for files in some stores, etc.
Status 12/19: TBD
Status 1/3/20: Implemented md5 hashing using a library that could create hashes using other algorithms.
Parallelism: The DVUploader currently sends one file at a time. Due to the way HTTP works, this may not use all of the available bandwidth. Sending multiple files in parallel would allow more bandwidth to be used. (Using more bandwidth for a single large file is harder and is one of the strengths of Globus/GridFTP). It would not be too much work to enhance the DVUploader to send several files at once. There might still be a bottleneck at Dataverse, where the api call is for a single file and results in an update in the database for the entire dataset. That database update has to complete before another api call can succeed. Adding an API call to allow multiple files to be added at once could be done to address that. It might also be possible to parallelize the upload in the web interface (it actually works more like this now as it streams all of the files up and then only updates the dataset in the database when you ‘save’ all the changes.) Whether these changes are worth the effort probably depends on the use cases and how much performance enhancement is gained from the direct S3 upload design itself.
Status 121/19: TBD20: The web interface parallelizes upload (subject to the limits on connections managed by the browser).