...
In the multistore version of the Dataverse software, multiple stores can be active at the same time. However, one still needs to be designated as the default. To do this, it uses the same jvm option, but the values are now allowed to specify the id of any configured store (versus just the three values above which only define the store ‘type’), e.g.
...
In general, configurations for s3 and swift stores can include all of the existing options described in the Harvard’s official Dataverse software user manuals, with the analogous modification to those shown to include a store specific identifier in the option.
...
-Ddataverse.files.<id>.profile - this option specifies an aws profile to use with the specific store. AWS configuration files allow multiple profiles to be defined, each with it’s own username/password. This option links a Dataverse S3 store with a specific AWS profile.
-Ddataverse.files.<id>.upload-redirect - specifying true configures an S3 store to support direct upload of files to the store, versus streaming files through the Dataverse (Glassfish) server to a temporary location before sending it to the specified store. As described elsewhere, this option affects both upload via the Dataverse software API and the Dataverse software web interface.
-Ddataverse.files.s3tacc.proxy-url - s3 stores can be configured to use a proxy (such as nginx) for direct file uploads when the s3 endpoint itself is firewalled. This option was created to support testing, but could be used in production if desired (potentially slowing uploads since the proxy may create temporary file, may involve traffic over slower networks than those between the source and S3 store, etc.).
...
To use multiple stores for new files, a superuser can configure Dataverses Dataverse repositories and collections, as well as datasets, to use specific stores. When editing the ‘General Information’ for a Dataverse collection, superusers will see a new ‘Storage Driver’ entry that lists available stores.
...
Note that since the ‘label’ of a store is what is shown in the we interface (i.e. for superusers to select a store for a given Dataverse repository or collection), this naming convention is not very restrictive. For example, in the image above, the ‘TDL’ store is one with an id='s3' and a label='TDL' and it is the original single store in use prior to the multistore configuration was set up.
Other Configuration Issues/Notes
The Dataverse software already supports direct download of files from S3 stores. This would usually be a reasonable default and would be a even more valuable with large data. This is set via the
-Ddataverse.files.<id>.upload-redirect=true
setting. One thing that is important for this setting: the S3 store must be configured to allow cross origin (CORS) requests for data previewers (the Dataverse software’s external tools in general) to work.
...
it would be recommended to leave the Dataverse software configured with the default MD5 fixity hash option. The Dataverse software itself can generate/verify several other hash types (SHA1, 256,512) and it should be possible for the direct upload implementation to support these in the future, but, at present, the web interface for upload is hardcoded to MD5. Nominally the Dataverse software can handle files using different algorithms, but it may be confusing if different files show different hash types to users.
The current web interface for direct upload calculates the fixity hash of the file locally on the user's machine, as a separate step initiated after the file has been uploaded to the remote store. This can also take significant time, so the progress bar includes progress on the file upload (up to 50%) and the hash calculation (from 50%-100%). Since the progress in each step depends on the user’s network, computer, and local disk speed, it is not possible to determine the relative speed of the two steps, so progress may be faster in the first or second half of progress as shown in the bar. Further, when a proxy is used, there can be a significant amount of time after the file upload, during which the proxy sends the file on to the remote store, that is not tracked in the progress bar. Thus the bar will appear to pause at the 50% mark when using a proxy, potentially for an amount of time similar to how long it took to reach 50%, before progress will continue. (Accounting for this retransmission from the proxy in the progress bar would be a useful addition if a proxy were to be used in production.)
For large files direct uploaded via a proxy: since the proxy may (it does with nginx as configured on dataverse-dev.tdl.org) create a temporary copy of the file(s) being uploaded, it is important to have local disk storage sufficient for whatever upload size is being performed. For dataverse-dev, this required shifting nginx’s temporary store from it’s default location to somewhere in the /ebs file system (which has more space). This allowed a 39GB upload.
For testing with the current implementation of direct upload, it may be useful to watch the browser’s console. It shows various debugging info and progress messages and will indicate, for example, whether the upload HTTP call is continuing (e.g. while the proxy is uploading to the remote store) versus any failure that might also cause the progress bar to stop.
Temporary files: since the direct upload creates files in their final location rather than a temporary one, it will be somewhat harder to remove abandoned files when direct upload is used. Due to prior work to limit the circumstances under which the Dataverse software will abandon files, this should still be a relatively rare issue. However, since direct upload may be used for larger files, the impact may still be significant. In production, the primary way files are abandoned may be if users leave the upload page without clicking save or cancel. If/when problems occur, one can compare the database entries for the files in a given dataset with the list of files having an S3 path corresponding to the matching dataset identifier plus file identifier and delete any extra files. More simply, if a dataset is known to be empty, all files for the corresponding s3 path can be removed. Further work could be done to automate this type of clean-up, either within the Dataverse software or as a separate script.