Multiple Store and Direct Upload Configuration

Specifying a default Store:

The storage option to use in Dataverse v.4.18.1 is configured through Glassfish jvm options. The primary choice is made through specifying the storage-driver-id, which can currently be file, s3, or swift, e.g.

-Ddataverse.files.storage-driver-id=s3

In the multistore version of the Dataverse software, multiple stores can be active at the same time. However, one still needs to be designated as the default. To do this, it uses the same jvm option, but the values are now allowed to specify the id of any configured store (versus just the three values above which only define the store ‘type’), e.g.

-Ddataverse.files.storage-driver-id=s3tacc

Configuring a specific Store:

In Dataverse 4.18.1, the three types of stores are configured through several type-specific jvm options. The simplest of these is a file store that is defined by the path where files should be stored, e.g.

-Ddataverse.files.file.directory=/usr/local/glassfish4/glassfish/domains/domain1/files

In the multistore version, all stores must have an id, type, and label, as well as type-specific information. Thus a file store can be configured using:

-Ddataverse.files.file.type=file

-Ddataverse.files.file.label=LocalFileSystem

-Ddataverse.files.file.directory=/usr/local/glassfish4/glassfish/domains/domain1/files

where the id ('file' in this case) is implicit in the names of the jvm options. A second file store with the id ‘testing’ could be set up as:

-Ddataverse.files.testing.type=file

-Ddataverse.files.testing.label=TestingOnly

-Ddataverse.files.testing.directory=/tmp

Other types of stores are similarly configured. As with file store configuration, the existing jvm option choices are replaced with similar ones that specify the id of a specific store. For example, the s3 option to configure the bucket was:

-Ddataverse.files.s3-bucket-name=dataverse-dev-tacc-s3

It now includes the id, e.g.

-Ddataverse.files.s3tacc.bucket-name=dataverse-dev-tacc-s3

A complete s3 configuration would be:

-Ddataverse.files.s3.type=s3

-Ddataverse.files.s3.profile=s3

-Ddataverse.files.s3.label=TDL

-Ddataverse.files.s3.bucket-name=dataverse-dev-s3

A more complex configuration would be:

-Ddataverse.files.s3tacc.type=s3

-Ddataverse.files.s3tacc.label=TACC

-Ddataverse.files.s3tacc.custom-endpoint-url=http://129.114.52.102:9006

-Ddataverse.files.s3tacc.path-style-access=true

-Ddataverse.files.s3tacc.profile=s3tacc

-Ddataverse.files.s3tacc.upload-redirect=true

-Ddataverse.files.s3tacc.proxy-url=https://18.211.108.182:8888/s3tacc

In general, configurations for s3 and swift stores can include all of the existing options described in Harvard’s official Dataverse software user manuals, with the analogous modification to those shown to include a store specific identifier in the option.

The complex example above includes 3 new options implemented as part of the TDL large data project:

-Ddataverse.files.<id>.profile - this option specifies an aws profile to use with the specific store. AWS configuration files allow multiple profiles to be defined, each with it’s own username/password. This option links a Dataverse S3 store with a specific AWS profile.
-Ddataverse.files.<id>.upload-redirect - specifying true configures an S3 store to support direct upload of files to the store, versus streaming files through the Dataverse (Glassfish) server to a temporary location before sending it to the specified store. As described elsewhere, this option affects both upload via the Dataverse software API and the Dataverse software web interface.
-Ddataverse.files.s3tacc.proxy-url - s3 stores can be configured to use a proxy (such as nginx) for direct file uploads when the s3 endpoint itself is firewalled. This option was created to support testing, but could be used in production if desired (potentially slowing uploads since the proxy may create temporary file, may involve traffic over slower networks than those between the source and S3 store, etc.).

Specifying which store to use:

By default, any new files uploaded will use whichever store is specified as the default (via the

-Ddataverse.files.s3.upload-redirect=true setting). Previously uploaded files remain in the store where they were originally uploaded.

To use multiple stores for new files, a superuser can configure Dataverse repositories and collections, as well as datasets, to use specific stores. When editing the ‘General Information’ for a Dataverse collection, superusers will see a new ‘Storage Driver’ entry that lists available stores.

The image at right shows the ‘TACC’ store selected from a list including ‘TACC’ and ‘TDL’ stores.

As with the default case, this setting only affects new files (and it affects new files in existing datasets).

Implementing Multiple Stores for an existing instance

A multistore configuration can be made backwards compatible with an existing Dataverse instance if a naming convention is followed for the existing store: it must be named ‘file’, ‘s3’, or ‘swift’, consistent with the type of the single store that was in use. This is because the database entries for file location in v 4.18.1 Dataverse include the store type as a prefix. The multistore version prefixes with the id of the store instead. If the original has an id that matches it’s type, the new code will find the datafiles without any change to the database. If this convention is not followed, existing database entries will need to be updated to use the new id of the original store before the software can find those files.

Note that since the ‘label’ of a store is what is shown in the we interface (i.e. for superusers to select a store for a given Dataverse repository or collection), this naming convention is not very restrictive. For example, in the image above, the ‘TDL’ store is one with an id='s3' and a label='TDL' and it is the original single store in use prior to the multistore configuration was set up.

Other Configuration Issues/Notes

The Dataverse software already supports direct download of files from S3 stores. This would usually be a reasonable default and would be a even more valuable with large data. This is set via the

-Ddataverse.files.<id>.upload-redirect=true setting. One thing that is important for this setting: the S3 store must be configured to allow cross origin (CORS) requests for data previewers (the Dataverse software’s external tools in general) to work.

For the current multistore implementation with direct uploads in use for some stores:

it would be recommended to leave the Dataverse software configured with the default MD5 fixity hash option. The Dataverse software itself can generate/verify several other hash types (SHA1, 256,512) and it should be possible for the direct upload implementation to support these in the future, but, at present, the web interface for upload is hardcoded to MD5. Nominally the Dataverse software can handle files using different algorithms, but it may be confusing if different files show different hash types to users.
The current web interface for direct upload calculates the fixity hash of the file locally on the user's machine, as a separate step initiated after the file has been uploaded to the remote store. This can also take significant time, so the progress bar includes progress on the file upload (up to 50%) and the hash calculation (from 50%-100%). Since the progress in each step depends on the user’s network, computer, and local disk speed, it is not possible to determine the relative speed of the two steps, so progress may be faster in the first or second half of progress as shown in the bar. Further, when a proxy is used, there can be a significant amount of time after the file upload, during which the proxy sends the file on to the remote store, that is not tracked in the progress bar. Thus the bar will appear to pause at the 50% mark when using a proxy, potentially for an amount of time similar to how long it took to reach 50%, before progress will continue. (Accounting for this retransmission from the proxy in the progress bar would be a useful addition if a proxy were to be used in production.)
For large files direct uploaded via a proxy: since the proxy may (it does with nginx as configured on dataverse-dev.tdl.org) create a temporary copy of the file(s) being uploaded, it is important to have local disk storage sufficient for whatever upload size is being performed. For dataverse-dev, this required shifting nginx’s temporary store from it’s default location to somewhere in the /ebs file system (which has more space). This allowed a 39GB upload.
For testing with the current implementation of direct upload, it may be useful to watch the browser’s console. It shows various debugging info and progress messages and will indicate, for example, whether the upload HTTP call is continuing (e.g. while the proxy is uploading to the remote store) versus any failure that might also cause the progress bar to stop.
Temporary files: since the direct upload creates files in their final location rather than a temporary one, it will be somewhat harder to remove abandoned files when direct upload is used. Due to prior work to limit the circumstances under which the Dataverse software will abandon files, this should still be a relatively rare issue. However, since direct upload may be used for larger files, the impact may still be significant. In production, the primary way files are abandoned may be if users leave the upload page without clicking save or cancel. If/when problems occur, one can compare the database entries for the files in a given dataset with the list of files having an S3 path corresponding to the matching dataset identifier plus file identifier and delete any extra files. More simply, if a dataset is known to be empty, all files for the corresponding s3 path can be removed. Further work could be done to automate this type of clean-up, either within the Dataverse software or as a separate script.