Specifying a default Store:

The storage option to use in Dataverse v.4.18.1 is configured through Glassfish jvm options. The primary choice is made through specifying the storage-driver-id, which can currently be file, s3, or swift, e.g.

-Ddataverse.files.storage-driver-id=s3

In the multistore version of the Dataverse software, multiple stores can be active at the same time. However, one still needs to be designated as the default. To do this, it uses the same jvm option, but the values are now allowed to specify the id of any configured store (versus just the three values above which only define the store ‘type’), e.g.

-Ddataverse.files.storage-driver-id=s3tacc

Configuring a specific Store:

In Dataverse 4.18.1, the three types of stores are configured through several type-specific jvm options. The simplest of these is a file store that is defined by the path where files should be stored, e.g.

-Ddataverse.files.file.directory=/usr/local/glassfish4/glassfish/domains/domain1/files

In the multistore version, all stores must have an id, type, and label, as well as type-specific information. Thus a file store can be configured using:

-Ddataverse.files.file.type=file

-Ddataverse.files.file.label=LocalFileSystem

-Ddataverse.files.file.directory=/usr/local/glassfish4/glassfish/domains/domain1/files

where the id ('file' in this case) is implicit in the names of the jvm options. A second file store with the id ‘testing’ could be set up as:

-Ddataverse.files.testing.type=file

-Ddataverse.files.testing.label=TestingOnly

-Ddataverse.files.testing.directory=/tmp

Other types of stores are similarly configured. As with file store configuration, the existing jvm option choices are replaced with similar ones that specify the id of a specific store. For example, the s3 option to configure the bucket was:

-Ddataverse.files.s3-bucket-name=dataverse-dev-tacc-s3

It now includes the id, e.g.

-Ddataverse.files.s3tacc.bucket-name=dataverse-dev-tacc-s3

A complete s3 configuration would be:

-Ddataverse.files.s3.type=s3

-Ddataverse.files.s3.profile=s3

-Ddataverse.files.s3.label=TDL

-Ddataverse.files.s3.bucket-name=dataverse-dev-s3

A more complex configuration would be:

-Ddataverse.files.s3tacc.type=s3

-Ddataverse.files.s3tacc.label=TACC

-Ddataverse.files.s3tacc.custom-endpoint-url=http://129.114.52.102:9006

-Ddataverse.files.s3tacc.path-style-access=true

-Ddataverse.files.s3tacc.profile=s3tacc

-Ddataverse.files.s3tacc.upload-redirect=true

-Ddataverse.files.s3tacc.proxy-url=https://18.211.108.182:8888/s3tacc

In general, configurations for s3 and swift stores can include all of the existing options described in Harvard’s official Dataverse software user manuals, with the analogous modification to those shown to include a store specific identifier in the option.

The complex example above includes 3 new options implemented as part of the TDL large data project:

Specifying which store to use:

By default, any new files uploaded will use whichever store is specified as the default (via the

-Ddataverse.files.s3.upload-redirect=true setting). Previously uploaded files remain in the store where they were originally uploaded.

To use multiple stores for new files, a superuser can configure Dataverse repositories and collections, as well as datasets, to use specific stores. When editing the ‘General Information’ for a Dataverse collection, superusers will see a new ‘Storage Driver’ entry that lists available stores.

The image at right shows the ‘TACC’ store selected from a list including ‘TACC’ and ‘TDL’ stores.

As with the default case, this setting only affects new files (and it affects new files in existing datasets).

Implementing Multiple Stores for an existing instance

A multistore configuration can be made backwards compatible with an existing Dataverse instance if a naming convention is followed for the existing store: it must be named ‘file’, ‘s3’, or ‘swift’, consistent with the type of the single store that was in use. This is because the database entries for file location in v 4.18.1 Dataverse include the store type as a prefix. The multistore version prefixes with the id of the store instead. If the original has an id that matches it’s type, the new code will find the datafiles without any change to the database. If this convention is not followed, existing database entries will need to be updated to use the new id of the original store before the software can find those files.

Note that since the ‘label’ of a store is what is shown in the we interface (i.e. for superusers to select a store for a given Dataverse repository or collection), this naming convention is not very restrictive. For example, in the image above, the ‘TDL’ store is one with an id='s3' and a label='TDL' and it is the original single store in use prior to the multistore configuration was set up.

Other Configuration Issues/Notes

The Dataverse software already supports direct download of files from S3 stores. This would usually be a reasonable default and would be a even more valuable with large data. This is set via the

-Ddataverse.files.<id>.upload-redirect=true setting. One thing that is important for this setting: the S3 store must be configured to allow cross origin (CORS) requests for data previewers (the Dataverse software’s external tools in general) to work.

For the current multistore implementation with direct uploads in use for some stores: