DVUploader, a Command-line Bulk Uploader for the Dataverse Software

Motivation:

The Dataverse software supports file uploads through its web interface. However, that interface has a limit of 1000 files per upload session and, since it displays uploaded files in a single long list, it becomes difficult to use with fewer files than that. The web interface's support for unzipping zip files is one way to simplify the process - files can be pre-zipped for upload as one larger zip file - but the interface still shows a long list of the included files. 

The Dataverse software community has a number of initiatives underway to support upload of larger files (greater than a few GB) and/or large numbers of files. Many of these involve configuring external storage and/or data transfer software. One, whose development was supported by TDL, is a relatively simple application (DVUploader) that can be downloaded by users. It uses the existing Dataverse software application programming interface (API) to upload files from a specified directory into a specified Dataset. It can be a useful alternative to the web interface when:

  • there are hundreds or thousands of files to upload,
  • when automatic verification of error-free and complete upload of files is desired,
  • when new files are being generated/added to a directory and Dataverse collection needs to be updated with just the new files,
  • uploading of files needs to be automated, e.g. added to an instrument or analysis script or program.

The DVUploader does need to be installed and, as a command-line tool, may not be as intuitive as the Dataverse software web interface. However, unlike other bulk tools being developed, it will work with any Dataverse software installation without any server-side changes. (Since it does upload and store data via the Dataverse software, it shares the basic performance and performance limitations of the Dataverse software's web interface. Other tools bypass the Dataverse software to handle larger data or do not move data from their remote locations and simply reference it in a Dataverse collection's dataset.) DVUploader can thus be a useful tool for individuals and for Dataverse installations interested in supporting larger numbers of files.  

Note that while the DVUploader does not share the Dataverse software upload limit of 1000 files, Dataverse software performance with thousands of files per Dataset will slow and TDL recommends breaking larger Datasets into parts when possible.

Installation

The DVUploader is a Java application packaged as a single jar file. Visit the tool's official home in GitHub for the current version: https://github.com/IQSS/dataverse-uploader/wiki/DVUploader,-a-Command-line-Bulk-Uploader-for-Dataverse).

To run it, you need to have Java and the DVUploader:

Step 1: Install Java (if needed)  (DVUploader requires version 8 or greater - download for most operating systems is available from https://java.com/en/download/).

Any warning about Java not being able to run in the user's browser (MS Edge is one where this warning is shown) can be ignored as the DVUploader does not run in the browser. 

If you have trouble installing Java, you may need to reach out to your local IT for help with this part as it could be due to permissions on your machine. 

Instructions for finding out what version, if any, of Java you have can be found here: https://java.com/en/download/help/version_manual.xml

Step 2: Download the (current version jar file is here https://github.com/IQSS/dataverse-uploader/wiki/DVUploader,-a-Command-line-Bulk-Uploader-for-Dataverse) file to a directory on your computer.

Prepare for Upload

To prepare, log into a Dataverse repository and:

  • find the DOI for the existing dataset you wish to add files to, and
  • find or generate an API key for yourself in the Dataverse instance you are using (from the popup menu under your profile). You must create the API key on the same machine that you're running the DVUploader tool on and that your data is on.Users can generate a new key for themselves as explained in the Account Creation + Management section of the Official Dataverse Software User Guide (scroll down to API Token section)
  • Hint: It might be helpful to copy and paste the DOI and API key into a note or text file on your desktop while you prepare your upload. You can also use this document to prepare your command prior to entering it in the command line interface. 
  • Hint: The simplest way to run the DVUploader is to place the jar file (under Installation, above) into the directory containing a subdirectory with the files intended for upload. (The DVUploader can be placed anywhere on disk and can upload files from any directory, but this requires adding these paths to the command line and/or configuration of Java's classpath.)

Uploading Files

These four arguments are always required. See Optional Parameters below for other arguments.


REQUIRED: Run the jar with the following command line:

	java -jar DVUploader-v1.0.n.jar -key=<api key> -did=<dataset doi> -server=<server URL> <dir or file names>

where

<apikey> is replaced with the API Key generated by the user in the Dataverse software

<dataset doi> is replaced with the DOI of the target Dataset. As in the example below, your DOI should start with "doi:" - leave off the HTTPS which is included in the DOI you copied from the TDR Dataverse repository, and

<serverURL> is replaced by the URL of the Dataverse server being used (with no trailing '/' and do not include any path to a specific Dataverse repository or collection on the server), and

<dir or file names> is replaced by the name of a directory and/or a list of individual files to upload. 


The server URL for Texas Data Repository Dataverse repository is https://dataverse.tdl.org


Recommendation for testing during first run, a cautionary note: 

For a first test, adding -listonly is useful - it will make the DVUploader list what it would do, but will not perform any uploads. This effort will you allow you to confirm that the DVUploader and your command are working properly before committing technical resources to the full process. 

For example,

	java -jar DVUploader-v1.0.n.jar -key=8599b802-659e-49ef-823c-20abd8efc05c -did=doi:10.5072/FK2/TUNNVE -server=https://dataverse.tdl.org testdir

would upload all of the files in the 'testdir' directory (relative to the current directory where the java command is run) to the Dataset at https://dataverse.tdl.org/dataset.xhtml?persistentId=10.5072/FK2/TUNNVE (if it existed: the dataset in this example is not real).


The output from the Uploader looks like:

	----------------------------------------------------------------------------------

TTTTT DDD L Texas
T D D L Digital
T DDD LLL Library

DVUploader - a command-line application to upload files to any Dataverse collection dataset
Developed for the Dataverse Software Community

----------------------------------------------------------------------------------


***Parsing arguments:***

Using apiKey: 54ea7ada-24bd-4b8f-8f13-90f6adb77bfd
Adding content to: doi:10.5072/FK2/I5FUSS
Using server: https://dataverse-dev.tdl.org
Request to upload: classes

***Starting to Process Upload Requests:***
	PROCESSING(C): testdir
Found as: doi:10.5072/FK2/TUNNVE

PROCESSING(D): testdir\Capture3.JPG
Does not yet exist on server.
UPLOADED as: MD5:b2d8726f4ddba30705259143dbb283e3
CURRENT TOTAL: 1 files :9506 bytes

PROCESSING(D): testdir\Capture4.GIF
Does not yet exist on server.
UPLOADED as: MD5:3b9b536bd0abaf9c2677846f62d77ed9
CURRENT TOTAL: 2 files :23973 bytes

PROCESSING(D): testdir\Capture5.PNG
Does not yet exist on server.
UPLOADED as: MD5:ce26585c19bd1470b7229b2cfcc879f0
CURRENT TOTAL: 3 files :35448 bytes

(The same information is written into a log file.)

Optional Parameters

The full set of available command-line arguments are shown in the example below.

	java -cp .\DVUploadClient.jar;. org.sead.acr.client.DVUploader -key=<api key> -did=<dataset doi> -server=<server URL> <-listonly> <-limit=<X>> <-ex=<ext>> < -verify> <-recurse> <-maxlockwait=<X>> <dir or file names>

(Note all combinations should work, but not all have been tested together.)

-listonly: write information about what would/would not be transferred without doing any uploads. Useful as a testing/debugging option and in combination with the -verify flag as discussed below

 -limit=X: limit this run to at most <X> data file uploads. Repeatedly running the uploader with, for example -limit=5, will upload five more files at a time. This can also be useful for testing, or as a way to break uploads into chunks as part of an automated workflow.

-ex=<ext>: exclude any file that matches the provided regular expression pattern, e.g. -ex=^\..*  (exclude files that start with a period) -ex=*.txt (exclude all files ending in .txt). Multiple repeats of this flag can be used to exclude based on multiple patterns. A common use for this flag would be to avoid uploading resource files (which start with a period) on MacOS.

-verify : use the cryptographic hash generated by the Dataverse software(usually MD5, but configurable now to SHA-1 and in the future to SHA-256 or SHA-512) and verify that the corresponding hash of the local file matches. This can be used with the Dataverse software to verify transfers as they occur, or, used with the -listonly flag, can be used in a second pass to verify that all files previously uploaded match the current file system contents

-recurse : Upload files from subdirectories of the listed directory(ies). Note that since the Dataverse software does not support folders, you data files will be uploaded without path information into the Dataset. This could cause issues, if, for example, you have files with the same name or content in different subdirectories.

-maxlockwait=<X> : - the maximum time to wait (in seconds) for a Dataset lock (i.e. while the last file is ingested) to expire (default 60 seconds)

Dataverse Software Requirements:

DVUploader uses the native API of the Dataverse software and will work with  v4.8.4 through v4.9.4. Until the file upload changes, this version of DVUploader should continue to work with newer Dataverse software versions. In 4.9+ DVUploader uses the new lock API to more robustly wait for the lock during ingest to expire.

Frequently Asked Questions:

Can I upload a whole directory tree?

Yes, using the -recurse flag described above. Given that Dataverse software currently doesn't support folders within datasets, the Uploader does not currently support nested subdirectories. If you supply a directory name (e.g. test) which contains a sub-dir, e.g. test/subdir, any files in /test/subdir will be ignored by default. However, if having all of your files appear as one flat list in a Dataset is acceptable, you can run the uploader with the -recurse flag or to only upload some subdirectories, don't specify -recurse and instead provide a list, e.g.

 
java … testdir testdir/subdir1 testdir/subdir2


Note that if there are files with the same name or content in these directories, the Dataverse softwaremay fail to upload them or may modify their names (e.g. file_1.txt). (The DVUploader will do the same thing as if you had uploaded all files via the Dataverse web application.)

Java cannot find the DVUploaderClient / Can I put the jar file in one place and not move it to upload different directories?

Yes. The examples shown use Windows-style paths. The DVUploader is a standard Java application so, as long as you configure your classpath and use paths in identifying directories to upload, you can put the DVUploader jar where you like and run it from any directory. In the command-line examples given above, you can change the -jar DVUploader-v1.0.1.jar to -jar <path>/DVUploader-v1.0.1.jar  (using the Unix path separator in this example). You can also add the jar to your default Java classpath to avoid having to type the classpath on the command line. 

The DVUploader was stopped before it finished, what do I do?

The DVUploader can just be restarted. It will scan through the existing files and start uploading when it finds the first one that does not exist in the Dataset.

Is the DVUploader Open Source?

Yes. The DVUploader is distributed under the Apache2 Open Source License. The source code has been posted to GitHub and is distributed as a Dataverse software community product. 

I see 'waiting' messages or errors from the Dataverse software!


Problems with the apikey, server URL, or Dataset DOI should be discovered and reported early. If a required argument is missing, the DVUploader will display usage information. Any problems occurring as files are uploaded should relate to the specifics of that file, or, if you see 'Waiting' messages, to the one before it. The DVUploader uses the Dataverse software API to upload files, so any problems that could occur using the web interface can occur for the DVUploader as well. Specifically, issues related to data size (upload size limits), network connections (failures, connections timing out), or to Dataverse software-specific operations, such as two files with the same content not being allowed can occur. When uploading files such as zip files or spreadsheets that are further processed by the Dataverse software, you may see errors such as the file already existing (i.e. if you upload an Excel file for which a .tab file has already been uploaded). Further when the Dataverse software ingests a file, it places a lock on the Dataset until the processing is done. DVUploader attempts to wait for such a lock to be removed before proceeding to upload the next file, but it only waits for 60 seconds by default. (On 4.8.x instances of the Dataverse software, it also cannot tell whether the Dataset is locked due to ingest or for another cause such as being 'in review'). If you see an error uploading a file after one where the DVUploader was 'waiting', try increasing the -maxlockwait setting. In all cases, it can be useful to try uploading any file for which the DVUploader reported an error through the Dataverse repository's web interface.

My Dataset has been published. Do I need to create a new Draft version?

No. When the DVUploader sends a file, the Dataverse software will automatically create a new Draft version of your dataset. Your published Dataset remains unchanged and you will have to publish your draft to create a new version if you want the files you are uploading to be included.

I'd like the DVUploader to do 'X'...

Great! Tell your Dataverse repository administrators who can help communicate your request to the larger community, or help develop it yourself. (The DVUploader leverages code originally developed as part of the SEAD project and there are a number of features that have not yet been ported to the Dataverse repository including the ability to create a new Dataset and upload metadata that would not have to be built from scratch.) 

Building DVUploader

DVUploader's source is available on GitHub at https://github.com/IQSS/dataverse-uploader and the jar file can be built using the instructions there. Unless you are interested in the DVUploader's inner workings, you can download the jar file noted in the Installation section above instead.