Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • For testing, it will be useful to set up a proxy server on dev (S3 uploads from clients and Dataverse would go to the proxy and the proxy would forward to TACC). Per Chris@TACC, this should allow uploads from any IP addresses we want without TACC having to alter their firewall settings. To test the maximum performance we would need to avoid the proxy (which wouldn’t be used in production). It’s a question for TDL as to which proxy software to use (I found TinyProxy but have no preference).

    • Status 12/19: Plan to use TinyProxy, not yet done. Since the upload URL feature signs the unproxied URL, some code changes will be needed in Dataverse to allow a proxy with direct uploads.

    • Status 1/3/20: Nginx set up in Dec. Configuration adjusted to use /ebs file system for caching uploads to allow testing of larger files. Using the proxy requires allowing a given IP address. One additional step is required to test upload via the web interface: the user must navigate to the proxy address first and accept the self-signed certificate. Once this is done, the browser will allow the background (ajax) uploads to proceed. Also - nginx does not restart when the machine is rebooted, so nginx needs to be checked/started for testing.

  • While the API to add a file to a Dataset checks whether the user (as identified by their access token) has permission to add a file to the specified Dataset, the api call to retrieve a presigned S3 key currently only checks that the user is a valid Dataverse user. It should deny the request unless the user has permission to add files to the dataset. (This is trivial to do, but until then, a valid dataverse user could add files to S3 that would not be associated with any Dataverse entries.)

    • Status 12/19: Fixed. Code will only return an upload URL if the user has permission to add files to the dataset.

  • Dataverse was originally designed to use one (configurable) store for files (could be a local file system, S3, Swift, etc.). The POC works with Dataverse configured with S3. However, as is, all files must be in the same store. To support sending some files to a different store, Dataverse will need to be modified to work with multiple stores. This is potentially useful in general, e.g. to support sending new data to a new store without having to move existing files, but, for the remote storage case, if the use case is to send only some files to the new store (specific datasets, only files larger than a cut-off size, as decided by an admin/user based on preference or knowledge of where the data initially exists, etc.), then additional work would be needed to implement that policy. Dataverse does already keep track of the store used for a given dataset, so some of the code required to identify which store a file is in already exists.

    • Status 12/19: Implemented - see Multiple Store section above for details.

  • To support upload via the Dataverse web interface, additional work will be needed. This could be significant in that the current Dataverse upload is managed via a third-party library and it may be difficult to replace just the upload step without impacting other aspects of the current upload process (e.g. showing previews, allowing editing of file names and metadata, providing warnings if/when files have the same content or colliding names.) If this is too complex, it will possible to create an alternate upload tab - Dataverse already provides a mechanism to add alternate upload mechanisms that has been used to support uploads from Dropbox, rsync, etc.

    • Status 12/19: I’ve started investigating the upload process in the web interface and ways to turn off the automatic direct upload to glassfish and to be able to trigger an alternate process to instead request upload URLs and perform direct uploads. So far, it still looks like it will be possible to do this without changing the upload user interface in any way visible to the user.

    • Status 1/3/20: Have implemented direct upload through the standard Dataverse upload interface. The method used is dynamically configured based on the choice for the current Dataverse. Have verified multiple sequential uses of the ‘Select Files To Add’ button as well as selecting multiple files at once. Have also tested files to 2.4 39 GB. Error handling, file cleanup on cancel, and some style updates to the progress bar tbd.

  • Depending on whether the normal processing that Dataverse does during upload are desirable for large files, additional work will be needed to reinstate those steps. These include:

    • thumbnails - shown during upload and in the dataset file list,

    • metadata extraction - currently limited to tabular files (DDI metadata) and FITS (astronomy, currently broken)

    • mimetype analysis - Dataverse can use what the client sends, check the file extension, and/or, in some cases, look at the file contents to determine mimetype. My sense is that most files will get a reasonable mimetype without having their content inspected. FWIW – most content inspection relies on the first few bytes of a file so it’s possible this could be done without retrieving the whole file.

    • Derived file creating - currently limited to deriving .tab files, which are viewable in tworavens, data explorer, from spreadsheet files

    • Unzipping - Dataverse automatically expands a top-level zip file into component files and stores those

    • Future - possibilities such as virus checks, etc.

  • There is also functionality that touches the files that can be triggered at other times. In the POC, these are still enabled. These include:

    • full-text indexing - if enabled, if under the configured size limit, whenever indexing occurs including after any dataset change or when triggered by an admin/cron job.

    • thumbnails - I think Dataverse tries to create these if they don’t exist whenever the data is displayed. I have not yet checked to see if it is ‘smart’ - only reads the file if its a type for which previews can be created.

    • previews - for any file types for which a previewer is registered. Some previewers, such as the video viewer, are smart - for video it plays as the file streams so a preview would download the whole file unless a user watched the who video. However, most would try to download the full file. There isn’t a mechanism now that would limit the size of files for which a preview is allowed. Note that in 4.18+, previews of published files can also be embedded in the file page.

...