Policy

A. Purpose

It is the mission of the Texas Digital library (TDL) to enable digital initiatives in support of research, scholarship, and learning in Texas. As a part of this mission, the TDL endeavors to collect, preserve, and disseminate scholarly materials for the benefit of both producers and consumers of academic research and scholarship. TheTexas Data Repository, TDL’s instance of the Dataverse software, encompassing all of the Dataverse collections of its member institutions, is the digital resource intended to address a consortium-level need for publishing, managing, and providing access to research-generated data sets. The following Digital Preservation Policy describes the extent to which the TDL will support sustainable access to the digital research data and related content deposited in the Texas Data Repository.

The preservation objectives of the Texas Data Repository are:

Part of the TDL’s vision in establishing a consortium Dataverse repository installation is to make research materials freely available to anyone, anywhere, and at any time. The TDL is an advocate for open access to scholarly work including research data. The incentives to researchers for publishing and preserving their research data in the Texas Data Repository are:

B. Scope

The TDL accepts the responsibility to preserve and provide access to research data, including associated metadata and documentation that is properly deposited in the Texas Data Repository. This responsibility includes the provision of digital means to preserve and ensure ongoing access to said content for a minimum period of ten years after it is deposited in the TDR Dataverse repository. Long-term preservation of TDR Dataverse repository content, beyond the ten-year retention period, is subject to the TDL’s selection criteria, appraisal of the content, and budgetary and technical support of resources necessary to meet this goal. Metadata for content removed from the TDR Dataverse repository, regardless of reason or retention period, may be preserved for an undetermined period of time after said content’s removal.

The Texas Data Repository content will be selected and appraised according to the following preservation priorities and levels of commitment:

  1. Research data associated with publications – great effort will be made to ensure the long-term preservation of data associated with journal or scholarly publications, so long as the data meets the TDL collection policies and the Texas Data Repository remains the data’s hosted or cited repository.
  2. Stand-alone data publications with high research value – reasonable effort will be made to ensure the long-term preservation of data and metadata of stand-alone publications that library professionals identify as having high research value to the broader academic community.
  3. Other data files and materials – efforts may or may not be made to retain ephemeral materials considered to lack significant or long-term value, although particular files may be preserved on a select basis as appropriate.

Additionally, the Texas Data Repository will accept data submissions of any format, but only provides full support (i.e. data exploration, analysis, and meta-analysis via the TwoRavens suite of statistical tools) to tabular data preferably in the following formats:

These files can be in compressed ZIP format at ingest, however, they may not exceed 2 GB in size. Any individual file uploaded to the repository must be under 4GB, though any uploads over 2GB, and some below that threshold, may be slow or stall due to variables outside of TDL's control. Please email support@tdl.org if you having trouble uploading files. If you have files over 4GB, we will consider support options on a case by case basis and in consultation with your Institutional TDR liaison. Please see http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html and http://guides.dataverse.org/en/latest/user/dataset-management.html for more specific information on data set and metadata formats.

Texas Data Repository provides basic, bit-level preservation through fixity checks and secure backup of deposited content (See also Information Security). Further and more in-depth digital preservation activities and services must be provided by a digital preservation program at the institution where the research data was originally generated.

Additionally, the TDR provides distributed digital preservation in Chronopolis via DuraCloud@TDL. That process is described in Procedures - Part B, below.

Any data stored outside the TDR (TACC external storage, for instance) is not covered by the TDR Preservation Policies, and is the responsibility of the researcher.

C. Strategic Plan

The TDL has an official backup strategy that requires all digital content to be:

The TDL systems also provide security services key to basic digital preservation, namely access control, network monitoring and protection, encryption, and system updates (see Information Security Policy). There are currently no institutional limitations to the overall quantity of data that can be stored on TDL servers, only limitations on the size of individual files (4 GB) uploaded via the Texas Data Repository application and a recommendation for datasets not to exceed 10GB.

Procedures

A. The Dataverse software's best practices for data management and preservation include:

The TDL systems infrastructure includes bit-level fixity checking via Amazon EBS and S3 host service. For more information about security, backups and integrity checking, see also Information Security.

B. The distributed digital preservation process for datasets contained in the Texas Data Repository (TDR) addresses workflow elements conducted by Texas Digital Library (TDL) administrators and policy implications determined by TDR Steering Committee (SC). 

Policy

Workflow

Producing the SIP

Ingest

From TDR to DuraCloud

From DuraCloud to Chronopolis

Chronopolis Replication 

Snapshot Ingest Completion

Maintenance

Retrieving Content from Chronopolis

Resources

Definitions

Archival Information Package (AIP): AIPs consist of Content Information and the associated Preservation Description Information (PDI), which is preserved within the digital preservation repository 

Bag: Based on the concept of “bag it and tag it,” where a digital collection is packed into a directory (the bag) along with a machine-readable manifest file (the tag) that lists the contents.

BagIt: a set of hierarchical file layout conventions for storage and transfer of arbitrary digital content.  A "bag" has just enough structure to enclose descriptive metadata "tags" and a file "payload" but does not require knowledge of the payload's internal semantics.  This BagIt format is suitable for reliable storage and transfer. BagIt is the file hierarchical format used for SIPs.

Bridge: An application running in Chronopolis that facilitates ingest from Duracloud into Chronopolis. Creates the snapshot directory, generates checksums, activates Chronopolis providing the snapshot ID and directory path.

Chronopolis: Geographically distributed digital preservation network that provides services for long-term preservation and curation of digital holdings. Partners include: UC San Diego Library, National Center for Atmospheric Research, University of Maryland Institute for Advanced Computer Studies, and the Texas Digital Library. 

DuraCloud: The intake provider for Chronopolis. This serves as a staging ground to create the AIPs from TDR Dataverse SIPs.

Fixity Values: Checksums generated by Duracloud and Bridge, MD5, SHA-256

Snapshot: A set of files which have been selected by a user of DuraCloud to be moved as a group into Chronopolis. These files, along with a few system files (manifests, properties listing files, etc) form a collection in Chronopolis. (This is the level of granularity used for deposits and restores.)

Snapshot Metadata: A set of metadata which is associated with a snapshot. This metadata is partially system generated and partially user defined. This information is stored as part of a snapshot, and may be used by DuraCloud and/or Chronopolis

Submission Information Package (SIP): SIPs are delivered by the producer to the digital preservation repository for use in the construction of one or more AIPs. 

References

The Dataverse Project, “Harvard Dataverse Preservation Policy, http://best-practices.dataverse.org/harvard-policies/harvard-preservation-policy.html

Purdue University Research Repository (PURR), “PURR Digital Preservation Policy,” https://purr.purdue.edu/legal/digitalpreservation

Texas Digital Library, “Texas Library Vision 2020-2023,” https://www.tdl.org/governance/strategic-plan/

Preserving digital Objects With Restricted Resources, “Tool Grid,” http://digitalpowrr.niu.edu/tool-grid/

Digital Curation Centre, “DataVerse,” http://www.dcc.ac.uk/resources/external/dataverse

Harvard Dataverse, “UCLA Social Science Data Archive Dataverse,” http://dataarchives.ss.ucla.edu/archive%20tutorial/archivingdata.html

Harvard’s Institute for Quantitative Social Science (IQSS), “About TwoRavens,” http://datascience.iq.harvard.edu/about-tworavens

University of North Carolina – The Odum Institute, “Digital Preservation Policies,” http://www.irss.unc.edu/odum/contentSubpage.jsp?nodeid=629

Harvard Dataverse Project, “User Guide: Tabular Data File Ingest,” http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html

Elizabeth Quigley, IQSS-Harvard University, “The Expanding Dataverse,” http://dataverse.org/files/dataverseorg/files/introduction_to_dataverse.pdf?m=1447352697