Digital Preservation Policies and Procedures

Policy

A. Purpose

It is the mission of the Texas Digital library (TDL) to enable digital initiatives in support of research, scholarship, and learning in Texas. As a part of this mission, the TDL endeavors to collect, preserve, and disseminate scholarly materials for the benefit of both producers and consumers of academic research and scholarship. TheTexas Data Repository, TDL’s instance of the Dataverse software, encompassing all of the Dataverse collections of its member institutions, is the digital resource intended to address a consortium-level need for publishing, managing, and providing access to research-generated data sets. The following Digital Preservation Policy describes the extent to which the TDL will support sustainable access to the digital research data and related content deposited in the Texas Data Repository.

The preservation objectives of the Texas Data Repository are:

  • to collect, preserve, and disseminate the data sets and related information generated by researchers affiliated with any of the TDL’s member institutions who choose to deposit their content therein;
  • to enable researchers affiliated with any of the TDL’s member institutions to comply with the mandates of funding agencies to manage, preserve, and share their research data; and
  • to provide the means for users to discover and access the data sets and metadata generated by academics affiliated with any of the TDL’s member institutions over the long term.

Part of the TDL’s vision in establishing a consortium Dataverse repository installation is to make research materials freely available to anyone, anywhere, and at any time. The TDL is an advocate for open access to scholarly work including research data. The incentives to researchers for publishing and preserving their research data in the Texas Data Repository are:

  • data that might be precariously stored on fragile, random, or unsustainable storage devices can be securely preserved for the long term;
  • data that might otherwise become neglected over time can be preserved and made accessible for other interested researchers to use and cite potentially providing wider visibility and impact for the research;
  • many funding agencies and scholarly journals require data management plans that detail how the data will be managed, made accessible, and preserved.

B. Scope

The TDL accepts the responsibility to preserve and provide access to research data, including associated metadata and documentation that is properly deposited in the Texas Data Repository. This responsibility includes the provision of digital means to preserve and ensure ongoing access to said content for a minimum period of ten years after it is deposited in the TDR Dataverse repository. Long-term preservation of TDR Dataverse repository content, beyond the ten-year retention period, is subject to the TDL’s selection criteria, appraisal of the content, and budgetary and technical support of resources necessary to meet this goal. Metadata for content removed from the TDR Dataverse repository, regardless of reason or retention period, may be preserved for an undetermined period of time after said content’s removal.

The Texas Data Repository content will be selected and appraised according to the following preservation priorities and levels of commitment:

  1. Research data associated with publications – great effort will be made to ensure the long-term preservation of data associated with journal or scholarly publications, so long as the data meets the TDL collection policies and the Texas Data Repository remains the data’s hosted or cited repository.
  2. Stand-alone data publications with high research value – reasonable effort will be made to ensure the long-term preservation of data and metadata of stand-alone publications that library professionals identify as having high research value to the broader academic community.
  3. Other data files and materials – efforts may or may not be made to retain ephemeral materials considered to lack significant or long-term value, although particular files may be preserved on a select basis as appropriate.

Additionally, the Texas Data Repository will accept data submissions of any format, but only provides full support (i.e. data exploration, analysis, and meta-analysis via the TwoRavens suite of statistical tools) to tabular data preferably in the following formats:

  • SPSS (POR and SAV formats)
  • STATA
  • R data
  • CSV

These files can be in compressed ZIP format at ingest, however, they may not exceed 2 GB in size. Any individual file uploaded to the repository must be under 4GB, though any uploads over 2GB, and some below that threshold, may be slow or stall due to variables outside of TDL's control. Please email support@tdl.org if you having trouble uploading files. If you have files over 4GB, we will consider support options on a case by case basis and in consultation with your Institutional TDR liaison. Please see http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html and http://guides.dataverse.org/en/latest/user/dataset-management.html for more specific information on data set and metadata formats.

Texas Data Repository provides basic, bit-level preservation through fixity checks and secure backup of deposited content (See also Information Security). Further and more in-depth digital preservation activities and services must be provided by a digital preservation program at the institution where the research data was originally generated.

Additionally, the TDR provides distributed digital preservation in Chronopolis via DuraCloud@TDL. That process is described in Procedures - Part B, below.

Any data stored outside the TDR (TACC external storage, for instance) is not covered by the TDR Preservation Policies, and is the responsibility of the researcher.

C. Strategic Plan

The TDL has an official backup strategy that requires all digital content to be:

  • the copy of the data residing on the production server, which is an Amazon Elastic Block Store (EBS) volume;
  • nightly snapshots that can be used to restore the entire service to a particular date within the preceding month;
  • and one snapshot from each month, retained for one year. 

The TDL systems also provide security services key to basic digital preservation, namely access control, network monitoring and protection, encryption, and system updates (see Information Security Policy). There are currently no institutional limitations to the overall quantity of data that can be stored on TDL servers, only limitations on the size of individual files (4 GB) uploaded via the Texas Data Repository application and a recommendation for datasets not to exceed 10GB.

Procedures

A. The Dataverse software's best practices for data management and preservation include:

  • automatic extraction of metadata from tabular files and FITS;
  • standard descriptive metadata schemas such as OAI DC, DDI (for statistical and social science), ISA-Tab (for biomedical), FITS (for astronomy);
  • re-formatting of tabular data to simple open format text files;
  • data and metadata versioning;
  • database maintenance;
  • checksum generation upon ingest (UNF for tabular data, MD5 for other files);
  • persistent URL – DOI (minted by DataCite);
  • deaccessioning of data, but not citation metadata, if necessary.

The TDL systems infrastructure includes bit-level fixity checking via Amazon EBS and S3 host service. For more information about security, backups and integrity checking, see also Information Security.

B. The distributed digital preservation process for datasets contained in the Texas Data Repository (TDR) addresses workflow elements conducted by Texas Digital Library (TDL) administrators and policy implications determined by TDR Steering Committee (SC). 

Policy

  • Chronopolis preservation for TDR content will be part of the Texas Data Repository Dataverse service at no additional cost to the members.
  • Only published content will be preserved.
  • TDL will ingest published datasets into Chronopolis two times per year.
  • Communication/reporting expectations:
    • TDL will notify TDR SC members when error notifications occur. 
    • TDL will inform TDR SC members of ongoing updates, issues, and programmatic  changes as they occur
    • TDL will report errors as they occur
  • If an item is permanently removed from an institution’s dataverse, and it has been preserved in Chronopolis, the TDR SC member will inform TDL of this removal. TDL will work with the institution to authorize deletion of preservation copies.

Workflow

  • Dataset must have citation metadata to be eligible for preservation.
  • File sizes must conform to TDR rules.

Producing the SIP

  • The Submission Information Package (SIP) will consist of one dataset (including all of the files and metadata (descriptive, administrative, and structural) at the dataset and file level) PUBLISHED in TDR.
  • Only the most recent version of a published datasets in a collection will serve as a SIP. 
  • SIP will be packaged as a Research Data Alliance conformant zipped BagIt bag. The bag contains an OAI-ORE map file and a datacite.xml file. Additionally, a separate copy of the datacite.xml file is included. The SIP is submitted to Chronopolis via DuraCloud. 
  • TDL will submit previously published datasets using a manual admin API call. If not already generated by Dataverse, this workflow creates a JSON-LD serialized OAI-ORE map file, which is also available as a metadata export format in the Dataverse web interface.

Ingest

From TDR to DuraCloud

  • TDL will establish and maintain one DuraCloud dashboard to move content from the TDR to Chronopolis. This includes all published content in the institutional dataverses within the TDR. 
  • All “published” datasets in the TDR will be automatically included in the ingest. TDL will do this twice a year. 
  • TDL will use the command line to automate dataset staging into Duracloud.
  • SIPs (resultant Bags from Dataverse) will be generated consisting of one dataset each that includes all the files and the Dataverse metadata at the dataset and file levels. Each version of the dataset will become its own SIP.  

From DuraCloud to Chronopolis

  • SIPs in DuraCloud spaces are manually transferred into a staging space in DuraCloud called the snapshot space. 
  • DuraCloud creates a properties file and stores it in the snapshot space.
  • Based on the snapshot properties file, DuraCloud makes the snapshot space read-only.
  • The Bridge Ingest application, part of the Chronopolis platform, facilitates the upload from DuraCloud to Chronopolis
    • Creates a directory for the snapshot on University of California San Diego’s file system (referred to as bridge storage in DuraSpace documentation)
    • Pulls all content from the space to /data directory under the snapshot directory
    • Moves the snapshot properties file from the data directory to the snapshot directory (which is one level above the data directory)
    • Stores the properties for all content items in a json file, also in the snapshot directory
    • Creates an MD5 manifest of all content items, adding each item after the MD5 has been verified to match the DuraCloud checksum. The initial MD5 value was generated upon upload from Dataverse to DuraCloud.
    • Creates a SHA-256 manifest of all content
    • The Bridge Ingest app:
  • The Chronopolis Intake Service sends requests to the Bridge Ingest App for new snapshot content.
    • DuraCloud host and port
    • Store ID
    • Space ID
    • Snapshot ID
    • Information provided on snapshot call: 
  • Chronopolis pulls all content from bridge storage into Chronopolis preservation storage.
    • For specific steps, see Chronopolis’s ChronCore Processes document
    • Chronopolis verifies content against manifest
    • Chronopolis creates Bags for the content

Chronopolis Replication 

  • Chronopolis replicates copies of the AIP at the three partner sites (University of California, San Diego; Texas Digital Library (via the Texas Advanced Computing Center); and University of Maryland Institute for Advanced Computer Studies).
  • Replication consists of the following steps:
    • If content is not valid, Chronopolis notifies partner site with a message that a new request needs to be generated. The cause of failure is first manually reviewed by Chronopolis and TDL to determine if the cause was intermittent (network issues) or something more serious (bit flips).
    • For specific steps, see Chronopolis’s ChronCore Processes document
    • The Chronopolis Ingest Service creates replication requests for the appropriate Chronopolis nodes
    • Chronopolis transfers data to other sites
    • Chronopolis automatically verifies that transferred content matches checksum values
    • If content is valid, the properties are registered with Chronoplis’s Audit Manager and queried in future audits
    • Replication is completed

Snapshot Ingest Completion

  • Chronopolis notifies DuraCloud that Snapshot creation is complete
  • DuraCloud removes all content from bridge storage and from staging storage
    • DuraCloud makes space in staging provider for new snapshots and ingests
  • DuraCloud displays the snapshot and the content it contains in a Chronopolis Storage Provider. The SIP now becomes the Archival Information Package (AIP). 

Maintenance

  • Once an object is in Chronopolis, TDL will inform TDR SC members of ingest, storage, or other errors.
  • TDR SC members are responsible for communicating any user concerns, requests, or errors to TDL for possible remedy.
  • Example - Researcher notifies a TDR SC member that files were corrupted in Dataverse. TDR SC member can contact TDL and look into this to see if needs to be replicated… etc.

Retrieving Content from Chronopolis

  • Retrieval is permitted for data recovery or other circumstances that might arise.
  • There will be a cost for data recovery equivalent to the AWS S3 Egress fees incurred upon recovery via TDL’s DuraCloud. 
  • The TDR SC member will initiate request via a TDL Helpdesk Ticket for retrieval with explanation of circumstances and clear description of needs. 
  • TDR will interact with both DuraCloud and the Bridge ingest app. Both of these components work with Snapshots of Spaces. 
  • To retrieve content, TDL:
    • Identifies the snapshot that contains the items to be retrieved. Note: If on ingest, the Snapshot was large enough to be placed in multiple bags in Chronopolis, upon retrieval, the bridge app requires that all bags must be retrieved back into DuraCloud.
    • Once everything is back in DuraCloud, TDR can access select objects through a Staging Space and pull these specific objects from Duracloud into local storage. 
  • At this point TDR contacts the TDR SC member to facilitate further handling or transfer depending on the specific needs of the retrieval request.

Resources

Definitions

Archival Information Package (AIP): AIPs consist of Content Information and the associated Preservation Description Information (PDI), which is preserved within the digital preservation repository 

Bag: Based on the concept of “bag it and tag it,” where a digital collection is packed into a directory (the bag) along with a machine-readable manifest file (the tag) that lists the contents.

BagIt: a set of hierarchical file layout conventions for storage and transfer of arbitrary digital content.  A "bag" has just enough structure to enclose descriptive metadata "tags" and a file "payload" but does not require knowledge of the payload's internal semantics.  This BagIt format is suitable for reliable storage and transfer. BagIt is the file hierarchical format used for SIPs.

Bridge: An application running in Chronopolis that facilitates ingest from Duracloud into Chronopolis. Creates the snapshot directory, generates checksums, activates Chronopolis providing the snapshot ID and directory path.

Chronopolis: Geographically distributed digital preservation network that provides services for long-term preservation and curation of digital holdings. Partners include: UC San Diego Library, National Center for Atmospheric Research, University of Maryland Institute for Advanced Computer Studies, and the Texas Digital Library. 

DuraCloud: The intake provider for Chronopolis. This serves as a staging ground to create the AIPs from TDR Dataverse SIPs.

Fixity Values: Checksums generated by Duracloud and Bridge, MD5, SHA-256

Snapshot: A set of files which have been selected by a user of DuraCloud to be moved as a group into Chronopolis. These files, along with a few system files (manifests, properties listing files, etc) form a collection in Chronopolis. (This is the level of granularity used for deposits and restores.)

Snapshot Metadata: A set of metadata which is associated with a snapshot. This metadata is partially system generated and partially user defined. This information is stored as part of a snapshot, and may be used by DuraCloud and/or Chronopolis

Submission Information Package (SIP): SIPs are delivered by the producer to the digital preservation repository for use in the construction of one or more AIPs. 

References

The Dataverse Project, “Harvard Dataverse Preservation Policy,” http://best-practices.dataverse.org/harvard-policies/harvard-preservation-policy.html

Purdue University Research Repository (PURR), “PURR Digital Preservation Policy,” https://purr.purdue.edu/legal/digitalpreservation

Texas Digital Library, “Our Mission and Vision,” https://www.tdl.org/strategic-plan/vision/

Preserving digital Objects With Restricted Resources, “Tool Grid,” http://digitalpowrr.niu.edu/tool-grid/

Digital Curation Centre, “DataVerse,” http://www.dcc.ac.uk/resources/external/dataverse

Harvard Dataverse, “UCLA Social Science Data Archive Dataverse,” http://dataarchives.ss.ucla.edu/archive%20tutorial/archivingdata.html

Harvard’s Institute for Quantitative Social Science (IQSS), “About TwoRavens,” http://datascience.iq.harvard.edu/about-tworavens

University of North Carolina – The Odum Institute, “Digital Preservation Policies,” http://www.irss.unc.edu/odum/contentSubpage.jsp?nodeid=629

Harvard Dataverse Project, “User Guide: Tabular Data File Ingest,” http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html

Elizabeth Quigley, IQSS-Harvard University, “The Expanding Dataverse,” http://dataverse.org/files/dataverseorg/files/introduction_to_dataverse.pdf?m=1447352697