Digital Object Management

(DRAFT) 7. Data Integrity and Authenticity

R7. The repository guarantees the integrity and authenticity of the data.

Compliance Level:

  • The guideline has been fully implemented for the needs of the repository.

“The TDL actively addresses the need to ensure the accuracy, integrity, authenticity, and permanence of the digital content that it manages, as well as the security of the services and platforms that it provides.”

https://texasdigitallibrary.atlassian.net/wiki/spaces/TDRUD/pages/291635428/Digital+Preservation+Poli cy
https://texasdigitallibrary.atlassian.net/wiki/spaces/TDRUD/pages/292159828

Data Integrity:

The Texas Data Repository provides basic, bit-level preservation through fixity checks and cyclic redundancy checks, both via Amazon S3 host service, and secure backup of deposited content. The TDL has an official backup strategy that requires all digital content to be stored in three distinct locations for all services including the Texas Data Repository. TDL will retain:

  1. the copy of the data residing on the production server (currently an Amazon S3 volume),

  2. nightly snapshots that can be used to restore the entire service to a particular date within the

    preceding month,

  3. a copy of all data files, made nightly with versioning and kept for one year, stored on Amazon S3

    (https://aws.amazon.com/s3/); these copies can be used to restore individual files, but not the entire service.

Checksums are generated upon ingest, e.g., MD5 for files. After download, users can generate their own checksum for comparison.

Completeness of the Data and Metadata:

With respect to the data, the TDR does not approve user uploads before they are posted. As such, the TDR does not warrant that the content or user uploads are timely, accurate, complete, reliable or correct in their posted forms.

https://texasdigitallibrary.atlassian.net/wiki/spaces/TDRUD/pages/289079299

With respect to metadata, Dataverse requires nine metadata fields describing the dataset be completed before upload. Administrators of the various dataverses can require that additional fields be completed before dataset publication. The TDR User Guide provides instructions (Section 2.3 in the link below).

Description and Details of Version Control Strategy:

 

When initially published, a dataset is automatically assigned to the category “Version 1.” Any subsequent changes to the dataset result in the creation of a new version. A ‘small’ change (correcting a typo) would create a “Version 1.1”. A ‘large’ change (adding a new data column) creates a “Version 2.0”. Adding a new file automatically creates a “Version 2.0”. All versions can be made public.

The TDR gives users the ability to follow the audit trails related to data changes. Users can read information about additions/edits and can also compare versions to identify particular differences (Section 4.3 in the link below).

Usage of Appropriate International Standards and Conventions:

The TDR has created a Metadata Dictionary for users which includes citation and domain specific metadata fields.
http://data.tdl.org/wp-content/uploads/2016/09/TDR-Metadata-Dictionary.pdf

The TDR uses and/or recommends the following standards and conventions:

  • Dublin Core (DC) and Open Archives Initiative (OAI)

  • ISO 8601 for date entry

  • GeoNames for geospatial metadata

  • Data Documentation Initiative for social science/humanities metadata

  • SIMBAD astronomical database and FITS for astronomy and astrophysics metadata

  • NCBI Taxonomy & NCBO Bioportal for life sciences metadata

  • ISA-Tab for biomedical http://guides.dataverse.org/en/latest/user/appendix.html

    Provenance:

    Depositors must log-in to the TDR service through their respective TDL member institution. As an example, University of Texas at Austin affiliates must provide their UT-EID and password via Shibboleth internet protocols. If logging-in for the first time, depositors must agree to the General Terms of Use before being allowed to create an account.

    Users are required to provide the TDR with accurate and complete registration information. Depositors’ first and last names and affiliation are displayed connected to their uploads.

    Each dataset is assigned a persistent ID (doi) and the corresponding metadata is a part of the complete digital object.

    The TDR also tracks/records information when a registered or non-registered guest downloads a file. “When you download a file from Texas Data Repository, our software collects user account data such as your name, username, email, institution and position if provided (or the session ID data for guest users) and accompanying download data such as the time of the download.”

8. Appraisal

R8. The repository accepts data and metadata based on defined criteria to ensure relevance and understandability for data users.

Compliance Level:

  • The guideline has been fully implemented for the needs of the repository.

Collection Policies

The TDR does not have a specific collection development policy with respect to data deposit and “encourages data deposit from all disciplines and can accept any type of data file.”

ions#FrequentlyAskedQuestions-FAQEight

The TDR does have a policy with respect to long-term preservation. “Long-term preservation of Dataverse content, beyond the ten-year retention period, is subject to the TDL’s selection criteria, appraisal of the content, and budgetary and technical support of resources necessary to meet this goal.” The appraisal criteria are provided: cy

Completeness of Data/Metadata:

The TDR does not currently have quality control checks in place to ensure the completeness and understandability of data deposited. The Terms of Use, signed by all users, state that the TDR “has no obligation to monitor the site, service, content, or user uploads.” The TDR service is provided “as is” and “as available”. As such, the TDR does not warrant that “the content or user uploads are timely, accurate, complete, reliable or correct in their posted forms on the service” or that “any defects or errors will be corrected.” Use of the service is solely at the user’s risk.

The TDR does have procedures in place with respect to metadata. At minimum, nine metadata fields must be completed before a dataset can be uploaded (Section 2.3 in the link below). Administrators of different sub-dataverses can require that additional metadata fields be completed.

The TDR will provide basic, bit-level preservation through fixity checks, cyclic redundancy checks (CRC), and secure backup of deposited content. The onus for providing metadata sufficient for long-term preservation, above and beyond the nine required fields, falls to the depositors and the administrators of the various sub-dataverses.

Non-Preferred/Preferred File Formats:

While the TDR does not publish a list of preferred formats, it does advice depositors to provide data in non-proprietary formats in order to ensure broader use for research, e.g. CSV or XML. Additional features and support for certain types of data files exist.

ions#FrequentlyAskedQuestions-FAQNine

But there are no quality control checks in place to ensure data producers adhere to preferred formats. Non-preferred formats undergo the same preservation and backup strategies as preferred formats.

Depositors grant all necessary permissions and required licenses to the TDR to make submitted or deposited content available for archiving, preservation and access, within the site. Among others this includes permission to “store, translate, copy or re-format the content in any way to ensure its future preservation and accessibility, and improve usability.”

9. Documented Storage Procedures

R9. The repository applies documented processes and procedures in managing archival storage of the data.

Compliance Level:

  • The guideline has been fully implemented for the needs of the repository.

User Documentation:

User documentation, processes, and procedures are available on the TDR’s Altassian wikispace:

This wikispace tracks changes over time so users can access and compare process and procedures as the TDR evolves.

In addition the Harvard Dataverse team maintains online its own set of user guides:

http://guides.dataverse.org/en/latest/

As Dataverse is open source software, its code is available on GitHub:

Security:

With respect to security, one must have a TDL/TDR Atlassian account to edit the TDR user documentation. Dataverse source code is open, thus anyone in the community can contribute. Access to TDL services can occur via two channels: a Shibboleth managed log-in or an SSH key log-in. The TDR is hosted by Amazon Web Services (AWS) and all Amazon employees who need access to data centers must be approved before gaining entry. https://tdl.org/wp-content/uploads/downloads/2015/04/Texas-Digital-Library-Data-Security-Policy.pdf https://aws.amazon.com/compliance/data-center/controls/

Data Storage:

Data sets (along with associated metadata and documentation) stored correctly in the TDR will be preserved for a minimum period of ten years. Storage beyond the ten years is subject to the TDL’s selection criteria, appraisal of content, and budgetary/technical resources necessary to meet the goal. There is no charge to researchers for deposits, provided a total dataset is not larger than 10GB (although member institutions can consult with the TDR on a case-by-case basis for files and/or sets above the volume limits. And, as mentioned above, the TDR is hosted by Amazon Web Service. cy ions#FrequentlyAskedQuestions-FAQTwelve

Backup Strategy/Data Recovery

The TDL has an official backup strategy in which TDL retains:

  • the copy of the data residing on the production server, which is an Amazon S3 volume;

  • nightly snapshots that can be used to restore the entire service to a particular date within the

    preceding month;

  • and one snapshot from each month, retained for one year.

    Snapshot backups are stored in Amazon Elastic Block Store (EBS) snapshots, which is replicated storage with regular systematic data integrity checks.

    Risk Management:

    The TDL actively addresses the need to ensure the accuracy, integrity, authenticity, and permanence of the digital content that it manages, as well as the security of the services and platforms that it provides. The TDL ensures the security of its Dataverse instance as follows:

  • System Security

  • Data Integrity

  • Regulatory and Legal Considerations

    Consistency across Archival Copies:

    Checksums are generated for data files upon ingest (UNF for tabular data, MD5 for other files). Data sets are assigned persistent URLs and DOIs. Changes made to a data set create new versions. The doi is always attached to the most current published version.

    Storage Media Monitoring:

    Amazon Web Services is responsible for monitoring the status of the servers.

    https://aws.amazon.com/compliance/data-center/controls/

10. Preservation Plan

R10. The repository assumes responsibility for long-term preservation and manages this function in a planned and documented way

Compliance Level:

  • The guideline has been fully implemented for the needs of the repository.

The Texas Data Repository’s preservation plan is outlined in its Digital Preservation and Security section of the TDR’s policies. The Texas Digital Library “accepts the responsibility to preserve and provide access to research data, including associated metadata and documentation that is properly deposited in the Texas Data Repository. This responsibility includes the provision of digital means to preserve and ensure ongoing access to said content for a minimum of ten years after it is deposited in Dataverse.” cy

Parts of the plan are also outlined in the Service Level Agreement (SLA) paperwork signed by a member institution and the TDL. This contract helps to delineate the different responsibilities of the member institution, its designated TDR liaison and the Texas Digital Library.

Texas Data Repository provides basic, bit-level preservation through fixity checks, cyclic redundancy checks, and secure backup of deposited content. Further and more in-depth digital preservation activities and services must be provided by a digital preservation program at the institution where the research data was originally generated. The aforementioned SLA describes backup “as copying the bitstream and storing that copy in a separate storage space.

The Terms of Use agreement required of an individual depositor provides for all the actions necessary for this bit-level preservation. By agreeing, users grant to TDR “all necessary permissions and required licenses to make the content [they] submit or deposit available for archiving, preservation and access, within the site.”

The TDR’s policies and documentation also cover Submission Information Packages/standards and Archival Information Packages/standards.

A SIP is the content and metadata received from an information producer by a preservation repository. As mentioned in an earlier requirement, the TDR requires a minimum of 9 metadata fields to be completed before a dataset can be uploaded. Different dataverse administrators can require additional fields.

An AIP is the set of content and metadata managed by a preservation repository, and organized in a way that allows the repository to perform preservation services. Upon ingest, the TDR uses the tool JHOVE to identify aspects of file formats. Provenance data is provided through the depositor and his/her affiliated

institution. Amazon Web Services performs both fixity and cyclic redundancy checks. Backup plans and access rights information are all documented. cy

https://aws.amazon.com/s3/faqs/#Durability_.26_Data_Protection

11. Data Quality

Compliance Level:

  • The guideline has been fully implemented for the needs of the repository.

The TDR does not review datasets before the sets are uploaded and then subsequently published. As such, the TDR does not endorse, take responsibility for, or make any representations or warranties for any user uploads.

With respect to metadata, the repository requires at least nine metadata fields be completed before a dataset can be uploaded. Some of these fields have pre-set formats and incorrect responses are rejected, e.g. dates must be entered according to ISO 8601 format. Dataverse and sub-dataverse administrators can require additional metadata fields be completed before datasets are allowed to be published. Administrators can also ask depositors to include a README.txt file if the dataset(s) requires special instructions, disclaimers, or definitions.

Outside of the aforementioned non-ISO date rejection, there is no automated assessment of metadata. Quality of metadata is the responsibility of depositors and administrators. The TDR metadata dictionary is available for reference.

The repository’s designated community can comment on and/or rate data and metadata through citing/using datasets. Datasets are assigned DOIs expressly for citation purposes. “Good” datasets will be cited more frequently as shown through different citations indices (Google Scholar, Web of Science, etc).

And members can always provide feedback on any aspect of the TDR to their institution’s designated liaison.

12. Workflow

R12. Archiving takes place according to defined workflows from ingest to dissemination.

Compliance Level:

  • The guideline has been fully implemented for the needs of the repository.

Workflows/Process Descriptions:

The Texas Data Repository uses its Atlassian Confluence pages to communicate with depositors, users, and guests about its workflows and its handling of data.

For example, one can find, among others, the Service Level Agreement, Terms of Use, User Guide, Policies, and FAQs all in one location:

These confluence pages come with a page history functionality. It is possible for a user to track changes to workflows over time. And the Service Level Agreement states that the TDL will “provide timely reporting to data repository liaisons regarding any system issues.”

Level of Security:

The TDR does not accept content that contains confidential or sensitive information, and requires that contributors remove, replace, or redact such information from datasets prior to upload. ions#FrequentlyAskedQuestions-FAQEight

Exceptions to this rule are explicitly stated. See the “Restrictions” section in the Terms of Use

Depositors and administrators are authenticated via their respective institutions.

https://tdl.org/wp-content/uploads/downloads/2015/04/Texas-Digital-Library-Data-Security-Policy.pdf

The TDR is hosted by Amazon Web Services which is responsible for the security of the actual servers.

https://aws.amazon.com/security/

Ingest and Output of Data/Datasets:

The TDR does not appraise or select data upon ingest. Depositors from any discipline are encouraged to use the service. While all types of data and formats are accepted, the repository provides full support (i.e. data exploration, analysis, and meta-analysis via the TwoRavens suite of statistical tools) to tabular data.

The repository does make appraisal decisions with respect to long-term preservation (after the 10-year minimum period). The selection criteria are listed in the Digital Preservation Policy section.

Checksums are generated for data upon ingest. After download, users can generate their own checksums and compare.

13. Data Discovery and Identification

R13. The repository enables users to discover the data and refer to them in a persistent way through proper citation.

Compliance Level:

  • The guideline has been fully implemented for the needs of the repository.

The Texas Data Repository has a search box on the left side of the screen that searches within the dataverse. There is also an advanced search feature available that provides faceted metadata searching.

The TDR maintains on its Atlassian Confluence pages the “Texas Data Repository Metadata Dictionary” that contains citation and domain specific metadata fields.

There also exists user guides created by the Harvard Dataverse team. These user guides are fully available and users can find more information about the citation and domain specific metadata fields supported by Dataverse.
http://guides.dataverse.org/en/latest/user/index.html http://guides.dataverse.org/en/latest/user/appendix.html

Dataverse does support a protocol called OAI-PMH that facilitates machine metadata harvesting from one system to another. http://guides.dataverse.org/en/latest/admin/integrations.html?highlight=oai%20pmh

According to the TDL Dataverse Implementation Working Group’s final report, TDR metadata may be aggregated by other systems using API applications or the OAI-PHM protocol. Permission from the TDL is not needed to harvest metadata into aggregated discovery or repository platforms unless aggregators intend to harvest on a permanent basis.

The Working Group’s report can be found here: https://tdl-ir.tdl.org/handle/2249.1/76364

Metadata from the TDR can be exported in Dublin Core, DDI, and JSON.

The repository is included in the re3data.org registry, a registry of research data repositories.

https://www.re3data.org/repository/r3d100012385

Finally, the TDR offers both DOIs and recommended data citations. As an example, listed below is the citation for a dataset currently hosted in the University of Texas’s dataverse:

o Dainer-Best,Justin,2018,"ReplicationdataandmaterialsforPositiveImageryTraining Increases Positive Self-Referent Cognition in Depression", doi:10.18738/T8/RHEMGW, Texas Data Repository Dataverse, V3, UNF:6:FgY50+UEDA/95sPKids5WA==

14. Data Reuse

R14. The repository enables reuse of the data over time, ensuring that appropriate metadata are available to support the understanding and use of the data.

Compliance Level:

  • The guideline has been fully implemented for the needs of the repository.

The Texas Data Repository requires a minimum of 9 metadata fields to be completed before a dataset can be uploaded and subsequently published. Those fields are:

  1. Title

  2. Name

  3. Contact with email (not displayed to user)

  4. Description

  5. Date (must be expressed in ISO format: YYYY-MM-DD)

  6. Subject (domain specific

  7. Production date (must be expressed in ISO format)

  8. Production place

  9. Kind of Data

Other fields can be required by administrators of the different dataverses.

The repository maintains a metadata dictionary to help users with citation and content specific fields.

According to the Dataverse User Guides hosted by Harvard, many metadata schemas are supported by Dataverse, including Dublin Core’s DCMI. Harvard also provides a spreadsheet crosswalk for several schemas.
http://guides.dataverse.org/en/4.9.1/user/appendix.html

Data Formats:

Recall that the TDR is an institutional repository and the Designated Community is broad. The repository accepts datasets in a variety of formats. The onus for using a “correct” format lies with the particular researchers/depositors (assuming a “correct” format exists for members of a particular discipline). While virtually all formats are supported by the TDR, additional features/support are provided for only certain types of data. For example, the TwoRavens tool can be used to visualize CSV, Rdata, and dta files. Other software needed for opening and exploring the repository’s content is the responsibility of the end user.

Future Migrations:

TDR policies and the Terms of Use agreement require depositors to grant to the repository all necessary permissions and required licenses to make the content [the depositor] submit[s] or deposit[s] available for archiving, preservation and access, within the site. This includes permission to “store, translate, copy

or re-format the content in any way to ensure its future preservation and accessibility, and improve usability and/or protect respondent confidentiality.”

The Service Level Agreement indicates that the Texas Digital Library shall “[b]e responsible for the stewardship, technological oversight, and upgrades of the data repository software infrastructure.”

Thus, the repository ensures that any and all datasets properly uploaded will be migrated forward when the dataverse software/platform is updated.

Data ‘Understandability’:

The TDR is provided “as is” and “as available” and without warranty of any kind, express or implied. However, integrity of uploaded data is maintained by fixity checks, cyclic redundancy checks, secure backup, network monitoring and protection, and system updates. cy