Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Research Reproducibility: Definitions, Essential Components, Tools: Reproducibility Solutions

This guide is intended as a primer on research reproducibility.

Data

Reproducibility, as regards the data underlying a published report or study, requires adequate description and availability. Descriptive choices need to be made early in the research life cycle, ideally captured before the data is produced (e.g. data dictionaries) as well as throughout the data collection phase (e.g. as data objects are generated). Availability requires making the data accessible, ideally via a repository. 

 

Metadata:

Metadata--defined as structured information about data objects that allows a data user to meaningfully locate, understand, and use the data it describes--can be conceived as including two types of information: information that is embedded in the data files themselves, and information that is deliberately created by the data creator to give context to the data.

Embedded metadata includes information such as file properties automatically generated when files are created and modified, or DICOM metadata, which includes extensive information about the image as well as the instrument and image capture procedure used to generate the image. Although embedded metadata is often auto-generated and left untouched, it may be modified or enhanced by the data producer. One way to add metadata is to either provide further information in the file properties or to directly edit the metadata already captured.

The level of information about data that is needed depends upon the study type and the discipline. Consider the fact that geospatial data needs an entirely different set of information for a secondary user to make sense of the data than data from qualitative interviews. Many subject domains have standards for the level of metadata needed to make data understandable, as well as standardized, structured elements to describe a data object. One place to find metadata standards by subject domain is the Data Curation Centre. The most commonly used discipline-agnostic metadata standard is the Dublin Core Metadata standard. 

If you have questions about using these metadata standards, please contact either the Research Librarian for the Health Sciences (browderk@ecu.edu) or the Scholarly Communications department at Joyner Library (scholcomm@ecu.edu). 

Intentional metadata is equally important for understanding and working with study data. For quantitative data, it ideally includes the full set of variables collected, including variable name, variable definition, accepted values, and units of measurement. For qualitative interviews, it might include the full set of questions chosen for qualitative interviews. The richer the level of documentation about the data, the better.

For an example of rich documentation for a data set, consider the NHANES data-related information, which includes the questionnaire instrument, the codebooks, procedure manuals, and general information about the study.

In its most minimal form, metadata may be provided via ReadMe files describing the data overall, as well as providing basic details about the data sets.

Data sharing:

Data sharing is a key component of reproducibility. Certain data may not be able to be shared widely, such as HIPAA data, FERPA data, proprietary data, or data deemed to be critical to national security. However, in many cases, at least some form of the data may be released (such as aggregated or anonymized data), or else sharing of data may be restricted to specific cases or needs.

Additionally, basic descriptive information about the data can often be provided, such as the variables collected, the codebooks, information about the number of participants, and keywords to describe the dataset. At a minimum, this level of contextual information about the data (metadata), allows other researchers to more easily discover the data and better understand what was done.

In considering how and where to share data, if choosing to provide some 'slice' or portion of your data, data repositories are generally recommended. However, not all data repositories are of high quality, and some have been referred to as 'data dumpsters' in that they provide no long-term value or preservation of data sets, or do not include key features that allow others to easily find and access data.

Finding data repositories:

The National Institutes of Health have a set of dedicated data repositories for the various Centers and Institutes at the NIH. Information about those repositories can be found on their lists of domain-specific open data sharing repositories or their list of 'other NIH data resources', though the latter may have restrictions on use and access. 

There are several cross-disciplinary research data repositories with broad use, support, and credibility across research domains. These include (among others not listed below):

  • Open Science Framework--free and open source. ECU is affiliated with OSF, allowing researchers to use their ECU credentials to sign in and create project spaces and upload data
  • Dataverse--an open source, international collaboration facilitating data sharing. ECU has an institutional affiliation with Dataverse, allowing researchers to use their ECU credentials to sign in and deposit data and data-related information
  • Dryad--fees apply to using Dryad since ECU is not a member institution. However, fees are low
  • Figshare--free accounts are available
  • Vivli--dedicated to sharing clinical research data, Vivli can assist researchers with sharing their data, as well as provide assistance with anonymization of clinical data. Use of Vivli does include a cost. The cost of using Vivli can be written into funding applications early in the clinical trial planning process

Re3Data is a searchable database of research repositories that can assist researchers in identifying potential repositories for data sets. It's possible to search by keyword or browse by discipline to find potential repositories.

Evaluating data repositories:

Due to the varying quality of data repositories, any data repository not listed by the NIH or in the (non-exhaustive) list of broadly accepted and trusted cross-disciplinary repositories above ought to be reviewed to ensure that any data deposited is not lost, made non-discoverable (i.e. findable by others), or ceded to the repository owner/builder. Criteria for trustworthy data repositories are nicely delineated by the CoreTrustSeal organization, which provides seals for data repositories as well as a list of repositories that have passed their criteria for trustworthiness. 

Choosing the data sets to share:

At present, there is little guidance on what data sets from a given study ought to be shared. This is likely due to the complexity of research designs, variability across study types and disciplines, as well as other factors related to storage and retention of research data products. Generally, it may be necessary to provide several versions of data sets (when feasible) to allow for full reproducibility--the versions may include the raw dataset, an interim dataset reflecting data cleaning or processing decisions, as well as the final dataset used for the analysis.

Using 'gold standard' formats for data:

The accessibility of data depends as much on the format of the data as on its availability to others. Data shared in a file format that cannot be functionally read or used by another person is essentially inaccessible, negating the ‘sharing’ of the data. For this reason, when feasible, best practice is to share the data in a gold standard file format. There are several gold standard file formats, but the five most common gold standard file formats are:

  • XML—EeXtensible Markup Language—ideal for web-based content
  • CSV—Comma Separated Values—the universal file format for spreadsheet-formatted data
  • PDF—Portable Document Format—ideal for text-based documents as well as other files meant to be saved ‘as is’. Note that PDFs are often difficult to edit or manipulate. In choosing the PDF format, the document should be considered as being saved largely ‘as is’.
  • TIFF—Tagged Image File Format—the standard for image files
  • JSON—a simple language that is readable across many different platforms and software packages

For a complete listing of recommended file formats to facilitate sharing and long-term preservation of the products of research, see the Library of Congress list.

Methods

The methods sections of manuscripts are often poor records of the exact methods used for a given study. Data cleaning choices and exact computational methods used are typically not accurately reflected within the text-based description for a study. Richer information needs to be provided to allow others to reproduce the results and figures for a manuscript or report.

Providing full descriptions of methods:

A fully detailed account of the methods used for the generation of data ought to be provided if data is created.

A complete description of the methods used for analysis needs to be provided, along with the codes or scripts used for a program. Generally, computational analysis using a commonly used platform is preferred if the tools and operating systems cannot be mirrored or provided (see section on tools and operating systems).

Repositories for data allow documentation of methods to be included with the data deposits. Additionally, online repositories for software and code, such as GitHub, allow for open sharing of computational codes used for software, as well as serving as a platform for homegrown software.

It is recommended that the code be tested to ensure that it is fully executable and actually reproduces the results and figures in a report or manuscript.

Tools and Operating Systems Details

In many cases, the ability to reproduce code or computational analysis depends upon the software and hardware used to create and run the code. Changes between versions of the same software or the operating system used can lead to code that does not reproduce results and figures for a given study. Additionally, proprietary and homegrown software can prevent others from being able to reproduce results. Broadly, there a few means of improving reproducibility as relates to the tools and operating systems used:

  1. Provide a complete description of the tools and operating systems used for analysis, including version number. Information about software and hardware components should be included in the written methods section of a manuscript, but full details are best provided in appendices or on repositories recording information about the overall research process
  2. Archive, preserve, and share homegrown software as able. Systems such as GitHub and Open Science Framework will track changes and versions of software throughout a research project
  3. Use shared computational notebooks to make the full process of analysis available, along with providing explanatory information. Jupyter notebooks are one such notebook with broad adoption
  4. Use technologies that allow for the packaging of all software, data, code and other needed information, allowing secondary data users to unpack the files and rerun the analyses in an emulated environment, either on a personal computer or via cloud-based platforms