SRA

Last updated on 2024-11-15 | Edit this page

Overview

Questions

  • What is SRA and how to use it for data deposition?

Objectives

  • Get to know the SRA database
  • Understand the requirements for data deposition
  • Learn required optional field used for a RNA-sequencing experiment

Introduction


The Sequence Read Archive (SRA) is a public repository that contains high-throughput sequencing data. It is a part of the International Nucleotide Sequence Database Collaboration (INSDC), which includes several major institutes: the National Center for Biotechnology Information (NCBI) in the USA, the European Bioinformatics Institute (EBI) in Europe, and the DNA Data Bank of Japan (DDBJ) in Japan. The SRA includes data from all life forms, as well as metagenomics and ecological studies. The SRA stores raw sequencing data and sometimes the processed alignment data. Researchers often use the SRA to deposit their data, which is typically a requirement for the publication of research papers. It aims to establish a central repository for next-generation sequencing data, linking to resources that reference or utilize this data. The repository allows in tracking project metadata for studies and experiments. There’s a focus on facilitating flexible submission and retrieval of ancillary data, providing normalized data structures. Additionally, the objectives include decoupling the submission process from content and laying the groundwork for interactive user submissions and data retrieval.

SRA concepts:


SRA metadata concept separates experimental data from its metadata, organizing the latter into a structured hierarchy:

  • Study: Represents a collection of experiments aimed at achieving a common goal, serving as an overarching project that encapsulates the purpose and scope of the research conducted. A study provides the context and rationale for the experiments.

  • Experiment: Refers to a specific set of laboratory procedures applied to input material, designed to achieve an anticipated result. Each experiment is a component of a study, contributing to its overall objective. It can also be interpreted as a series of experimental protocols or assays conducted to test a hypothesis or gather data.

  • Sample: The focus of an experiment, which can be a single sample or multiple samples grouped together. The results of an experiment are articulated based on these samples, detailing the outcomes for individual samples or their collective group as defined by the experimental setup. A sample is not equivalent to an individual, as one individual can provide multiple samples (e.g., right and left ventricle tissue samples).

  • Run: Denotes the actual outcomes or results of the experiment. These runs encompass the data collected from a sample or a group of samples linked to a specific experiment. Essentially, a run is the execution of the sequencing or analytical process that generates data.

  • Submission: A submission encompasses a bundle of metadata and data objects, coupled with instructions on how the submission should be handled. This package facilitates the organized and controlled entry of both experimental data and its corresponding metadata into the repository, ensuring that the data can be accurately classified, accessed, and utilized.

Accessing the SRA


The SRA database can be accessed through the NCBI website.

Although discussing data access is outside the scope of this material, it is important to note that data access is a key aspect of the FAIR principles. This means that with the right metadata, data can be reused, generating more follow-up studies and citations.

Significant to note that not all data on the SRA is public, some data is private and only accessible to authorized users. Data can also be embargoed for a certain period of time. Recently, SRA started to use placeholder as a way to keep track of data produced by large cohort and require controlled access.

Data deposition


There are multiple ways to deposit data on the SRA, the most common is through the submission portal. This approach is detailed here. At the CRC, we plan to streamline this process using a programmatic approach for data submission, as detailed in this SOP. We plan to automatize this process in the future.

Metadata for a transcriptomics experiment


As you can realize, there are differences between the ISA framework and SRA. To enable the unambiguous interpretation and adhere to the FAIR framework, we take some pragmatic rules to define and create the Investigation, Study, and Assay files. The following table shows the required and recommended fields for an RNA-sequencing experiment, and can be easily adapted to other types of experiments.

Investigation metadata

Metadata Field Required? Definition Comment
Identifier required Unique identifier for the investigation
Title required Title of the investigation
Description required Brief description of the investigation
Submission Date required Date the investigation was submitted Standardized date format
Public Release Date required Date the investigation was publicly released Standardized date format
Contacts required List of contacts associated with the investigation Names, affiliations, roles, and contact details
Study Design Types recommended Types of study designs

Sample metadata1

Metadata Field Required? Definition Comment
unique ID required Identifier for a sample that is at least unique within the project
sample type required The type of the collected specimen, e.g., tissue biopsy, blood draw, or throat swab Ontology field - e.g., OBI or EFO
species required The primary species of the specimen, preferably the taxonomic identifier Ontology field - NCBITaxonomy
tissue/organism part required The tissue from which the sample was taken Ontology field - e.g., Uberon
disease required Any diseases that may affect the sample This may not necessarily be the same as the host’s disease, e.g., healthy brain tissue might be collected from a host with type II diabetes while cirrhotic liver tissue might be collected from an otherwise healthy individual. Ontology field - e.g., MONDO or DO
sex required The biological/genetic sex of the sample Ontology field - e.g., PATO
development stage required The developmental stage of the sample Ontology field - e.g., Uberon or Hsadpdv; species dependent
collection date required The date on which the sample was collected, in a standardized format Collection date in combination with other fields such as location and disease may be sufficient to de-anonymize a sample
external accessions recommended Accession numbers from any external resources to which the sample was submitted e.g., Biosamples, Biostudies
ancestry/ethnicity recommended Ancestry or ethnic group of the individual from which the sample was collected Ontology field - e.g., HANCESTRO
age recommended Age of the organism from which the sample was collected
age unit recommended Unit of the value of the age field Ontology field - e.g., UO
BMI recommended Body mass index of the individual from which the sample was collected Only applies to human samples
treatment category recommended Treatments that the sample might have undergone after collection Ontology field - e.g., OBI, NCIt, or OGMS
cell type recommended The cell type(s) known or selected to be present in the sample Ontology field - e.g., CL
growth conditions recommended Features relating to the growth and/or maintenance of the sample
genetic variation recommended Any relevant genetic differences from the specimen or sample to the expected genomic information for this species, e.g., abnormal chromosome counts, major translocations, or indels
sample collection technique recommended The technique used to collect the specimen, e.g., blood draw or surgical resection Ontology field - e.g., EFO or OBI
phenotype recommended Any relevant (usually abnormal) phenotypes of the specimen or sample Ontology field - e.g., HP or MP; species dependent
cell cycle recommended The cell cycle phase of the sample (for synchronized growing cells or a single-cell sample), if known Ontology field - e.g., GO
cell location recommended The cell location from which genetic material was collected (usually either nucleus or mitochondria) Ontology field - e.g., GO

Assay metadata

Metadata Field Required? Definition Comment
Identifier required Unique identifier for the assay
File Name required Name of the file that contains the assay data
Measurement Type required Type of measurement performed in the assay Ontology field - e.g. OBI
Technology Type required Type of technology used in the assay Ontology field - e.g. OBI
Technology Platform recommended Specific platform or instrument used in the assay
Performer required Person or organization that performed the assay Names, affiliations, roles, and contact details
Date recommended Date the assay was performed Standardized date format
Parameter Values recommended Parameters and their values used in the assay Names, types, descriptions, and values
Sample Name required Name of the sample used in the assay
Raw Data File recommended File name of the raw data generated by the assay
Processed Data File recommended File name of the processed data generated by the assay

Key Points

  • The SRA serves is public repository for high-throughput sequencing data, supporting a wide range of genomic research by providing access to raw sequencing data, alignments and feature counts.
  • Metadata Organization needs to be organized. Although ISA-files don’t directly map to SRA metadata model, there is a clear separation between required and optional items. There is also some flexibility to adapt the SODAR system to a given experiment.
  • While access to the SRA database is facilitated through the NCBI website, the database also highlights the principles of controlled access and privacy for certain data sets.