SRA

Last updated on 2025-01-07 | Edit this page

Estimated time: 12 minutes

Overview

Questions

What is SRA and how to use it for data deposition?

Objectives

Get to know the SRA database
Understand the requirements for data deposition
Learn required optional field used for a RNA-sequencing experiment

Introduction

The Sequence Read Archive (SRA) is a public repository that contains high-throughput sequencing data. It is a part of the International Nucleotide Sequence Database Collaboration (INSDC), which includes several major institutes: the National Center for Biotechnology Information (NCBI) in the USA, the European Bioinformatics Institute (EBI) in Europe, and the DNA Data Bank of Japan (DDBJ) in Japan. The SRA includes data from all life forms, as well as metagenomics and ecological studies. The SRA stores raw sequencing data and sometimes the processed alignment data. Researchers often use the SRA to deposit their data, which is typically a requirement for the publication of research papers. It aims to establish a central repository for next-generation sequencing data, linking to resources that reference or utilize this data. The repository allows in tracking project metadata for studies and experiments. There’s a focus on facilitating flexible submission and retrieval of ancillary data, providing normalized data structures. Additionally, the objectives include decoupling the submission process from content and laying the groundwork for interactive user submissions and data retrieval.

SRA concepts:

SRA metadata concept separates experimental data from its metadata, organizing the latter into a structured hierarchy:

Study: Represents a collection of experiments aimed at achieving a common goal, serving as an overarching project that encapsulates the purpose and scope of the research conducted. A study provides the context and rationale for the experiments.
Experiment: Refers to a specific set of laboratory procedures applied to input material, designed to achieve an anticipated result. Each experiment is a component of a study, contributing to its overall objective. It can also be interpreted as a series of experimental protocols or assays conducted to test a hypothesis or gather data.
Sample: The focus of an experiment, which can be a single sample or multiple samples grouped together. The results of an experiment are articulated based on these samples, detailing the outcomes for individual samples or their collective group as defined by the experimental setup. A sample is not equivalent to an individual, as one individual can provide multiple samples (e.g., right and left ventricle tissue samples).
Run: Denotes the actual outcomes or results of the experiment. These runs encompass the data collected from a sample or a group of samples linked to a specific experiment. Essentially, a run is the execution of the sequencing or analytical process that generates data.
Submission: A submission encompasses a bundle of metadata and data objects, coupled with instructions on how the submission should be handled. This package facilitates the organized and controlled entry of both experimental data and its corresponding metadata into the repository, ensuring that the data can be accurately classified, accessed, and utilized.

Accessing the SRA

The SRA database can be accessed through the NCBI website.

Although discussing data access is outside the scope of this material, it is important to note that data access is a key aspect of the FAIR principles. This means that with the right metadata, data can be reused, generating more follow-up studies and citations.

Significant to note that not all data on the SRA is public, some data is private and only accessible to authorized users. Data can also be embargoed for a certain period of time. Recently, SRA started to use placeholder as a way to keep track of data produced by large cohort and require controlled access.

Data deposition

There are multiple ways to deposit data on the SRA, the most common is through the submission portal. This approach is detailed here. At the CRC, we plan to streamline this process using a programmatic approach for data submission, as detailed in this SOP. We plan to automatize this process in the future.

Metadata for a transcriptomics experiment

As you can realize, there are differences between the ISA framework and SRA. To enable the unambiguous interpretation and adhere to the FAIR framework, we take some pragmatic rules to define and create the Investigation, Study, and Assay files. The following table shows the required and recommended fields for an RNA-sequencing experiment, and can be easily adapted to other types of experiments.

Investigation metadata

Metadata Field	Required?	Definition	Comment
Identifier	required	Unique identifier for the investigation
Title	required	Title of the investigation
Description	required	Brief description of the investigation
Submission Date	required	Date the investigation was submitted	Standardized date format
Public Release Date	required	Date the investigation was publicly released	Standardized date format
Contacts	required	List of contacts associated with the investigation	Names, affiliations, roles, and contact details
Study Design Types	recommended	Types of study designs

Sample metadata¹

Metadata Field	Required?	Definition	Comment
unique ID	required	Identifier for a sample that is at least unique within the project
sample type	required	The type of the collected specimen, e.g., tissue biopsy, blood draw, or throat swab	Ontology field - e.g., OBI or EFO
species	required	The primary species of the specimen, preferably the taxonomic identifier	Ontology field - NCBITaxonomy
tissue/organism part	required	The tissue from which the sample was taken	Ontology field - e.g., Uberon
disease	required	Any diseases that may affect the sample	This may not necessarily be the same as the host’s disease, e.g., healthy brain tissue might be collected from a host with type II diabetes while cirrhotic liver tissue might be collected from an otherwise healthy individual. Ontology field - e.g., MONDO or DO
sex	required	The biological/genetic sex of the sample	Ontology field - e.g., PATO
development stage	required	The developmental stage of the sample	Ontology field - e.g., Uberon or Hsadpdv; species dependent
collection date	required	The date on which the sample was collected, in a standardized format	Collection date in combination with other fields such as location and disease may be sufficient to de-anonymize a sample
external accessions	recommended	Accession numbers from any external resources to which the sample was submitted	e.g., Biosamples, Biostudies
ancestry/ethnicity	recommended	Ancestry or ethnic group of the individual from which the sample was collected	Ontology field - e.g., HANCESTRO
age	recommended	Age of the organism from which the sample was collected
age unit	recommended	Unit of the value of the age field	Ontology field - e.g., UO
BMI	recommended	Body mass index of the individual from which the sample was collected	Only applies to human samples
treatment category	recommended	Treatments that the sample might have undergone after collection	Ontology field - e.g., OBI, NCIt, or OGMS
cell type	recommended	The cell type(s) known or selected to be present in the sample	Ontology field - e.g., CL
growth conditions	recommended	Features relating to the growth and/or maintenance of the sample
genetic variation	recommended	Any relevant genetic differences from the specimen or sample to the expected genomic information for this species, e.g., abnormal chromosome counts, major translocations, or indels
sample collection technique	recommended	The technique used to collect the specimen, e.g., blood draw or surgical resection	Ontology field - e.g., EFO or OBI
phenotype	recommended	Any relevant (usually abnormal) phenotypes of the specimen or sample	Ontology field - e.g., HP or MP; species dependent
cell cycle	recommended	The cell cycle phase of the sample (for synchronized growing cells or a single-cell sample), if known	Ontology field - e.g., GO
cell location	recommended	The cell location from which genetic material was collected (usually either nucleus or mitochondria)	Ontology field - e.g., GO

Assay metadata

Metadata Field	Required?	Definition	Comment
Identifier	required	Unique identifier for the assay
File Name	required	Name of the file that contains the assay data
Measurement Type	required	Type of measurement performed in the assay	Ontology field - e.g. OBI
Technology Type	required	Type of technology used in the assay	Ontology field - e.g. OBI
Technology Platform	recommended	Specific platform or instrument used in the assay
Performer	required	Person or organization that performed the assay	Names, affiliations, roles, and contact details
Date	recommended	Date the assay was performed	Standardized date format
Parameter Values	recommended	Parameters and their values used in the assay	Names, types, descriptions, and values
Sample Name	required	Name of the sample used in the assay
Raw Data File	recommended	File name of the raw data generated by the assay
Processed Data File	recommended	File name of the processed data generated by the assay

Key Points

The SRA serves is public repository for high-throughput sequencing data, supporting a wide range of genomic research by providing access to raw sequencing data, alignments and feature counts.
Metadata Organization needs to be organized. Although ISA-files don’t directly map to SRA metadata model, there is a clear separation between required and optional items. There is also some flexibility to adapt the SODAR system to a given experiment.
While access to the SRA database is facilitated through the NCBI website, the database also highlights the principles of controlled access and privacy for certain data sets.

Based on The faircookbook ↩︎

SRA