bedRMod format¶

The bedRMod format is described in the official EU format specifications website. Users should always refer to the latest version.

Attention

Sci-ModoM relies on the bedRMod format for data transfer and storage, but imposes a controlled vocabulary for selected header fields and a strict nomenclature for chromosomes. This documentation summarizes the bedRMod format, and introduces requirements for data upload to Sci-ModoM.

The bedRMod format specification¶

A bedRMod file is a tabulated count of base modifications from every sequencing read over each reference genomic position or modification site. It is a convenient representation of the information stored in the MM/ML tags in BAM alignment files. This documentation summarizes the bedRMod format, and introduces additional requirements for data upload to Sci-ModoM. Users should always refer to the official EU format specifications (bedRMod).

Hint

Sci-ModoM requirements

A given dataset or bedRMod file can contain more than one modification, as reported in column 4 (MODOMICS short name), but this should be for the same RNA type. Currently, the only supported RNA types are WTS or whole transcriptome sequencing. According to the official specification, a dataset or bedRMod file can only contain ONE RNA type and ONE organism (incl. cell type, tissue, or organ), and records from the same assembly. This stems form the bedRMod header, which describes information for one organism, one assembly and annotation, and one RNA type.

To upload data to Sci-ModoM, in addition, bedRMod files should contain records for ONE technology only, i.e. ONE RNA type, ONE organism, and ONE technology. The best way to handle different technologies, treatment and/or conditions is to have as many bedRMod files as required to describe the experimental protocol, and provide a meaningful title and metadata for each file.

The header section¶

The header contains metainformation about the source of the data. Only one reference to organism, assembly, annotation source, and annotation version is possible.

bedRMod header¶
Tag	Description
fileformat	Fileformat and version e.g. bedRModv1.8
organism	NCBI taxonomic identifier
modification_type	RNA
assembly	Assembly name using Ensembl nomenclature e.g. GRCh38 for human
annotation_source	Annotation source e.g. Ensembl
annotation_version	Annotation version e.g. 110
sequencing_platform	Sequencing platform e.g. Illumina NovaSeq 6000, etc.
basecalling	Basecalling model information where relevant
bioinformatics_workflow	Reference to bioinformatics workflow e.g. GitHub, or information relevant to score, coverage and/or frequency calculation
experiment	Information about experimental protocol, design, etc. or link to e.g. openBIS
external_source	Databank:ID of data e.g. GEO:GSEXXXXXX

While all tags are required, sequencing_platform, basecalling, bioinformatics_workflow, experiment or external_source can be left empty, although it is strongly advised to provide as much information as possible, e.g.

#fileformat=bedRModv1.8
#organism=9606
#modification_type=RNA
#assembly=GRCh38
#annotation_source=Ensembl
#annotation_version=110
#sequencing_platform=Illumina NovaSeq 6000
#basecalling=
#bioinformatics_workflow=https://github.com/...
#experiment=https://doi.org/...
#external_source=GEO;GSEXXXXXX,PMID;XXXXXXXX

Hint

Sci-ModoM requirements

For data upload, fileformat, organism, and assembly are validated. Only the latest specifications are used, and only modification_type=RNA is currently allowed. The assembly tag entry must match exactly the assembly chosen from the dropdown menu during upload, and it must follow the Ensembl nomenclature (without patch number). Data does not have to be for a specific genome assembly, Sci-ModoM will take care of lifting over all records to the most recent assembly for each organism. Additional user-defined tags or extra header lines starting with # are ignored during upload.

The data section¶

The first nine columns generally follow the standard BED specification, but the name (4th column) must conform to the MODOMICS nomenclature for the modification short name, and the score (5th column) is a site-specific measure of confidence. The last two columns contain the coverage (10th column), or number of reads at this position, and the frequency (11th column), or the integer value capped at 100 representing the precentage of reads that are modified at this position. These columns must be separated by tabs and each row must have the same number of columns.

bedRMod data¶
Field	Description
chrom	Chromosome
chromStart	Modification start position (0-based numbering, included)
chromEnd	Modification end position (0-based numbering, excluded), e.g. chromStart + 1
name	MODOMICS short name
score	Modification confidence (see below for details)
strand	Strand
thickStart	Typically same as chromStart
thickEnd	Typically same as chromEnd
itemRgb	Color e.g. for genome browser
coverage	Number or reads at this position
frequency	Precentage (0,100] of modified reads, or stoichiometry at this position

Hint

Sci-ModoM requirements

To enable quantitative data comparison, the score (5th column) is defined as round(-log10(p value)), where p value is calculated from a statistical test. A value of 0 indicates missing data e.g. p values were not calculated. The coverage (10th column) can be 0, if the number of reads at this position is not available, but frequency (11th column) MUST always be present. Modification frequency or stoichiometry is a minimal requirement for quantitative data comparison.

Caution

bedRMod is essentially a BED-formatted file, it uses a 0-based, half-open coordinate system. If you use a 1-based index, all your modification sites will be off-by-one!

Warning

For data upload to Sci-ModoM, chromosomes (1st column) must be formatted following the Ensembl short format e.g. 1 and not chr1, or MT and not chrM. Only chromosomes are considered, records from contigs/scaffolds are discarded. The modification name (4th column) must match exactly the chosen modifications, according to the MODOMICS nomenclature for the modification short name. Rows with out-of-range values for score (5th column) or frequency (11th column) are discarded.

File upload will fail if there are too many skipped records, e.g. due to wrong chromosome formatting, too many contigs, out-of-range values, etc.

Additional columns¶

Users can add any number of additional columns to suit their needs (same as for BED), but these are ignored in Sci-ModoM. Note however that a bedRMod file with exactly 12 columns may be implicitely assumed to be a BED12 file by some software (bedtools, genome browsers, …), which can result in unexpected behaviour.

Notes¶

Unmodified bases¶

bedRMod is a format to store modification data (site-specific or not), hence unmodified bases should not be recorded. Context can be recorded using chromStart/End + thickStart/End, additional columns, etc.

Download¶

A PDF version of the latest specification can be downloaded here.