Mapped reads - erhard-lab/gedi GitHub Wiki

CIT files

When it comes to mapped reads, there is a single standard file format that is used: SAM (or its more efficient binary brother BAM). However, it has been designed mainly with genome resequencing in mind, and for some applications it is a not very convenient way of working with reads. In Gedi, we have an alternative way of storing and indexing interval-based data (which also includes things like genes, transcripts, transcription factor binding sites, ...): CIT (centered interval trees). Its main advantages are:

It is based on principled data structures designed for efficient access to interval data (i.e. interval trees)
We use a variant that makes it particularly useful for non-memory based indices (i.e. it's centered)
We can store any payload data (e.g. mapped reads with or without information on mismatches, transcripts names and contained ORFs,...)
Data can be compressed
For read data, a single record contains all reads or read pairs that are mapped to a specific position in the genome (which is a very space efficient way of storing them)
For read data, a single CIT file can contain the information of several samples (e.g. replicates, conditions)

Of course all tools built on top of Gedi can utilize either CIT files or BAM files.

Specifying mapped reads to a Gedi program

Most Gedi programs have a parameter -reads. This parameter either accepts a single BAM file, a single CIT file (containing mapped reads), or a bamlist file (which contains, in each line, the name of or path to a bam file; lines starting with # are ignored; the extensions must be .bamlist ).

Example (test.bamlist):

# this is a bamlist file containing three replicates of two conditions
C1.A.bam
C1.B.bam
C1.C.bam
C2.A.bam
C2.B.bam
C2.C.bam

Creating a CIT file from bam files

Many of our tools run much faster based on cit files than on several bam files. To convert several bamfiles to a single cit file, create a bamlist file (see above), and call the bamlist2cit bash script like this:

bamlist2cit xyz.bamlist

Getting data out of CIT files

As CIT is a proprietary format of Gedi, it comes with a little tool to extract data from such files. Its full usage is

gedi -e ViewCIT <Options> <file>

Options:
 -l             Only list the number of elements per reference
 -q <location>          Only output elements overlapping the given query (i.e. whole reference or genomic region)
 -o <file>              Specify output file (Default: stdout)
 -m <mode>              output mode: Bed/Location/Cit (Default: location)
 -name <js>             javascript function body to generate name (variable d is current data, ref the reference, reg the region)
 -score <js>            javascript function body to generate score (variable d is current data, ref the reference, reg the region)

 -p                     Show progress
 -h                     Show this message
 -D                     Output debugging information

Examples:

gedi -e ViewCIT -l hek293.cit

List the number of entries per chromosome for the given file (containing mapped reads). The number of entries is not the number of reads, but the number of distinct genomic regions (defined by start and end position of a mapped reads and the positions of potential introns). The lesson here is that an entry may correspond to more than one read, either because the same sequence has been observed multiple times, or distinct sequences map to the same genomic region due to mismatches.
Output:

hek293.cit: gedi.core.data.reads.DefaultAlignedReadsData
1+      368315
1-      387836
2+      212989
2-      202875
3+      175598
3-      172153
4+      110494
4-      102068
5+      204240
...

gedi -e ViewCIT -q 1+:27322311-27322411 -m bed -name 'd.getId(0)+""' hek293.cit

Extract all reads mapped to the watson strand of chromosome 1, overlapping the region 27322311-27322411 in bed format. For the name column in the bed file, output the read id from the first of potentially many distinct sequences mapping at that location. The -name parameter can refer to any Javascript code executed using the Nashorn JS engine. It has full access to the Gedi library and can refer to the variables ref (the ReferenceSequence object), reg (the GenomicRegion object) and d (the data object, in this case an AlignedReadData object). Obviously, this is a very powerful mechanism but requires deep knowledge of the internals of Gedi. Output:

1+:27322293-27322315    132480
1+:27322345-27322367    132490
1+:27322329-27322347    132487
1+:27322320-27322346    132484
1+:27322290-27322312    132477
1+:27322359-27322382    132494
1+:27322380-27322403    132507
1+:27322364-27322386    132500
1+:27322288-27322316    132476
1+:27322353-27322380    132493
1+:27322328-27322353    132486
1+:27322291-27322315    132479
1+:27322365-27322391|27330719-27330720  132504
...

gedi -e ViewCIT -q 1+:27322311-27322411 price/hek293.merged.codons.cit

Extract inferred codons with their activity values in the locations format. Output:

1+:27322340-27322343    [4.0]
1+:27322392-27322395    [1.0]
1+:27322347-27322350    [1.0]
1+:27322326-27322329    [5.0]
1+:27322377-27322380    [6.956845]
1+:27322332-27322335    [3.0]
1+:27322386-27322389    [2.0]
1+:27322365-27322368    [1.0]
1+:27322317-27322320    [1.0]
1+:27322374-27322377    [13.043155]
1+:27322354-27322357    [1.204386]
1+:27322359-27322362    [2.795614]

gedi -e ViewCIT -m bed -q '8+:27491575-27528868' -name 'd.getType()' -score 'd.getTotalActivity()' price/hsv.orfs.cit

Extract inferred ORFs in bed format with the ORF type in the name column and the total activity of the first (and, here, only) experiment in the score column. This might give a further impression about how powerful the Javascript access is. Output:

8       27491717        27528868        CDS     978     +       0       0       0       6       7,99,120,99,1044,452,   0,15501,17307,22581,24295,36699,
8       27491575        27491617        uORF    20      +

gedi -e ViewCIT -q 1+:27322311-27322411 -name 'd.getTranscriptId()'  Homo_sapiens.GRCh38.86.gtf.index.cit

Extract transcripts from the index generated by IndexGenome and output them in location format. Output:

1+:27322159-27322391|27330719-27330804|27332069-27332101|27333957-27334054|27334150-27334281|27335378-27336400  ENST00000374076
1+:27322253-27322391|27330719-27330804|27330984-27331137|27332069-27332101|27333957-27334054|27334150-27334281|27334666-27334754|27335378-27335648      ENST00000486082
1+:27322280-27322391|27330719-27330804|27332069-27332101|27333957-27334054|27334150-27334896    ENST00000464813
1+:27322253-27322391|27330719-27330804|27332069-27332101|27332446-27332559      ENST00000466759
1+:27322299-27322391|27330719-27330804|27332069-27332101|27333360-27333435|27333957-27334054|27334150-27334281|27335378-27335545        ENST00000498220
1+:27322246-27322391|27330719-27330804|27330984-27331137|27332069-27332101|27333957-27334054|27334150-27334281|27335378-27336400        ENST00000464720
1+:27322247-27322391|27330719-27330804|27332069-27332101|27333957-27334054|27334150-27334281|27335378-27336400  ENST00000608611
1+:27322284-27322391|27330719-27330804|27332069-27332101|27333957-27334054|27334150-27334281|27334666-27334754|27335378-27335824        ENST00000478104
1+:27322144-27322391|27330719-27330804|27332069-27332101|27333957-27334054|27334150-27334281|27335378-27336399  ENST00000611517

Converting BAM to CIT

To produce CIT files containing read mappings, one or several BAM files can be merged by

gedi -e Bam2CIT -p mapped.cit exp1.bam exp2.bam exp3.bam