Drosophila Comparison Archive - McIntyre-Lab/TranD GitHub Wiki

Below is a description of the method used to compare the Drosophila species reference annotations of D. melanogaster FlyBase r6.17 and D. simulans FlyBase r2.02 and generate a transcript model map on D. melanogaster genome coordinates and on D. simulans genome coordinates.

We have also generated a reciprocal minimum transcript comparison file found here that classifies transcript pairs as putative transcript orthologs (TO), exon region overlaps (ERO), reciprocal minimum pairs (RMP), or no reciprocal minimum (NRM) in the other species. See below for definitions of each classification.

All steps described below are implemented as 2 Nextflow scripts: NF_get_ref_transcript_fasta.nf NF_compare_species_ref.nf

More detailed documentation and scripts are in the links provided.

For more information on how to run TranD, see the User Guide.

Prepare orthologous gene file

Reformat annotated ortholog file from FlyBase to contain only D. melanogaster to D. simulans, split coordinates into separate columns (FlyBase format = start..end), make strand value +/- (FlyBase format = 1/-1), remove “Dsim\” from the Ortholog_GeneSymbol elements, and flag one-to-one orthologs (one D. simulans gene is orthologous to only one D. melanogaster gene and vice versa).

python reformat_FB_orthologs.py \
    -i dmel_orthologs_in_drosophila_species_fb_2017_04.tsv \
    -s Dsim \
    -n sim \
    -o dmel_orthologs_dsim_fb_2017_04.csv

Get reference transcript FASTA sequences

Transcript sequences are extracted to be mapped to the reference genome of each species.

The single trans-spliced gene in D. melanogaster (mod(mdf4) - FBgn0002781) is first removed from the reference GTF annotation file using the TranD utility subset_gtf.py:

cat "FBgn0002781" > exclude_gene.txt

python subset_gtf.py \
    -g dmel-all-r6.17.gtf \
    -t gene_id \
    -e exclude_gene.txt \
    -o dmel-all-r6.17_mod.gtf

Using gffread transcript sequences are extracted using the reference annotation and genome sequences:

gffread --w-nocds -w dmel-all-r6.17_xcrpt_seq.fasta -g dmel-all-chromosome-r6.17.fasta dmel-all-r6.17_mod.gtf

gffread --w-nocds -w dsim-all-r2.02_xcrpt_seq.fasta -g dsim-all-chromosome-r2.02.fasta dsim-all-r2.02.gtf

These files

Map reference transcript sequences

Using minimap2 map the reference transcript sequences to both genome coordinates (mel2mel, mel2sim, sim2mel, sim2sim).

Mapping back to the associated reference genome will aid in identifying genes/transcripts that may be affected by the mapping algorithm and cannot be accurately compared across the species.

minimap2 --secondary=no -u f -ax splice -C 5 dmel-all-chromosome-r6.17.fasta dmel-all-r6.17_xcrpt_seq.fasta > mel2mel_minimap2.sam

minimap2 --secondary=no -u f -ax splice -C 5 dsim-all-chromosome-r2.02.fasta dmel-all-r6.17_xcrpt_seq.fasta > mel2sim_minimap2.sam

minimap2 --secondary=no -u f -ax splice -C 5 dmel-all-chromosome-r6.17.fasta dsim-all-r2.02_xcrpt_seq.fasta > sim2mel_minimap2.sam

minimap2 --secondary=no -u f -ax splice -C 5 dsim-all-chromosome-r2.02.fasta dsim-all-r2.02_xcrpt_seq.fasta > sim2sim_minimap2.sam

Convert SAM alignment files to GTF annotation files

SAM files are converted to BAM files with samtools view

samtools view -b -F 2048 mel2mel_minimap2.sam > mel2mel_minimap2.bam

samtools view -b -F 2048 mel2sim_minimap2.sam > mel2sim_minimap2.bam

samtools view -b -F 2048 sim2mel_minimap2.sam > sim2mel_minimap2.bam

samtools view -b -F 2048 sim2sim_minimap2.sam > sim2sim_minimap2.bam

BAM files are converted to BED12 files with bedtools bamtobed

bedtools bamtobed -split -i mel2mel_minimap2.bam > mel2mel_minimap2.bed

bedtools bamtobed -split -i mel2sim_minimap2.bam > mel2sim_minimap2.bed

bedtools bamtobed -split -i sim2mel_minimap2.bam > sim2mel_minimap2.bed

bedtools bamtobed -split -i sim2sim_minimap2.bam > sim2sim_minimap2.bed

BED12 files are converted to GTF files using information from the BED12:

awk -v name=mel2mel -F "\t" '{{print $1"\t"name"\t""exon""\t"$2+1"\t"$3"\t"".""\t"$6"\t"".""\t""gene_id " "\\"" $4 "\\"" "; " "transcript_id " "\\"" $4 "\\"" ";"}}' mel2mel_minimap2.bed > mel2mel_minimap2.gtf

awk -v name=mel2sim -F "\t" '{{print $1"\t"name"\t""exon""\t"$2+1"\t"$3"\t"".""\t"$6"\t"".""\t""gene_id " "\\"" $4 "\\"" "; " "transcript_id " "\\"" $4 "\\"" ";"}}' mel2sim_minimap2.bed > mel2sim_minimap2.gtf

awk -v name=sim2mel -F "\t" '{{print $1"\t"name"\t""exon""\t"$2+1"\t"$3"\t"".""\t"$6"\t"".""\t""gene_id " "\\"" $4 "\\"" "; " "transcript_id " "\\"" $4 "\\"" ";"}}' sim2mel_minimap2.bed > sim2mel_minimap2.gtf

awk -v name=sim2sim -F "\t" '{{print $1"\t"name"\t""exon""\t"$2+1"\t"$3"\t"".""\t"$6"\t"".""\t""gene_id " "\\"" $4 "\\"" "; " "transcript_id " "\\"" $4 "\\"" ";"}}' sim2sim_minimap2.bed > sim2sim_minimap2.gtf

Get common associated gene_id values

To compare splice junction chains across transcripts with TranD, genes with overlapping exons (on the same strand) are given a common gene_id GFFcompare is used to find overlapping exons:

gffcompare -r dmel-all-r6.17.gtf -o mel2mel mel2mel_minimap2.gtf

gffcompare -r dsim-all-r2.02.gtf -o mel2sim mel2sim_minimap2.gtf

gffcompare -r dmel-all-r6.17.gtf -o sim2mel sim2mel_minimap2.gtf

gffcompare -r dsim-all-r2.02.gtf -o sim2sim sim2sim_minimap2.gtf

To update the GTF files with the corrected associated gene_id values, use the TranD utility correct_gffcompare_GTF_gene_id.py:

python correct_gffcompare_GTF_gene_id.py \
    -a mel2mel.annotated.gtf \
    -g mel2mel_minimap2.gtf \
    -o mel2mel_corrected_associated_gene.gtf

python correct_gffcompare_GTF_gene_id.py \
    -a mel2sim.annotated.gtf \
    -g mel2sim_minimap2.gtf \
    -o mel2sim_corrected_associated_gene.gtf

python correct_gffcompare_GTF_gene_id.py \
    -a sim2mel.annotated.gtf \
    -g sim2mel_minimap2.gtf \
    -o sim2mel_corrected_associated_gene.gtf

python correct_gffcompare_GTF_gene_id.py \
    -a sim2sim.annotated.gtf \
    -g sim2sim_minimap2.gtf \
    -o sim2sim_corrected_associated_gene.gtf

NOTE: All gene_id values are given either a reference gene_id (if there is an exon overlap on the same strand) or a gffcompare XLOC identifier (if there is no exon overlap with a reference gene).

(Optional) Compare mapped transcripts to reference annotation

Compare the coordinates of mapped transcripts to the associated reference annotation within each species to identify any differences in annotation caused by mapping of the transcripts.

The below example is for the D. melanogaster transcripts mapped to the D. melanogaster genome. The same process will also be done for the D. simulans transcripts mapped to the D. simulans genome.

TD_OUT=TranD_mel2mel_consol_ref_map
    mkdir -p ${TD_OUT}

Add prefix of "map" to mapped GTF transcript_id values.
python add_prefix_gtf.py \
    -g mel2mel_corrected_associated_gene.gtf \
    -t transcript_id \
    -p map \
    -o ${TD_OUT}/mel2mel_map_prefix.gtf

cat dmel-all-r6.17.gtf ${TD_OUT}/mel2mel_map_prefix.gtf \
    > ${TD_OUT}/mel2mel_ref_map.gtf

python id_ujc.py \
    --gtf mel2mel_ref_map.gtf \
    --transcript-prefix mel \
    --skip-gtf
    -o ${TD_OUT}
    -p mel2mel_ref

Combine mapped reference genes and annotated orthologous gene information. Identify genes that do not reciprocally map back to the annotation in either species to (optionally) remove from further comparisons using the id_ujc.py utility.

NOTE: In the D. melanogaster vs. D. simulans comparison, we have retained all transcripts, but noted where there are differences in the mapping location compared to the respective annotation GTF.

The --gene2 and --gene2 arguments are the names used in the ortholog file for gene_id values

python make_mapped_ref_ortho_gene_list_03avn.py \
    --species1 /TranD_mel2mel_consol_ref_map/mel_ref_map_transcript_id_2_consolidation_id.csv \
    --species2 /TranD_sim2sim_consol_ref_map/sim_ref_map_transcript_id_2_consolidation_id.csv \
    -n1 mel \
    -n2 sim \
    --ortholog dmel_orthologs_dsim_fb_2017_04.csv \
    --gene1 mel_geneID \
    --gene2 sim_geneID \
    -o mel_sim_mapped_ortho_gene_info.csv \
    -list1 mel_mapped_retained_gene.txt \
    -list2 sim_mapped_retained_gene.txt \
    > mel_sim_ortho_map_counts.txt

Subset mapped GTF for transcripts that map back properly (OPTIONAL):

Subset GTF using sing the TranD utility subset_gtf.py for genes that "pass" in mapping and comparison to the annotation. Passing genes have all transcripts map back to the same reference junctions (FSM).

for NAME in mel sim; do
    python subset_gtf.py \
        -g !{OUTD}/${NAME}2${REF}_corrected_associated_gene.gtf \
        -t gene_id \
        -i !{SUBSET_GENE} \
        -o !{OUTD}/${NAME}2${REF}_subset_mapped_ref.gtf
done

Union reference

To make a union reference containing all unique splice junction chains across the two species, the mapped GTF of species each are concatenated together and consolidated using the id_ujc.py utility on D. melanogaster coordinates and on D. simulans coordinates:

The example below shows for just the D. melanogaster genome coordinates and the same process would be used for the D. simulans genome.

## Add species prefix to each transcript name
python add_prefix_gtf.py \
    -g mel2mel_corrected_associated_gene.gtf \
    -t transcript_id \
    -p mel \
    -o mel_subset_mapped_ref_prefix.gtf
python add_prefix_gtf.py \
    -g sim2mel_corrected_associated_gene.gtf \
    -t transcript_id \
    -p sim \
    -o sim_subset_mapped_ref_prefix.gtf

cat mel_subset_mapped_ref_prefix.gtf sim_subset_mapped_ref_prefix.gtf \
    > combine2mel_corrected_associated_gene.gtf

mkdir -p TranD_consol_union_2_mel

python id_ujc.py \
    -g combine2mel_corrected_associated_gene.gtf \
    -x tr \
    -o TranD_consol_union_2_mel \
    --prefix union2mel

Get counts of the number of genes/transcripts in one or both species

python count_gene_xcrpt_in_union.py \
    -i /TranD_consol_union_2_mel/union2mel_transcript_id_2_consolidation_id.csv \
    -n1 mel \
    -n2 sim \
    -g TranD_consol_union_2_mel/union2mel_gene_flags.csv \
    -t TranD_consol_union_2_mel/union2mel_UJC_flags.csv \
    -c TranD_consol_union_2_mel/union2mel_summary_counts.csv

The GTF of the union reference on the D. melanogaster coordinates is TranD_consol_union_2_mel/union2mel_consolidated_transcriptome.gtf in the example given and can be downloaded from here.

The GTF of the union reference on the D. simulans coordinates is TranD_consol_union_2_sim/union2sim_consolidated_transcriptome.gtf in the example given and can be downloaded from here.

Compare Unique Junction Chains (UJC)

Do splice-match consolidation of each species mapped transcript GTF file and compare between species on the same set of coordinates.

The example below is for the D. melanogaster genome coordinates, and the same process would be used for the D. simulans genome.

## Consolidate mel transcriptome
python id_ujc.py \
    -g mel2mel_corrected_associated_gene.gtf \
    -x mel_UJC \
    -o TranD_consolidate_mapped_mel_2_mel \
    -p mel

## Consolidate sim transcriptome
python id_ujc.py \
    -g sim2sim_corrected_associated_gene.gtf \
    -x sim_UJC \
    -o TranD_consolidate_mapped_sim_2_mel \
    -p sim

## TranD 2 GTF of consolidated mel vs sim
trand \
    /TranD_consolidate_mapped_mel_2_${SPECIES}/mel_UJC_consolidated_transcriptome.gtf \
    /TranD_consolidate_mapped_sim_2_${SPECIES}/sim_UJC_consolidated_transcriptome.gtf \
    -e pairwise \
    -k \
    -x mel_vs_sim \
    -1 mel \
    -2 sim \
    -p all \
    -o TranD_consol_mel_vs_sim_2_mel \
    -f \
    -n ${N_CPU}

Make D. melanogaster-D. simulans Reciprocal Minimum Transcript File

Merge UJC key files from individual species' consolidations with the union consolidation key file for each set of coordinates.

NOTE: The -i option is the input directory that contains the following: TranD_consolidate_mapped_[species1]2[species1], /TranD_consolidate_mapped_[species1]2[species2], TranD_consolidate_mapped_[species2]2[species2], /TranD_consolidate_mapped_[species2]2[species1], TranD_consol_union_2_[species1], and TranD_consol_union_2_[species2] directories.

python get_combined_union_consol_key_file_02avn.py \
    -i ${IND} \
    -s1 mel \
    -s2 sim \
    -g1 mFB617 \
    -g2 sFB202 \
    -o ${OUTD} \
    > ${IND}/mel_mFB617_sim_sFB202_UJC_key_log.txt

Combine:

comparison of D. melanogaster vs. D. simulans mapped transcripts onto each set of coordinates
Unique junction chains from each species mapped to each set of coordinates

To create a file that classifies transcript pairs across transcripts between the two species.

NOTE: The -i option is the input directory that contains the following: TranD_consol_[species1]vs[species2]2[species1] and TranD_consol_[species1]vs[species2]2[species2] directories, as well as union_UJC_key_file_2_[species1].csv and union_UJC_key_file_2_[species2].csv.

python make_transcript_ortholog_file_02avn.py \
    -i ${IND} \
    -s1 mel \
    -s2 sim \
    -g1 mFB617 \
    -g2 sFB202 \
    -o ${OUTD} \
    > mel_mFB617_sim_sFB202_ortholog_file_log.txt

We generated a reciprocal minimum transcript comparison file found here that classifies transcript pairs as putative transcript orthologs (TO), exon region overlaps (ERO), reciprocal minimum pairs (RMP), or no reciprocal minimum (NRM) in the other species. See below for definitions of each classification.

Definitions of transcript pair classifications

Transcript Orthologs (TO) are defined by the following (on both sets of coordinates):

are reciprocal minimums
do not contain an IR event relative to the other
have all overlapping exon regions
a total of ≤ 15 nt difference between the pair across all alternative donors/acceptors

Exon Region Overlaps (ERO) are defined similarly as TO but contain > 15 nt difference across all alternative donors/acceptors.

Reciprocal Minimum Pairs (RMP) are defined similarly as ERO but with IR or some sort of exon difference between the reciprocal minimum pairs.

No Reciprocal Minimum (NRM) are defined as having no reciprocal minimum on either set of coordinates.