Create within Species Transcript Model Map - McIntyre-Lab/TranD GitHub Wiki

Introduction
- Example Data Origin and Flowchart
Create Union Reference
Generate Minimum Pairwise Distance Calculations
Generate the Pair Classification File
Generate Pairwise Distance Calculations for Genes Unique to 1 GTF
Create the Union ERG Files
- Run id_ERG
- Concatenate ERG Output Files
Example Transcript Model Map

Below is a description of a pipeline used to create a Transcript Model Map within a single species and compare to two GTFs of the same genome.

This script is a generic bash script that describes each step in code form and allows for creation of a TMM for any two GTFs (as long as TranD is installed and the utilities have been downloaded).

Introduction

We define the Transcript Model Map as a set of 3 things:

Union Reference Files - which is a GTF file with one transcript model representing each unique junction chain (UJC) in the union of the two original GTF files, and a CSV key file.
Pair Classification File - associates and describes connected transcript models across the two GTF files, with flags describing and classifying their relationship.
Union Exon Region Group (ERG) Files - including 3 key files for the GTF unique on gene, transcript, and ERG. Also, a GTF with one transcript model representing each unique ERG, or set of overlapping exons, in the union of the two GTF files.

Example Data Origin and Flowchart

As an example, we will create a Transcript Model Map for Drosophila melanogaster (mel) transcript models and Drosophila simulans (sim) transcript models, which have been mapped to the mel genome. The example data has been subset to 22 genes, solely to be used as an explanation of the Transcript Model Map analysis pipeline. More information on cross-species comparison and GTF "mapping" can be found in Drosophila Species Comparison, which uses the whole genome of both species and displays the process necessary to make the mapped GTFs.

Download input files

mel2mel: GTF
sim2mel: GTF

Note: mel2mel refers to the GTF with D. melanogaster transcripts mapped onto the D. melanogaster genome. sim2mel refers to the GTF with D. simulans transcripts mapped onto the D. melanogaster genome.

Below is a flowchart that indicates the overall process of the pipeline, and how each piece of the Transcript Model Map is created:

Flowchart.

Create Union Reference

1. Identify UJC for first set of transcript models

As noted in the Utility Descriptions, all transcripts with the same junction chain will be grouped under one representative transcript from the group known as the ujc_id. Note that utilities are not part of the TranD environment, and require their own separate installation (via github cloning or downloading).

To identify unique junction chains (UJC), use the TranD utility id_ujc.py:

python id_ujc.py \
    -g /path/to/input/mel2mel.gtf \  # Input GTF file
    -o /path/to/output_directory \   # Output directory
    -p mel2mel                       # Output file prefix

The -p argument sets the prefix of the two output files that will be written to the directory provided in the -o argument.

The output of this step will be:

A GTF file of representative UJC transcript models (/path/to/output_directory/mel2mel_UJC.gtf)
A UJC key file of original mel2mel transcript model id values matched to the associated UJC id value (/path/to/output_directory/mel2mel_UJC_ID.csv)

Example Output: GTF and CSV.

2. Identify UJC for second set of transcript models

Repeat the above steps for the second set of transcript models.

python id_ujc.py \
    -g /path/to/input/sim2mel.gtf \  # Input GTF file
    -o /path/to/output_directory \   # Output directory
    -p sim2mel                       # Output file prefix

The -p argument sets the prefix of the two output files that will be written to the directory provided in the -o argument.

The output of this step will again be:

A GTF file of representative UJC transcript models (/path/to/output_directory/sim2mel_UJC.gtf)
A UJC key file of original mel2mel transcript model id values matched to the associated UJC id value (/path/to/output_directory/sim2mel_UJC_ID.csv)

Example Output: GTF and CSV.

3. Make union reference

To identify unique junction chains (UJC) across the two GTF files, concatenate the UJC GTF files produced from the original two GTF files. Then use the TranD utility id_ujc.py on it to create the Union Reference Files:

cat /path/to/output_directory/mel2mel_UJC.gtf \
    /path/to/output_directory/sim2mel_UJC.gtf \
    > /path/to/output_directory/mel2mel_sim2mel_UJC.gtf

python id_ujc.py \
    -g /path/to/output_directory/mel2mel_sim2mel_UJC.gtf \
    -o /path/to/output_directory \
    -p mel2mel_sim2mel_union

The output of this step will be the Union Reference Files: a GTF file of representative UJC transcript models and a UJC key file of mel2mel UJC and sim2mel UJC transcript model id values matched to the associated union UJC id value.

Example Output: Concatenated GTF

Union Reference: GTF and CSV.

Generate Minimum Pairwise Distance Calculations

Using the representative UJC transcript model GTF files, get the pairwise distance metrics between all transcript pairs.

trand \
    /path/to/output_directory/mel2mel_UJC.gtf \
    /path/to/output_directory/sim2mel_UJC.gtf \
    -o /path/to/output_directory/TranD_2GTF_output \
    -1 mel2mel \
    -2 sim2mel \
    -e pairwise \
    -p both \
    -f \
    -n 8      # Number of CPUs

Note: The order in which you input GTF files and their prefixes (-1 and -2) creates a convention followed throughout the rest of the pipeline. mel2mel will be 'GTF1' and sim2mel will be 'GTF2.' We recommend to use an alphabetical order for these GTFs wherever they appear together, to maintain consistency througout the pipeline.

TranD 2 GTF pairwise is parallelized by pair, so increasing the cpu number used (-n) can lower runtime.

For a general idea of how long it would take to run TranD 2GTF for an analysis, please see the Pre-Computed Files.

Example Output

The files output by TranD that are used in later steps are: the minimum pairwise distance (PD) file and the file listing genes that only appear in GTF1 or GTF2. These have been copied to a different folder in the example output for ease of use.

Generate the Pair Classification File

Input the PD file into the pair_classification.py utility. This analyzes the pairs of transcripts in the PD file and classifies them into 4-5 possible groups:

FSM: The transcripts are a complete match.
ERS_wIR: The two transcripts are in the same exon region group (all of their exon regions overlap). There is an intron retention event within at least one transcript.
ERS_noIR: The two transcripts are in the same exon region group and there is no intron retention.
- Large: There is a nucleotide number difference between the two transcripts greater than the threshold entered (less similar).
- Small: There is a nucleotide number difference between the two transcripts less than the threshold entered (more similar).
NRM: The transcript pair is a non-reciprocal match.

The utility can be run as follows:

python pair_classification.py \
    -d /path/to/TranD/output/mel2mel_vs_sim2mel_minimum_pairwise_transcript_distance.csv \ # Input PD file
    -s 15 \                                                                     # The nucleotide threshold
    -o /path/to/output_directory/mel2mel_sim2mel_pair_classification.csv        # Path to output file.

The threshold (-s) is not required but allows for the distinction between a small and large transcript distance, as noted above. This threshold can be calculated by multiplying the number of nucleotides per codon (3) by the average number of exons per gene of the species. The mean number of exons per gene can be found in TranD output in the file prefix_transcriptome_complexity_counts.csv for both GTFs under the column mean_exonPerGene. For Drosophila melanogaster this number is 5. Meaning the -s is 3*5, 15.

Please note that the output must be a path to a file, not a directory.

Example Output

Generate Pairwise Distance Calculations for Genes Unique to 1 GTF

Genes that only appear in one of the GTFs will not be included in the minimum PD file. In order to create a more complete map, it is necessary to include these genes, which can be found in the TranD output as prefix_gtf1_only.gtf and prefix_gtf2_only.gtf. To include them, all that is necessary is to run TranD on these GTF files separately.

Note: It can be possible that there are no unique genes in either file (as it is for GTF2 in the example). If so, the "gtf_only" file will be empty, and this step will not be necessary.

trand \
    /path/to/TranD/output/mel2mel_vs_sim2mel_gtf1_only.gtf \
    -o /path/to/output_directory/TranD_GTF1_only_output \
    -e pairwise \
    -f \
    -n 8      # Number of CPUs

trand \
    /path/to/TranD/output/mel2mel_vs_sim2mel_gtf2_only.gtf \
    -o /path/to/output_directory/TranD_GTF2_only_output \
    -e pairwise \
    -f \
    -n 8      # Number of CPUs

The only key output from this is the PD file: Example Output.

Create the Union ERG Files

1. Run id_ERG on the PD file(s)

Identify ERGs, or groups of transcripts with overlapping exon regions, across the two GTF files using the TranD utility id_ERG.py:

python id_ERG.py \
    -i /path/to/TranD/output/mel2mel_vs_sim2mel_minimum_pairwise_transcript_distance.csv \ ## TranD 2GTF pairwise distance output
    -g \                           ## Argument to output the GTF of representative ERG transcript models
    -p mel2mel_vs_sim2mel \        ## Prefix for output files
    -o /path/to/output_directory \ ## Path to output directory
    -ir Y                          ## Y: IR transcripts are included in groups, N: they are excluded

We recommend to keep the intron retention (-ir) command as Y but this may change depending on usage. For more information about id_ERG and -ir check the Utility Descriptions.

If necessary, identify ERGs in both 1GTF (unique genes) outputs:

python id_ERG.py \
    -i /path/to/TranD_GTF1_only_output/pairwise_transcript_distance.csv \ ## TranD 1GTF pairwise distance output
    -g \                              ## Argument to output the GTF of representative ERG transcript models
    -p mel2mel_vs_sim2mel_gtf1_only \ ## Prefix for output files
    -o /path/to/output_directory \    ## Path to output directory
    -ir Y \
    -w 1

python id_ERG.py \
    -i /path/to/TranD_GTF2_only_output/pairwise_transcript_distance.csv \ ## TranD 1GTF pairwise distance output
    -g \                              ## Argument to output the GTF of representative ERG transcript models
    -p mel2mel_vs_sim2mel_gtf2_only \ ## Prefix for output files
    -o /path/to/output_directory \    ## Path to output directory
    -ir Y \
    -w 2

When running 1GTF TranD output in id_ERG to create a Transcript Model Map, make sure to use -w #. Use -w 1 if using GTF1 only data and -w 2 if using GTF2 only data. The distinction between GTF1 and GTF2 was made earlier when running TranD 2GTF. The -w command creates the 'which_gtf' column which does not normally exist when using 1 GTF output; it is very useful to have this column when merging the ERG files together (more on this later).

The id_ERG utility will generate 4 files per PD file. They are as follows:

prefix_xscript_output.csv, an ERG key file that is unique on transcript.
prefix_gene_output.csv, an ERG key file that is unique on gene.
prefix_erg_output.csv, an ERG key file that is unique on ERG.
prefix_erg_gtf.gtf, a GTF with one transcript model representing each unique ERG.

Example Output: 2GTF, GTF1 Only

Again, the example output does not have any genes unique to GTF2.

2. Concatenate ERG Output Files

The only thing left to do is concatenate the output made using the 2GTF PD files and the output made using the "GTF Only" PD files to create the Union ERG Files:

# GTF
cat /path/to/ERG/output/2GTF/mel2mel_vs_sim2mel_erg_gtf.gtf \
    /path/to/ERG/output/GTF1_only/mel2mel_vs_sim2mel_gtf1_only_erg_gtf.gtf \
    /path/to/ERG/output/GTF2_only/mel2mel_vs_sim2mel_gtf2_only_erg_gtf.gtf \
    > path/to/ERG/output/union/mel2mel_vs_sim2mel_union_erg_gtf.gtf

# XSCRIPT
cat /path/to/ERG/output/2GTF/mel2mel_vs_sim2mel_xscript_output.csv \
    <(tail +2 path/to/ERG/output/2GTF/GTF1_only/mel2mel_vs_sim2mel_gtf1_only_xscript_output.csv) \
    <(tail +2 path/to/ERG/output/2GTF/GTF2_only/mel2mel_vs_sim2mel_gtf2_only_xscript_output.csv) \
    > path/to/ERG/output/union/mel2mel_vs_sim2mel_union_xscript_output.csv

# GENE
cat /path/to/ERG/output/2GTF/mel2mel_vs_sim2mel_gene_output.csv \
    <(tail +2 path/to/ERG/output/2GTF/GTF1_only/mel2mel_vs_sim2mel_gtf1_only_gene_output.csv) \
    <(tail +2 path/to/ERG/output/2GTF/GTF2_only/mel2mel_vs_sim2mel_gtf2_only_gene_output.csv) \
    > path/to/ERG/output/union/mel2mel_vs_sim2mel_union_gene_output.csv

# ERG
cat /path/to/ERG/output/2GTF/mel2mel_vs_sim2mel_erg_output.csv \
    <(tail +2 path/to/ERG/output/2GTF/GTF1_only/mel2mel_vs_sim2mel_gtf1_only_erg_output.csv) \
    <(tail +2 path/to/ERG/output/2GTF/GTF2_only/mel2mel_vs_sim2mel_gtf2_only_erg_output.csv) \
    > path/to/ERG/output/union/mel2mel_vs_sim2mel_union_erg_output.csv

If there were no genes unique to GTF1 or GTF2, remove those lines when concatenating. If there were no unique genes for both GTF1 and GTF2, the Union ERG Files are the output generated when running id_ERG on the 2GTF PD file.

This will finalize the last piece of the transcript model map.

Example Output

Example Transcript Model Map

The entire example Transcript Model Map created using the above process can be found here.

Create within Species Transcript Model Map - McIntyre-Lab/TranD GitHub Wiki

Table of Contents

Introduction

Example Data Origin and Flowchart

Download input files

Create Union Reference

1. Identify UJC for first set of transcript models

2. Identify UJC for second set of transcript models

3. Make union reference

Generate Minimum Pairwise Distance Calculations

Generate the Pair Classification File

Generate Pairwise Distance Calculations for Genes Unique to 1 GTF

Create the Union ERG Files

1. Run id_ERG on the PD file(s)

2. Concatenate ERG Output Files

Example Transcript Model Map