Human Annotation Archive - McIntyre-Lab/TranD GitHub Wiki

Below is a description of the method used to compare the Human hg38 RefSeq and Ensembl annotations using TranD and generate a union reference.

All steps are described below and more detailed documentation and scripts in the links provided.

For more information on how to run TranD, see the User Guide.

Convert chromosome IDs

Convert chromosome ID of RefSeq (formated NC_###.#) to Ensembl chromsome ID (formated 1, 2, 3, etc.) using cthreepo.

Clone and install cthreepo

$ git clone https://github.com/vkkodali/cthreepo.git
$ conda create -n cthreepo python=3.7
$ conda activate cthreepo
$ cd cthreepo
$ python setup.py install

Download input files

RefSeq Input GTF Assembly Report
Ensembl Input GTF

Convert chromosome IDs with cthreepo

For RefSeq to Ensembl chromosome IDs:

cthreepo -i <(zcat < GCF_000001405.39_GRCh38.p13_genomic.gtf.gz) -if rs -it ens -f gtf \
    -m GCF_000001405.39_GRCh38.p13_assembly_report.txt -o refseq_GCF_000001405.39_GRCh38.p13_genomic_converted_ens.gtf

For Ensembl scaffolds to be consistent with the converted RefSeq chromosome IDs:

cthreepo -i <(zcat < Homo_sapiens.GRCh38.104.chr_patch_hapl_scaff.gtf.gz) -if ens -it ens -f gtf \
    -m GCF_000001405.39_GRCh38.p13_assembly_report.txt -o ensembl_Homo_sapiens.GRCh38.104.chr_patch_hapl_scaff_converted_ens.gtf

Get common associated gene_id values

To compare splice junction chains across transcripts with TranD, genes with overlapping exons (on the same strand) are given a common gene_id GFFcompare is used to find overlapping exons (For simplicity, scaffolds were removed):

awk '$3=="transcript"||$3=="exon"' refseq_GCF_000001405.39_GRCh38.p13_genomic_converted_ens.gtf | \
    awk -v chr='1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT' \
    'index(","chr",",","$1",")' \
    > ${ROZ}/${RSGTF}
awk '$3=="transcript"||$3=="exon"' ensembl_Homo_sapiens.GRCh38.104.chr_patch_hapl_scaff_converted_ens.gtf | \
    awk -v chr='1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT' \
    'index(","chr",",","$1",")' \
    > ${OUTD}/hg38_ensembl_xrpt_exon.gtf

## Run gffcompare on refseq GTF using ensembl GTF file for annotation
gffcompare -r ${OUTD}/hg38_ensembl_xrpt_exon.gtf -o ${OUTD}/hg38_refseq_2_ensembl ${ROZ}/${RSGTF}

To update the GTF files with the corrected associated gene_id values, use the TranD utility correct_gffcompare_GTF_gene_id.py:

python correct_gffcompare_GTF_gene_id.py \
    -a ${OUTD}/hg38_refseq_2_ensembl.annotated.gtf \
    -k \
    -g ${ROZ}/${RSGTF} \
    -o ${OUTD}/hg38_refseq_2_ensembl_corrected_associated_gene.gtf

NOTE: The -k argument used here ensures that if there is a RefSeq gene with no exon overlap to an Ensembl gene, the RefSeq gene retains the RefSeq gene_id value (without the -k option, all gene_id values are replaced with either a reference gene_id or a gffcompare XLOC identifier)

Identify unique junction chains within each annotation

Annotations may contain transcripts with identical splice junction chains, these are identified within each reference first. To identify unique junction chains (UJC), use the TranD utility id_ujc.py:

## ID UJC in Refseq
python id_ujc.py \
    -g /path/to/hg38_refseq_2_ensembl_corrected_associated_gene.gtf \    # Input GTF file
    -x rs \                                                          # Prefix to use for transcript_id values for the representative UJC transcript models
    -o /path/to/output_directory \                                       # Output directory
    -p rs                                                            # Output file prefix

## ID UJC in Ensemble
python id_ujc.py \
    -g /path/to/hg38_ensembl_xrpt_exon.gtf \    # Input GTF file
    -x ens \                                                         # Prefix to use for transcript_id values for the representative UJC transcript models
    -o /path/to/output_directory \                                       # Output directory
    -p ens

Note that the -x argument can be changed to any desired prefix for ujc_id transcript model names. In this example, we use either 'refseq' or 'ensembl' so all UJC names will be "refseq_[representative transcript]", for example. The -p argument sets the prefix of the two output files that will be written to the directory provided in the -o argument.

The output of this step will be:

A GTF file of representative UJC transcript models (/path/to/output_directory/ensembl_UJC.gtf)
A UJC key file of original ensembl transcript model id values matched to the associated UJC id value (/path/to/output_directory/ensembl_UJC_ID.csv)

Union reference

To make a union reference containing all unique splice junction chains across the two annotations, unique splice junctions of each are concatenated together and consolidated using the id_ujc.py utility:

cat ${OUTD}/TranD_consol_gene_hg38_ensembl/consolidated_transcriptome.gtf \
    ${OUTD}/TranD_consol_gene_hg38_refseq_2_ensembl/consolidated_transcriptome.gtf \
    > ${OUTD}/cat_prefix_consolidated.gtf

    python id_ujc.py \
        --gtf ${OUTD}/cat_prefix_consolidated.gtf \
        --transcript-prefix tr \
        -o ${OUTD}/TranD_ensembl_refseq_union
        --prefix ensembl_refseq_union

The GTF of the union reference is ${OUTD}/TranD_ensembl_refseq_union/ensembl_refseq_union_UJC.gtf in the example given and can be downloaded from nowhere, yet.

marking this link as an old file.

TranD 2 GTF comparison of annotations

To compare the unique splice junction chains across the two annotations TranD 2 GTF pairwise is used. Due to the large number of unique splice junction chains in each annotation, TranD on the full GTF files may a long time to run. To reduce the time to run the GTF files can be split by chromosome and run separately as array jobs on a cluster (NOTE: looping over the chromosomes on a single machine will not decrease runtime):

## Subset input GTF files for chromosome
awk -v chrom=${CHROM} '$1==chrom' \
    ${OUTD}/TranD_consol_gene_hg38_refseq_2_ensembl/consolidated_transcriptome.gtf \
    > ${ROZ}/hg38_refseq_2_ensembl.gtf
awk -v chrom=${CHROM} '$1==chrom' \
    ${OUTD}/TranD_consol_gene_hg38_ensembl/consolidated_transcriptome.gtf \
    > ${ROZ}/hg38_ensembl.gtf

    trand \
        ${ROZ}/hg38_refseq_2_ensembl.gtf \
        ${ROZ}/hg38_ensembl.gtf \
        -e pairwise \
        -1 hg38_refseq_2_ensembl \
        -2 hg38_ensembl \
        -p all \
        -l ${OUTD}/chrom_${CHROM}/TranD_consol_hg38_refseq_2_ensembl_vs_ensembl_${CHROM}.log \
        -d \
        -o ${OUTD}/chrom_${CHROM} \
        -n $SLURM_CPUS_ON_NODE \
        -f

Once all chromosomes of interest have completed TranD 2 GTF, the outputs can be combined and all data replotted to get the summarizing TranD figures across all chromosomes using the TranD plotting utility plot_trand_from_output_files.py:

## Concatenate all pairwise transcript distance output
awk 'FNR==1 && NR!=1{next;}{print}' \
    ${OUTD}/chrom_*/*_pairwise_transcript_distance.csv \
    > ${OUTD}/all_chrom_pairwise_transcript_distance.csv

## Concatenate all gtf1_only.gtf and gtf2_only.gtf files
cat ${OUTD}/chrom_*/*_gtf1_only.gtf \
    > ${OUTD}/all_chrom_gtf1_only.gtf
cat ${OUTD}/chrom_*/*_gtf2_only.gtf \
    > ${OUTD}/all_chrom_gtf2_only.gtf

## Plot
python plot_trand_from_output_files.py \
    -p2 ${OUTD}/all_chrom_pairwise_transcript_distance.csv \
    -n1 hg38_refseq_2_ensembl \
    -n2 hg38_ensembl \
    -g1 ${OUTD}/all_chrom_gtf1_only.gtf \
    -g2 ${OUTD}/all_chrom_gtf2_only.gtf \
    -f \
    -o ${OUTD}

TranD 2 GTF output for RefSeq vs. Ensembl can be found here.