Making sense of EDTA usage and outputs Q&A - oushujun/EDTA Wiki
If the page gets too busy, you may use the search function to find your keywords.
What's the difference between different GFF files?
You will get three GFF files when the
--anno 1parameter is used.
genome.fa.mod.EDTA.intact.gff3: This file contains only structurally intact TEs including LTRs, TIRs, and Helitrons in the genome. Entries in this file could be overlapping due to the nesting nature of TEs (a TE inserted into another TE). Of course, misannotations may also result in overlapping entries.
genome.fa.mod.EDTA.TEanno.gff3: This file contains both structurally intact TEs and fragmented TEs. Thus, all entries in the file
genome.fa.mod.EDTA.intact.gff3are included. To distinguish intact and fragmented TEs, please find the information in the 9th column of the file, and look for the "Method" information (see more in the next question). Similarly, entries in this file could be overlapping.
genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.split.gff3: This file also contains both structurally intact TEs and fragmented TEs, but each entry is not overlapping with other entries. This file is sometimes helpful to count TE length of each family. However, since overlapping entries are split up, it may cause over-counting for the number of annotation entries.
How to interpret the annotation GFF3 file?
EDTA produces two GFF3 files:
genome.mod.EDTA.TEanno.gff3. The First file collects all structurally intact TEs. The second file contains both structurally intact and fragmented TEs and represents whole-genome TE annotation. Thus, the first file is a subset of the second file. Here is a sample of the
Tzi8_chr1 EDTA Gypsy_LTR_retrotransposon 217199 217303 468 - . ID=TE_homo_200;Name=hute_AC204317_8199;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.762;Method=homology Tzi8_chr1 EDTA helitron 217381 219292 9230 + . ID=TE_homo_201;Name=TE_00004146;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Identity=0.932;Method=homology Tzi8_chr1 EDTA hAT_TIR_transposon 219293 219391 432 - . ID=TE_homo_202;Name=TE_00001127;Classification=MITE/DTA;Sequence_ontology=SO:0002279;Identity=0.752;Method=homology Tzi8_chr1 EDTA helitron 219456 221834 20911 + . ID=TE_homo_203;Name=TE_00012917;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Identity=0.97;Method=homology Tzi8_chr1 EDTA repeat_region 221830 233499 . - . ID=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0000657;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT Tzi8_chr1 EDTA target_site_duplication 221830 221834 . - . ID=lTSD_3;Parent=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0000434;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT Tzi8_chr1 EDTA long_terminal_repeat 221835 224508 . - . ID=lLTR_3;Parent=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0000286;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT Tzi8_chr1 EDTA Gypsy_LTR_retrotransposon 221835 233494 . - . ID=LTRRT_3;Parent=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT Tzi8_chr1 EDTA long_terminal_repeat 230798 233494 . - . ID=rLTR_3;Parent=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0000286;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT Tzi8_chr1 EDTA target_site_duplication 233495 233499 . - . ID=rTSD_3;Parent=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0000434;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT ... Tzi8_chr1 EDTA Mutator_TIR_transposon 328718 329794 . . . ID=TE_struc_906;Name=Tzi8_chr1:328718..329794;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=1;Method=structural;TSD=AATCTTCTTT_AATCTTCTTT_100.0;TIR=GAGTAAAGTA_TACTTTACTC
The file follows this format:
seqid source sequence_ontology start end score strand phase attributes.
sequence_ontology: the definition of sequence features defined by http://www.sequenceontology.org/. The sequence ontology (SO) ID is provided in the
attributesfield. Here is the full list of SOs used by EDTA.
score: The Smith-Waterman score generated by
RepeatMasker, so it's only available for homology entries. 300 is used to filter out low-confident matches. You may read more in the RepeatMasker Documentation.
phase: Phasing information required by the GFF3 format. This field is filled with
attributes: This field contains many information:
- ID: unique ID for each TE entry.
- Name: TE family name.
- Classification: Superfamily classification. One-to-one corresponded to
- Sequence_ontology: Sequence ontology.
- Identity: Ranging from 0 - 100. The meaning is context-dependent.
Method=homology, this represents the divergence between this sequence and the library sequence.
Method=structural, this represents the divergence between terminal inverted repeat sequences of this element. Only available for TIR elements.
- ltr_identity: Only available for structural LTR annotations. This is the divergence between the 5' and 3' LTRs of the element.
- Method: Indicate if this entry is produced by
- motif: Only available for structural LTR annotations. This is the di-nucleotide motif at the start and end of the LTR region. For example,
TGCAmeans the LTR has
TGat the start and
CAat the end.
- tsd/TSD: Target site duplication, only available for structural LTR and TIR annotations. For TIRs, TSD contains both sequences from the 5' and 3' end of the element. For example,
AATCTTCTTTat the 5' end and
AATCTTCTTTat the 3' end, and their identity is 100%.
- TIR: Terminal inverted repeat, only available for structural TIR annotations. It contains both sequences from the 5' and 3' end of the element. For example,
GAGTAAAGTAat the 5' end and
TACTTTACTCat the 3' end.
What do "TE_homo_xxx" and "TE_struc_xxx" mean?
"TE_homo_xxx" means this entry is annotated based on homology to the TE library. "TE_struc_xxx" means this entry is annotated by structural methods. Each TE annotation entry has a unique ID.
Why do some annotations have directions (+/-) and some do not?
All homology-based annotations ("TE_homo_xxx") should have directions as assigned by RepeatMasker. The TE library entry is assumed as + when assigning the direction. However, this is not always the correct assignment because we don't know the direction of some TEs lacking coding information. Some structural-based LTR annotations ("TE_struc_xxx") may have directions. The direction is determined by the consensus direction of coding sequences in the element. For elements lacking coding sequences or do not have consensus coding directions, their direction is labeled "?". All structural-based TIR transposons do not have directions and are labeled ".". All structural-based Helitrons have directions.
Why are structurally intact LTRs and TIRs represented differently in the GFF3 file?
Structurally intact LTR elements contain the following structural features annotated by
repeat_region: The entire repeat region including TSD and the LTR element.
lTSD: Left target site duplication.
lLTR: Left long terminal repeat.
LTRRT: The LTR retrotransposon containing lLTR, internal region, and rLTR.
rLTR: Right long terminal repeat.
rTSD: Right target site duplication. Structurally intact TIR elements have TSD and TIR information in the
attributesfield of the file.
Why do some TE families have coordinates as their names? e.g., ID=TE_struc_37;Name=Chr2:800875..805238;Classification=DNA/DTT
First, these TEs are all structurally annotated ("TE_struc_37"). Further, they cannot be clustered with other structural TEs, so there's no need to use another name to represent them. They are likely single copies in the genome and have lower confidence to be a TE. Thus, their sequences are not used to construct the TE library to avoid inflation of false annotations. You may filter them out based on their naming structure.
Why TE families in the GFF3 file do not have consistent classifications?
Ideally, each TE family (each sequence in the TE library represents a TE family) should have only one classification, but in the GFF3 file you may observe multiple classifications of the same family:
Count Name Sequence_ontology 13 TE_00000022 helitron 161 TE_00000022_LTR LTR_retrotransposon 12 TE_00000023 helitron 25 TE_00000024 helitron 39 TE_00000024 CACTA_TIR_transposon 37 TE_00000025 helitron 444 TE_00000025 CACTA_TIR_transposon 19 TE_00000026 helitron 314 TE_00000026 CACTA_TIR_transposon 1 TE_00000027 Mutator_TIR_transposon 21 TE_00000027 helitron 1796 TE_00000027_INT LTR_retrotransposon
This represents the inconsistency between the structural and homology annotations. During the structural annotation process, each element is identified and classified by structural features, and the non-redundant TE library is generated from these structurally intact TEs. To classify structurally intact elements into the family level, EDTA uses the TE library to annotate each intact TEs, where inconsistency may occur. If an intact TE shares an extensive portion (passing the 80-80-80 threshold) with an LTR sequence in the library (eg.,
TE_00000022_LTR), then this intact TE will be classified as a copy in the family. However, if this intact TE is not structurally annotated as an LTR (eg.,
helitron), inconsistency occurs. In practice, we found inconsistencies mostly occurred on
helitron, which is due to two main reasons: 1. Technically, helitrons are very difficult to annotate accurately, and thus, we have relatively low confidence in helitron annotations (benchmarked in the EDTA paper). 2. Biologically, helitrons tend to capture sequences when they transpose, and many TEs could be captured during the process. If helitrons capture more non-helitron TEs than themselves, they could pass the classification threshold and be classified as non-helitrons. For example, 161
TE_00000022_LTRwere correctly annotated as LTR retrotransposons but 13 were misclassified as helitrons. The classification inconsistency of this family is 7.5%. For the case of
TE_00000024, 25 were classified as
helitron, while 39 were
CACTA_TIR_transposon. The inconsistency is pretty high. Overall, most of the inconsistencies are not too bad. This represents one of the most difficult challenges in TE annotation.
How to summarize TE annotation in my genome?
It is recommended to use the summary file
genome.fa.mod.EDTA.TEanno.sumproduced by EDTA to report TE annotation. This report consideres overlapping annotations and has reasonable summaries for TE length and copy numbers. This file is produced by the script
./util/buildSummary.plmodified from the RepeatMasker package. The script is originally developed by Robert M. Hubley ([email protected]) and adapted to EDTA. Please read more on this thread #169 on reproducing the summary file.
How to evaluate the quality of annotation?
Without prior knowledge and manual curation, it's difficult to know if the annotation is correct or wrong. One way to estimate the quality of the annotation is to see if the annotation is done consistently. For example, to check if a sequence and its similar sequences are consistently annotated as LTR or not. You may think consistent annotation is a basic requirement, but it's not! It's quite challenging to produce a consistent TE annotation due to the complexity of TE sequences such as nested insertions. In general, genomes annotated by a curated TE library have inconsistency lower than 3%. EDTA has inconsistency of 1% - 40% depending on the TE category, and mostly 10-20%. Other de novo tools may have inconsistency as high as 80%! The original paper has more descriptions and discussions.
You may use the
--evaluation 1parameter to obtain the annotation consistency. If you forget to add this in your run, you may rerun EDTA with
--step anno --anno 1 --overwrite 0 --evaluation 1, and it should skip most of the finished steps and start the evaluation. The evaluation step involves all-vs-all blasts and could take quite some time. The evaluation step won't change the annotation at all.
How to interpret the evaluation results?
The files ".stat.all.sum" ".stat.nested.sum", and ".stat.redun.sum" are to describe the level of annotation inconsistency in the data. The inconsistency could be due to technical issues (EDTA fails to classify some TEs accurately) or biological reasons (some TEs are nested inside other TEs). Basically, the ".stat.redun.sum" file is to describe the technical inconsistency and the ".stat.nested.sum" file is to describe biological inconsistency, while the ".stat.all.sum" file is the summary of both. See more discussions #260.
How to improve the annotation quality?
There are some ad hoc ways to improve the annotation quality. Because EDTA has low power to annotate SINEs and LINEs, getting this part right is challenging for genomes with high proportions of SINEs/LINEs (i.e., vertebrate genomes).
I recommend manually preparing a non-redundant SINEs/LINEs library of your species and giving it to EDTA (
--curatedlib lib.fasta). This will significantly improve the annotation of both SINEs/LINEs and other TEs. As far as I know, there is a SNIE database that collects some curated SINE sequences. NCBI is also a good place to look at. There is a nice annotation pipeline for de novo SINE annotations recently published. You may also do a more generic google search regarding your species.
For post hoc improvements, you may filter out those single-copy elements at the expense of sensitivity.
Can EDTA annotate non-TE repeats? For example, telomeric repeat, centromeric repeat, knob repeat
Yes, but not in a de-novo fashion. You may identify or collect non-TE repeats manually and provide them to EDTA as a curated library (
--curatedlib lib.fasta), then EDTA will use
RepeatMaskerto annotate sequences similar (≤40% divergence) to these repeats. If the provided repeat sequence is "too short", for example, a 7-bp telomeric repeat, they will not be utilized by
RepeatMasker, and thus, the final annotation won't have these annotations.
Why the results of multiple runs are different?
Because the library-making process in EDTA is random (
./util/cleanup_nested.pl). This represents the uncertainty we are facing when trying to select representative sequences. Fortunately, the differences are not very big as far as we observed. See more discussions #246.