Making sense of EDTA usage and outputs Q&A - oushujun/EDTA GitHub Wiki

If the page gets too busy, you may use the search function to find your keywords.

How to prepare the CDS file?
Using nucleotide coding sequences (CDS) will help to remove gene-like sequences in the TE library. Please use only CDS and no more, because TEs are sometimes present in untranslated regions (UTRs) and introns, and using these sequences will also remove TE sequences in the TE library and reduce the sensitivity. CDS is optional in EDTA but required in panEDTA because the error could propagate when multiple genomes are annotated. To fulfill the CDS requirement in panEDTA, at least one CDS file is needed (See more discussions in #312). It's OK even if the CDS file contains some TE-like coding sequences because they will be cleaned by TEsorter and RepeatMasker before being used to clean the TE library. You may do one of the following:
1. use CDS from a closely related species.
2. use a first-pass gene prediction from one of the genomes.
3. use the Arabidopsis/Rice/Drosophila/Human/Yeast/other model species CDS file.
4. use the toy CDS file in the test folder.
i and ii will give you power to remove gene-related sequences in the result; iii will give you some power to remove conserved genes because model species could be distant from your species; iv is just a bypass solution (not recommended).
What's the difference between different GFF files?
You will get three GFF files when the --anno 1 parameter is used.
- genome.fa.mod.EDTA.intact.gff3: This file contains only structurally intact TEs including LTRs, TIRs, and Helitrons in the genome. Entries in this file could be overlapping due to the nesting nature of TEs (a TE inserted into another TE). Of course, misannotations may also result in overlapping entries.
- genome.fa.mod.EDTA.TEanno.gff3: This file contains both structurally intact TEs and fragmented TEs. Thus, all entries in the file genome.fa.mod.EDTA.intact.gff3 are included. To distinguish intact and fragmented TEs, please find the information in the 9th column of the file, and look for the "Method" information (see more in the next question). Similarly, entries in this file could be overlapping.
- genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.split.gff3: This file also contains both structurally intact TEs and fragmented TEs, but each entry is not overlapping with other entries. This file is sometimes helpful to count TE length of each family. However, since overlapping entries are split up, it may cause over-counting for the number of annotation entries.

How to interpret the annotation GFF3 file?
EDTA produces two GFF3 files: genome.mod.EDTA.intact.gff3 and genome.mod.EDTA.TEanno.gff3. The First file collects all structurally intact TEs. The second file contains both structurally intact and fragmented TEs and represents whole-genome TE annotation. Thus, the first file is a subset of the second file. Here is a sample of the genome.mod.EDTA.TEanno.gff3 file:

Tzi8_chr1       EDTA    Gypsy_LTR_retrotransposon       217199  217303  468     -       .       ID=TE_homo_200;Name=hute_AC204317_8199;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.762;Method=homology
Tzi8_chr1       EDTA    helitron        217381  219292  9230    +       .       ID=TE_homo_201;Name=TE_00004146;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Identity=0.932;Method=homology
Tzi8_chr1       EDTA    hAT_TIR_transposon      219293  219391  432     -       .       ID=TE_homo_202;Name=TE_00001127;Classification=MITE/DTA;Sequence_ontology=SO:0002279;Identity=0.752;Method=homology
Tzi8_chr1       EDTA    helitron        219456  221834  20911   +       .       ID=TE_homo_203;Name=TE_00012917;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Identity=0.97;Method=homology
Tzi8_chr1       EDTA    repeat_region   221830  233499  .       -       .       ID=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0000657;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT
Tzi8_chr1       EDTA    target_site_duplication 221830  221834  .       -       .       ID=lTSD_3;Parent=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0000434;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT
Tzi8_chr1       EDTA    long_terminal_repeat    221835  224508  .       -       .       ID=lLTR_3;Parent=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0000286;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT
Tzi8_chr1       EDTA    Gypsy_LTR_retrotransposon       221835  233494  .       -       .       ID=LTRRT_3;Parent=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT
Tzi8_chr1       EDTA    long_terminal_repeat    230798  233494  .       -       .       ID=rLTR_3;Parent=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0000286;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT
Tzi8_chr1       EDTA    target_site_duplication 233495  233499  .       -       .       ID=rTSD_3;Parent=repeat_region_3;Name=xilon_diguus_AC203313_7774;Classification=LTR/Gypsy;Sequence_ontology=SO:0000434;ltr_identity=0.9910;Method=structural;motif=TGCA;tsd=TTGAT
...
Tzi8_chr1       EDTA    Mutator_TIR_transposon  328718  329794  .       .       .       ID=TE_struc_906;Name=Tzi8_chr1:328718..329794;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=1;Method=structural;TSD=AATCTTCTTT_AATCTTCTTT_100.0;TIR=GAGTAAAGTA_TACTTTACTC

The file follows this format: seqid source sequence_ontology start end score strand phase attributes.

sequence_ontology (the 3rd column): the definition of sequence features defined by http://www.sequenceontology.org/. The sequence ontology (SO) ID is provided in the attributes field. Here is the full list of SOs used by EDTA.
- Full-length LTR retrotransposons contain structural features, with the parent feature called repeat_region.
- Unknown repeats are named repeat_fragment in this column.
score (the 7th column): The Smith-Waterman score generated by RepeatMasker, so it's only available for homology entries. 300 is used to filter out low-confident matches. You may read more in the RepeatMasker Documentation.
phase (the 8th column): Phasing information required by the GFF3 format. This field is filled with .
attributes (the 9th column): This field contains many information:
- ID: unique ID for each TE entry.
- Name: TE family name.
- Classification: Superfamily classification. One-to-one corresponded to sequence_ontology.
- Sequence_ontology: Sequence ontology.
- Identity: Ranging from 0 - 100. The meaning is context-dependent.
  - If Method=homology, this represents the divergence between this sequence and the library sequence.
  - if Method=structural, this represents the divergence between terminal inverted repeat sequences of this element. Only available for TIR elements.
- ltr_identity: Only available for structural LTR annotations. This is the divergence between the 5' and 3' LTRs of the element.
- Method: Indicate if this entry is produced by structural annotation or homology annotation.
- motif: Only available for structural LTR annotations. This is the di-nucleotide motif at the start and end of the LTR region. For example, TGCA means the LTR has TG at the start and CA at the end.
- tsd/TSD: Target site duplication, only available for structural LTR and TIR annotations. For TIRs, TSD contains both sequences from the 5' and 3' end of the element. For example, AATCTTCTTT_AATCTTCTTT_100.0 has AATCTTCTTT at the 5' end and AATCTTCTTT at the 3' end, and their identity is 100%.
- TIR: Terminal inverted repeat, only available for structural TIR annotations. It contains both sequences from the 5' and 3' end of the element. For example, GAGTAAAGTA_TACTTTACTC has GAGTAAAGTA at the 5' end and TACTTTACTC at the 3' end.

What do "TE_homo_xxx" and "TE_struc_xxx" mean?
"TE_homo_xxx" means this entry is annotated based on homology to the TE library. "TE_struc_xxx" means this entry is annotated by structural methods. Each TE annotation entry has a unique ID.
Why do some annotations have directions (+/-) and some do not?
All homology-based annotations ("TE_homo_xxx") should have directions as assigned by RepeatMasker. The TE library entry is assumed as + when assigning the direction. However, this is not always the correct assignment because we don't know the direction of some TEs lacking coding information. Some structural-based LTR annotations ("TE_struc_xxx") may have directions. The direction is determined by the consensus direction of coding sequences in the element. For elements lacking coding sequences or do not have consensus coding directions, their direction is labeled "?". All structural-based TIR transposons do not have directions and are labeled ".". All structural-based Helitrons have directions.
Why are structurally intact LTRs and TIRs represented differently in the GFF3 file?
Structurally intact LTR elements contain the following structural features annotated by LTR_retriever:
- repeat_region: The entire repeat region, including TSD and the LTR element.
- lTSD: Left target site duplication. This is actually not part of the LTR element but a 4-6bp short repeat created during the insertion of the LTR.
- lLTR: Left long terminal repeat.
- LTRRT: The whole LTR retrotransposon containing lLTR, internal region, and rLTR.
- rLTR: Right long terminal repeat.
- rTSD: Right target site duplication. This is actually not part of the LTR element but a 4-6bp short repeat created during the insertion of the LTR.
  Structurally intact TIR elements have TSD and TIR information in the attributes field of the file.
Why do some TE families have coordinates as their names? e.g., ID=TE_struc_37;Name=Chr2:800875..805238;Classification=DNA/DTT
First, these TEs are all structurally annotated ("TE_struc_37"). Further, they cannot be clustered with other structural TEs, so there's no need to use another name to represent them. They are likely single copies in the genome and have lower confidence to be a TE. Thus, their sequences are not used to construct the TE library to avoid inflation of false annotations. You may filter them out based on their naming structure.
Why TE families in the GFF3 file do not have consistent classifications?
Ideally, each TE family (each sequence in the TE library represents a TE family) should have only one classification, but in the GFF3 file you may observe multiple classifications of the same family:
```
  Count Name            Sequence_ontology
     13 TE_00000022     helitron
    161 TE_00000022_LTR LTR_retrotransposon
     12 TE_00000023     helitron
     25 TE_00000024     helitron
     39 TE_00000024     CACTA_TIR_transposon
     37 TE_00000025     helitron
    444 TE_00000025     CACTA_TIR_transposon
     19 TE_00000026     helitron
    314 TE_00000026     CACTA_TIR_transposon
      1 TE_00000027     Mutator_TIR_transposon
     21 TE_00000027     helitron
   1796 TE_00000027_INT LTR_retrotransposon
```
This represents the inconsistency between the structural and homology annotations. During the structural annotation process, each element is identified and classified by structural features, and the non-redundant TE library is generated from these structurally intact TEs. To classify structurally intact elements into the family level, EDTA uses the TE library to annotate each intact TEs, where inconsistency may occur. If an intact TE shares an extensive portion (passing the 80-80-80 threshold) with an LTR sequence in the library (eg., TE_00000022_LTR), then this intact TE will be classified as a copy in the family. However, if this intact TE is not structurally annotated as an LTR (eg., helitron), inconsistency occurs. In practice, we found inconsistencies mostly occurred on helitron, which is due to two main reasons: 1. Technically, helitrons are very difficult to annotate accurately, and thus, we have relatively low confidence in helitron annotations (benchmarked in the EDTA paper). 2. Biologically, helitrons tend to capture sequences when they transpose, and many TEs could be captured during the process. If helitrons capture more non-helitron TEs than themselves, they could pass the classification threshold and be classified as non-helitrons. For example, 161 TE_00000022_LTR were correctly annotated as LTR retrotransposons but 13 were misclassified as helitrons. The classification inconsistency of this family is 7.5%. For the case of TE_00000024, 25 were classified as helitron, while 39 were CACTA_TIR_transposon. The inconsistency is pretty high. Overall, most of the inconsistencies are not too bad. This represents one of the most difficult challenges in TE annotation.
How to summarize TE annotation in my genome?
It is recommended to use the summary file genome.fa.mod.EDTA.TEanno.sum produced by EDTA to report TE annotation. This report consideres overlapping annotations and has reasonable summaries for TE length and copy numbers. This file is produced by the script ./util/buildSummary.pl modified from the RepeatMasker package. The script is originally developed by Robert M. Hubley ([email protected]) and adapted to EDTA. Please read more on this thread #169 on reproducing the summary file.
How to evaluate the quality of annotation?
Without prior knowledge and manual curation, it's difficult to know if the annotation is correct or wrong. One way to estimate the quality of the annotation is to see if the annotation is done consistently. For example, to check if a sequence and its similar sequences are consistently annotated as LTR or not. You may think consistent annotation is a basic requirement, but it's not! It's quite challenging to produce a consistent TE annotation due to the complexity of TE sequences such as nested insertions. In general, genomes annotated by a curated TE library have inconsistency lower than 3%. EDTA has inconsistency of 1% - 40% depending on the TE category, and mostly 10-20%. Other de novo tools may have inconsistency as high as 80%! The original paper has more descriptions and discussions.
You may use the --evaluation 1 parameter to obtain the annotation consistency. If you forget to add this in your run, you may rerun EDTA with --step anno --anno 1 --overwrite 0 --evaluation 1, and it should skip most of the finished steps and start the evaluation. The evaluation step involves all-vs-all blasts and could take quite some time. The evaluation step won't change the annotation at all.
How to interpret the evaluation results?
The files ".stat.all.sum" ".stat.nested.sum", and ".stat.redun.sum" are to describe the level of annotation inconsistency in the data. The inconsistency could be due to technical issues (EDTA fails to classify some TEs accurately) or biological reasons (some TEs are nested inside other TEs). Basically, the ".stat.redun.sum" file is to describe the technical inconsistency and the ".stat.nested.sum" file is to describe biological inconsistency, while the ".stat.all.sum" file is the summary of both. See more discussions #260.
How to improve the annotation quality?
There are some ad hoc ways to improve the annotation quality. Because EDTA has low power to annotate SINEs and LINEs, getting this part right is challenging for genomes with high proportions of SINEs/LINEs (i.e., vertebrate genomes).
I recommend manually preparing a non-redundant SINEs/LINEs library of your species and giving it to EDTA (--curatedlib lib.fasta). This will significantly improve the annotation of both SINEs/LINEs and other TEs. As far as I know, there is a SNIE database that collects some curated SINE sequences. NCBI is also a good place to look at. There is a nice annotation pipeline for de novo SINE annotations recently published. You may also do a more generic google search regarding your species.
For post hoc improvements, you may filter out those single-copy elements at the expense of sensitivity.
Can EDTA annotate non-TE repeats? For example, telomeric repeat, centromeric repeat, knob repeat
Yes, but not in a de-novo fashion. You may identify or collect non-TE repeats manually and provide them to EDTA as a curated library (--curatedlib lib.fasta), then EDTA will use RepeatMasker to annotate sequences similar (≤40% divergence) to these repeats. If the provided repeat sequence is "too short", for example, a 7-bp telomeric repeat, they will not be utilized by RepeatMasker, and thus, the final annotation won't have these annotations.
Why the results of multiple runs are slightly different?
Because the library-making process in EDTA is random (./util/cleanup_nested.pl). This represents the uncertainty we are facing when trying to select representative sequences. Fortunately, the differences are not very big as far as we observed. See more discussions #246.