How to prepare a curated library to maximize the efficacy of EDTA - oushujun/EDTA GitHub Wiki

Benefits

The parameter --curatedlib of EDTA allows utilizing existing repeat curation to boost EDTA's annotation. This parameter brings with many benefits:

  1. Preserve existing knowledge such as TE family names.
  2. Improve annotation consistency by classifying some of the EDTA-identified families as known families. The classified families will (very likely) not have misclassification issues, reducing the pool of potential misclassifications.
  3. Help to identify repeats that are missed by EDTA.

How this works

  1. The provided curated library will be used to mask the EDTA-generated library. The 80-80-60 (≥80% coverage, ≥80bp length, ≥60% identity) rule is used to replace sequences in the EDTA library with the curated library. The remaining sequences in the EDTA library are stored in the $genome.EDTA.TElib.novel.fa file, which is combined with the curated library file to form the $genome.EDTA.TElib.fa file.
  2. The provided curated library will be used to mask intact TEs identified by EDTA. The 80-80-80 (≥80% coverage, ≥80bp length, ≥80% identity) rule is used to rename intact TEs. The remaining intact TEs will keep the original TE family ID generated by EDTA.

Formatting requirements

Sequences in the curated library should be formatted following the RepeatMasker naming convention, which is family#(sub)class/super_family. # and / are required separators in this format. The family can be named by the user, but using the existing name is one of the purposes of the --curatedlib parameter. If you are collecting sequences from NCBI, you may use the NCBI ID as the family ID if no better naming is found. The (sub)class is a class/subclass-level classification of repeats, such as LTR, LINE, SINE, TIR, DNA (DNA TE), rDNA, Satellite, telomere, and Cent (centromere). The super family further classifies the repeat into the lower level, including Copia, Ty3, Helitron, hAT, Mutator, Tourist, etc. For a full list of superfamilies that are supported by EDTA, please check out this file: EDTA/util/TE_Sequence_Ontology.txt. The Alias column lists all (sub)class/super_family recognizable to EDTA. If you find a classification missing in this file, please open an issue. If you are looking for examples, you may want to check out curated libraries in EDTA/database for rice (rice6.9.5.liban), maize (maizeTE11122019), and Arabidopsis (athrep.updated.nonredun.fasta).

What to include

It's recommended to include all curated repeat sequences of your target species or sister species. Please also follow these rules:

  1. Be skeptical about formatted family names, such as SolRep00002, which suggest software-generated annotation and low confidence. If you understand how this parameter works (referring to the above section How this works), you do not want to include any low-confidence sequences in the curated library.
  2. Do not include unclassified sequences, such as RandI-1#unknown/unknown, which will counteract the purpose of this parameter.
  3. If you study a particular kind of TEs, such as Athila, be sure to manually curated them and include them in the curated library even though you have only one or a few sequences.
  4. If there is no curated library for your species, it is still highly encouraged to collect some sequences by yourself. You may want to focus on the repeats that could be missed by EDTA (see the next section).

How to maximize your efforts in curation

Following are the items that can maximize your efforts spent on curation.

  1. The repeats highlighted in your study. You definitely want to personally make sure they are correct.
  2. rDNA, including 45S rDNA, 5S rDNA, rDNA spacer, rDNA intergenic spacer (IGS), etc.
  3. telomeric repeat. Most plants have Arabidopsis-like telomeres (TTTAGGG or CCCTAAA). Because they are too short to pass the RepeatMasker filter, you may want to manually increase the length by including multiple copies. Such as:

>Arabidopsis_telomere#telomere/telomere
TTTAGGGTTTAGGGTTTAGGGTTTAGGG

  1. tandem repeats that are known abundant in your species. If the repeat unit is shorter than 10bp, you may need to include multiple copies as well.
  2. Chloroplast and mitochondrion sequences. In general, these are highly repetitive sequences in your sequencing data. You may need to think carefully if you want them in your repeat library or not. Becasue plastic genomes contain functional genes and some are homologous to the nuclear genes, having plastic genomes in your repeat library means these sequences/genes will be masked. If your study only concerns the correctness of TE annotation, including them will be helpful, otherwise this may affect your gene annotation.
  3. centromere
  4. SINE
  5. LINE