Repeat masking benchmark - bbuchfink/diamond GitHub Wiki

We show a comparison of two repeat masking algorithms: tantan masking [1] applied to query and target sequences (Diamond default), and the default BLASTP SEG masking (only masking targets, window=10, locut=1.8, hicut=2.1).

We created a database of decoy sequences applying DecoyPYrat [2] to the UniRef50 database (July 2021, 50,106,394 sequences). The queries are reference assemblies of several organisms, aligned against the decoy database using Diamond in very-sensitive mode. The plots show the number of query proteins with at least one error (alignment against a decoy protein) depending on the e-value threshold.


Assembly accessions: GCF_000008865.2, GCF_000001735.4, GCF_000471905.2, GCF_002880755.1

[1] Frith MC. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res. 2011;39(4):e23. doi:10.1093/nar/gkq1212

[2] Wright JC, Choudhary JS. DecoyPyrat: Fast Non-redundant Hybrid Decoy Sequence Generation for Large Scale Proteomics. J Proteomics Bioinform. 2016;9(6):176-180. doi:10.4172/jpb.1000404