7 Masking - coopermkr/sdepressaAssembly GitHub Wiki

Repeat masking is the final(!) step we will take to prepare our reference sequence. The process involves searching the assembly for commonly repeated sequences, including telomeres and transposons, and changing them so they do not register during analyses of the genome. We take this step to speed up processing and reduce computational needs to analyze genomes, as a huge portion of large, complex genomes tend to be repeats.

There are two options for masking: we can either hard mask our sequences by turning all of our repeat areas into Xs or we can soft mask by turning them into lowercase a, c, g, or t. I prefer soft masking as it achieves the same outcome as hard masking but preserves sequence information in case anyone in the future wants to study repeats.

To mask, I used the program RepeatMasker: https://www.repeatmasker.org/ which makes use of hmmer for protein alignments. It is fast, easy to use, and will produce both a masked fasta and an annotation file (.gff) describing the repeats.

RepeatMasker -noisy -dir . -a --xsmall -html -gff -e hmmer -pa 20 -species viridiplantae ../6.polishing/tetra.polished.fasta