20. Masking a genome for TEs - davidaray/Genomes-and-Genome-Evolution GitHub Wiki
Now that you've run a TE-discovery pipeline on the C. elegans genome, we need to see where those TEs are and how much of the genome they occupy. To do that, we use a second package, RepeatMasker. The nomenclature comes from a period early in the genomics era when people weren't quite sure what to do with TEs and even if they did, the computational power to deal with thousands and thousands of identical regions just wasn't there. Thus, many researchers would just hide (mask) them, in effect ignoring 50% or more of the genome when running their analyses. Fortunately, computation and our understanding of the importance of TEs has come a long way since then.
RepeatMasker will take a library of known TEs and identify where they exist in a genome. Conceptually, it's BLAST but specialized for TEs. But, keep in mind that the results will only be as good as the queries in the library you give it. More on that later.
RepeatMasker is easily installed using conda.
interactive -p nocona
NOTE: this is not working this year, so instead install repeatmasker using the HPCC module system:
module load gcc/9.2.0
module load repeatmasker/4.0.9-p2
Skip this part this year:
. ~/conda/etc/profile.d/conda.sh
conda activate
conda create -n repeatmasker
conda activate repeatmasker
conda install -c bioconda repeatmasker
Now that we've skipped the conda installation and installed RepeatMasker using HPCC, you should be good to go.
As always, you'll need to set up a directory structure, get your genome, and in this case, locate the library of TEs from the previous exercise. Fortunately, the first two of those were done as part of the previous exercise. All you'll need to do is get the library. Note that the text below assumes that you named your genome database 'cEle_db', as directed in the previous exercise.
cd /lustre/scratch/[eraider]/gge2022/te/repeatmasker
ln -s /lustre/scratch/daray/gge2022/te/repeatmodeler/cEle-families.fa
If you have difficulties accessing this file, use this one instead: /lustre/scratch/jkorstia/gge2022/te/cEle-families.fa
How long RepeatMasker runs is very dependent on two variables, the size of the genome and the size of the library, for what should be obvious reasons. The first variable is how much space there is to search, and the second is how many things to search for.
Because my lab often uses RepeatMasker to search very large genomes for very many TEs, if we run RepeatMasker the 'old-fashioned way' as described on the repeatmasker.org site, a single run can take days to weeks. Thus, we've worked with Robert Hubley and Arian Smit (the authors of RepeatMasker) to find shortcuts. You will be running RepeatMasker both ways to show you the difference.
There's absolutely nothing wrong with running RepeatMasker in this fashion, especially if you have a relatively small genome and/or relatively few repeats to identify. In fact, it's probably better to do it this way for technical reasons we can discuss. But, for a large genome and a large library, it can take a long time. Fortunately, the C. elegans genome is small. Our library is relatively large but it's not unmanageable.
Set up your directory structure.
mkdir /lustre/scratch/[eraider]/gge2022/te/repeatmasker/oldfashioned
Because we're working with a relatively small genome that has relatively few repeats, let's just do this on the command line and watch what happens.
interactive -p nocona
cd /lustre/scratch/[eraider]/gge2022/te/repeatmasker/oldfashioned
. ~/conda/etc/profile.d/conda.sh
conda activate repeatmasker
The basic command for RepeatMasker is:
RepeatMasker
There are many, many options to consider. Feel free to examine any of several sites that have those options listed. Here's one. For our purposes, we will run the command as follows:
RepeatMasker ../../assembly/cEle.fa -lib ../../repeatmodeler/cEle-families.fa -dir . -s -pa 12 -xsmall > of_log.txt &
Breaking it down:
'RepeatMasker' calls the program.
'../assembly/dMau.fa' is the location and name of the genome file.
'-lib ../../repeatmodeler/cEle_db-families.fa' is the name and location of the library of known TEs.
'-dir .' simply directs the output to the current directory.
'-s' sets the sensitivity. In this case, the most sensitive setting. Takes longer but is better at identification.
'-pa 12' uses 12 processors running in parallel to do the job. This speeds things up a bit.
'xsmall' tells the program to mask any TEs in the genome to lower-case letters rather than the default, which is N's.
'&' moves the job to the background, allowing you to use this terminal for other things while you wait.
Run this command and wait for the output. While you wait you can keep track of the progress with:
tail -f of_log.txt
. You'll see that the entire job consisting of over 2000 batches is being processed 12 batches at at time.
While you wait for the first run to finish, let's set up the second, more efficient method of running RepeatMasker. This will be important if you have a large genome and/or many repeats to identify. The process is essentially one that breaks the genome assembly into as many parts as you like and then runs each of them separately and in parallel. Each separate run is also run using any number of processors that parallelizes that portion, increasing the speed by several orders of magnitude, depending on how many processors you use, how many pieces you break the genome into, and how many processors are available to you on HPCC. Each piece of the genome is processed, generating all of the output files. After all pieces are done, a second script will combine all of the smaller output files into a single set of output files. To do all this, you will require a python script (sge_clusterrun.py) that my lab generated, with the help of the RepeatMasker guys.
Get a second terminal and log in as usual. Once logged in, get a single processor interactive session and follow the instructions below.
interactive -p nocona
mkdir /lustre/scratch/[eraider]/gge2022/te/repeatmasker/newfangled
To run our python script, you will need the necessary working environment.
We can just modify the existing repeatmasker environment.
. ~/conda/etc/profile.d/conda.sh
conda activate repeatmasker
conda install -c bioconda pyfaidx
conda install biopython pandas
Now you're ready to go.
cd /lustre/scratch/[eraider]/gge2022/te/repeatmasker/newfangled
cp ../../repeatmodeler/cEle-families.fa .
python /home/daray/gitrepositories/bioinfo_tools/sge_clusterrun.py -i ../../assembly/cEle.fa -b 50 -lib cEle-families.fa -dir . -xsmall
Notice that the only real difference is that we tell the program to break the genome assembly into 50 pieces (-b = batches).
Run that. We wrote this script to give you information on what's happening as it works. After a few seconds, you'll see a new directory pop up and three new files, cEle.2bit, qsub.sh and doLift.sh.
'cEle.2bit' is a compressed version of the genome .fa file. It just saves a little space but is necessary for some downstream steps.
'qsub.sh' is just a list of job submission commands that will run all 50 parts of the genome.
Run it with sbatch qsub.sh
and you'll see all 50 jobs get submitted to the scheduler. Here, you'll be dependent on the schedule to decide how quickly all of these jobs run. If you haven't been running much and/or the queue isn't very busy, the scheduler will move many of your jobs to the top of the queue. Otherwise, it may take some time to run all of the jobs.
'doLift.sh' is a script that will take all 50 output files and compile them into a single set of output files.
Keep track of the queue using squeue -u
. Once all jobs have finished running, run sbatch doLift.sh
.
After a little while, you will see several files appear in your directory.
If you compare the files in the 'oldfashioned' and 'newfangled' directories, you'll see significant similarities. One minor difference is that the output in the newfangled directory is all compressed.
Here's what all those files are:
File | Directory | Contents |
---|---|---|
cEle.fa.out | oldfashioned & newfangled (compressed) | The file most people are interested in. Coordinates of each TE hit in the genome plus some other information. |
cEle.fa.cat.gz | oldfashioned | A detailed file showing the same information as the .out file but including an alignment between the query and subject. |
cEle.fa.align.gz | newfangled | same as the .cat file. |
cEle.fa.masked | oldfashioned & newfangled (compressed) | A copy of the genome with TEs masked as N or lower-case, depending on your input command. |
cEle.fa.tbl | oldfashioned & newfangled | An old-fashioned table showing genome proportions from each TE class. |
cEle.summary.gz | newfangled | Similar to .tbl but with much more detailed information. |
cEle.divsum.gz | newfangled | Table with mean divergence values from the consensus for each query. Useful for estimating TE ages. |
"Garbage in, garbage out." The results of any analysis are only as good as the input you provide. That's true for any of the exercises you've done so far but we're going to illustrate that principle here.
The library of TEs you generated using RepeatModeler is a de novo library and simply gives the best automated estimate of what TEs might be present in the assembly used to identify them. In reality, TE libraries need quite a bit of curation to ensure that the TEs in the library are really what the computer says they are.
To illustrate the differences you might get from just using raw RepeatModeler output when compared to using a well-curated library, I want you to re-run your RepeatMasker analyses using a well-curated library I obtained from https://www.dfam.org/home, a database of TEs.
Re-run your RepeatMasker analysis in a new directory under the following conditions:
- The output should be in a new directory called 'repeatmasker-dfam'.
- Use the library I've provided here:
/lustre/scratch/daray/gge2022/te/repeatmasker-dfam/cEle_dfam_families.fa
. If you have difficulties accessing this file, use this one instead: /lustre/scratch/jkorstia/gge2022/te/cEle_dfam_families.fa - Run the analysis the newfangled way, using the sge_clusterrun.py script.
Complete the following tasks/answer the questions below.
Each of the RepeatMasker runs should have generated a 'landscape.html.gz' file. Unzip that file using
gunzip -c cEle-landscape.html.gz >cEle-landscape.html
.
Then, open the file using a web browser.
The one generated using the dfam library should look something like this:
Interpreting this graph is relatively easy. On the X-axis, there are divergence values. For our purposes those can equate to time. Let's talk about why that is.
In this case, the divergence is a number that represents how different from the consensus sequence each individual instance of a given TE in the genome is from the consensus. The consensus is a sequence representing what we thing the original element looked like and a bunch of those are what are in our TE library. If the consensus represents the original sequence that gave rise to all the copies, then when each of those copies looked exactly like the consensus when it was first deposited in the genome. Bu, as time goes by, mutation processes will degrade the individual insertions, making them 'diverge' from the consensus. The longer they sit there, the more diverged they get. Thus, divergence ~= time.
Getting back to the graph, the intersection at the Y-axis represents 'very recent' and more to the right takes use into the more distant past. The Y-axis is proportion of the assembly occupied by the various categories of TE observed. Now answer the following questions/follow the instructions and upload your responses to Assignment 19 - TE masking.
-
Contrast the two pie charts with regard to how much of the assembly is made up of Helitrons, DNA/TcMar elements, and unknown TEs?
-
How much of the genome consists of LINE/RTE elements that were deposited in the genome most recently?
-
Generally and in your own words, how would your understanding of genome content differ when using the uncurated library compared to using the curated library? I can see at least three ways immediately. Which annotation would you consider to be 'better' and why?
Unzip the 'cEle.fa.out.gz' in the 'repeatmasker-dfam' directory and use the information in the file to answer these questions:
- What are the names, types, and genome coordinates of the first two TEs identified in this genome? The last two? Provide me with evidence for your answers and the command you used to get that evidence.