19. Identifying TEs in a de novo assembly - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

IDENTIFYING TEs IN A DE NOVO ASSEMBLY

Because transposable elements (TEs) make up such a substantial proportion of any eukaryotic genome, they have significant impacts on the structure and function of that genome. Thus, identifying what TEs are present and where they're located is a critical task in analyzing any new assembly. There are numerous pipelines for this task and they all have pros and cons. A good review of the topic is here. One of the more popular and useful pipelines is RepeatModeler.

We will be performing a de novo analysis of the C. elegans assembly that was generated in an earlier exercise.

SETTING UP

This exercise and the next are intimately related. RepeatModeler works well but it's kind of a pain to install. There is a conda installation but there are issues with some modules. So, I'll just get you to use the installation that I have. For future reference, that installation is at /lustre/work/daray/software/RepeatModeler-2.0. You'll need that path later.

interactive -p nocona

You'll need your genome assembly and all of the directories to hold your work.

mkdir -p /lustre/scratch/[eraider]/gge2022/te/assembly

cd /lustre/scratch/[eraider]/gge2022/te/assembly

ln -s /lustre/scratch/daray/gge2022/abyss/celegans/celegans-k96/cehybridk96-scaffolds.fa cEle.fa

mkdir ../repeatmodeler

mkdir ../bin

mkdir ../repeatmasker #for the next exercise

cd ../bin

FOR YOU TO DO

Your job is to create and run a submission script that will run a de novo RepeatModeler analysis on the C. elegans genome assembly you created as part of an earlier exercise using Abyss.

Here are the criteria for the assignment.

Create a submission script to run RepeatModeler on the best C. elegans genome assembly from Abyss. Save and run the script in your bin directory. (Note 1. RepeatModeler is already installed, see above. Note 2. Running RepeatModeler is a two step process. Note 3. https://github.com/Dfam-consortium/RepeatModeler)
Your output should end up in the repeatmodeler directory.
You should use at least 36 processors for the job.
Make sure a log file is created for the job. It will be important later.
Refer to the genome database you build in the first step as 'cEle_db'. This will make the next exercise easier because that's the nomenclature I used.

Do not try to run this without a submission script. Creating that is an important part of the assignment.

Upload your working script to Assignment 18 - TE discovery.

ALSO FOR YOU TO DO

Locate the output directory and the log file. Using those, complete/answer the following. Upload your answers to Assignment 18 - TE discovery.

In your own words, what does the histogram in the log file tell you?
How many rounds of repeat identification did RepeatModeler complete? How do you know this?
What is the name of the output file that contains the consensus sequences of all identified potential TEs?
Use grep and wc to determine how many consensus sequences are in that file. Tell me the command that you used to answer this question.
Use the same tools to determine how many potential LINE elements are in that file? DNA elements?