Ageing LTR insertions - SIWLab/Lab_Info GitHub Wiki

Why do we want to age the insertions of LTRs?

LTR-RTs are a case where we can exploit biology to provide insight into larger processes. LTR-RTs are a class of retrotransposon with long terminal repeats, which are identical upon insertion at the beginning and end of the TE's sequence. Over time, mutations accumulate in these terminal repeats, roughly following a molecular clock. This allows us to more or less count the differences between the two LTRs of a LTR-RT, and calculate the time (in generations ago) it was inserted. If we do this for all LTR-RTs in a genome, we get a distribution of insertion times, and shifts in this distribution can tell us about the history of TE activity and/or selective pressures against TEs in our favourite species or between species.

The general steps

I'll go through the steps and why we do them, following the methods of Hu et al. 2011 and Slotte et al. 2013.

The pipeline is fairly simple: find the LTR-RTs, get the sequence of the LTR pairs for each element, locally align the pairs to each other, trim the hanging ends, count the differences and calculate the date. The detailed steps are as follows. Note that all of the examples assume you have the tools, e.g. gt or muscle, in your $PATH--if you don't, make sure to use the full path to the program.

Obtaining the LTR-RTs

Luckily there's a bit of software that does this for us, with high accuracy: LTRharvest, which you'll need to have installed the GenomeTools package to use.

Unfortunately, LTRharvest has an annoying habit of not using the scaffold/chromosome names you give it (it outputs scaffolds as "seqX" where X is the 0-base number of your scaffold), so I remove all scaffolds in my fasta file, remove newline ("\n") characters, and add a single scaffold name so that getting the fasta sequence for regions in the output gff3 is painless. You can get this with some variation of:

grep -v '>' <path-to-fasta> | tr -d "\n" | cat <(echo "> fake-scaffold-name") - > <path-to-output-fasta>

This is useful if you just want the distribution of insertion times of your LTRs, but if you need location information you'll either have to rerun LTRharvest with the regular fasta, or use the length of each scaffold to figure out the real scaffold from your concatenated fasta. The pipeline below assumes you have concatenated the scaffolds in your fasta.

LTRharvest works in two steps. The first is to generate a series of lookup tables which it uses to run the meat of the program. You can run this with

gt suffixerator -db <path-to-fasta> -indexname <name-of-species> -tis -suf -lcp -des -sds -dna

This will create 8 files for your species note: these will be written to the directory in which you run the script

Once you have the lookup tables, you can run LTRharvest with default parameters:

gt ltrharvest -index <name-of-species> -v -gff3 <path-to-write-gff3> -out <path-to-write-output> > <path-to-write-stdout>

You should make sure to save the stdout to a file, as this has useful results. The output file will have useful summaries of the location and size of the LTR-RTs and each of their LTRs, but the gff3 output will be the most useful for getting the sequence of LTR pairs into a usable format.

Because of the annoying re-nameing LTRharvest does to the scaffold names, I then convert the scaffold name in the gff3 output to the fake scaffold name I used in my concatenated fasta with

sp='fake-fasta-name'
sed -ie "s/seq0/$sp/g" <path-to-output-gff3>

Note that this will edit the file in place, so double check your names are right.

Getting LTRs into pairs

The next step is a bit of an intermediate step. We need to realign the LTR pairs to each other, which means we need a separate gff3 file for each. This should be as easy as splitting the gff3 into every two lines, but it's good to have a safety check to make sure the parent LTR-RT is the same for both LTRs in a pair. I wrote a script to do this that's linked at the end...

Align LTR pairs

Once you have gff3 files with pairs of LTRs, you can extract the sequence of each into a multiple fasta and align with muscle. To extract the sequence, I use bedtools:

bedtools getfasta -fi <path-to-fasta> -bed <path-to-gff3> > <path-to-output-for-pair>

Note that you'll have to do this for every pair (likely thousands), so you'll want to loop over those.

Now you can align these fasta pairs with muscle. Muscle takes multiple fastas as input (in this case a fasta file with 2 sequences corresponding to the LTR pairs) and outputs a multiple fasta local alignment--this will make sure we're comparing the right bases between the LTR pairs.

muscle3.8.31_i86linux64 -in <path-to-LTR-pairs-fasta> -out <path-to-output>.afa

Again, you need to run this for every pair.

Trim alignments and calculate distance

Your muscle alignments will have gaps on the ends of one or both sequences, so you'll have to trim these off. Once that's done, you want to calculate the distance between the two LTRs. Simply this would the the number of differences between the two, but mutation happens at different rates at different types of sites, for example transitions vs transversions. Therefore, convention is to calculate the Kimura 2-parameter model: k2pm, where P is the fraction of sites that are transitions, and Q is the fraction of sites that are transversions. If you calculate this for each of the LTRs in your species, you will get a distribution of K substitutions/site. You can then get the insertion time for each K, by calculating T, where mu is your mutation rate (e.g. 7E-9 in A. thaliana).

I wrote a python script to do all of this last part and output a K value, from which it is easy to convert to T (it's also easy to edit the script to do this). Just note that if your species' generation time isn't 1 year, the T estimate you get won't be in years, it's in generations.

Full pipeline

I wrote a full pipeline for going from a concatenated fasta file to a file of K values and put it in this repo under Scripts/LTRharvest. The script Age_LTRs.sh has the pipeline steps, but the other scripts are needed as they're referenced here. You should only have to edit the species list and paths in the Age_LTRs.sh script to run the full pipeline.

⚠️ **GitHub.com Fallback** ⚠️