Module 8 get expression from RNA Seq data - HRGV/Marmics_Metagenomics GitHub Wiki

Set up your environment

You can simply use the environment kallisto for this module

conda activate /bioinf/transfer/Marmics_SPMG2021/SPMG_software_env/kallisto

match with the right Thiosymbion is important!!

##Metagenomes A,D-F are Leptonemella aphanotecae B Leptonemella vicina C Leptonemella sp.

##Metatranscriptomes **All samples are Leptonemella aphanotecae ** A – C Oxic D – F Anoxic

If you have annotated with RAST

Download the annotated genome as a genbank file

In the job details of a RAST annotated genome you can download a Genbank formatted file

OPTIONAL convert the genbank file into a fasta file of all coding sequences

To be able to quantify expression we need a gene database - a fasta file with all the genes as separate fasta sequences. We also want annotations for each gene in the description for each fasta sequence. This can be done with a little script called genbank_to_fasta.py

genbank_to_fasta.py -i YOUR_GENOME.gbk --out YOUR_GENOME_CDS.fasta -s nt -f CDS -q gene,product

-i gives your .gbk file
--out your output file
-s sequence type nt - nucleotide
-f features to be written, in our case CDS
-q descriptions to be added to the header, gene and gene product

In your case, we have generated the files for you already, look for the feature_dna files in the folder bioinf/transfer/Marmics_SPMG2019/annotation/

Fix your fasta headers

The RAST annotations have an annoying structure, you can fix the headers with a simple substitution using sed

sed 's/fig/XYZ/' YOUR_GENOME_CDS.fasta > YOUR_GENOME_CDS_fixed.fasta

Replace the XYZ with a shortcut for your symbiont - e.g. LVs for Leptonemella vicina symbiont

If you have annotated with prokka

use the .ffn file in the annotation folder

Call expression with kallisto

Kallisto will quantify the transcription for the genes in the database using the total-RNA-Seq reads we generated. Kallisto is superfast, anything you do from now on can be run on a 5-year old Laptop...

Index for kallisto

To be able to use the gene database with kallisto, you need to index it. That only needs to be done once

kallisto index -i YOUR_INDEX_FILE YOUR_GENOME_CDS.ffn

-i indicates the name of the index to be generated from the .fasta file

get your reads

The RNA-Seq reads are in bioinf/transfer/Marmics_SPMG2019/metatranscriptomic_reads/Leptonemella/. Please link the libraries you want to analyze into your folder using ln -s. As always you can simplify your link names...

ln -s /bioinf/transfer/Marmics_SPMG2019/metatranscriptomic_reads/Leptonemella/3421_B_run472_TCCAGTCG_S54_L005_R1_001.fastq.gz LA_3421_B_rna.fq.gz

Call kallisto with the reference and the reads

kallisto quant -i YOUR_INDEX_FILE -o RNA-Seq-lib_on_YOUR_INDEX_FILE -b 100 read1.fq.gz read2.fq.gz -t 6

only necessary for single reads -l gives fragment length of the library - you can get this from your sequencing center, in our case about 300
only necessary for single reads -s standard deviation of fragment length - in our case about 50
-t threads to be used, 6 is more than enough \

In your output folder, you will find a file called abundances.tsv that are your expression profiles. Sadly the file misses the annotations, that would make the file hard to interpret

Add annotations to expression files with a little command line magic

First lets collect the headers from our fasta files that contain the annotations

grep '>' YOUR_GENOME_CDS.ffn | sed 's/>//g' >  YOUR_GENOME_CDS_fixed.headers.txt

The headers and the abundances files are sorted the same way, this allows us to merge the two files. But wait, there is a header line in the abundances file, that needs to be kept and amended!

paste <(head -n 1 ./RNA-Seq-lib_on_YOUR_GENOME_CDS/abundance.tsv) <(echo "annotation") > transcription_header.txt

paste glues together two inputs, line by line - here we call two commands, head and echo and paste the line into the new header and write it into a file

Now lets do the same for the data

paste <(tail -n +2 ./RNA-Seq-lib_on_YOUR_GENOME_CDS/abundance.tsv) <(cut YOUR_GENOME_CDS_fixed.headers.txt -d' ' -f2-) > RNA-Seq-lib_on_YOUR_GENOME_CDS_expression_noheader.tsv

We now pasted everything but the first line of the abundance file with the description part of the fasta headers. Let's add the header using cat and we are done:

cat transcription_header.txt RNA-Seq-lib_on_YOUR_GENOME_CDS_expression_noheader.tsv > RNA-Seq-lib_on_YOUR_GENOME_CDS_expression.tsv

Sort the expression profiles from high to low expression

Hmm, but that last list was awful, it is sorted by the genes and not the transcription! To change this you can finally use &&!

(head -n 1 RNA-Seq-lib_on_YOUR_GENOME_CDS_expression.tsv && tail -n +2 RNA-Seq-lib_on_YOUR_GENOME_CDS_expression.tsv | sort -k5 -rV) > RNA-Seq-lib_on_YOUR_GENOME_CDS_expression_sorted.tsv

Have fun analyzing what your bug was doing when you killed it!

The overall collected TPM values are in

/bioinf/transfer/Marmics_SPMG2021/expression_all_libraries/4466_AtoF-on_TS_La_expression_final.csv

#Alternative environment in case conda breaks or does not work for you

edit your ~/.bashrc

nano ~/.bashrc

add the following PATH:

export PATH=/bioinf/transfer/Marmics_SPMG2019/software/kallisto:$PATH

To make it work, source the ~/.bashrc:

source ~/.bashrc