Module 8 get expression from RNA Seq data - HRGV/Marmics_Metagenomics GitHub Wiki
Set up your environment
You can simply use the environment kallisto for this module
conda activate /bioinf/transfer/Marmics_SPMG2021/SPMG_software_env/kallisto
match with the right Thiosymbion is important!!
##Metagenomes A,D-F are Leptonemella aphanotecae B Leptonemella vicina C Leptonemella sp.
##Metatranscriptomes **All samples are Leptonemella aphanotecae ** A – C Oxic D – F Anoxic
If you have annotated with RAST
Download the annotated genome as a genbank file
In the job details of a RAST annotated genome you can download a Genbank formatted file
OPTIONAL convert the genbank file into a fasta file of all coding sequences
To be able to quantify expression we need a gene database - a fasta file with all the genes as separate fasta sequences. We also want annotations for each gene in the description for each fasta sequence. This can be done with a little script called genbank_to_fasta.py
genbank_to_fasta.py -i YOUR_GENOME.gbk --out YOUR_GENOME_CDS.fasta -s nt -f CDS -q gene,product
-i
gives your .gbk file
--out
your output file
-s
sequence type nt - nucleotide
-f
features to be written, in our case CDS
-q
descriptions to be added to the header, gene and gene product
In your case, we have generated the files for you already, look for the feature_dna files in the folder bioinf/transfer/Marmics_SPMG2019/annotation/
Fix your fasta headers
The RAST annotations have an annoying structure, you can fix the headers with a simple substitution using sed
sed 's/fig/XYZ/' YOUR_GENOME_CDS.fasta > YOUR_GENOME_CDS_fixed.fasta
Replace the XYZ with a shortcut for your symbiont - e.g. LVs for Leptonemella vicina symbiont
If you have annotated with prokka
use the .ffn
file in the annotation folder
Call expression with kallisto
Kallisto will quantify the transcription for the genes in the database using the total-RNA-Seq reads we generated. Kallisto is superfast, anything you do from now on can be run on a 5-year old Laptop...
Index for kallisto
To be able to use the gene database with kallisto, you need to index it. That only needs to be done once
kallisto index -i YOUR_INDEX_FILE YOUR_GENOME_CDS.ffn
-i
indicates the name of the index to be generated from the .fasta file
get your reads
The RNA-Seq reads are in bioinf/transfer/Marmics_SPMG2019/metatranscriptomic_reads/Leptonemella/
. Please link the libraries you want to analyze into your folder using ln -s
. As always you can simplify your link names...
ln -s /bioinf/transfer/Marmics_SPMG2019/metatranscriptomic_reads/Leptonemella/3421_B_run472_TCCAGTCG_S54_L005_R1_001.fastq.gz LA_3421_B_rna.fq.gz
Call kallisto with the reference and the reads
kallisto quant -i YOUR_INDEX_FILE -o RNA-Seq-lib_on_YOUR_INDEX_FILE -b 100 read1.fq.gz read2.fq.gz -t 6
only necessary for single reads -l
gives fragment length of the library - you can get this from your sequencing center, in our case about 300
only necessary for single reads -s
standard deviation of fragment length - in our case about 50
-t
threads to be used, 6 is more than enough \
In your output folder, you will find a file called abundances.tsv
that are your expression profiles. Sadly the file misses the annotations, that would make the file hard to interpret
Add annotations to expression files with a little command line magic
First lets collect the headers from our fasta files that contain the annotations
grep '>' YOUR_GENOME_CDS.ffn | sed 's/>//g' > YOUR_GENOME_CDS_fixed.headers.txt
The headers and the abundances files are sorted the same way, this allows us to merge the two files. But wait, there is a header line in the abundances file, that needs to be kept and amended!
paste <(head -n 1 ./RNA-Seq-lib_on_YOUR_GENOME_CDS/abundance.tsv) <(echo "annotation") > transcription_header.txt
paste glues together two inputs, line by line - here we call two commands, head and echo and paste the line into the new header and write it into a file
Now lets do the same for the data
paste <(tail -n +2 ./RNA-Seq-lib_on_YOUR_GENOME_CDS/abundance.tsv) <(cut YOUR_GENOME_CDS_fixed.headers.txt -d' ' -f2-) > RNA-Seq-lib_on_YOUR_GENOME_CDS_expression_noheader.tsv
We now pasted everything but the first line of the abundance file with the description part of the fasta headers. Let's add the header using cat and we are done:
cat transcription_header.txt RNA-Seq-lib_on_YOUR_GENOME_CDS_expression_noheader.tsv > RNA-Seq-lib_on_YOUR_GENOME_CDS_expression.tsv
Sort the expression profiles from high to low expression
Hmm, but that last list was awful, it is sorted by the genes and not the transcription! To change this you can finally use &&
!
(head -n 1 RNA-Seq-lib_on_YOUR_GENOME_CDS_expression.tsv && tail -n +2 RNA-Seq-lib_on_YOUR_GENOME_CDS_expression.tsv | sort -k5 -rV) > RNA-Seq-lib_on_YOUR_GENOME_CDS_expression_sorted.tsv
Have fun analyzing what your bug was doing when you killed it!
The overall collected TPM values are in
/bioinf/transfer/Marmics_SPMG2021/expression_all_libraries/4466_AtoF-on_TS_La_expression_final.csv
#Alternative environment in case conda breaks or does not work for you
edit your ~/.bashrc
nano ~/.bashrc
add the following PATH:
export PATH=/bioinf/transfer/Marmics_SPMG2019/software/kallisto:$PATH
To make it work, source the ~/.bashrc
:
source ~/.bashrc