4B. Day 4. identification Long Coding RNA - bioinfokushwaha/Livestock_Genomics GitHub Wiki

Long Non coding RNA

Long non-coding RNAs (long ncRNAs, lncRNA) are one among another types of RNA, generally defined as transcripts more than 200 nucleotides that are NOT translated into protein. They are involved in different physiological conditions like disease, biotic and abiotic stress, development etc. Identification and analysis of the expression profiles of lncRNAs will be of great value in understanding their key roles in specific conditions. This session will end up with the holistic view of the lncRNAs. This training will show insights of the interplay of key molecular players specifically lncRNAs which may play role in resistance and susceptibility pattern against pathogens. This will surely contribute in the better understanding of different physiological conditions at genomic level.

Pipeline for identification of lncRNAs

1. QUALIY FILTERING – FASTP

This is a tool which functions for quality control, trimming of adapters, filtering by quality.

Command: fastp -i ../Dushyant/SRR12426364_R1.fastq -I ../Dushyant/SRR12426364_R2.fastq -o sample_R1.fastq -O sample_R2.fastq -j sample.json -h sample.html

Screenshot showing Fastq file format

2. MAPPING READS TO THE GENOME - HISAT2:

Note:- To build genome reference file for hisat2 mapping we have to use -build flag. The output file of -build is the input genome file for further hisat2 step.

command: hisat2-build ../Dushyant/ARS-UCD2.0_genomic.fna Bos_taurus

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population.

Command: hisat2 -q --dta --summary-file summary.txt -x Bos_taurus -1 sample_R1.fastq -2 sample_R2.fastq -S sample.sam

Note:- -x hisat2 build output without extention

Screenshot showing output file of mapping by Hisat 2 : summary file and sam file

3. CONVERSION OF SAM FILE TO BAM

Command: samtools sort -O BAM -o sample.bam sammple.sam

Screenshot showing output of samtool

4. Assembly of transcripts using Stringtie

Command: stringtie sample.bam -G ../Dushyant/ARS-UCD2.0_genomic.gff -o sample.gtf

Screenshot showing output of assembly by Stringtie

5. TRANSCRIPT ANNOTATION - GFFCompare

GFFcompare: Tool used for compaeing transcripts with reference genome file (.fasta and .gff). Where we will get gffcmp.combined.gtf file from which we are going to extract class codes.

NOTE:- need to save the path of .gtf file into a .txt file

Command: gffcompare -r ../Dushyant/ARS-UCD2.0_genomic.gff -s ../Dushyant/ARS-UCD2.0_genomic.fna -T -V -i gtf_path.txt

Screenshot showing annotated File

Annotates the transcripts into different class codes, of which 3 class codes represent lncRNAs

Class_code “i” - intronic lncRNAs
Class_code “o” - intergenic lncRNAs
Class_code “x” - Antisense lncRNAs

6. EXTRACTION OF SEQUENCE OF i,u AND x CLASS_CODES:

transcript code

Screenshot showing different types of class codes

for extraction of class_codes:

grep 'class_code \"i\"' gffcmp.annotated.gtf | cut -f1,4,5,9 | cut -f1 -d ";" > ixo.bed

grep 'class_code \"x\"' gffcmp.annotated.gtf | cut -f1,4,5,9 | cut -f1 -d ";" >> ixo.bed

grep 'class_code \"o\"' gffcmp.annotated.gtf | cut -f1,4,5,9 | cut -f1 -d ";" >> ixo.bed

here "samtools" we are using for preparation of chromosome file, in which information about position of chromosome

samtools faidx ../Dushyant/ARS-UCD2.0_genomic.fna

we will extract first two columns, as they contain chromosome id and length. Output file will be the input for bedtools

cut -f1,2 ARS-UCD2.0_genomic.fna.fai > Bos_taurus.txt

7. EXTRACTION OF FASTA SEQUENCE USING BEDTOOL

Report whether each alignment overlaps one or more genes. If not, the alignment is not reported. Report the number of genes that each alignment overlaps.

Command: bedtools slop -b 50 -i ixo.bed -g Bos_taurus.txt > output-file.slop.bed

using slop.bed going to extract fasta sequence

Command: bedtools getfasta -fi ../Dushyant/ARS-UCD2.0_genomic.fna -bed output_file.slop.bed -fo ixo_transcript.fasta -name

Screenshot (128)

Screenshot showing bed file

8. LENGTH AND ORF-FILTER:

Sequence length filter:

Long non-coding RNAs are generally of length greater than 200. So we remove sequences with length less than 200.

python ../Dushyant/lenfilter200.py

ORF filter:

OrfPredictor is a web server designed for identifying protein-coding regions in expressed sequence tag (EST)-derived sequences.

Command: OrfPredictor.pl ixo_lenfilter.fasta empty.txt 0 both 0.001 ixo_orf.fasta

Sequences with ORFs of length greater than 300 are generally considered as protein coding, thus we remove sequences with ORF greater than 300 and keeping <100 amino acid length sequence.

python ../Dushyant/lenfileter100.py

9. PROTEIN FAMILY FILTER – PFAM - RPSBLAST

RPSBlast command: rpsblast -db path_of_database -query ixo_ORFfilter_gt100.fasta -out ixo_ORF100_pfam_rps.txt -evalue 0.001

In result we can see 'hit' and '0 hit'. 0 hit means this could be potentially non coding to anything.

Screenshot showing RPSBlast Output

10. CODING POTENTIAL ANALYSIS – CPC2

Command: CPC2.py -i input.fasta -o output_file

Screenshot showing CPC2 output file