3A. Day 2: Unix essential for genomics data processing and analysis (Hands on Session) - bioinfokushwaha/Livestock_Genomics GitHub Wiki

Lecture_Advances_in_sequencing_technology_and_its_application.pdf image

image

Objective: In this exercise, we will learn routine tasks in Genomics data processing and analysis

I. Check file integrity and file type

Creation of the first file
   1. $nano sandeep.txt
   2. write something “My name is Sandeep"
   3. <Control> + o (To save the file )
   4. Enter
   5. <Control> + x (Exit the file )

Creation of the second file
   1. $nano sandeep1.txt
   2. write something “My name is Sandeep."
   3. <Control> + o (To save the file )
   4. Enter
   5. <Control> + x (Exit the file )

Check the content of the file
   $md5sum sandeep
   $md5sum sandeep1

Check the file type 
   $file sandeep
   $file ../Day2/Test.fq.gz

II. Sequence counting

$cp ../Day2/Test.fq.gz ./
$zcat Test.fq.gz

(a) Sequence Counting
	$zcat Test.fq.gz | wc -l
	$zcat Test.fq.gz | grep -c "^@ERR"

III. Nucleotide sequence extraction and counting

cut : used to cutting out the sections from each line of files for data processing

Paste : Use to merge file content

Wc : to count Words and line

cut : used to cutting out the sections from each line of files for data processing
Syntax : cut OPTION... [FILE]...
Options Description
	-b : To cut by bytes, with the list of byte numbers separated by comma or range by -
	-c : To cut by character, with the list of byte numbers separated by comma or range by -
	-f : To cut by field, with the list of byte numbers separated by comma or range by -
	-d : cut by delimiter
       
        $cp ../Day2/id.txt ./
	$cat id.txt
	$less id.txt |cut -b 1-3		$less id.txt |cut -b 3-6
	$less id.txt |cut -b -3			$less id.txt |cut -b 3-
	$less id.txt |cut -d “_” -f 1		$less id.txt |cut -d “_” -f 2
	$less id.txt |cut -d “_” -f 1,2		$less id.txt |cut -d “_” -f 1-2



Paste : Use to merge file content
Syntax : paste [-d/s] file1 file2
Options Description
	-d : To define delimiter for joining 
	-s : To join file content horizontally

        $cp ../Day2/family.txt ./
	$cat id.txt			 $cat family.txt
	$paste id.txt family.txt	 $paste –d “|” id.txt family.txt
	$paste -s id.txt family.txt 	 $cat id.txt family.txt |paste - - -



Wc : to count Words and line
Syntax : wc [-l/w/c]... [FILES]...
Options Description
	-l : To get line number only
	-w : To get words only
	-c : To get character only
	-L: to print out the length of the longest line

        $less id.txt|wc 
        $less id.txt|wc -l
        $less id.txt|wc -c
	


Back to the original Objective ....

b) Extraction of the nucleotide sequence and count
	$zless Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|wc
        $cp ../Day2/Test.fa ./
	$less Test.fa| grep -c \>
        $less Test.fa| grep -v \>|wc -l

grep : global regular expression print

sort: Use to sort file content line by line

uniq : used to reports or filter out repeated lines in a file line by line

sed: used to report or filter out repeated lines in a file line by line

grep : global regular expression print
Syntax:    grep [options] pattern [files]
Options Description
	-c : Prints only a count of the lines that match a pattern
	-n : Display the matched lines and their line numbers.
	-w : Match whole word
	-v : Prints out all the lines that do not matches the pattern
	-A : Print number of line after pattern matching
	-B : Print number of line before pattern matching
	-f : Takes patterns from file, one per line.
	-i : Ignores, case for matching (for non-case sensitivity )

	$less Test.fa|grep -c "Seq“
	$less Test.fa|grep -ci "Seq“
	$less Test.fa|grep -in "Seq“
	$less Test.fa|grep -inw "Seq“
		$less Test.fa|grep  "Seq" -A 1 (Troubleshooting)
		$less Test.fa|grep  "Seq" -B 1  (Troubleshooting)





sort: Use to sort file content line by line
Syntax : sort [-n/r/k] <file_name>
Options Description
	-r: reverse the result of comparisons
	-k: KEYDEF gives location and type
	-n: compare according to string numerical value
	-b: ignore leading blanks

         $cp ../Day2/Features.txt ./
	 $cat Features.txt		 $cat Exon.txt 
	 $cat Features.txt|sort		 $cat Exon.txt|sort 
	 $cat Exon.txt|sort -n 		 $cat Exon.txt|sort -n -k 2 
	 $cat Exon.txt|sort -n -k 2 -r 

	

uniq : used to reports or filter out repeated lines in a file line by line
Syntax : uniq [options] [files]
Options Description
	-c : prints how many times a line was repeated by displaying a number
	-d : prints the repeated lines and not the lines which aren’t repeated
	-i : ignore case.
	-u : print only unique lines. 	

          $cp ../Day2/Exon.txt ./
          $cat Exon.txt			 	         $cat Exon.txt|uniq
          $cat Exon.txt|cut -d " " -f 1 |uniq		 $cat DATA/Exon.txt|cut -d " " -f 2 |uniq
          $cat Exon.txt|cut -d " " -f 2 |sort|uniq	 $cat Exon.txt|cut -d " " -f 1 |sort|uniq –c
          $cat Exon.txt|cut -d " " -f 1 |sort|uniq –u	 $cat Exon.txt|cut -d " " -f 1 |sort|uniq -d




Fold (to break lines)
Syntax: fold [options] [files]
Options Description
  	-s: break at spaces
  	-w: use columns WIDTH
	
        $less Exon.txt|fold -w 1|sort|uniq -c 




Syntax : sed "s/<pattern to match>/<pattern to replace with>/<options>"
Options Description
	g- Global match - Replace more than one hit.
	i- Case-insensitive match

        $cat Filename|sed  ‘s/Patternfor matching/,/Replacement Pattern/p' 
        $cat Filename|sed -n '/PatternStart/,/PatternEnd/p' 

        $cat Exon.txt
        $cat Exon.txt|sed 's/exon/mRNA/'



back to original objective .........


c) Composition of nucleotides of A/T/G/C
	$zless Test.fq.gz|paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c 
	$less Test.fq |paste - - - -|cut -f 2|tr -d "\n"|sed 's/.../&\n/g'

Its time for some mathematics

d) AT/GC % ;Tm?

$zcat ../Day2/Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c |grep A|sed 's/  //g'|cut -d " " -f 1

Saving output in a variable by using backtick (symbol on Esc key)
 
$A=`zcat ../Day2/Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c |grep A|sed 's/  //g'|cut -d " " -f 1`
$T=`zcat ../Day2/Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c |grep T|sed 's/  //g'|cut -d " " -f 1`

$echo A T  ...... is it worked

$echo $A $T

echo "$A+$T"|bc

Assignment-1:

  1. Fastq to Fasta sequence
  2. Splitting of fasta sequence file
  3. Conversion of single line fasta sequence into multiple line fasta sequence or vice-versa
  4. Random dataset generation for fastq and fasta file
  5. Extraction of the gene sequence of interest from genome file
  6. Creation of non-redundant dataset
⚠️ **GitHub.com Fallback** ⚠️