3A. Day 2: Unix essential for genomics data processing and analysis (Hands on Session) - bioinfokushwaha/Livestock

Lecture_Advances_in_sequencing_technology_and_its_application.pdf

Objective: In this exercise, we will learn routine tasks in Genomics data processing and analysis

I. Check file integrity and file type

Creation of the first file
   1. $nano sandeep.txt
   2. write something “My name is Sandeep"
   3. <Control> + o (To save the file )
   4. Enter
   5. <Control> + x (Exit the file )

Creation of the second file
   1. $nano sandeep1.txt
   2. write something “My name is Sandeep."
   3. <Control> + o (To save the file )
   4. Enter
   5. <Control> + x (Exit the file )

Check the content of the file
   $md5sum sandeep
   $md5sum sandeep1

Check the file type 
   $file sandeep
   $file ../Day2/Test.fq.gz

II. Sequence counting

$cp ../Day2/Test.fq.gz ./
$zcat Test.fq.gz

(a) Sequence Counting
	$zcat Test.fq.gz | wc -l
	$zcat Test.fq.gz | grep -c "^@ERR"

III. Nucleotide sequence extraction and counting

cut : used to cutting out the sections from each line of files for data processing

Paste : Use to merge file content

Wc : to count Words and line

cut : used to cutting out the sections from each line of files for data processing
Syntax : cut OPTION... [FILE]...
Options Description
	-b : To cut by bytes, with the list of byte numbers separated by comma or range by -
	-c : To cut by character, with the list of byte numbers separated by comma or range by -
	-f : To cut by field, with the list of byte numbers separated by comma or range by -
	-d : cut by delimiter
       
        $cp ../Day2/id.txt ./
	$cat id.txt
	$less id.txt |cut -b 1-3		$less id.txt |cut -b 3-6
	$less id.txt |cut -b -3			$less id.txt |cut -b 3-
	$less id.txt |cut -d “_” -f 1		$less id.txt |cut -d “_” -f 2
	$less id.txt |cut -d “_” -f 1,2		$less id.txt |cut -d “_” -f 1-2



Paste : Use to merge file content
Syntax : paste [-d/s] file1 file2
Options Description
	-d : To define delimiter for joining 
	-s : To join file content horizontally

        $cp ../Day2/family.txt ./
	$cat id.txt			 $cat family.txt
	$paste id.txt family.txt	 $paste –d “|” id.txt family.txt
	$paste -s id.txt family.txt 	 $cat id.txt family.txt |paste - - -



Wc : to count Words and line
Syntax : wc [-l/w/c]... [FILES]...
Options Description
	-l : To get line number only
	-w : To get words only
	-c : To get character only
	-L: to print out the length of the longest line

        $less id.txt|wc 
        $less id.txt|wc -l
        $less id.txt|wc -c

Back to the original Objective ....

b) Extraction of the nucleotide sequence and count
	$zless Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|wc
        $cp ../Day2/Test.fa ./
	$less Test.fa| grep -c \>
        $less Test.fa| grep -v \>|wc -l

grep : global regular expression print

sort: Use to sort file content line by line

uniq : used to reports or filter out repeated lines in a file line by line

sed: used to report or filter out repeated lines in a file line by line

grep : global regular expression print
Syntax:    grep [options] pattern [files]
Options Description
	-c : Prints only a count of the lines that match a pattern
	-n : Display the matched lines and their line numbers.
	-w : Match whole word
	-v : Prints out all the lines that do not matches the pattern
	-A : Print number of line after pattern matching
	-B : Print number of line before pattern matching
	-f : Takes patterns from file, one per line.
	-i : Ignores, case for matching (for non-case sensitivity )

	$less Test.fa|grep -c "Seq“
	$less Test.fa|grep -ci "Seq“
	$less Test.fa|grep -in "Seq“
	$less Test.fa|grep -inw "Seq“
		$less Test.fa|grep  "Seq" -A 1 (Troubleshooting)
		$less Test.fa|grep  "Seq" -B 1  (Troubleshooting)





sort: Use to sort file content line by line
Syntax : sort [-n/r/k] <file_name>
Options Description
	-r: reverse the result of comparisons
	-k: KEYDEF gives location and type
	-n: compare according to string numerical value
	-b: ignore leading blanks

         $cp ../Day2/Features.txt ./
	 $cat Features.txt		 $cat Exon.txt 
	 $cat Features.txt|sort		 $cat Exon.txt|sort 
	 $cat Exon.txt|sort -n 		 $cat Exon.txt|sort -n -k 2 
	 $cat Exon.txt|sort -n -k 2 -r 

	

uniq : used to reports or filter out repeated lines in a file line by line
Syntax : uniq [options] [files]
Options Description
	-c : prints how many times a line was repeated by displaying a number
	-d : prints the repeated lines and not the lines which aren’t repeated
	-i : ignore case.
	-u : print only unique lines. 	

          $cp ../Day2/Exon.txt ./
          $cat Exon.txt			 	         $cat Exon.txt|uniq
          $cat Exon.txt|cut -d " " -f 1 |uniq		 $cat DATA/Exon.txt|cut -d " " -f 2 |uniq
          $cat Exon.txt|cut -d " " -f 2 |sort|uniq	 $cat Exon.txt|cut -d " " -f 1 |sort|uniq –c
          $cat Exon.txt|cut -d " " -f 1 |sort|uniq –u	 $cat Exon.txt|cut -d " " -f 1 |sort|uniq -d




Fold (to break lines)
Syntax: fold [options] [files]
Options Description
  	-s: break at spaces
  	-w: use columns WIDTH
	
        $less Exon.txt|fold -w 1|sort|uniq -c 




Syntax : sed "s/<pattern to match>/<pattern to replace with>/<options>"
Options Description
	g- Global match - Replace more than one hit.
	i- Case-insensitive match

        $cat Filename|sed  ‘s/Patternfor matching/,/Replacement Pattern/p' 
        $cat Filename|sed -n '/PatternStart/,/PatternEnd/p' 

        $cat Exon.txt
        $cat Exon.txt|sed 's/exon/mRNA/'

back to original objective .........


c) Composition of nucleotides of A/T/G/C
	$zless Test.fq.gz|paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c 
	$less Test.fq |paste - - - -|cut -f 2|tr -d "\n"|sed 's/.../&\n/g'

Its time for some mathematics

d) AT/GC % ;Tm?

$zcat ../Day2/Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c |grep A|sed 's/  //g'|cut -d " " -f 1

Saving output in a variable by using backtick (symbol on Esc key)
 
$A=`zcat ../Day2/Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c |grep A|sed 's/  //g'|cut -d " " -f 1`
$T=`zcat ../Day2/Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c |grep T|sed 's/  //g'|cut -d " " -f 1`

$echo A T  ...... is it worked

$echo $A $T

echo "$A+$T"|bc

Assignment-1:

Fastq to Fasta sequence
Splitting of fasta sequence file
Conversion of single line fasta sequence into multiple line fasta sequence or vice-versa
Random dataset generation for fastq and fasta file
Extraction of the gene sequence of interest from genome file
Creation of non-redundant dataset

3A. Day 2: Unix essential for genomics data processing and analysis (Hands on Session) - bioinfokushwaha/Livestock_Genomics GitHub Wiki

Objective: In this exercise, we will learn routine tasks in Genomics data processing and analysis

I. Check file integrity and file type

II. Sequence counting

III. Nucleotide sequence extraction and counting

Its time for some mathematics

Assignment-1:

Use the following credentials to access the internet and NIAB server

⚠️ GitHub.com Fallback ⚠️

3A. Day 2: Unix essential for genomics data processing and analysis (Hands on Session) - bioinfokushwaha/Livestock_Genomics GitHub Wiki

Objective: In this exercise, we will learn routine tasks in Genomics data processing and analysis

I. Check file integrity and file type

II. Sequence counting

III. Nucleotide sequence extraction and counting

Its time for some mathematics

Assignment-1:

Use the following credentials to access the internet and NIAB server

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️