3A. Day 2: Unix essential for genomics data processing and analysis (Hands on Session) - bioinfokushwaha/Livestock_Genomics GitHub Wiki
Lecture_Advances_in_sequencing_technology_and_its_application.pdf
Creation of the first file
1. $nano sandeep.txt
2. write something “My name is Sandeep"
3. <Control> + o (To save the file )
4. Enter
5. <Control> + x (Exit the file )
Creation of the second file
1. $nano sandeep1.txt
2. write something “My name is Sandeep."
3. <Control> + o (To save the file )
4. Enter
5. <Control> + x (Exit the file )
Check the content of the file
$md5sum sandeep
$md5sum sandeep1
Check the file type
$file sandeep
$file ../Day2/Test.fq.gz
$cp ../Day2/Test.fq.gz ./
$zcat Test.fq.gz
(a) Sequence Counting
$zcat Test.fq.gz | wc -l
$zcat Test.fq.gz | grep -c "^@ERR"
cut : used to cutting out the sections from each line of files for data processing
Paste : Use to merge file content
Wc : to count Words and line
cut : used to cutting out the sections from each line of files for data processing
Syntax : cut OPTION... [FILE]...
Options Description
-b : To cut by bytes, with the list of byte numbers separated by comma or range by -
-c : To cut by character, with the list of byte numbers separated by comma or range by -
-f : To cut by field, with the list of byte numbers separated by comma or range by -
-d : cut by delimiter
$cp ../Day2/id.txt ./
$cat id.txt
$less id.txt |cut -b 1-3 $less id.txt |cut -b 3-6
$less id.txt |cut -b -3 $less id.txt |cut -b 3-
$less id.txt |cut -d “_” -f 1 $less id.txt |cut -d “_” -f 2
$less id.txt |cut -d “_” -f 1,2 $less id.txt |cut -d “_” -f 1-2
Paste : Use to merge file content
Syntax : paste [-d/s] file1 file2
Options Description
-d : To define delimiter for joining
-s : To join file content horizontally
$cp ../Day2/family.txt ./
$cat id.txt $cat family.txt
$paste id.txt family.txt $paste –d “|” id.txt family.txt
$paste -s id.txt family.txt $cat id.txt family.txt |paste - - -
Wc : to count Words and line
Syntax : wc [-l/w/c]... [FILES]...
Options Description
-l : To get line number only
-w : To get words only
-c : To get character only
-L: to print out the length of the longest line
$less id.txt|wc
$less id.txt|wc -l
$less id.txt|wc -c
Back to the original Objective ....
b) Extraction of the nucleotide sequence and count
$zless Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|wc
$cp ../Day2/Test.fa ./
$less Test.fa| grep -c \>
$less Test.fa| grep -v \>|wc -l
grep : global regular expression print
sort: Use to sort file content line by line
uniq : used to reports or filter out repeated lines in a file line by line
sed: used to report or filter out repeated lines in a file line by line
grep : global regular expression print
Syntax: grep [options] pattern [files]
Options Description
-c : Prints only a count of the lines that match a pattern
-n : Display the matched lines and their line numbers.
-w : Match whole word
-v : Prints out all the lines that do not matches the pattern
-A : Print number of line after pattern matching
-B : Print number of line before pattern matching
-f : Takes patterns from file, one per line.
-i : Ignores, case for matching (for non-case sensitivity )
$less Test.fa|grep -c "Seq“
$less Test.fa|grep -ci "Seq“
$less Test.fa|grep -in "Seq“
$less Test.fa|grep -inw "Seq“
$less Test.fa|grep "Seq" -A 1 (Troubleshooting)
$less Test.fa|grep "Seq" -B 1 (Troubleshooting)
sort: Use to sort file content line by line
Syntax : sort [-n/r/k] <file_name>
Options Description
-r: reverse the result of comparisons
-k: KEYDEF gives location and type
-n: compare according to string numerical value
-b: ignore leading blanks
$cp ../Day2/Features.txt ./
$cat Features.txt $cat Exon.txt
$cat Features.txt|sort $cat Exon.txt|sort
$cat Exon.txt|sort -n $cat Exon.txt|sort -n -k 2
$cat Exon.txt|sort -n -k 2 -r
uniq : used to reports or filter out repeated lines in a file line by line
Syntax : uniq [options] [files]
Options Description
-c : prints how many times a line was repeated by displaying a number
-d : prints the repeated lines and not the lines which aren’t repeated
-i : ignore case.
-u : print only unique lines.
$cp ../Day2/Exon.txt ./
$cat Exon.txt $cat Exon.txt|uniq
$cat Exon.txt|cut -d " " -f 1 |uniq $cat DATA/Exon.txt|cut -d " " -f 2 |uniq
$cat Exon.txt|cut -d " " -f 2 |sort|uniq $cat Exon.txt|cut -d " " -f 1 |sort|uniq –c
$cat Exon.txt|cut -d " " -f 1 |sort|uniq –u $cat Exon.txt|cut -d " " -f 1 |sort|uniq -d
Fold (to break lines)
Syntax: fold [options] [files]
Options Description
-s: break at spaces
-w: use columns WIDTH
$less Exon.txt|fold -w 1|sort|uniq -c
Syntax : sed "s/<pattern to match>/<pattern to replace with>/<options>"
Options Description
g- Global match - Replace more than one hit.
i- Case-insensitive match
$cat Filename|sed ‘s/Patternfor matching/,/Replacement Pattern/p'
$cat Filename|sed -n '/PatternStart/,/PatternEnd/p'
$cat Exon.txt
$cat Exon.txt|sed 's/exon/mRNA/'
back to original objective .........
c) Composition of nucleotides of A/T/G/C
$zless Test.fq.gz|paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c
$less Test.fq |paste - - - -|cut -f 2|tr -d "\n"|sed 's/.../&\n/g'
d) AT/GC % ;Tm?
$zcat ../Day2/Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c |grep A|sed 's/ //g'|cut -d " " -f 1
Saving output in a variable by using backtick (symbol on Esc key)
$A=`zcat ../Day2/Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c |grep A|sed 's/ //g'|cut -d " " -f 1`
$T=`zcat ../Day2/Test.fq.gz |paste - - - -|cut -f 2|tr -d '\n'|fold -w 1|sort|uniq -c |grep T|sed 's/ //g'|cut -d " " -f 1`
$echo A T ...... is it worked
$echo $A $T
echo "$A+$T"|bc
- Fastq to Fasta sequence
- Splitting of fasta sequence file
- Conversion of single line fasta sequence into multiple line fasta sequence or vice-versa
- Random dataset generation for fastq and fasta file
- Extraction of the gene sequence of interest from genome file
- Creation of non-redundant dataset