text_processing - Karthikeyan-Lab-Caltech/Wiki GitHub Wiki
If one wants to print the output add > file.txt
after the command
grep "search_term" filename
- prints occurences of "search_term" in filename
-
-i
for case insensitive search -
-n
to show line numbers -
-A N
show lines afer match -
-B N
show lines before match -
-C N
show lines before and after match
Awk is extremely useful albeit challenging to work with. Its basic syntax: awk 'pattern { action }' filename
Useful Awk Variables
$0 - the whole line
$N - Nth lines (starting at 1)
NR - line number
FNR - line number within a file
NF - number of fields in line
&& - AND
|| - OR
~ /pattern/ - grep like search
-
awk '{print $1, $3}' filename
- Print columns 1 and 3 -
awk '$2 == "text" {print $0}' filename
- Print the whole column where the second column matches 'text' -
awk '{print $2 + $3}' filename
- print the sum of the second and third columns
Advanced Awk
Advanced AWK set up:
awk 'BEGIN { setup } condition { action } END { wrap-up }' file
awk runs the BEGIN block before any input is read, then the condition { action } block for each line in the file, and finally the END block once after the last line.
Calculate the average
awk 'BEGIN {sum = 0; n = 0} {sum += $2; n++} END {print "Average:", sum/n}' file.txt
Multiple Files As an example, let’s join two files on the same first column:
awk 'FNR==NR {dict[$1]=$2; next} $1 in dict {print $1, dict[$1], $2}' file1 file2
Explanation: FNR==NR is only true while reading the first file. During that time, it builds a dictionary (dict) mapping the first column to the second. The next command skips the rest of the script for the first file.
When awk starts reading the second file, the second block runs. If the first column of the second file matches a key in the dictionary, it prints the key, the value from the dictionary, and the value from the second file.
head -n N filename
- print first N lines
tail -n N filename
- print last N lines
sort filename
- sort file alphabetically
-
-r
- revese sort -
-n
- numberical sort -
-kN, N
- sort by N column
uniq sorted_file
- take a sorted file and removes duplicates
-
-c
- counts occurences -
-d
- keeps only duplicates -
-u
- keeps only unique lines
wc -l filename
- Count Lines
wc -w filename
- Count Words