text_processing - Karthikeyan-Lab-Caltech/Wiki GitHub Wiki

If one wants to print the output add > file.txt after the command

grep - find text in files

grep "search_term" filename - prints occurences of "search_term" in filename

  • -i for case insensitive search
  • -n to show line numbers
  • -A N show lines afer match
  • -B N show lines before match
  • -C N show lines before and after match

awk - advanced text line by line processing

Awk is extremely useful albeit challenging to work with. Its basic syntax: awk 'pattern { action }' filename

Useful Awk Variables

$0 - the whole line

$N - Nth lines (starting at 1)

NR - line number

FNR - line number within a file

NF - number of fields in line

&& - AND

|| - OR

~ /pattern/ - grep like search

Examples of awk:
  • awk '{print $1, $3}' filename - Print columns 1 and 3
  • awk '$2 == "text" {print $0}' filename - Print the whole column where the second column matches 'text'
  • awk '{print $2 + $3}' filename - print the sum of the second and third columns
Advanced Awk

Advanced AWK set up:

awk 'BEGIN { setup } condition { action } END { wrap-up }' file

awk runs the BEGIN block before any input is read, then the condition { action } block for each line in the file, and finally the END block once after the last line.

Calculate the average

awk 'BEGIN {sum = 0; n = 0} {sum += $2; n++} END {print "Average:", sum/n}' file.txt

Multiple Files As an example, let’s join two files on the same first column:

awk 'FNR==NR {dict[$1]=$2; next} $1 in dict {print $1, dict[$1], $2}' file1 file2

Explanation: FNR==NR is only true while reading the first file. During that time, it builds a dictionary (dict) mapping the first column to the second. The next command skips the rest of the script for the first file.

When awk starts reading the second file, the second block runs. If the first column of the second file matches a key in the dictionary, it prints the key, the value from the dictionary, and the value from the second file.

Other Useful Commands

head -n N filename - print first N lines

tail -n N filename - print last N lines

sort filename - sort file alphabetically

  • -r - revese sort
  • -n - numberical sort
  • -kN, N - sort by N column

uniq sorted_file - take a sorted file and removes duplicates

  • -c - counts occurences
  • -d - keeps only duplicates
  • -u - keeps only unique lines

wc -l filename - Count Lines

wc -w filename - Count Words

⚠️ **GitHub.com Fallback** ⚠️