Phase 3: text processing - mishraxharshit/harshitxmishra.github.io GitHub Wiki
Previous: Phase 2 — Command Line | Next: Phase 4 — Users and Permissions
3.1 Why Text Processing Matters
In Linux, almost everything is a text file: configuration files, log files, CSV data, source code. Mastering text processing tools means you can inspect, transform, and extract information from any of these without writing a program.
The tools in this phase are composable: you pipe the output of one into the next.
3.2 grep — Search for Patterns
grep searches each line of input for a pattern and prints lines that match.
# Basic search
grep "error" /var/log/syslog
# Prints every line containing the word "error"
# Case-insensitive search
grep -i "error" /var/log/syslog
# Show line numbers
grep -n "error" /var/log/syslog
# Print lines that do NOT match
grep -v "debug" /var/log/syslog
# Count matching lines
grep -c "error" /var/log/syslog
# Search recursively through directories
grep -r "password" /etc/
grep -r "TODO" ~/projects/ --include="*.py"
# Show context: 3 lines before and after each match
grep -B3 -A3 "critical" /var/log/syslog
# Extended regular expressions (more powerful patterns)
grep -E "error|warning|critical" /var/log/syslog
grep -E "^root" /etc/passwd # lines starting with "root"
grep -E "[0-9]{3}-[0-9]{4}" contacts.txt # phone number pattern
# Print only the matching part, not the whole line
grep -o "192\.[0-9.]*" /var/log/nginx/access.log # extract IP addresses
3.3 cut — Extract Fields from Lines
# Cut by delimiter (-d) and select field (-f)
cut -d: -f1 /etc/passwd # extract usernames (field 1)
cut -d: -f1,3 /etc/passwd # fields 1 and 3
cut -d, -f2 data.csv # second column of a CSV
# Cut by character position
cut -c1-10 file.txt # first 10 characters of each line
cut -c5- file.txt # from character 5 to end of line
# Example: extract all email addresses from a file
grep -o "[a-zA-Z0-9._%+-]*@[a-zA-Z0-9.-]*\.[a-zA-Z]*" contacts.txt
3.4 sort and uniq — Sorting and Deduplication
# Sort alphabetically
sort names.txt
# Sort in reverse
sort -r names.txt
# Sort numerically (important: alphabetic sort puts 10 before 2)
sort -n numbers.txt
# Sort by specific field (tab-delimited)
sort -t$'\t' -k2 data.tsv # sort by second column
# Sort by file size (useful with ls)
ls -l | sort -k5 -n # sort ls output by size column
# Remove duplicate lines (input must be sorted)
sort names.txt | uniq
# Count occurrences of each unique line
sort names.txt | uniq -c
# Sort by frequency (most common first)
sort access.log | uniq -c | sort -rn | head -20
# Real use case: find the most common IP addresses in a web log
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
3.5 sed — Stream Editor (Search and Replace)
sed edits text as it flows through, applying transformations to each line.
# Substitute: replace first occurrence per line
sed 's/old/new/' file.txt
# Substitute: replace ALL occurrences per line (g = global)
sed 's/old/new/g' file.txt
# Case-insensitive substitute
sed 's/old/new/gI' file.txt
# Edit file in place (-i flag)
sed -i 's/localhost/192.168.1.100/g' /etc/myapp/config.conf
# Edit in place but keep a backup
sed -i.bak 's/old/new/g' config.conf
# Original saved as config.conf.bak
# Delete lines matching a pattern
sed '/^#/d' config.conf # delete comment lines
sed '/^$/d' file.txt # delete blank lines
# Print only matching lines (like grep)
sed -n '/error/p' log.txt
# Print specific line numbers
sed -n '20,30p' file.txt # print lines 20 to 30
# Add a line after a match
sed '/\[database\]/a host = localhost' config.conf
# Real use case: change a hostname in multiple config files
find /etc/myapp/ -name "*.conf" -exec sed -i 's/old.host/new.host/g' {} \;
3.6 awk — Pattern Scanning and Processing
awk is a full programming language for processing tabular text. It reads input line by line, splits each line into fields, and lets you perform actions on them.
# Basic syntax: awk 'pattern { action }' file
# Print specific fields (NF = number of fields, NR = line number)
awk '{print $1}' access.log # first field
awk '{print $1, $4}' access.log # first and fourth fields
awk '{print NR, $0}' file.txt # line number + whole line
# Delimiter (-F flag)
awk -F: '{print $1}' /etc/passwd # colon-delimited, field 1
awk -F, '{print $2}' data.csv # comma-delimited, field 2
# Pattern matching
awk '/error/' log.txt # print lines containing "error"
awk '!/debug/' log.txt # print lines NOT containing "debug"
awk '$3 > 1000' data.txt # lines where field 3 is greater than 1000
# Arithmetic
awk '{sum += $5} END {print "Total:", sum}' sales.csv # sum column 5
awk 'NR > 1 {total += $2} END {print total/NR}' data.txt # average, skip header
# Real use case: summarise disk usage from df output
df -h | awk 'NR>1 {print $5, $6}' | sort -rn
# Prints usage percentage and mount point, sorted by usage
# Real use case: extract failed login attempts from auth log
awk '/Failed password/ {print $11}' /var/log/auth.log | sort | uniq -c | sort -rn
3.7 tr — Translate Characters
# Replace lowercase with uppercase
echo "hello world" | tr 'a-z' 'A-Z'
# HELLO WORLD
# Delete specific characters
echo "hello123" | tr -d '0-9'
# hello
# Squeeze repeated characters into one
echo "hello world" | tr -s ' '
# hello world
# Convert Windows line endings to Unix
tr -d '\r' < windows.txt > unix.txt
3.8 Practical Pipelines
These examples show how to combine tools to solve real problems.
Find the most frequently appearing words in a file:
tr ' ' '\n' < book.txt | tr -d '.,;:!?' | sort | uniq -c | sort -rn | head -20
Parse an Apache/Nginx access log for the most-visited pages:
awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
Find how many 404 errors occurred per hour:
grep " 404 " /var/log/nginx/access.log | awk '{print $4}' | cut -d: -f2 | sort | uniq -c
Extract all email addresses from a directory of files:
grep -roh "[a-zA-Z0-9._%+-]*@[a-zA-Z0-9.-]*\.[a-zA-Z]*" /var/mail/ | sort | uniq
Phase 3 Exercises
Exercise 1: Use grep to count how many user accounts in /etc/passwd use /bin/bash as their shell.
Exercise 2: Extract just the usernames and home directories from /etc/passwd using cut. Fields are colon-separated, username is field 1, home is field 6.
Exercise 3: Find the five most common words in any text file using the pipeline: tr, sort, uniq -c, sort -rn, head.
Exercise 4: Use sed to delete all comment lines (starting with #) and all blank lines from /etc/ssh/sshd_config. Print the result to screen without modifying the file (do not use -i).
Exercise 5: Use awk to calculate the total size (in bytes) of all files in /var/log by summing field 5 of ls -l output.
Previous: Phase 2 — Command Line | Next: Phase 4 — Users and Permissions