Phase 3: text processing - mishraxharshit/harshitxmishra.github.io GitHub Wiki

Previous: Phase 2 — Command Line | Next: Phase 4 — Users and Permissions


3.1 Why Text Processing Matters

In Linux, almost everything is a text file: configuration files, log files, CSV data, source code. Mastering text processing tools means you can inspect, transform, and extract information from any of these without writing a program.

The tools in this phase are composable: you pipe the output of one into the next.


3.2 grep — Search for Patterns

grep searches each line of input for a pattern and prints lines that match.

# Basic search
grep "error" /var/log/syslog
# Prints every line containing the word "error"
 
# Case-insensitive search
grep -i "error" /var/log/syslog
 
# Show line numbers
grep -n "error" /var/log/syslog
 
# Print lines that do NOT match
grep -v "debug" /var/log/syslog
 
# Count matching lines
grep -c "error" /var/log/syslog
 
# Search recursively through directories
grep -r "password" /etc/
grep -r "TODO" ~/projects/ --include="*.py"
 
# Show context: 3 lines before and after each match
grep -B3 -A3 "critical" /var/log/syslog
 
# Extended regular expressions (more powerful patterns)
grep -E "error|warning|critical" /var/log/syslog
grep -E "^root" /etc/passwd         # lines starting with "root"
grep -E "[0-9]{3}-[0-9]{4}" contacts.txt  # phone number pattern
 
# Print only the matching part, not the whole line
grep -o "192\.[0-9.]*" /var/log/nginx/access.log   # extract IP addresses

3.3 cut — Extract Fields from Lines

# Cut by delimiter (-d) and select field (-f)
cut -d: -f1 /etc/passwd          # extract usernames (field 1)
cut -d: -f1,3 /etc/passwd        # fields 1 and 3
cut -d, -f2 data.csv             # second column of a CSV
 
# Cut by character position
cut -c1-10 file.txt              # first 10 characters of each line
cut -c5- file.txt                # from character 5 to end of line
 
# Example: extract all email addresses from a file
grep -o "[a-zA-Z0-9._%+-]*@[a-zA-Z0-9.-]*\.[a-zA-Z]*" contacts.txt

3.4 sort and uniq — Sorting and Deduplication

# Sort alphabetically
sort names.txt
 
# Sort in reverse
sort -r names.txt
 
# Sort numerically (important: alphabetic sort puts 10 before 2)
sort -n numbers.txt
 
# Sort by specific field (tab-delimited)
sort -t$'\t' -k2 data.tsv        # sort by second column
 
# Sort by file size (useful with ls)
ls -l | sort -k5 -n              # sort ls output by size column
 
# Remove duplicate lines (input must be sorted)
sort names.txt | uniq
 
# Count occurrences of each unique line
sort names.txt | uniq -c
 
# Sort by frequency (most common first)
sort access.log | uniq -c | sort -rn | head -20
 
# Real use case: find the most common IP addresses in a web log
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10

3.5 sed — Stream Editor (Search and Replace)

sed edits text as it flows through, applying transformations to each line.

# Substitute: replace first occurrence per line
sed 's/old/new/' file.txt
 
# Substitute: replace ALL occurrences per line (g = global)
sed 's/old/new/g' file.txt
 
# Case-insensitive substitute
sed 's/old/new/gI' file.txt
 
# Edit file in place (-i flag)
sed -i 's/localhost/192.168.1.100/g' /etc/myapp/config.conf
 
# Edit in place but keep a backup
sed -i.bak 's/old/new/g' config.conf
# Original saved as config.conf.bak
 
# Delete lines matching a pattern
sed '/^#/d' config.conf        # delete comment lines
sed '/^$/d' file.txt           # delete blank lines
 
# Print only matching lines (like grep)
sed -n '/error/p' log.txt
 
# Print specific line numbers
sed -n '20,30p' file.txt       # print lines 20 to 30
 
# Add a line after a match
sed '/\[database\]/a host = localhost' config.conf
 
# Real use case: change a hostname in multiple config files
find /etc/myapp/ -name "*.conf" -exec sed -i 's/old.host/new.host/g' {} \;

3.6 awk — Pattern Scanning and Processing

awk is a full programming language for processing tabular text. It reads input line by line, splits each line into fields, and lets you perform actions on them.

# Basic syntax: awk 'pattern { action }' file
 
# Print specific fields (NF = number of fields, NR = line number)
awk '{print $1}' access.log           # first field
awk '{print $1, $4}' access.log       # first and fourth fields
awk '{print NR, $0}' file.txt         # line number + whole line
 
# Delimiter (-F flag)
awk -F: '{print $1}' /etc/passwd      # colon-delimited, field 1
awk -F, '{print $2}' data.csv         # comma-delimited, field 2
 
# Pattern matching
awk '/error/' log.txt                 # print lines containing "error"
awk '!/debug/' log.txt                # print lines NOT containing "debug"
awk '$3 > 1000' data.txt              # lines where field 3 is greater than 1000
 
# Arithmetic
awk '{sum += $5} END {print "Total:", sum}' sales.csv   # sum column 5
awk 'NR > 1 {total += $2} END {print total/NR}' data.txt  # average, skip header
 
# Real use case: summarise disk usage from df output
df -h | awk 'NR>1 {print $5, $6}' | sort -rn
# Prints usage percentage and mount point, sorted by usage
 
# Real use case: extract failed login attempts from auth log
awk '/Failed password/ {print $11}' /var/log/auth.log | sort | uniq -c | sort -rn

3.7 tr — Translate Characters

# Replace lowercase with uppercase
echo "hello world" | tr 'a-z' 'A-Z'
# HELLO WORLD
 
# Delete specific characters
echo "hello123" | tr -d '0-9'
# hello
 
# Squeeze repeated characters into one
echo "hello   world" | tr -s ' '
# hello world
 
# Convert Windows line endings to Unix
tr -d '\r' < windows.txt > unix.txt

3.8 Practical Pipelines

These examples show how to combine tools to solve real problems.

Find the most frequently appearing words in a file:

tr ' ' '\n' < book.txt | tr -d '.,;:!?' | sort | uniq -c | sort -rn | head -20

Parse an Apache/Nginx access log for the most-visited pages:

awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10

Find how many 404 errors occurred per hour:

grep " 404 " /var/log/nginx/access.log | awk '{print $4}' | cut -d: -f2 | sort | uniq -c

Extract all email addresses from a directory of files:

grep -roh "[a-zA-Z0-9._%+-]*@[a-zA-Z0-9.-]*\.[a-zA-Z]*" /var/mail/ | sort | uniq

Phase 3 Exercises

Exercise 1: Use grep to count how many user accounts in /etc/passwd use /bin/bash as their shell.

Exercise 2: Extract just the usernames and home directories from /etc/passwd using cut. Fields are colon-separated, username is field 1, home is field 6.

Exercise 3: Find the five most common words in any text file using the pipeline: tr, sort, uniq -c, sort -rn, head.

Exercise 4: Use sed to delete all comment lines (starting with #) and all blank lines from /etc/ssh/sshd_config. Print the result to screen without modifying the file (do not use -i).

Exercise 5: Use awk to calculate the total size (in bytes) of all files in /var/log by summing field 5 of ls -l output.


Previous: Phase 2 — Command Line | Next: Phase 4 — Users and Permissions