Awk Crash Course - BDC-training/VT25 GitHub Wiki
The aim with this exercise is to learn how to use awk for table visualization and simple calculations.
The data used in this task are counts from a Drosophila melanogaster (the fruitfly), obtained using RNA-seq (Fly_B.counts
, you can find it at /home/courses/Unix/files/
). It consists of 6 columns:
- Chromosome
- Start
- End
- Strand
- Gene
- Counts
Click for results
$ head Fly_B.counts
chr2L 8387891 8388804 + CG13384 24
chr2L 16776850 16777439 + Mhc 25
chr2L 3145528 3145848 - CG12400 7
chr2L 6291245 6291684 + Ddr 0
chr2L 9436929 9437227 + RpS28-like 0
chr2L 10485538 10485790 - CG7456 102
chr2L 10485594 10485857 + Grip75 119
chr2L 11533877 11534547 - CR42743 0
chr2L 12582402 12583550 + CR44595 0
chr2L 15727581 15729156 - CycE 1090
awk
is a tool and a programming language within unix. awk
was deveoped by Alfred Aho, Peter Weinbergeer and Brian Kerninghan, thereby its name. It was developed to simplify data manipulation.
Start by looking at the first column in the file Fly_B.counts.
Column number in awk is written as $n
, meaning first column is called using $1
.
As there are many rows in this table pipe it to head.
awk '{print $1}' Fly_B.counts | head
Click for results
chr2L
chr2L
chr2L
chr2L
chr2L
chr2L
chr2L
chr2L
chr2L
chr2L
$0
in awk stands for the entire rows. Lets try it out. Keep the pipe to head
for only displaying the first 10 rows.
awk '{print $0}' Fly_B.counts | head
Click for results
chr2L 8387891 8388804 + CG13384 24
chr2L 16776850 16777439 + Mhc 25
chr2L 3145528 3145848 - CG12400 7
chr2L 6291245 6291684 + Ddr 0
chr2L 9436929 9437227 + RpS28-like 0
chr2L 10485538 10485790 - CG7456 102
chr2L 10485594 10485857 + Grip75 119
chr2L 11533877 11534547 - CR42743 0
chr2L 12582402 12583550 + CR44595 0
chr2L 15727581 15729156 - CycE 1090
You define the field separator using FS
, by default AWK understands space and tab. You can also change the output field separator using OFS
. Let's print all 6 columns with the output separator "_"
awk 'OFS = "_" {print $1,$2,$3,$4,$5,$6}' Fly_B.counts | head
Click for results
chr2L_8387891_8388804_+_CG13384_24
chr2L_16776850_16777439_+_Mhc_25
chr2L_3145528_3145848_-_CG12400_7
chr2L_6291245_6291684_+_Ddr_0
chr2L_9436929_9437227_+_RpS28-like_0
chr2L_10485538_10485790_-_CG7456_102
chr2L_10485594_10485857_+_Grip75_119
chr2L_11533877_11534547_-_CR42743_0
chr2L_12582402_12583550_+_CR44595_0
chr2L_15727581_15729156_-_CycE_1090
Maybe you only want to separate two specific columns with a different separator and use another one for the rest. Below we separate column 1 and 2 with "_" the rest are separated with a tab.
awk '{print $1"_"$2"\t"$3"\t"$4"\t"$5"\t"$6}' Fly_B.counts | head
Click for results
chr2L_8387891 8388804 + CG13384 24
chr2L_16776850 16777439 + Mhc 25
chr2L_3145528 3145848 - CG12400 7
chr2L_6291245 6291684 + Ddr 0
chr2L_9436929 9437227 + RpS28-like 0
chr2L_10485538 10485790 - CG7456 102
chr2L_10485594 10485857 + Grip75 119
chr2L_11533877 11534547 - CR42743 0
chr2L_12582402 12583550 + CR44595 0
chr2L_15727581 15729156 - CycE 1090
I think we are getting familiar with the awk essentials. Let's continue to perform some calculations. Column 2 and Column 3 in this count file corresponds the the start and the end coordinate of that gene. We can use awk to calculate the length of each gene by subtracting the Start Coord
from the End Coord
of the gene. Let's perform this calculation, keeping the Gene Name
in the output.
awk '{print $3-$2"\t"$5}' Fly_B.counts | head
Click for results
913 CG13384
589 Mhc
320 CG12400
439 Ddr
298 RpS28-like
252 CG7456
263 Grip75
670 CR42743
1148 CR44595
1575 CycE
You can also calculate the sum of an entire column. The amount of reads mapped to the gene is displayed in column 6. We can use awk to calculate the total counts in these samples by summing up column 6. This is done by creating a variable, we name it SUM
and then for each row we are adding the count from column 6 to the SUM variable
.
awk '{SUM+=$6; print SUM}' Fly_B.counts | head
Click for results
24
49
56
56
56
158
277
277
277
1367
The issue with the code snippet above is that the sum is printed each time awk loops over a row. It is probably more handy to print the sum after the entire calculation is performed. You can use END
for this. Also let's add a text string to add what we are printing to standard output.
awk '{SUM+=$6} END {print "Sum of counts:\t"SUM}' Fly_B.counts
Click for results
Sum of counts: 21874597A way of normalizing data is to divide each gene count with the sum of gene counts for the same sample. With this we are taking care of differences in sequencing depth. This will result in a fraction of that gene being expressed compared to the rest of the genes. The sum of these genes will be 1. As these numbers generally are very low one way to handle them is to multiply fraction values with 1 000 000. Lets try this normalization method using awk.
First we need to explain NR
and FNR
-
NR
stands for number of records being processed or put it simple the line number. -
FNR
is the number of records relative to the current input file.
As in this case it is the same file we just say when the line number of the first input file is the same as the second sum up column 6. When awk is done looping the first file the NR
will still increment but FNR
will reset. The next variable
means that when we are not reaching the rule for executing the record we are leaving the record. This means that we leave the first part summing up column 6 when we are done looping through file 6 and with the next file (the same file read a second time) we use the new sum variable
of column 6 to divide each gene count in column 6 with this sum. Finally we also redirect the standard output to a file called Fly_B_Normalized.counts
awk 'NR==FNR{SUM+=$6; next} {print $5"\t"$6/SUM*1000000}' Fly_B.counts Fly_B.counts > Fly_B_Normalized.counts
Click for results
$ head Fly_B_Normalized.counts CG13384 1.09716 Mhc 1.14288 CG12400 0.320006 Ddr 0 RpS28-like 0 CG7456 4.66294 Grip75 5.4401 CR42743 0 CR44595 0 CycE 49.8295$ awk '{SUM+=$2} END {print "Sum of counts:\t"SUM}' Fly_B_Normalized.counts Sum of counts: 1e+06
You can also perform pattern matching in awk. Say that we are only interested in chr2L we can specify only printing rows with the pattern chr2L in the first column. You use the character :~
_ for this. Lets pipe to uniq
to make sure that we did not get any other chromosomes using this command.
awk '$1 ~ /chr2L/ {print $1}' Fly_B.counts | uniq
Click for results
chr2L
You can add conditions to print a row. Say that you are only interested in genes with at least 10 counts. Then you can make sure that you only print rows where the genes on column 6 is larger then 9.
awk '$6 > 9 {print $0}' Fly_B.counts | head
Click for results
chr2L 8387891 8388804 + CG13384 24
chr2L 16776850 16777439 + Mhc 25
chr2L 10485538 10485790 - CG7456 102
chr2L 10485594 10485857 + Grip75 119
chr2L 15727581 15729156 - CycE 1090
chr2L 15727784 15729156 - CycE 1090
chr2L 259949 267505 - CG17075 12
chr2L 1695649 1704682 + chinmo 22
chr2L 2359142 2360393 - Rab5 351
chr2L 2488197 2490678 - Slh 30
Q.Lets try out everything you learnt in a final task. Try to calculate the sum of counts on chr2R using only genes with at least 10 reads into account using only awk. Add a string so the final output looks like the example below
Total geneCount on chr2R is: n reads
Hint: you can add multiple condition using the character &&
. If you find it easier you can also of course pipe the command.
Developed by Sanna Abrahamsson, 2021.