Awk Crash Course - BDC-training/VT25 GitHub Wiki

Course: VT25 Unix applied to genomic data (SC00036)


Aim

The aim with this exercise is to learn how to use awk for table visualization and simple calculations.

The data

The data used in this task are counts from a Drosophila melanogaster (the fruitfly), obtained using RNA-seq (Fly_B.counts, you can find it at /home/courses/Unix/files/). It consists of 6 columns:

  1. Chromosome
  2. Start
  3. End
  4. Strand
  5. Gene
  6. Counts
Click for results
$ head Fly_B.counts

chr2L   8387891 8388804 +       CG13384 24
chr2L   16776850        16777439        +       Mhc     25
chr2L   3145528 3145848 -       CG12400 7
chr2L   6291245 6291684 +       Ddr     0
chr2L   9436929 9437227 +       RpS28-like      0
chr2L   10485538        10485790        -       CG7456  102
chr2L   10485594        10485857        +       Grip75  119
chr2L   11533877        11534547        -       CR42743 0
chr2L   12582402        12583550        +       CR44595 0
chr2L   15727581        15729156        -       CycE    1090

awk is a tool and a programming language within unix. awk was deveoped by Alfred Aho, Peter Weinbergeer and Brian Kerninghan, thereby its name. It was developed to simplify data manipulation.

Basics

Start by looking at the first column in the file Fly_B.counts. Column number in awk is written as $n, meaning first column is called using $1.
As there are many rows in this table pipe it to head.

awk '{print $1}' Fly_B.counts | head
Click for results
chr2L
chr2L
chr2L
chr2L
chr2L
chr2L
chr2L
chr2L
chr2L
chr2L

$0 in awk stands for the entire rows. Lets try it out. Keep the pipe to head for only displaying the first 10 rows.

awk '{print $0}' Fly_B.counts | head
Click for results
chr2L   8387891 8388804 +   CG13384 24
chr2L   16776850    16777439    +   Mhc 25
chr2L   3145528 3145848 -   CG12400 7
chr2L   6291245 6291684 +   Ddr 0
chr2L   9436929 9437227 +   RpS28-like  0
chr2L   10485538    10485790    -   CG7456  102
chr2L   10485594    10485857    +   Grip75  119
chr2L   11533877    11534547    -   CR42743 0
chr2L   12582402    12583550    +   CR44595 0
chr2L   15727581    15729156    -   CycE    1090

You define the field separator using FS, by default AWK understands space and tab. You can also change the output field separator using OFS. Let's print all 6 columns with the output separator "_"

awk 'OFS = "_" {print $1,$2,$3,$4,$5,$6}' Fly_B.counts | head
Click for results
chr2L_8387891_8388804_+_CG13384_24
chr2L_16776850_16777439_+_Mhc_25
chr2L_3145528_3145848_-_CG12400_7
chr2L_6291245_6291684_+_Ddr_0
chr2L_9436929_9437227_+_RpS28-like_0
chr2L_10485538_10485790_-_CG7456_102
chr2L_10485594_10485857_+_Grip75_119
chr2L_11533877_11534547_-_CR42743_0
chr2L_12582402_12583550_+_CR44595_0
chr2L_15727581_15729156_-_CycE_1090

Maybe you only want to separate two specific columns with a different separator and use another one for the rest. Below we separate column 1 and 2 with "_" the rest are separated with a tab.

awk '{print $1"_"$2"\t"$3"\t"$4"\t"$5"\t"$6}' Fly_B.counts | head
Click for results
chr2L_8387891   8388804 +   CG13384 24
chr2L_16776850  16777439    +   Mhc 25
chr2L_3145528   3145848 -   CG12400 7
chr2L_6291245   6291684 +   Ddr 0
chr2L_9436929   9437227 +   RpS28-like  0
chr2L_10485538  10485790    -   CG7456  102
chr2L_10485594  10485857    +   Grip75  119
chr2L_11533877  11534547    -   CR42743 0
chr2L_12582402  12583550    +   CR44595 0
chr2L_15727581  15729156    -   CycE    1090

Calculations

I think we are getting familiar with the awk essentials. Let's continue to perform some calculations. Column 2 and Column 3 in this count file corresponds the the start and the end coordinate of that gene. We can use awk to calculate the length of each gene by subtracting the Start Coord from the End Coord of the gene. Let's perform this calculation, keeping the Gene Name in the output.

awk '{print $3-$2"\t"$5}' Fly_B.counts | head
Click for results
913 CG13384
589 Mhc
320 CG12400
439 Ddr
298 RpS28-like
252 CG7456
263 Grip75
670 CR42743
1148    CR44595
1575    CycE

You can also calculate the sum of an entire column. The amount of reads mapped to the gene is displayed in column 6. We can use awk to calculate the total counts in these samples by summing up column 6. This is done by creating a variable, we name it SUM and then for each row we are adding the count from column 6 to the SUM variable.

awk '{SUM+=$6; print SUM}' Fly_B.counts | head
Click for results
24
49
56
56
56
158
277
277
277
1367

The issue with the code snippet above is that the sum is printed each time awk loops over a row. It is probably more handy to print the sum after the entire calculation is performed. You can use END for this. Also let's add a text string to add what we are printing to standard output.

awk '{SUM+=$6} END {print "Sum of counts:\t"SUM}' Fly_B.counts
Click for results Sum of counts: 21874597

A way of normalizing data is to divide each gene count with the sum of gene counts for the same sample. With this we are taking care of differences in sequencing depth. This will result in a fraction of that gene being expressed compared to the rest of the genes. The sum of these genes will be 1. As these numbers generally are very low one way to handle them is to multiply fraction values with 1 000 000. Lets try this normalization method using awk.

First we need to explain NR and FNR

  • NR stands for number of records being processed or put it simple the line number.
  • FNR is the number of records relative to the current input file.

As in this case it is the same file we just say when the line number of the first input file is the same as the second sum up column 6. When awk is done looping the first file the NR will still increment but FNR will reset. The next variable means that when we are not reaching the rule for executing the record we are leaving the record. This means that we leave the first part summing up column 6 when we are done looping through file 6 and with the next file (the same file read a second time) we use the new sum variable of column 6 to divide each gene count in column 6 with this sum. Finally we also redirect the standard output to a file called Fly_B_Normalized.counts

awk 'NR==FNR{SUM+=$6; next} {print $5"\t"$6/SUM*1000000}' Fly_B.counts Fly_B.counts > Fly_B_Normalized.counts
Click for results $ head Fly_B_Normalized.counts CG13384 1.09716 Mhc 1.14288 CG12400 0.320006 Ddr 0 RpS28-like 0 CG7456 4.66294 Grip75 5.4401 CR42743 0 CR44595 0 CycE 49.8295

$ awk '{SUM+=$2} END {print "Sum of counts:\t"SUM}' Fly_B_Normalized.counts Sum of counts: 1e+06

Conditional selection

You can also perform pattern matching in awk. Say that we are only interested in chr2L we can specify only printing rows with the pattern chr2L in the first column. You use the character :~_ for this. Lets pipe to uniq to make sure that we did not get any other chromosomes using this command.

awk '$1 ~ /chr2L/ {print $1}' Fly_B.counts | uniq
Click for results
chr2L

You can add conditions to print a row. Say that you are only interested in genes with at least 10 counts. Then you can make sure that you only print rows where the genes on column 6 is larger then 9.

awk '$6 > 9 {print $0}' Fly_B.counts | head
Click for results
chr2L   8387891 8388804 +   CG13384 24
chr2L   16776850    16777439    +   Mhc 25
chr2L   10485538    10485790    -   CG7456  102
chr2L   10485594    10485857    +   Grip75  119
chr2L   15727581    15729156    -   CycE    1090
chr2L   15727784    15729156    -   CycE    1090
chr2L   259949  267505  -   CG17075 12
chr2L   1695649 1704682 +   chinmo  22
chr2L   2359142 2360393 -   Rab5    351
chr2L   2488197 2490678 -   Slh 30

Q.Lets try out everything you learnt in a final task. Try to calculate the sum of counts on chr2R using only genes with at least 10 reads into account using only awk. Add a string so the final output looks like the example below

Total geneCount on chr2R is: n reads

Hint: you can add multiple condition using the character &&. If you find it easier you can also of course pipe the command.



Developed by Sanna Abrahamsson, 2021.

⚠️ **GitHub.com Fallback** ⚠️