Unix III: Warm up exercises - BDC-training/VT25 GitHub Wiki

Course: VT25 Unix applied to genomic data (SC00036)

Let's warm up!!

It is common that programs use a quite specific input file format. And most of the time, we need to reformat our files.

1) Convert the file from BED to CSV format

Copy refGene.bed from /home/courses/Unix/files/. Reformat the file so that:

there are only genes located within canonical chromosomes
genes are sorted by position and
the chromosome name should read as 13 instead of chr13
and the columns should be separated by comma rather than a tab

For example, the original file looks like:

 chr13   50571142        50592603        NM_213590
 chr19   50180408        50191707        NM_198318
 chr6    90341942        90348474        NM_020466
 chr16   58497548        58547523        NM_020465
 chr5    180581942       180582890       NM_206880
 chr6    138743180       138893668       NM_020464

and your output should look like:

1,11873,14409,NR_046018
1,14361,29370,NR_024540
1,34610,36081,NR_026818
1,34610,36081,NR_026820
1,69090,70008,NM_001005484
1,134772,140566,NR_039983
1,323891,328581,NR_028322

2) Create a table from a VCF file

Copy ADK.snp.vcf from from /home/courses/Unix/files/. This file format contains a list of SNPs and it is somehow difficult to read.

So to make it easier, we will reformat the list so we have only the following columns:

 CHR     POS             ID            REF     ALT       GT       DP 
 chr10   74015787        rs78186808      T       C       1/1      6 
 chr10   74019676        rs3998474       A       C       1/1      4 
 chr10   74034179        .               G       A       0/1     10 
 chr10   74056770        rs118045036     A       G       1/1      4 
 chr10   74058471        rs10762507      C       G       1/1      4

3) Filter and reformat a TSV file

Copy occurrenceTable.txt from /home/courses/Unix/files/. This table contains mutation counts for different samples.

Filter and reformat the file so we display a table with mutations that:

are from the CDR1 region (look at the column names and reformat)
have an aminoacid change (look at the data from the first column and reformat)
have at least one count in any of the samples

so it looks like:

 Mutation VH5_IgG1_CDR1  VH5_IgG3_CDR1   VH5_IgG8_CDR1   VH5_IgM1_CDR1   VH5_IgM3_CDR1   VH5_IgM8_CDR1
 S35>H   1       0       0       0       0       0
 S35>P   0       0       1       1       0       0
 S35>R   0       0       1       0       0       1
 S35>A   0       0       0       1       0       0
 S35>G   1       0       0       0       4       0
 S35>V   1       0       0       0       0       0
 S35>C   0       1       1       0       0       0
 S36>H   0       0       0       0       1       0
 S36>R   0       1       3       0       0       0
 S36>A   0       0       0       1       0       0

4) Summarize data

Copy targetRegions.txt from /home/courses/Unix/files/. These are the positions that were targeted for sequencing in a research project. How many regions per chromosome were targeted? Make sure to order the data showing the chromosome with more regions on the top, like:

 8       chr1
 7       chr4
 6       chr3
 5       chr7
 5       chr22
 4       chrX
 4       chr5

5) Generate a count matrix from a VCF file

Copy S1.vcf, S2.vcf and S3.vcf from from /home/courses/Unix/files/. This files show mutations detected in three different patients, let's summarize them.

Generate a summary of the amount of PASS mutations per chromosome per sample (use for loop)
Save each sample summary in a file called SX.summary
Create a matrix, where each row represents a chromosome and each column is a sample (use awk)

The output should look like:

chr     S1      S2      S3
chr1    10160   9439    9986
chr2    10201   9657    10408
chr3    8063    7495    8221
chr4    7246    6932    7368
chr5    6885    6454    6629
chr6    6717    6345    6570
chr7    7857    7546    7822

Home: Unix applied to genomic data

Developed by Marcela Dávila, 2018.