Unix III: Warm up exercises - BDC-training/VT25 GitHub Wiki
Course: VT25 Unix applied to genomic data (SC00036)
Let's warm up!!
It is common that programs use a quite specific input file format. And most of the time, we need to reformat our files.
1) Convert the file from BED to CSV format
Copy refGene.bed
from /home/courses/Unix/files/
.
Reformat the file so that:
- there are only genes located within canonical chromosomes
- genes are sorted by position and
- the chromosome name should read as
13
instead ofchr13
- and the columns should be separated by comma rather than a tab
For example, the original file looks like:
chr13 50571142 50592603 NM_213590
chr19 50180408 50191707 NM_198318
chr6 90341942 90348474 NM_020466
chr16 58497548 58547523 NM_020465
chr5 180581942 180582890 NM_206880
chr6 138743180 138893668 NM_020464
and your output should look like:
1,11873,14409,NR_046018
1,14361,29370,NR_024540
1,34610,36081,NR_026818
1,34610,36081,NR_026820
1,69090,70008,NM_001005484
1,134772,140566,NR_039983
1,323891,328581,NR_028322
2) Create a table from a VCF file
Copy ADK.snp.vcf
from from /home/courses/Unix/files/
.
This file format contains a list of SNPs and it is somehow difficult to
read.
.
So to make it easier, we will reformat the list so we have only the following columns:
CHR POS ID REF ALT GT DP
chr10 74015787 rs78186808 T C 1/1 6
chr10 74019676 rs3998474 A C 1/1 4
chr10 74034179 . G A 0/1 10
chr10 74056770 rs118045036 A G 1/1 4
chr10 74058471 rs10762507 C G 1/1 4
3) Filter and reformat a TSV file
Copy occurrenceTable.txt
from /home/courses/Unix/files/
.
This table contains mutation counts for different samples.
.
Filter and reformat the file so we display a table with mutations that:
- are from the
CDR1
region (look at the column names and reformat) - have an aminoacid change (look at the data from the first column and reformat)
- have at least one count in any of the samples
so it looks like:
Mutation VH5_IgG1_CDR1 VH5_IgG3_CDR1 VH5_IgG8_CDR1 VH5_IgM1_CDR1 VH5_IgM3_CDR1 VH5_IgM8_CDR1
S35>H 1 0 0 0 0 0
S35>P 0 0 1 1 0 0
S35>R 0 0 1 0 0 1
S35>A 0 0 0 1 0 0
S35>G 1 0 0 0 4 0
S35>V 1 0 0 0 0 0
S35>C 0 1 1 0 0 0
S36>H 0 0 0 0 1 0
S36>R 0 1 3 0 0 0
S36>A 0 0 0 1 0 0
4) Summarize data
Copy targetRegions.txt
from /home/courses/Unix/files/
. These are the positions that were targeted for sequencing in a research project.
How many regions per chromosome were targeted? Make sure to order the data showing the chromosome with more regions on the top, like:
8 chr1
7 chr4
6 chr3
5 chr7
5 chr22
4 chrX
4 chr5
5) Generate a count matrix from a VCF file
Copy S1.vcf
, S2.vcf
and S3.vcf
from from /home/courses/Unix/files/
.
This files show mutations detected in three different patients, let's summarize them.
- Generate a summary of the amount of PASS mutations per chromosome per sample (use
for loop
) - Save each sample summary in a file called SX.summary
- Create a matrix, where each row represents a chromosome and each column is a sample (use
awk
)
The output should look like:
chr S1 S2 S3
chr1 10160 9439 9986
chr2 10201 9657 10408
chr3 8063 7495 8221
chr4 7246 6932 7368
chr5 6885 6454 6629
chr6 6717 6345 6570
chr7 7857 7546 7822
Unix applied to genomic data
Home:Developed by Marcela Dávila, 2018.