Unix II: Some basic commands - bcfgothenburg/VT24 GitHub Wiki

Course: VT24 Unix applied to genomic data (SC00036)

Let's practice some commands

First copy the files used for these exercises:

cp /home/courses/Unix/files/day2.tar.gz .

Q1. What is the size of the file?

Extract the files in the tarball to a directory called Some_exercises_2. You can first create the directory and then use tar to extract the files

Click here for code

mkdir Some_exercises_2
tar -xvzf day2.tar.gz -C Some_exercises_2

Q2. What is the size of the uncompressed data? Hint: you can use ls or try du -sh [Directory_name]

cat and redirection

Q3. What is the command to use if you were to combine the contents of the file S1.vcf and the file S2.vcf into one new file with the name all.txt ?

Q4. What is the command to use if you want to append to the file all.txt the contents of the file S3.vcf?

grep

Q5. You want to know if the SNP rs17098198 is in the file ADK.snp.vcf. What is the command to use?

Q6. How many header lines (those starting with #) are there in the file ADK.snp.vcf?

Q7. In how many lines does the word PASS (which means the variant passed a set quality filter) occur in the file S1.vcf?

Q8. What files in your directory contain the position 862389? How do you write the command?

cut

Q9. Extract the 4th and 5th fields of the file S2.vcf and save these in a new file. What is the command used?

Q10. Suggest a command to extract the first column from the comma-separated file dat2.txt

sort

Q11. What is the command to sort the file Fly_B.counts numerically by the positions in column 2?

Q12. What is the command to sort the lines of the Homo_sapiens.chr10.gtf file, based on the information in the column showing strand +/- ? Save the sorted lines in a new file.

uniq

Q13. See exercise 10. For the exons in the file created, extract the strand column and use it to answer the question: how many exons are on the plus and minus strands, respectively?

find

Q14. Suggest a command to find all files in your directory with the extension .txt, using find.

comm

Q15. Write a command that outputs the gene-names that are found in both the files genelist.txt and genelist2.txt

Q16. Write a command that outputs the gene-names that are unique to genelist2.txt

sed

Q17. What is the command to change all + to pos and - to neg from Fly_B.counts?

Q18. What is the command to change all science instances in science.txt, regardless of its case, to SCIENCE?

A combination of commands and piping

Q19. Examine again the file Homo_sapiens.chr10.gtf. How many different types of transcripts are there (field 2)? How many lines are annotated as lincRNA?

Q20. What is the command to extract the accession codes (like AY156735.1 in the line ">gi|24209941|gb|AY156735.1| HIV-1 clone P2.BCM.RT from USA reverse transcriptase (pol) gene, partial cds") from the file rt.fa?

Q21. Change all mutations in chrX and chrY in S1.vcf to chr23 and chr24, saving it to a new file, what command line did you use?

Q22. Add a "#" to every empty line in rt.fa. How many did you add?

Q23. Select all UTR regions from the ADK gene that are in Homo_sapiens.chr10.gtf. Modify the output so you only display the chromosome, start, end, gene_id and transcript_name. Append the output to the file you created under Q20

Q24. What is the most used word in science.txt?

awk

First we will practice awk using the GTF file Homo_sapiens.chr10.gtf. A GTF file is a common annotation format used in bioinformatics. Read more about the format here. Try to use awk as much as possible, sometimes you might need to combine awk with wc -l or sort.

Q25. How many protein coding genes are there on chromosome 10? How many are there on the first half of chromosome 10? Chromosome 10 is 133797422 nucleotides long.

Q26. PTEN is a well known tumor suppressor gene located on chromosome 10. How many exons of PTEN are larger than 100 base pairs?

Q27. How many exons does PTEN have that belong to transcript id ENST00000371953?

Q28. Which is the longest protein coding gene on chr10?

Q29. How many nucleotides are covered by exonic lincRNAs?

For the following exercises you will be using organisms_mtx.tsv. Inspect the file, each row is an organism and "how much" it has been found in different samples (columns). Try to use awk as much as you can to answer the following:

Q30. What are the dimensions of the table?

Q31. Display the name of the organism and samples G_261 and S118. NOTE: the names of the organisms are separated by spaces, so don't forget to specify the input separator as TAB, otherwise awk will also take the spaces as separators and the columns will be shifted.

Q32. From the previous output, calculate the ratio between these 2 samples. First add "1" to each column, so when calculating ratios we don't have a problem when dividing by zero (this is the pseudocount method). Divide these values and display it as a fourth column. Separate each column by a tab and save the result to a file.

Using the file you just created answer the following:

Q33. How many organisms are equally expressed in G_261 and S118?

Q34. How many organisms have a ratio of 30 or more?

Q35. How many Lactobacillus have a ratio of 0,5 or less?

Now inspect IgG_mutations.txt. This is a file where mutations (row) of different samples in different regions of Ig (CDR1, CDR2, FR2 and FR3) are summarized. Try to use awk as much as possible to answer the following:

Q36. How many mutations in FR2 are found?

Q37. Are there any positions where the Serine is changed to Glycine?

Q38. Inspecting the previous result, there are some positions that do not have any mutations, what is the command (using awk) to remove those lines?

Sometimes the expressions can be really long! specially when handling a lot of columns. To reduce typing a lot of code, we can use more complex expressions. In this case, an alternative approach would be to sum up the number of mutations for each row and remove the ones that sum up to cero. If you google a specific task, in this case "awk sum rows" you will get suggestions on how to achieve this. One of these suggestions is:

awk '{ for(i=1; i<=NF;i++) j+=$i; print j; j=0 }' data

It may be difficult to read the code in one line, so we can re-arrange it vertically:

awk '{ 
     for(i=1; i<=NF;i++) 
         j+=$i; 
     print j; 
     j=0 }' data

The for loop will help us to repeat an action: in this case adding the values of a row, for each one of the rows. It will start adding at position i (in this case column one, i=1), until the last column (in this case 7, i<=NF). Every time a column is added, it will be stored in the variable j (j+=$i). The final sum for each row will be then printed (print j). Once it's done adding each value and printing the result, j is set to cero (j=0), so we do not carry its value when we calculate the sum for the next row.

Q39. Try to use this expression to filter the rows where the sum of the mutations equals cero.

Home: Unix applied to genomic data

Developed by Katarina Truvé, 2018._ Modified by Marcela Dávila, 2019. Modified by Sanna Abrahamsson and Marcela Dávila, 2021.