Unix II: Some basic commands - bcfgothenburg/VT24 GitHub Wiki
Let's practice some commands
First copy the files used for these exercises:
cp /home/courses/Unix/files/day2.tar.gz .
Q1. What is the size of the file?
Extract the files in the tarball to a directory called Some_exercises_2. You can first create the directory and then use tar to extract the files
Click here for code
mkdir Some_exercises_2
tar -xvzf day2.tar.gz -C Some_exercises_2
Q2. What is the size of the uncompressed data? Hint: you can use
lsor trydu -sh [Directory_name]
Q3. What is the command to use if you were to combine the contents of the file
S1.vcfand the fileS2.vcfinto one new file with the nameall.txt?
Q4. What is the command to use if you want to append to the file
all.txtthe contents of the fileS3.vcf?
Q5. You want to know if the SNP
rs17098198is in the fileADK.snp.vcf. What is the command to use?
Q6. How many header lines (those starting with #) are there in the file
ADK.snp.vcf?
Q7. In how many lines does the word
PASS(which means the variant passed a set quality filter) occur in the fileS1.vcf?
Q8. What files in your directory contain the position
862389? How do you write the command?
Q9. Extract the 4th and 5th fields of the file
S2.vcfand save these in a new file. What is the command used?
Q10. Suggest a command to extract the first column from the comma-separated file
dat2.txt
Q11. What is the command to sort the file
Fly_B.countsnumerically by the positions in column 2?
Q12. What is the command to sort the lines of the
Homo_sapiens.chr10.gtffile, based on the information in the column showing strand +/- ? Save the sorted lines in a new file.
Q13. See exercise 10. For the exons in the file created, extract the strand column and use it to answer the question: how many exons are on the plus and minus strands, respectively?
Q14. Suggest a command to find all files in your directory with the extension
.txt, using find.
Q15. Write a command that outputs the gene-names that are found in both the files
genelist.txtandgenelist2.txt
Q16. Write a command that outputs the gene-names that are unique to
genelist2.txt
Q17. What is the command to change all
+toposand-tonegfromFly_B.counts?
Q18. What is the command to change all
scienceinstances inscience.txt, regardless of its case, toSCIENCE?
Q19. Examine again the file
Homo_sapiens.chr10.gtf. How many different types of transcripts are there (field 2)? How many lines are annotated aslincRNA?
Q20. What is the command to extract the accession codes (like AY156735.1 in the line ">gi|24209941|gb|AY156735.1| HIV-1 clone P2.BCM.RT from USA reverse transcriptase (pol) gene, partial cds") from the file
rt.fa?
Q21. Change all mutations in chrX and chrY in
S1.vcfto chr23 and chr24, saving it to a new file, what command line did you use?
Q22. Add a "#" to every empty line in
rt.fa. How many did you add?
Q23. Select all
UTRregions from theADKgene that are inHomo_sapiens.chr10.gtf. Modify the output so you only display the chromosome, start, end, gene_id and transcript_name. Append the output to the file you created under Q20
Q24. What is the most used word in
science.txt?
First we will practice awk using the GTF file Homo_sapiens.chr10.gtf. A GTF file is a common annotation format used in bioinformatics. Read more about the format here. Try to use awk as much as possible, sometimes you might need to combine awk with wc -l or sort.
Q25. How many protein coding genes are there on chromosome 10? How many are there on the first half of chromosome 10? Chromosome 10 is 133797422 nucleotides long.
Q26. PTEN is a well known tumor suppressor gene located on chromosome 10. How many exons of PTEN are larger than 100 base pairs?
Q27. How many exons does PTEN have that belong to transcript id ENST00000371953?
Q28. Which is the longest protein coding gene on chr10?
Q29. How many nucleotides are covered by exonic lincRNAs?
For the following exercises you will be using organisms_mtx.tsv. Inspect the file, each row is an organism and "how much" it has been found in different samples (columns). Try to use awk as much as you can to answer the following:
Q30. What are the dimensions of the table?
Q31. Display the name of the organism and samples G_261 and S118. NOTE: the names of the organisms are separated by spaces, so don't forget to specify the input separator as TAB, otherwise awk will also take the spaces as separators and the columns will be shifted.
Q32. From the previous output, calculate the ratio between these 2 samples. First add "1" to each column, so when calculating ratios we don't have a problem when dividing by zero (this is the pseudocount method). Divide these values and display it as a fourth column. Separate each column by a tab and save the result to a file.
Using the file you just created answer the following:
Q33. How many organisms are equally expressed in G_261 and S118?
Q34. How many organisms have a ratio of 30 or more?
Q35. How many Lactobacillus have a ratio of 0,5 or less?
Now inspect IgG_mutations.txt. This is a file where mutations (row) of different samples in different regions of Ig (CDR1, CDR2, FR2 and FR3) are summarized. Try to use awk as much as possible to answer the following:
Q36. How many mutations in FR2 are found?
Q37. Are there any positions where the Serine is changed to Glycine?
Q38. Inspecting the previous result, there are some positions that do not have any mutations, what is the command (using
awk) to remove those lines?
Sometimes the expressions can be really long! specially when handling a lot of columns. To reduce typing a lot of code, we can use more complex expressions. In this case, an alternative approach would be to sum up the number of mutations for each row and remove the ones that sum up to cero. If you google a specific task, in this case "awk sum rows" you will get suggestions on how to achieve this. One of these suggestions is:
awk '{ for(i=1; i<=NF;i++) j+=$i; print j; j=0 }' dataIt may be difficult to read the code in one line, so we can re-arrange it vertically:
awk '{
for(i=1; i<=NF;i++)
j+=$i;
print j;
j=0 }' dataThe for loop will help us to repeat an action: in this case adding the values of a row, for each one of the rows.
It will start adding at position i (in this case column one, i=1), until the last column (in this case 7, i<=NF).
Every time a column is added, it will be stored in the variable j (j+=$i).
The final sum for each row will be then printed (print j).
Once it's done adding each value and printing the result, j is set to cero (j=0), so we do not carry its value when we calculate the sum for the next row.
Q39. Try to use this expression to filter the rows where the sum of the mutations equals cero.
Developed by Katarina Truvé, 2018._ Modified by Marcela Dávila, 2019. Modified by Sanna Abrahamsson and Marcela Dávila, 2021.