Exam 3 2019 Take home - davidaray/test GitHub Wiki

Instructions. Read Carefully.

We will be using the honor system. You are expected to work primarily on your own. However, I'll be happy to allow some level of discussion with you classmates and with me on the 'general' slack channel. You may use your notes, slides, tutorials, websites, and any other resources.

You have until 10 am Friday (12/6/2019) to complete this exam. I will need to have an e-mail with your answers to the questions below in my inbox on or before that time for you to receive full credit ([email protected]). For every 10 minutes after that time, you will lose 10% of your final grade on the exam. I will use the time of receipt in my inbox as the final determination of arrival time. Return your questions as a Word document with any figures or tables embedded. Include your name in the filename. I suggest you name your files something like JSmith.Genomics.exam2.doc. You must turn in an electronic version of your answers rather than a hard-copy. It will be examined using a plagiarism detector.

You are free to use any resources as necessary. I also encourage you to communicate with one another using the 'general' Slack channel. I see these exams as a learning experience just as much as an evaluation. That being said, you must turn in your own work and you may not simply copy files/text among the group.

### Special instructions for using HPCC. Because of the shutdown, the quanah nodes that we typically use are unavailable. You will be using my own nodes but that requires a slightly different login process. Instead of logging in to quanah.hpcc.ttu.edu, use ivy.hpcc.ttu.edu. Storage and navigation to /home/, /lustre/work/ (this is where you've done most of your work), and /lustre/scratch/ will not be impacted. However, using qlogins will. Whereas you would normally use qlogin -pe sm <some number of processors> -P quanah -q omni, this option is not available during the HPCC shutdown. Instead, you can use any of three sets of nodes that I own. qlogin -pe sm <some number of processors> -P communitycluster -q Chewie, qlogin -pe sm <some number of processors> -P communitycluster -q Yoda, and qlogin -pe sm <some number of processors> -P communitycluster -q R2D2 are all available to you. Unfortunately, the usual 30,000 processors are not available. Instead, Chewie offers 180 processors, R2D2 offers 20, and Yoda offers 40. While you should use the number of processors you need for any of your jobs, keep in mind that others will also be using them so, don't go crazy hogging processors. Also, don't forget to exit any qlogins you create, freeing up the processors for others. Finally, ALWAYS use a qlogin when doing any interactive work. Interactive work is where you submit a command and then have to wait for it to finish with your terminal open.

PLEASE keep in mind that some of these tasks may take substantial amounts of time. I was able to complete all of the HPCC-based tasks on Friday morning in about 6 hours, mostly using simple qlogins. However, having written the exam and having done all of this before, I have an advantage. I was also running several analyses simultaneously because I'm just that good. Get started as soon as possible and plan ahead for keeping your terminal open and connected for significant amounts of time in some cases. If you feel comfortable, it may be worth your while to run many of these by submitting the jobs to the queue using a qsub script. If you choose to do so, be aware that you will need to have the following header on that script (it's different from the one we used for our abyss and canu runs because of the switch to ivy/hrothgar). I've put my version of the abyss_qsub.sh script in /lustre/work/daray/exam3/gge_scripts.

#!/bin/bash

#$ -V

#$ -cwd

#$ -S /bin/bash

#$ -N <your job name>

#$ -o $JOB_NAME.o$JOB_ID

#$ -e $JOB_NAME.e$JOB_ID

#$ -q <your choice of nodes, Chewie, Yoda, or R2D2>

#$ -pe sm <your choice of processor number>

#$ -P communitycluster

Pay close attention to the instructions for each section. Make sure to address all parts of each.

Graduate students are responsible for addressing all questions. Undergraduate students may choose not to respond to any one question. Please identify which question you're skipping.

1. (10 pts)

In class, three major sequencing platforms were discussed in class that make use of sequencing by synthesis. In other words, they build DNA polymerase as part of the sequencing process. A) Assuming you already have a sequencing library (a set of fragments that's ready to be put on the machine and subjected to sequencing), describe the way each of those platforms works to identify the order of bases in any given fragment from that library. B) A fourth platform was also described that does not use sequencing by synthesis. How does that one work?

2. (10 pts)

Same rules as before with regard to my grandmother. In 300 words or fewer, explain what RADSeq is and how it works in terms I would understand. During that second explanation, makes sure to include a mention of how sequencing depth makes a difference in interpreting the data.

3. (10 pts)

Create a directory at /lustre/work/<eraider>/exam3/question3. I've used SRA toolkit to retrieve the raw data and save it to one of my directories /lustre/work/daray/exam3/bacteroides_data. Copy the raw read data in the directory you just created. Use jellyfish to estimate the genome size for this organism. Do this using three different k-mer sizes, k=27, k=33, and k=65). A) Provide the answers/results in the Word file you will turn in on Friday. Include screenshots of the histograms for each k-mer. B) Investigate the sequencing reads in the .fq files you copied. How long are the sequencing reads? What does that suggest as far as the sequencing technology and the sequencing machine used? Explain your answers and how you got them. C) The data are from Bacteroides fragilis. Use NCBI to determine how your estimates compare to the actual genome size. NOTE that on the GenomeScope page there are boxes that need to be filled our properly to get the correct results. All other results from your work should be visible to me in the directory you created.

4. (10 pts)

For question 4, raw sequence read data and a genome assembly is available from this paper (https://www.g3journal.org/content/9/7/2051) describing an apple genome. Create a directory at /lustre/work/<eraider>/exam3/question4. Copy the genome assembly to that directory from where I've stored it, /lustre/work/daray/exam3/apple_data/Malus_baccata_v1.0_scaffolds.fa. Enter that directory and use bowtie2 to index the genome. Afterward, map the reads from /lustre/work/daray/exam3/apple_datato the genome. Your output should be a SAM formatted file called Malus_baccata.sam. A) Record the successful commands you used for these steps. B) Use abyss-fac to get basic genome statistics for the genome. What is the assembly size? What is the N50? C) Use the 'head' and/or tail commands (http://www.linfo.org/head.html, http://www.linfo.org/tail.html) and your knowledge of the SAM format to determine how far apart, generally, the the paired reads mapped with respect to each other on the genome. D) The following three sentences were taken from the paper. Reword them so that my grandmother would understand. "Nineteen paired-end libraries were prepared for sequencing the M. baccata genome. These included nine paired-end libraries with an insert of 200, 500, and 800 nt and 10 mate-pair libraries with insert sizes of 2, 5, 10, and 20 kb. All libraries were constructed following the instructions provided by Illumina." Provide the answers/results for A, B, C, and D in the Word file you will turn in on Friday. All of the files you generated should be visible when I examine the directory you created for this problem.

5. (10 pts)

Create a directory at /lustre/work/<eraider>/exam3/question5. Copy the genome assembly to that directory from where I stored it, /lustre/work/daray/exam3/bat_genome/mMyo.fa. Get some RADSeq data from another species of this genus by copying files 194091_mAus_L003_R1.fq and 194091_mAus_L003_R2.fq to the question5 directory from /lustre/work/daray/exam3/bat_data. Now, use freebayes to identify variants in individual 194091 when its sequencing reads are mapped to only the first scaffold mMyo.fa genome assembly. That scaffold is called 'scaffold_m19_p_1'. There will be many files generated along the way but the final output should be a .vcf file. Generate basic variant statistics for this .vcf file. A) How many SNPs with a quality score better than 40 were identified? B) How many indels with quality scores better than 40 were identified? C) I specifically asked you to only identify variants associated with the first scaffold. Why do you think I did that and how did you accomplish it? Provide the answers/results for A, B, C in the Word file you will turn in on Friday. All of the files you generated should be visible when I examine the directory you created for this problem.

6. (10 pts)

Create a directory at /lustre/work/<eraider>/exam3/question6. Obtain the raw sequence data for this organism directly from the Short Read Archive (SRA) at NCBI using tools from earlier exercises in the class. The SRA IDs for the data are SRR8482575 and SRR8494924. The data should be deposited in the question6 directory to use for the rest of this question. Now, use Unicycler to assemble the genome of this organism. Take note, some of the data are paired-end Illumina data. Other data are long-read Nanopore sequence reads. You'll need to know the difference to get it to work. A) In your Word document, identify the organism whose genome you just assembled and explain how you got your answer. All of the files you generated should be visible when I examine the directory you created for this problem. NOTE: Because of the same issues we encountered previously when assembling with Unicycler, you will likely get more fragments than you should when assembling this genome and you may see some 'core' files pop up. This is ok. I just want to see if you know how to do the work.

7. (10 pts)

Create a directory at /lustre/work/<eraider>/exam3/question7. I've used SRA toolkit to retrieve the raw data and save it to one of my directories /lustre/work/daray/exam3/azospirillum. Copy the raw read data in the directory you just created. Now, use Abyss to assemble the genome of this organism. Use the k-mer of you your choice for your assembly but explain why you chose the one you did. Take note, all of the reads are paired-end Illumina data. A) In your Word document, paste the genome statistics from abyss-fac for this assembly. Should, you plan to modify the qsub script we used for the abyss assemblies in class, you'll need to modify the first several lines of the script as described in the instructions above. All of the files you generated should be visible when I examine the directory you created for this problem.

8. (10 pts)

A .vcf file from 23&Me can be found in /lustre/work/daray/exam3/snpdata. Examine that file and answer the following questions. A) For this individual, what is the ID of the SNP at position 231963491 on chromosome 1. What is the genotype at this position in the human reference assembly. The genotype of this person? Use https://www.snpedia.com/index.php/SNPedia to investigate the SNP at this position and tell me what relationship this SNP has to human biology. B) Find SNP rs4402960. What is this person's genotype at this position? Where does this SNP lie in the human genome? What does this genotype suggest for this person's likelihood of developing diabetes?

9. (10 pts)

You have just assembled a previously unexamined eukaryotic genome, how would you go about identifying all protein coding genes that exist in that genome? Assume you have no limit on funding. Be complete but concise in your answer.

10. (10 pts)

There is a spreadsheet linked here (https://github.com/davidaray/test/blob/master/fragment_lengths.xlsx; click the 'download' button) that lists the 171 contigs from a genome assembly and their lengths. Using the data in the spreadsheet, calculate the N50 for this genome assembly. Explain how you got your answer.