Exam 2 2019 Take home - davidaray/test GitHub Wiki

Instructions. Read Carefully.

We will be using the honor system. You are expected to work on your own, without the help of any of your classmates. You may use your notes, slides, tutorials, websites, and any other resources, just not other people.

You have 47 hours to complete this exam. I will need to have an e-mail with your answers to the questions below in my inbox on or before 10 am on October 30, 2019 for you to receive full credit ([email protected]). For every 10 minutes after that time, you will lose 10% of your final grade on the exam. I will use the time of receipt in my inbox as the final determination of arrival time. Return your questions as a Word document with any figures or tables embedded. Include your name in the filename. I suggest you name your files something like JSmith.Genomics.exam2.doc. You must turn in an electronic version of your answers rather than a hard-copy. It will be examined using a plagiarism detector.

Pay close attention to the instructions for each section. Make sure to address all parts of each.

Graduate students are responsible for addressing all but one of the *'d questions. Undergraduate students may choose not to respond to any two of the parts labeled with a *.

1. (10 pts)

AGAGATACCTTAGCTACACGTTACCCAGACTTGGAATCCCAGACTTCGCTAGGTCCAGGCATGATTTGTCACGGAGCGAACATAAGACGGTTCTACCACCGCGAGGGTA. The preceding sequence is a very simple genome that you have been asked to reconstruct using sequencing reads. Answer the following questions and/or complete the tasks. k=20.

A. List all possible k-mers and give their multiplicity.
B. With a k of 6, what part(s) of this genome will be the hardest to assemble?
C. Explain why you answered b the way you did.
D. Describe how your answer would be different if k=10.

2. (10 pts)

Answer the following questions about genomic libraries.

A. Explain the difference between unpaired shotgun reads, paired-end reads, and mate-pair reads from Illumina sequencing.
B. What’s different about the three sequencing strategies and how are the reads used differently in a genome assembly?

3. (15 pts)

Design a project to sequence and assemble a previously unsequenced eukaryotic genome. This will be a de novo assembly. In other words, you have no genome to which you can map your reads. Begin with the ‘hows’ and ‘whys’ of the rationale behind selecting your organism(s) to sequence and end with how you would determine the quality of your final assembly. Include details on what technologies you might employ for each step along the way. I.e, how deeply you would sequence, what technology would you use and why, how would you determine the quality of your assembly. Relate all of these decisions to your rationale for the project if relevant. Obviously, there is no single correct answer. I’m trying to determine how much of the general process you’ve absorbed and how deeply you can think about projects such as this. Hint – it’s important that you explain why you make each choice make along the way (e.g. why this particular organism? why this sequencing platform vs. another?, why this coverage level vs. another?, what features of the genome would influence your choices? Etc.)

4. (5 pts)

Explain how you can use sequencing reads to determine if there’s a sequencing error in a read or if there is a sequence polymorphism such as a heterozygous SNP in a genome.

5. (10 pts)

This next set of activities will require a connection to HPCC.

A. Input the following commands and answer the associated questions.

cd /lustre/work/<eraider>

cp -r /lustre/work/daray/exam2 .

cd exam2/genome2/data

head SRX5287350_1.fastq

i. What is the quality score of the fifth base in the first read of the file you just examined?

ii. Using the website linked here (https://www.drive5.com/usearch/manual/quality_score.html), determine the probability of that base call being erroneous.

B. Use the following commands.

head SRX5299451.fastq

head SRX5299461.fastq

i. Of all of the read files present in your /lustre/work//exam2/genome2/data directory, which one(s) are likely to be Illumina and which one(s) are likely to be PacBio? Explain how you know this.

C. Assemble genome1 using Unicycler and short reads only in a directory called “genome1_short_only”.

i. Provide the path to that assembly directory so I can find it.
D. Assemble genome1 using Unicycler and short reads and one set of long reads in a directory called “genome1_short_long”. i. Provide the path to that assembly directory so I can find it.
E. Do the same for genome2.

i. Provide the paths to all assemblies so I can find them.
F. For each genome, address the following:

i. How many fragments are present in the final assembly file?

ii. Explain why there are so many more fragments for one type of assembly (short_only) vs. the other (short_long) for each genome.

iii. What is the range of insert lengths for the paired-end libraries for each data set (1st – 99th percentiles)?

iv. Use the first 3000 bp of the longest fragment from each genome assembly and Nucleotide BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) to identify each genome to a likely species.

v. Use Nucleotide BLAST to identify the smallest fragment from each ‘short_long’ assembly.fasta. What are these smaller fragments?

6*. (10 pts)

I’m my grandmother. She was a telephone operator when you used to have to manually switch lines from one slot to another like in this photo.

Not my actual grandmother.

As my grandmother, I know almost nothing about genomes or genome analysis. Explain to me in 200 words or fewer what a genome is and why it’s difficult to assemble a mammal genome like our own.

7* (10 pts)

Same rules with regard to my grandmother. In 300 words or fewer, explain what RADSeq is and how it works in terms I would understand. During that second explanation, makes sure to include a mention of how sequencing depth makes a difference in interpreting the data.

8*. (10 pts)

Read this paper that was published a couple of months ago: https://myweb.ttu.edu/daray/Genomes/evz159.pdf. I’m my grandmother again. In answer all queries below, speak to me as though I’m her, with a near complete lack of knowledge of genomics.

A. In fewer than 10 sentences, explain to me what the authors did and why they did it.
B. What does Table 1 describe?
C. If you had to choose one row of Table 1 and describe it as the ‘best’ assembly, which one would you choose and why? (Note: It’s perfectly valid to have a different answer from my own. The trick to getting a good score is explaining WHY you answered the way you did. If you can do that, I don’t care if your answer is different from mine.)

9*. (10 pts)

Explain how Illumina sequencing reads can be used to identify an 500 bp insertion in one organism (with a reference genome assembly) vs. a second individual of the same species that doesn’t have that insertion. No more than 100 words. Pictures are helpful but only if you explain them with text.