Accuracy vs depth: methods - rrwick/Perfect-bacterial-genome-tutorial GitHub Wiki

R10.4+Illumina assembly and assessment

Preparing the Nanopore reads

To keep things simple, I only used the chromosome for this test:

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R10.4
head -n2 ../S_aureus_JKD6159.fasta > reference.fasta

And I used the reads that Trycycler assigned to the chromosome, so the assemblies won't have the plasmid or circular phage:

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R10.4
cp ../S_aureus_JKD6159_assembly_R10.4/trycycler/cluster_001/4_reads.fastq nanopore.fastq

This read set had 131438 reads and 1759841108 bp, which equates to 624x depth over the 2818670 bp chromosome.

Preparing the Illumina reads

To get chromosomal Illumina reads, I first aligned the Illumina reads (post-fastp) to the full reference (with phage and plasmid included):

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R10.4
cp ../S_aureus_JKD6159.fasta full_reference.fasta
bwa index full_reference.fasta
bwa mem -t 32 full_reference.fasta ../sequencing_data/illumina_reads/illumina_1.fastq.gz ../sequencing_data/illumina_reads/illumina_2.fastq.gz > full_reference_alignments.sam

Many reads aligned equally well to both chromosomal and phage, but BWA should randomly assign them to one or the other, so this shouldn't be a problem.

I then looked for and extract read pairs where both reads aligned to the chromosome:

cat full_reference_alignments.sam | samtools view -F 2048 | grep "chromosome" | cut -f1 | sort | uniq > chromosomal_read_names
cat chromosomal_read_names | sed -e 's/$/ 1/' > chromosomal_read_names_1
cat chromosomal_read_names | sed -e 's/$/ 2/' > chromosomal_read_names_2
zcat ../sequencing_data/illumina_reads/illumina_1.fastq.gz | paste - - - - | grep -f chromosomal_read_names_1 | tr '\t' '\n' > illumina_1.fastq
zcat ../sequencing_data/illumina_reads/illumina_2.fastq.gz | paste - - - - | grep -f chromosomal_read_names_2 | tr '\t' '\n' > illumina_2.fastq

And as a sanity check, I did a Unicycler Illumina-only assembly of these reads to make sure I got the expected result (an assembly graph minus the plasmid):

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R10.4
unicycler -1 reads_1.fastq -2 reads_2.fastq -o unicycler_illumina_only --threads 32

Looked good, so cleaning up:

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R10.4
rm full_reference*
rm chromosomal_read_names*
rm -r unicycler_illumina_only

Subsampled read counts

This Python code produces read counts for 320 subsamples of Nanopore reads up to 256x depth, evenly spaced on a square-root scale:

genome_size = 2818670
read_count = 131438
total_read_size = 1759841108
mean_read_length = total_read_size / read_count
depth_sqrt = 0
read_counts = []
while depth_sqrt < 16:
    depth_sqrt += 0.05
    depth = depth_sqrt**2
    base_count = genome_size * depth
    read_count = int(round(base_count / mean_read_length))
    read_counts.append(f'{read_count}')
print(' '.join(read_counts))

This Python code does the same but for the Illumina reads:

genome_size = 2818670
read_pair_count = 3289163
total_read_size = 957926684
mean_read_pair_length = total_read_size / read_pair_count
depth_sqrt = 0
read_pair_counts = []
while depth_sqrt < 16:
    depth_sqrt += 0.05
    depth = depth_sqrt**2
    base_count = genome_size * depth
    read_pair_count = int(round(base_count / mean_read_pair_length))
    read_pair_counts.append(f'{read_pair_count}')
print(' '.join(read_pair_counts))

I used a square-root scale because a linear scale wouldn't show enough of the interesting low-depth region, while a log-transform would be too extreme. A square-root transform is nicely intermediate between linear and log.

Note that these read counts weren't exactly the target depth, because they have a wide length distribution and are randomly selected. E.g. the 24648-read subset might have ended up being a bit bigger (in bases) than the 24879-read subset. But they should have given a reasonably even distribution over the 0-256x depth range.

Flye assembly + Medaka polishing + Polypolish polishing + POLCA polishing

depths=(000.0025 000.0100 000.0225 000.0400 000.0625 000.0900 000.1225 000.1600 000.2025 000.2500 000.3025 000.3600 000.4225 000.4900 000.5625 000.6400 000.7225 000.8100 000.9025 001.0000 001.1025 001.2100 001.3225 001.4400 001.5625 001.6900 001.8225 001.9600 002.1025 002.2500 002.4025 002.5600 002.7225 002.8900 003.0625 003.2400 003.4225 003.6100 003.8025 004.0000 004.2025 004.4100 004.6225 004.8400 005.0625 005.2900 005.5225 005.7600 006.0025 006.2500 006.5025 006.7600 007.0225 007.2900 007.5625 007.8400 008.1225 008.4100 008.7025 009.0000 009.3025 009.6100 009.9225 010.2400 010.5625 010.8900 011.2225 011.5600 011.9025 012.2500 012.6025 012.9600 013.3225 013.6900 014.0625 014.4400 014.8225 015.2100 015.6025 016.0000 016.4025 016.8100 017.2225 017.6400 018.0625 018.4900 018.9225 019.3600 019.8025 020.2500 020.7025 021.1600 021.6225 022.0900 022.5625 023.0400 023.5225 024.0100 024.5025 025.0000 025.5025 026.0100 026.5225 027.0400 027.5625 028.0900 028.6225 029.1600 029.7025 030.2500 030.8025 031.3600 031.9225 032.4900 033.0625 033.6400 034.2225 034.8100 035.4025 036.0000 036.6025 037.2100 037.8225 038.4400 039.0625 039.6900 040.3225 040.9600 041.6025 042.2500 042.9025 043.5600 044.2225 044.8900 045.5625 046.2400 046.9225 047.6100 048.3025 049.0000 049.7025 050.4100 051.1225 051.8400 052.5625 053.2900 054.0225 054.7600 055.5025 056.2500 057.0025 057.7600 058.5225 059.2900 060.0625 060.8400 061.6225 062.4100 063.2025 064.0000 064.8025 065.6100 066.4225 067.2400 068.0625 068.8900 069.7225 070.5600 071.4025 072.2500 073.1025 073.9600 074.8225 075.6900 076.5625 077.4400 078.3225 079.2100 080.1025 081.0000 081.9025 082.8100 083.7225 084.6400 085.5625 086.4900 087.4225 088.3600 089.3025 090.2500 091.2025 092.1600 093.1225 094.0900 095.0625 096.0400 097.0225 098.0100 099.0025 100.0000 101.0025 102.0100 103.0225 104.0400 105.0625 106.0900 107.1225 108.1600 109.2025 110.2500 111.3025 112.3600 113.4225 114.4900 115.5625 116.6400 117.7225 118.8100 119.9025 121.0000 122.1025 123.2100 124.3225 125.4400 126.5625 127.6900 128.8225 129.9600 131.1025 132.2500 133.4025 134.5600 135.7225 136.8900 138.0625 139.2400 140.4225 141.6100 142.8025 144.0000 145.2025 146.4100 147.6225 148.8400 150.0625 151.2900 152.5225 153.7600 155.0025 156.2500 157.5025 158.7600 160.0225 161.2900 162.5625 163.8400 165.1225 166.4100 167.7025 169.0000 170.3025 171.6100 172.9225 174.2400 175.5625 176.8900 178.2225 179.5600 180.9025 182.2500 183.6025 184.9600 186.3225 187.6900 189.0625 190.4400 191.8225 193.2100 194.6025 196.0000 197.4025 198.8100 200.2225 201.6400 203.0625 204.4900 205.9225 207.3600 208.8025 210.2500 211.7025 213.1600 214.6225 216.0900 217.5625 219.0400 220.5225 222.0100 223.5025 225.0000 226.5025 228.0100 229.5225 231.0400 232.5625 234.0900 235.6225 237.1600 238.7025 240.2500 241.8025 243.3600 244.9225 246.4900 248.0625 249.6400 251.2225 252.8100 254.4025 256.0000)

nanopore_read_counts=(1 2 5 8 13 19 26 34 43 53 64 76 89 103 118 135 152 171 190 211 232 255 278 303 329 356 384 413 443 474 506 539 573 608 645 682 721 760 800 842 885 928 973 1019 1066 1114 1163 1213 1264 1316 1369 1423 1478 1535 1592 1650 1710 1770 1832 1895 1958 2023 2089 2156 2224 2293 2363 2434 2506 2579 2653 2728 2805 2882 2960 3040 3120 3202 3285 3368 3453 3539 3626 3714 3803 3893 3984 4076 4169 4263 4358 4455 4552 4650 4750 4850 4952 5055 5158 5263 5369 5476 5583 5692 5802 5913 6026 6139 6253 6368 6485 6602 6720 6840 6960 7082 7204 7328 7453 7579 7706 7833 7962 8092 8223 8356 8489 8623 8758 8894 9032 9170 9310 9450 9592 9734 9878 10023 10169 10315 10463 10612 10762 10913 11065 11219 11373 11528 11684 11842 12000 12160 12320 12482 12644 12808 12973 13139 13305 13473 13642 13812 13983 14155 14328 14503 14678 14854 15032 15210 15389 15570 15752 15934 16118 16303 16488 16675 16863 17052 17242 17433 17625 17818 18013 18208 18404 18601 18800 18999 19200 19401 19604 19808 20012 20218 20425 20633 20842 21052 21263 21475 21688 21902 22118 22334 22551 22770 22989 23210 23431 23654 23878 24102 24328 24555 24783 25012 25242 25473 25705 25938 26172 26408 26644 26881 27120 27359 27600 27841 28084 28327 28572 28818 29065 29313 29562 29812 30063 30315 30568 30822 31077 31334 31591 31849 32109 32369 32631 32894 33157 33422 33688 33955 34223 34491 34761 35033 35305 35578 35852 36127 36404 36681 36959 37239 37519 37801 38083 38367 38652 38938 39224 39512 39801 40091 40382 40674 40968 41262 41557 41853 42151 42449 42749 43049 43351 43653 43957 44262 44567 44874 45182 45491 45801 46112 46424 46737 47052 47367 47683 48000 48319 48638 48959 49280 49603 49927 50251 50577 50904 51232 51561 51891 52222 52554 52887 53221 53557 53893)

illumina_read_counts=(24 97 218 387 605 871 1186 1549 1960 2420 2928 3484 4089 4742 5444 6194 6993 7839 8735 9678 10670 11711 12800 13937 15122 16356 17639 18969 20349 21776 23252 24776 26349 27970 29640 31358 33124 34939 36802 38713 40673 42681 44738 46843 48996 51198 53448 55747 58094 60489 62933 65425 67966 70555 73192 75878 78612 81394 84225 87104 90032 93008 96033 99105 102227 105396 108614 111881 115196 118559 121970 125430 128939 132495 136101 139754 143456 147206 151005 154852 158748 162692 166684 170725 174814 178951 183137 187371 191654 195985 200364 204792 209268 213793 218366 222987 227657 232375 237142 241957 246820 251732 256692 261700 266757 271862 277016 282218 287469 292767 298115 303510 308954 314447 319988 325577 331214 336900 342635 348417 354249 360128 366056 372032 378057 384130 390252 396422 402640 408907 415222 421585 427997 434457 440966 447523 454128 460782 467484 474235 481034 487881 494777 501721 508714 515755 522844 529982 537168 544402 551685 559016 566396 573824 581301 588825 596399 604020 611690 619409 627176 634991 642854 650766 658727 666735 674793 682898 691052 699254 707505 715804 724152 732548 740992 749485 758026 766615 775253 783939 792674 801457 810288 819168 828096 837073 846098 855171 864293 873463 882682 891949 901264 910628 920040 929500 939009 948566 958172 967826 977529 987279 997079 1006926 1016822 1026767 1036760 1046801 1056890 1067028 1077215 1087449 1097733 1108064 1118444 1128872 1139349 1149874 1160448 1171070 1181740 1192459 1203226 1214041 1224905 1235817 1246778 1257787 1268844 1279950 1291104 1302307 1313558 1324857 1336205 1347601 1359046 1370539 1382080 1393670 1405308 1416994 1428729 1440512 1452344 1464224 1476153 1488130 1500155 1512228 1524350 1536521 1548740 1561007 1573322 1585686 1598099 1610560 1623069 1635626 1648232 1660886 1673589 1686340 1699140 1711988 1724884 1737829 1750822 1763863 1776953 1790091 1803278 1816513 1829796 1843128 1856508 1869937 1883414 1896939 1910513 1924135 1937806 1951525 1965292 1979108 1992972 2006884 2020845 2034855 2048912 2063018 2077173 2091376 2105627 2119926 2134274 2148671 2163116 2177609 2192150 2206740 2221379 2236066 2250801 2265584 2280416 2295297 2310225 2325202 2340228 2355302 2370424 2385595 2400814 2416081 2431397 2446761 2462174 2477635)

conda activate medaka

for i in {1..320}; do
    depth="$depths[i]"
    nanopore_read_count="$nanopore_read_counts[i]"
    illumina_read_count="$illumina_read_counts[i]"

    cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R10.4
    mkdir "$depth"
    cd "$depth"

    seqtk sample -s "$nanopore_read_count" ../nanopore.fastq "$nanopore_read_count" > nanopore.fastq
    fast_count nanopore.fastq > nanopore.reads

    seqtk sample -s "$nanopore_read_count" ../illumina_1.fastq "$illumina_read_count" > illumina_1.fastq
    seqtk sample -s "$nanopore_read_count" ../illumina_2.fastq "$illumina_read_count" > illumina_2.fastq
    fast_count illumina_1.fastq illumina_2.fastq > illumina.reads

    # Assemble with Flye:
    /home/ubuntu/programs/Flye/bin/flye -t 32 --nano-hq nanopore.fastq --out-dir flye 2> flye.out

    if [ -f flye/assembly_graph.gfa ]; then  # If Flye produced an assembly graph...
        cp flye/assembly_graph.gfa flye.gfa

        # Try to extract a single chromosomal contig from the Flye assembly:
        ../scripts/get_chromosome.py --min 2813669 --max 2823669 --start ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGAAATTGCTCAA flye.gfa 1> flye.fasta 2> flye.info

        if [ -s flye.fasta ]; then  # If we got a Flye chromosome...

            # Polish with Medaka:
            medaka_consensus -i nanopore.fastq -d flye.fasta -o medaka -m r104_e81_sup_g5015 -t 12 &> medaka.out
            mv medaka/consensus.fasta flye_medaka.fasta

            # Polish with Polypolish:
            bwa index flye_medaka.fasta
            bwa mem -t 32 -a flye_medaka.fasta illumina_1.fastq > alignments_1.sam
            bwa mem -t 32 -a flye_medaka.fasta illumina_2.fastq > alignments_2.sam
            polypolish_insert_filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam &> polypolish.out
            polypolish flye_medaka.fasta filtered_1.sam filtered_2.sam 1> flye_medaka_polypolish.fasta 2>> polypolish.out

            # Polish with POLCA:
            /home/ubuntu/programs/MaSuRCA-4.0.9/bin/polca.sh -a flye_medaka_polypolish.fasta -r illumina_1.fastq" "illumina_2.fastq -t 32 -m 1G &> polca.out
            mv *.PolcaCorrected.fa flye_medaka_polypolish_polca.fasta
        fi
    else
        echo "Flye assembly failed" > flye.info
    fi

    rm -rf flye medaka
    rm *.fastq
    rm *.fai
    rm *.mmi
    rm *.sam
    rm *.amb *.ann *.bwt *.pac *.sa
    rm *.bam *.bam.bai
    rm *.err
    rm *.success *.batches *.names *.report *.vcf
done

Some notes:

  • For the seqtk sample command, I used the read count as the seed. This made the result reproducible but also avoided the overlap that would occur if I used the same seed for each subset.
  • The get_chromosome.py script requires an exact match for the starting sequence, which isn't very robust. However, I didn't encounter any cases where the script quit with a starting sequence in neither strand error, so it wasn't a problem. I.e. the starting sequence seems to have assembled perfectly each time.
  • I used a buffer of 5 kbp for the min/max size with the get_chromosome.py script. While this seems strict, it was to exclude any misassemblies, and the correctly assembled contigs fell right in the middle of this range, so I don't think it's a problem.

Quantify assembly accuracy

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R10.4

scripts/assess_assembly.py --header > results_flye.tsv
scripts/assess_assembly.py --header > results_flye_medaka.tsv
scripts/assess_assembly.py --header > results_flye_medaka_polypolish.tsv
scripts/assess_assembly.py --header > results_flye_medaka_polypolish_polca.tsv

for flye_medaka_polypolish_polca in */flye_medaka_polypolish_polca.fasta; do
    flye_medaka_polypolish="${flye_medaka_polypolish_polca/_polca/}"
    flye_medaka="${flye_medaka_polypolish/_polypolish/}"
    flye="${flye_medaka/_medaka/}"
    scripts/assess_assembly.py -a "$flye" -r reference.fasta >> results_flye.tsv
    scripts/assess_assembly.py -a "$flye_medaka" -r reference.fasta >> results_flye_medaka.tsv
    scripts/assess_assembly.py -a "$flye_medaka_polypolish" -r reference.fasta >> results_flye_medaka_polypolish.tsv
    scripts/assess_assembly.py -a "$flye_medaka_polypolish_polca" -r reference.fasta >> results_flye_medaka_polypolish_polca.tsv
done

Check out homopolymer error

Here I wanted to investigate the Tx9 homopolymer error that was often the only error left in Trycycler+Medaka+Polypolish assemblies.

I used assemblies from 64x depth and up, as this seemed to be about where the accuracy plateaus.

First checked Flye assemblies:

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R10.4
grep --no-filename -oP "GCATTGAGACCGCAAGGTCTCT+ATGTCTAAAACGTCAAAATAAAAAGC" 06[456789]*/flye.fasta 0[789]*/flye.fasta [12]*/flye.fasta | sort | uniq -c

Which produced this:

    150 GCATTGAGACCGCAAGGTCTCTTTTTTTTATGTCTAAAACGTCAAAATAAAAAGC
     11 GCATTGAGACCGCAAGGTCTCTTTTTTTTTATGTCTAAAACGTCAAAATAAAAAGC

So most Flye assemblies contained the incorrect Tx8 sequence. This makes sense, as that sequence was slightly in the majority in the reads.

And checked Flye+Medaka assemblies:

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R10.4
grep --no-filename -oP "GCATTGAGACCGCAAGGTCTCT+ATGTCTAAAACGTCAAAATAAAAAGC" 06[456789]*/flye_medaka.fasta 0[789]*/flye_medaka.fasta [12]*/flye_medaka.fasta | sort | uniq -c

Which produced this:

     50 GCATTGAGACCGCAAGGTCTCTTTTTTTTATGTCTAAAACGTCAAAATAAAAAGC
    111 GCATTGAGACCGCAAGGTCTCTTTTTTTTTATGTCTAAAACGTCAAAATAAAAAGC

For the 11 Flye assemblies that got it right, Medaka always kept the correct sequence. For the 150 Flye assemblies that got it wrong, Medaka fixed it 2/3 of the time, leaving 50 with the error.

And checked Flye+Medaka+Polypolish assemblies:

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R10.4
grep --no-filename -oP "GCATTGAGACCGCAAGGTCTCT+ATGTCTAAAACGTCAAAATAAAAAGC" 06[456789]*/flye_medaka_polypolish.fasta 0[789]*/flye_medaka_polypolish.fasta [12]*/flye_medaka_polypolish.fasta | sort | uniq -c

Which produced this:

     50 GCATTGAGACCGCAAGGTCTCTTTTTTTTATGTCTAAAACGTCAAAATAAAAAGC
    111 GCATTGAGACCGCAAGGTCTCTTTTTTTTTATGTCTAAAACGTCAAAATAAAAAGC

So the same results as Flye+Medaka. Due to it being in an inexact repeat, Polypolish couldn't fix this error.

R9.4.1+Illumina assembly and assessment

These methods follow mostly the same procedure as the R10.4+Illumina analysis described above.

Preparing the Nanopore reads

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R9.4.1
head -n2 ../S_aureus_JKD6159.fasta > reference.fasta
cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R9.4.1
cp ../S_aureus_JKD6159_assembly_R9.4.1/trycycler/cluster_001/4_reads.fastq nanopore.fastq

This read set had 77880 reads and 1613880227 bp, which equates to 573x depth over the 2818670 bp chromosome.

Preparing the Illumina reads

I already did this for the R10.4+Illumina analysis, so I just took those reads:

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R9.4.1
cp ../accuracy_vs_depth_R10.4/illumina_*.fastq .

Subsampled read counts

This Python code produces read counts for 320 subsamples of Nanopore reads up to 256x depth, evenly spaced on a square-root scale:

genome_size = 2818670
read_count = 77880
total_read_size = 1613880227
mean_read_length = total_read_size / read_count
depth_sqrt = 0
read_counts = []
while depth_sqrt < 16:
    depth_sqrt += 0.05
    depth = depth_sqrt**2
    base_count = genome_size * depth
    read_count = int(round(base_count / mean_read_length))
    read_counts.append(f'{read_count}')
print(' '.join(read_counts))

Since I used the same Illumina reads as in the R10.4-accuracy-vs-depth analysis, I used the same Illumina read counts I used there.

Flye assembly + Medaka polishing + Polypolish polishing + POLCA polishing

depths=(000.0025 000.0100 000.0225 000.0400 000.0625 000.0900 000.1225 000.1600 000.2025 000.2500 000.3025 000.3600 000.4225 000.4900 000.5625 000.6400 000.7225 000.8100 000.9025 001.0000 001.1025 001.2100 001.3225 001.4400 001.5625 001.6900 001.8225 001.9600 002.1025 002.2500 002.4025 002.5600 002.7225 002.8900 003.0625 003.2400 003.4225 003.6100 003.8025 004.0000 004.2025 004.4100 004.6225 004.8400 005.0625 005.2900 005.5225 005.7600 006.0025 006.2500 006.5025 006.7600 007.0225 007.2900 007.5625 007.8400 008.1225 008.4100 008.7025 009.0000 009.3025 009.6100 009.9225 010.2400 010.5625 010.8900 011.2225 011.5600 011.9025 012.2500 012.6025 012.9600 013.3225 013.6900 014.0625 014.4400 014.8225 015.2100 015.6025 016.0000 016.4025 016.8100 017.2225 017.6400 018.0625 018.4900 018.9225 019.3600 019.8025 020.2500 020.7025 021.1600 021.6225 022.0900 022.5625 023.0400 023.5225 024.0100 024.5025 025.0000 025.5025 026.0100 026.5225 027.0400 027.5625 028.0900 028.6225 029.1600 029.7025 030.2500 030.8025 031.3600 031.9225 032.4900 033.0625 033.6400 034.2225 034.8100 035.4025 036.0000 036.6025 037.2100 037.8225 038.4400 039.0625 039.6900 040.3225 040.9600 041.6025 042.2500 042.9025 043.5600 044.2225 044.8900 045.5625 046.2400 046.9225 047.6100 048.3025 049.0000 049.7025 050.4100 051.1225 051.8400 052.5625 053.2900 054.0225 054.7600 055.5025 056.2500 057.0025 057.7600 058.5225 059.2900 060.0625 060.8400 061.6225 062.4100 063.2025 064.0000 064.8025 065.6100 066.4225 067.2400 068.0625 068.8900 069.7225 070.5600 071.4025 072.2500 073.1025 073.9600 074.8225 075.6900 076.5625 077.4400 078.3225 079.2100 080.1025 081.0000 081.9025 082.8100 083.7225 084.6400 085.5625 086.4900 087.4225 088.3600 089.3025 090.2500 091.2025 092.1600 093.1225 094.0900 095.0625 096.0400 097.0225 098.0100 099.0025 100.0000 101.0025 102.0100 103.0225 104.0400 105.0625 106.0900 107.1225 108.1600 109.2025 110.2500 111.3025 112.3600 113.4225 114.4900 115.5625 116.6400 117.7225 118.8100 119.9025 121.0000 122.1025 123.2100 124.3225 125.4400 126.5625 127.6900 128.8225 129.9600 131.1025 132.2500 133.4025 134.5600 135.7225 136.8900 138.0625 139.2400 140.4225 141.6100 142.8025 144.0000 145.2025 146.4100 147.6225 148.8400 150.0625 151.2900 152.5225 153.7600 155.0025 156.2500 157.5025 158.7600 160.0225 161.2900 162.5625 163.8400 165.1225 166.4100 167.7025 169.0000 170.3025 171.6100 172.9225 174.2400 175.5625 176.8900 178.2225 179.5600 180.9025 182.2500 183.6025 184.9600 186.3225 187.6900 189.0625 190.4400 191.8225 193.2100 194.6025 196.0000 197.4025 198.8100 200.2225 201.6400 203.0625 204.4900 205.9225 207.3600 208.8025 210.2500 211.7025 213.1600 214.6225 216.0900 217.5625 219.0400 220.5225 222.0100 223.5025 225.0000 226.5025 228.0100 229.5225 231.0400 232.5625 234.0900 235.6225 237.1600 238.7025 240.2500 241.8025 243.3600 244.9225 246.4900 248.0625 249.6400 251.2225 252.8100 254.4025 256.0000)

nanopore_read_counts=(0 1 3 5 9 12 17 22 28 34 41 49 57 67 77 87 98 110 123 136 150 165 180 196 213 230 248 267 286 306 327 348 370 393 417 441 466 491 517 544 572 600 629 658 689 720 751 783 816 850 884 919 955 992 1029 1066 1105 1144 1184 1224 1265 1307 1350 1393 1437 1481 1526 1572 1619 1666 1714 1763 1812 1862 1913 1964 2016 2069 2122 2176 2231 2286 2343 2399 2457 2515 2574 2633 2694 2754 2816 2878 2941 3005 3069 3134 3200 3266 3333 3400 3469 3538 3608 3678 3749 3821 3893 3966 4040 4115 4190 4266 4342 4419 4497 4576 4655 4735 4815 4897 4979 5061 5145 5229 5313 5399 5485 5571 5659 5747 5836 5925 6015 6106 6197 6290 6382 6476 6570 6665 6760 6857 6954 7051 7149 7248 7348 7448 7549 7651 7753 7856 7960 8065 8170 8275 8382 8489 8597 8705 8814 8924 9035 9146 9258 9370 9484 9597 9712 9827 9943 10060 10177 10295 10414 10533 10653 10774 10895 11018 11140 11264 11388 11513 11638 11764 11891 12019 12147 12276 12405 12535 12666 12798 12930 13063 13197 13331 13466 13602 13738 13875 14013 14151 14290 14430 14571 14712 14854 14996 15139 15283 15428 15573 15719 15865 16012 16160 16309 16458 16608 16759 16910 17062 17215 17368 17522 17677 17832 17988 18145 18303 18461 18620 18779 18939 19100 19262 19424 19587 19750 19915 20079 20245 20411 20578 20746 20914 21083 21253 21423 21594 21766 21938 22112 22285 22460 22635 22811 22987 23164 23342 23521 23700 23880 24060 24242 24424 24606 24789 24973 25158 25343 25529 25716 25903 26091 26280 26470 26660 26850 27042 27234 27427 27620 27814 28009 28205 28401 28598 28796 28994 29193 29392 29593 29794 29995 30198 30401 30604 30809 31014 31219 31426 31633 31841 32049 32258 32468 32679 32890 33102 33314 33527 33741 33956 34171 34387 34604 34821)

illumina_read_counts=(24 97 218 387 605 871 1186 1549 1960 2420 2928 3484 4089 4742 5444 6194 6993 7839 8735 9678 10670 11711 12800 13937 15122 16356 17639 18969 20349 21776 23252 24776 26349 27970 29640 31358 33124 34939 36802 38713 40673 42681 44738 46843 48996 51198 53448 55747 58094 60489 62933 65425 67966 70555 73192 75878 78612 81394 84225 87104 90032 93008 96033 99105 102227 105396 108614 111881 115196 118559 121970 125430 128939 132495 136101 139754 143456 147206 151005 154852 158748 162692 166684 170725 174814 178951 183137 187371 191654 195985 200364 204792 209268 213793 218366 222987 227657 232375 237142 241957 246820 251732 256692 261700 266757 271862 277016 282218 287469 292767 298115 303510 308954 314447 319988 325577 331214 336900 342635 348417 354249 360128 366056 372032 378057 384130 390252 396422 402640 408907 415222 421585 427997 434457 440966 447523 454128 460782 467484 474235 481034 487881 494777 501721 508714 515755 522844 529982 537168 544402 551685 559016 566396 573824 581301 588825 596399 604020 611690 619409 627176 634991 642854 650766 658727 666735 674793 682898 691052 699254 707505 715804 724152 732548 740992 749485 758026 766615 775253 783939 792674 801457 810288 819168 828096 837073 846098 855171 864293 873463 882682 891949 901264 910628 920040 929500 939009 948566 958172 967826 977529 987279 997079 1006926 1016822 1026767 1036760 1046801 1056890 1067028 1077215 1087449 1097733 1108064 1118444 1128872 1139349 1149874 1160448 1171070 1181740 1192459 1203226 1214041 1224905 1235817 1246778 1257787 1268844 1279950 1291104 1302307 1313558 1324857 1336205 1347601 1359046 1370539 1382080 1393670 1405308 1416994 1428729 1440512 1452344 1464224 1476153 1488130 1500155 1512228 1524350 1536521 1548740 1561007 1573322 1585686 1598099 1610560 1623069 1635626 1648232 1660886 1673589 1686340 1699140 1711988 1724884 1737829 1750822 1763863 1776953 1790091 1803278 1816513 1829796 1843128 1856508 1869937 1883414 1896939 1910513 1924135 1937806 1951525 1965292 1979108 1992972 2006884 2020845 2034855 2048912 2063018 2077173 2091376 2105627 2119926 2134274 2148671 2163116 2177609 2192150 2206740 2221379 2236066 2250801 2265584 2280416 2295297 2310225 2325202 2340228 2355302 2370424 2385595 2400814 2416081 2431397 2446761 2462174 2477635)

conda activate medaka

for i in {1..320}; do
    depth="$depths[i]"
    nanopore_read_count="$nanopore_read_counts[i]"
    illumina_read_count="$illumina_read_counts[i]"

    cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R9.4.1
    mkdir "$depth"
    cd "$depth"

    seqtk sample -s "$nanopore_read_count" ../nanopore.fastq "$nanopore_read_count" > nanopore.fastq
    fast_count nanopore.fastq > nanopore.reads

    seqtk sample -s "$nanopore_read_count" ../illumina_1.fastq "$illumina_read_count" > illumina_1.fastq
    seqtk sample -s "$nanopore_read_count" ../illumina_2.fastq "$illumina_read_count" > illumina_2.fastq
    fast_count illumina_1.fastq illumina_2.fastq > illumina.reads

    # Assemble with Flye:
    /home/ubuntu/programs/Flye/bin/flye -t 32 --nano-hq nanopore.fastq --out-dir flye 2> flye.out

    if [ -f flye/assembly_graph.gfa ]; then  # If Flye produced an assembly graph...
        cp flye/assembly_graph.gfa flye.gfa

        # Try to extract a single chromosomal contig from the Flye assembly:
        ../scripts/get_chromosome.py --min 2813669 --max 2823669 --start ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGAAATTGCTCAA flye.gfa 1> flye.fasta 2> flye.info

        if [ -s flye.fasta ]; then  # If we got a Flye chromosome...

            # Polish with Medaka:
            medaka_consensus -i nanopore.fastq -d flye.fasta -o medaka -m r104_e81_sup_g5015 -t 12 &> medaka.out
            mv medaka/consensus.fasta flye_medaka.fasta

            # Polish with Polypolish:
            bwa index flye_medaka.fasta
            bwa mem -t 32 -a flye_medaka.fasta illumina_1.fastq > alignments_1.sam
            bwa mem -t 32 -a flye_medaka.fasta illumina_2.fastq > alignments_2.sam
            polypolish_insert_filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam &> polypolish.out
            polypolish flye_medaka.fasta filtered_1.sam filtered_2.sam 1> flye_medaka_polypolish.fasta 2>> polypolish.out

            # Polish with POLCA:
            /home/ubuntu/programs/MaSuRCA-4.0.9/bin/polca.sh -a flye_medaka_polypolish.fasta -r illumina_1.fastq" "illumina_2.fastq -t 32 -m 1G &> polca.out
            mv *.PolcaCorrected.fa flye_medaka_polypolish_polca.fasta
        fi
    else
        echo "Flye assembly failed" > flye.info
    fi

    rm -rf flye medaka
    rm *.fastq
    rm *.fai
    rm *.mmi
    rm *.sam
    rm *.amb *.ann *.bwt *.pac *.sa
    rm *.bam *.bam.bai
    rm *.err
    rm *.success *.batches *.names *.report *.vcf
done

One note:

  • I encountered a single assembly (at 008.1225 depth) where the get_chromosome.py script failed to find an exact match for the starting sequence. In that case I ran the get_chromosome.py script again but with ATGTCGGAAAAGAAATTTGGGAAAAAGTGCTTGAAATTGCTCAA (Ax5 -> Ax4).

Quantify assembly accuracy

cd /home/ubuntu/Perfect_bacterial_genome_paper/accuracy_vs_depth/R9.4.1

scripts/assess_assembly.py --header > results_flye.tsv
scripts/assess_assembly.py --header > results_flye_medaka.tsv
scripts/assess_assembly.py --header > results_flye_medaka_polypolish.tsv
scripts/assess_assembly.py --header > results_flye_medaka_polypolish_polca.tsv

for flye_medaka_polypolish_polca in */flye_medaka_polypolish_polca.fasta; do
    flye_medaka_polypolish="${flye_medaka_polypolish_polca/_polca/}"
    flye_medaka="${flye_medaka_polypolish/_polypolish/}"
    flye="${flye_medaka/_medaka/}"
    scripts/assess_assembly.py -a "$flye" -r reference.fasta >> results_flye.tsv
    scripts/assess_assembly.py -a "$flye_medaka" -r reference.fasta >> results_flye_medaka.tsv
    scripts/assess_assembly.py -a "$flye_medaka_polypolish" -r reference.fasta >> results_flye_medaka_polypolish.tsv
    scripts/assess_assembly.py -a "$flye_medaka_polypolish_polca" -r reference.fasta >> results_flye_medaka_polypolish_polca.tsv
done