Contig breaking - mahulchak/quickmerge GitHub Wiki

Contig breaking: why and how?

Why should you break a nicely assembled contiguous stretch of sequence and why is this included in quickmerge wiki? If you read our NAR paper that describes quickmerge, you must have noticed that the merged assembly sometimes inherits assembly errors from the component assemblies. These situations, counter-intuitively,can reduce the N50 of your merged assembly (they almost never artificially inflate the N50 of the merged assembly).

So if you can identify where such discordance between the two assemblies are located, you can break such contigs at the points where the discordance begins. Then you can use my contig breaker program to break the contig at the mis-assembly junction.

What you need for contig breaking?

You'll need two utilities from my Assembly-utils repo, called xtrac-contig.cpp and editSeq.cpp.

Steps for contig breaking:

Contig extraction

Add the name of the contig you want to break in a text file ("ctg_name.txt"). Include the '>' character. E.g. >chrName Then run xtrac-contig -

xtrac-contig ctg_name.txt assembly.fasta

The extracted contig will be in a file called "asm.new.fasta". xtrac-contig can extract any number of contigs you want but here we will stick with just one.

Break the contig in parts

Lets say the misassembly occurs at position 1001 of your contig. Then you can break it using the editSeq utility in the following way -

editSeq asm.new.fasta 1001 1001 N

editSeq will generate three files, with your broken contig parts in p1 and p2 files. editSeq can do a lot of things, including removal/extraction of a chunk of sequence from the middle of a contig. What I described above is only one of its functions. Currently editSeq works on a single contig/scaffold at a time but I will update it in near future so that it can work on multiple sequences.