Reconstructing contigs - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki
Introduction:
This module aligns cleaned reads of each individual derived from cleanpe
to its in-target reference generated by intarget
using novoalign. The idea is to retain reads that mapped uniquely to the reference. GATK (McKenna et al. 2010) to is then used to perform re-alignment around indels. Finally this module uses SAMtools/BCFtools (Li et al. 2009) to generate individual consensus sequences by calling genotypes and incorporate ambiguous sites in the individual-specific assemblies. Certainly filters are applied in this process. For details please see blow.
Command and options:
(seqCapture) $ seqCapture assemble
Usage: seqCap buildcontigs [options]
Basic Options:
-t INT Target sequences could be one of the following:
1=individual exons;
2=cDNA (no UTR);
3=transcripts (including UTR);
4=random (such as UCEs, no need for exon identification)
[3]
-a DIR Path to a folder with all intarget assemblies
(AAA_targetedRegionAndFlanking.fasta);
-f DIR Path to a folder with all bed files
-b DIR A folder with all cleaned reads (AAA_1_final.fq,
AAA_2_final.fq, AAA_u_final.fq...)
-i INT Avg. Insert size [200];
-m INT memory limit (in MB) for the program, default 800;
0 for unlimitted [0]
-n INT number of threads [10]
-d INT Minimum depth to keep a site, otherwise masked as an "N" [5]
-D INT Maximum depth to keep a site, otherwise masked as an "N" [100000]
-N INT INDEL filtering window [5]
-M FLOAT Discard a locus if M percent bases are Ns [0.8]
-c INT only keep concordant mapping for PE reads?
1 = yes
0 = no [1]
-r INT read length (bp) of original raw reads [100]
-s INT repeat Masking?
1 = yes
0 = no [0]
RepeatMasking Options: (use T or R)
-R CHAR Species used for repeatmasking. some examples are: human, mouse, rattus,
"ciona savignyi",arabidopsis, mammal, carnivore, rodentia, rat, cow, pig,
cat, dog, chicken, fugu, danio, "ciona intestinalis", drosophila,
anopheles, elegans,diatoaea, artiodactyl, rice, wheat, maize,
"vertebrata metazoa"
-T CHAR Use a custom-build repetitive library (full path) for repeat masking,
in this case do not use -R
Prepare input for the run:
-
After finishing running
cleanpe
, the cleaned reads of each samples are stored in diretory "cleaned_reads_dir", which is the input for this step.(seqCapture) $ ls cleaned_reads_dir/ Sample1_1_final.fq Sample1_2_final.fq Sample1_u_final.fq Sample1.contam.out Sample2_1_final.fq Sample2_2_final.fq Sample2_u_final.fq Sample2.contam.out ...... SampleN_1_final.fq SampleN_2_final.fq SampleN_u_final.fq SampleN.contam.out
-
Intarget assemblies generated by step
intarget
:(seqCapture) $ ls fasta/ Sample1_targetedRegionAndFlanking.fasta Sample2_targetedRegionAndFlanking.fasta ...... SampleN_targetedRegionAndFlanking.fasta
Usage examples:
Assembling each and all samples in "cleaned_reads_dir" and store raw assemblies for each sample in "raw_assemblies_dir"; choosing kmer lengths of 21, 33, 55, 77, 99, and 127 (-kmer 21,33,55,77,99,127); allocating 10 cpus (-n 10) for SPAdes assemblies.
(seqCapture) $ seqCapture assemble -reads /path/to/cleaned_reads_dir/ -kmer 21,33,55,77,99,127 -out raw_assemblies_dir -np 10 Output
In "raw_assemblies_dir" individual assemblies are stored:
(seqCapture) $ ls raw_assemblies_dir/ Sample1.fasta Sample2.fasta ...... SampleN.fasta