2022 Biol 320 Migseq data processing commands on myco server - barrettlab/2021-Genomics-bootcamp GitHub Wiki

Processing raw fastq reads from Migseq using the ISSRseq pipeline

A great resource for population genomics in R by Tom Jenkins

Rename fastq.gz files

rename 's/^Barrett-320-//' Barrett-320-*

rename 's/_001.fastq.gz/.fastq.gz/' *.fastq.gz

rename 's/_L001//' *.fastq.gz

Make a samples list

ls *R1.fastq.gz | sed 's/_R1.fastq.gz//g' > ../samples.txt

Assemble pseudoreference with ISSRseq pipeline

## arguments:

-O [desired prefix of output directory]

-I [path to directory containing sequences]

-S [path to the samples file]

-R [verbatim name of sample to be used for to create the reference assembly]

-T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]

-M [minimum post-trim read length]

-H [number of bases to hard trim from the end of reads]

-P [fasta file of ISSR motifs used]

-K [kmer choice for ABYSS assembly]

-L [minimum assembled contig length]

-N [negative reference in fasta format to filter reads against, ex. sequenced plastome or specific contaminant]

-X [bbduk trimming kmer, equal to or longer than shortest primer used]

Start the 'AssembleReference' pipeline to trim reads, assemble pseudoreference, and blast contaminant contigs

nohup ISSRseq_AssembleReference.sh -O test_320_mig -I reads -S samples.txt -R  WV-Mon-LF-2_S210 -T 30 -M 50 -H 0 -P issr_primers.txt -K 31 -L 100 -X 50 -N NC_026839_L_japonica_pt.fasta &

### Output will be:
### test_320_mig_2022_03_20_13_06
### Use this name for subsequent steps

Create bam files with bbmap

nohup ISSRseq_CreateBAMs.sh -O test_320_mig_2022_03_20_13_06 -T 30 &

Call SNPs with GATK best practices pipeline


-O [desired prefix of output directory]

-T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]

-P [ploidy of organism]

### Command 

nohup ISSRseq_AnalyzeBAMs.sh -O test_320_mig_2022_03_20_13_06 -T 30 -P 2 &

# output will be a vcf file called 'filtered_snps.vcf'

Use ANGSD to estimate genotype likelihoods and filter based on read and mapping quality

nohup angsd -out genolike -nThreads 30 -GL 2 -minQ 20 -minMapQ 20 -doGlf 2 -doMaf 2 -b bamlist.txt -doMajorMinor 1 -SNP_pval 1e-6 -doIBS 1 -doCounts 1 -doCov 1 -makeMatrix 1 &

Ancestry coefficients and admixture proportions with NGSAdmix
Alternatively, because we have ~75% missing data due to low coverage, we can impute the missing data with beagle

 beagle.r1399.jar gt=filtered_variants.vcf out=imputed.vcf

For phylogenetic analyses, we can convert our imputed vcf file to fasta, nexus, and phylip formats:

/usr/local/bin/vcf2phylip --input imputed.vcf --fasta --nexus --nexus-binary