Assemblies - ASBioinfo/Utils-hub GitHub Wiki

1. Pangenome using Minigraph

minigraph -cxggs -t45 GCF_003369695.1_UOA_Brahman_withY_sline.fna LGP01_arcs_gapc_sline.fa > ./final_1/ref_LGP01.gfa
  • Mash distance calculation
minigraph --inv no -cxggs -t 80 ../GCF_003369695.1_UOA_Brahman_withY_sline.fna ../LGP02_supernova_N.fa ../LGP01_supernova_N.fa ../LGP04_supernova_N.fa ../LGP05_supernova_N.fa ../LGP03_supernova_N.fa >mash_genetic_distance.gfa
  • Run the script.sh for multiple assemblies with backbone
minigraph -t 110 --cov -x asm mash_genetic_distance.gfa ../GCF_003369695.1_UOA_Brahman_withY_sline.fna > graph_rev_brahman.gfa
minigraph -t 110 --cov -x asm mash_genetic_distance.gfa ../LGP01_supernova_N.fa > graph_rev_LGP01.gfa
minigraph -t 110 --cov -x asm mash_genetic_distance.gfa ../LGP02_supernova_N.fa > graph_rev_LGP02.gfa
minigraph -t 110 --cov -x asm mash_genetic_distance.gfa ../LGP03_supernova_N.fa > graph_rev_LGP03.gfa
minigraph -t 110 --cov -x asm mash_genetic_distance.gfa ../LGP04_supernova_N.fa > graph_rev_LGP04.gfa
minigraph -t 110 --cov -x asm mash_genetic_distance.gfa ../LGP05_supernova_N.fa > graph_rev_LGP05.gfa

2. STAR alignment (for novel sequences)

  • Indexing
STAR --runMode genomeGenerate --runThreadN 32 --genomeDir ./STAR_index --genomeFastaFiles ./cd_hit_out_contig_name_changed.fa --genomeSAindexNbases 9 –sjdbOverhang 100
  • Aligning
STAR --runThreadN 32 --genomeDir ./STAR_index --outFileNamePrefix ./supercontig --readFilesIn unmapped_2ndUnmapped.out.mate1  unmapped_2ndUnmapped.out.mate2
  • To filter the SAM file after unmapped data alignement on NUI
samtools view -@ 40 -h -F12 supercontigAligned.out.sam -o final_filterd.sam
awk '!/^@/{if ($5==255) print $0}' final_filterd.sam|awk '{print "#"$(NF-1)"\t"$0}'|awk -F "#AS:i:" '{print $2}'|awk '{if ($1>=140) print $0}'|cut -f2- >filtered_sam_file
awk '/^@/{print $0}' final_filterd.sam|less
awk '/^@/{print $0}' final_filterd.sam >header.txt
cat header.txt filtered_sam_file >filtered_sam_file.sam
samtools sort -O BAM -@ 60 -o filtered_sam_file_sort.bam filtered_sam_file.sam
stringtie -p 12 -o supercontig.gtf filtered_sam_file_sort.bam

3. Repeatmasker

RepeatMasker -nolow -species cow -xsmall cd_hit_out_contig_name_changed.fa  > repeatmasker.log

4. arcs

arcs-make arks draft=LGP04_supernova_N reads=LGP_04 k=60 m=50-10000 t=85

5. GapCloser

GapCloser -a LGP05_arcs.fa -b LGP05.config -l 127 -t 80 -o LGP05_arcs_gapc.fa
  • config file
# maximal read length
max_rd_len=127

[LIB]
# average insert size
avg_ins=500

# if sequence needs to be
reversed reverse_seq=0

# in which part(s) the reads are used
asm_flags=4

# use only first 50 bps of each read
rd_len_cutoff=127

# in which order the reads are used while scaffolding
rank=1

# cutoff of pair number for a reliable connection (default 3)
pair_num_cutoff=3

# minimum aligned length to contigs for a reliable read location (default 32)
map_len=32

# fastq file for read 1
q1=/home/livestock/gap_closer/LGP_05/barcode_trimmed_data/LGP05_S1_L001_R1_001_trim.fastq.gz

# fastq file for read 2 always follows fastq file for read 1
q2=/home/livestock/gap_closer/LGP_05/barcode_trimmed_data/LGP05_S1_L001_R1_001_trim.fastq.gz

6. Minimap2

minimap2 -ax asm5 -t 16 --cs your_reference.fasta your_query.fasta > assembly_alignment.sam
  • For close relative species & asm20 for divergent species