Pilon - NBISweden/workshop-genome_assembly GitHub Wiki
Pilon: Polishing assemblies
Notes:
- Memory requirements: Estimated 1 Gb of Memory for every Mb of sequence. Use
java -d64 -Xmx2tfor allocating 2 Tb to a Java JVM as the maximum heap size. - Pilon is optimised for Illumina data, but the BAM can contain any sequences, e.g. from PacBio.
- Pilon can be used in the same way with a BAM file from LongRanger to polish with 10X genomics data.
Polishing with Pilon:
See below for code snippets to align Illumina Paired-end data or 10X genomics data.
Polishing Command:
#!/usr/bin/env bash
module load bioinfo-tools Pilon
CPUS="${SLURM_NPROCS:-4}"
JOB=$SLURM_ARRAY_TASK_ID
DATA_DIR=/path/to/bams
FILES=( "$DATA_DIR"/*_bwa_alignment.bam )
apply_pilon () {
BAM=$1
PREFIX=$( basename "${BAM%_bwa_alignment.bam}" )
FASTA="${BAM%_bwa_alignment.bam}.fasta"
java -jar "$PILON_HOME/pilon.jar" --threads "$CPUS" --genome "$FASTA" --frags "$BAM" --outdir "${PREFIX}-polished_spades_assembly" --output "${PREFIX}-polished_spades_assembly" --tracks --changes --vcf
}
ALIGNMENT="${FILES[$JOB]}"
apply_pilon "$ALIGNMENT"
Summarizing Pilon changes output
An awk script summarising the changes from a Pilon ${PREFIX}.changes file.
summarize_pilon_changes.awk:
# USAGE:
# awk -f summarize_pilon_changes.awk ${PREFIX}.changes
{
if ($3 == "." || $4 == ".") {
# Indel
indels++
if($3 == "." ){
insertions[insertionslen++] = length($4)
} else {
deletions[deletionslen++] = length($3)
}
} else if ($3 ~ /^[ACGT]$/ && $4 ~ /^[ACGT]$/) {
# SNP
snps++
} else {
# larger change
svs++
structural[structurallen++] = length($3) " -> " length($4)
}
}
END {
printf("Total number of changes: %d\n", NR)
printf("Number of single nucleotide changes: %d\n", snps)
printf("Number of indels: %d\n", indels)
printf("Number of segmental changes: %d\n", svs)
for( i = 0; i < insertionslen; i++ ){
print insertions[i] > "insertion_size.txt"
}
for( i = 0; i < deletionslen; i++ ){
print deletions[i] > "deletion_size.txt"
}
for( i = 0; i < structurallen; i++ ){
print structural[i] > "segment_size.txt"
}
}
The file insertion_size.txt contains the size of each insertion, line by line. The file deletion_size.txt contains
the size of each deletion, line by line. The file segment_size.txt contains the size of starting segment and size of
end segment for each large change made. These can be plotted as histograms using a tool such as R.
Count genes affected by Pilon changes.
Here is an awk / bash one-liner that will count how many genes in the GFF are affected by changes made in Pilon. This command uses bedtools.
awk '{ split($2,a,":"); if(a[2] ~ /-/){ split(a[2],b,"-"); printf("%s\t%s\t%s\n",a[1],b[1],b[2]) } else { printf("%s\t%s\t%s\n",a[1],a[2],a[2]) } }' ${PREFIX}.changes | bedtools intersect -wa -a Gene_annotation.gff -b "stdin" | sort -u | wc -l
Convert Pilon changes file to GFF3.
Here is an awk script that will take the information in the Pilon Changes file and convert it to GFF3.
pilon_2_GFF3.awk:
# USAGE:
# awk -f pilon_2_GFF3.awk ${PREFIX}.changes
{
split($2,a,":")
if(a[2] ~ /-/){
split(a[2],b,"-")
a[2]=b[1]
a[3]=b[2]
} else {
a[3]=a[2]
}
if ($3 == "." || $4 == ".") {
# Indel
type="Indel"
} else if ($3 ~ /^[ACGT]$/ && $4 ~ /^[ACGT]$/) {
# SNP
type="SNP"
} else {
# larger change
type="Change"
}
# Chr, Source, Feature, Start, End, Score, Strand, Phase, Attributes
printf("%s\t%s\t%s\t%d\t%d\t%s\t%s\t%s\t%s\n",a[1],"Pilon",type,a[2],a[3],".",".",".","ID=" type "_" NR ";OriginalSeq=" $3 ";ChangeSeq=" $4)
type="*"
}
Making an Illumina read bam:
#!/usr/bin/env bash
module load bioinfo-tools bwa samtools
CPUS="${SLURM_NPROCS:-8}"
JOB=$SLURM_ARRAY_TASK_ID
DATA_DIR=/path/to/sequences
FASTA_DIR=/path/to/assemblies
FILES=( "$FASTA_DIR"/*-spades_assembly/scaffolds.fasta )
align_reads () {
ASSEMBLY="$1" # The assembly is the first parameter to this function
READ1="$2" # The first read pair is the second parameter to this function
READ2="$3" # The second read pair is the third parameter to this function
PREFIX=$SAMPLE_PREFIX # Make a prefix from the file names
ln -s -T "$1" "${PREFIX}.fasta"
bwa index "${PREFIX}.fasta" # Index the assembly prior to alignment
bwa mem -t "$CPUS" "${PREFIX}.fasta" "$READ1" "$READ2" | samtools sort -@ "$CPUS" -T "$SNIC_TMP/$PREFIX" -O BAM -o "${PREFIX}_bwa_alignment.bam" -
samtools index "${PREFIX}_bwa_alignment.bam"
samtools flagstat "${PREFIX}_bwa_alignment.bam" > "${PREFIX}_bwa_alignment.stats"
}
FASTA="${FILES[$JOB]}"
SAMPLE_PREFIX=$(basename "${FASTA%-spades_assembly*}" )
mkdir -p bams
cd bams
align_reads "$FASTA" "$DATA_DIR/${SAMPLE_PREFIX}_R"{1,2}.fastq.gz
cd ..
Polishing with 10X genomics data:
Notes:
- 10X's documentation states the reference should have 500 contigs or less. This does not actually matter and the tool will not complain if there are more contigs.
- If you want to follow 10X's documentation, a bin-packing tool was written which will pack and unpack the assembly here: https://github.com/NBISweden/GAAS/tree/master/assembly/utilities/binpack
Create an alignment using 10X's own tools:
#!/usr/bin/env bash
PREFIX=species
ASSEMBLY=/path/to/assembly
FASTQDIR=/path/to/fastq/files/directory
ALN_REF="refdata-$(basename $ASSEMBLY .fasta)"
longranger mkref "$ASSEMBLY"
longranger align --id="$PREFIX" --fastqs="$FASTQDIR" --reference="$ALN_REF"
Then run Pilon as normal with the bam in $PREFIX/outs/possorted_bam.bam.