Pilon - NBISweden/workshop-genome_assembly GitHub Wiki

Pilon: Polishing assemblies

Notes:

Memory requirements: Estimated 1 Gb of Memory for every Mb of sequence. Use java -d64 -Xmx2t for allocating 2 Tb to a Java JVM as the maximum heap size.
Pilon is optimised for Illumina data, but the BAM can contain any sequences, e.g. from PacBio.
Pilon can be used in the same way with a BAM file from LongRanger to polish with 10X genomics data.

Polishing with Pilon:

See below for code snippets to align Illumina Paired-end data or 10X genomics data.

Polishing Command:

#!/usr/bin/env bash

module load bioinfo-tools Pilon

CPUS="${SLURM_NPROCS:-4}"
JOB=$SLURM_ARRAY_TASK_ID

DATA_DIR=/path/to/bams
FILES=( "$DATA_DIR"/*_bwa_alignment.bam )

apply_pilon () {
	BAM=$1
	PREFIX=$( basename "${BAM%_bwa_alignment.bam}" )
	FASTA="${BAM%_bwa_alignment.bam}.fasta"
	java -jar "$PILON_HOME/pilon.jar" --threads "$CPUS" --genome "$FASTA" --frags "$BAM" --outdir "${PREFIX}-polished_spades_assembly" --output "${PREFIX}-polished_spades_assembly" --tracks --changes --vcf
}

ALIGNMENT="${FILES[$JOB]}"
apply_pilon "$ALIGNMENT"

Summarizing Pilon changes output

An awk script summarising the changes from a Pilon ${PREFIX}.changes file.

summarize_pilon_changes.awk:

# USAGE:
#    awk -f summarize_pilon_changes.awk ${PREFIX}.changes

{
        if ($3 == "." || $4 == ".") {
                # Indel
                indels++
                if($3 == "." ){
                        insertions[insertionslen++] = length($4)
                } else {
                        deletions[deletionslen++] = length($3)
                }
        } else if ($3 ~ /^[ACGT]$/ && $4 ~ /^[ACGT]$/) {
                # SNP
                snps++
        } else {
                # larger change
                svs++
                structural[structurallen++] = length($3) " -> " length($4)
        }
}
END {
        printf("Total number of changes:             %d\n", NR)
        printf("Number of single nucleotide changes: %d\n", snps)
        printf("Number of indels:                    %d\n", indels)
        printf("Number of segmental changes:         %d\n", svs)
        for( i = 0; i < insertionslen; i++ ){
                print insertions[i] > "insertion_size.txt"
        }
        for( i = 0; i < deletionslen; i++ ){
                print deletions[i] > "deletion_size.txt"
        }
        for( i = 0; i < structurallen; i++ ){
                print structural[i] > "segment_size.txt"
        }
}

The file insertion_size.txt contains the size of each insertion, line by line. The file deletion_size.txt contains the size of each deletion, line by line. The file segment_size.txt contains the size of starting segment and size of end segment for each large change made. These can be plotted as histograms using a tool such as R.

Count genes affected by Pilon changes.

Here is an awk / bash one-liner that will count how many genes in the GFF are affected by changes made in Pilon. This command uses bedtools.

awk '{ split($2,a,":"); if(a[2] ~ /-/){ split(a[2],b,"-"); printf("%s\t%s\t%s\n",a[1],b[1],b[2]) } else { printf("%s\t%s\t%s\n",a[1],a[2],a[2]) } }' ${PREFIX}.changes | bedtools intersect -wa -a Gene_annotation.gff -b "stdin" | sort -u | wc -l

Convert Pilon changes file to GFF3.

Here is an awk script that will take the information in the Pilon Changes file and convert it to GFF3.

pilon_2_GFF3.awk:

# USAGE:
#     awk -f pilon_2_GFF3.awk ${PREFIX}.changes

{
        split($2,a,":")
        if(a[2] ~ /-/){
                split(a[2],b,"-")
                a[2]=b[1]
                a[3]=b[2]
        } else {
                a[3]=a[2]
        }
        if ($3 == "." || $4 == ".") {
                # Indel
                type="Indel"
        } else if ($3 ~ /^[ACGT]$/ && $4 ~ /^[ACGT]$/) {
                # SNP
                type="SNP"
        } else {
                # larger change
                type="Change"
        }
        # Chr, Source, Feature, Start, End, Score, Strand, Phase, Attributes
        printf("%s\t%s\t%s\t%d\t%d\t%s\t%s\t%s\t%s\n",a[1],"Pilon",type,a[2],a[3],".",".",".","ID=" type "_" NR ";OriginalSeq=" $3 ";ChangeSeq=" $4)
        type="*"
}

Making an Illumina read bam:

#!/usr/bin/env bash

module load bioinfo-tools bwa samtools

CPUS="${SLURM_NPROCS:-8}"
JOB=$SLURM_ARRAY_TASK_ID

DATA_DIR=/path/to/sequences
FASTA_DIR=/path/to/assemblies
FILES=( "$FASTA_DIR"/*-spades_assembly/scaffolds.fasta )

align_reads () {
	ASSEMBLY="$1" 					# The assembly is the first parameter to this function
	READ1="$2" 					# The first read pair is the second parameter to this function
	READ2="$3" 					# The second read pair is the third parameter to this function
	PREFIX=$SAMPLE_PREFIX				# Make a prefix from the file names
	ln -s -T "$1" "${PREFIX}.fasta"
	bwa index "${PREFIX}.fasta"				# Index the assembly prior to alignment
	bwa mem -t "$CPUS" "${PREFIX}.fasta" "$READ1" "$READ2" | samtools sort -@ "$CPUS" -T "$SNIC_TMP/$PREFIX" -O BAM -o "${PREFIX}_bwa_alignment.bam" -
	samtools index "${PREFIX}_bwa_alignment.bam"
	samtools flagstat "${PREFIX}_bwa_alignment.bam" > "${PREFIX}_bwa_alignment.stats"
}

FASTA="${FILES[$JOB]}"
SAMPLE_PREFIX=$(basename "${FASTA%-spades_assembly*}" )
mkdir -p bams
cd bams
align_reads "$FASTA" "$DATA_DIR/${SAMPLE_PREFIX}_R"{1,2}.fastq.gz
cd ..

Polishing with 10X genomics data:

Notes:

10X's documentation states the reference should have 500 contigs or less. This does not actually matter and the tool will not complain if there are more contigs.
If you want to follow 10X's documentation, a bin-packing tool was written which will pack and unpack the assembly here: https://github.com/NBISweden/GAAS/tree/master/assembly/utilities/binpack

Create an alignment using 10X's own tools:

#!/usr/bin/env bash

PREFIX=species
ASSEMBLY=/path/to/assembly
FASTQDIR=/path/to/fastq/files/directory
ALN_REF="refdata-$(basename $ASSEMBLY .fasta)"

longranger mkref "$ASSEMBLY"
longranger align --id="$PREFIX" --fastqs="$FASTQDIR" --reference="$ALN_REF"

Then run Pilon as normal with the bam in $PREFIX/outs/possorted_bam.bam.