Pilon - NBISweden/workshop-genome_assembly GitHub Wiki

Pilon: Polishing assemblies

Notes:

  • Memory requirements: Estimated 1 Gb of Memory for every Mb of sequence. Use java -d64 -Xmx2t for allocating 2 Tb to a Java JVM as the maximum heap size.
  • Pilon is optimised for Illumina data, but the BAM can contain any sequences, e.g. from PacBio.
  • Pilon can be used in the same way with a BAM file from LongRanger to polish with 10X genomics data.

Polishing with Pilon:

See below for code snippets to align Illumina Paired-end data or 10X genomics data.

Polishing Command:

#!/usr/bin/env bash

module load bioinfo-tools Pilon

CPUS="${SLURM_NPROCS:-4}"
JOB=$SLURM_ARRAY_TASK_ID

DATA_DIR=/path/to/bams
FILES=( "$DATA_DIR"/*_bwa_alignment.bam )

apply_pilon () {
	BAM=$1
	PREFIX=$( basename "${BAM%_bwa_alignment.bam}" )
	FASTA="${BAM%_bwa_alignment.bam}.fasta"
	java -jar "$PILON_HOME/pilon.jar" --threads "$CPUS" --genome "$FASTA" --frags "$BAM" --outdir "${PREFIX}-polished_spades_assembly" --output "${PREFIX}-polished_spades_assembly" --tracks --changes --vcf
}

ALIGNMENT="${FILES[$JOB]}"
apply_pilon "$ALIGNMENT"

Summarizing Pilon changes output

An awk script summarising the changes from a Pilon ${PREFIX}.changes file.

summarize_pilon_changes.awk:

# USAGE:
#    awk -f summarize_pilon_changes.awk ${PREFIX}.changes

{
        if ($3 == "." || $4 == ".") {
                # Indel
                indels++
                if($3 == "." ){
                        insertions[insertionslen++] = length($4)
                } else {
                        deletions[deletionslen++] = length($3)
                }
        } else if ($3 ~ /^[ACGT]$/ && $4 ~ /^[ACGT]$/) {
                # SNP
                snps++
        } else {
                # larger change
                svs++
                structural[structurallen++] = length($3) " -> " length($4)
        }
}
END {
        printf("Total number of changes:             %d\n", NR)
        printf("Number of single nucleotide changes: %d\n", snps)
        printf("Number of indels:                    %d\n", indels)
        printf("Number of segmental changes:         %d\n", svs)
        for( i = 0; i < insertionslen; i++ ){
                print insertions[i] > "insertion_size.txt"
        }
        for( i = 0; i < deletionslen; i++ ){
                print deletions[i] > "deletion_size.txt"
        }
        for( i = 0; i < structurallen; i++ ){
                print structural[i] > "segment_size.txt"
        }
}

The file insertion_size.txt contains the size of each insertion, line by line. The file deletion_size.txt contains the size of each deletion, line by line. The file segment_size.txt contains the size of starting segment and size of end segment for each large change made. These can be plotted as histograms using a tool such as R.

Count genes affected by Pilon changes.

Here is an awk / bash one-liner that will count how many genes in the GFF are affected by changes made in Pilon. This command uses bedtools.

awk '{ split($2,a,":"); if(a[2] ~ /-/){ split(a[2],b,"-"); printf("%s\t%s\t%s\n",a[1],b[1],b[2]) } else { printf("%s\t%s\t%s\n",a[1],a[2],a[2]) } }' ${PREFIX}.changes | bedtools intersect -wa -a Gene_annotation.gff -b "stdin" | sort -u | wc -l

Convert Pilon changes file to GFF3.

Here is an awk script that will take the information in the Pilon Changes file and convert it to GFF3.

pilon_2_GFF3.awk:

# USAGE:
#     awk -f pilon_2_GFF3.awk ${PREFIX}.changes

{
        split($2,a,":")
        if(a[2] ~ /-/){
                split(a[2],b,"-")
                a[2]=b[1]
                a[3]=b[2]
        } else {
                a[3]=a[2]
        }
        if ($3 == "." || $4 == ".") {
                # Indel
                type="Indel"
        } else if ($3 ~ /^[ACGT]$/ && $4 ~ /^[ACGT]$/) {
                # SNP
                type="SNP"
        } else {
                # larger change
                type="Change"
        }
        # Chr, Source, Feature, Start, End, Score, Strand, Phase, Attributes
        printf("%s\t%s\t%s\t%d\t%d\t%s\t%s\t%s\t%s\n",a[1],"Pilon",type,a[2],a[3],".",".",".","ID=" type "_" NR ";OriginalSeq=" $3 ";ChangeSeq=" $4)
        type="*"
}

Making an Illumina read bam:

#!/usr/bin/env bash

module load bioinfo-tools bwa samtools

CPUS="${SLURM_NPROCS:-8}"
JOB=$SLURM_ARRAY_TASK_ID

DATA_DIR=/path/to/sequences
FASTA_DIR=/path/to/assemblies
FILES=( "$FASTA_DIR"/*-spades_assembly/scaffolds.fasta )

align_reads () {
	ASSEMBLY="$1" 					# The assembly is the first parameter to this function
	READ1="$2" 					# The first read pair is the second parameter to this function
	READ2="$3" 					# The second read pair is the third parameter to this function
	PREFIX=$SAMPLE_PREFIX				# Make a prefix from the file names
	ln -s -T "$1" "${PREFIX}.fasta"
	bwa index "${PREFIX}.fasta"				# Index the assembly prior to alignment
	bwa mem -t "$CPUS" "${PREFIX}.fasta" "$READ1" "$READ2" | samtools sort -@ "$CPUS" -T "$SNIC_TMP/$PREFIX" -O BAM -o "${PREFIX}_bwa_alignment.bam" -
	samtools index "${PREFIX}_bwa_alignment.bam"
	samtools flagstat "${PREFIX}_bwa_alignment.bam" > "${PREFIX}_bwa_alignment.stats"
}

FASTA="${FILES[$JOB]}"
SAMPLE_PREFIX=$(basename "${FASTA%-spades_assembly*}" )
mkdir -p bams
cd bams
align_reads "$FASTA" "$DATA_DIR/${SAMPLE_PREFIX}_R"{1,2}.fastq.gz
cd ..

Polishing with 10X genomics data:

Notes:

Create an alignment using 10X's own tools:

#!/usr/bin/env bash

PREFIX=species
ASSEMBLY=/path/to/assembly
FASTQDIR=/path/to/fastq/files/directory
ALN_REF="refdata-$(basename $ASSEMBLY .fasta)"

longranger mkref "$ASSEMBLY"
longranger align --id="$PREFIX" --fastqs="$FASTQDIR" --reference="$ALN_REF"

Then run Pilon as normal with the bam in $PREFIX/outs/possorted_bam.bam.