Spades - NBISweden/workshop-genome_assembly GitHub Wiki

Spades: Small genome assembler

Notes:

  • Sequences stored in assembly_graph.fastg correspond to contigs before repeat resolution (edges of the assembly graph). Paths corresponding to contigs after repeat resolution (scaffolding) are stored in contigs.paths for viewing in Bandage.
  • Check your reads are long enough for the -k values.
  • In the GFA file, paths correspond to the contigs, and not the segments.

Running Spades on all read pairs (All read pairs from the same sample)

Command:

#!/usr/bin/env bash

module load bioinfo-tools spades

CPUS="${SLURM_NPROCS:-20}"
#JOB=$SLURM_ARRAY_TASK_ID

DATA_DIR=/path/to/reads
R1READS=( "$DATA_DIR"/*_R1.fastq.gz )
R2READS=( "$DATA_DIR"/*_R2.fastq.gz )

# Assumes the base name of the data directory is suitable for a prefix (e.g. a sample name)
PREFIX=$( basename "$DATA_DIR" )
CONFIG=my_dataset.yaml

printf -v R1LIST '"%s",\n' "${R1READS[@]}"
printf -v R2LIST '"%s",\n' "${R2READS[@]}"
R1LIST=${R1LIST%,*}
R2LIST=${R2LIST%,*}
cat > $CONFIG <<-EOF
  [
    {
      orientation: "fr",
      type: "paired-end",
      right reads: [
        $R1LIST
      ],
      left reads: [
        $R2LIST
      ]
    }
  ]
EOF

spades.py -k 21,33,55,77,99,127 --careful --dataset "$CONFIG" -o "${PREFIX}-spades_assembly"

Running Spades on read pairs separately (Each read pair is a different sample)

Command:

#!/usr/bin/env bash
module load bioinfo-tools spades
CPUS="${SLURM_NPROCS:-16}"
JOB=$SLURM_ARRAY_TASK_ID

DATA_DIR=/path/to/reads
FILES=( $DATA_DIR/*_R1.fastq.gz )

apply_spades () {
	READ1="$1" 			# The first read pair is the first parameter to this function
	READ2="$2" 			# The second read pair is the second parameter to this function
	PREFIX=$(basename "${READ1%_R1*}")
	spades.py -k 21,33,55,77,99,127 --careful --pe1-1 "$READ1" --pe1-2 "$READ2" -o "${PREFIX}-spades_assembly"
}

FASTQ="${FILES[$JOB]}"
apply_spades "$FASTQ" "${FASTQ/_R1./_R2.}"