Spades - NBISweden/workshop-genome_assembly GitHub Wiki

Spades: Small genome assembler

Notes:

Sequences stored in assembly_graph.fastg correspond to contigs before repeat resolution (edges of the assembly graph). Paths corresponding to contigs after repeat resolution (scaffolding) are stored in contigs.paths for viewing in Bandage.
Check your reads are long enough for the -k values.
In the GFA file, paths correspond to the contigs, and not the segments.

Running Spades on all read pairs (All read pairs from the same sample)

Command:

#!/usr/bin/env bash

module load bioinfo-tools spades

CPUS="${SLURM_NPROCS:-20}"
#JOB=$SLURM_ARRAY_TASK_ID

DATA_DIR=/path/to/reads
R1READS=( "$DATA_DIR"/*_R1.fastq.gz )
R2READS=( "$DATA_DIR"/*_R2.fastq.gz )

# Assumes the base name of the data directory is suitable for a prefix (e.g. a sample name)
PREFIX=$( basename "$DATA_DIR" )
CONFIG=my_dataset.yaml

printf -v R1LIST '"%s",\n' "${R1READS[@]}"
printf -v R2LIST '"%s",\n' "${R2READS[@]}"
R1LIST=${R1LIST%,*}
R2LIST=${R2LIST%,*}
cat > $CONFIG <<-EOF
  [
    {
      orientation: "fr",
      type: "paired-end",
      right reads: [
        $R1LIST
      ],
      left reads: [
        $R2LIST
      ]
    }
  ]
EOF

spades.py -k 21,33,55,77,99,127 --careful --dataset "$CONFIG" -o "${PREFIX}-spades_assembly"

Running Spades on read pairs separately (Each read pair is a different sample)

Command:

#!/usr/bin/env bash
module load bioinfo-tools spades
CPUS="${SLURM_NPROCS:-16}"
JOB=$SLURM_ARRAY_TASK_ID

DATA_DIR=/path/to/reads
FILES=( $DATA_DIR/*_R1.fastq.gz )

apply_spades () {
	READ1="$1" 			# The first read pair is the first parameter to this function
	READ2="$2" 			# The second read pair is the second parameter to this function
	PREFIX=$(basename "${READ1%_R1*}")
	spades.py -k 21,33,55,77,99,127 --careful --pe1-1 "$READ1" --pe1-2 "$READ2" -o "${PREFIX}-spades_assembly"
}

FASTQ="${FILES[$JOB]}"
apply_spades "$FASTQ" "${FASTQ/_R1./_R2.}"