Spades - NBISweden/workshop-genome_assembly GitHub Wiki
Spades: Small genome assembler
Notes:
- Sequences stored in
assembly_graph.fastg
correspond to contigs before repeat resolution (edges of the assembly graph). Paths corresponding to contigs after repeat resolution (scaffolding) are stored incontigs.paths
for viewing in Bandage. - Check your reads are long enough for the
-k
values. - In the GFA file, paths correspond to the contigs, and not the segments.
Running Spades on all read pairs (All read pairs from the same sample)
Command:
#!/usr/bin/env bash
module load bioinfo-tools spades
CPUS="${SLURM_NPROCS:-20}"
#JOB=$SLURM_ARRAY_TASK_ID
DATA_DIR=/path/to/reads
R1READS=( "$DATA_DIR"/*_R1.fastq.gz )
R2READS=( "$DATA_DIR"/*_R2.fastq.gz )
# Assumes the base name of the data directory is suitable for a prefix (e.g. a sample name)
PREFIX=$( basename "$DATA_DIR" )
CONFIG=my_dataset.yaml
printf -v R1LIST '"%s",\n' "${R1READS[@]}"
printf -v R2LIST '"%s",\n' "${R2READS[@]}"
R1LIST=${R1LIST%,*}
R2LIST=${R2LIST%,*}
cat > $CONFIG <<-EOF
[
{
orientation: "fr",
type: "paired-end",
right reads: [
$R1LIST
],
left reads: [
$R2LIST
]
}
]
EOF
spades.py -k 21,33,55,77,99,127 --careful --dataset "$CONFIG" -o "${PREFIX}-spades_assembly"
Running Spades on read pairs separately (Each read pair is a different sample)
Command:
#!/usr/bin/env bash
module load bioinfo-tools spades
CPUS="${SLURM_NPROCS:-16}"
JOB=$SLURM_ARRAY_TASK_ID
DATA_DIR=/path/to/reads
FILES=( $DATA_DIR/*_R1.fastq.gz )
apply_spades () {
READ1="$1" # The first read pair is the first parameter to this function
READ2="$2" # The second read pair is the second parameter to this function
PREFIX=$(basename "${READ1%_R1*}")
spades.py -k 21,33,55,77,99,127 --careful --pe1-1 "$READ1" --pe1-2 "$READ2" -o "${PREFIX}-spades_assembly"
}
FASTQ="${FILES[$JOB]}"
apply_spades "$FASTQ" "${FASTQ/_R1./_R2.}"