Data Quantity - NBISweden/workshop-genome_assembly GitHub Wiki

Notes:

  • Gzip flags an error when the file is incomplete (i.e. EOF not found)

Command (Illumina Paired End):

#!/usr/bin/env bash

CPUS="${SLURM_NPROCS:-1}"
JOB=$SLURM_ARRAY_TASK_ID

DATA_DIR=/path/to/reads
FILES=( $DATA_DIR/*_R1.fastq.gz )

FASTQ=${FILES[$JOB]}
printf "%s: %d\n" "$FASTQ" "$(zcat "$FASTQ" | awk 'FNR % 4 == 2' | tr -dc "ACGTNacgtn" | wc -m)"
printf "%s: %d\n" "{$FASTQ/_R1./_R2.}" "$(zcat "${FASTQ/_R1./_R2.}" | awk 'FNR % 4 == 2' | tr -dc "ACGTNacgtn" | wc -m)"

Command (General):

#!/usr/bin/env bash

set -ueo pipefail
module load bioinfo-tools seqtk

CPUS="${SLURM_NPROCS:-1}"
JOB=$SLURM_ARRAY_TASK_ID

DATA_DIR=/path/to/reads
FILES=( $DATA_DIR/*.fastq.gz )
MIN_LENGTH=10000

FASTQ=${FILES[$JOB]}
printf "%s: %d\n" "$FASTQ" "$(zcat "$FASTQ" | seqtk seq -L "$MIN_LENGTH" - | awk 'FNR % 4 == 2' | tr -dc "ACGTNacgtn" | wc -m)"