Data Quantity - NBISweden/workshop-genome_assembly GitHub Wiki
Notes:
- Gzip flags an error when the file is incomplete (i.e. EOF not found)
Command (Illumina Paired End):
#!/usr/bin/env bash
CPUS="${SLURM_NPROCS:-1}"
JOB=$SLURM_ARRAY_TASK_ID
DATA_DIR=/path/to/reads
FILES=( $DATA_DIR/*_R1.fastq.gz )
FASTQ=${FILES[$JOB]}
printf "%s: %d\n" "$FASTQ" "$(zcat "$FASTQ" | awk 'FNR % 4 == 2' | tr -dc "ACGTNacgtn" | wc -m)"
printf "%s: %d\n" "{$FASTQ/_R1./_R2.}" "$(zcat "${FASTQ/_R1./_R2.}" | awk 'FNR % 4 == 2' | tr -dc "ACGTNacgtn" | wc -m)"
Command (General):
#!/usr/bin/env bash
set -ueo pipefail
module load bioinfo-tools seqtk
CPUS="${SLURM_NPROCS:-1}"
JOB=$SLURM_ARRAY_TASK_ID
DATA_DIR=/path/to/reads
FILES=( $DATA_DIR/*.fastq.gz )
MIN_LENGTH=10000
FASTQ=${FILES[$JOB]}
printf "%s: %d\n" "$FASTQ" "$(zcat "$FASTQ" | seqtk seq -L "$MIN_LENGTH" - | awk 'FNR % 4 == 2' | tr -dc "ACGTNacgtn" | wc -m)"