Trimming - SabaLab/RNASeq_Scripts GitHub Wiki

Usage

trimBatch.py

Run an entire folder of rawReads through trimming with cutadapt. Allows you to summarize avg read length and # of reads in rawRead files and then trimmedRead files when finished.

Usage: trimBatch.py [options] inputPath

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -c, --count-raw       When set script will count/summarize the raw files
                        first
  --cutadapt-path=CUTADAPTPATH
                        set cutadapt path
  --cutadapt-q=CUTADAPTQ
                        set cutadapt -q to trim based on quality score
  --cutadapt-m=CUTADAPTM
                        set cutadapt -m to set the minimum read length.
  --cutadapt-M=CUTADAPTM
                        set cutadapt -M to set the maximum read length.
  -a ADAPTA, --cutadapt-a=ADAPTA
                        set cutadapt -a set 3' adapter for reads or read1 if
                        paired
  -A ADAPTA, --cutadapt-A=ADAPTA
                        set cutadapt -A set 3' adapter for reads or read2 if
                        paired
  -b ADAPTB, --cutadapt-b=ADAPTB
                        set cutadapt -b set 5' adapter for reads or read1 if
                        paired
  -B ADAPTB, --cutadapt-B=ADAPTB
                        set cutadapt -B set 5' adapter for reads or read2 if
                        paired
  -g ADAPTG, --cutadapt-g=ADAPTG
                        set cutadapt -g set 3' or 5' adapter for reads or
                        read1 if paired
  -G ADAPTG, --cutadapt-G=ADAPTG
                        set cutadapt -G set 3' or 5' adapter for reads or
                        read2 if paired
  -P, --paired          pass rsem the --paired-end parameter and look for
                        paired end files to pass appropriate paired end files
                        to rsem
  -U, --unpaired        pass rsem appropriate unpaired files/parameters
  -d SAMPLEDELIM, --delim=SAMPLEDELIM
                        A delimiter to detect the end of the sample label,
                        default is _L00 to parse everything before the lane as
                        the sample name.
  -p MAXP               set number of processes to run at once.
  --pair-prefix=PAIRPREFIX
                        A prefix before the paired label part of the file name
                        for paired reads, defaults to _R and assumes _R1 -
                        first read and _R2 is second read
  -o OUTPUT, --output=OUTPUT
                        The output folder.  Output will go to a folder with
                        the extracted Sample Name in this location.
  -i INSUFFIX, --input-suffix=INSUFFIX
                        The suffix to look for in the inpute files.  ex .fq.gz
                        or .fastq

Example

/usr/local/scripts/trimBatch.py -P -c -p 6 -o /data/hi-seq/HRDP.Liver.totalRNA.2018-01-10/trimmedReads/v2 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTATGCCGTCTTCTGCTTG -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -i .fastq.gz -d _R /data/hi-seq/HRDP.Liver.totalRNA.2018-01-10/rawReads

-P -- paired-end reads
-c -- count rawReads
-p 6 -- use 6 processes at once so process 6 samples through cutadapt at once
-o /data/hi-seq/HRDP.Liver.totalRNA.2018-01-10/trimmedReads/v2 -- output path
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTATGCCGTCTTCTGCTTG -- pass the adapter sequence for trimming with the cutadapt -a parameter for read1.
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -- pass the adapter sequence for trimming with the cutadapt -A parameter for read2.
-i .fastq.gz -- process files in the inputDir that end in .fastq.gz
-d _R -- truncate the sample name after _R
inputDir -- the input directory where the rawReads are to count/trim.