documentation - gcorre/GNT_GuideSeq GitHub Wiki
Introduction
Installation
In order to get a working environment, we recommend to clone the git repository using the command line :
git clone https://github.com/gcorre/GNT_GuideSeq
Your folder architecture should look similar to :
Path/to/guideseq/
├── 00-pipeline/
├── 01-envs/
├── 02-resources/
├── test/
Conda environments
Install the miniconda environment manager :
## https://www.anaconda.com/docs/getting-started/miniconda/install
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash ~/Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda update -n base -c conda-forge conda
conda install mamba -c conda-forge # faster packages manager
conda install snakemake # worflow manager
Create the environment in your favorite path from the environment.yaml
file:
mamba env create -p path/to/env/ -f environment.yaml
You should now have an environment named 'guideseq' containing all the required programs when running:
mamba env list
mamba list -n guideseq # versions details
Reference genomes
The pipeline uses the bowtie2 program to align reads on the reference genome (@langmead2018).
For efficient genome alignment, you can use a pre-built index for Bowtie2. These indices can be downloaded from the Bowtie2 website. Using a pre-built index saves computational resources and time, as it eliminates the need to build the index from scratch.
If you decide to build the index yourself, use the bowtie2 command line:
bowtie2-build -@ {threads} {fasta_file} {index_prefix}
We recommend to not use haplotypes, scaffolds and unplaced chromosomes to avoid unnecessary multihits alignments. Instead, use the "primary Assembly" version from ensembl or gencode for example. Manually remove unwanted chromosomes if necessary using linux grep functions or the seqkit program:
seqkit grep -r -p '^[chr]?[0-9XYM]+' Homo_sapiens.GRCh38.dna.primary_assembly.fa > your_clean_fasta
The index prefix path will be used in the configuration file.
Annotation
Off-Target site annotation is performed from a GTF file that can be downloaded from any source (ensembl, gencode ...). The GTF file will be processed to keep only gene and exon features for annotation.
Annotation and reference genome must use the same chromosome nomenclature ("chr1" or "1").
Prediction tool
The pipeline uses the SW program (@yaish2024) to predict gRNA off-targets on the fly.
- Install from the github repository : https://github.com/OrensteinLab/SWOffinder
Running the pipeline
In order to run the analysis, 4 elements are mandatory:
-
The conda environment (see above for installation)
-
The Sample Data Sheet
-
The configuration file
-
The input un-demultiplexed reads in fastq.gz format (R1,R2,I1,I2). If working with libraries that were already demultiplexed, please read section .....
Prepare samples data sheet (SDS) {#prepare-samples-data-sheet-sds}
The sample data-sheet (SDS) is a simple delimited file ( ; ) that contains information about each sample to process in the run. An example is proposed in /test/sampleInfo.csv
.
Mandatory columns are:
-
sampleName : Sample name to use. Will be use to name output files and folders. Samples name must not include '-' in their name as this symbol is used in the pipeline for a special purpose. Instead, use "_" or any other separator of your choice.
-
Genome: use to define reference genome. This must be one of the values present in config file - genome key
-
gRNA_name : name of the gRNA used.
-
gRNA_sequence: Sequence of the gRNA without the PAM sequence
-
orientation : which PCR orientation was chosen ["positive", "negative", "mixted" if they share the same indexes] (this info is only used for metadata purposes, both PCR orientations are automatically processes by the pipeline).
-
Cas: name of the Cas used
-
PAM_sequence: Sequence of the PAM ["NGG"]
-
Cut_Offset: Distance from gRNA end where the cut occurs [-4]
-
type: Type of experiment ["guideseq" or "iguideseq"]. This value will define which sequence to trim.
-
index1: Sequence of index 1
-
index2: Sequence of index 2
Additional columns can be added for metadata annotation purpose.
SDS format is validated using the snakemake.utils - validate
function.
Example of SDS:
sampleName | CellType | Genome | gRNA_name | gRNA_sequence | orientation | Cas | PAM_sequence | Cut_Offset | Protocole | index1 | index2 |
---|---|---|---|---|---|---|---|---|---|---|---|
VEGFAs1_HEK293T_pos | HEK293T | GRCh38 | VEGFAs1 | GGGTGGGGGGAGTTTGCTCC | positive | Cas9 | NGG | -4 | guideseq | ATCGATCG | AATTCCAA |
VEGFAs1_HEK293T_neg | HEK293T | GRCh38 | VEGFAs1 | GGGTGGGGGGAGTTTGCTCC | negative | Cas9 | NGG | -4 | guideseq | AATTCCAA | ATCGATCG |
Merging samples
In certain scenarios, it may be beneficial to merge different samples from a library during the reads processing stage. This can be particularly useful when dealing with Multiple Replicates or +/-PCRs.
To achieve this, you can use the same sample name for multiple rows in the Sample Description Sheet (SDS). Samples that share the same name will be merged during the reads processing, provided they meet the following criteria:
-
Reference Genome: The samples must use the same reference genome.
-
gRNA: The samples must have the same gRNA, both in terms of sequence and name.
-
Cas Protein: The samples must use the same Cas protein, with identical PAM (Protospacer Adjacent Motif) and Offset values.
-
Protocole: The samples must use the same ODN (Oligodeoxynucleotide), defined as guideseq or iguideseq.
If any of the above conditions are not met, an error will be raised, and the pipeline will be stopped. This ensures that only compatible samples are merged, maintaining the integrity of the data processing workflow.
Using example SDS above, if you want to merge both positive and negative libraries, give the same sample name to both rows. As they use the same genome, gRNA, Cas & method, they will be aggregated in a single library.
sampleName | CellType | Genome | gRNA_name | gRNA_sequence | orientation | Cas | PAM_sequence | Cut_Offset | Protocole | index1 | index2 |
---|---|---|---|---|---|---|---|---|---|---|---|
VEGFAs1_HEK293T | HEK293T | GRCh38 | VEGFAs1 | GGGTGGGGGGAGTTTGCTCC | positive | Cas9 | NGG | -4 | guideseq | ATCGATCG | AATTCCAA |
VEGFAs1_HEK293T | HEK293T | GRCh38 | VEGFAs1 | GGGTGGGGGGAGTTTGCTCC | negative | Cas9 | NGG | -4 | guideseq | AATTCCAA | ATCGATCG |
Demultiplexed libraries
To be documented
Prepare configuration file
The configuration file is a yaml formatted file with key-value dictionary used to fine-tune the pipeline behavior. Settings will apply to all samples of the run.
An example of configuration file is proposed in ./test/guideSeq.conf.yaml.
Metadata
Author and affiliation will be printed on the final report.
author: "Guillaume CORRE, PhD"
affiliation: "Therapeutic Gene Editing - GENETHON, INSERM U951, Paris-Saclay University, Evry, France"
version: 0.1
Path to Sample Data Sheet and read files
SDS path is relative to current folder (from where the pipeline is started). Absolute path can be used.
## Library informations
sampleInfo_path: "sampleInfo.csv"
read_path: "/media/Data/common/guideseq_gnt_dev/margaux_5"
R1: "Undetermined_S0_L001_R1_001.fastq.gz"
R2: "Undetermined_S0_L001_R2_001.fastq.gz"
I1: "Undetermined_S0_L001_I1_001.fastq.gz"
I2: "Undetermined_S0_L001_I2_001.fastq.gz"
Reference genome
Genome name is user-defined but must be referenced using the exact same name in the Sample Data Sheet.
'Index' corresponds to the prefix used when bowtie2 index was built.
## path to references
genome:
GRCh37:
fasta: "/PATH_TO_REFERENCE/GRCh37/Sequences/GRCh37.primary_assembly.genome.fa"
index: "/PATH_TO_REFERENCE/GRCh37/Indexes/Bowtie2/GRCh37.primary_assembly.genome"
annotation: "/PATH_TO_REFERENCE/GRCh37/Annotations/gencode.v19.annotation.gtf.gz"
GRCh38:
fasta: "/PATH_TO_REFERENCE/GRCh38/Sequences/GRCh38.primary_assembly.genome.fa"
index: "/PATH_TO_REFERENCE/GRch38/Indexes/Bowtie2/GRCh38.primary_assembly.genome"
annotation: "/PATH_TO_REFERENCE/GRCh38/Annotations/gencode.v46.annotation.gtf.gz"
GRCm39:
fasta: "/PATH_TO_REFERENCE/GRCm39/Sequences/GRCm39.primary_assembly.genome.fa"
index: "/PATH_TO_REFERENCE/GRCm39/Indexes/Bowtie2/GRCm39.primary_assembly.genome"
annotation: "/PATH_TO_REFERENCE/GRCm39/Annotations/gencode.vM36.annotation.gtf.gz"
Reads filtering
After adaptor & ODN triming and before alignment to the reference genome, reads that are too short are discarded. This filter is applied if any of the mates size is below this threshold.
################################################
minLength: 25 ## Minimal read length after trimming, before alignment
################################################
Alignment to reference genome
Define which aligner to use and the range of fragment size.
## Alignement
################################################
aligner: "bowtie2" ## Aligner to use (bowtie2 only for now)
minFragLength: 100 # Minimal fragment length after alignment
maxFragLength: 1500 # Maximal fragment length after alignment
################################################
Insertion sites calling
After alignment on the genome, reads are collapse to single insertion points and aggregated if they cluster in a distance smaller than ISbinWindow
defined here. Filters can be applied to exclude UMI with few reads (minReadsPerUMI
) or insertion sites with few UMIs (minUMIPerIS
).
Here, you can also define if you want to tolerate bulges in the alignment between the gRNA and gDNA.
################################################
## Off targets calling
tolerate_bulges: "FALSE" # whether to include gaps in the gRNA alignment (this will change the gap penalty during SW pairwise alignment)
max_edits_crRNA: 6 # filter clusters with less or equal than n edits in the crRNA sequence (edits = substitutions + INDELs)
ISbinWindow: 100 # insertion sites closer than 'ISbinWindow' will be clustered together
minReadsPerUMI: 0 # 0 to keep all UMIs, otherwise min number of reads per UMIs
minUMIPerIS: 0 # 0 to keep all IS, otherwise min number of UMI per IS
slopSize: 50 # window size (bp) around IS (both directions) to identify gRNA sequence (ie 50bp = -50bp to +50bp)
################################################
UMI correction
Due to potential sequencing errors, additionnal UMIs may be detected and a correction step is required.
For each cluster of Off Targets, a similarity matrix between all UMIs detected is calculated and similar UMI collapsed together if the editing distance is smaller than UMI_hamming_distance
. The Adjacency method described in UMI-tools (@Smith2017) is used by default (see https://umi-tools.readthedocs.io/en/latest/the_methods.html for details).
################################################
# post alignment
minMAPQ: 1 # Min MAPQ score to keep alignments -> !!! multihits have a MAPQ close to 1. Value greater than 1 will discard Offtargets with exact same sequence.
UMI_hamming_distance: 1 # min distance to cluster UMI using network-based deduplication, use [0] to keep raw UMIs
UMI_deduplication: "Adjacency" # method to correct UMI (cluster or Adjacency)
UMI_pattern: "NNWNNWNN"
UMI_filter: "FALSE" # If TRUE, remove UMIs that do no match the expected pattern [FALSE or TRUE]
################################################
Reporting
################################################
# reporting
max_clusters: 100 # max number of cluster alignments to report
minUMI_alignments_figure: 1 # filter clusters with more than n UMI in the report alignment figure (set to 0 to keep all clusters -> can be slow)
################################################
Prediction
# Prediction
################################################
SWoffFinder:
path: "/opt/SWOffinder" ## Path to SWoffinder on your server (downloaded from https://github.com/OrensteinLab/SWOffinder)
maxE: 6 # Max edits allowed (integer).
maxM: 6 # Max mismatches allowed without bulges (integer).
maxMB: 6 # Max mismatches allowed with bulges (integer).
maxB: 3 # Max bulges allowed (integer).
window_size: 100
min_predicted_distance: 100 # distance between cut site and predicted cut site to consider as predicted
################################################
Reads adapter & ODN trimming sequences
Indicate which sequence will be trimmed from R1 & R2 reads ends depending on the PCR orientation and ODN used.
################################################
# Sequences for the trimming steps
guideseq:
positive:
R1_trailing: "GTTTAATTGAGTTGTCATATGT"
R2_leading: "ACATATGACAACTCAATTAAAC"
R2_trailing: "AGATCGGAAGAGCGTCGTGT"
negative:
R1_trailing: "ATACCGTTATTAACATATGACAACTCAA"
R2_leading: "TTGAGTTGTCATATGTTAATAACGGTAT"
R2_trailing: "AGATCGGAAGAGCGTCGTGT"
iguideseq:
positive:
R2_leading: "ACATATGACAACTCAATTAAACGCGAGC"
R2_trailing: "AGATCGGAAGAGCGTCGTGT"
R1_trailing: "GCTCGCGTTTAATTGAGTTGTCATATGT"
negative:
R1_trailing: "TCGCGTATACCGTTATTAACATATGACAACTCAA"
R2_leading: "TTGAGTTGTCATATGTTAATAACGGTATACGCGA"
R2_trailing: "AGATCGGAAGAGCGTCGTGT"
################################################
Folder structure
In order to start a run:
-
create a new directory in the installation folder and
cd
into. -
Then :
-
move the sample data-sheet
-
move the configuration file
-
move the illumina sequencing
undeterminded_R1/R2/I1/I2.fastq.gz
files (undemultiplexed)
-
Input fastq files should respect the following structure from original paper :
R1 : contains fragment sequence starting in gDNA
R2: Starts with ODN sequence followed by gDNA sequence and potential adaptor sequence
i1 : Contains barcode 1 (usually 8 nucleotides)
i2 : Contains barcode 2 and UMI sequence (usually 8 + 8 nucleotides)
Your folder architecture should look similar to :
Path/to/guideseq/
├── 00-pipeline/
├── 01-envs/
├── 02-resources/
├── test/
├── My/folder/
├── guideSeq_GNT.yml
├── sampleInfo.csv
├── undertermined_R1.fastq.gz
├── undertermined_R2.fastq.gz
├── undertermined_I1.fastq.gz
└── undertermined_I2.fastq.gz
From inside your analysis folder, run the command below after adjusting for number of CPU (-j) :
snakemake -s ../00-pipeline/guideseq_gnt.smk \
-j 24 \ ## number of threads used
-k \ ## keep running on rule error
--use-conda \ ## use conda environment
-n ## dry-run, will print rules without running them. Remove this argument if no error is returned
Each rule takes a maximum of 6 threads (alignment, trimming) to speed up the data processing. Set -j
as a multiple of 6 to process 2 (12threads) , 3 (18 threads) ... samples in parallel.
Output
Workflow finished, no error
_____________
< You rock baby !! >
-------------
\ ^__^
\ (♥♥)\_______
(__)\ )\/\
||----w |
|| ||
Upon pipeline completion, your folder should now look like:
.
├── 00-demultiplexing
│ ├── demultiplexing_R1.log
├── 01-trimming
│ ├── VEGFAs1.odn.log
│ ├── VEGFAs1.trailing.log
├── 02-filtering
│ ├── VEGFAs1.filter.log
│ ├── VEGFAs1_R1.UMI.ODN.trimmed.filtered.fastq.gz
│ ├── VEGFAs1_R2.UMI.ODN.trimmed.filtered.fastq.gz
├── 03-align
│ ├── VEGFAs1_multi.txt
│ ├── VEGFAs1_R1.UMI.ODN.trimmed.unmapped.fastq.gz
│ ├── VEGFAs1_R2.UMI.ODN.trimmed.unmapped.fastq.gz
│ ├── VEGFAs1.UMI.ODN.trimmed.filtered.align.log
│ ├── VEGFAs1.UMI.ODN.trimmed.filtered.sorted.filtered.bam
│ ├── VEGFAs1.UMI.ODN.trimmed.filtered.sorted.filtered.bam.bai
├── 04-IScalling
│ ├── VEGFAs1.cluster_slop.bed
│ ├── VEGFAs1.cluster_slop.fa
│ ├── VEGFAs1.reads_per_UMI_per_IS.bed
│ ├── VEGFAs1.reads_per_UMI_per_IS_corrected.bed
│ ├── VEGFAs1.UMIs_per_IS_in_Cluster.bed
├── 05-Report
│ ├── VEGFAs1.rdata
│ ├── VEGFAs1.stat
│ ├── VEGFAs1_summary.tsv
│ ├── VEGFAs1_summary.xlsx
│ ├── report-files
│ │ ├── VEGFAs1_offtargets_dynamic_files/
│ │ ├── VEGFAs1_offtargets_dynamic.html
│ │ ├── VEGFAs1_offtargets.html
│ ├── report.html
│ └── report.rdata
├── 06-offPredict
│ ├── GRCh38_GGGTGGGGGGAGTTTGCTCCNGG.csv
│ └── GRCh38_GGGTGGGGGGAGTTTGCTCCNGG.txt
├── guideSeq_GNT.yml
├── sampleInfo.csv
├── undertermined_R1.fastq.gz
├── undertermined_R2.fastq.gz
├── undertermined_I1.fastq.gz
└── undertermined_I2.fastq.gz
Report
A general report is generated. It summarizes all main QC and key features obtained from the run using graphical representations and tables.
Off-targets files
For each sample, an excel file with the complete OT information is generated. This file has many columns among which are of particular interest :
Pipeline step-by-step explanations
Demultiplexing
Tool : cutadapt
Undetermined fastq files are demultiplexed to sampleName
fastq files according to barcodes present in the sample data sheet.
-
Barcode1 and barcode2 sequences are concatenated to build the demultiplexing index.
-
i1 and i2 fastq files are concatenated to a single fastq file (i3)
-
R1 and R2 fastq are demultiplexed according to i1+i2 sequence present in i3 fastq files.
-
UMI sequence is added to R1 and R2 read name for future UMI quantification.
- UMI is extracted from i3 first nucleotides according to the length of
UMI_pattern
variable in the configuration file.
- UMI is extracted from i3 first nucleotides according to the length of
ODN trimming:
Tool : cutadapt
The leading ODN sequence is remove from R2 reads according to the method
and PCR orientation
defined in the sample data sheet for each sample.
Reads that do not start with the ODN sequence are discarded.
Adaptor trimming:
Tool : cutadapt
The ODN and adaptor trailing sequences are trimmed from R1 and R2 reads respectively if present. If the DNA fragment is large compared to R1/R2 sequences length, those sequences may not be present.
Read filtering:
Tool : cutadapt
After trimming, only read pairs with both mates longer than a minLength
defined in the configuration file are selected for alignment.
Genome alignment:
Reference sequences are specified in the configuration file. Genome index is build if it does not already exist.
## path to references
genome:
human:
fasta: "/media/References/Human/Genecode/GRch38/Sequences/GRCh38.primary_assembly.genome.fa"
index: "/media/References/Human/Genecode/GRch38/Indexes/Bowtie2/GRCh38.primary_assembly.genome" # path to index created during the run if not existing yet
annotation: "/media/References/Human/Genecode/GRch38/Annotations/gencode.v46.annotation.gtf.gz"
Trimmed reads pairs that passed the filtering steps are then aligned on the reference genome
specified in the sample data sheet for each sample using the aligner
specified in configuration file.
## Alignement
################################################
aligner: "bowtie2" ## Aligner to use (bowtie2 or bwa)
minFragLength: 100 # Minimal fragment length after alignment
maxFragLength: 1500 # Maximal fragment length after alignment
################################################
Multihit alignments with low MAPQ score are discarded except if the Alignment score (AS tag) is equal to the score of the second best alignment score (XS tag) .
Multi-hits management:
Multihits are reads with multiple possible equally good alignment positions in the genome. We choose to keep only a single random alignment for each read instead of reporting all possible positions.
On the long run, if multi read arise from the same cuting site, they will distribute randomly to all sites (explain more).
Cutting site calling:
Following alignment on the reference genome, nuclease cutting sites are extracted from the start position of R2 read alignment.
Reads that align at the same cutting site with the same UMI are aggregated together keeping track of total reads count.
UMI correction:
UMI sequences are corrected for potential sequencing error using the parameters defined in the configuration file
UMI_hamming_distance: 1 # min distance to cluster UMI using network-based deduplication, use [0] to keep raw UMIs
UMI_deduplication: "Adjacency" # method to correct UMI (cluster or Adjacency)
UMI_pattern: "NNWNNWNN"
UMI_filter: "FALSE" # If TRUE, remove UMIs that do no match the expected pattern [FALSE or TRUE]
Cut site clustering:
Cut sites than fall in the same window of ISbinWindow
defined in the configuration file are clustered together.
gRNA match:
For each cluster of cutting sites, the gRNA sequence defined in the sample data sheet for each sample is looked up in a window of +/- slopSize
bp using the Swith-Watterman algorithm.
Gap tolerance can be accepted to detect bulges in gDNA or gRNA if the tolerate_bulges
variable is set.
Annotation of clusters:
Clusters of cutting sites are annotated using the gtf file specified in the configuration file for each organism.
A first step prepare the annotation file to extract only gene and exons features.
A second step annotate clusters to gene and position relative to those genes (exon/intron). Multiple annotations may be present for each cluster and are reported.
Off target prediction:
for each gRNA sequence and each organism specified in the sample data sheet, a prediction of OTS is realized using the SWOffinder tool with parameters defined in the configuration file :
# Prediction
################################################
SWoffFinder:
path: "/opt/SWOffinder" ## Path to SWoffinder on your server (downloaded from https://github.com/OrensteinLab/SWOffinder)
maxE: 6 # Max edits allowed (integer).
maxM: 6 # Max mismatches allowed without bulges (integer).
maxMB: 6 # Max mismatches allowed with bulges (integer).
maxB: 3 # Max bulges allowed (integer).
window_size: 100
################################################
Reporting:
A short report is generated with main tables and graphical representations to better understand pipeline results.