WIKI single‐cell‐DNA - gustaveroussy/single-cell-DNA GitHub Wiki

Welcome to the single-cell-DNA wiki!

Pipeline Goal:

Perform single-cell DNA-seq analysis from FastQ files to figures file for missionbio tapestri data.

Steps available:

Alignment
Preprocessing (filtering bad quality variants, CNV and cells)
SNV_CNV (Normalization dimension Reduction and clustering)
PROTEIN (Normalization dimension Reduction and clustering)
ALL (Combining DNA-seq analysis & Proteomic analysis)
Phylogeny (reconstruction of mutations events)

Usage

Usage on Flamingo, the GR's computing cluster

:heavy_exclamation_mark: if you already used the single-cell RNA-seq pipeline it is identical

make the parameters file according to your needs (see below how to configure the parameter file)
indicate the path to this file in the path_to_configfile variable
run the snakemake command

module load singularity

path_to_configfile="<path/to/your_configfile.yaml>"
path_to_pipeline="<path/to/single-cell-dna-seq>"

snakemake --profile ${path_to_pipeline}/profiles/local -s ${path_to_pipeline}/Snakefile --configfile ${path_to_configfile}

Configuration

1. steps & alignment: choose the steps to run

name	description	example	default value	possible value
steps	steps to run	[Aligment,preprocessing,SNV_CNV,ALL,phylogeny]	NA	Aligment,preprocessing,SNV_CNV,PROTEIN,ALL,phylogeny
tmp	temporary directory	/tmp	NA	NA
sample	sample(s) to run	[sample_1,sample_2]	NA	NA
reference_genome_path	path of the reference genome	"/mnt/beegfs/pipelines/single-cell_dna/tapestri_database/v2/hg19/ucsc_hg19.fa"	"/mnt/beegfs/pipelines/single-cell_dna/tapestri_database/v2/hg19/ucsc_hg19.fa"	"/mnt/beegfs/pipelines/single-cell_dna/tapestri_database/v2/hg19/ucsc_hg19.fa","/mnt/beegfs/pipelines/single-cell_dna/tapestri_database/v3/hg19/ucsc_hg19.fa"
reference_genome	reference genome release	"hg19"	"hg19"	"hg19"
type_analysis	select your analysis dna or dna+protein"	"dna+protein"	NA	"dna","dna+protein"
panel_path	path of your panel of variants	"</your/path/to/panel/file/location>"	"/mnt/beegfs/pipelines/single-cell_dna/tapestri_database/v2/panels/Myeloid"	"/mnt/beegfs/pipelines/single-cell_dna/tapestri_database/v2/panels/Myeloid","<your/path/panel/location>"
panel_protein_path	path of the reference fasta for protein	"/mnt//beegfs/pipelines/single-cell_dna/tapestri_database/v2/panels/protein"	"/mnt//beegfs/pipelines/single-cell_dna/tapestri_database/v2/panels/protein"	"/mnt/beegfs/pipelines/single-cell_dna/tapestri_database/v2/panels/protein/","/mnt/beegfs/pipelines/single-cell_dna/tapestri_database/v3/panels/protein/"
design_file	path to design file in order to create proper yaml file for aligment	"/your/path/to/panel/file/location"	NA	NA

2.filtering: filtering - remove bad quality variants & preprocess your data

name	description	example	default value	possible value
filter_na	filtering Missing Value	True	False	True/False
filter_na_percent	remove variants which missing value are superior or equal to the threshold	35	25	any integer
predict_missing_value	KNN predict missing value variants	True	False	True/false
filtering_variants	multiple filter in order to remove bad quality variants & cells	NA	NA	NA
max_vaf_percent	filter variants which mean VAF value is superior or equal	95	95	any integer
whitelist	variants that must be keep (even if their quality is poor)	["chr20:33868702:T/C"]	NA	NA

2.1. filtering: filtering_variants

name	description	example	default value	possible value
min_dp	The minimum depth (DP) for the call to be considered	10	10	any integer
min_gq	The minimum genotype quality (GQ) for the call to be considered	30	30	any integer
vaf_ref	All reference calls (NGT = 0) with VAF > vaf_ref are converted to no calls (NGT = 3)	5	5	any integer
vaf_het	All hetrozygous calls (NGT = 1) with VAF < vaf_het are converted to no calls (NGT = 3)	35	35	any integer
vaf_hom	All homozygous calls (NGT = 2) with VAF < vaf_hom are converted to no calls (NGT = 3)	95	95	any integer
min_mut_prct_cells	The minimum percent of the total cells in which the variant should be mutated,	1	1	any integer
min_prct_cells	The minimum percent of total cells in which the variant should be present	50	50	any integer

3.SNV: snv_norm_dimred

name	description	example	default value	possible value
method_dimred	select dimension reduction for variants matrix between	pca	pca	fa,pca
max_dims	maximum dimensions for the dimension reduction	6	6	any integer
clustering_method	clustering method to use	leiden	dbscan	graph-community,leiden,dbscan,hdbscan

3.CNV: cnv_norm_dimred

name	description	example	default value	possible value
max_dims	maximum dimensions for the dimension reduction	6	6	any integer
clustering_method	clustering method to use	leiden	dbscan	graph-community,leiden,dbscan,hdbscan

4.PROTEIN: prot_norm_dimred

name	description	example	default value	possible value
normalization	normalization method to correct noise	DSB	CLR	CLR,DSB,asinh,NSP
clustering_method	clustering method to use	leiden	dbscan	graph-community,leiden,dbscan,hdbscan

5.ALL: all_norm_dimred

name	description	example	default value	possible value
snv	SNV parameters to keep in order to combine multi-omics data	NA	NA	NA
cnv	CNV parameters to keep in order to combine multi-omics data	NA	NA	NA
prot	Protein parameters to keep in order to combine multi-omics data	NA	NA	NA
variants_of_interest	takes a list of variants of interest in order to label data	["EIF6:20:33868702:T:C","TP53:17:7577559:G:T"]	NA
chr_of_interest	list of chromsomes to focus on it	["5","17","7"]	NA	any list of number of chromosomes

5.1. all_norm_dimred - snv

name	description	example	default value	possible value
method_dimred	reduction method to keep	pca	pca	fa,pca
dims	number of dimensions to keep	6	6	any integer
clustering_method	clustering method to keep	leiden	dbscan	graph-community,leiden,dbscan,hdbscan
res	resolution for clustering to keep	NA	NA	depend of the algorithm leiden and dbscan take float graph-community and hdbscan take integer
predict_missing_value	boolean to predict missing value using KNN method	True	False	True/False

5.2. all_norm_dimred - cnv

name	description	example	default value	possible value
method_dimred	reduction method to keep	pca	pca	pca
dims	number of dimensions to keep	6	6	any integer
clustering_method	clustering method to keep	leiden	dbscan	graph-community,leiden,dbscan,hdbscan
res	resolution for clustering to keep	NA	NA	depend of the algorithm leiden and dbscan take float graph-community and hdbscan take integer

5.3. all_norm_dimred - prot

name	description	example	default value	possible value
normalization	normalization method to keep	CLR	CLR	CLR,DSB,asinh,NSP
method_dimred	reduction method to keep	pca	pca	fa,pca
dims	number of dimensions to keep	6	6	any integer
clustering_method	clustering method to keep	leiden	dbscan	graph-community,leiden,dbscan,hdbscan
res	resolution for clustering to keep	NA	NA	depend of the algorithm leiden and dbscan take float graph-community and hdbscan take integer

6. phylogeny

name	description	example	default value	possible value
phylogeny_method	list of method to use for mutations events reconstruction	["COMPASS","infSCITE"]	NA	COMPASS,infSCITE,BiTSC2

6.1 phylogeny - COMPASS

name	description	example	default value	possible value
bool_cnv	add CNV in the reconstruction mutations events	1	0	0,1

What's coming next ?

infSCITE and BiTSC2 are not implemented yet but they will be added soon
Currently the version of mosaic used is 2.4.1, it will be updated to the 3.0.1

Questions

Don't hesitate to contact the bioinformatic plateform at [email protected] or [email protected] if you have any questions/suggestion.