scRAFIKI usage - Mye-InfoBank/scRAFIKI GitHub Wiki

Before starting this section, please make sure that your data is preprocessed as described in the Dataset preparation section of this wiki.

Pipeline execution

The pipeline can be used with the following command:

nextflow run Mye-InfoBank/scRAFIKI -resume -profile <YOUR_PROFILE> --samplesheet "samplesheet.csv"

Samplesheet

The samplesheet is a csv file with the following column names as header:

Column	Description	Default
`id`	The dataset identifier, needs to be unique across datasets	required
`input_adata`	Path to the input AnnData (`.h5ad`) file	required
`min_counts`	Minimum number of counts per cell	`0`
`max_counts`	Maximum number of counts per cell	`Infinity`
`min_genes`	Minimum number of genes per cell	`0`
`max_genes`	Maximum number of genes per cell	`Infinity`
`max_pct_mito`	Maximum percentage of mitochondrial genes per cell	`100`
`no_symbols`	Set to `true` to enable conversion of `var_names` to gene symbols. If enabled, ensemble and entrez IDs can be processed.	`false`
`transfer`	Perform integration using the transfer learning method `scArches`	`false`

Cell filtering will be handled by the pipeline if the respective columns are specified. If filtering has been done before, the columns can be omitted.

Parameters

Parameter	Description	Default
`samplesheet`	Path to the samplesheet	required
`outdir`	Path to the output directory	`./output`
`celltypist_model`	Celltypist model to use for annotation, a list of possible values can be found here. Make sure to add the `.pkl` extension, e.g. `Cells_Intestinal_Tract.pkl`. If the parameter is `null`, celltypist will not be executed.	`null`
`leiden_resolutions`	List of resolutions for clustering	`[0.25, 0.5, 0.75, 1, 1.5, 2]`
`integration_methods`	List of integration methods to use	`["scvi", "scanvi", "harmony", "scgen", "scanorama", "bbknn", "desc", "combat", "trvaep"]`
`scshc`	Run sc-SHC clustering and QC	`false`
`entropy`	Calculate entropy of clusterings compared to batches	`false`
`cell_cycle`	Run cell cycle analysis	`true`
`benchmark_hvgs`	Number of highly variable genes to use for scIB benchmarking. Benchmarking will only be executed, if a positive value is set.	`0`
`integration_hvgs`	Number of highly variable genes to use for integrations.	`6000`
`normalization_method`	Method to use for normalizing count values. Can be one out of: `raw` (no normalization), `scTransform`, `log_total`, `pearson_residuals`	`log_total`
`min_cells`	Minimum number of cells that a gene needs to be expressed in so that it will be included in the pipeline output	`10`
`decontX`	Perform ambient RNA removal using decontX	`true`
`has_extended`	Set to `true` if any of the datasets has the `transfer` option activated in the samplesheet	`false`
`has_celltypes`	Set to false if none of the input datasets has annotations in the `cell_type` metadata field	`true`
`custom_metadata`	Custom metadata fields that should be passed through the pipeline - if not present in all datasets, `Unknown` will be used as a filler	`[]`
`custom_hvgs`	Can be used to manually add genes to the automated selection of highly variable genes	`[]`
`publish_mode`	Defines if the output files are copied (`copy`) or symlinked (`symlink`)	`symlink`
`max_cpus`	Maximum number of CPUs to use	`60`
`max_memory`	Maximum amount of memory to use	`300.GB`
`max_time`	Maximum runtime of a job	`48.h`

Parameters can be applied as documented in the Nextflow documentation.

Note that the integration methods scvi and scanvi will be executed regardless of the specified integration methods, as they are required for some downstream analysis steps.

Profiles

The following profiles are available:

standard: Run the pipeline on your local machine without containerization, requires all dependencies to be installed (not recommended)
docker (if running on an ARM-based machine, use docker,arm)
singularity
podman
shifter
charliecloud
apptainer
gitpod