scRAFIKI usage - Mye-InfoBank/scRAFIKI GitHub Wiki

Before starting this section, please make sure that your data is preprocessed as described in the Dataset preparation section of this wiki.

Pipeline execution

The pipeline can be used with the following command:

nextflow run Mye-InfoBank/scRAFIKI -resume -profile <YOUR_PROFILE> --samplesheet "samplesheet.csv"

Samplesheet

The samplesheet is a csv file with the following column names as header:

Column Description Default
id The dataset identifier, needs to be unique across datasets required
input_adata Path to the input AnnData (.h5ad) file required
min_counts Minimum number of counts per cell 0
max_counts Maximum number of counts per cell Infinity
min_genes Minimum number of genes per cell 0
max_genes Maximum number of genes per cell Infinity
max_pct_mito Maximum percentage of mitochondrial genes per cell 100
no_symbols Set to true to enable conversion of var_names to gene symbols. If enabled, ensemble and entrez IDs can be processed. false
transfer Perform integration using the transfer learning method scArches false

Cell filtering will be handled by the pipeline if the respective columns are specified. If filtering has been done before, the columns can be omitted.

Parameters

Parameter Description Default
samplesheet Path to the samplesheet required
outdir Path to the output directory ./output
celltypist_model Celltypist model to use for annotation, a list of possible values can be found here. Make sure to add the .pkl extension, e.g. Cells_Intestinal_Tract.pkl. If the parameter is null, celltypist will not be executed. null
leiden_resolutions List of resolutions for clustering [0.25, 0.5, 0.75, 1, 1.5, 2]
integration_methods List of integration methods to use ["scvi", "scanvi", "harmony", "scgen", "scanorama", "bbknn", "desc", "combat", "trvaep"]
scshc Run sc-SHC clustering and QC false
entropy Calculate entropy of clusterings compared to batches false
cell_cycle Run cell cycle analysis true
benchmark_hvgs Number of highly variable genes to use for scIB benchmarking. Benchmarking will only be executed, if a positive value is set. 0
integration_hvgs Number of highly variable genes to use for integrations. 6000
normalization_method Method to use for normalizing count values. Can be one out of: raw (no normalization), scTransform, log_total, pearson_residuals log_total
min_cells Minimum number of cells that a gene needs to be expressed in so that it will be included in the pipeline output 10
decontX Perform ambient RNA removal using decontX true
has_extended Set to true if any of the datasets has the transfer option activated in the samplesheet false
has_celltypes Set to false if none of the input datasets has annotations in the cell_type metadata field true
custom_metadata Custom metadata fields that should be passed through the pipeline - if not present in all datasets, Unknown will be used as a filler []
custom_hvgs Can be used to manually add genes to the automated selection of highly variable genes []
publish_mode Defines if the output files are copied (copy) or symlinked (symlink) symlink
max_cpus Maximum number of CPUs to use 60
max_memory Maximum amount of memory to use 300.GB
max_time Maximum runtime of a job 48.h

Parameters can be applied as documented in the Nextflow documentation.

Note that the integration methods scvi and scanvi will be executed regardless of the specified integration methods, as they are required for some downstream analysis steps.

Profiles

The following profiles are available:

  • standard: Run the pipeline on your local machine without containerization, requires all dependencies to be installed (not recommended)
  • docker (if running on an ARM-based machine, use docker,arm)
  • singularity
  • podman
  • shifter
  • charliecloud
  • apptainer
  • gitpod