scRAFIKI usage - Mye-InfoBank/scRAFIKI GitHub Wiki
Before starting this section, please make sure that your data is preprocessed as described in the Dataset preparation section of this wiki.
Pipeline execution
The pipeline can be used with the following command:
nextflow run Mye-InfoBank/scRAFIKI -resume -profile <YOUR_PROFILE> --samplesheet "samplesheet.csv"
Samplesheet
The samplesheet is a csv file with the following column names as header:
Column | Description | Default |
---|---|---|
id |
The dataset identifier, needs to be unique across datasets | required |
input_adata |
Path to the input AnnData (.h5ad ) file |
required |
min_counts |
Minimum number of counts per cell | 0 |
max_counts |
Maximum number of counts per cell | Infinity |
min_genes |
Minimum number of genes per cell | 0 |
max_genes |
Maximum number of genes per cell | Infinity |
max_pct_mito |
Maximum percentage of mitochondrial genes per cell | 100 |
no_symbols |
Set to true to enable conversion of var_names to gene symbols. If enabled, ensemble and entrez IDs can be processed. |
false |
transfer |
Perform integration using the transfer learning method scArches |
false |
Cell filtering will be handled by the pipeline if the respective columns are specified. If filtering has been done before, the columns can be omitted.
Parameters
Parameter | Description | Default |
---|---|---|
samplesheet |
Path to the samplesheet | required |
outdir |
Path to the output directory | ./output |
celltypist_model |
Celltypist model to use for annotation, a list of possible values can be found here. Make sure to add the .pkl extension, e.g. Cells_Intestinal_Tract.pkl . If the parameter is null , celltypist will not be executed. |
null |
leiden_resolutions |
List of resolutions for clustering | [0.25, 0.5, 0.75, 1, 1.5, 2] |
integration_methods |
List of integration methods to use | ["scvi", "scanvi", "harmony", "scgen", "scanorama", "bbknn", "desc", "combat", "trvaep"] |
scshc |
Run sc-SHC clustering and QC | false |
entropy |
Calculate entropy of clusterings compared to batches | false |
cell_cycle |
Run cell cycle analysis | true |
benchmark_hvgs |
Number of highly variable genes to use for scIB benchmarking. Benchmarking will only be executed, if a positive value is set. | 0 |
integration_hvgs |
Number of highly variable genes to use for integrations. | 6000 |
normalization_method |
Method to use for normalizing count values. Can be one out of: raw (no normalization), scTransform , log_total , pearson_residuals |
log_total |
min_cells |
Minimum number of cells that a gene needs to be expressed in so that it will be included in the pipeline output | 10 |
decontX |
Perform ambient RNA removal using decontX | true |
has_extended |
Set to true if any of the datasets has the transfer option activated in the samplesheet |
false |
has_celltypes |
Set to false if none of the input datasets has annotations in the cell_type metadata field |
true |
custom_metadata |
Custom metadata fields that should be passed through the pipeline - if not present in all datasets, Unknown will be used as a filler |
[] |
custom_hvgs |
Can be used to manually add genes to the automated selection of highly variable genes | [] |
publish_mode |
Defines if the output files are copied (copy ) or symlinked (symlink ) |
symlink |
max_cpus |
Maximum number of CPUs to use | 60 |
max_memory |
Maximum amount of memory to use | 300.GB |
max_time |
Maximum runtime of a job | 48.h |
Parameters can be applied as documented in the Nextflow documentation.
Note that the integration methods scvi
and scanvi
will be executed regardless of the specified integration methods, as they are required for some downstream analysis steps.
Profiles
The following profiles are available:
standard
: Run the pipeline on your local machine without containerization, requires all dependencies to be installed (not recommended)docker
(if running on an ARM-based machine, usedocker,arm
)singularity
podman
shifter
charliecloud
apptainer
gitpod