003‐ Getting Started - rezakj/iCellR GitHub Wiki

How to use iCellR for analyzing scRNA-seq data

Load data

Load from sparse matrix format

# Step 1: Load the required library
library("iCellR")

# Step 2: Load the PBMC dataset from 10x Genomics processed files
# Specify the directory containing 10x Genomics files (barcodes.tsv, genes.tsv/features.tsv, and matrix.mtx)
my.data <- load10x(data.dir = "filtered_gene_bc_matrices/hg19/")

# Notes:
# - The directory ("filtered_gene_bc_matrices/hg19/") should include the following:
#   - `barcodes.tsv` (cell barcodes)
#   - `genes.tsv` or `features.tsv` (gene names or features)
#   - `matrix.mtx` (sparse expression matrix)
# - The data can be zipped or unzipped; iCellR handles both.

Alternative Formats for Data Loading:

If your data is in .csv or .tsv format

# Read the dataset directly from a .tsv.gz file
my.data <- read.delim("my_sample_RNA.tsv.gz", header = TRUE)

# For uncompressed .csv or .tsv files:
# my.data <- read.csv("my_sample_RNA.csv", header = TRUE)

If your data is in .h5 format:

# Load the hdf5r library to work with .h5 files
library(hdf5r)

# Load the dataset from an h5 file
my.data <- load.h5(file = "filtered_feature_bc_matrix.h5")

If your data is in S3 or S4 object format types like (Seurat or iCellR objects, etc.)

Here we use a Seurat object as an exacple:

# my.Seurat.object is your Seurat 5 object name

# get the raw data from your Seurat object slots 

my.data <- as.data.frame(as.matrix(my.Seurat.object@assays$RNA@layers$counts))
rownames(my.data) <- rownames(my.Seurat.object@assays$RNA@[email protected])
colnames(my.data) <- rownames(my.Seurat.object@assays$RNA@[email protected])

If you want to see the help page for any function in R, simply use a question mark (?) followed by the function name. Here's an example:

?load10x

Aggregate data

Conditions in iCellR are defined or displayed in the column names of the data and are separated by an underscore (_) sign. If you want to merge multiple datasets (data frames/matrices) into one file and run iCellR in aggregate mode (combining all samples together), you can accomplish this using the data.aggregation function.

Example: Suppose you have divided your sample into four datasets and need to aggregate them into a single matrix. Let's say the samples are WT, KO, Ctrl, and KD. After aggregating these datasets into one matrix, iCellR will recognize the presence of four distinct samples for further analyses, such as batch alignment, plotting, differential expression (DE), and more. Here, I have divided this sample into four datasets for a test run.


# Check the dimensions of the dataset
dim(my.data)
# Output: [1] 32738  2700

# Divide your dataset into four separate samples for this example
sample1 <- my.data[1:900]
sample2 <- my.data[901:1800]
sample3 <- my.data[1801:2300]
sample4 <- my.data[2301:2700]

# Merge all samples into a single aggregated file
my.data <- data.aggregation(samples = c("sample1", "sample2", "sample3", "sample4"),
                            condition.names = c("WT", "KO", "Ctrl", "KD"))

To check the head (the first few rows) of your file or dataset in R, use the head() function. Here's how you can do it.

This snippet shows how to inspect the header (column names) and the data for the first two cells in an aggregated data file:

# Display the head of your aggregated data and extract the first 2 columns (cells)
head(my.data)[, 1:2]

# Example Output:
#         WT_AAACATACAACCAC-1 WT_AAACATTGAGCTAC-1
# A1BG                       0                   0
# A1BG.AS1                   0                   0
# A1CF                       0                   0
# A2M                        0                   0
# A2M.AS1                    0                   0

# as you see the header (column names) have the condition names added to the UMIs

Make an object of class iCellR.

my.obj <- make.obj(my.data)
my.obj
###################################
,--. ,-----.       ,--.,--.,------.
`--''  .--./ ,---. |  ||  ||  .--. '
,--.|  |    | .-. :|  ||  ||  '--'.'
|  |'  '--'\   --. |  ||  ||  |
`--' `-----' `----'`--'`--'`--' '--'
###################################
An object of class iCellR version: 1.6.0
Raw/original data dimentions (rows,columns): 32738,2700
Data conditions in raw data: Ctrl,KD,KO,WT (500,400,900,900)
Row names: A1BG,A1BG.AS1,A1CF ...
Columns names: WT_AAACATACAACCAC.1,WT_AAACATTGAGCTAC.1,WT_AAACATTGATCAGC.1 ...
###################################
   QC stats performed:FALSE, PCA performed:FALSE
   Clustering performed:FALSE, Number of clusters:0
   tSNE performed:FALSE, UMAP performed:FALSE, DiffMap performed:FALSE
   Main data dimensions (rows,columns): 0,0
   Normalization factors:,...
   Imputed data dimensions (rows,columns):0,0
############## scVDJ-seq ###########
VDJ data dimentions (rows,columns):0,0
############## CITE-seq ############
   ADT raw data  dimensions (rows,columns):0,0
   ADT main data  dimensions (rows,columns):0,0
   ADT columns names:...
   ADT row names:...
############## scATAC-seq ############
   ATAC raw data  dimensions (rows,columns):0,0
   ATAC main data  dimensions (rows,columns):0,0
   ATAC columns names:...
   ATAC row names:...
############## Spatial ###########
Spatial data dimentions (rows,columns):0,0
########### iCellR object ##########

QC and filtering

my.obj <- qc.stats(my.obj)

Plot QC

Default Behavior of Plotting Functions:

In iCellR, all plotting functions generate interactive HTML files by default. These interactive plots are useful for exploring data visually in web browsers. If you prefer static plots (e.g., for quick visualization or embedding in reports), you can disable interactivity by setting the parameter interactive = FALSE.

# plot UMIs, genes and percent mito all at once and in one plot. 
# you can make them individually as well, see the arguments ?stats.plot.
stats.plot(my.obj,
	plot.type = "three.in.one",
	out.name = "UMI-plot",
	interactive = FALSE,
	cell.color = "slategray3", 
	cell.size = 1, 
	cell.transparency = 0.5,
	box.color = "red",
	box.line.col = "green")

# Scatter plots
stats.plot(my.obj, plot.type = "point.mito.umi", out.name = "mito-umi-plot")
stats.plot(my.obj, plot.type = "point.gene.umi", out.name = "gene-umi-plot")

Filtering Options in iCellR

The iCellR package provides flexibility to filter single-cell RNA-seq datasets based on various metrics, helping improve data quality and remove unwanted cells or genes from the analysis. You can filter your data using the following criteria:

Library Sizes (UMIs): Filter cells based on the total library size (number of UMIs per cell). This can help exclude cells with very low UMI counts, which might indicate doublets or empty droplets.

Number of Genes per Cell: Filter cells by the number of detected genes. For example, you can remove cells with fewer than a certain threshold of expressed genes to exclude low-quality cells.

Percent Mitochondrial Content: Filter cells by mitochondrial content. Typically, cells with excessively high mitochondrial expression (e.g., >10% of total expression) may indicate stressed or dying cells.

Based on One or More Genes: Select cells whose expression levels meet criteria for one or more specific genes (e.g., filter based on marker gene expression).

Cell IDs: Filter cells by specific cell IDs. This allows for targeted removal or selection of cells identified in previous analyses or metadata.

# Apply multiple filters to the iCellR object
my.obj <- cell.filter(
  my.obj,
  min.mito = 0,   # Minimum fraction of mitochondrial content allowed
  max.mito = 0.05, # Maximum fraction of mitochondrial content allowed
  min.genes = 200, # Minimum number of detected genes per cell
  max.genes = 2400, # Maximum number of detected genes per cell
  min.umis = 0,    # Minimum UMI count allowed
  max.umis = Inf   # Maximum UMI count (set to infinite to not limit)
)

# Example Output:
#[1] "cells with min mito ratio of 0 and max mito ratio of 0.05 were filtered."
#[1] "cells with min genes of 200 and max genes of 2400 were filtered."
#[1] "No UMI number filter"
#[1] "No cell filter by provided gene/genes"
#[1] "No cell id filter"
#[1] "filters_set.txt file has beed generated and includes the filters set for this experiment."

You can add filters for specific genes. For example, the following line filters out cells that do not have counts for the genes "RPL13" or "RPL10":

# my.obj <- cell.filter(my.obj, filter.by.gene = c("RPL13","RPL10")) # filter our cell having no counts for these genes

This removes the cell WT_AAACATACAACCAC.1 from your dataset.

# my.obj <- cell.filter(my.obj, filter.by.cell.id = c("WT_AAACATACAACCAC.1")) # filter our cell cell by their cell ids.

Check the dimensions after filtering

dim([email protected])
# Output: [1] 32738  2637

Down sampling

What is Down-Sampling?

Purpose: Down-sampling ensures that each condition (e.g., treatment groups like WT, KO, Ctrl, etc.) has the same number of cells for balanced comparisons.

Why?:

Prevent bias in downstream analyses caused by unequal cell counts across conditions.

Down-Sampling: Important Considerations

This step is optional and should be used with caution, as it is generally not recommended. Down-sampling may result in the loss of important or rare cell populations, which can impact the accuracy of your analysis, especially for heterogeneous datasets with diverse cell types.

However, there are cases where down-sampling can be useful, such as:

When working with datasets containing an extremely large number of cells. To speed up computational analysis for complex workflows or resource-limited environments. For testing purposes or when uniform cell counts are needed across conditions. Ultimately, the decision to down-sample should be made based on your specific experimental goals and the nature of your data. This option is available if necessary, but its use should be carefully weighed against the potential impact on downstream analyses.

# optional
# Perform down-sampling to equalize cells across conditions (optional)
# my.obj <- down.sample(my.obj)

#Before Down-Sampling:
#The dataset initially contains cells as follows:
#Ctrl: 877 cells
#KO: 877 cells
#WT: 883 cells
#After Down-Sampling:
#Down-sampling has equalized the number of cells across all conditions at 877 cells each.

Data Normalization

Normalization is an essential step in single-cell RNA sequencing analysis. iCellR provides several options for normalization, and you can choose the best approach depending on your study objectives and dataset characteristics.

Options for Normalization:

Use iCellR’s Built-in Normalization Methods:

iCellR offers multiple normalization techniques tailored to single-cell experiments. One highly recommended method is ranked.glsf.

External Tools for Normalization:

You can normalize your data using external tools, like:

DESeq2 (Geometric Normalization): Popular for bulk RNA-seq but adaptable for single-cell studies.

Scran: Computes size factors using clustering-based normalization for single-cell datasets.

After normalization, you can import the externally normalized data into iCellR for further analysis.

"Ranked GLSF" Normalization: What is it?

Ranked Geometric Library Size Factor (ranked.glsf) is inspired by DESeq2's Geometric Mean Size Factor normalization, but adapted for single-cell challenges. The ranked component makes it better suited for sparse datasets by focusing on highly expressed genes.

Designed to handle single-cell datasets with lots of zeros (dropouts) in the matrix.
Because it uses a geometric approach, this normalization reduces batch-wise differences caused by variable library sizes.

my.obj <- norm.data(my.obj, 
     norm.method = "ranked.glsf",
     top.rank = 500) # best for scRNA-Seq
# This focuses on the top 500 most highly expressed genes for calculating library size normalization.

# more examples
#my.obj <- norm.data(my.obj, norm.method = "ranked.deseq", top.rank = 500)
#my.obj <- norm.data(my.obj, norm.method = "deseq") # best for bulk RNA-Seq 
#my.obj <- norm.data(my.obj, norm.method = "global.glsf") # best for bulk RNA-Seq 
#my.obj <- norm.data(my.obj, norm.method = "rpm", rpm.factor = 100000) # best for bulk RNA-Seq
#my.obj <- norm.data(my.obj, norm.method = "spike.in", spike.in.factors = NULL)
#my.obj <- norm.data(my.obj, norm.method = "no.norm") # if the data is already normalized

Perform second QC (optioal) After initial filtering and normalization, a second QC step can be performed to further refine the dataset.

#my.obj <- qc.stats(my.obj,which.data = "main.data")

#stats.plot(my.obj,
#	plot.type = "all.in.one",
#	out.name = "UMI-plot",
#	interactive = F,
#	cell.color = "slategray3", 
#	cell.size = 1, 
#	cell.transparency = 0.5,
#	box.color = "red",
#	box.line.col = "green",
#	back.col = "white")

Scale data (optional) Why Scaling is Not Required in iCellR

In iCellR, scaling the data is handled dynamically or "on the fly" during tasks such as plotting or running dimensionality reduction methods like PCA.

Here's why this design is beneficial:

To save storage: this eliminates the need for permanently scaling your main dataset beforehand. If you do choose to scale your data manually, scaling does not overwrite the main dataset. Instead, scaled data is saved into a separate slot in your iCellR object. iCellR automatically scales the data as needed during specific functions like plot.tsne() or run.pca().
Untransformed Data for Differential Expression Analysis: Untransformed data is used for generating accurate fold-change values during Differential Expression (DE) Analysis.

# my.obj <- data.scale(my.obj)

Gene stats Gene statistics typically involve summarizing the behavior or characteristics of genes across cells in your scRNA-seq dataset. iCellR provides tools to calculate and explore gene-level info, such as:

Gene Expression Levels: Aggregate counts or normalized values for specific genes across all cells or conditions.

Gene Detection Frequency: Proportion of cells in which each gene is expressed (non-zero counts).

Gene Variance: Variability in expression levels across all cells or conditions to identify highly variable genes.

Top-Expressed Genes: Identify genes that are most highly expressed across the dataset or within a condition.

my.obj <- gene.stats(my.obj, which.data = "main.data")

head([email protected][order([email protected]$numberOfCells, decreasing = T),])
#       genes numberOfCells totalNumberOfCells percentOfCells  meanExp
#30303 TMSB4X          2637               2637      100.00000 38.55948
#3633     B2M          2636               2637       99.96208 45.07327
#14403 MALAT1          2636               2637       99.96208 70.95452
#27191 RPL13A          2635               2637       99.92416 32.29009
#27185  RPL10          2632               2637       99.81039 35.43002
#27190  RPL13          2630               2637       99.73455 32.32106
#               SDs condition
#30303 7.545968e-15       all
#3633  2.893940e+01       all
#14403 7.996407e+01       all
#27191 2.783799e+01       all
#27185 2.599067e+01       all
#27190 2.661361e+01       all

Make a gene model for clustering Creating a gene model for clustering in single-cell RNA-seq analysis involves selecting a subset of genes (e.g., highly variable genes, marker genes, or genes of interest) that are most informative for identifying cell clusters. This process helps reduce noise and focus on biologically relevant features for unsupervised clustering.

This function will help you find a good number of genes to use for running PCA.

# See model plot 
make.gene.model(my.obj, my.out.put = "plot",
	dispersion.limit = 1.5, 
	base.mean.rank = 1500, 
	no.mito.model = T, 
	mark.mito = T, 
	interactive = F,
	out.name = "gene.model")
	
# Write the gene model data into the object

my.obj <- make.gene.model(my.obj, my.out.put = "data",
	dispersion.limit = 1.5, 
	base.mean.rank = 1500, 
	no.mito.model = T, 
	mark.mito = T, 
	interactive = F,
	out.name = "gene.model")

head([email protected])
# "ACTB"  "ACTG1" "ACTR3" "AES"   "AIF1"  "ALDOA"

# get html plot (optional)
#make.gene.model(my.obj, my.out.put = "plot",
#	dispersion.limit = 1.5, 
#	base.mean.rank = 1500, 
#	no.mito.model = T, 
#	mark.mito = T, 
#	interactive = T,
#	out.name = "plot4_gene.model")

To view an the html interactive plot click on this links: Dispersion plot

Run PCA

Principal Component Analysis (PCA) is a fundamental dimensionality reduction technique often used in single-cell RNA-seq analysis to represent high-dimensional gene expression data in a lower-dimensional space. However, PCA does not harmonize or integrate or batch align the data.

Skip the PCA step if you plan to perform batch correction, which typically realigns data across batches and conditions. For batch correction (sample alignment/harmonization/integration) see the sections; CPCA, CCCA, MNN or anchor alignment.

# When you run run.pca, iCellR uses raw or normalized data directly without correcting for batch-related artifacts. This results in principal components that may reflect technical variations (batch effects) rather than true biological signals.

# run PCA 
my.obj <- run.pca(my.obj, method = "gene.model", gene.list = [email protected],data.type = "main")

opt.pcs.plot(my.obj)

2 round PCA (optional)

For finding top genes in the top principal components (PCs) and re-running PCA to achieve better segregation of cell populations. This is optional and not recommended except in certain cases.


#my.obj <- find.dim.genes(my.obj, dims = 1:10, top.pos = 20, top.neg = 20) # (optional)

#length([email protected])
# 211

# second round PC
#my.obj <- run.pca(my.obj, method = "gene.model", gene.list = [email protected],data.type = "main")

Perform tSNE, UMAP, KNetL, PHATE, destiny, diffusion maps and more

Run tSNE

- tSNE
my.obj <- run.pc.tsne(my.obj, dims = 1:10)

Run UMAP

# UMAP
my.obj <- run.umap(my.obj, dims = 1:10)

Run KNetL map

Don't forget to set the zoom in the right range.

my.obj <- run.knetl(my.obj, dims = 1:20, zoom = 110)

KNetL works with a higher resolution; therfore using dims = 20 (2 times the number of PCs used for UMAP) usually produces the best results for most datasets.

Recommended Zoom Settings:

< 1,000 cells: Zoom range 5–50

1,000–5,000 cells: Zoom range 50–200

5,000–10,000 cells: Zoom range 100–300

10,000–30,000 cells: Zoom range 200–500

> 30,000 cells: Zoom range 400–600

Additional Notes: A zoom of 400 generally works well for large datasets, but adjustments might be needed for your desired resolution.

Remember:

Lower zoom numbers = zoom in.

Higher zoom numbers = zoom out (reverse logic).

Run diffusion map

# diffusion map
# this requires python packge phate or bioconductor R package destiny
# How to install destiny
# if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
# BiocManager::install("destiny")
# How to install phate
# pip install --user phate
# Install phateR version 2.9
# wget https://cran.r-project.org/src/contrib/Archive/phateR/phateR_0.2.9.tar.gz
# install.packages('phateR/', repos = NULL, type="source")
# or 
# library(devtools)
# install_version("phateR", version = "0.2.9", repos = "http://cran.us.r-project.org")


# optional 
# library(destiny)
# my.obj <- run.diffusion.map(my.obj, dims = 1:10)
# or 
# library(phateR)
# my.obj <- run.diffusion.map(my.obj, dims = 1:10, method = "phate")

Visualizing the results of dimensionality reductions before clustering (optional)

# Generate cluster plots with different methods
A <- cluster.plot(my.obj, plot.type = "pca", interactive = FALSE)
B <- cluster.plot(my.obj, plot.type = "umap", interactive = FALSE)
C <- cluster.plot(my.obj, plot.type = "tsne", interactive = FALSE)
D <- cluster.plot(my.obj, plot.type = "knetl", interactive = FALSE)

# Load the gridExtra library for arranging multiple plots
library(gridExtra)

# Combine and arrange the plots (PCA, UMAP, t-SNE, and KNetL) in a grid layout
grid.arrange(A, B, C, D)

Save your iCellR object

Saving your iCellR object allows you to preserve your analysis results and reload them later without having to rerun the pipeline.

Option 1:

# Save the iCellR object to a file
save(my.obj, file = "my.obj.Robj")

# To load the object later
load("my.obj.Robj")

Option 2:

# Save the iCellR object to a file
saveRDS(my.obj, file = "my_iCellR_object.rds")

# To load the object later
my.obj <- readRDS("my_iCellR_object.rds")