COMPARE Suite - morchalabi/COMPARE-suite GitHub Wiki

COMPARE-suite is a novel ultra-fast software suite (pipeline) to analyze small-scale to high-throughput multiparametric screenings as in drug discovery and personalized medicine. The suite consists of several modules for experiment quality control, bias correction, similarity measurement, clustering and visualization. It can process the output of any screening technology like mass cytometer, flow cytometer and high-content microscope in both text and binary data format. It is data-size independent, can effectively circumvent the batch effect, needs neither dimension reduction nor subsampling. Equipped with multithreading mechanism for parallel computing, the suite can process hundreds of samples with numerous markers containing millions of events in a few hours rather than days. It can also run through graphical user interface on desktop machines or command-line interface on computer servers.

Contents
1 Introduction
2 Acquisition and installation
3 Modules Import Spillover compensation Signal drift correction Cell viability correction Similarity matrix generator Similarity matrix heatmap Negative control outlier detection Clustering
4 GUI
5 Troubleshooting

Introduction

Multiparametric phenotypic screening of cells, assessing for example their responses to small molecules or knockdown/knockout of specific genes, is a powerful approach to understanding cellular systems and identifying potential new therapeutic strategies. However, automated tools for analyzing similarities and differences between a large number of tested conditions have not been readily available. Methods designed for clustering cells cannot identify the differences between samples effectively. COMPARE-suite was developed for ultra-fast and robust analysis of multiparametric screening data. Applying a mass-aware gridding algorithm using dynamic programming, it performs automatic and effective similarity comparison for hundreds to thousands of tests and provides information about the treatment effect.

Benchmarking tests show that the suite can circumvent batch effects and perform a similarity analysis substantially faster than other available analysis tools. Applying it to high-throughput flow cytometry screening data (one of the most challenging and complex type), it was able to distinguish subtle phenotypic drug responses in a human sample and a genetically engineered mouse model with acute myeloid leukemia (AML). It revealed groups of drugs with similar responses even though their mechanisms are distinct from each other. In another screening, COMPARE effectively removed batch effects and effectively grouped samples from AML and myelodysplastic syndrome (MDS) patients using clinical flow cytometry data.

The suite consists of several modules for experiment quality control, bias correction, similarity measurement, clustering and visualization. This wiki page briefly introduces the modules and demonstrates how to run them. The video tutorial goes well with the text manuals and explains all the read-outs output by the suite such as plots, tables, etc.

Acquisition and installation

To acquire the latest release of COMPARE-suite, download one of the zip files from Releases. The downloaded suite has three subfolders named data: containing the assay data files, out: containing the readouts of the suit after each step, and src which contains the module scripts. To run the suite, you need to run these modules individually staged 1 to 8. COMPARE-suite is a self-organizing application meaning that it has dedicated folders for input, output and source files. However, it is still possible to keep input data or readouts in different locations and redirect the suit to them locations.

COMPARE-suite is purely written in R and therefore doesn't need "installation" in its classical sense. However, it relies on some R packages that must be already installed on your machine. To install the suite, first, install the latest release of R. Then, get the latest version of Bioconductor by starting R and entering the commands:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.12")

At the time of writing, the latest versions of R and bioconductor were 4.0.2 and 3.12 respectively. Replace "3.12" with the latest one. To find out the latest versions of R and Bioconductor already installed on your machine type in the commands:

R.version
BiocManager::version()

If you already have the latest versions of R and Bioconductor, it is highly recommended to update old packages. Type in the following commands and follow any question possibly asked. You may skip over troublesome packages as they won't probably interfere with COMPARE-suite's functionality. You may need to contact your system administrator if it crashes all the time.

update.packages(ask = FALSE)
BiocManager::install(ask = FALSE)

You also need to make sure Rscript command is available in the command line. Open the terminal window (not R):

$ Rscript --version
R scripting front-end version 4.0.2 (2020-06-22)

Having successfully installed R and Bioconductor, you may now proceed with installing the packages COMPARE-suite relies on. As these are basically prevalent ones, there should be no installation issues. You may open INSTALL.R script in src to check out the packages or even install them individually. Like before, make sure to follow any question possibly asked. Type the command below in terminal (not R):

$ Rscript INSTALL.R

To test if the installation went well, you may again source INSTALL.R. We expect to see all the packages load successfully. If there are messages complaining that some packages don’t exist, then you need to install them individually or reach out to your system administrator for assistance. Also please refer to the Troubleshooting section.

The suit needs an annotation file complying with a specific layout. The annotation file provides the suite with enough information on the assay, such as how many plates there are, types of samples and their locations. You may find an example one in the data folder of the suite. If some of the columns are irrelevant to your assay, just leave them blank (NA is also taken as blank) or populate them by something similar.

The data needed to follow this tutorial can be downloaded from FLOWRepository with the repository ID FR-FCM-Z357. The data set contains the sample files of an AML mouse model analyzed with a high-throughput flow cytometer (a plate reader technology). The files are in Flow Cytometry Standard (FCS) format. The investigator should replace these data files with their screening data files.

Modules

COMPARE-suite (for simplicity, COMPARE) is comprised of several modules for experiment quality control (QC), bias correction, similarity measurement, clustering, and visualization. The figure below shows the steps in the downstream analysis of a high-throughput flow cytometry screening of mouse AML cells treated with hundreds of antineoplastic agents. During quality control, sources of bias and errors in high-throughput flow cytometry data such as autofluorescence, bioluminescence, carryover effect, batch effect, edge effect, drift in the marker fluorescence intensity signal (signal drift) and drift in the number of live cells (cell viability drift) are identified. A user at this point can decide to discard samples or even (microtiter) plates associated with worst affected wells. As the other sources of bias are not usually possible to be removed in silico, the bias correction modules correct for only signal and cell viability drifts (including the batch effect) using regression analysis.

Based on similarity, COMPARE performs clustering using a weighted graph algorithm. Initially, all nodes (samples) are connected forming a weighted complete graph wherein weights represent measured similarity values. The graph is then pruned to remove potential false positive edges using a similarity threshold inferred automatically from negative controls. After constructing a linked graph, clustering is tantamount to finding maximal cliques (complete subgraphs that cannot be extended) each containing the treatments with similar responses.

COMPARE carries out the analysis using parallel computing. The modules can also run independently of each other. The similarity and clustering modules of COMPARE may be applied autonomously to any relevant problem in data mining with tabular data (column: variable, row: observation).

Import

Import is the first module of the suit. Its function is to discard broken files or those with too few events like cells or beads. It creates a folder called REMOVED and then moves undesirable files to it. This lets you later have a look at those files and delete them, if you wish. This module can import tab-delimited tsv, comma-delimited csv and fcs files. In case tsv and csv text files are input, they are coerced to binary FCS files for the sake of storage and programming.

All modules can be invoked by typing Rescript and then the name of the module. If you type such a command with no argument, you will see the full description of the command-line options, if any. The usage shows this module takes 2 arguments. --min-events specifies the minimum number of events (rows) a file needs to be called valid. Use -inURL to redirect the module to read in data files from a different location than the data folder. If you don't use this argument, then by default the module looks for data files in the data folder.

To invoke this module, first redirect current directory to src folder using cd command. To find out your current directory type in:

$ pwd
/Users/bnk107/Documents/workspace/GitHub/COMPARE-suite/src

pwd command prints the current directory which is already src here.pwd and cd on POSIX systems (macOS, Linux, etc.) are equivalent to cd in Windows. Also, in Windows, you need to use \ instead of / for directory paths. Here for the AML mouse assay, we set the minimum number of live cells to 1000. Several FCS files should be moved to REMOVED.

$ Rscript run_1_import.R 

Minimum number of events/cells to process a fcs file was not passed in. Usage:
Rscript run_1_import.R \
--min-events 1000 \
-inURL ../data

$ Rscript run_1_import.R --min-events 1000 -inURL ../data

import

If any file has been moved to REMOVED, the annotation file is updated to include information on valid files only. If the input data files are text files, you can now delete them along with those in RMEOVED.

Spillover compensation

The second module compensates for fluorescence spillover in FCS files if the plate reader was actually a high-throughput flow cytometer. In brief, in flow cytometry, spectra of different color lasers may overlap. This causes the colors to spillover into unrelated channels (detectors) each measuring one specific color. In case the FCS files have not been compensated yet, this preliminary step applies the correction matrix called spillover matrix to the channels, provided the matrix is embedded in them (FCS version 3+). If you are sure that the FCS files contain only compensated channels, you may skip over this step, otherwise they must be compensated before moving forward.

To invoke this module type in the following command; it doesn’t have specific arguments except for optional -inURL for redirecting it to the input files:

$ Rscript run_2_spillover_compensation.R

Please note that the spillover matrix is computed by the plate reader and written to the FCS file. An FCS file normally contains only original uncompensated channel values unless the investigator has already applied the matrix to the channels and saved only the compensated values in the file. You might be tricked into thinking that your FCS files already contain compensated values when you open them in FCS file parsers such as FlowJo or FCS Express which automatically apply the matrix before visualization. Therefore, for the downstream analyses of FCS files like those carried out by COMPARE, this process must be done by the investigator using this module. Please also note that this module will modify FCS files by replacing the spillover matrix with the identity matrix and adding "compensated" postfix to $FIL keyword. Therefore, it is a good idea to keep a backup of your screening data.

Signal drift correction

The third module checks if the assay suffers from some sources of bias such as edge effect, signal drift, autofluorescence, bioluminescence, carryover effect and batch effect in favor of specific regions of the plates. However, this module only corrects the signal drift bias and batch effect, as the other types should be naturally avoided during an experiment.

Signal drift and cell viability drift could be the main sources of bias within and between the plates of an assay. Depending on the protocol by which wells are processed, time may become a major concern so that specific wells may have lower or higher values than expected. To correct for these drifts, COMPARE employs a two-step correction: intra-plate drift correction and inter-plate drift (batch effect) correction. For a given plate, COMPARE first fits a linear regression model and then vertically translates well values with respect to the learned line as it rotates to slope zero. After correcting for the intra-plate bias, the inter-plate bias is corrected by aligning medians of the plates, that is, translating to a common baseline. The following figure shows signal (a) and cell viability drifts (b) in the mouse assay before and after correction.

This module has several arguments:

-chnl: the list of channels as in the data section of a FCS file or the columns of a csv/tsv file.

-correct: if signal drift should be corrected. To reveal potential bias (quality control mode), it should be set to FALSE, because, initially, we don’t want to correct anything, maybe correction is not necessary at all.

-drctn: the direction of bias, either column or row. Basically, we need to assume a direction for the bias, either column-wise or row-wise. We should try both the directions, but one at a time. COMPARE supports only these two orthodox directions.

--fit-plot and --heat-plot: if the regression plots and plate heatmaps should be output respectively.

To invoke this module to reveal any bias column-wise, type in the following command:

$ Rscript run_3_signal_drift_correction.R -chnl 'VL6-H,BL5-H,RL1-H,BL1-H' -correct FALSE -drctn column --fit-plot TRUE --heat-plot TRUE

After this run, we discarded wells in the columns 8-9 of plate 1 and the entire plate 3 ending up with 465 wells in total. We also scrapped the BL1-H channel owing to its instability. Typing the same command with the modified channel list and -correct TRUE will correct the column-wise bias:

$ Rscript run_3_signal_drift_correction.R -chnl 'VL6-H,BL5-H,RL1-H' -correct TRUE -drctn column --fit-plot TRUE --heat-plot TRUE

Like the previous module, this one also alters the FCS files and adds "corrected" postfix to $FIL after correction. It is worth mentioning that the randomization step in the experimental process in which chemical compounds are scattered over plates randomly is crucially important for being able to distinguish between noise (bias) and signal (biological meaning).

Cell viability correction

The fourth module checks if the module suffers from the cell viability bias and corrects for it. Like the signal drift correction module, to reveal this type of bias, you should initially set -correct to FALSE. The argument list remains pretty much the same but with no channel list this time. Like before, you initially need to assume a direction for the bias, either column-wise or row-wise.

To invoke this module for column-wise correction:

$ Rscript run_4_viability_correction.R -correct TRUE -drctn column --fit-plot TRUE --heat-plot TRUE

Similarity matrix generator

The 5th module generates the similarity matrix. To measure the similarity between two datasets, COMPARE divides each dimension into n subsets for each dataset individually. Therefore, a dataset with d dimensions (variables) will be gridded into at most n^d spatial units called hypercubes. COMPARE grids only the part of the space encompassing data points, avoiding empty regions. It then measures the proportion of data points for either dataset within each of the corresponding hypercubes. The difference between the two proportions is indicative of the similarity within that relative spatial position represented by the hypercube. The similarity in the exclusive hypercubes is considered 0. Averaging these differences across all the hypercubes indicates the amount of similarity between the two datasets. In this way, COMPARE captures the morphology of a sample's data enabling it to measure similarity even in the presence of moderate signal shift. To generate a similarity matrix of multiple input samples, COMPARE runs in parallel. The similarity matrix could then be used for identifying clusters of samples such as drugs with similar dose responses like predicting the mechanism of action of drugs in development.

The algorithm is illustrated in the following figure. It first forms an abstract square grid (red) encompassing all the data points within the range (1.1, 9.6). At the top level, all the cells (table rows) are in the region number (RN) 0. First iteration divides the first dimension formed by CD1 marker (a variable) into 3 (= n) subsets. Assuming a left-first numbering rule, the RN column is dynamically updated (blue column) for each subset using some information such as current RN (grey column), current dimension and possible number of families and siblings behind. For instance, child node 12 has parent node 3, could have 2 siblings (node 6, node 9) and 2 families (parent 1, parent 2) behind, although children 11 and 9 were never born as marked by X. Final leaves are called hypercubes (HCs). The corresponding grid on the biplot demonstrates that two regions devoid of data points have not been assigned any hypercube. For comparing two samples, they are first jointly normalized between a range like [-1,1]. The tree graph is just for better visualization and won't be implemented.

Like before this module needs a channel list. Additionally, it needs -n which is a positive integer by which each dimension is divided. Try something between 3 to 5 to get the most out of your assay. In the clustering module, it is explained how to figure out what an appropriate -n could be. Set it to 5 for this assay. -cpu argument specifies the number of CPU cores to run the module on in parallel. On a desktop machine with 4 cores, you may set it to 3, leaving at least one core free. If you are on a computer server, you can use many more CPU cores like 10, 20, 30 or something for this size of assay but always be careful with the available memory and consider that other folks may jump in to use the server. Depending on the size of the assay, number of CPU cores used and the amount of memory available to the module, this process may take a while.

To invoke this module, issue the following command:

$ Rscript run_5_similarity_matrix_generator.R -chnl 'VL6-H,BL5-H,RL1-H' -n 3 -cpu 3

The good news is since the modules are independent, you can run all the modules up to the 4th one on a desktop, and then continue on a server with many more cores and much more memory.

The similarity matrix of this assay is a huge matrix with hundreds of thousands of elements. Each value is the amount of spatial similarity between a pair of samples. In general, the similarity between two samples can be measured from many aspects like across their dimension distributions, common sub-clusters, etc. COMPARE measures spatial similarity meaning how much two things are morphologically similar in a high dimensional space. Speaking more about the effect of -n, consider two monozygotic twins. As you compare them across more features, they become less similar. This is what -n controls: the level of smoothing. Also, consider three samples A, B and C. It is possible while B is more similar to A across a subset of features, it turns out that B is more similar to C across another subset of features. So, by changing -n, closely similar samples may change places. Regardless of its intermediate applications, the similarity matrix is ultimately intended for clustering.

Similarity matrix heatmap

The 6th module is an optional stage, and you may skip to the 7th one. This module generates a heatmap of the similarity matrix which comes in handy for finding potent (biologically active agents) and impotent samples (inactive agents) very quickly as well as gives an overview of the assay. The heatmap is normalized between -1 and 1 so that the highest similarity is 1 and the lowest one is -1 and then transformed by a power function to emphasize the extremes of the range. The size and color of each orb represent the amount of similarity, blue: 1, red: -1. The focus of this heatmap is on the highest and lowest values and their neighboring ones not those between. Using this map, you may find hits (highly potent samples) in the top-left corner with agility, as the map is sorted into ascending order of total similarity score.

To invoke this module (no arguments needed):

$ Rscript run_6_similarity_matrix_heatmap.R

Negative control outlier detection

Since the clustering module infers the similarity cutoff from the similarities between negative controls, possible outliers among the negative controls should be removed beforehand. This is what the 7th module assists with. Also be advised that if you have followed this tutorial using our assay data, there is no outliers among the negative controls.

To invoke the module (no arguments needed):

$ Rscript run_7_negative_control_outlier_detector.R

In the heatmap, outlier samples appear less similar to the rest of the ensemble, a property observed in outliers. Such samples should be removed at this point. To do that, simply delete their corresponding entries from the annotation file. You don’t need to modify the similarity matrix or delete the FCS files directly. In parentheses, if you have enough negative controls, for example in this assay DMSO samples, well spread within and between the plates, one or two outliers wouldn’t really affect the inferred similarity threshold, because their effect would be basically manifested in the 3rd or 4th digit after the decimal point. The good practice, however, is to remove them to be on the safe side. This is because, in the clustering module, negative controls act like a magnet attracting impotent samples, so if there are negative control outliers, they can falsely attract potent samples creating false negatives.

Clustering

The last module is clustering. Clusters provide a lot of information. For example, by clustering the samples of a drug assay, you can identify different dose-specific drugs showing similar responses across the markers used. A direct application is when there are experimental drugs or those still in trials in your drug panel which have been clustered with known drugs, you can predict their response and subsequently mechanism of action across the markers used before taking decisive steps.

COMPARE uses a graphical clustering algorithm in which initially all nodes (samples) are connected forming a weighted complete graph wherein edges represent the similarity between the nodes. This graph is then pruned to remove potential false positive edges for a given cutoff inferred from negative controls. The optimal cutoff turns out to be the minimum weight in the maximum spanning tree of the negative control nodes. After pruning, some samples may end up being connected to the negative controls (impotent) and some disconnected (potent). After constructing this graph, clustering is tantamount to finding maximal cliques among potent agents. In addition to maximal cliques, it also reports disconnected components known as communities (a clique is actually a subset of a community). Communities can be seen as loose clusters. In a community, unlike a clique, similarity is not necessarily transitive meaning that if A is similar to B and B is similar to C, A is not necessarily similar to C. If these were three drugs within a community, we could not necessarily say they had an equal response unless they would form a clique.

Like before, this module takes a list of channels as well as a new argument -nn which stands for nearest neighbors used by UMAP. UMAP represents a dispersion map of the clusters. Set -nn to a big value like 1000 to let the module automatically use all the clusters for this assay.

To invoke this module:

$ Rscript run_8_clustering.R -chnl 'VL6-H,BL5-H,RL1-H' -nn 1000

-nn is actually the size of the local neighborhood (the number of neighboring points, here clusters) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. The developers recommend values in the range 2 to 100. Like other 2D projectors, depending on the perplexity of the high-dimensional data, you may get a different dispersion map for the same call to UMAP. Therefore, you had better try different values for -nn, several times each, to get a sufficiently informative map. In an informative map, you would be able to see different groups by coloring the clusters. These groups are mostly the identified communities the cliques come from. If you are not happy with the map, you can use the Centroids.txt file to generate your own map; you may get an idea from the corresponding snippet (look up # STEP 8: Dispersion map) in the module's script.

Having clustered the samples, you can weigh up different choices of -n from the 5th module. As a rule of thumb, if you experience one of the following events, you need to change -n and/or remove more negative controls less similar to others:

there are few clusters
there are too many singleton clusters
all samples are within the control community, i.e. they all have been called impotent
there is one or two very large communities of potent samples

In general, a normal sample graph for a good choice of -n should look like that of this assay.

GUI

All the modules have been integrated into the COMPARE's GUI. Check boxes related to the modules you would like to run and set their options accordingly. When you hover the cursor over the items, the descriptions of them appear. To start the GUI, type in the command:

$ Rscript COMPARE-suite.R

Since the GUI is a web application, you should copy-paste the IP:PORT address to which the web server is listening into a web browser like Firefox as shown below:

GUI_start

Every time you run a web application, the web server may assign it a different IP:PORT; note 127.0.0.1:6708 in the address bar of the Firefox browser. If the GUI doesn't start, it could be due to the firewall blocking the port. Contact your system administrator for assistance.

The GUI can also run in Rstudio which automatically uses its own integrated browser.

Troubleshooting

Depending on the configuration of your OS, particularly Linux distributions, you may encounter problems when installing R packages. For example, if you have difficulties installing inlmisc R package, consider installing gdal and geos first. On Red Hat Enterprise Linux (often abbreviated to RHEL distribution), you may type in the following commands:

$ sudo yum install gdal.x86_64 gdal-devel.x86_64 gdal-libs.x86_64 
$ sudo yum install proj.x86_64 proj-devel.x86_64 
$ sudo yum install proj-epsg.x86_64 proj-nad.x86_64
$ sudo yum install geos geos-devel

Please always Google these libraries to make sure you are installing the latest packages. Report any bug/issue on the issue page.