Getting Started - nolanlab/vortex GitHub Wiki

VorteX

VorteX Clustering Environment

ATTENTION if you are experiencing 'Serialization errors' while trying to run clustering, this is likely due to the latest Java update on Mac. A fix is to a) download the latest release (below) and b) to create a new blank database and re-import the data that you are trying to cluster into the fresh DB. Your old data is safe and you will still be able to access it, but unfortunately may not be able to create new clustering results in the old database (see wiki/Managing-Database-Connections).

Latest Release: Vortex 29-Jun-2017

Algorithm description and Citation


Overview

VorteX is a graphical tool for cluster analysis of multiparametric datasets in biology, especially single-cell data. It provides multithreaded implementations of clustering algorithms, including nonparametric density-based X-shift, Hierarchical clustering, Mean-shift and K-medoids. VorteX is designed to empower researchers' exploration of biological data by providing and easy-to-use environment for cluster analysis and rich visualization of clustering results. Visualization tools include plotting of cluster profiles, Biaxial scatterplots, 3D PCA, Minimum Spanning Trees (MST), Divisive Marker Trees (DMT), single cell Force-Directed Layouts (scFDL) and ModulMap.

Installation and Usage

  • Make sure you have Java 8 installed. Windows users often have Java 32-bit installed by default. Please remove it and make sure you have the latest 64-bit java installation. You can get it from here: http://www.java.com/en/download/manual.jsp
  • Check how much RAM does your computer have (Windows: Start->Computer->Right Click->Properties, Mac: Apple Button->About This Mac)
  • Download and unzip the latest VorteX: https://github.com/nolanlab/vortex/releases
  • Double-click on launch.jar (Windows), one of the launch_XXGB.command files (MacOS) or on any operating system from the command line by typing java -Xmx16G -cp "lib/*" vortex.gui2.frmMain
  • Ignore or override any security blocks/warnings

Storage configuration

  • On the first startup of VorteX, you have to specify the path where the data will be stored and the name of the database that will contain all the clustering data.
  • Click "Done". The message saying "Host successfully added" should appear.

Data import

  • Press the "New Dataset" button, then "Select Source Files". Navigate to the directory where your FCS and CSV files are stored and select one or multiple files. If multiple files are selected, events from all files will be concatenated into one dataset, but the source file identity of each event will be preserved using a special Annotation object. You will be able to use this annotation during visualization and statistics table computation, enabling you to do overlay samples and compare multiple samples together
  • Click on the color bars to toggle the color and define which parameters will be used for clustering (Feature variables - blue), which will be imported, but not used for clustering (Side variables - yellow) or skipped from import (gray)
  • Select numerical transformation - asinh(x/5) is recommended for CyTOF and FACS data. Noise threshold is a way to remove low-level noise out of clustering variables. All values (raw, before numerical transformation) that are below the threshold will be set to zero. This improves the separation of clusters in multidimensional datasets because it increases the sparseness of expression vectors and thus alleviates the so-called "curse of dimensionality".

  • Apply row filtering. Euclidean noise filter removes any vectors where the sqrt of sum of squares of all feature variables fall below the specified threshold. This removes the "junk" measurements that don"t have any signal in any clustering channel.

  • Limit the number of measurements that are taken from every file. This helps to keep the size of the dataset reasonable. The ballpark estimate of optimal dataset size is 500K for a two-core laptop, 1M for a quad-core laptop with 16GB of RAM, 2-3M for a 16-core workstation with 64GB RAM.

  • Click "Finalize Import". The application may become unresponsive for a minute or two until the import is complete.

Clustering the Data

  • Press the "New Clustering" button.

  • Then, select the distance measure. Angular distance (arccosine of uncentered Pearson correlation) is fast and great for multidimensional data, but it assumes that the data is zero-centered, i.e. the negative values on every marker (feature variable) are distributed around zero. Euclidean distance doesn't require this assumption but takes longer to compute. Note: CyTOF data is typically zero-centered, while FACS data is often not, unless the mean of the negative population has been specifically subtracted from values on every channel. Choose the clustering algorithm: X-shift is the recommended default method.

  • Specify the algorithm parameters. Typically the clustering in done in batches, whereby a series of clustering results are produced with varying free parameter values. For instance, setting K from 150 to 5 in 30 steps will make clustering run clustering with K=150, 145, ...., 10, 5.

  • Press "Go". Clustering will start running and the progress notifications will be printed in real time into the console. Once ready, the window will close and the clustering results will appear in the lower left pane.

  • Select all the cluster results that were produced from this clustering run. You can use Ctrl+A (Apple+A on Mac) to select all rows. Then right-click on any selected row and choose Validation->Find Elbow point for cluster number

  • A message will appear that indicates the K that corresponds to the optimal clustering.

  • Now, select the clustering with the corresponding K. The list of clusters will appear in the upper right pane.

  • Select all clusters in the list (Ctrl+A both on Windows or Mac). Right-click on any cluster and select Create Graph->Force-Directed Layout.

  • Choose the maximal number of cells to be selected from each cluster and the number of nearest neighbors. The total number of cells should typically stay within 30000, in order to maintain reasonable speed. Recommended N is 10. The window will be displayed with a progress bar. Once the progress bar disappears, the layout will begin automatically.