Phenotyping Notebook - KoellenspergerLab/MeXpose GitHub Wiki
Analysis Notebook
This wiki aims to briefly explain the functions and or requirements for each cell of the Analysis Notebook. Note that the cell numbers in this wiki must not be confused with the execution count (in brackets left of a cell) which indicates the cell's position in the execution order.
Cell 1: Import Libraries and Set Up Environment
Functionality
This cell imports all the necessary Python libraries required for the notebook to function. These libraries support various operations like data manipulation, statistical analysis, image processing, and interactive widgets.
Libraries Imported
- Standard Libraries:
codecs
,io
,os
for basic Python operations. - Data Manipulation:
pandas
(aspd
) andnumpy
(asnp
) for data manipulation and mathematical operations. - Data Visualization:
matplotlib.pyplot
(asplt
),plotly.express
(aspx
), andseaborn
(assns
) for plotting and data visualization. - Image Processing:
cv2
for image processing tasks. - Interactive Widgets:
ipywidgets
(aswidgets
),IPython.display
for interactive user interfaces. - Dimensionality Reduction and Clustering:
umap
,phenograph
,sklearn.preprocessing
for machine learning tasks. - Statistical Analysis:
scipy.stats
(asstats
) for statistical functions. - Others:
warnings
for suppressing warning messages,RobustScaler
,StandardScaler
,MinMaxScaler
for data scaling,numba.jit
for just-in-time compilation.
Special Commands
%matplotlib widget
: This magic command enables the interactive Matplotlib backend for Jupyter notebooks.
Additional Notes
Warning messages are optionally suppressed using the warnings
library to make the notebook output cleaner.
Cell 2: Data Upload and Initialization
Functionality
Initializes widgets for data upload and sets up a function to read and format the uploaded CSV file.
Widgets
input_file
: Upload widget for single-cell CSV files.- Type: File (CSV)
working_directory
: Text box for the working directory path.- Type: Text
on_value_change(change)
Function: Updates global variables singlecell_df
and PATH
and briefly displays the uploaded DataFrame.
Output
Displays the file upload and directory path widgets, along with a temporary view of the uploaded DataFrame.
Cells 3 & 4: Raw Histogram Plotting and Saving
Functionality
Generates a raw histogram plot for a selected channel from the uploaded single-cell data. Allows saving of the plot.
Widgets
histogram_select
: Radio buttons to select a channel for the histogram.- Type: Radio Buttons (Options are DataFrame columns)
Functions
update_histogram(change, data, channel, cluster)
:- Updates and displays the histogram based on the selected channel.
- Optional parameters
channel
andcluster
for further filtering. - Cell 4 contains a line of code for saving the histogram plot. It is commented out by default.
Output
Displays radio buttons for channel selection and the histogram plot side by side.
Additional Notes
To save the histogram, uncomment the line in Cell 4 and execute it.
Cells 5 & 6: Channel Normalization and DataFrame Display
Functionality
Cell 5 sets up widgets for entering pixel size and selecting channels for normalization. Cell 6 performs the normalization based on the selected options.
Widgets
pixelsize_select
: Text box for entering the pixel size in µm.- Type: Text
normalise_channels
: Multiple select box for channels to normalize.- Type: Multiple Select (Options are DataFrame columns)
binary_select_norm
: Radio buttons to decide whether to display the modified DataFrame.- Type: Radio Buttons ("Yes", "No")
Functions
Cell 6 contains code that normalizes selected columns by the area of cell objects, also adjusting for pixel size if entered. It uses a copy of the original DataFrame, normalised_df
, to store the normalized data.
Output
Displays an interface to select pixel size, channels for normalization, and an option to display the modified DataFrame.
Additional Notes
- If no pixel size is entered for normalization, a warning message will appear, and the original data will be used.
- To display the modified DataFrame, select "Yes" on the
binary_select_norm
widget.
Cells 7, 8 & 9: Outlier Filtering and Data Display
Functionality
Cells 7 and 8 contain functions for outlier removal based on percentiles or Z-scores. Widgets for selecting the filtering method and channels are also provided. Cell 9 includes a line for saving the filtered DataFrame.
Widgets
filter_channels
: Multiple select box to choose channels for outlier filtering.- Type: Multiple Select (Options are DataFrame columns)
outlier_method
: Radio buttons to select the outlier filtering method.- Type: Radio Buttons ("Percentiles", "Z-Score")
binary_select_filter
: Radio buttons to choose whether to display the modified DataFrame.- Type: Radio Buttons ("Yes", "No")
Functions
remove_outliers_percentiles(df, columns)
: Removes outliers based on percentile values.remove_outliers_zscore(df, columns, n_std)
: Removes outliers based on Z-scores.
Output
Displays widgets for channel selection, outlier method, and an option to display the modified DataFrame.
Additional Notes
- Cell 8 checks if the data was previously normalized and uses the appropriate DataFrame (
normalised_df
ororiginal_df
) for filtering. - Number of filtered cell events will be printed.
- To save the filtered DataFrame to a CSV file, uncomment the line in Cell 9 and execute it.
Cells 10 & 11: Plotting Histograms for Modified Data
Functionality
Cell 10 sets up widgets for selecting a data version (normalized or filtered) and a channel for histogram plotting. It then plots the selected histogram. Cell 11 includes a line for saving the histogram.
Widgets
histogram_select
: Radio buttons for selecting a channel for the histogram.- Type: Radio Buttons (Options are DataFrame columns)
hist_version_select
: Radio buttons for choosing the version of data (normalized or filtered) to use for plotting.- Type: Radio Buttons (Options depend on available data)
Functions
update_hist_version(change)
: Updates the histogram based on selected data version and channel.
Output
Displays radio buttons for data version and channel selection, along with the histogram plot.
Additional Notes
- The cell checks which versions of the data are available (normalized, filtered) and adjusts the widget options accordingly.
- To save the histogram, uncomment the line in Cell 11 and execute it.
Cell 12: Data Selection for Downstream Analysis
Functionality
Determines the version of data (filtered, normalized, or original) to be used for downstream analysis and informs the user.
Output
Prints a message indicating which version of the data will be used for downstream analysis.
Additional Notes
- If outlier filtering has been performed,
filtered_df
will be used. - If only size normalization has been done,
normalised_df
will be used. - If neither has been performed, the original
singlecell_df
will be used.
Cell 13: Dynamic Heatmap Channel Visualization on Raw Image
Functionality
This cell creates a dynamic heatmap channel overlay on a raw image. The heatmap colors indicate the intensity of the selected channel. The user uploads a .png segmentation mask and a raw or processed sample image (.tiff format).
Widgets
mask_upload
: Upload widget for the segmentation mask.- Type: File (PNG)
overlay_image_upload
: Upload widget for the raw or processed image.- Type: File (TIFF)
column_select
: Dropdown menu for selecting the channel to visualize.- Type: Dropdown (Options are DataFrame columns)
load_data_button
: Button to load the uploaded data and visualize the heatmap.- Type: Button
Functions
read_uploaded_file(upload_widget)
: Reads uploaded file into a NumPy array.update_color_mask_numba(img, color_mask, cell_ids, cell_colors)
: Updates the color mask based on selected channel intensities.plot_heatmap(...)
: Plots the heatmap based on various parameters like transparency and channel intensity limits.load_data(btn)
: Main function that initializes the heatmap plotting.
Sliders
Limits
: Slider to adjust the upper and lower thresholds for channel intensity.Transparency
: Slider to adjust the transparency of the overlayed channel intensities.
Output
Displays the upload widgets, the dropdown for channel selection, and the 'Load data and visualize' button. Once the button is clicked, the heatmap overlay on the raw image will be displayed along with sliders to adjust transparency and intensity limits.
Additional Notes
- Move the 'Limits' slider to adjust upper and lower thresholds for channel intensity.
- Move the 'Transparency' slider to adjust the transparency of the overlayed cell intensities.
Cells 14 & 15: Channel Selection and Data Scaling for Clustering
Functionality
- Cell 14: Provides a multiple select box to choose channels for clustering.
- Cell 15: Scales the selected data to prepare it for clustering. Two scaling methods are available: Robust scaling and MinMax scaling.
Widgets
cluster_channels
: Multiple select box for channels to include in clustering.- Type: Multiple Select (Options are DataFrame columns)
Functions
Cell 15 contains code that scales the data using either Robust scaling or MinMax scaling. The scaling method to use can be selected by commenting/uncommenting the respective lines.
Output
- Cell 14: Displays the multiple select box for channel selection.
- Cell 15: Scales the data based on the selected method and channels.
Additional Notes
- Robust scaling is more resilient to outliers and is the default scaling method. To switch to MinMax scaling, uncomment the relevant lines in Cell 15.
- For more information on data scaling methods, refer to the scikit-learn documentation and here.
Cell 16: PhenoGraph Clustering
Functionality
Runs PhenoGraph clustering on the scaled data based on selected parameters. Provides widgets to input the clustering parameters.
Widgets
k_text
: Text box to input the k-value for PhenoGraph.- Type: Text (Integer, Placeholder: 30)
resolution_parameter_text
: Text box to input the resolution parameter for Leiden clustering.- Type: Text (Float, Placeholder: 1.0)
seed_text
: Text box to input the initial seed for randomization.- Type: Text (Integer or empty, Placeholder: 42)
update_button
: Button to execute the clustering algorithm.- Type: Button
Functions
run_clustering(button)
: Callback function that runs the PhenoGraph clustering algorithm with the given parameters.
Output
Displays text boxes for inputting k-value, resolution parameter, and initial seed, along with a button to run the clustering algorithm.
Additional Notes
- For more information on PhenoGraph and its parameters, refer to the PhenoGraph GitHub repository and the corresponding publications: Phenograph and Leiden algorithm.
- Leaving the 'Initial seed' parameter unset will result in a random starting seed, leading to slightly different results each run.
Cell 17: UMAP Dimensionality Reduction and Visualization
Functionality
Executes Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction on the scaled and clustered data. It then visualizes the UMAP embedding, coloring the points based on cluster assignments.
Functions
- Uses the UMAP algorithm from the
umap-learn
library to perform dimensionality reduction. - The UMAP embedding is converted to a DataFrame and visualized using Plotly Express.
Output
Displays a scatter plot of the UMAP embedding, where each point represents a cell and the color indicates the cluster assignment.
Additional Notes
- Default settings for UMAP's hyperparameters are used. For custom settings, you may uncomment the cell below.
- For more information on UMAP and its parameters, refer to the UMAP documentation and the corresponding publication.
- Leaving the 'random initiation state' parameter unset will result in a random starting state, causing slightly different results each run.
Cell 18: Generate Cluster Heatmap Data
Functionality
Prepares the data for generating a cluster heatmap by calculating z-scores for each channel's median (or mean) values for every cluster.
Functions
- Creates a list of DataFrames, each representing a unique cluster.
- Calculates the median (or mean, if uncommented) of each channel for every cluster.
- Applies z-score scaling to these median (or mean) values for better heatmap visualization.
Output
Creates a DataFrame heatmap_clusters
containing the z-scored median (or mean) values for each channel and cluster.
Additional Notes
Uncomment the section for using mean values instead of median for heatmap calculation.
Cell 19: Interactive Cluster Heatmap
Functionality
Displays an interactive widget to select channels and update the cluster heatmap. The heatmap shows how each cluster's median (or mean) values differ from the global median (or mean) for each selected channel.
Functions
- Uses Seaborn to plot the heatmap.
- Allows users to select channels to include in the heatmap and to specify a title.
Output
Displays an interactive widget that enables users to select channels, enter a title, and generate an updated heatmap.
Additional Notes
Click the "Update Heatmap" button to refresh the heatmap based on the selected channels.
Cell 20: Save Cluster Heatmap
Functionality
Code for saving the cluster heatmap to a PNG file is provided.
Functions
Utilizes Matplotlib's savefig
method to save the heatmap.
Additional Notes
The line for saving the heatmap is commented out by default. Uncomment it to execute the save operation.
Cell 21: Prepare non-Normalised Data for Cluster and Channel-based Histograms
Functionality
Prepares a DataFrame named export_df
to be used for plotting histograms based on selected clusters and channels. It reverts the size normalization if pixel size was initially set.
Functions
- Copies
filtered_df
intoexport_df
. - Divides the 'Area' and 'EquivalentDiameter' columns by the square and the first power of the pixel size, respectively, if pixel size was provided.
Additional Notes
This code is commented out by default. Uncomment it to execute.
Cell 22: Interactive Cluster and Channel-based Histograms
Functionality
Displays an interactive widget that allows users to choose a specific cluster and channel to visualize in a histogram.
Functions
- Creates RadioButtons for selecting a cluster and a channel.
- Observes changes in the widgets and updates the histogram accordingly.
Output
Displays an interactive widget for cluster and channel selection. A histogram is generated based on the selected cluster and channel.
Cell 23: Save Cluster and Channel-based Histogram
Functionality
Provides a code snippet to save the generated histogram to a PNG file.
Functions
Utilizes Matplotlib's savefig
method to save the histogram.
Additional Notes
The line for saving the histogram is commented out by default. Uncomment it to execute the save operation.
Cell 24: Select Clusters for CSV Export
Functionality
- Displays an interactive widget that allows users to select multiple clusters that will be saved as CSV files.
Functions
- Creates a
SelectMultiple
widget with options populated from the unique clusters inexport_df
.
Output
- Displays an interactive widget for cluster selection.
Cell 25: Save Selected Clusters as CSV Files
Functionality
- Provides code snippets to save the data of selected clusters to individual CSV files.
Functions
- Iterates through the selected clusters and filters
export_df
to only contain rows corresponding to each selected cluster. - Utilizes Pandas'
to_csv
method to save each filtered DataFrame to a CSV file.
Additional Notes
- The lines for saving the data to CSV files are commented out by default. Uncomment them to execute the save operation.
Cell 26: Dynamic Heatmap Channel Visualization on Raw Image (for uploaded cluster data)
Functionality
This cell creates a dynamic heatmap channel overlay on a raw image. The heatmap colors indicate the intensity of the selected channel. The user uploads a .png segmentation mask and a raw or processed sample image (.tiff format).
Widgets
mask_upload
: Upload widget for the segmentation mask.- Type: File (PNG)
overlay_image_upload
: Upload widget for the raw or processed image.- Type: File (TIFF)
column_select
: Dropdown menu for selecting the channel to visualize.- Type: Dropdown (Options are DataFrame columns)
load_data_button
: Button to load the uploaded data and visualize the heatmap.- Type: Button
Functions
read_uploaded_file(upload_widget)
: Reads uploaded file into a NumPy array.update_color_mask_numba(img, color_mask, cell_ids, cell_colors)
: Updates the color mask based on selected channel intensities.plot_heatmap(...)
: Plots the heatmap based on various parameters like transparency and channel intensity limits.load_data(btn)
: Main function that initializes the heatmap plotting.
Sliders
Limits
: Slider to adjust the upper and lower thresholds for channel intensity.Transparency
: Slider to adjust the transparency of the overlayed channel intensities.
Output
Displays the upload widgets, the dropdown for channel selection, and the 'Load data and visualize' button. Once the button is clicked, the heatmap overlay on the raw image will be displayed along with sliders to adjust transparency and intensity limits.
Additional Notes
- Move the 'Limits' slider to adjust upper and lower thresholds for channel intensity.
- Move the 'Transparency' slider to adjust the transparency of the overlayed cell intensities.