MLFlow in Bioinformatics - ua-datalab/Bioinformatics GitHub Wiki

Leveraging MLflow for Experiment Tracking in Bioinformatics





Image credit: mlflow.org


Workshop Schedule

Date: October 29 Time: 3:00 PM – 4:00 PM

Important

🕐 Schedule

  • 3:00pm-3:10pm: Welcome and introduction to MLflow
  • 3:10pm-3:20pm: The importance of experiment tracking in bioinformatics
  • 3:20pm-3:50pm: Hands-on session with MLflow in Jupyter Notebook
  • 3:50pm-4:00pm: Resources and closing remarks

Important

Requirements

  • Basic knowledge of Python and machine learning concepts
  • Familiarity with Jupyter Notebook
  • Access to Google Colab, CyVerse, or your local machine
  • GitHub account (optional for accessing the notebook)
  • Anaconda or Python 3.8+ installed (if using a local machine)

Important

Expected Outcomes

  • Understand the role of MLflow in managing machine learning experiments
  • Learn how to set up MLflow for experiment tracking
  • Gain hands-on experience logging parameters, metrics, and artifacts
  • Apply MLflow to a bioinformatics machine learning task
  • Explore ways to improve reproducibility and collaboration in research projects

Table of Contents


Topic Overview

In this workshop, we'll talk about how using ML life cycle managers can speed up our work. Data science projects are, by definition, interdisciplinary, comprising teams with diverse levels of computing competence and experience. The data and analysis lifecycle management process is crucial to the reproducibility and sustainability of any DS effort. Any DS project must keep track of the results of exhaustive testing, as well as the associated parameters, metrics, artifacts, source code, and package dependencies. Individuals or teams of data scientists can use tools like MLFlow to create robust and reproducible machine learning pipelines. Due to its comprehensive support for a variety of machine learning frameworks and languages, including Python, R, and Java, MLFlow is designed to be easily incorporated into existing systems. The tracking feature enables developers to save all aspects of an experiment or model, from the code version to the model's parameters and metrics. It is compatible with widely used environmental management frameworks, such as CONDA and DOCKER. Using the visual execution tool, a data scientist may easily recreate tracking processes. Additionally, MLFlow is an open-source project, ensuring the tool's continued availability.


Detailed Introduction

Data science (DS) projects are intrinsically interdisciplinary in nature, including teams with varied degrees of computational knowledge and experience with data management. The underlying Machine Learning (ML) methods and analysis workflows utilized in DS projects are frequently composed of constantly growing open-source software stacks, with analysis activities executed on diverse computational infrastructure (workstations, HPC, Cloud, etc.). Managing the data and analysis lifecycle is critical for the reproducibility and long-term sustainability of any DS project. Any DS project must maintain a record of the results of thorough testing, as well as the parameters, metrics, artifacts, source code, and package dependencies connected with them. Additionally, many team members must be able to navigate this data, which necessitates the usage of a platform-independent framework and a robust model provenance system (storage, versioning, reproducibility). We've all had to construct or access an older model in order to react to a colleague or reviewer's question, but do we know the fundamental parameters? Model development, iterative experimentation, and deployment can be accelerated with machine learning lifecycle management such as MLFlow. They enable individuals or teams of data scientists to develop robust and repeatable machine learning pipelines with platform-agnostic model packaging, deployment, and versioning, as well as quality assurance, all without sacrificing their preferred programming language or library, which is critical for the productivity of any DS project.

Challenges in (Traditional) ML Development

Typical machine learning projects need to track a diverse set of inputs and results. We frequently conduct a huge number of experiments and must keep track of not only the outputs but also the parameters, code, models, and artifacts. Additionally, the majority of machine learning projects involve numerous team members who require access to both experiments and the most recent code versions. This demands a significant degree of coordination amongst all team members.

How MLFlow Can Help

Custom machine learning platforms, such as Facebook's FBLearner and Uber's Michelangelo, exist to address the challenges, but are not publicly available. Additionally, they tailor their frameworks to match their unique requirements. MLFlow's project component provides a platform-independent environment that enables any team member, regardless of the target system, to access the project and, if necessary, replicate any experiment. This is designed to be readily integrated into current applications due to its extensive support for a range of machine learning frameworks and languages, including Python, R, and Java. The project component's user-friendly design and accessibility will appeal to a diverse range of teams. The tracking function enables developers to save every piece associated with an experiment or model, from the code version to the model's parameters and metrics. Its model registry simplifies the process of deploying or accessing already-deployed production models. Additionally, developers can utilize this model registry to monitor the performance of deployed models over time. It is compatible with widely used environment management frameworks such as CONDA and DOCKER, the latter of which can be quickly deployed to multi-cluster production sites with a few Docker file changes.


Welcome everyone to today's workshop on leveraging MLflow for experiment tracking in bioinformatics. Over the next hour, we'll dive into how MLflow can enhance your bioinformatics workflows by improving reproducibility and collaboration.

What is MLflow?

MLflow is an open-source platform designed to manage the end-to-end machine learning lifecycle. It addresses four key challenges:

  1. Experiment Tracking: Recording and querying experiments, code, data, and results.
  2. Model Packaging: Packaging ML code in a reusable and reproducible form to share with others or transfer to production.
  3. Model Management: Centralized management and deployment of models from various ML libraries.
  4. Model Registry: A collaborative hub to share and manage ML models.

Why MLflow in Bioinformatics?

In bioinformatics, experiments can become quite complex, involving large datasets and intricate pipelines. Reproducibility is critical. MLflow helps in:

  • Reproducibility: By tracking parameters, versions, and outputs, ensuring experiments can be replicated.
  • Collaboration: Facilitating sharing of results and models among team members.
  • Scalability: Managing multiple experiments efficiently.
  • Integration: Working seamlessly with popular ML libraries and existing workflows.

Getting Started

To participate in the hands-on session, please access the Jupyter Notebook provided:

We'll be working through a Jupyter Notebook during the hands-on session. Please make sure you have access to it via one of the options below.

Setup Environment

Ensure you have all the required libraries installed: pandas, numpy, scikit-learn, matplotlib, and mlflow.

Option 1: Google Colab

  • Open the notebook directly in Google Colab:

    Open in Colab

  • Ensure you have a Google account to access Colab.

  • Install MLflow and Required Libraries by running the following cell at the beginning of your notebook:

    !pip install mlflow pandas scikit-learn numpy matplotlib

Option 2: CyVerse

  • Log in to your CyVerse account.

  • Launch the JupyterLab application from the CyVerse Discovery Environment.

  • Upload the notebook file MLflow_Bioinformatics_Workshop.ipynb to your workspace.

  • Open the notebook in JupyterLab.

  • Install MLflow and Required Libraries by running the following cell at the beginning of your notebook:

    !pip install mlflow pandas scikit-learn numpy matplotlib

Option 3: Local Machine

If you're using your local machine, make sure you have Python 3.8+ and Jupyter Notebook or JupyterLab installed. You can use Anaconda to manage your Python environments.

Steps:
  1. Install Anaconda (if not already installed):

  2. Create a Conda Environment:

    conda create -n mlflow_env python=3.8
  3. Activate the Environment:

    conda activate mlflow_env
  4. Install Required Libraries:

    pip install mlflow pandas scikit-learn numpy matplotlib
  5. Launch Jupyter Notebook:

    jupyter notebook
  6. Open the Notebook:

    • Navigate to the directory containing MLflow_Bioinformatics_Workshop.ipynb and open it.
  7. Ensure the Kernel is Set Correctly:

    • In the notebook, select the kernel associated with mlflow_env.

Hands-On Session: Experiment Tracking with MLflow

Overview

In this hands-on session, we'll be working with a real dataset to build a machine learning model and use MLflow to track our experiments. This will give you practical experience with how MLflow can be integrated into your bioinformatics workflows.

Dataset

We will use the Breast Cancer Wisconsin Dataset, which is a classic dataset for binary classification tasks in bioinformatics.

Step-by-Step Guide

1. Import Libraries

First, let's import the necessary libraries.

import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, classification_report
import matplotlib.pyplot as plt
%matplotlib inline

2. Load and Explore the Dataset

We'll load the dataset and take a quick look at it.

# Load the dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Display the first few rows
df.head()

Explanation:

  • The dataset contains features computed from digitized images of breast masses.
  • The target variable indicates whether the tumor is malignant or benign.

3. Data Preprocessing

Let's prepare the data for modeling. We'll separate the features and the target, and then split the data into training and testing sets.

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Explanation:

  • We perform standard scaling to normalize the feature values.

4. Set Up MLflow Experiment

Now, let's set up an MLflow experiment to start tracking our runs.

mlflow.set_experiment("Breast_Cancer_Classification")

5. Define and Run Experiments

We'll define a function to train the model and log the parameters and metrics using MLflow. We'll run experiments with different hyperparameters.

def run_experiment(n_estimators, max_depth):
    with mlflow.start_run():
        # Log parameters
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)

        # Initialize and train the model
        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
        model.fit(X_train, y_train)

        # Make predictions
        y_pred = model.predict(X_test)
        y_proba = model.predict_proba(X_test)[:, 1]

        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        roc_auc = roc_auc_score(y_test, y_proba)

        # Log metrics
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("roc_auc", roc_auc)

        # Log model
        mlflow.sklearn.log_model(model, "random_forest_model")

        # Log confusion matrix as an artifact
        from sklearn.metrics import confusion_matrix
        import seaborn as sns
        cm = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(6,4))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
        plt.title('Confusion Matrix')
        plt.ylabel('Actual')
        plt.xlabel('Predicted')
        plt.savefig("confusion_matrix.png")
        mlflow.log_artifact("confusion_matrix.png")
        plt.close()

        # Print metrics
        print(f"Parameters: n_estimators={n_estimators}, max_depth={max_depth}")
        print(f"Accuracy: {accuracy:.4f}, ROC AUC: {roc_auc:.4f}")

# Run experiments with different hyperparameters
n_estimators_list = [50, 100, 150]
max_depth_list = [5, 10, 15]

for n_estimators in n_estimators_list:
    for max_depth in max_depth_list:
        run_experiment(n_estimators, max_depth)

In this loop, we're testing different combinations of n_estimators and max_depth to see how they affect the model's performance. MLflow will track each run separately.

6. Viewing Experiment Results

Since we're running in a notebook environment, we'll use MLflow's API to retrieve and display the logged results.

# Retrieve experiment ID
experiment_id = mlflow.get_experiment_by_name("Breast_Cancer_Classification").experiment_id

# Fetch all runs from the experiment
runs = mlflow.search_runs(experiment_ids=experiment_id)

# Display the runs
runs_df = runs[["run_id", "params.n_estimators", "params.max_depth", "metrics.accuracy", "metrics.roc_auc"]]
runs_df.sort_values(by="metrics.roc_auc", ascending=False, inplace=True)
runs_df.reset_index(drop=True, inplace=True)
runs_df

Explanation:

  • We create a DataFrame to organize and sort the experiment runs based on the ROC AUC score.

7. Analyze Results

Let's analyze the results to identify the best-performing model based on ROC AUC score.

# Convert parameter columns to numeric
runs_df['params.n_estimators'] = runs_df['params.n_estimators'].astype(int)
runs_df['params.max_depth'] = runs_df['params.max_depth'].astype(int)

# Plot performance
plt.figure(figsize=(10,6))
sns.lineplot(data=runs_df, x='params.n_estimators', y='metrics.roc_auc', hue='params.max_depth', marker='o')
plt.title('Model Performance by Hyperparameters')
plt.xlabel('Number of Estimators')
plt.ylabel('ROC AUC Score')
plt.legend(title='Max Depth')
plt.show()

Explanation:

  • The plot helps visualize how different hyperparameters affect the model's performance.

8. Load and Evaluate the Best Model

We'll load the best model and evaluate it further.

# Get the best run ID
best_run_id = runs_df.iloc[0]['run_id']

# Load the model
best_model = mlflow.sklearn.load_model(f"runs:/{best_run_id}/random_forest_model")

# Make predictions
y_pred_best = best_model.predict(X_test)

# Classification report
print(classification_report(y_test, y_pred_best))

Explanation:

  • The classification report provides detailed performance metrics like precision, recall, and F1-score.

9. Interpret Results

Discuss the performance metrics and what they mean in the context of our classification task.

  • Accuracy: The proportion of correct predictions.
  • Precision: The accuracy of positive predictions.
  • Recall: The ability of the model to find all positive instances.
  • F1-Score: The harmonic mean of precision and recall.
  • ROC AUC: Measures the model's ability to distinguish between classes.

Closing Remarks and Resources

We've seen how MLflow can be integrated into a machine learning workflow to track experiments, log parameters and metrics, and manage models. This not only enhances reproducibility but also makes collaboration easier.

Additional Resources


Next Steps

To continue your journey with MLflow and experiment tracking in bioinformatics:

  • Experiment with Different Models: Try other algorithms like Support Vector Machines, Gradient Boosting, etc.
  • Incorporate Data Preprocessing Steps: Log data preprocessing and feature engineering steps.
  • Use MLflow Projects: Package your code for reproducibility.
  • Set Up MLflow Tracking Server: For collaborative work and centralized tracking.

Notebook Access


Common Issues and Troubleshooting

Before we wrap up, let's discuss some common issues you might encounter:

  • Environment Conflicts: Ensure that MLflow and other libraries are installed in the correct environment.
  • File Permissions: Check read/write permissions when logging artifacts.
  • Version Mismatch: Use compatible versions of MLflow and other dependencies.
  • Large Artifacts: Be cautious when logging large files; consider using remote storage solutions if needed.

Q&A and Closing

Thank you for participating in today's workshop. I hope you found it informative and that you're excited to incorporate MLflow into your bioinformatics projects. Let's open the floor for any questions you might have.

Thank you for participating! Feel free to reach out with any questions or feedback.


Remember, MLflow is a powerful tool that can significantly improve the way we manage machine learning experiments in bioinformatics. By incorporating experiment tracking into your workflows, you'll enhance the reproducibility and reliability of your research.


Additional Note:

Please make sure to save your work and close any running sessions. If you're interested in further workshops or resources, check out our GitHub page or reach out to the Data Lab team.

⚠️ **GitHub.com Fallback** ⚠️