Google Summer of Code 2024 Wrap Up - cBioPortal/GSoC GitHub Wiki
Google Summer of Code 2024 Wrap-Up
The cBioPortal for Cancer Genomics is a resource that provides broad community access to cancer genomic data. The user-friendly and biology-centric interface helps to make genomic data more easily accessible and interpretable by translational scientists, biologists, and clinicians. The public instance of cBioPortal is one of the most popular online resources for cancer genomics data and attracts more than 3,000 unique visitors daily (cancer researchers and clinicians. In addition, there are dozens of local instances installed in medical centers, universities, government institutions, and pharmaceutical companies around the globe.
The cBioPortal's open-source community is central to its success, fostering collaboration and innovation. The Google Summer of Code (GSoC) program is key in nurturing this community by attracting talented contributors who leverage their skills and perspectives to enhance cBioPortal. GSoC provides a valuable opportunity for knowledge exchange, mentorship, and developing new features that benefit cancer research.
We are thrilled to share the achievements of our Google Summer of Code (GSoC) 2024 contributors, who dedicated their summer to advancing cBioPortal’s functionality. These ten talented individuals embarked on projects that spanned visualization enhancements, data integration, and pipeline development.
Project 1: Automated Curation and Harmonization of cBioPortal Clinical Metadata using Sentence Transformers
- Student: Abhilash Dhal
- Mentors: Sehyun Oh, Jonathan Davenport, Michele Waters
- Project Summary: This project aims to develop automated tools for metadata harmonization and the standardization of clinical metadata in cBioPortal. We established the framework for two main components of the tool - schema mapping and ontology mapping. The schema mapper was developed to harmonize column names across different studies using frequency- and transformer-based approaches. The frequency-based methods achieved over 80% accuracy on test data. For ontology mapping, a three-stage framework (Figure) was implemented, incorporating exact matching, language model (LM)-based matching, and large language model (LLM)- based matching. We tested various BERT models, with SAP-BERT showing the best performance across different categories (treatment_name, body site, and disease), achieving over 80% accuracy for the top 5 matches. These outcomes are significant progress in automating metadata harmonization for cBioPortal, with potential for further improvements through fine-turning models and implementing more advanced NLP techniques, ultimately enhancing the FAIRness and AI/ML-readiness of cBioPortal data across studies
- Code Link
Project 2: Visualize OncoKB Annotation and Patient Report Generation
- Student: Aishika Nandi
- Mentors: Hongxin, Ben Preiser
- Project Summary: OncoKB provides an endpoint for programmatic annotation of genomic data; however, this data may not be easily digestible as it is in JSON format. Thus, Aishika has developed a module to be used for visualizing our OncoKB annotations, and it takes the data returned from our endpoints as an input. Soon, this package will be available on NPM, and users will be able to easily leverage this interface. In fact, we expect to see this package incorporated into the cBioPortal and OncoKB ecosystems in the near future.
- Code Link
Project 3: Improve Navigability of the HTAN Data Standards
- Student: Ankita Sahu
- Mentors: Onur Sumer, Ino de Bruijn, Jennifer Altreuter, Aditi Gopalan
- Project Summary: Enhance the Data Standards Page for the Human Tumor Atlas Network (HTAN), a cBioPortal-adjacent project. HTAN is a collaborative NCI-funded initiative dedicated to mapping the cellular complexities of human cancers to improve diagnosis and treatment. The Data Standards Page helps submitters and reusers of HTAN data to understand the underlying data model. The Data Standards Page empowers researchers to effectively leverage HTAN's rich public datasets, ultimately advancing our comprehension of cancer biology and facilitating the development of targeted therapies. Ankita implemented improvements to find specific attributes across a variety of data modalities.
- Code Link (In production 🚀)
Project 4: Add the Ability to Spawn Code Notebooks from cBioPortal Queries
- Student: Gautam Sarawagi
- Mentors: Aaron Lisman
- Project Summary: Users may want to perform custom analysis on data queried in the cBioportal. This project introduces the ability to spawn a browser-based JuypterLite code notebook populated with data exported from the Oncoprint. A sample Python script renders a visualization in the Oncoprint. This feature could easily be extended to other export points in the portal, as well as other 3rd-party analytics tools.
- Code Link (In production 🚀)
Project 5: Integration of AlphaMissense Pathogenicity Predictions into Genome Nexus and cBioPortal
- Student: Ivy Zou
- Mentors: Onur Sumer, Xiang Li
- Project Summary: AlphaMissense is an AI model developed by Google DeepMind that predicts the pathogenicity of missense variants. This model offers highly accurate predictions by classifying these variants as either benign or pathogenic, which is crucial for understanding genetic diseases. The project aims to integrate AlphaMissense data into Genome Nexus API to programmatically provide precise pathogenicity predictions for missense mutations, and display it on cBioPortal and Genome Nexus website for on-site analysis. It enhances the cBioPortal’s ability to provide actionable insights into cancer genomics.
- Code Link (In production 🚀)
Project 6: Extend Chart Types in Study View
- Student: mukayevolzhas
- Mentors: Aaron Lisman, Bryan Lai
- Project Summary: The Study View Page in cBioPortal provides users the ability to view and generate charts for clinical, genomic and other types of data for a set of studies. Currently, the Study View Page in cBioPortal supports only a couple of chart types such as pie charts, bar charts, and tables with limited functionality. This project enhances the Study View Page by implementing new chart functionality such as the ability to show a zoom preview for bar charts as well as the addition of new charts such as line charts. These features enhance the user experience of cBioPortal, giving users more freedom and tools to customize charts on the Study View Page for visualization and research needs.
- Code Link (Partially in production 🚀)
Project 7: Frontend Visualization and Incorporation of Single Cell Data in cBioPortal
- Student: Suraj Sharma
- Mentors: Zeynep Karagöz, Sowmiyaa Kumar, Anika Bongaarts
- Project Summary: Researchers often aim to combine insights from various omics techniques, and single-cell gene expression data provides an additional layer of depth to cancer genomic analyses. By integrating single-cell data at the cell type and sample level, researchers can compare gene expression between cell types within or across groups, uncovering tumor heterogeneity and distinct gene expression profiles. In this project, a new single-cell tab has been integrated to the cBioPortal frontend for analyzing data at the cell type-patient level, along with enhanced functionality for stacked bar plots in the portal.
- Code Link
Project 8: Chatbot Trained on Documentation Site and Conversations
- Student: Xinling Wang
- Mentors: Augustin Luna, Meysam Ghaffari, Ruslan Forostianov
- Project Summary: This project was conducted to simplify how users are able to find information they need about the cBioPortal project through a customized chatbot based on GPT4. This chatbot uses retrieval augmented generation paired with a routing mechanism to match user questions to cBioPortal content including Google Group conversations and documentation. Additionally, it can address basic questions using content returned from the cBioPortal API. There are ongoing plans to make this prototype available more broadly in a maintainable manner.
- Code Link
Project 9: Create Pipeline/Interface to Prioritize Variants for OncoKB Curation
- Student: Yameng Ge
- Mentors: John Konecny, Hongxin Zhang
- Project Summary: OnocKB’s curation team needs to analyze gigabytes of real patient genomic data which is available in cBioPortal. The curation team has limited resources and needs to prioritize analyzing specific genomic variants that will most likely enhance our understanding of target therapy options in oncology. In this project, Yameng created a new webpage in OncoKB’s curation platform to display to the curators statistical information about frequencies of variants in specific tumor types. Ideally, if there is a high frequency of a specific variant in a tumor type then the chances of our curators finding something potentially important to share on OncoKB’s website is much higher.
- Code Link
Project 10: Migrate end-to-end tests to WebdriverIO Async style
- Student: Jonathan Atiene
- Mentors: Aaron Lisman
- Project Summary: For years the portal’s frontend build infrastructure has been stuck using an outdated version of Node.js because upgrading would require a substantial migration of our e2e tests. Over the summer, Jonathan successfully migrated all the e2e tests.
- Code Link (In production 🚀)