Journal 3: Annotation Sources - bcb420-2025/Keren_Zhang GitHub Wiki

The Cancer Genome Atlas (TCGA)

Time

Date: Feb 20th -Feb 24th, 2025

Estimated Time: 10 hour

Time Taken: 7 hours

1. What sort of data is it? What sort of information does it offer?

TCGA provides comprehensive genomic profiles across many types of data for each of over 11,000 tumors from 33 of the most prevalent forms of cancer. Each step in the characterization pipeline generated numerous data points, such as:

Clinical Data:
- Includes comprehensive patient information such as demographic details, treatment histories, and survival data.
Biospecimen Data:
- Information about how samples were processed and handled, which is crucial for understanding the quality and characteristics of the molecular data derived from these samples.
Pathology Reports:
- Detailed reports for select cases that provide insights into the histological findings and the pathologist's interpretations, which are key for validating molecular research findings.
Molecular Characterization Data: This encompasses a wide range of genomic and molecular data types, including:
- Genomic DNA Sequencing: Both low-pass and whole genome sequencing of tumors, providing insights into genetic alterations.
- Exome Sequencing: Focused sequencing of the coding regions of the genome, which are rich with mutations that can alter protein function.
- Copy Number Variation: Data from SNP microarrays that help in identifying genomic regions that have gained or lost DNA segments, impacting gene dosage.
- Methylation: Information on DNA methylation patterns that affect gene expression without altering the DNA sequence.
- miRNA and mRNA Expression: Sequencing data that reveal the levels of miRNA and mRNA, important for understanding the regulation of gene expression and its impact on cancer.
- Protein Expression: Data from reverse-phase protein arrays that provide protein levels in tumor samples, essential for linking genetic changes to functional protein changes.
Imaging Data:
- Includes diagnostic images from tissue slides and radiological scans like MRI, CT, and PET scans, providing a visual assessment of the cancer’s physical characteristics.
Derived Data:
- Such as mutation calls, gene expression levels, and methylation profiles. These are processed from raw sequencing data and are crucial for bioinformatics analyses.

2. When and where was it published? Was it published?

The TCGA project was launched in 2006, and data collection and publication have been ongoing.
The TCGA Pan-Cancer Atlas findings were published in a collection of 27 papers in 2018, primarily in the journals Cell and its associated journals.
- The entire collection of papers comprising the PanCancer Atlas are available through a portal on cell.com
Three summary papers published on April 5, 2018, recap the core findings, and marked the culmination of the TCGA program, which was formally completed in 2018.

3. Is this annotation set updated regularly or is it a static source?

Since TCGA has officially concluded, the Pan-Cancer Atlas itself is a static source. However, the data continues to be used and referenced in new research. Additionally, the datasets are maintained in databases like the Genomic Data Commons (GDC) where they can be combined with new data for ongoing research.

4. Where can I find this data? (link to the download web address or ftp site or publication where it can be found)

The TCGA data, including the Pan-Cancer Atlas, is not hosted on an FTP site for direct download due to the size and complexity of the data, as well as the need for controlled access to sensitive patient information. Instead, the data is available through the Genomic Data Commons (GDC) Data Portal.

Genomic Data Commons (GDC) Portal:
- data can be accessed via the GDC Portal, which provides user-friendly access to both open and controlled data. The portal offers tools for searching, viewing, and downloading the comprehensive datasets collected by TCGA.
GDC Data Transfer Tool
- For downloading large datasets, the GDC Data Transfer Tool could be used. It is designed to support high-volume data transfer, providing a more efficient way to download data than a web browser.
GDC APIs
- For programmatic access, GDC provides APIs (Application Programming Interfaces) that allow researchers to query and download data based on specific parameters and needs.

5. How is the data formatted and released? Does it exist in some sort of standard file format?

The data from the Pan-Cancer Atlas is available in standard bioinformatics file formats, including raw sequencing data (BAM), variant data (VCF), gene expression (TXT, CSV), protein expression data, and images (SVS for slide images), etc.

6. What identifiers are associated with these annotations?

The TCGA Pan-Cancer Atlas annotations are associated with a variety of identifiers that help track and organize the extensive data collected for each sample. These identifiers ensure precise linkage of genomic data with corresponding clinical information, biospecimen details, and other relevant data types. For instance:

TCGA Barcode: Unique identifier for each sample incorporating project, site, and sample details.
UUID (Universally Unique Identifier): 128-bit identifier used across the GDC to track files and samples.