Literature review - gy315-K/REAL_FORKED_abT-Tact-cells-Team2 GitHub Wiki

Dynamic gene regulatory networks of human myeloid differentiation

Take home message: Gene expression changes rapidly early in macrophage differentiation, while significant chromatin accessibility changes occur later, revealing a dynamic and staged process of cellular specification with distinct temporal patterns of genetic regulation

  • Techniques

RNA-sequencing (RNA-seq)

measure gene expression levels method:

extract RNA from cells convert RNA to complementary DNA (cDNA) sequence the cDNA using high-throughput sequencing technologies analyse the number of reads for each gene to determine expression levels advantages quantifies how active genes are at a specific moment can detect subtle changes in gene expression provides a comprehensive view of cellular gene activity in this study: used to track gene expression changes during myeloid cell differentiation identified which genes were upregulated or downregulated at different time points

  • ATAC-sequencing (ATAC-seq)

technique used to map chromatin accessibility method:

hyperactive Tn5 transposase enzyme enzyme cuts open DNA regions that aren’t tightly packed → accessible chromatin open regions are tagged with sequencing adapters regions are sequenced and mapped to understand chromatin structure advantages: reveals which part of the genome are “open” and potentially active helps understand gene regulation at the chromatin level requires fewer cells compared to older techniques in this study: tracked changes in chromatin accessibility during cell differentiation found 8.907 differential chromatin elements

  • Results

Gene expression dynamics

macrophage differentiation showed rapid gene expression changes early (3 hours) neutrophil and monocyte differentiation had more significant changes at 6 hours post-differentiation Chromatin accessibility minimal chromatin landscape changes in early stages major accessibility changes occurred in middle to late differential stages Transcription factor regulation identification of key regulators like PU.1 and EGR family

PU.1

critical myeloid fate specification factor unique expression patterns across cell types PU.1 knockdown experiment

altered chromatin accessibility EGR dramatic expression changes in macrophages cell types significant role in cell differentiation discovered 23 transcriptional regulators with 158 inferred differentiation stages

ATAC-seq: A method for assaying chromatin accessibility genome-wide

Take home message: ATAC-seq (Assay for Transposase-Accessible Chromatin with high-throughput sequencing) is a genome-wide chromatin accessibility technique that uses hyperactive Tn5 transposase to simultaneously cut DNA and ligate sequencing adapters, enabling rapid, sensitive and low-input epigenomic profiling with unprecedented efficiency across multiple cell types and species

  • Background and significance

Chromatin packaging challenge

there’s approximately 2 metres of DNA packed into 5-micron nucleus DNA is folded hierarchical around histone proteins creates complex chromatin structure that

sequesters inactive genome regions leaves biologically active regions accessible operates with dynamic epigenetic mechanisms

  • Epigenetic genome-wide analysis

former techniques:

DNase-seq, MNase-seq, ChIP-seq

require tens to hundreds of millions of cells slow, complex workflow

  • ATAC-seq

advantages

low cell input requirements

only 50 000 cells enables analysis of rare and important cellular subtypes difficult to acquire in large quantities (compatible with multiple cell types and species) speed and efficiency rapid protocol, total time of 3 h simplified workflow compared to other chromatin mapping techniques comprehensive insights provides multidimensional epigenomic profiling separates reads into sub-nucleosomal and nucleosomal length fragments

Multiscale footprints reveal the organization of cis-regulatory elements

Take home message: This study introduces a computational method called PRINT, combined with deep learning, to map the dynamic organization of proteins at cis-regulatory elements, revealing insights into gene regulation across cell differentiation and aging

  • Fundamental concepts

CREs: genomic regions controlling gene expression characteristics

dynamic in structure and function integrate diverse regulatory proteins determine cellular potential and function scale approximately 1 million candidate CREs in humans

  • Key innovations

PRINT Method: Protein-regulatory element interactions at nucleotide resolution

detects footprints of DNA-binding proteins from ATAC-seq and scATAC-seq data corrects for Tn5 sequence bias using a convolutional neural network seq2PRINT a deep learning model that predicts protein-DNA binding footprints using only DNA sequence. enables high-resolution prediction of TF and nucleosome binding. identifies de novo motifs and models combinatorial TF interactions

  • Biological applications

haematopoiesis analysis

methodology:

single-cell ATAC RNA-sequencing showed how TF binding at CREs changes during cell differentiation stepwise activation of erythroid (red blood cells) and lymphoid (T cells, B cells and NK cells) CREs, expanding outward from pioneer TFs (= transcription factors) aging hematopoietic stem cells (HSCs) in mice detected altered nucleosome footprints and rewiring of TF binding identified age-associated changes (including increased activity of Ets/Runx composite motifs and loss of Yy1, Nrf1, Ctcf) structural predictions (using AlphaFold3) suggested physical interactions at these composite motifs

  • Major insights

CRE dynamics:

sequential establishment and widening of CREs centred on pioneer factors during haematopoiesis age-associated changes age-related alterations in CRE structure in hematopoietic stem cells, including nucleosome footprint reduction and novel Ets (= DNA sequences where Ets (E 26) transcription factors bind in combination with other transcription factors) composite motifs TF binding dynamics CREs switch transcription factors during differentiation which is not always reflected by overall accessibility TFs don’t always bind simultaneously; binding is sequential and layered, often beginning at CRE centres and expanding outward

From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis

Systematic discussion of all major steps in an ATAC-seq analysis pipeline, starting with raw sequencing reads to the end-point of biological meaningful interpretation

  • ATAC-seq

interrogates chromatin accessibility (heterochromatin vs. euchromatin) identifies nucleosome positions using fragments representing nucleosome monomer and multi-mers

ATAC-seq incorporates a genetically engineered hyperactive Tn5 transposase that simultaneously cuts open chromatin leaving a 9-bp staggered nick and ligates high-throughput sequencing adapters to these regions. During this process, the nick is repaired, leaving a 9-bp duplication. Paired-end sequencing is then performed to facilitate higher unique alignment rates of these open regions.

500-50,000 cells (here 10.000 bp) sensitivity and specificity similar to RNase-seq and FAIRE-seq but less input material and easier protocol few bioinformatic analysis tools developed specifically for ATAC-seq

  • Pre-analysis: quality control and alignment

Pre-alignment QC, read alignment to a reference genome, post-alignment QC and processing Recommendations: FastQC, trimmomatic and BWA-MEM

    • Pre-alignment QC

Steps are standard for most high-throughput sequencing technologies. Read trimming tools are comparable in performance of efficient removal of low-quality and contaminating adapter sequences. FastQC or trimmomatic

    • Alignment

Alignment through overlapping 9bp duplications FastQC can be performed again to check the successful removal of adapter and low-quality bases (?) BWA-MEM and Bowtie2 aligners are memory-efficient and fast for short paired-end reads

    • Post alignment processing and QC

Basic metrics of the BAM file can be collected using Picard and SAM-tools:

unique mapping reads/rates duplicated read percentages fragment size distribution Improving the power of open chromatin detection and ensuring fewer false positives: Reads should be removed if they are improperly paired or of low mapping quality. Mitochondrial genome and ENCODE blacklisted regions both have often extremely high read coverage and should be discarded. Duplicated reads (usually as a PCR artifact) should also be removed Additional quality metrics Check if the fragment size distribution plot has decreasing and periodical peaks corresponding to NFR (<100 bp), and mono-, di- and tri-nucleosomes (around 200, 400 and 600 bp). NFR fragments are expected to be enriched around TSS of genes, while nucleosome-bound fragments are expected to be depleted around TSS with a slight enrichment of flanking regions around TSS Reads should be shifted +4 bp and -5 bp for positive and negative strand respectively, to account for the 9-bp duplication created during tagmentation.

  • Core analysis: peak calling Identification of accessible regions of chromatin (peak calling) is the basis for advanced analysis

Should account for Tn5 cleavage bias

Popular peak callers are divided into two major categories: shape-based and count-based.

    • Count-based Employ different statistical methods to compare read distribution shape in candidate region to a random background.

MACS2, HOMER and SICER/epic2 assume Poisson distribution ZINBA assumes zero-inflated negative binomial distribution F-seq and PeakDEck use kernel density estimation to profile fragment distribution. SPP has no assumption on fragment distribution, but uses a sliding window to calculate scores based on fragment counts from up- and downstream flanking windows. F-seq and ZINBA are not actively maintained -> use with caution

    • Shape-based Shape-based peak callers are not currently used in ATAC-seq, but they utilize read density profile information directly or indirectly and are believed to improve peak calling in ChIP-seq
    • HMMRATAC Only peak caller exclusive for ATAC-seq

Employs three-state semi-supervised hidden Markov model (HMM) to simultaneously segment the genome into open chromatin regions with high signal, nucleosomal regions with moderate signal, and background regions with low signals. Computationaly more intensive, but better than MACS2 and F-seq. Also provides additional nucleosome position information

  • Advanced analysis
    • Peaks Interpretation at four different levels: peak, motif, nucleosome, and TF footprint. However, only a few tools are designed specifically for ATAC-seq.

Peak differential analysis

Currently, no differential peak analysis tools have been specifically developed for ATAC-seq data analysis. A straightforward approach would be to find the candidate regions (consensus peaks or binned genome), normalize, and count the fragments in these regions and compare with other conditions statistically. Can be done manually or with:

Consensus-peak tools: HOMER, DBChIP, and DiffBind

Assume negative binomial distribution and require biological replicates to estimate dispersion. Sliding window approaches: PePr, DiffReps, ChIPDiff, csaw Tend to yield more false positives and require stringent filtering and false discovery rate (FDR) control. This approach is thought to give more unbiased estimates of read count across the genome.

Peak annotation

Generates biological and functionally meaningful results for further investigation. HOMER, ChIPseeker, and ChIPpeakAnno are widely used to assign peaks to nearest or overlapping gene, exon, intron, promoter, 5'- and 3'-UTR and other genomic features

    • Motifs Understanding motif usage or activity change may help decipher the underlying regulatory networks, as well as identify key regulators

Motif database scan

No database which contains comprehensive and consistent motif information.

CIS-BP and TRANSFAC for eukaryotic TF motifs HOCOMOCO for human and mouse TF motifs RegulonDB for E. coli

Motif enrichment and activity analysis

Position and frequency of motifs in each peak region can be obtained and compared to a random background or another condition

Prediction of putative TFBSs indirectly from sequences found within peak regions. HOMER, MEME-AME, MEME-CentriMo, DAStk, ChromVAR, DiffTF

    • Footprints Refers to a pattern where an active TF binds to DNA and prevents Tn5 cleavage within the binding site. Useful to understand TF regulation and further reconstruct cell-specific regulatory networks. This leaves a relative depletion within the open chromatin region. Footprint detection is both experimentally and computationally difficult

Footprints of actively bound TFs can be used to reconstruct a regulatory network specifically for certain samples. Two methods: de novo and motif-centric tools

De novo

Predict all footprint sites across peaks. These putative sites are then used to match known motifs or identify novel ones.

Motif-centric

Focus on a priori TFBSs and consider TF-specific footprint profiles

Comments on footprinting analysis

Generalizability of these tools to ATACseq still requires extensive evaluation. Bias correction is important in both DNase-seq and ATAC-seq footprint detection. There is not a general guideline for minimal ATAC-seq sequencing depth in order to achieve effective footprinting. De novo methods still have the advantage for low-quality and novel motifs.

    • Nucleosome positioning The nucleosome consists of a histone octamer complex with approximately 147 bp of DNA and affects TF binding by altering chromatin accessibility.

HMMRATAC and NucleoATAC are two useful and specific tools for ATAC-seq nucleosome detection.

  • Integration with multi omics data to reconstruct regulatory networks Integration of ATAC-seq with other high-throughput sequencing technologies such as RNA-seq and ChIP-seq is gaining increasing interest to understand gene regulation.

    • Integration with ChIP-seq

Integrating ChIP-seq and ATAC-seq helps to understand TF and histone facilitated chromatin accessibility changes. TF ChIPseq and ATAC-seq can mutually validate the quality and reliability of each other within the same experimental system.

    • Integration with RNA-seq Coupled clustering combining scATAC-seq and scRNA-seq was shown to improve accuracy in subpopulation detection. Integration of ATAC-seq with RNA-seq aids to decipher gene regulation and cellular heterogeneity.
    • Pipelines for ATAC-seq data

esATAC and CIPHER focus on peak annotation. GUAVA, a graphic user interface (GUI) tool, focuses on differential peak detection as well as functional annotation. ATAC2GRN is another pipeline specifically optimized for footprinting.

  • Single-cell ATAC-seq ScATAC-seq is now able to measure the chromatin accessibility for thousands of cells with easy protocol at a low cost.

scATAC-seq data will be sparse because in diploid organisms, there are only two copies of DNA. Combining window-based genome binning, binarization of the accessibility, coverage bias correction, and dimension reduction using principle component analysis, help handling the sparse scATAC-seq data.

  • Summary Pre-analysis: FastQC, trimmomatic, BWA-MEM Peak calling: MACS2 Differential peak analysis: casw Motif detection and enrichment: MEME suite Annotation and visualization: ChIPseeker Nucleosome detection: HMMRATAC Footprint analysis: HINT-ATAC

The chromatin accessibility landscape of primary human cancers

Take home message: This study maps the chromatin accessibility landscape across various human cancers, pinpointing regulatory elements, transcription factor activities and non-coding mutations to improve cancer diagnosis and therapy

  • Key findings

diverse regulatory landscape

identification of diverse regulatory landscapes across 23 cancer types using ATAC-seq data from 410 tumour samples DNA regulatory elements correlation-based linking of ATAC-seq peaks to genes

androgen receptor (AR) in prostate cancer, FOXA1 in nonbasal breast cancer, MITF in melanoma were enriched in cluster-specific peak sets pinpoint of 562 709 DNA regulatory elements, expanding the catalogue of known cis-regulatory elements identification of regulatory interactions involving key oncogenes MYC, SRC, BCL2, PDL1 transcription factor activities identification of distinct TFs activities in cancer based on patterns of TF-DNA interaction and gene expression regulatory interactions genome-wide correlation of gene expression and chromatin accessibility predicted thousands of interactions between distal regulatory elements and gene promoters non-coding mutations integration of whole-genome sequencing with ATAC-seq identified cancer-relevant non-coding mutations associated with altered gene expression highlights a mutation upstream of the FGD4 gene in bladder cancer which significantly increased its expression immune infiltration identification of DNA regulatory elements related to the immunological response to cancer key target is PDL1 gene and immune evasion mechanisms

  • Broader implications

comprehensive regulatory landscape

helps in identifying cancer subtypes helps in understanding gene regulatory networks identified that distal regulatory elements affect PDL1 expression which is usually a key target for cancer immunotherapy → new avenues for immune evasion mechanisms modulation improved diagnostics by identification of specific regulatory elements and interactions MECOM gene in kidney renal papillary cell carcinoma (KIRP) is overexpressed in a subgroup of patients with adverse outcomes → potential prognostic marker development of more targeted therapies

The cis-regulatory dynamics of embryonic development at single-cell resolution

  • Investigation of the dynamics of chromatin regulatory landscapes during embryogenesis at single-cell resolution
    • Profiling of embryonic cells in 3 different stages:
  1. 2-4h after egg laying (stage 5 blastoderm)
  2. 6-8h after egg laying (stage 10-11, midpoint when major lineages in mesoderm and ectoderm are specified)
  3. 10-12h after egg laying (cell's terminal differentiation)
    • sci-ATAC-seq: single cell combinatorial indexing ATAC-seq
    • 53 000+ potential cis-regulatory elements identified. Almost 41 000 are clade specific (for each cell-lineage at each time point, most for the later time points)
    • Highly dynamic and heterogeneous nature of chromatin accessibility during embryogenesis
    • 4 major clades identified at 6-8h and 10-12h:
  1. Neurogenic ectoderm
  2. Non-neurogenic ectoderm
  3. Myogenic mesoderm
  4. Non-myogenic mesoderm combined with endoderm
  • Validation of in-silico sorting and clade assignments by comparing with DNase-seq readings after FACS (fluorescence-activated cell sorting)

Spearman's p>0.85 for matched versus 0.53 for non-matched comparisons globally (in bulk)

    • Clade assignments further supported by motif enrichments for transcription factor binding sites and transcription factor occupancy at putative enhancers (specific for each cell lineage).
    • 18 cell clusters at 2-4h (identified through t-SNE)

Data supports view that early pre-gastrulation cell specification events are underpinned by spatial heterogeneity in chromatin accessibility.

    • Determining if elements that exhibit tissue specific chromatin accessibility correspond to bona fide tissue specific enhancers: cloning of regulatory element upstream of a minimal promoter driving lacZ reporter and stable integration at a common location in transgenic fly's genome (minimize positional effect). Enhancer activity assessed across all stages of embryo development by in situ hybridization.

94% of selected regions functioned as in vivo developmental enhancers only 4 out of 7 elements accessible in clade 4 were also tested active in yolk nuclei (extra-embryonic). Suggests potential regulatory link between the yolk and mesendodermal tissues (supported by the role of GATA transcription factor serpent in both yolk and non-myogenic mesoderm)

  • In summary

Analysing development dynamics of chromatin accessibility through sci-ATAC-seq

    • 30 000+ putative distal regulatory elements exhibiting clade-specific accessibility identified
    • Sparsity of data from sci-ATAC-seq is still a challenge -> insights can be derived by aggregating observations across subsets of cells
    • around 12% of cell barcodes are expected to represent aggregates of 2 or more cells
  • Looking forward

    • Expanded dataset (more cells per time point and covering the whole of embryonic development) -> identify rarer cell types and reveal a fully continuous view of the landscape of chromatin accessibility as it unfolds.
    • integration of chromatin state, transcriptional output, lineage history and spatial information at single-cell resolution has the po

Landscape of stimulation-responsive chromatin across diverse human immune cells

Take home message: This study maps how chromatin accessibility and gene expression change in various human immune cells when they’re stimulated, revealing insights into autoimmune diseases.

  • Study overview

research focus:

investigating chromatin accessibility and gene expression in human immune cells during resting and stimulated states methodology collected 25 primary human immune cell types from four blood donors performed ATAC-seq and RNA-seq in resting and activated conditions included six thymocyte subsets and thymic epithelial cells

  • Methods

ATAC-seq

profile chromatin accessibility in resting and stimulated immune cells identification of genomic regions that are open and accessible for gene transcription RNA-seq determination of up- or downregulation genes in response to simulation via measuring gene expression levels in resting and stimulated cells allele-specific analysis examines allele-specific chromatin accessibility (ASC) to identify genetic variants affecting chromatin regulation pinpoint how genetic variation affects gene regulation GWAS enrichment analysis genome-wide association studies enrichment analysis to identify autoimmune-related genetic variants determining whether genetic variants associated with autoimmune diseases are enriched in specific regions of the genome identified by ATAC-seq → links genetic risk factors to specific immune cell types and states

  • Key results

chromatin remodelling

stimulation causes widespread changes in chromatin accessibility, especially in B and T cells average of ca. 30 000 additional peaks in stimulated cells shared responses B and T cells share significant stimulation responses, indicating common regulatory networks genetic variation identification of genetic variants that alter chromatin accessibility in specific conditions, offering clues about autoimmune mechanics autoimmune insights heritability enrichment is strongest in stimulated immune cells, with contributions from both B and T cell lineages no subset definitively drives autoimmunity TNFAIP3 locus affects NFkB1 binding in stimulated CD4+ T cells, potentially regulating autoimmune disease variant is associated with rheumatoid arthritis and ulcerative colitis

  • Implications

significant implications for understanding and potentially treating autoimmune diseases offers valuable resource for future research since it provides a comprehensive map of immune cell chromatin dynamics new therapeutic targets for autoimmune diseases from identifying specific variants and their effects on gene regulation (e. g. TNFAIP3)