Week 2 Task A1 & 2 Explanation by Seda - gy315-K/REAL_FORKED_abT-Tact-cells-Team2 GitHub Wiki

📘 Task A1 & A2: Distance to TSS and Signal Intensity — Full Explanation for Team

This document explains everything we did in Tasks A1 and A2.


🧭 Goal

We want to find out whether the distance of ATAC-seq peaks to the nearest Transcription Start Site (TSS) has any relationship with their accessibility signal. This involves two tasks:

  • Task A1: Compute distance of each peak to the closest TSS.

  • Task A2: Analyze the relationship between peak signal and distance to TSS.


✅ Task A1 – Compute Distance to TSS (TSS Assignment)

Step 1: Load the RefFlat Gene Annotation File

ref = pd.read_csv("refFlat", sep="\t", header=None)
  • This file tells us where all genes start and end.

  • It doesn't come with column names, so we add them manually:

ref.columns = [
    "gene_name", "transcript_name", "chrom", "strand",
    "tx_start_5prime", "tx_start_3prime",
    "cds_start", "cds_end",
    "exon_count", "exon_starts", "exon_ends"
]

Step 2: Extract the Strand-Aware TSS

ref["tss"] = ref.apply(
    lambda row: row["tx_start_5prime"] if row["strand"] == "+" else row["tx_start_3prime"],
    axis=1
)
  • Genes can be on the forward (+) or reverse (-) strand.

  • Depending on the strand, the "start" of the transcript changes.

Step 3: Create TSS Table (Chrom + TSS)

tss_table = ref[["chrom", "tss"]].copy()
  • We'll match peaks to TSSs by chromosome.


Step 4: Load the ATAC-seq Peaks

peaks = pd.read_csv("ATAC-seq/refined_ATAC.csv")
  • This contains all the ATAC-seq peaks from ImmGen.

  • We define the peak "center" as its Summit:

peaks["peak_center"] = peaks["Summit"]
  • Then keep only useful columns for now:

peaks = peaks[["ImmGenATAC1219.peakID", "chrom", "peak_center"]].copy()

Step 5: Build Fast Lookup for Each Chromosome

tss_dict = {
    chrom: np.sort(group["tss"].values)
    for chrom, group in tss_table.groupby("chrom")
}
  • This creates a dictionary: for each chromosome, we get a sorted array of all TSSs.

Step 6: Find Closest TSS (Using Binary Search)

def fast_closest_tss(chrom, center):
    if chrom not in tss_dict:
        return np.nan
    tss_array = tss_dict[chrom]
    idx = np.searchsorted(tss_array, center)
    if idx == 0:
        return tss_array[0]
    elif idx == len(tss_array):
        return tss_array[-1]
    else:
        left = tss_array[idx - 1]
        right = tss_array[idx]
        return left if abs(center - left) < abs(center - right) else right

Step 7: Apply Closest-TSS Function to All Peaks

peaks["closest_tss"] = peaks.apply(
    lambda row: fast_closest_tss(row["chrom"], row["peak_center"]),
    axis=1
)

peaks["distance_to_tss"] = np.abs(peaks["peak_center"] - peaks["closest_tss"])

  • Now we have a new column distance_to_tss for each peak.

Step 8: Histogram of Distances

plt.hist(peaks["distance_to_tss"], bins=100, range=(0, 600000), color='hotpink')
  • This shows the distribution of distances. Most peaks are near a TSS, but some are far.


✅ Task A2 – Analyze Distance vs Signal Strength

Step 1: Load Full Peak Table Again

peaks_full = pd.read_csv("ATAC-seq/refined_ATAC.csv")

Step 2: Average the Signal Per Peak

peaks_mean = peaks_full.groupby("ImmGenATAC1219.peakID").agg({
    "Signal": "mean",
    "Summit": "first",
    "chrom": "first"
}).reset_index()
  • Rename for clarity:

peaks_mean.rename(columns={
    "Signal": "mean_signal",
    "Summit": "peak_center"
}, inplace=True)

Step 3: Clean Columns and Reset Index

peaks_mean_clean = peaks_mean.drop(columns=[...], errors='ignore')
peaks_mean_clean = peaks_mean_clean.reset_index(drop=True)

Step 4: Merge Distance Values

peaks_mean_clean = peaks_mean_clean.merge(
    peaks[["ImmGenATAC1219.peakID", "distance_to_tss"]],
    on="ImmGenATAC1219.peakID",
    how="left"
)
  • This brings in the distance_to_tss values from Task A1 to match with signal values.


Step 5: Plot Regression (Signal vs Distance)

sns.regplot(
    data=sampled,
    x="distance_to_tss",
    y="mean_signal",
    scatter_kws={"s": 4, "alpha": 0.3},
    line_kws={"color": "red"}
)
  • A sharp decline is visible: the signal tends to drop as distance increases.

Step 6: Pearson Correlation

from scipy.stats import pearsonr
r, p = pearsonr(filtered["distance_to_tss"], filtered["mean_signal"])
r_squared = r ** 2

Results:

Pearson r    = -0.1381
R-squared    = 0.0191
p-value      = 0.0000
  • The correlation is statistically significant but very weak.

Interpretation: The farther a peak is from a TSS, the weaker its signal on average, but the effect is tiny. Most peaks still vary a lot regardless of TSS proximity.



🎓 Final Thoughts

  • The method is biologically sound.

  • The correlation is real but weak: regulatory peaks are not always close to TSSs.

  • You now have distance_to_tss and mean_signal ready for other analyses!

📈 What Does ATAC-seq Signal Depend On?

In ATAC-seq data, signal reflects how accessible a region of DNA is to enzymatic cleavage and sequencing. It’s influenced by both biological and technical factors.


🔬 Biological Factors

  1. Chromatin Accessibility

    • The more “open” a DNA region is (i.e., not tightly wrapped around nucleosomes), the higher the signal.
    • Promoters and enhancers are typically more accessible and therefore show stronger signals.
  2. Cell Type–Specific Regulation

    • Some peaks are only accessible in specific cell types or conditions (e.g., activated T-cells vs. naive T-cells).
  3. Transcription Factor Binding

    • DNA regions bound by transcription factors (TFs) often have elevated accessibility—especially if TFs recruit coactivators that open chromatin.
  4. Histone Modifications

    • Active histone marks such as H3K27ac (enhancers) and H3K4me3 (promoters) are associated with higher ATAC-seq signals.
  5. Distance to TSS (Transcription Start Site)

    • Promoters near TSS often show strong signal, but strong enhancers can also exist far away from the TSS.
    • In your project, this distance explains only a small part of the variability in signal.

🧪 Technical Factors

  1. GC Content and Mappability

    • Regions with extreme GC content or repetitive sequences may have reduced signal due to sequencing or alignment biases.
  2. Read Depth

    • Low sequencing depth in a given region can reduce the observed signal, even if it’s biologically active.
  3. Batch Effects / Noise

    • Differences in library prep, sequencing batches, or data processing can introduce signal variability that’s not biologically meaningful.

🧠 Summary

Interpretation: The farther a peak is from a TSS, the weaker its signal on average—but this effect is very small. Most peaks vary in signal due to other biological and technical factors.

⚠️ **GitHub.com Fallback** ⚠️