Week 2 Task A1 & 2 Explanation by Seda - gy315-K/REAL_FORKED_abT-Tact-cells-Team2 GitHub Wiki
This document explains everything we did in Tasks A1 and A2.
We want to find out whether the distance of ATAC-seq peaks to the nearest Transcription Start Site (TSS) has any relationship with their accessibility signal. This involves two tasks:
-
Task A1: Compute distance of each peak to the closest TSS.
-
Task A2: Analyze the relationship between peak signal and distance to TSS.
ref = pd.read_csv("refFlat", sep="\t", header=None)
-
This file tells us where all genes start and end.
-
It doesn't come with column names, so we add them manually:
ref.columns = [
"gene_name", "transcript_name", "chrom", "strand",
"tx_start_5prime", "tx_start_3prime",
"cds_start", "cds_end",
"exon_count", "exon_starts", "exon_ends"
]
ref["tss"] = ref.apply(
lambda row: row["tx_start_5prime"] if row["strand"] == "+" else row["tx_start_3prime"],
axis=1
)
-
Genes can be on the forward (+) or reverse (-) strand.
-
Depending on the strand, the "start" of the transcript changes.
tss_table = ref[["chrom", "tss"]].copy()
-
We'll match peaks to TSSs by chromosome.
peaks = pd.read_csv("ATAC-seq/refined_ATAC.csv")
-
This contains all the ATAC-seq peaks from ImmGen.
-
We define the peak "center" as its
Summit
:
peaks["peak_center"] = peaks["Summit"]
-
Then keep only useful columns for now:
peaks = peaks[["ImmGenATAC1219.peakID", "chrom", "peak_center"]].copy()
tss_dict = {
chrom: np.sort(group["tss"].values)
for chrom, group in tss_table.groupby("chrom")
}
-
This creates a dictionary: for each chromosome, we get a sorted array of all TSSs.
def fast_closest_tss(chrom, center):
if chrom not in tss_dict:
return np.nan
tss_array = tss_dict[chrom]
idx = np.searchsorted(tss_array, center)
if idx == 0:
return tss_array[0]
elif idx == len(tss_array):
return tss_array[-1]
else:
left = tss_array[idx - 1]
right = tss_array[idx]
return left if abs(center - left) < abs(center - right) else right
peaks["closest_tss"] = peaks.apply(
lambda row: fast_closest_tss(row["chrom"], row["peak_center"]),
axis=1
)
peaks["distance_to_tss"] = np.abs(peaks["peak_center"] - peaks["closest_tss"])
-
Now we have a new column
distance_to_tss
for each peak.
plt.hist(peaks["distance_to_tss"], bins=100, range=(0, 600000), color='hotpink')
-
This shows the distribution of distances. Most peaks are near a TSS, but some are far.
peaks_full = pd.read_csv("ATAC-seq/refined_ATAC.csv")
peaks_mean = peaks_full.groupby("ImmGenATAC1219.peakID").agg({
"Signal": "mean",
"Summit": "first",
"chrom": "first"
}).reset_index()
-
Rename for clarity:
peaks_mean.rename(columns={
"Signal": "mean_signal",
"Summit": "peak_center"
}, inplace=True)
peaks_mean_clean = peaks_mean.drop(columns=[...], errors='ignore')
peaks_mean_clean = peaks_mean_clean.reset_index(drop=True)
peaks_mean_clean = peaks_mean_clean.merge(
peaks[["ImmGenATAC1219.peakID", "distance_to_tss"]],
on="ImmGenATAC1219.peakID",
how="left"
)
-
This brings in the
distance_to_tss
values from Task A1 to match with signal values.
sns.regplot(
data=sampled,
x="distance_to_tss",
y="mean_signal",
scatter_kws={"s": 4, "alpha": 0.3},
line_kws={"color": "red"}
)
-
A sharp decline is visible: the signal tends to drop as distance increases.
from scipy.stats import pearsonr
r, p = pearsonr(filtered["distance_to_tss"], filtered["mean_signal"])
r_squared = r ** 2
Results:
Pearson r = -0.1381
R-squared = 0.0191
p-value = 0.0000
-
The correlation is statistically significant but very weak.
Interpretation: The farther a peak is from a TSS, the weaker its signal on average, but the effect is tiny. Most peaks still vary a lot regardless of TSS proximity.
-
The method is biologically sound.
-
The correlation is real but weak: regulatory peaks are not always close to TSSs.
-
You now have
distance_to_tss
andmean_signal
ready for other analyses!
In ATAC-seq data, signal reflects how accessible a region of DNA is to enzymatic cleavage and sequencing. It’s influenced by both biological and technical factors.
-
Chromatin Accessibility
- The more “open” a DNA region is (i.e., not tightly wrapped around nucleosomes), the higher the signal.
- Promoters and enhancers are typically more accessible and therefore show stronger signals.
-
Cell Type–Specific Regulation
- Some peaks are only accessible in specific cell types or conditions (e.g., activated T-cells vs. naive T-cells).
-
Transcription Factor Binding
- DNA regions bound by transcription factors (TFs) often have elevated accessibility—especially if TFs recruit coactivators that open chromatin.
-
Histone Modifications
- Active histone marks such as H3K27ac (enhancers) and H3K4me3 (promoters) are associated with higher ATAC-seq signals.
-
Distance to TSS (Transcription Start Site)
- Promoters near TSS often show strong signal, but strong enhancers can also exist far away from the TSS.
- In your project, this distance explains only a small part of the variability in signal.
-
GC Content and Mappability
- Regions with extreme GC content or repetitive sequences may have reduced signal due to sequencing or alignment biases.
-
Read Depth
- Low sequencing depth in a given region can reduce the observed signal, even if it’s biologically active.
-
Batch Effects / Noise
- Differences in library prep, sequencing batches, or data processing can introduce signal variability that’s not biologically meaningful.
Interpretation: The farther a peak is from a TSS, the weaker its signal on average—but this effect is very small. Most peaks vary in signal due to other biological and technical factors.