prepare - MiraldiLab/maxATAC GitHub Wiki
Prepare
The prepare function converts a .bam
file of paired-end sequencing reads to a .bigwig
file of Tn5 cut sites counts that are normalized for use with maxatac predict
and maxatac train
. This page will provide a walkthrough of different aspects of the prepare function and how they relate to ATAC-seq data processing. The maxatac prepare function requires samtools
, bedtools
, pigz
, and bedGraphToBigWig
be installed on your PATH to run.
Overview
The maxatac prepare
function is a convenience function for users with data that has already been aligned with either bowtie2
or bwa-mem
. The input to maxatac prepare
is a .bam
file of aligned reads from bulk ATAC-seq or a scATAC-seq pseudobulk .tsv
fragment files. The objective of maxatac prepare
is to perform all necessary quality control filtering, PCR deduplication, signal track generation, and normalization required for .bigwig
inputs to maxATAC.
The steps that are performed by maxatac prepare
on bulk ATAC-seq data are also available in other formats such as:
- snakeATAC, a snakemake pipeline for ATAC-seq to process data from SRA. This workflow covers data download from SRA with alignment.
- A common workflow language pipeline for ATAC-seq. This workflow covers data processing from
.fastq
files. - A simple shell script that covers data processing from
.fastq
files to read-depth normalized signal tracks. Users will need to usemaxatac normalize
to generate a min-max normalized signal track for input intomaxatac predict
.
Output files
There are multiple output files from the maxatac prepare
function. The main output that will be used for maxATAC ends with minmax01.bw
and is the file that has been min-max normalized and blacklisted.
Filename | Description |
---|---|
{name}_IS_slop20_RP20M_minmax01_chromosome_min_max.txt | Contains the minimum and maximum values per chromosome |
{name}_IS_slop20_RP20M_minmax01_genome_stats.txt | Contains the min, max, median, and stats on the input file. |
{name}_IS_slop20_RP20M_minmax01.bw | The output file that is to be used for prediction. This file has been min-max normalized. |
{name}_IS_slop20_RP20M.bw | The read-depth normalized signal tracks. |
{name}_IS_slop20.bed.gz | The compressed bed file of individual cut sites that have been corrected for the Tn5 shift. |
Required flags
-i
,--input
:The input file to be processed. The input file can be either:
.bam
: Bulk ATAC-seq BAM file.
.tsv
: 10X scATAC fragments file. Must end in .tsv
or .tsv.gz
.
-
-o
,--output
: The output directory path. -
-n
,--name
,--prefix
,-prefix
: This argument is used to set the prefix for setting the filenames.
Optional flags
The optional flags for the prepare function are related to tuning the preprocessing steps. The primary flags are described below and in the prepare documentation.
The default values for the optional arguments are based on the testing performed in the maxATAC publication. See the Methods of our publication for a detailed explanation of each parameter choice.
-
-skip_dedup
,--skip_deduplication
: It is important to remove PCR duplicates from your ATAC-seq data if you have not done so already. Include this flag to perform PCR deduplication of the input BAM file if you know that it has not been deduplicated. Skipping this step will speed up data processing. Defualt: False -
-slop
,--slop
: The slop size used to smooth sparse Tn5 cut sites' signal. Each Tn5 cut site will be extended +/- the slop size (in bp). Because maxATAC models were trained using slop size of 20bp (a value that approximates the size of Tn5 transposase), this parameter should not be changed from default (20 bp) when using the trained models provided by maxATAC. Default: 20 bp. -
-rpm
,--rpm_factor
: The reads per million (RPM) factor used for read-depth normalization of signal. Most groups use RPM and therefore 1,000,000 as a scaling factor, but maxATAC uses RP20M and therefore 20,000,000 because it is approximately the median sequencing depth of the ATAC-seq data used for training. Changing from the default (20000000) is not problematic for maxATAC prediction, as this track is only used for visualization. (Predictions are made on a min-max-like normalized signal track, also an output frommaxatac prepare
.) Default: 20000000.
Tutorial
The prepare function can be used to prepare .bam
files from bulk ATAC-seq or .tsv
fragment files from scATAC-seq. Below an example for each approach is provided.
Bulk ATAC-seq
In order to use the maxatac prepare
function, you need to have aligned your reads to a reference genome. The currently supported genome is hg38. You can use BWA-mem or bowtie2 for alignment. The reads are expected to have a paired-end library format. More details about ATAC-seq data processing implemented in the maxatac prepare
function is outlined in the ATAC-seq data processing wiki page.
Example bulk command
maxatac prepare -i GM12878_bulk.tsv -o prepare_bulk_output -n GM12878_bulk
Output
A truncated version of the output is shown below. The sections are broken up for explanation.
Parameters
The first few lines that are printed will lay out the run parameters that are going to be used and where the files will be saved.
_______ _____
/\|__ __|/\ / ____|
_ __ ___ __ ___ __ / \ | | / \ | |
| '_ ` _ \ / _` \ \/ / / /\ \ | | / /\ \| |
| | | | | | (_| |> < / ____ \| |/ ____ \ |____
|_| |_| |_|\__,_/_/\_\/_/ \_\_/_/ \_\_____|
[2022-08-23 17:34:16,563]
Prepare Parameters:
Input file: bulk_atac_SRX2717911_sub01.bam
Input chromosome sizes file: hg38.chrom.sizes
Tn5 cut sites will be slopped 20 bps on each side
Input blacklist file: hg38_maxatac_blacklist.bw
Output filename: GM12878_bulk
Output directory: prepare_output_bulk
Using a millions factor of: 20000000
Using 9 threads to run job.
[2022-08-23 17:34:16,565]
Generate the normalized signal tracks.
[2022-08-23 17:34:16,566]
Working on a bulk ATAC-seq BAM file
Getting the number of reads in the BAM file
There are 5262882 reads in the file
[2022-08-23 17:34:20,014]
Processing BAM to bigwig. Running eduplication
Inputs:
BAM: bulk_atac_SRX2717911_sub01.bam
Sample name: GM12878_bulk
Cores: 9
Chr Keep: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
Blacklist: hg38_maxatac_blacklist.bed
Chr Sizes: hg38.chrom.sizes
Slop Size: 20
Scale factor: 3.800199206442402
Create working directory: prepare_output_bulk
Sorting and Deduplication
If you run the default parameters, you will perform deduplication of reads on a .bam
file. This process will use the samtools implementation of PCR duplicate remove. The output is shown below. The input file used did not have any PCR duplicates detected.
Filtering with Samtools
Samtools sort reads by name for bulk_atac_SRX2717911_sub01.bam
[bam_sort_core] merging from 0 files and 9 in-memory blocks...
Samtools fixmate on GM12878_bulk_filtered.bam
[bam_sort_core] merging from 0 files and 9 in-memory blocks...
Remove duplicates from GM12878_bulk_fixmate.bam
[markdup] warning: unable to calculate estimated library size. Read pairs 2631441 should be greater than duplicate pairs 0, which should both be non zero.
COMMAND: samtools markdup -@ 9 -r -s GM12878_bulk_fixmate.bam -
READ: 5262882
WRITTEN: 5262882
EXCLUDED: 0
EXAMINED: 5262882
PAIRED: 5262882
SINGLE: 0
DUPLICATE PAIR: 0
DUPLICATE SINGLE: 0
DUPLICATE PAIR OPTICAL: 0
DUPLICATE SINGLE OPTICAL: 0
DUPLICATE NON PRIMARY: 0
DUPLICATE NON PRIMARY OPTICAL: 0
DUPLICATE PRIMARY TOTAL: 0
DUPLICATE TOTAL: 0
ESTIMATED_LIBRARY_SIZE: 0
[bam_sort_core] merging from 0 files and 9 in-memory blocks...
Remove unwanted chr from GM12878_bulk_deduped.bam
[bam_sort_core] merging from 0 files and 9 in-memory blocks...
Converting to Tn5 sites and generating coverage tracks
This section shows the scale factor that was calculated for the RP20M normalization used by bedtools genomecov
.
Scale factor: 3.800199206442402
Using Bedtools to convert BAM to bedgraph
Using bedGraphToBigWig to convert bedgraph to bigwig
Done!
maxatac normalize
Min-max normalization with The last part of the printout shows the use of maxatac normalize
to min-max normalize an input bigwig file.
[2022-08-23 17:36:07,853]
Min-max normalize signal tracks
[2022-08-23 17:36:07,855]
Normalizing
Input bigwig file: /prepare_output_bulk/GM12878_bulk_IS_slop20_RP20M.bw
Output filename: /prepare_output_bulk/GM12878_bulk_IS_slop20_RP20M_minmax01.bw
Output directory: /prepare_output_bulk
Using min-max normalization
[2022-08-23 17:36:07,864]
Calculating stats per chromosome
[2022-08-23 17:36:07,864]
Calculating statistics for chr1
[2022-08-23 17:36:27,439]
...
[2022-08-23 17:42:32,200]
Calculating statistics for chr21
[2022-08-23 17:42:49,563]
Calculating statistics for chr22
[2022-08-23 17:43:09,052]
Calculating genome-wide statistics.
[2022-08-23 17:44:08,692]
Sample Statistics
Genomic minimum value: 0.0
Genomic max value: 38.00199890136719
Genomic median (non-zero): 3.8002
Genomic mean: 5.9282417
Genomic standard deviation: 6.279289
[2022-08-23 17:44:08,693]
Normalize and Write BigWig file
[2022-08-23 18:01:21,814]
Total normalization time: 0:25:13.
[2022-08-23 18:01:21,815]
Results saved to: /prepare_output_bulk
scATAC-seq
One of the outputs from the 10X CellRanger pipeline is a .tsv.gz
file that represents the fragments that passed QC. You do not need to have a CellRanger specific output, but the fragment file must be 4-columns : chr
, start
, stop
, and barcode
. The traditional CellRanger output has the support
column in the 5-th column position. We currently do not use this and the maxATAC pipeline expects a 4-column file.
Example scATAC command
maxatac prepare -i GM12878_fragments_subsample_1M.tsv -o prepare_output -n GM12878_scatac_1M
Outputs
Below is a truncated output text you should see while running.
_______ _____
/\|__ __|/\ / ____|
_ __ ___ __ ___ __ / \ | | / \ | |
| '_ ` _ \ / _` \ \/ / / /\ \ | | / /\ \| |
| | | | | | (_| |> < / ____ \| |/ ____ \ |____
|_| |_| |_|\__,_/_/\_\/_/ \_\_/_/ \_\_____|
[2022-08-23 16:25:05,087]
Prepare Parameters:
Input file: GM12878_fragments_subsample_1M.tsv
Input chromosome sizes file: ./data/hg38/hg38.chrom.sizes
Tn5 cut sites will be slopped 20 bps on each side
Input blacklist file: ./data/hg38/hg38_maxatac_blacklist.bw
Output filename: GM12878_scatac_1M
Output directory: .prepare_output
Using a millions factor of: 20000000
Using 9 threads to run job.
[2022-08-23 16:25:05,088]
Generate the normalized signal tracks.
[2022-08-23 16:25:05,088]
Working on 10X scATAC fragments file
Converting fragment files to Tn5 sites
There are 2000000 Tn5 cut sites in the file
[2022-08-23 16:25:05,845]
Getting the number of Tn5 cut sites in the fragment file
[2022-08-23 16:25:09,665]
Slopping Tn5 cut sites and generating RPM normalized bigwig
Converting bedgraph to bigwig
Compressing files
Done!
[2022-08-23 16:25:48,796]
Min-max normalize signal tracks
[2022-08-23 16:25:48,797]
Normalizing
Input bigwig file: /prepare_output/GM12878_scatac_1M_IS_slop20_RP20M.bw
Output filename: /prepare_output/GM12878_scatac_1M_IS_slop20_RP20M_minmax01.bw
Output directory: prepare_output
Using min-max normalization
[2022-08-23 16:25:48,801]
Calculating stats per chromosome
[2022-08-23 16:25:48,802]
Calculating statistics for chr1
...
[2022-08-23 16:32:01,447]
Calculating statistics for chr21
[2022-08-23 16:32:18,022]
Calculating statistics for chr22
[2022-08-23 16:32:35,487]
Calculating genome-wide statistics.
[2022-08-23 16:33:24,501]
Sample Statistics
Genomic minimum value: 0.0
Genomic max value: 60.0
Genomic median (non-zero): 10.0
Genomic mean: 13.236136
Genomic standard deviation: 9.855374
[2022-08-23 16:33:24,504]
Normalize and Write BigWig file
[2022-08-23 16:49:30,168]
Total normalization time: 0:23:41.
[2022-08-23 16:49:30,169]
Results saved to: /prepare_output
You can now use the {name}_minmax01.bw
file for maxATAC predictions!