prepare - MiraldiLab/maxATAC GitHub Wiki

Prepare

Schematic overview

The prepare function converts a .bam file of paired-end sequencing reads to a .bigwig file of Tn5 cut sites counts that are normalized for use with maxatac predict and maxatac train. This page will provide a walkthrough of different aspects of the prepare function and how they relate to ATAC-seq data processing. The maxatac prepare function requires samtools, bedtools, pigz, and bedGraphToBigWig be installed on your PATH to run.

Overview

The maxatac prepare function is a convenience function for users with data that has already been aligned with either bowtie2 or bwa-mem. The input to maxatac prepare is a .bam file of aligned reads from bulk ATAC-seq or a scATAC-seq pseudobulk .tsv fragment files. The objective of maxatac prepare is to perform all necessary quality control filtering, PCR deduplication, signal track generation, and normalization required for .bigwig inputs to maxATAC.

The steps that are performed by maxatac prepare on bulk ATAC-seq data are also available in other formats such as:

snakeATAC, a snakemake pipeline for ATAC-seq to process data from SRA. This workflow covers data download from SRA with alignment.
A common workflow language pipeline for ATAC-seq. This workflow covers data processing from .fastq files.
A simple shell script that covers data processing from .fastq files to read-depth normalized signal tracks. Users will need to use maxatac normalize to generate a min-max normalized signal track for input into maxatac predict.

Output files

There are multiple output files from the maxatac prepare function. The main output that will be used for maxATAC ends with minmax01.bw and is the file that has been min-max normalized and blacklisted.

Filename	Description
{name}_IS_slop20_RP20M_minmax01_chromosome_min_max.txt	Contains the minimum and maximum values per chromosome
{name}_IS_slop20_RP20M_minmax01_genome_stats.txt	Contains the min, max, median, and stats on the input file.
{name}_IS_slop20_RP20M_minmax01.bw	The output file that is to be used for prediction. This file has been min-max normalized.
{name}_IS_slop20_RP20M.bw	The read-depth normalized signal tracks.
{name}_IS_slop20.bed.gz	The compressed bed file of individual cut sites that have been corrected for the Tn5 shift.

Required flags

-i, --input:The input file to be processed. The input file can be either:

.bam: Bulk ATAC-seq BAM file.

.tsv: 10X scATAC fragments file. Must end in .tsv or .tsv.gz.

-o, --output: The output directory path.
-n, --name, --prefix, -prefix: This argument is used to set the prefix for setting the filenames.

Optional flags

The optional flags for the prepare function are related to tuning the preprocessing steps. The primary flags are described below and in the prepare documentation.

The default values for the optional arguments are based on the testing performed in the maxATAC publication. See the Methods of our publication for a detailed explanation of each parameter choice.

-skip_dedup, --skip_deduplication: It is important to remove PCR duplicates from your ATAC-seq data if you have not done so already. Include this flag to perform PCR deduplication of the input BAM file if you know that it has not been deduplicated. Skipping this step will speed up data processing. Defualt: False
-slop, --slop: The slop size used to smooth sparse Tn5 cut sites' signal. Each Tn5 cut site will be extended +/- the slop size (in bp). Because maxATAC models were trained using slop size of 20bp (a value that approximates the size of Tn5 transposase), this parameter should not be changed from default (20 bp) when using the trained models provided by maxATAC. Default: 20 bp.
-rpm, --rpm_factor: The reads per million (RPM) factor used for read-depth normalization of signal. Most groups use RPM and therefore 1,000,000 as a scaling factor, but maxATAC uses RP20M and therefore 20,000,000 because it is approximately the median sequencing depth of the ATAC-seq data used for training. Changing from the default (20000000) is not problematic for maxATAC prediction, as this track is only used for visualization. (Predictions are made on a min-max-like normalized signal track, also an output from maxatac prepare.) Default: 20000000.

Tutorial

The prepare function can be used to prepare .bam files from bulk ATAC-seq or .tsv fragment files from scATAC-seq. Below an example for each approach is provided.

Bulk ATAC-seq

In order to use the maxatac prepare function, you need to have aligned your reads to a reference genome. The currently supported genome is hg38. You can use BWA-mem or bowtie2 for alignment. The reads are expected to have a paired-end library format. More details about ATAC-seq data processing implemented in the maxatac prepare function is outlined in the ATAC-seq data processing wiki page.

Example bulk command

maxatac prepare -i GM12878_bulk.tsv -o prepare_bulk_output -n GM12878_bulk

Output

A truncated version of the output is shown below. The sections are broken up for explanation.

Parameters

The first few lines that are printed will lay out the run parameters that are going to be used and where the files will be saved.

                             _______       _____ 
                          /\|__   __|/\   / ____|
 _ __ ___   __ ___  __   /  \  | |  /  \ | |     
| '_ ` _ \ / _` \ \/ /  / /\ \ | | / /\ \| |     
| | | | | | (_| |>  <  / ____ \| |/ ____ \ |____ 
|_| |_| |_|\__,_/_/\_\/_/    \_\_/_/    \_\_____|
                                                 
                                                 

[2022-08-23 17:34:16,563]
Prepare Parameters:
Input file: bulk_atac_SRX2717911_sub01.bam 
Input chromosome sizes file: hg38.chrom.sizes 
Tn5 cut sites will be slopped 20 bps on each side 
Input blacklist file: hg38_maxatac_blacklist.bw 
Output filename: GM12878_bulk 
Output directory: prepare_output_bulk 
Using a millions factor of: 20000000 
Using 9 threads to run job.
[2022-08-23 17:34:16,565]
Generate the normalized signal tracks.
[2022-08-23 17:34:16,566]
Working on a bulk ATAC-seq BAM file 
Getting the number of reads in the BAM file
There are 5262882 reads in the file
[2022-08-23 17:34:20,014]
Processing BAM to bigwig. Running eduplication
Inputs:
BAM: bulk_atac_SRX2717911_sub01.bam
Sample name: GM12878_bulk
Cores: 9
Chr Keep: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
Blacklist: hg38_maxatac_blacklist.bed
Chr Sizes: hg38.chrom.sizes
Slop Size: 20
Scale factor: 3.800199206442402
Create working directory: prepare_output_bulk

Sorting and Deduplication

If you run the default parameters, you will perform deduplication of reads on a .bam file. This process will use the samtools implementation of PCR duplicate remove. The output is shown below. The input file used did not have any PCR duplicates detected.

Filtering with Samtools
Samtools sort reads by name for bulk_atac_SRX2717911_sub01.bam
[bam_sort_core] merging from 0 files and 9 in-memory blocks...
Samtools fixmate on GM12878_bulk_filtered.bam
[bam_sort_core] merging from 0 files and 9 in-memory blocks...
Remove duplicates from GM12878_bulk_fixmate.bam
[markdup] warning: unable to calculate estimated library size. Read pairs 2631441 should be greater than duplicate pairs 0, which should both be non zero.
COMMAND: samtools markdup -@ 9 -r -s GM12878_bulk_fixmate.bam -
READ: 5262882
WRITTEN: 5262882
EXCLUDED: 0
EXAMINED: 5262882
PAIRED: 5262882
SINGLE: 0
DUPLICATE PAIR: 0
DUPLICATE SINGLE: 0
DUPLICATE PAIR OPTICAL: 0
DUPLICATE SINGLE OPTICAL: 0
DUPLICATE NON PRIMARY: 0
DUPLICATE NON PRIMARY OPTICAL: 0
DUPLICATE PRIMARY TOTAL: 0
DUPLICATE TOTAL: 0
ESTIMATED_LIBRARY_SIZE: 0
[bam_sort_core] merging from 0 files and 9 in-memory blocks...
Remove unwanted chr from GM12878_bulk_deduped.bam
[bam_sort_core] merging from 0 files and 9 in-memory blocks...

Converting to Tn5 sites and generating coverage tracks

This section shows the scale factor that was calculated for the RP20M normalization used by bedtools genomecov.

Scale factor:  3.800199206442402
Using Bedtools to convert BAM to bedgraph
Using bedGraphToBigWig to convert bedgraph to bigwig
Done!

Min-max normalization with `maxatac normalize`

The last part of the printout shows the use of maxatac normalize to min-max normalize an input bigwig file.

[2022-08-23 17:36:07,853]
Min-max normalize signal tracks
[2022-08-23 17:36:07,855]
Normalizing
  Input bigwig file: /prepare_output_bulk/GM12878_bulk_IS_slop20_RP20M.bw
  Output filename: /prepare_output_bulk/GM12878_bulk_IS_slop20_RP20M_minmax01.bw
  Output directory: /prepare_output_bulk
  Using min-max normalization
[2022-08-23 17:36:07,864]
Calculating stats per chromosome
[2022-08-23 17:36:07,864]
Calculating statistics for chr1
[2022-08-23 17:36:27,439]

...

[2022-08-23 17:42:32,200]
Calculating statistics for chr21
[2022-08-23 17:42:49,563]
Calculating statistics for chr22
[2022-08-23 17:43:09,052]
Calculating genome-wide statistics.
[2022-08-23 17:44:08,692]
Sample Statistics
  Genomic minimum value: 0.0
  Genomic max value: 38.00199890136719
  Genomic median (non-zero): 3.8002
  Genomic mean: 5.9282417
  Genomic standard deviation: 6.279289
[2022-08-23 17:44:08,693]
Normalize and Write BigWig file
[2022-08-23 18:01:21,814]
Total normalization time: 0:25:13.

[2022-08-23 18:01:21,815]
Results saved to: /prepare_output_bulk

scATAC-seq

One of the outputs from the 10X CellRanger pipeline is a .tsv.gz file that represents the fragments that passed QC. You do not need to have a CellRanger specific output, but the fragment file must be 4-columns : chr, start, stop, and barcode. The traditional CellRanger output has the support column in the 5-th column position. We currently do not use this and the maxATAC pipeline expects a 4-column file.

Example scATAC command

maxatac prepare -i GM12878_fragments_subsample_1M.tsv -o prepare_output -n GM12878_scatac_1M

Outputs

Below is a truncated output text you should see while running.

                             _______       _____ 
                          /\|__   __|/\   / ____|
 _ __ ___   __ ___  __   /  \  | |  /  \ | |     
| '_ ` _ \ / _` \ \/ /  / /\ \ | | / /\ \| |     
| | | | | | (_| |>  <  / ____ \| |/ ____ \ |____ 
|_| |_| |_|\__,_/_/\_\/_/    \_\_/_/    \_\_____|
                                                 
                                                 

[2022-08-23 16:25:05,087]
Prepare Parameters:
Input file: GM12878_fragments_subsample_1M.tsv 
Input chromosome sizes file: ./data/hg38/hg38.chrom.sizes 
Tn5 cut sites will be slopped 20 bps on each side 
Input blacklist file: ./data/hg38/hg38_maxatac_blacklist.bw 
Output filename: GM12878_scatac_1M 
Output directory: .prepare_output 
Using a millions factor of: 20000000 
Using 9 threads to run job.
[2022-08-23 16:25:05,088]
Generate the normalized signal tracks.
[2022-08-23 16:25:05,088]
Working on 10X scATAC fragments file 
 Converting fragment files to Tn5 sites
There are 2000000 Tn5 cut sites in the file
[2022-08-23 16:25:05,845]
Getting the number of Tn5 cut sites in the fragment file
[2022-08-23 16:25:09,665]
Slopping Tn5 cut sites and generating RPM normalized bigwig
Converting bedgraph to bigwig
Compressing files
Done!
[2022-08-23 16:25:48,796]
Min-max normalize signal tracks
[2022-08-23 16:25:48,797]
Normalizing
  Input bigwig file: /prepare_output/GM12878_scatac_1M_IS_slop20_RP20M.bw
  Output filename: /prepare_output/GM12878_scatac_1M_IS_slop20_RP20M_minmax01.bw
  Output directory: prepare_output
  Using min-max normalization
[2022-08-23 16:25:48,801]
Calculating stats per chromosome
[2022-08-23 16:25:48,802]
Calculating statistics for chr1

...

[2022-08-23 16:32:01,447]
Calculating statistics for chr21
[2022-08-23 16:32:18,022]
Calculating statistics for chr22
[2022-08-23 16:32:35,487]
Calculating genome-wide statistics.
[2022-08-23 16:33:24,501]
Sample Statistics
  Genomic minimum value: 0.0
  Genomic max value: 60.0
  Genomic median (non-zero): 10.0
  Genomic mean: 13.236136
  Genomic standard deviation: 9.855374
[2022-08-23 16:33:24,504]
Normalize and Write BigWig file
[2022-08-23 16:49:30,168]
Total normalization time: 0:23:41.

[2022-08-23 16:49:30,169]
Results saved to: /prepare_output

You can now use the {name}_minmax01.bw file for maxATAC predictions!