juicer tools - BiocottonHub/BioSoftware GitHub Wiki

juicer-tools

主要是对juicer处理后的.hic文件进行处理，使用其中的功能对TAD和染色质loop进行注释

HiCCUPS
HiCCUPS Diff
Eigenvector

1.具体参数

CPU版本下，键入以下命令查看说明 java -Xms512m -Xmx2048m -jar scripts/common/juicer_tools.jar -h

1.1pre 子程序

将其他类型的文件处理后生成 .hic文件使用作者给出的测试数据，处理后生成.hic文件，而.hic文件可以直接在图形化软件包Juicerbox实现可视化 test_data

# 下载测试数据
cd data
wget -c https://github.com/aidenlab/juicer/wiki/data/test.txt.gz
## 使用pre进行处理，生成hic文件
java -Xms512m -Xmx2048m -jar scripts/common/juicer_tools.jar pre data/test.txt.gz data/test.hic hg19

1.1.1输入文件格式说明

常见文件以空格分隔

readname str1 chr1 pos1 frag1 str2 chr2 pos2 frag2 mapq1 mapq2

str = strand (0 for forward, anything else for reverse)
chr = chromosome (must be a chromosome in the genome)
pos = position
frag = restriction site fragment
mapq = mapping quality score
cigar = cigar string as reported by aligner
sequence = DNA sequence
score = the score imputed to this read

如果没有提供酶切信息，frag信息自动忽略，没有指定过滤阈值，mapq自动忽略，最后.hic文件中read名称和比对的链的信息不会存在

1.2 dump子程序

dump包含8个参数，主要用于从.hic文件中得到对应的稀释矩阵

1.2.1参数

observed/oe 矩阵类型
NONE/VC/VC_SQRT/KR 归一化方法
.hic文件，可以接受多个，程序自动合并
第一个染色质编号支持数字和x1 x2
第二个染色体编号
分辨率选择 BP/FRAG 基于碱基和基于酶切片段定义精度
分辨率大小，提供9种分辨率
可选参数 -v -d

BP, this is one of <2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000>

FRAG this must be one of <500, 200, 100, 50, 20, 5, 2, 1>

从.hic文件中获取交互矩阵

juicer_tools dump <observed/oe> <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr1>[:x1:x2] <chr2>[:y1:y2] <BP/FRAG> <binsize> [outfile]

例如获取chr1 与chr1 内部的Hi-C交互数据，精度在2.5Mb

java -jar scripts/common/juicer_tools.jar dump observed KR data/test.hic 1 1 BP 2500000 chr1.txt

输出chr1 染色体 data matrices 数据

0       0       1.8131434
2500000 0       3.0028903
5000000 0       1.2578616
10000000        0       3.461795
12500000        0       1.3383466
15000000        0       0.957843

1.3功能注释

对hic结果进行功能注释又细分为多个子程序，并且所有的脚本都是基于.hic文件进行的；当然没有hic文件的话，可以使用pre脚本从比对文件中提取得到对应的hic文件

1.3.1Arrowhead

主要是用于找到染色质上的交互区域，最终生成一个包含12列的文件

基本用法

arrowhead [-c chromosome(s)] [-m matrix size] [-r resolution] [--threads num_threads]
		[-k normalization (NONE/VC/VC_SQRT/KR)] <HiC file> 
		<output_file> [feature_list] [control_list]

hic文件，需要以.hic为后缀
输出文件可以直接在juicebox中进行可视化
提供loops/domains文件，将会计算对应的score
背景loops/domains文件
-c 指定染色质可以指定多个染色体 chr1,chr2
-m 滑动窗口大小，用于搜索domains，必须为偶数
-r 分辨率
-k 归一化方式 <NONE/VC/VC_SQRT/KR>
--threads 线程数

1.3.2 HiCCUPS

用于找chromatin loops，并且在hi-c分析中，使用比较频繁的一个脚本。成功运行后，在输出文件夹中将会包含以下结果

.../outputDirectory/
.../outputDirectory/merged_loops (the final looplist - this is likely what you'll use)
.../outputDirectory/enriched_pixels_10000
.../outputDirectory/enriched_pixels_5000 (contains raw enriched pixels from GPU)
.../outputDirectory/fdr_thresholds_10000 
.../outputDirectory/fdr_thresholds_5000 (threshold values used to calculate enrichment)
.../outputDirectory/postprocessed_pixels_10000
.../outputDirectory/postprocessed_pixels_5000 (clustered pixels for each resolution)

在不指定参数的情况下，软件会根据Hi-C的密度选择对应的两套参数，高精度和低精度的

基本用法

hiccups [-m matrixSize] [-c chromosome(s)] [-r resolution(s)] [--threads num_threads]
		[-k normalization (NONE/VC/VC_SQRT/KR)] [-f fdr] 
		[-p peak width] [-i window] [-t thresholds] 
		[-d centroid distances] <HiC file> <outputDirectory> [specified_loop_list]

hic输入文件
输出文件夹
对应指定的一些loop，脚本会计算这些loops的富集程度，CPU版本的脚本，为了节约时间只会在对角线附近进行搜索
-m 选择多大的矩阵传递给GPU，必须大于40的偶数，大小也受限与GPU型号
-c 染色体
-r 精度选择，可以同时指定多个精度
-k 归一化方法
-f FDR阈值，指定多个FDR，对应相应的精度
-p 峰值enriched pixels
-i window用于发现enriched pixels
-t 合并不同精度下的loops 的阈值，e.g. 0.02,1.5,1.75,2
-d 多大距离就进行合并
--threads 线程数

实例

hiccups -m 500 -r 5000,10000 -f 0.1,0.1 -p 4,2 -i 7,5 -d 20000,20000 -c 22 HIC006.hic all_hiccups_loops

1.3.3 低配 HiCCUPs

仍旧处于实验阶段，最好使用GPU版本的HiCCUPS 两个版本找出来的loops还是有些差别

Using the cohesin-degron maps from Rao et al. 2017, regular GPU-based HiCCUPS finds 3444 loops in the untreated megamap and 350 loops in the treated megamap. CPU-based HiCCUPS (and restricted GPU-based HiCCUPS) finds 3300 loops in the untreated megamap and 239 loops in the treated megamap.

1.3.4 HiCCUPSDiff

从两个loops列表中找到不同的loops 基本用法

hiccupsdiff 
[-m matrixSize]
[-k normalization (NONE/VC/VC_SQRT/KR)]
[-c chromosome(s)] 
[-f fdr] 
[-p peak width]
[-i window] 
[-t thresholds] 
[-d centroid distances]
<firstHicFile>
<secondHicFile> 
<firstLoopList>
<secondLoopList> <outputDirectory>

firstHicFile 第一个hic文件
secondHicFile 第二个hic文件
firstLoopList HiCCUPS对第一个hic文件运行的结果
secondLoopList HiCCUPS对第二个hic文件运行的结果

1.3.5Eigenvector

用于从低精度的Hi-C数据中，区分compartments。并且eignvector是来自于皮尔逊相关矩阵中的第一个主成分

# 将会以KR的归一化方法，计算一号染色体上1M精度的特征值，并且将结果输出在标准输出上
java -jar juicer_tools.jar eigenvector KR HIC001.hic 1 BP 1000000

1.3.6 Persons

计算染色体内部的皮尔逊相关系数，当分辨率越高时，所花费的时间也随之增加

java -jar juicer_tools.jar pearsons KR HIC001.hic 1 BP 1000000

1.3.7APA

从交互矩阵中，计算峰的聚集程度。