ScSmOP Standard Output - ZhengmzLab/ScSmOP GitHub Wiki
There are four directories storing the processed files by ScSmOP:
- 01.BarcodeIden
- Results of barcode identification
- Result FASTQs with identified barcodes and identifier information
- Barcoding statistics
- Genomic material statistics
- Library idb files
- Results of barcode identification
- 02.ReadAlign
- 03.GroupAndRefine
- Results of barcode grouping and data refinement:
- 04.QualityAssess
- Result of library statistic
- Statistic table
- Statistic plot if have
- Result of library statistic
There were three possible kinds of output FASTQs: DNA, RNA, NDNR, each kinds of FASTQs have n read corresponding to the input FASTQs from Read 1 to Read 4.
- DNA FASTQs: Reads sorted to DNA based on identified barcode.
- RNA FASTQs: Reads sorted to RNA based on identified barcode.
- NDNR FASTQs: Reads can not sorted to DNA or RNA based on identifed barcode.
The format of FASTQ is just like the original FASTQ format with identified barcode and identifiers store in the name each record.
@SRR7216005.man.87|||DPM6D6|NYBot17_Stg|Odd2Bo30|Even2Bo22|Odd2Bo50|||COMPLEX_1
TTTGAGGGCTGACTCTTTACTTGCACTGTCCTAGGTAGGAAGTGGAAGATAAAAAA
+
EEEEEEEEEEEAEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@SRR7722051.1.1|||BC:AGTACCACAAGAAGAG|||COMPLEX_1
ATTGGGGTATAAAATGAGAAAATTTTAAGTCGATTTACAAGTGTAAAAAAAATGTCAAAAAATATCACCTTATTTTTCGGAAGTGTGGGCGTGACAGTTTTGTGCGGCACGGAAGAGCACACGNCTG
+
<-7<J<AJAFAFFJJJ-A7--<-<AFJA--<FAFAFFFJ<F<AFFF<F-7F<FAJFAAFFFA<JJAJJJJFJ<F<F7FF<<-)7-7-))7)--FFJ<--A7FF-A)-))--)7-<-<F<)A77#F<7
@SRR12212044.sra.3|||NYbotLigEven_D12_Stg|Even2Bo70|Odd2Bo1|Even2Bo75|Odd2Bo88|DPMPRE|DPM6bot91|||CELL_1|COMPLEX_1
NTGCCAGGGGGTTGAATGTCTTTTTCCTTTTCTTACTAAGAATATAGTACTTGACAACACGCTGCCATTAGGAAGAAGAAAATAATCTTACGAGAAGAAA
+
#FFFFFFFF,,FFFFFFFFF::FFFFFFFFFFFF:FF:FFF::FFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
The name is separated to 3 fields with |||
:
Field | Content | Description |
---|---|---|
1st | Sequence name | Only retain the name before the first space appear the original FASTQ |
2nd | Identified barcode | Barcodes are separated with ` |
3rd | Identifier | Identifiers are separated with ` |
*** $ head BarcodeIdentification_statistic.txt ***
COMPLEX : 7132895
[DPM] [GENOMEA|1] | Read Count : 42888991
[Not_Found] [GENOMEA|1] | Read Count : 1528760
[DPM] [Not_Found] | Read Count : 0
[Not_Found] [Not_Found] | Read Count : 0
[Y] [ODD] [EVEN] [ODD] | Read Count : 28032533
[Y] [ODD] [Not_Found] [Not_Found] | Read Count : 7792381
[Y] [ODD] [EVEN] [Not_Found] | Read Count : 3383577
[Y] [Not_Found] [Not_Found] [Not_Found] | Read Count : 2404195
[Not_Found] [ODD] [EVEN] [ODD] | Read Count : 929331
The first several lines specifies the unique group count in the processed library.
The next lines specifies the read count of each barcode permutation in the processed library, the read count should be modified to recover real situation according to Barcode identification algothrim.
*** $ head stat ***
Total reads:44417751
Total DNA reads:0
Fully barcoded DNA reads:0
Not fully barcoded DNA reads:0
Total RNA reads:0
Fully barcoded RNA reads:0
Not fully barcoded RNA reads:0
Total non-defined reads:44417751
Fully barcoded non-defined reads:27219050
Not fully barcoded non-defined reads:17198701
Reads sorted to DNA, RNA or NDNR and fully barcoded count in each categories.
Please refer to the supplementary note of the paper for detailed IDB value.
*** $ head rdSPRITE.COMPLEX.idb ***
--- IDB value - IDB ID - IDB count in the library ---
83886080 3639261 2
138414081 1243089 1
25169922 134200 1
125833218 5075147 1
File name | Description | Appear in |
---|---|---|
LIB_DNA.bam | DNA alignment | ChIA-Drop, SPRITE, rdSPRITE, scATAC-seq |
LIB.Barcoded.Aligned.out.bam | Fully barcoded DNA alignment | scSPRITE |
LIBAligned.sortedByCoord.out.bam | Fully barcode RNA alignment | scRNA-seq, Spatial-RNA-seq, rdSPRITE |
LIBSignal.Unique.str1.out.bg; LIBSignal.Unique.str2.out.bg | RNA Coverage | scRNA-seq, Spatial-RNA-seq, rdSPRITE |
Alignments are filtered to retain uniquely mapped reads.
File name | Filter with samtools | Description | Appear in |
---|---|---|---|
LIB.F2304.q30.bam | -F 2304 -q 30 | Uniquely aligned DNA with valid barcode | ChIA-Drop |
LIB.PrimaryAlign.out.bam | -F 0x100 | One alignment for one read | scRNA-seq, scRNA-seq in scARC-seq |
LIB_DNA.F2304.q30.bam | -F 2304 -q 30 | Uniquely aligned DNA with valid barcode | scATAC-seq, scATAC-seq in scARC-seq |
LIB.Barcoded.Aligned.out.bam | -q 255 | Uniquely aligned DNA with full set of barcode | scSPRITE |
LIB.Barcoded.UniqAlign.DNA.bam | -F 2304 -q 30 | Uniquely aligned DNA with full set of barcode | rdSPRITE,SPRITE |
LIB.Barcoded.UniqAlign.RNA.bam | -q 255 | Uniquely aligned RNA with full set of barcode | rdSPRITE |
LIB_P2S.bam | - | One alignment for one read with barcode and UMI information stored in bam field: CB and UR respectively | scATAC-seq, scATAC-seq in scARC-seq |
*** $ head rdSPRITE.DNA.cluster ***
#This file is produced by barp p2s.
@CN COMPLEX Count
COMPLEX_4196353 2 chr2 96439353 96439478 chr2 96445065 96445120
COMPLEX_4200451 2 chr1 85968959 85969031 chr1 85968959 85969031
COMPLEX_8196 5 chr2 157832187 157832245 chr2 157883432 157883582 chr2 157832187 157832245 chr2 157832187 157832245 chr2 157883432 157883582
COMPLEX_4204549 1 chr8 14266475 14266539
COMPLEX_4208647 1 chr7 27303335 27303459
*** $ head rdSPRITE.DNA.RDP.cluster ***
# This file is produced by barp rdp
@CN complex COUNT
COMPLEX_4196353 2 chr2 96439353 96439478 chr2 96445065 96445120 1 1
COMPLEX_4200451 1 chr1 85968959 85969031 2
COMPLEX_8196 2 chr2 157832187 157832245 chr2 157883432 157883582 3 2
COMPLEX_4204549 1 chr8 14266475 14266539 1
COMPLEX_4208647 1 chr7 27303335 27303459 1
*** $ head scSPRITE.DNA.RDP.cluster ***
# This file is produced by barp rdp
@CN cell complex COUNT
CELL_926 COMPLEX_2049 1 chr16 38350928 38350960 2
CELL_778 COMPLEX_4098 1 chr8 9041233 9041287 1
CELL_1192 COMPLEX_6147 1 chr10 17566046 17566141 1
CELL_157 COMPLEX_8196 8 chr2 98667125 98667213 chr1 3037492 3037582 chr2 98667125 98667215 chr2 98666550 98666640 chr3 7058394 7058484 chrUn_GL456383 28713 28802 chr2 98666874 98666964 chr2 98662870 98662960 1 1 7 1 1 1 1 1
cluster stores the deconvolved biological groups with one-line-one-group fanshion as we called inline BED format. The format split each line as 4 fields that the field are not splited by certain symbols, instead these fields are organized based the count field. RDP.cluster stores the deduplicated groups, the duplicates times are stored in the forth field for each fragment.
Field | Content | Description |
---|---|---|
1st | Group field | This field contains columns before "Count" field, specifying the biological group of all the fragments within the line. |
2nd | Count field | This field has only one column specifies the fragment count within the group. |
3rd | Fragments field | Fragments within the group that each fragment is specified with chromosome start end. |
4th | Additional field | This field stores additional information for each fragment, if there are multiple additional information for each fragment, the information should be recorded separately. |
An example of inline BED format with two addition information for each fragments: InfoA and InfoB.
COMPLEX_8196 2 chr2 157832187 157832245 chr2 157883432 157883582 InfoA InfoA InfoB InfoB
Merged_cluster.txt stores the information of merged fragments in ChIA-Drop with inline BED format.
LIB.Qualified.bed converted the inline BED format RDP.cluster to bed format that the columns corresponding with: Fragment chromosome | start | end | Cell ID | Duplicate time.
*** $ head PBMC.Qualified.bed ***
chr10 116264702 116264748 CELL_2049 1
chr18 23631039 23631090 CELL_4098 2
chr5 175075551 175075637 CELL_6147 1
LIB.Qualified.Sorted.bed sorting the fragments from LIB.Qualified.bed according to genomic coordinates.
*** $ head PBMC.Qualified.Sorted.bed ***
chr1 10074 10322 CELL_63 1
chr1 10098 10334 CELL_220 1
chr1 10152 10438 CELL_118 1
chr1 10169 10341 CELL_185 1
chr1 10229 10304 CELL_550 1
LIB.QualifiedFragments.bedgraph generated coverage from LIB.Qualified.Sorted.bed using bedtools genomecov.
*** $ head PBMC.QualifiedFragments.bedgraph ***
chr1 10074 10098 1
chr1 10098 10152 2
chr1 10152 10169 3
chr1 10169 10229 4
chr1 10229 10248 5
LIB.narrowPeaks called peaks from LIB.QualifiedFragments.bedgraph using macs2 bdgpeakcall.
*** $ head PBMC.narrowPeaks ***
track type=narrowPeak name="PBMC.narrowPeaks" description="PBMC.narrowPeaks" nextItemButton=on
chr1 778240 779288 PBMC.narrowPeaks_narrowPeak1 1550 . 0 0 0 466
chr1 817218 817486 PBMC.narrowPeaks_narrowPeak2 320 . 0 0 0 149
chr1 826703 826928 PBMC.narrowPeaks_narrowPeak3 70 . 0 0 0 28
chr1 827068 827852 PBMC.narrowPeaks_narrowPeak4 1240 . 0 0 0 523
LIB.PeakCellCount.bed Fragment counts fall into each called peak for each cell.
*** $ head PBMC.PeakCellCount.bed ***
-- Peak coordinate -- -- Cell ID -- -- Fragment count --
chr1 778240 779288 CELL_187 2
chr1 778240 779288 CELL_154 2
chr1 778240 779288 CELL_275 2
chr1 778240 779288 CELL_38 4
chr1 778240 779288 CELL_341 2
SubGEM is generated for ChIA-Drop type data with inline BED format, complexes are splited into subGEMs by chromosome.
*** $ head CHDP.ChIADrop.SubGEM ***
--- The complex are named as "COMPLEX_ID-nth SubGEM in the complex" ---
COMPLEX_2049-0 1 chr2 12713870 12714100
COMPLEX_2049-1 1 chr2R 7863021 7863695
COMPLEX_2049-2 1 chr3L 11185707 11185889
COMPLEX_2049-3 1 chr3R 3018595 3018777
COMPLEX_4098-0 2 chr2R 6408379 6408606 chr2R 12027465 12027974
COMPLEX_4098-1 4 chr3L 1490949 1491382 chr3L 4460139 4460366 chr3L 21232289 21232501 chr3L 22314225 22314879
COMPLEX_4098-2 5 chr3R 3671537 3671764 chr3R 5063085 5063312 chr3R 9839373 9839600 chr3R 19349478 19349626 chr3R 25367853 25368080
COMPLEX_4098-3 2 chrX 9123872 9124094 chrX 17155881 17156108
rgn file is generated for visualized by ChIA-View with one line one fragment, the forth line in the .rgn files is the unique ID of the fragment.
*** $ head rdSPRITE.DNA.rgn ***
--- For data do not require SubGEM, the complexes are named as "COMPLEX_ID-Total fragment in the complex-nth fragment" ---
chr2 96439353 96439478 COMPLEX_4196353-2-0
chr2 96445065 96445120 COMPLEX_4196353-2-1
chr1 85968959 85969031 COMPLEX_4200451-1-0
chr2 157832187 157832245 COMPLEX_8196-2-0
$ head CHDP.rgn
--- For data requires SubGEM, the complexes are names as "COMPLEX_ID-nth SubGEM-Total fragment in the SubGEM-nth fragment" ---
chr2L 12713870 12714100 COMPLEX_2049-0-1-0
chr2R 7863021 7863695 COMPLEX_2049-1-1-0
chr3L 11185707 11185889 COMPLEX_2049-2-1-0
chr3R 3018595 3018777 COMPLEX_2049-3-1-0
chr2R 6408379 6408606 COMPLEX_4098-0-2-0
chr2R 12027465 12027974 COMPLEX_4098-0-2-1
There were at least three files under the directory. Showing the result for all the barcodes. The name of the directory do not necessarily named as raw_matrix, it can be raw for scRNA-seq related pipeline under Solo.out/Gene/.
File name | Description | Experiment |
---|---|---|
features.tsv | Gene information as row name of expression matrix | scRNA-seq, scRNA-seq in scARC-seq |
barcodes.tsv | Barcode information as column name of expression matrix | scRNA-seq, scRNA-seq in scARC-seq |
matrix.mtx | Expression matrix/peak cell fragment count matrix as matrix market format | scRNA-seq, scATAC-seq, scRNA-seq in scARC-seq, scATAC-seq in scARC-seq |
barcodes.tsv_BARPID.txt | Converted barcode sequence to BARP identifier ID | scRNA-seq in scARC-seq |
BarcodeToBARPIDConvertTable.tsv | Pairwise conversion from barcode to BARP ID | scRNA-seq in scARC-seq |
Peak | Peak information as row name of peak cell fragment count matrix | scATAC-seq, scATAC-seq in scARC-seq |
Cell | Cell information as column name of peak cell fragment count matrix | scATAC-seq, scATAC-seq in scARC-seq |
filtered_cells.tsv | Filtered barcodes with emptyDrops | scATAC-seq, scATAC-seq in scARC-seq |
*** $ head features.tsv ***
--- Gene ID - Gene symbol - Gene Expression ---
ENSG00000223972.5 DDX11L1 Gene Expression
ENSG00000227232.5 WASH7P Gene Expression
ENSG00000278267.1 MIR6859-1 Gene Expression
ENSG00000243485.5 MIR1302-2HG Gene Expression
ENSG00000284332.1 MIR1302-2 Gene Expression
*** $ head barcodes.tsv ***
AAACAGCCAAACCTAT
AAACAGCCAAACCTTG
AAACAGCCAAACGGGC
AAACAGCCAAACTAAG
AAACAGCCAAACTGCC
*** $ head matrix.mtx ***
--- Header line (necessary) tells the attributes of MatrixMarket ---
--- %%MatrixMarket - object - format - field - symmetry ---
%%MatrixMarket matrix coordinate integer general
--- Annotation line ---
%
--- row number - column number - total value ---
58780 303576 3981045
--- row index column index - value ---
665 1 1
2910 1 1
3236 1 1
6355 1 1
6475 1 1
17632 1 1
Detailed information about MatrixMarket(English, 中文).
$ head barcodes.tsv_BARPID.txt
CELL_16833
CELL_60050
CELL_149669
CELL_174442
CELL_120091
CELL_119651
$ head BarcodeToBARPIDConvertTable.tsv
AAACAGCCAAACCTAT CELL_16833
AAACAGCCAAACCTTG CELL_60050
AAACAGCCAAACGGGC CELL_149669
AAACAGCCAAACTAAG CELL_174442
AAACAGCCAAACTGCC CELL_120091
head filtered_cells.tsv
CELL_2585
CELL_53
CELL_1174
CELL_2272
CELL_664
There were at least three files under the directory. Showing the result filtered for real cells. The name of the directory do not necessarily named as raw_matrix, it can be raw for scRNA-seq related pipeline under Solo.out/Gene/.
File name | Description | Experiment |
---|---|---|
features.tsv | Gene information as row name of expression matrix | scRNA-seq, scRNA-seq in scARC-seq |
barcodes.tsv | Barcode information as column name of expression matrix | scRNA-seq, scRNA-seq in scARC-seq |
matrix.mtx | Expression matrix/peak cell fragment count matrix as matrix market format | scRNA-seq, scATAC-seq, scRNA-seq in scARC-seq, scATAC-seq in scARC-seq |
barcodes.tsv_BARPID.txt | Converted barcode sequence to BARP identifier ID | scRNA-seq in scARC-seq |
BarcodeToBARPIDConvertTable.tsv | Pairwise conversion from barcode to BARP ID | scRNA-seq in scARC-seq |
Peak | Peak information as row name of peak cell fragment count matrix | scATAC-seq, scATAC-seq in scARC-seq |
Cell | Cell information as column name of peak cell fragment count matrix | scATAC-seq, scATAC-seq in scARC-seq |
results.tsv | Pairwise cell to cell same transposition event count matrix scATAC-seq algorithm | scATAC-seq, scATAC-seq in scARC-seq |
- | CELL_10 | CELL_11 | CELL_524 | CELL_13 | CELL_39 | CELL_101 | CELL_105 | CELL_106 | CELL_108 | CELL_958 |
---|---|---|---|---|---|---|---|---|---|---|
CELL_10 | 31758 | 200 | 142 | 331 | 118 | 294 | 284 | 344 | 387 | 32 |
CELL_11 | 200 | 2540 | 45 | 93 | 23 | 77 | 110 | 98 | 91 | 6 |
CELL_524 | 142 | 45 | 2656 | 68 | 15 | 64 | 75 | 96 | 64 | 527 |
CELL_13 | 331 | 93 | 68 | 5106 | 65 | 100 | 177 | 203 | 191 | 11 |
CELL_39 | 118 | 23 | 15 | 65 | 786 | 41 | 54 | 50 | 69 | 3 |
CELL_101 | 294 | 77 | 64 | 100 | 41 | 3214 | 148 | 175 | 128 | 18 |
CELL_105 | 284 | 110 | 75 | 177 | 54 | 148 | 5204 | 225 | 148 | 13 |
CELL_106 | 344 | 98 | 96 | 203 | 50 | 175 | 225 | 4784 | 159 | 24 |
CELL_108 | 387 | 91 | 64 | 191 | 69 | 128 | 148 | 159 | 6388 | 17 |
CELL_958 | 32 | 6 | 527 | 11 | 3 | 18 | 13 | 24 | 17 | 78 |