Output of ScSmOP

There are four directories storing the processed files by ScSmOP:

01.BarcodeIden
- Results of barcode identification
  - Result FASTQs with identified barcodes and identifier information
  - Barcoding statistics
  - Genomic material statistics
  - Library idb files
02.ReadAlign
- Results of read alignment
03.GroupAndRefine
- Results of barcode grouping and data refinement:
04.QualityAssess
- Result of library statistic
  - Statistic table
  - Statistic plot if have

File format explanation

Resulting FASTQs

There were three possible kinds of output FASTQs: DNA, RNA, NDNR, each kinds of FASTQs have n read corresponding to the input FASTQs from Read 1 to Read 4.

DNA FASTQs: Reads sorted to DNA based on identified barcode.
RNA FASTQs: Reads sorted to RNA based on identified barcode.
NDNR FASTQs: Reads can not sorted to DNA or RNA based on identifed barcode.

The format of FASTQ is just like the original FASTQ format with identified barcode and identifiers store in the name each record.

    @SRR7216005.man.87|||DPM6D6|NYBot17_Stg|Odd2Bo30|Even2Bo22|Odd2Bo50|||COMPLEX_1
    TTTGAGGGCTGACTCTTTACTTGCACTGTCCTAGGTAGGAAGTGGAAGATAAAAAA
    +
    EEEEEEEEEEEAEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

    @SRR7722051.1.1|||BC:AGTACCACAAGAAGAG|||COMPLEX_1
    ATTGGGGTATAAAATGAGAAAATTTTAAGTCGATTTACAAGTGTAAAAAAAATGTCAAAAAATATCACCTTATTTTTCGGAAGTGTGGGCGTGACAGTTTTGTGCGGCACGGAAGAGCACACGNCTG
    +
    <-7<J<AJAFAFFJJJ-A7--<-<AFJA--<FAFAFFFJ<F<AFFF<F-7F<FAJFAAFFFA<JJAJJJJFJ<F<F7FF<<-)7-7-))7)--FFJ<--A7FF-A)-))--)7-<-<F<)A77#F<7

    @SRR12212044.sra.3|||NYbotLigEven_D12_Stg|Even2Bo70|Odd2Bo1|Even2Bo75|Odd2Bo88|DPMPRE|DPM6bot91|||CELL_1|COMPLEX_1
    NTGCCAGGGGGTTGAATGTCTTTTTCCTTTTCTTACTAAGAATATAGTACTTGACAACACGCTGCCATTAGGAAGAAGAAAATAATCTTACGAGAAGAAA
    +
    #FFFFFFFF,,FFFFFFFFF::FFFFFFFFFFFF:FF:FFF::FFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

The name is separated to 3 fields with |||:

Field	Content	Description
1^st	Sequence name	Only retain the name before the first space appear the original FASTQ
2^nd	Identified barcode	Barcodes are separated with `
3^rd	Identifier	Identifiers are separated with `

Barcoding statistics

    *** $ head BarcodeIdentification_statistic.txt ***
    COMPLEX : 7132895
    [DPM] [GENOMEA|1] | Read Count : 42888991
    [Not_Found] [GENOMEA|1] | Read Count : 1528760
    [DPM] [Not_Found] | Read Count : 0
    [Not_Found] [Not_Found] | Read Count : 0
    [Y] [ODD] [EVEN] [ODD] | Read Count : 28032533
    [Y] [ODD] [Not_Found] [Not_Found] | Read Count : 7792381
    [Y] [ODD] [EVEN] [Not_Found] | Read Count : 3383577
    [Y] [Not_Found] [Not_Found] [Not_Found] | Read Count : 2404195
    [Not_Found] [ODD] [EVEN] [ODD] | Read Count : 929331

The first several lines specifies the unique group count in the processed library.

The next lines specifies the read count of each barcode permutation in the processed library, the read count should be modified to recover real situation according to Barcode identification algothrim.

Genomic material statistics

    *** $ head stat ***
    Total reads:44417751
    Total DNA reads:0
    Fully barcoded DNA reads:0
    Not fully barcoded DNA reads:0
    Total RNA reads:0
    Fully barcoded RNA reads:0
    Not fully barcoded RNA reads:0
    Total non-defined reads:44417751
    Fully barcoded non-defined reads:27219050
    Not fully barcoded non-defined reads:17198701

Reads sorted to DNA, RNA or NDNR and fully barcoded count in each categories.

IDB files

Please refer to the supplementary note of the paper for detailed IDB value.

    *** $ head rdSPRITE.COMPLEX.idb ***
    --- IDB value - IDB ID - IDB count in the library ---
    83886080        3639261 2
    138414081       1243089 1
    25169922        134200  1
    125833218       5075147 1

Result of Read alignment

File name	Description	Appear in
LIB_DNA.bam	DNA alignment	ChIA-Drop, SPRITE, rdSPRITE, scATAC-seq
LIB.Barcoded.Aligned.out.bam	Fully barcoded DNA alignment	scSPRITE
LIBAligned.sortedByCoord.out.bam	Fully barcode RNA alignment	scRNA-seq, Spatial-RNA-seq, rdSPRITE
LIBSignal.Unique.str1.out.bg; LIBSignal.Unique.str2.out.bg	RNA Coverage	scRNA-seq, Spatial-RNA-seq, rdSPRITE

Filtered alignment

Alignments are filtered to retain uniquely mapped reads.

File name	Filter with samtools	Description	Appear in
LIB.F2304.q30.bam	-F 2304 -q 30	Uniquely aligned DNA with valid barcode	ChIA-Drop
LIB.PrimaryAlign.out.bam	-F 0x100	One alignment for one read	scRNA-seq, scRNA-seq in scARC-seq
LIB_DNA.F2304.q30.bam	-F 2304 -q 30	Uniquely aligned DNA with valid barcode	scATAC-seq, scATAC-seq in scARC-seq
LIB.Barcoded.Aligned.out.bam	-q 255	Uniquely aligned DNA with full set of barcode	scSPRITE
LIB.Barcoded.UniqAlign.DNA.bam	-F 2304 -q 30	Uniquely aligned DNA with full set of barcode	rdSPRITE,SPRITE
LIB.Barcoded.UniqAlign.RNA.bam	-q 255	Uniquely aligned RNA with full set of barcode	rdSPRITE
LIB_P2S.bam	-	One alignment for one read with barcode and UMI information stored in bam field: CB and UR respectively	scATAC-seq, scATAC-seq in scARC-seq

cluster and RDP.cluster

    *** $ head rdSPRITE.DNA.cluster ***
    #This file is produced by barp p2s.
    @CN     COMPLEX Count
    COMPLEX_4196353 2   chr2    96439353    96439478    chr2    96445065    96445120
    COMPLEX_4200451 2   chr1    85968959    85969031    chr1    85968959    85969031
    COMPLEX_8196    5   chr2    157832187   157832245   chr2    157883432   157883582   chr2    157832187   157832245   chr2    157832187    157832245        chr2    157883432 157883582
    COMPLEX_4204549 1   chr8    14266475    14266539
    COMPLEX_4208647 1   chr7    27303335    27303459

    *** $ head rdSPRITE.DNA.RDP.cluster ***
    # This file is produced by barp rdp
    @CN     complex COUNT
    COMPLEX_4196353 2   chr2    96439353    96439478    chr2    96445065    96445120    1   1
    COMPLEX_4200451 1   chr1    85968959    85969031    2
    COMPLEX_8196    2   chr2    157832187   157832245   chr2    157883432   157883582   3   2
    COMPLEX_4204549 1   chr8    14266475    14266539    1
    COMPLEX_4208647 1   chr7    27303335    27303459    1

    *** $ head scSPRITE.DNA.RDP.cluster ***
    # This file is produced by barp rdp
    @CN     cell    complex COUNT
    CELL_926    COMPLEX_2049    1   chr16   38350928    38350960    2
    CELL_778    COMPLEX_4098    1   chr8    9041233     9041287     1
    CELL_1192   COMPLEX_6147    1   chr10   17566046    17566141    1
    CELL_157    COMPLEX_8196    8   chr2    98667125    98667213    chr1    3037492 3037582 chr2    98667125    98667215    chr2    98666550    98666640    chr3    7058394 7058484 chrUn_GL456383  28713   28802   chr2    98666874    98666964    chr2    98662870    98662960    1   1    7   1   1   1   1   1

cluster stores the deconvolved biological groups with one-line-one-group fanshion as we called inline BED format. The format split each line as 4 fields that the field are not splited by certain symbols, instead these fields are organized based the count field. RDP.cluster stores the deduplicated groups, the duplicates times are stored in the forth field for each fragment.

Field	Content	Description
1^st	Group field	This field contains columns before "Count" field, specifying the biological group of all the fragments within the line.
2^nd	Count field	This field has only one column specifies the fragment count within the group.
3^rd	Fragments field	Fragments within the group that each fragment is specified with chromosome start end.
4^th	Additional field	This field stores additional information for each fragment, if there are multiple additional information for each fragment, the information should be recorded separately.

An example of inline BED format with two addition information for each fragments: InfoA and InfoB.

    COMPLEX_8196    2   chr2    157832187   157832245   chr2    157883432   157883582   InfoA   InfoA   InfoB   InfoB

Merged cluster

Merged_cluster.txt stores the information of merged fragments in ChIA-Drop with inline BED format.

Qualified fragments in scATAC-seq

LIB.Qualified.bed converted the inline BED format RDP.cluster to bed format that the columns corresponding with: Fragment chromosome | start | end | Cell ID | Duplicate time.

    *** $ head PBMC.Qualified.bed ***
    chr10   116264702   116264748   CELL_2049   1
    chr18   23631039    23631090    CELL_4098   2
    chr5    175075551   175075637   CELL_6147   1

LIB.Qualified.Sorted.bed sorting the fragments from LIB.Qualified.bed according to genomic coordinates.

    *** $ head PBMC.Qualified.Sorted.bed ***
    chr1    10074   10322   CELL_63 1
    chr1    10098   10334   CELL_220        1
    chr1    10152   10438   CELL_118        1
    chr1    10169   10341   CELL_185        1
    chr1    10229   10304   CELL_550        1

LIB.QualifiedFragments.bedgraph generated coverage from LIB.Qualified.Sorted.bed using bedtools genomecov.

    *** $ head PBMC.QualifiedFragments.bedgraph ***
    chr1    10074   10098   1
    chr1    10098   10152   2
    chr1    10152   10169   3
    chr1    10169   10229   4
    chr1    10229   10248   5

LIB.narrowPeaks called peaks from LIB.QualifiedFragments.bedgraph using macs2 bdgpeakcall.

    *** $ head PBMC.narrowPeaks ***
    track type=narrowPeak name="PBMC.narrowPeaks" description="PBMC.narrowPeaks" nextItemButton=on
    chr1    778240  779288  PBMC.narrowPeaks_narrowPeak1    1550    .       0       0       0       466
    chr1    817218  817486  PBMC.narrowPeaks_narrowPeak2    320     .       0       0       0       149
    chr1    826703  826928  PBMC.narrowPeaks_narrowPeak3    70      .       0       0       0       28
    chr1    827068  827852  PBMC.narrowPeaks_narrowPeak4    1240    .       0       0       0       523

LIB.PeakCellCount.bed Fragment counts fall into each called peak for each cell.

    *** $ head PBMC.PeakCellCount.bed ***
    -- Peak coordinate --  -- Cell ID --    -- Fragment count --
    chr1    778240  779288  CELL_187        2
    chr1    778240  779288  CELL_154        2
    chr1    778240  779288  CELL_275        2
    chr1    778240  779288  CELL_38 4
    chr1    778240  779288  CELL_341        2

SubGEM

SubGEM is generated for ChIA-Drop type data with inline BED format, complexes are splited into subGEMs by chromosome.

    *** $ head CHDP.ChIADrop.SubGEM ***
    --- The complex are named as "COMPLEX_ID-nth SubGEM in the complex" ---
    COMPLEX_2049-0  1   chr2    12713870    12714100
    COMPLEX_2049-1  1   chr2R   7863021     7863695
    COMPLEX_2049-2  1   chr3L   11185707    11185889
    COMPLEX_2049-3  1   chr3R   3018595     3018777
    COMPLEX_4098-0  2   chr2R   6408379     6408606     chr2R   12027465    12027974
    COMPLEX_4098-1  4   chr3L   1490949     1491382     chr3L   4460139     4460366     chr3L   21232289    21232501    chr3L   22314225    22314879
    COMPLEX_4098-2  5   chr3R   3671537     3671764     chr3R   5063085     5063312     chr3R   9839373     9839600     chr3R   19349478    19349626    chr3R   25367853    25368080
    COMPLEX_4098-3  2   chrX    9123872     9124094     chrX    17155881    17156108

rgn file

rgn file is generated for visualized by ChIA-View with one line one fragment, the forth line in the .rgn files is the unique ID of the fragment.

    *** $ head rdSPRITE.DNA.rgn ***
    --- For data do not require SubGEM, the complexes are named as "COMPLEX_ID-Total fragment in the complex-nth fragment" ---
    chr2    96439353        96439478        COMPLEX_4196353-2-0
    chr2    96445065        96445120        COMPLEX_4196353-2-1
    chr1    85968959        85969031        COMPLEX_4200451-1-0
    chr2    157832187       157832245       COMPLEX_8196-2-0

    $ head CHDP.rgn
    --- For data requires SubGEM, the complexes are names as "COMPLEX_ID-nth SubGEM-Total fragment in the SubGEM-nth fragment" ---
    chr2L   12713870        12714100        COMPLEX_2049-0-1-0
    chr2R   7863021 7863695 COMPLEX_2049-1-1-0
    chr3L   11185707        11185889        COMPLEX_2049-2-1-0
    chr3R   3018595 3018777 COMPLEX_2049-3-1-0
    chr2R   6408379 6408606 COMPLEX_4098-0-2-0
    chr2R   12027465        12027974        COMPLEX_4098-0-2-1

raw_matrix

There were at least three files under the directory. Showing the result for all the barcodes. The name of the directory do not necessarily named as raw_matrix, it can be raw for scRNA-seq related pipeline under Solo.out/Gene/.

File name	Description	Experiment
features.tsv	Gene information as row name of expression matrix	scRNA-seq, scRNA-seq in scARC-seq
barcodes.tsv	Barcode information as column name of expression matrix	scRNA-seq, scRNA-seq in scARC-seq
matrix.mtx	Expression matrix/peak cell fragment count matrix as matrix market format	scRNA-seq, scATAC-seq, scRNA-seq in scARC-seq, scATAC-seq in scARC-seq
barcodes.tsv_BARPID.txt	Converted barcode sequence to BARP identifier ID	scRNA-seq in scARC-seq
BarcodeToBARPIDConvertTable.tsv	Pairwise conversion from barcode to BARP ID	scRNA-seq in scARC-seq
Peak	Peak information as row name of peak cell fragment count matrix	scATAC-seq, scATAC-seq in scARC-seq
Cell	Cell information as column name of peak cell fragment count matrix	scATAC-seq, scATAC-seq in scARC-seq
filtered_cells.tsv	Filtered barcodes with emptyDrops	scATAC-seq, scATAC-seq in scARC-seq

    *** $ head features.tsv ***
    --- Gene ID - Gene symbol - Gene Expression ---
    ENSG00000223972.5       DDX11L1 Gene Expression
    ENSG00000227232.5       WASH7P  Gene Expression
    ENSG00000278267.1       MIR6859-1       Gene Expression
    ENSG00000243485.5       MIR1302-2HG     Gene Expression
    ENSG00000284332.1       MIR1302-2       Gene Expression

    *** $ head barcodes.tsv ***
    AAACAGCCAAACCTAT
    AAACAGCCAAACCTTG
    AAACAGCCAAACGGGC
    AAACAGCCAAACTAAG
    AAACAGCCAAACTGCC

    *** $ head matrix.mtx ***
    --- Header line (necessary) tells the attributes of MatrixMarket ---
    --- %%MatrixMarket - object - format - field - symmetry ---
    %%MatrixMarket matrix coordinate integer general
    --- Annotation line ---
    %
    --- row number - column number - total value ---
    58780 303576 3981045
    --- row index column index - value ---
    665 1 1
    2910 1 1
    3236 1 1
    6355 1 1
    6475 1 1
    17632 1 1

Detailed information about MatrixMarket(English, 中文).

    $ head barcodes.tsv_BARPID.txt
    CELL_16833
    CELL_60050
    CELL_149669
    CELL_174442
    CELL_120091
    CELL_119651

    $ head BarcodeToBARPIDConvertTable.tsv
    AAACAGCCAAACCTAT        CELL_16833
    AAACAGCCAAACCTTG        CELL_60050
    AAACAGCCAAACGGGC        CELL_149669
    AAACAGCCAAACTAAG        CELL_174442
    AAACAGCCAAACTGCC        CELL_120091

    head filtered_cells.tsv
    CELL_2585
    CELL_53
    CELL_1174
    CELL_2272
    CELL_664

filtered_matrix

There were at least three files under the directory. Showing the result filtered for real cells. The name of the directory do not necessarily named as raw_matrix, it can be raw for scRNA-seq related pipeline under Solo.out/Gene/.

File name	Description	Experiment
features.tsv	Gene information as row name of expression matrix	scRNA-seq, scRNA-seq in scARC-seq
barcodes.tsv	Barcode information as column name of expression matrix	scRNA-seq, scRNA-seq in scARC-seq
matrix.mtx	Expression matrix/peak cell fragment count matrix as matrix market format	scRNA-seq, scATAC-seq, scRNA-seq in scARC-seq, scATAC-seq in scARC-seq
barcodes.tsv_BARPID.txt	Converted barcode sequence to BARP identifier ID	scRNA-seq in scARC-seq
BarcodeToBARPIDConvertTable.tsv	Pairwise conversion from barcode to BARP ID	scRNA-seq in scARC-seq
Peak	Peak information as row name of peak cell fragment count matrix	scATAC-seq, scATAC-seq in scARC-seq
Cell	Cell information as column name of peak cell fragment count matrix	scATAC-seq, scATAC-seq in scARC-seq
results.tsv	Pairwise cell to cell same transposition event count matrix scATAC-seq algorithm	scATAC-seq, scATAC-seq in scARC-seq

-	CELL_10	CELL_11	CELL_524	CELL_13	CELL_39	CELL_101	CELL_105	CELL_106	CELL_108	CELL_958
CELL_10	31758	200	142	331	118	294	284	344	387	32
CELL_11	200	2540	45	93	23	77	110	98	91	6
CELL_524	142	45	2656	68	15	64	75	96	64	527
CELL_13	331	93	68	5106	65	100	177	203	191	11
CELL_39	118	23	15	65	786	41	54	50	69	3
CELL_101	294	77	64	100	41	3214	148	175	128	18
CELL_105	284	110	75	177	54	148	5204	225	148	13
CELL_106	344	98	96	203	50	175	225	4784	159	24
CELL_108	387	91	64	191	69	128	148	159	6388	17
CELL_958	32	6	527	11	3	18	13	24	17	78

ScSmOP Standard Output - ZhengmzLab/ScSmOP GitHub Wiki

Output of ScSmOP

File format explanation

Resulting FASTQs

Barcoding statistics

Genomic material statistics

IDB files

Result of Read alignment

Filtered alignment

cluster and RDP.cluster

Merged cluster

Qualified fragments in scATAC-seq

SubGEM

rgn file

raw_matrix

filtered_matrix

⚠️ GitHub.com Fallback ⚠️

ScSmOP Standard Output - ZhengmzLab/ScSmOP GitHub Wiki

Output of ScSmOP

File format explanation

Resulting FASTQs

Barcoding statistics

Genomic material statistics

IDB files

Result of Read alignment

Filtered alignment

cluster and RDP.cluster

Merged cluster

Qualified fragments in scATAC-seq

SubGEM

rgn file

raw_matrix

filtered_matrix

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️