04. Preparing input - cbg-ethz/LongSom GitHub Wiki

Input Files

LongSom is designed to run multiple samples in parallel.

Sample map

LongSom first reads a samplemap.tsv containing all sample names that should be analyzed. These SampleIDs will be used to name files and identify the sample throughout the workflow.

samplemap.tsv should look like:

sample
SampleID1
SampleID2

For each SampleID, LongSom takes an aligned SampleID.bam file together with a file linking barcodes to their cell type annotations SampleID.tsv as input.

Input directory

The input directory has to be organized as follows

input_dir
--| samplemap.tsv
--| bam
   --| SampleID1.bam
   --| SampleID2.bam
--| barcodes
   --| SampleID1.tsv
   --| SampleID2.tsv

BAM files

The input BAM files must be aligned, and barcoded i.e. have a CB tag (as reported by Cell Ranger, 10x Genomics)).

Barcodes files

Barcodes files should have an Index, containing unique barcodes, and a Cell_type column, containing cell type annotation:

Index			Cell_type
AAACCCATCGAGATAA	HGSOC
AAAGTGATCCAACTGA	T.cell
ACACCAAAGGTCCAGA	Fibroblast
ACATGCAGTACGGATG	HGSOC
etc.

LongSom compares "cancer" and "non-cancer" cells. For this, you specify which cell type should be viewed as "cancer" in the config/config.yaml file. The rest will be aggregated and viewed as "non-cancer". LongSom only supports one cancer cell type at the moment.