Walk through of test data - JGEnglishLab/TRE-MPRA-Pipeline GitHub Wiki

Explanation of input files

test_data

input

run1_dna

run1X42_220722_A00421_0459_AHH3JFDRX2_S42_L002_R1_001.fastq.gz run1X42_220722_A00421_0459_AHH3JFDRX2_S42_L002_R2_001.fastq.gz run1X42_220722_A00421_0459_AHH3JFDRX2_S42_L001_R1_001.fastq.gz run1X42_220722_A00421_0459_AHH3JFDRX2_S42_L001_R2_001.fastq.gz
etc.

run1_rna

run1X1_220401_A00421_0429_AH3JCHDRX2_S1_L001_R1_001.fastq.gz
run1X1_220401_A00421_0429_AH3JCHDRX2_S1_L002_R1_001.fastq.gz
run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R1_001.fastq.gz
run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R2_001.fastq.gz
run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L002_R1_001.fastq.gz
run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L002_R2_001.fastq.gz
etc.

spike-ins.txt
run1.tsv
run1_dna_map.tsv

Fastq files in run1_rna and run1_dna

Note that these are raw fastq files after demultiplexing. It is not necessary to split RNA and DNA fastq files into separate directories. Either option (having all fastq files in 1 directory, or splitting DNA and RNA fastq files into separate directories) is perfectly fine, and both make sense depending on the needs of the user.

Sample numbers

The sample number can be seen in each file here _S{}_. TMP will automatically look for a number following _S to detect a sample number. If your files have a different format, for example, run1_220722_A00421_0459_AHH3JFDRX2_sample42_L002_R1_001.fastq.gz you would need to use the -sr flag to specify a new regex pattern to detect sample numbers from file names. For example, using -sr sample would automatically detect 42 as the sample number for the file run1_220722_A00421_0459_AHH3JFDRX2_sample42_L002_R1_001.fastq.gz

Paired files

Note that some of these files are single-end reads, like sample 1 in run1_rna, and some paired-end reads, like sample 8 in run1_rna. TMP accepts both types of reads.

Here is an in-depth explanation of these file names

run1X1_220401_A00421_0429_AH3JCHDRX2_S1_L001_R1_001.fastq.gz (Sample1, Lane1, Read1) run1X1_220401_A00421_0429_AH3JCHDRX2_S1_L002_R1_001.fastq.gz (Sample1, Lane2, Read1) run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R1_001.fastq.gz (Sample8, Lane1, Read1) run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R2_001.fastq.gz (Sample8, Lane1, Read2)

In the bolded section of each name, you can see the sample number, (S1 or S8) the lane number (L001 or L002), and the read number (R1 or R2). The first two files are single-end reads because they are both read 1 (R1) and they don't have a corresponding read 2 (R2) file. The last two files are paired-end because they are from the exact same except for their read number.

This is how TMP detects paired-end files. If every character in the file name is the same except for a 1 and 2 following an R they will be paired. For example,

run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R1_001.fastq.gz
run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R2_001.fastq.gz
These two files would be paired because the only difference in the file name is the 1 and 2 directly following the R

run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R1_001.fastq.gz
run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L002_R2_001.fastq.gz
These two files would not be paired because even though they are read1 and read2 files, they belong to different lanes.

It is ok if your file names do not follow all the same conventions.

In summary

  • A file that contains an "R" or "r" followed directly by a 1 will be treated as a "Read 1" or "Forward read"
  • A file that contains an "R" or "r" followed directly by a 2 will be treated as a "Read 2" or "Reverse read"
  • "Read 1" and "Read 2" files will automatically be paired if the only difference in the file name is the number following the "R" or "r"
  • If a "Read 2" file doesn't have a corresponding "Read 1" file, it will be ignored
  • If a file doesn't have an "R" or "r" followed directly by a 1 or 2. It will be treated as a "Read 1" or "Forward read" and will not be paired with another file.

run1.tsv

This tsv pairs the sample numbers and treatment names. The only two columns that are required are the sample_number and treatment. The other columns are optional and will not do anything unless you plan to use it in a downstream analysis. We (The English Lab) will use the optional columns in our data visualization.

Every number in the sample_number column must be unique.

The short names may not contain any of the following characters #_%&{}<>*?/\$'!":@+`|=, or spaces

The other columns may not include commas (,)

It is important to note that your DNA sample numbers must be labeled as DNA even if there are multiple DNA samples.

sample_number	treatment	long_name	concentration	time	cell_type
1	SF	Serum Free	350 (pg/uL)	6 (hours)	HEK
2	SF	Serum Free	350 (pg/uL)	6 (hours)	HEK
3	SF	Serum Free	350 (pg/uL)	6 (hours)	HEK
5	FBS	Fetal Bovine Serum	450 (pg/uL)	6 (hours)	HEK
6	FBS	Fetal Bovine Serum	450 (pg/uL)	6 (hours)	HEK
8	Dex	Dexamethasone	300 (pg/uL)	6 (hours)	HEK
42	DNA	DNA	NA	NA	HEK
43	DNA	DNA	NA	NA	HEK

run1_dna_map.tsv

This tsv is only needed if multiple DNA samples are present. It is used to map which DNA samples correspond to which RNA samples. In this example, DNA sample 42 corresponds to RNA sample 8. Sample 43, corresponds to the remaining RNA samples. In the tsv, the first column contains the DNA sample numbers, and the second column contains the corresponding RNA sample numbers.

DNA_sample_number	RNA_sample_number
43	1
43	2
43	3
43	5
43	6
42	8

pairwise_comparisons.tsv

This tsv is used to specify the pairwise comparisons for TMP_comparative.py. The treatment names should correspond to the treatment column of the the input tsv (run1.tsv)

id	base_treatment	stim_treatment	base_run	stim_run
1	SF	FBS	run1	run1
2	SF	Dex	run1	run1

Running TMP_empirical.py

To run TMP_empirical.py start by making sure that you have all the dependencies installed and the conda environment activated.

python TMP_empirical.py -r run1 -f test_data/input/run1_rna -d test_data/input/run1_dna -t test_data/input/run1.tsv -dt test_data/input/run1_dna_map.tsv

Note that if all of your fastq files were in the same directory, you would not use the -d flag. The -d flag is only used if you happen to have separate directories for holding your RNA fastq and DNA fastq files.

See here for a more comprehensive look at the flags used.

When this command is run, a new directory named run1/ will be created in the runs/ directory. This directory will include; a Snakemake file that will run the whole pipeline, all of the intermediate files used in the pipeline, and the output files. (see Explanation of output files)

Running TMP_comparative.py

When running TMP_comparative.py you can either create the tsv file yourself ahead of time, or TMP_comparative.py will prompt you to create them interactively. Either way, you must wait for TMP_empirical.py to finish before using TMP_comparative.py.

If you decide to enter the comparisons manually, you will be given a prompt like this for creating pairwise comparisons.

id	treatment	run
1	Dexamethasone	run1
2	Serum Free	run1
3	FBS	run1
***********************************
***********************************
***********************************
Now, enter the pairwise comparisons that you want to make
Enter the numbers corresponding to the treatments you want to compare
The first treatment entered will be the basal treatment, the second will be the stimulated treatment
Separate the two numbers with a comma e.g. "1,2"
Enter "done" when finished
Enter "menu" to see the list of treatments again


Enter pairwise comparison starting with basal treatment e.g. "1,3" (Type "done" after entering all comparisons):

So in this case, if you wanted to compare Serum Free against FBS (with Serum Free as the baseline) you would enter 2,3. Once you are done entering comparisons you would enter done.

If you create your comparisons manually, they will automatically start running as soon as you finish entering them. Additionally, your pairwise comparisons will be automatically stored in the entered_pairwise_comparisons/ directory for your convenience in remembering which comparisons you have run.

If you created the tsv comparisons beforehand, like with pairwise_comparisons.tsv, you can run TMP_comparative.py with flags to point to the comparison tsvs. The command would look like this.

python TMP_comparative.py -p test_data/input/pairwise_comparisons.tsv

Once you run TMP_comparative.py the output files will be written to the pairwise_results/ directory.

Explanation of output files

TMP_empirical.py output

Each time you TMP_empirical.py a new directory will be created in the runs/ directory. The name of the new directory will be the name of the run, which can be specified with the -r flag. This new directory will contain intermediate files used for the pipeline and output files. The main output is the {run_name}__empirical_results.csv file.

{run_name}__empirical_results.csv

The columns of {run_name}__empirical_results.csv will be the following

  • architecture
    • The name of the architecture
    • Architecture names follow this pattern
    • {TRE_name}-{TRE_variant_number}:{a unique id}, {Period}, {Spacer}, {Promoter}
    • For example ALX3-1:224, 1, set1, minCMV
  • statistic_{treatment name}
    • The alpha value. Essentially the transcription rate.
  • controls
    • A boolean stating if the architecture is a negative control or not.
  • mad.score_{treatment name}
  • pval.empirical_{treatment name}
  • pval.mad_{treatment name}
  • pval.zscore_{treatment name}
  • zscore_{treat name}
  • DNA_barcodes_{treat name}
    • The number of DNA barcodes associated with this architecture and treatment
  • RNA_barcodes_{treat name}
    • The number of RNA barcodes associated with this architecture and treatment

{run_name}__meta_data.csv

  • This is used for keeping track of information about the run. We use it for our data visualization.

/run_stats/

The run_stats/ directory contains several files with information about the run. In order to understand most of these files, you need to understand the Starcode clustering step we take as a part of pre-processing the data. We use Starcode to cluster all of the barcode sequences for a given sample. We do this so that barcode counts that would be ignored due to sequencing error can still be attributed to a real barcode.

After running Starcode we categorize the clusters into 1 of 5 types

  • Type 1 = The centroid is in the barcode map, and none of the clustered barcodes are in the barcode map.
  • Type 1.2 = The centroid is in the barcode map, and one or more of the clustered barcodes are in the barcode map, but they all map to the same architecture.
  • Type 2 = The centroid is not in the barcode map and neither are the clustered barcodes.
  • Type 3 = The centroid is in the barcode map, and one or more of the clustered barcodes are in the barcode map, but they map to different architectures
  • Type 4 = The centroid was not in the barcode map, and one or more of the clustered barcodes are in the barcode map.

Type 1 and 1.2 are kept the rest are filtered.

/run_stats/pre_filter_data.csv

This file contains all the barcode information before filtering for type 1 and 1.2 clusters.

Each row of this file is a cluster from Starcode

The columns will be as follows

  • sample_number
  • norm
    • Used for calculating the rpm
  • mean_cluster_count
    • The median barcode count for every barcode in the cluster
  • median_cluster_count
    • The mean barcode count for every barcode in the cluster
  • total_clustered
    • The number of barcodes in the cluster
  • total_counts
    • The sum of all the barcode counts in the cluster
  • centroid
    • The centroid barcode
  • centroid_counts
    • The barcode count of the centroid barcode
  • centroid_in_map
    • A boolean stating if the centroid barcode is found in the barcode map
  • clustered_barcodes
    • All of the barcodes that clustered to the centroid (comma separated)
  • clustered_in_map
    • All of the barcodes that clustered to the centroid that are in the barcode map (comma separated)
    • Will be NULL if none are in the barcode map
  • architectures_match
    • A boolean or NULL value.
    • True means the barcodes from clustered_in_map map to the same architecture as the centroid
    • False means the barcodes from clustered_in_map map to different architectures as the centroid
    • NULL means the centroid wasn't in the barcode map
  • type
    • Cluster type (1, 1.2, 2, 3, 4) see above.
  • collapse_stat
    • centroid_counts/total_clustered
  • treatment
  • rpm
    • reads per million
  • architecture
    • Architecture names follow this pattern
    • {TRE_name}-{TRE_variant_number}:{a unique id}, {Period}, {Spacer}, {Promoter}
    • For example ALX3-1:224, 1, set1, minCMV
  • class
    • Candidate, spacer, scramble or Promega

/run_stats/run_summary.csv

Contains information about the run including the total number of reads and barcodes for each sample, the number of reads and barcodes that weren't filtered, and information about spike-in counts if provided.

/run_stats/filtering_ratios.png

Shows the ratio of reads that are filtered vs. not for each sample.

/run_stats/type_ratios.png

Shows the number of reads that fit into each of the 5 clustering categories.

/run_stats/dna_per_barcode.png

Shows a histogram of the number of DNA counts per barcode (post-filtering)

/run_stats/spike_in_stats.png

Shows relative ratios of spike-ins if provided

TMP_comparative.py output

When you run TMP_comparative.py the results will be stored in the multi_results and comparative_results directories. Each comparison will create two output files. The first one contains the results of the comparison, the second contains a list of architectures that were not present in the comparison.

The file names for pairwise files will be in this format {Basal Treatment}__{Basal Run}_vs_{Stimulated Treatment}__{Stimulated Run}.csv The file names for multivariate files will be in this format {Treatment 1}__{Run 1}_vs_{Treatment 2}__{Run 2}_vs_{Treatment 3}__{Run 3}{...etc}.csv

The columns of the pairwise results will be the following

  • statistic
  • pval
  • fdr
  • df.test
  • df.dna
  • df.rna.full
  • df.rna.red
  • logFC
  • DNA_barcodes_{treat name}
    • The number of DNA barcodes associated with this architecture and treatment
  • RNA_barcodes_{treat name}
    • The number of RNA barcodes associated with this architecture and treatment
  • architecture
    • The name of the architecture
    • Architecture names follow this pattern
    • {TRE_name}-{TRE_variant_number}:{a unique id}, {Period}, {Spacer}, {Promoter}
    • For example ALX3-1:224, 1, set1, minCMV