November 2021 - Bozhie/transcription-modeling GitHub Wiki

11.02.2021

Using mm9 RefSeq for TSS analysis

Purpose: Redoing analysis using actual TSS

Process/Notes:

Using the RefSeq GTF files downloaded from UCSC browser, find TSS
- Simply took the first index of all the transcripts in the file, depending on the strand/orientation
Repeating scatter-plot and mapping by index method from last time, looks like a tighter relationship/much less outliers
- Also noticed that using bioframe to find closest overlap between transcripts and TSS, >90% of them had a distance of 0 (makes sense)

images/20211102_TSS_v_CTCF_scatter_redo.png

TSS binning with pybbi
- Should be able to just use the TSS windows build from
- If they're from the same genome, then the stackup should filter according to the windows (I think this means we don't need to match / filter back from TSS)
- ISSUE: Still received error ValueError: Start exceeds the chromosome length, 15902555.

11.03.2021

Explorations of sleuth-Kalliso

Resources

(Bray 2016) Near-Optimal probabilistic RNA-seq Quantification sleuth walkthrough Kallisto getting started

Notes

Starting new directory in pokorny/RNAseq
Goal: follow the kallisto tutorial, using RNA-seq data from Rao 2017
- Added transcriptome from NCBI GRCm38 choosing the following download settings: REfSeq source database, filetype: CDS from genomic fasta, saved in scratch/pokorny/genomic_annotations/mm10/genome_assemblies_cds_fasta.tar
- SRA download from the cloud might be another way to download raw fastq results from individual experiments
- Tried to download GSE106886_RAW.tar from Rao_2017, seems to be a huge file and tar -xv but lost remote connection while waiting.
- Submitting unzip as a job (id: Submitted batch job 6543562)

Meeting with Geoff

Goals for EOW:

Check if cleaning up TSS windows by removing anything with '_random' chrom location helps
Continue trying to run kallisto/sleuth -- minimally, get a good RNAseq dataset downloaded
- Rao_2017 nascient PROseq data might not be the best/easiest dataset to work with here
- Where is raw RNAseq from Elphege et al?

Notes:

Look into SRA downloads to retrieve RNAseq data. Also, Liu et al might be a good place to turn
some tools for downloading sras: https://github.com/saketkc/pysradb , e.g. https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/02.Commandline_download.ipynb
https://github.com/ncbi/sra-tools
a nice command for seeing cluster usage: /project/qcb_640/local/bin/clstat ( you can add an alias in your bashrc )
place to download some kallisto indexes https://github.com/pachterlab/kallisto-transcriptome-indices

TSS Buckets / Visualization

Steps/Notes:

new TSS buckets after RefSeqTranscript clean-up worked! Tested with a subset (~100, 1000 TSS windows)
Figured out good visualization settings with imshow()

Next Steps:

Put it all together, get TSSwindows for only the expressed transcripts, ordered by transript's FPKM
Map both CTCF-TSS binding and FPKM levels together on same y-axis

11.04.2021

TSS and FPKM heatmap TOGETHER 👯

Notes/Steps:

Building TSS buckets for RNA-seq data, ordered by delta FPKM
Mapped TSS locations from refTranscripts onto RNAseq FPKM dataframe (annotated file from Elphege supp. excel) using the gene name
Strange: not all gene names in RNAseq data were present in RefSeq set.
- For now, just going to clean them out (less than 2%).
- hopefully building our own DE gene RNAseq set will eliminate this issue? So that we are all using the same set of unique identifiers/references end-end of pipeline
Again received chrom index out of bounds error
- removing chromosome X, looks like it's small enough that this issue could be coming from it? https://www.ncbi.nlm.nih.gov/grc/mouse/data
filter out any TSS that are not within this region
- probably points to some other issue, or something I'm not understanding so we should keep note of this!
Going back and attempting to run pybbi on full set of TSS from refseq (just to see if my matching of TSS to RNAseq is where these values become confused)


Ran full set of intervals to do bbi stackup (under header: Getting closest TSS to regions bound by CTCF)

Why are these different dimensions?? 


input: (refTSSwindow.size) --> 131931
output: (matrix.shape) --> (43977, 20)

11.05.2021

Download and storage for SRA files

References:

overview: https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/
downloading sra tools to download directly for ncbi: https://github.com/ncbi/sra-tools/wiki/
Other option is cloud delivery, might be an option to look into later GCP - google cloud

Steps

Downloaded SRA files for Wapl set at 0h, 6h, 24h, 48h, 96h ( Wapl-C20_Xh_RNA_rep2; Mus musculus; RNA-Seq )
- ncbi group acc: SRP216929; each acc: SRX9106184, SRX9106186, SRX9106188, SRX9106190, SRX9106192

Next Steps:

Look into package manager, options to download all SRA files (compare w
Try kallisto!

11.10.2021

Trying Kallisto

Downloads/Prep/Resources

In kallisto FAQs: "Transcriptome fasta files for model organisms can be downloaded from the Ensembl database. We recommend using cDNA fasta, specifically the *.cdna.all.fa.gz files. kallisto can build indices directly from gzipped files."
Downloaded transcriptomes for multiple mouse transcriptomes into dir /scratch/pokorny/kallisto_resources/ensembl_cdna by navigating ensembl ftp downloads
- GRCm39 is most up-to-date mus musculus genome assembly, downloaded file: Mus_musculus.GRCm39.cdna.all.fa.gz
- mm10 = GRCm38, file: Mus_musculus.GRCm38.cdna.all.fa.gz
- mm9 = NCBIM37 = GRCm37, file: Mus_musculus.NCBIM37.65.cdna.all.fa.gz

Steps

Submitting as jobs, and also tried on personal computer

kallistoquant4.job --> output4, from kallisto.job4 = JOBID 6629286 kallistoquant5.job --> output5, from kallisto.job5 = JOBID 6629287, activated biovenv

Results

11.11.2021

Todo:

read paper/tutorial: https://colab.research.google.com/drive/17E4h5aAOioh5DiTo7MZg4hpL6Z_0FyWr#scrollTo=qVh9frJDgVQ-
look into indexing / reading
- https://arxiv.org/pdf/1506.05101.pdf
- knowledge graphs
  - neo4j, cypher

11.12.2021

Steps:

Made new jupyter notebook for easier organization
Generated Fig 6.a heatmap (did not do clustering, maybe later?)
using bioframe overlap function, mapping RefSeq transcripts to RNAseq data
filter by: distance, greatest amount of coverage -- choose only one refseq

Todo:

redo stack-up, with all 4 of the auxin conditions
remove "unchanged" RNAseqs, only include those with large deltas --> maybe should have just done that with orig. data in orig. notebook but we've come too far now
Do same analysis with sleuth results!

11.15.2021

Figure 6(b) Fixed

Overview: Started a new notebook to approach figure 6, starting with mapping TSS from the beginning. The goal here is to build it more cleanly, and make functions so that can do the analysis across multiple sets of data.

Notebook Name: Elphege_2017_Fig6.

Steps/Notes:

Noticed issues with trying to take log of delta
- At first, delta was calculated as the difference between the auxin-treated RNA-seq FPKM and the untreated). Can't take log of negative numbers.
- Also tried defining delta as (x_treated - x_original) / x_original.
  - this gave negative change as very small decimals. I was trying to take log of that and it was essentially deleting all down-regulated transcripts.
- FIXED: took log10change from the beginning: log10(x_treated/x_original). Use this normalized value to compare/sort the change in expression.
Put into a couple of
We have figures!

Questions:

what is n in Fig 6(b)? I wasn't able to re-do that
cutoff=0.1 seems best to me. Is there a way to choose a cutoff other than visualization?

11.16.2021

Finished Fig 6
Meeting with Geoff

11.17.2021

https://learning.cyverse.org/projects/kallisto_tutorial/en/latest/step3.html

11.23.2021

Working on downloading RNA-seq files from Elphege

Purpose: Want to get RNA-seq files from Elphege_2017 to run through kallisto, sleuth pipeline and have a standard format for processing. Would be good to establish a pipeline for processing RNA-seq data for differential expression analysis.

Process:

Trying to use package manager: pysradb
- downloaded all files from project to /scratch/pokorny/Elphege_2017/SRP106652/pysradb_downloads/SRP106652/
  - Issue: this is EVERYTHING - mixture of fastq, Hi-C data, etc all with files in .sra format.

Better Process:

Use Entrez search with project number and rna-seq to download only the rna-sequencing data
- have some way to verify whether paired-end or not for kallisto input

todo:

use SraAccList.txt to download only RNA-seq datasets
feed all files from that into slurm job: fastq-dump -N 7164180 -X 14328358 -O /tmp/pfd_qtpv_w_r/1 --split-files --gzip {SRR5517500.sra_files_list}
kallisto for all these output files (also in slurm job?)
look into https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth
- does this have what we need?
- maybe use this and input entire directory of .sra / fastq files. I think you just run the snakeFile in the directory with the needed files??

12.24.2021

Still downloading and running rna-seq and kallisto for Elphege et. al

I'm downloading metadata and downloaded the list of files manually (in SraAccList.txt)
- Entrez search might be a good tool if we want to automate this --> https://www.ncbi.nlm.nih.gov/books/NBK179288/
Todo: need to parse the 'condition' out of 'Experiment Title' column in metadata/sra_run_summary.csv

Useful resources:

For using loading a conda environment for carc job submissions: https://www.carc.usc.edu/user-information/user-guides/software-and-programming/anaconda
Sleuth tutorial is a little more helpful than the one on their website: https://github.com/Kapeel/Sleuth-tutorial-iPlant
querying ncbi API for SRA files using python: https://bootstrappers.umassmed.edu/guides/main/python_get_sra_run_ids.html

todo:

watch: https://www.youtube.com/watch?v=tLbXl2x6nxY