January 2022 - Bozhie/transcription-modeling GitHub Wiki

January Wiki

12.14.22

Getting familiar

looking into options for downloading EMSEMBL--> RefSeq
- emsembl perl API
- most promising (for me): access SQL db directly: https://www.biostars.org/p/106470/

Todo:

move shared files into genome folder
read/reacquaint with the different sleuth options

---> plan of action for sleuth analysis:

download ensembl to refseq mapping

in /scratch/pokorny/ensembl_queries_downloads/mm9_ensembl_refseq_id.csv --> maps ensembl cdna/transcript ids to refseq transcript ids

I see two options:

simply add gene name to the transcriptional mapping from kallisto (the abundance file)
redo kallisto alignments using the gene set (I don't think this makes sense though)
generate sleuth models for comparison of time=0 to different time points for both WT and condition
how to compare WT vs dCTCF?

mySQL query to download RefSeq transcripts (used for mapping of RNAseq data) to the ensembl cDNA sequences

next steps:

what exactly is the info we're getting out of kallisto vs sleuth
do some graphing and playing with kallisto tpms
- what did the orig. paper do? what are tpms vs rpkm? can we create a similar sort of mapping?
move stuff to shared folder finally!
- probably move the sql downloaded files into shared folder too. Or, maybe since these can be pulled so easily it doesn't matter? Maybe better in scratch

12.18.2022

Mapping DE Analysis by genes

Goal: Perform some analysis of the RNAseq data that was mapped to transcripts by kallisto. Particularly: 1. collect the transcripts into genes to get Expression values by-gene 2. compare with results from Nora analysis (i.e. FPKM in table 11 vs TPM), are these both by genes? 3. Compare differential expression for WT vs dCTCF using sleuth

Process/Notes:

Using mySQL Interface for ensembl is pretty straight-forward:
- schema for entire database documented: https://m.ensembl.org/info/docs/api/core/core_schema.html
- .sql files with queries found in: /scratch/pokorny/ensembl_queries_downloads/
- run file with command: $ mysql -u anonymous -h ensembldb.ensembl.org < ensembl_gene_transcript.sql | sed 's/\t/,/g' > mm9_transcripts_gene.csv

Wald Test vs likihood ratio test (LRT)

"Many packages, including limma, use Wald tests for two-sample comparisons. The LRT approach is a bit more elegant in that it is formally testing the relative goodness of fit of a more parameterized (i.e. including an effect of treatment) vs. a less parameterized model, instead of the null hypothesis of no difference in expression between conditions. Thus, in LRT mode, the results table output does not include a logfold change estimate. But, with a little bit of data manipulation, we can extract mean TPM values per treatment and add them to our sleuth results table so that, for transcripts with significant differential expression, we can asses the direction of the difference."

read: https://shiny.york.ac.uk/bioltf/gene_expression_course/day3/day3.Rmd#section-wald-tests-and-interactions

12.25.22

todo:

map RNAseq data using kallisto to mm9 instead
perform simplest wald test for auxin treatments compared to CTCF_untreated. export relevant data to compare
add gene aggregations

commit "sleuth running basic WT against correct base case, and building full summary in df"

so far:

--> successfully mapped new kallisto reads --> switch to using those --> wald test and building a mega-df is ~complete for simplest case

need still:

add gene aggregation. Do simple comparisons
add significant or not? or can do this in python/pandas more easily probably
switch to mm9.

FOUND THIS: http://www.informatics.jax.org/downloads/reports/MRK_ENSEMBL.rpt --> noticed the names were weird, ncbi isn't really the source of gene names for this, but MRK is.

was able to download the MRK acc #, but may need yet another relationship to get everything in one place

WT for each condition - download gene aggregation as a .csv for each --> can map the MKG (mouse genome annot) before downloading if want/if easy
in pandas, join all on WTs on transcription id (or --> just keep as separate tables?)
also add the MKG to the elphege supp. table 11

12.27.2022

Fixing gene-transcription mapping downloads

Found many blank entries in

successfully used biomart to download, but still appears to be many blank entries for the MGI symbol ` library(biomaRt)

for mm9 -- use ensembl version 54 / mmusculus genome NCBI37

mm9mart = useEnsembl(biomart="ensembl", version=54, dataset="mmusculus_gene_ensembl")

t2g <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "mgi_id"), mart = mm9mart)

dim(t2g)

50179

dim(t2g[which(t2g$mgi_symbol != ""),]) 38921 ` ** Looking at source kallisto file used to build index --> 88366 transcripts are being mapped

proceeding with this biomart version for now, since it seems to be standard procedure