January 2022 - Bozhie/transcription-modeling Wiki
- looking into options for downloading EMSEMBL--> RefSeq
- emsembl perl API
- most promising (for me): access SQL db directly: https://www.biostars.org/p/106470/
move shared files into genome folder
read/reacquaint with the different sleuth options
---> plan of action for sleuth analysis:
- download ensembl to refseq mapping
in /scratch/pokorny/ensembl_queries_downloads/mm9_ensembl_refseq_id.csv --> maps ensembl cdna/transcript ids to refseq transcript ids
I see two options:
simply add gene name to the transcriptional mapping from kallisto (the abundance file)
redo kallisto alignments using the gene set (I don't think this makes sense though)
generate sleuth models for comparison of time=0 to different time points for both WT and condition
how to compare WT vs dCTCF?
mySQL query to download RefSeq transcripts (used for mapping of RNAseq data) to the ensembl cDNA sequences
- what exactly is the info we're getting out of kallisto vs sleuth
- do some graphing and playing with kallisto tpms
- what did the orig. paper do? what are tpms vs rpkm? can we create a similar sort of mapping?
- move stuff to shared folder finally!
- probably move the sql downloaded files into shared folder too. Or, maybe since these can be pulled so easily it doesn't matter? Maybe better in scratch
Mapping DE Analysis by genes
Goal: Perform some analysis of the RNAseq data that was mapped to transcripts by kallisto. Particularly: 1. collect the transcripts into genes to get Expression values by-gene 2. compare with results from Nora analysis (i.e. FPKM in table 11 vs TPM), are these both by genes? 3. Compare differential expression for WT vs dCTCF using sleuth
- Using mySQL Interface for ensembl is pretty straight-forward:
- schema for entire database documented: https://m.ensembl.org/info/docs/api/core/core_schema.html
- .sql files with queries found in:
- run file with command:
$ mysql -u anonymous -h ensembldb.ensembl.org < ensembl_gene_transcript.sql | sed 's/\t/,/g' > mm9_transcripts_gene.csv
Wald Test vs likihood ratio test (LRT)
"Many packages, including limma, use Wald tests for two-sample comparisons. The LRT approach is a bit more elegant in that it is formally testing the relative goodness of fit of a more parameterized (i.e. including an effect of treatment) vs. a less parameterized model, instead of the null hypothesis of no difference in expression between conditions. Thus, in LRT mode, the results table output does not include a logfold change estimate. But, with a little bit of data manipulation, we can extract mean TPM values per treatment and add them to our sleuth results table so that, for transcripts with significant differential expression, we can asses the direction of the difference."