Cleaning and Normalizing Data - bcb420-2025/Clare_Gillis GitHub Wiki

1 - Initial exploration

What are Hugo gene symbols? - Standardized symbols to represent genes. My files do not have these.

What do the rows and columns in my file represent?

Rows: Sample ID

Columns: Genes (Ensembl ID for the gene)

Supplementary files provided that map Sample ID to metadata, and Ensembl accession to gene info (including HUGO symbol)

Need to install an updated version of the readxl package to import the gene metadata file.

I don't see any obvious signs of outliers due to measurement error, so I will not exclude any data.

Some of the counts are fractional - this is odd because the file is entitles count_matrix. However, many of the values are integers. The read depth is not standard across samples so the dataset is not CPM normalized (which is mentioned later in the paper).

According to a BioStars post by swbarnes2 "RSEM will return expected counts as fractions, because it assigns ambiguously assigned reads fractionally to all the places it thinks that read might have come from" so the count matrices are likely just the RESM count matrices (the paper mentions this is the tool they used to quantify aligned reads)

2 - Normalization

After watching the lecture videos, it is obvious that RSEM gives fraction counts. Not much of an insight anymore.

After normalizing, I get this. Not to bad, the points are quite well segregated between the different layers.

How am I supposed to choose the min_num_samples?

Mine seem to have a LOT that are lowly expressed in many samples, so I don't have a smooth-ish curve until I use a crazy high min_num_samples (80+). I think I'm just going to deal with a messy curve - there just ARE many lowly expressed genes. My smallest group is 14 (Layer 5, DS) so I'll go with min_num_samples of 11 to make sure things expressed in only Layer5-DS are picked up.

3 - Post normalization

Why do some sample names have an R? most are like h270, h271... but some have h293 & h293R. Does this mean replicate? If so, why do only some have that?

Actually no. It looks like not all of the samples have technical replicates (based on the supplementary files).

FOUND IT! I read the files in to opposite names accentally (read L3 as L5 and vice versa) This still doesnt really explain the R sucffices but the filenames now match the supplementary files so at least thats good...

Ok I think their naming conventions are just reallyyyyy weird. Probably multiple people working on this that never standardized their naming convention. From what I can infer, it looks like initially, all samples were run once but split into two technical replicates (A and B). If that didn't work (one or both of them failed to meet the quality threshold,) they ran them again, again split into 2 technical replicates and added an R to the name. This part is NOT standardized - some people put the R right after the sample ID (h270R-A), some put it after a dash with the replicate ID (h270-AR). After this, things get even LESS standardized. For Layer 5, they combined exactly 2 technical replicates per sample into 1 sample, then left all extras (leftover technical replicates from replicated sampling) as individuals. For Layer 3, they left replicates separate if they aren't from the same round of RNAseq (i.e. one has an R, one does not). All this is to say, I think if two samples have the same sample ID and layer, they are technical replicates and should be combined.

4 - Writing report

Mapping to HUGO symbols is harder than I initially thought - why are there so many duplicate HUGO mappings, and so many missing?

When I map using current set, there are like 1100 HUGO symbols missing. Maybe I should use the ensembl dataset from when this dataset was made? There is nothing from around then, this is from 2018 - the closest one before that is 2015... I think they delete after 5 years or more, keeping only infrequent ones. I'll try 2019 because that's close. Now there are like 1500 missing...

Ok what if I do both 2019 (actual closest to my dataset timeline wise) and the most recent one - much better! Only 107 missing now!

I'll pick the most recent first I guess - is that more accurate?

Still a lot of duplicate IDs and symbols. I'll try keeping the best one for each set of duplicates. Longest transcript? Protein coding gene? There doesn't seem to be a filter for transcript lenght. There is one for biotype, but it doesn't give me anything back. I guess I'll have to just take the first one (not a great solution). I don't think averaging them would be a good idea since I'm not 100% sure they're the same gene...

4.2 Compiling and Uploading

Uh oh. It won't compile. It compiled yesterday. Said ensembl timed out - I think ensembl is down. Yep the website says it sent me to a mirror. So how can I check if it will compile once ensembl is back up??

It might be back up - compiled on my docker rstudio, trying now on command line docker - it worked eventually.

Checking over MDS plot before submitting - why is it only showing one point for Layer5 CTR??? They're not identical... FOUND IT!! used my sample_type_df instead of sample_type_dt!!

Ok finally - submitting.