FAQs - GarrettJenkinson/informME GitHub Wiki

What quality of data is recommended for informME?

In our experience, WGBS data that is ~125bp paired-end reads at 15X to 20X coverage results in high quality models across most regions of interest in the genome.

My WGBS data quality is low compared to your recommendations (e.g., I have short 50bp single-ended reads). Should I use informME?

The power of informME versus standard marginal processing (e.g., where each CpG is treated as independent Bernoulli's) is that we actively consider the single-cell information that a WGBS read provides when it observes two-or-more CpG sites simultaneously on a single read. In your data, we will never observe simultaneously two CpG sites that are more than 50bp apart, so that advantage is reduced. But in this case, we would do no worse than marginal modeling, just we lose that extra edge. Where your data will shine is in CpG islands where the spacing between CpG sites is small. However, your read-depth might limit the number of islands informME chooses to model, because more CpG sites means that gaps in the data make estimation more difficult, and informME is cautious about what it chooses to model. We would rather you have gaps in modeling along the genome, but have full confidence in any regions that do have models.

One thing we suggest when data-depth is lower is that you try building "pooled" models. So for example, if you have two normals and two cancers, rather than building 4 models, you can build a pooled normal and compare it against your pooled cancer. The pooled models then have effectively twice the depth. A caveat is that the models now do not represent individual samples, but a population of samples. So high entropy in the population, for example, could be due to heterogeneity of the samples rather than high entropy in both samples. But as long as you are aware of this distinction when you draw your scientific conclusions there is no problem and often people are indeed interested in the population difference between a normal and cancerous population. In informME, building pooled models is easily handled at the informME_run.sh stage, where you can specify multiple bam files when building a model.

Will informME work for my RRBS data?

At this time, informME is not recommended for RRBS data, since the underlying modeling assumptions have been tailored to WGBS data. The pipeline will indeed run on RRBS data, and will process the data correctly, since the bisulfite sequencing reads stored in a BAM file are interpreted the same way. But the coverage patterns of RRBS data are dramatically different from WGBS, so our modeling choices (especially our exclusion criteria where we decide which regions have sufficient data to build a model) may not be appropriate for your data.

We welcome community contributions to informME that add optional settings to tweak the software to be more tailored towards RRBS, and we also welcome feedback from users who ignore our suggestion and run RRBS through our pipeline anyway.

What memory settings should I use on the cluster? And how long will it take to run my samples through.

This is cluster dependent as well as data dependent, and generally requires a little trial-and-error. As a guideline, we have included example submission scripts whose threading and memory choices have generally worked well for us in ~125bp paired-end reads at 15X to 20X coverage. With these submission scripts, it generally takes our computer cluster <24 hours to process a BAM file through all steps of the pipeline (assuming the reference genome has been previously analyzed, since this only needs to be done for your first sample and can take up to a day to process).

The cluster killed my jobs partway through, what should I do?

Frequent cluster users will find this is not uncommon due to a variety of reasons (nodes going down, other jobs encroaching on your resources, time/memory requests being set too low, etc.). The first step is to simply resubmit the job that was killed. InformME is designed to not redo previously completed computations, and is generally "smart" enough to only do the work required to finish your job. On very rare occasions, your job will have died in the middle of writing a file to disk, resulting in corrupted ".mat" files. Therefore it may be wise to use a matlab script such as the following (replacing '/path/to/mat/files/' to the scratch or intermediate directory that you have written the informME ".mat" files to) to detect and delete corrupted mat files:

myFolder = '/path/to/mat/files/'
matFiles = dir([myFolder '*mat'])
for ind = 1:length(matFiles)
  try
    load([myFolder matFiles(ind).name]);
  catch
    delete([myFolder matFiles(ind).name]);
  end
end

alternatively, if you don't want to trust a script to delete your files, replace "delete" with "display" to print the names of the files that are corrupted.

What preprocessing steps of our WGBS data do you recommend upstream of informME?

See "Online Methods: Quality control and alignment" in reference [1] below, for the preprocessing steps we use when generating a sorted, indexed, deduplicated BAM file to input to informME.

I don't have a MATLAB license. Can I use informME?

Unfortunately, at this time the informME software depends on a matlab license. If you are at a university with a computing cluster, it is common for them to have a site license, so check with your cluster administrator. We are working on porting the software to the Julia language, which would remove the requirement for a MATLAB license. This code is available here https://github.com/GarrettJenkinson/InformMe.jl but it is still considered experimental/unsupported at this time.