MoBaGenetics1.0 - folkehelseinstituttet/mobagen GitHub Wiki
Table of Contents
This release comes with the following disclaimer: Although all the batches in version 1.0 have been quality controlled, pre-phased and imputed (see description below), there is still ongoing work retrieving additional meta-data from the respective cohorts, running additional quality assurances and preparing proper documentation. Although we have no reason to believe there are any bugs in the quality control, researchers who want to utilize this version 1.0 should be aware of its current status and hence scrutinize their results carefully.
MoBa Genetics 1.0 is the merger of all data-sets tagged 1.0 on the page for projects that have contributed to MoBa Genetics page, after quality control and imputation. The merged data-set contains 98110 samples. The sparse overlap between directly genotyped markers on the various chips used in the respective projects resulted in a too sparse backbone for pre-phasing all samples together that would have made harmonization a lot easier.
To find the data, assuming that you have secured access to TSD, is described on the Access data from TSD page.
One you have touched base with V1.0, you will see a MERGE sub-directory (see below) that you probably will end up using. However, should you want raw data-files, the other sub-directories are the names of the projects that had the samples genotyped. Refer to the page projects that have contributed to MoBa Genetics for name and information.
In MoBaGenetics 1.0 QC pipelines have been run for each individual datasets, see details on Quality control (QC) . The MERGEd dataset is simply a merge of these, see details below.
All SNP reference locations are from GRCh37.
Sub-directories are
- plink: bedsets suitable for use with plink and the source for the other data. These are simply the concatenations of all the bedset/plink files described under Individual data-sets
- vcf: imputed data . See details on Information score below
- bgen: Plink2 generated bgen 1.2 files
- markerinfo: info on the markers but split in all data-sets (and sub-data-sets of these). This is of great help to find what sets had what markers. Refer to the section on the individual data-sets for more info.
- aux is a subdirectory with auxiliary files
- flaglist-merged is a flaglist related to the merging process
- pedigree contains ethnic core samples and why segments used to identify them
- markerinfo contains info on markers in the merged set
- recode-files-all-prefullinference has probably little value but was (probably?) used to recode ids/parents.
- pca are plots produced during QC
(If you are bold enough to face individual data-sets, you might be interested in Version 1.5 before it is officially released) Every data-set resides it's own sub-directory. They are organized by the project that scanned them.
Certain projects have scanned multiple data-sets, sometime even with different chips. When a project, e.g. NORMENT1, scanned the biological material in multiple batches, an internal name, i.e. FEB18 was be used. Se the complete list of sets for v1.0 for details on the sets.
If there were multiple chips used, there will be one sub-directory for each chip.
All sets contain a directory structure that roughly contains the directories described below.
The directory contains various documentation, such as QC-reports.
These data are close to what was delivered by the lab that processed the underlying biological material. Check out details on the Minor changes below related to renaming of SNPs and individuals.
If you want to do your own QC or verify what has been done, the raw-data area is the place to look. The MoBa Genetics team will not be able to help you with this, but you can of course use the slack channel to discuss this with other members of the community.
There are be two types of files here:
- Bedset suitable for analysis with programs like
plink
(directory plink) - idat files - the idat directory contains raw data (.idat) files as well as samplesheets
Contains info on what samples have passed/failed what stages of the QC. Steps are described in the docs directory described above.
Imputed based on the corresponding bedset (plink)-files found under raw-data.
All batches were imputed at the Sanger Imputation Service using PBWT and the HRC v1.1 reference panel. Pre-phasing prior to imputation was performed using Shapeit2. In batches with available triads, pre-phasing was performed locally (pre-phasing using pedigree information was not possible at Sanger). In batches without triads pre-phasing and imputation were both performed at Sanger. Of special note, the X-chromosome was always pre-phased without pedigree information due to limitations in the software.
Providing a "one size fits all"-QC is difficult. Some researchers are likely going to prefer doing their own QC for various reasons. We try to accommodate this by providing signal intensities and raw genotype calls (PLINK format) for those who would like to embark on this. Additionally, we regard the availability of signal intensities necessary in an interim release to enable researcher to check the underlying calls if needed. For those who would like to use the quality controlled data a detailed QC-report and accompanying QC-flowchart is available for full transparency.
All batches were imputed at the Sanger Imputation Service using PBWT and the HRC v1.1 reference panel. Pre-phasing prior to imputation was performed using Shapeit2. In batches with available triads, pre-phasing was performed locally (pre-phasing using pedigree information was not possible at Sanger). In batches without triads pre-phasing and imputation were both performed at Sanger. Of special note, the X-chromosome was always pre-phased without pedigree information due to limitations in the software.
MoBa Genetics is for all practical purposes merely a merge of all the batches post imputation and very little tinkering is done to the files returned from the imputation service except for 1) renaming SNPs and 2) generating weighted average information scores.
Most markers in the HRC v1.1 reference panel have rsIDs. Those without an rsID are returned with a . (dot) which is cumbersome to work with in downstream analyses as many tools do not like duplicated markers and/or rely on markernames carrying information. For this reason markernames have been updated from . to rsID if and rsID was found in dbSNP 151. Markers without and rsID were converted from . to chr <CHR>:<POSITION>_<REF>/<ALT> to avoid duplicated markernames.
Since every batch of samples has been QCed and imputed separately, the same marker will be associated with a different information score/imputation quality score (INFO-score) in each batch. The INFO-score in the resulting merged MoBa Genetics datasets is calculated as a weighted average of all included batches. Although the scores are largely consistent across batches, a marker could have a very low info score in one batch but still have a decent weighted average. It is important for researchers to take this into account if they want to filter on INFO-score. They might want to look into each specific batch when deciding on what markers and samples to include in their analysis.
Although NIPH aims at not re-genotyping samples in order to maximize the utilization of precious DNA, some earlier projects have overlapping samples with later projects due to various reasons like technical restrictions in the biobank and in some situations the desire to genotype on a more modern platform. As a consequence, researchers would need to scrutinize the dataset in order to avoid duplicated samples in their analyses.
Several of the aforementioned projects used arrays targeting rare variants. These variants require a more extensive QC in order to achieve sufficient quality. These protocols often necessitate a lot of manual labour assessing cluster plots to avoid dodgy calls. Due to the many sub-projects multiplying the amount of plots that would need manual inspection, rare variants QC was outside the scope of our QC. Analyses performed on robust phenotypes on markers with MAF > 0.5% have shown satisfactory reproducibility of previously established loci. We highly recommend additional QC if analysts would like to investigate rare variants.
An old version of the data are available on MoBa_harvest. This directory is only of interest for projects that have started/completed their analysis on these data.