Step 3. Obtain final format for genotype data - ariadnacilleros/Cis-mQTL-mapping-protocol-for-methylome GitHub Wiki

At this point, we will filter out the SNPs with a MAF smaller than 5%, a Hardy-Weinberg p-value smaller than 0.05 and keep the samples according to the final list obtained in the previous step. To do so, we will convert the VCF into a PLINK binary file, which is the input format for the genotype in TensorQTL. To filter the samples we will use --keep flag, but in some cases, it doesn't work properly, if you see that the output file of this step contains a wrong total number of samples that pass the filter, change this flag by --remove, as is indicated in the Readme.

Be careful, when PLINK converts a VCF to a binary PLINK file set, it subsets the name of the samples from the VCF into FID and IID on the binary plink file (.bim, .fam, .bed) by searching a separator which by default is _ . In case you experience problems with it, you can use the following link to customize the way PLINK reads sample names or how to change the name of the samples once you already have the binary PLINK file set. In our case, we always set --const-fid flag which allows you to set FID to 0 in all the samples and the IID as the whole sample name coming from the VCF. Remember that to run TensorQTL, you should match the IID in PLINK with the sample name on the BED file from the methylome.

Finally, the genotype counts and the linkage disequilibrium information will be obtained for each SNP to further be sent to us (IRLab, INMA cohort).