Tasks to do - MartenKlaesson/Research_training GitHub Wiki

Popgen task 2019-06-19

1) Make a new plink file that contains all the ‘Unk’ populations as well as 5 individuals from the following: Biaka, Yoruba, Mbuti, Ju_hoan_North, CEU, FIN, TSI, French, Sardinian, Burosho, GIH, Brahui, Makrani, Punjabi, Han, Dai, Japanese, Miao, Oroqen Koryak, Yakut, Altaian, Eskimo_Naukan, Aleut, Yukagir, Mayan, Zapotec, Mixe, Mixtec, Pima, Papuan, ONG.SG, Micronesian, Murut, Thai, Kankanaey, Kinh, Lao, Malay

I started of with extracting all the Unknowns and 5 individuals from each of the populations above into a text file. This text file was then used in the command plink --bfile Panel2019Exercise --keep filter.txt --make-bed --out Panel2019Exercise_sorted to create my .fam, .bed and .bim files. The .fam files contains information about the sample i.e what populations/individuals that should be included. the .bim is the variant markers and the .bed is a binary file containing genetic information.

597573 variants and 215 people pass filters and QC.

Then i sorted out the markers with more than 10% missing data with the following command:

plink --bfile Panel2019Ex_sorted --geno 0.1 --make-bed --out Panel2019Ex_sorted_2

6874 variants removed due to missing genotype data (--geno). 590699 variants and 215 people pass filters and QC.

Next step was to sort out on missingness. This didn't reduce the set anything. plink --bfile Panel2019Ex_sorted_2 --mind 0.15 --make-bed --out Panel2019Ex_sorted_3 590699 variants and 215 people pass filters and QC.

Then i sorted SNPS out of Hardy-Weinberg equilibrium doing a Hardy-Weinberg exact test. This is done to sort out problems that occured at the genotyping. plink --bfile Panel2019Ex_sorted_3 --hwe 0.001 --make-bed --out Panel2019Ex_sorted_4

420124 variants and 215 people pass filters and QC.

This is where i am at now. Do i need to merge my data with comparative datasets? Which ones should i use then? Also last time i did PCA i had to use this code before:

cut -d " " -f1-5 PopStrucIn1.fam >file1a cut -d " " -f1 PopStrucIn1.fam >file2a sed "s/Unknown1/51/g" <file2a | sed "s/Unknown3/53/g" | sed "s/Unknown5/55/g" | sed "s/Unknown11/61/g" | sed "s/Unknown11/61/g" | sed "s/CEU/81/g" | sed "s/YRI/82/g" | sed "s/Han/83/g" | sed "s/San/84/g" | sed "s/MbutiPygmies/85/g" >file3a paste file1a file3a >fileComb sed "s/\t/ /g" fileComb > PopStrucIn1.pedind rm file1a; rm file2a; rm file3a; rm fileComb

Is that the case this time as well?

Also what is the parameter file i need for for PCA?

Finally, i tried to run admixture script and i submitted 120 sbatch scripts??

AFTER MEETING WITH MAX: I do not need to merge my dataset with comparative data set. The output from PCA will be .fts file, evac and eval. The .fts can be used to do tree in MEGA program, but other program opnmly requires plink file. To run PCA i need the PArameterfile which maxc shouldve uploaded to uppmax.

2)Run PCA on the new plink file

3)Run Admixture on the new plink file with 12 K and 10 iterations each (there is a script “AdmixtureRun.sh” that you can use)

4)Construct a phylogenetic tree on the new plink file

2019-07-01

At this point:

Im currently running phasing job. 450 sbatches. Expected to be done before wednesday. I'm trying to create a Phylogenetic tree using Treemax (or whatever the name is, on rackham) Also trying to do a neighbor joining tree in MEGA. Currently trying to make the file so its readable by the program. Trying to run CLUMP on my data from admixture! I will also try and get PONG to work so i can visualize the results from this.