Parsing - sungsik-kong/PhyNEST.jl GitHub Wiki

Parsing DNA alignment

Function readPhylip(args) (args means arguments) parses the input alignment and stores the observed site pattern frequencies for every quartet (i.e., combination of four taxa (or sequences)) from the data along with other relevant information in the form of julia object. readPhylip(args) can have multiple arguments. What are they?

Mandatory argument

The file name of the sequence alignment. For example,

    julia> phylip_data = readPhylip("filename.phy")

Optional arguments

showProgress=true/false
- The boolean argument showProgress visualizes the process of data parsing and estimated remaining time. showProgress=true by default.
checkpoint=true/false
- The boolean argument checkpoint creates a .ckp file in the working directory upon the completion of the data parsing. This file will have the same name as the input alignment file with an extension .ckp. By default, checkpoint=false. See here for more information about checkpointing.
writecsv=true/false
- The boolean argument writecsv creates a .csv file in the working directory upon the completion of the data parsing. This .csv file contains observed site pattern frequencies extracted from the data for every quartet. writecsv=false by default.
csvname=""
- csvname allows users to change the name of the .csv file created when writecsv=true. Desired name can be specified in the quote. If csvname is not specified, the output .csv file will have the same name as the input alignment file with a prefix sitePatternCounts_.

Task

Let's say we want to use the function readPhylip(args) to parse the input alignment sample_n5h1.phy located in PhyNEST.jl.wiki/example-data and name the data object as phylip_data. Using the optional arguments, we disallow visualizing the progress bar, and create .ckp and .csv files upon completion, where the .csv file will have the name sample_n5h1.csv. Can you guess the command?

Click here to see the command

julia> phylip_data = readPhylip("sample_n5h1.phy", showProgress=false, checkpoint=true, writecsv=true, csvname="sample_n5h1")

When you are ready, let's execute the command. This should take less than a minute.

Click here to see the output

julia> phylip_data = readPhylip("sample_n5h1.phy", showProgress=false, checkpoint=true, writecsv=true, csvname="sample_n5h1")
A [.csv] file is saved as sample_n5h1.csv.csv in the current working directory.
Summary of Phylip File
Parsing the file [sample_n5h1.phy] took 26.123 seconds.
Number of taxa: 5
Species names: ["outgroup", "species_4", "species_3", "species_1", "species_2"]
Alignment length (b.p): 1000000
Site patterns frequencies for 120 quartets computed and stored.
Try `show_sp()` function to see all quartet site patterns.

Now, we have parsed the input sequence alignment and computed observed quartet site pattern frequencies. But what does this mean?

Next: Observed site patterns and Checkpointing