iPyRAD introductory tutorial with simulated data with Maribeth Latvis - barrettlab/2021-Genomics-bootcamp GitHub Wiki

iPyRAD introductory tutorial with simulated data

Maribeth Latvis

Please see the Documentation for iPyRAD

Copyright 2019, Deren Eaton & Isaac Overcast Revision 1f88f521.

  1. Getting set up: Need to type this first to be able to use ipyrad:
# For 35.202.62.133: 

PATH=$PATH://home/drmckain/miniconda3/bin

# For 25.193.79.219: 

PATH=$PATH://home/drmckain/anaconda3/bin

# This tells unix where to find all of your software
# To check what is in your PATH, type:

echo $PATH

# output:
/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/drmckain/miniconda3/bin
  1. We also need to download some example simulated data:
curl -LkO https://eaton-lab.org/data/ipsimdata.tar.gz
tar -xvzf ipsimdata.tar.gz

# curl is a command that transfers data to/from a server
# tar is a command that extracts the data files
  1. We will use just two files from this directory. The file rad_example_R1_.fastq.gz contains Illumina fastQ formatted reads and is gzip compressed. This is a typical format for raw data. The other file, rad_example_barcodes.txt, is a tab-separated table matching barcodes to sample IDs.

  2. To take a look at a fastq file:

gunzip -c rad_example_R1_.fastq.gz | head -n 12

# gunzip uncompresses the file
# The '|' is a pipe, taking the ouput of one command and sending it to another
# head -n 12 will allow us to look at just the top 12 lines of the file

Each read takes four lines:

  • The first is the name of the read (its location on the plate).

  • The second line contains the sequence data.

  • The third line is a spacer.

  • And the fourth line the quality scores for the base calls. In this case arbitrarily high since the data were simulated.

  1. To take a look at the barcode file:
cat ./ipsimdata/rad_example_barcodes.txt
  1. Create a parameter file for ipyrad!
ipyrad -n iptest

This will create a file in the current directory called params-iptest.txt. The params file lists on each line one parameter followed by a ## mark, then the name of the parameter, and then a short description of its purpose. Take a look at it by using the unix command ‘cat’ (or you can use any text editor you like).

  1. Edit the parameter file!
nano params-iptest.txt

In general the default parameter values are sensible, and we won’t mess with them for now, but there are a few parameters we must change. We need to set the path to the raw data we want to analyse, and we need to set the path to the barcodes file.

Running the iPyrad pipeline

Step 1: Demultiplexing

Now we will start assembling the data with ipyrad. Step 1 reads in the barcodes file and the raw data. It scans through the raw data and sorts each read based on the mapping of samples to barcodes.

At the end of this step we’ll have a new directory in our project_dir called iptest_fastqs/. Inside this directory will be individual fastq.gz files for each sample.

ipyrad -p params-iptest.txt -s 1

ls iptest_fastqs

A more informative metric of success might be the number of raw reads demultiplexed for each sample. ipyrad tracks the state of all your steps in your current assembly, so at any time you can ask for results by invoking the -r flag.

ipyrad -p params-iptest.txt -r

# -r gives us information about the last executed step

Step 2: Filtering

This step filters reads based on quality scores and can be used to detect Illumina adapters in your reads, which is a common concern with any NGS data set, and especially so for homebrew type library preparations. Here the filter is set to the default value of 0 (zero), meaning it filters only based on quality scores of base calls, and does not search for adapters. This is a good option if your data are already pre-filtered. The resulting filtered files from step 2 are written to a new directory called iptest/edits.


ipyrad -p params-iptest.txt -s 2

Check output:

ls iptest/edits
ipyrad -p params-iptest.txt -r

Step 3: Clustering

Step 3 de-replicates and then clusters reads within each sample by the set clustering threshold and then writes the clusters to new files in a directory called iptest_clust_0.85/.

  • Intuitively we are trying to identify all the reads that map to the same locus within each sample. The clustering threshold specifies the minimum percentage of sequence similarity below which we will consider two reads to have come from different loci.

  • The true name of this output directory will be dictated by the value you set for the clust_threshold parameter in the params file.

  • You can see the default value is 0.85, so our default directory is named accordingly. This value dictates the percentage of sequence similarity that reads must have in order to be considered reads at the same locus.

  • You may want to experiment with this value, but 0.85-0.90 is a fairly reliable range, balancing over-splitting of loci vs over-lumping.

  • Note that the clustering steps take a little longer. With this simple, simulated data, it’ll take a couple of minutes. It would take much longer with real data.

ipyrad -p params-iptest.txt -s 3

Check output:

ipyrad -p params-iptest.txt -r

The stats output tells you how many clusters were found, and the number of clusters that pass the mindepth thresholds.

Check one of the files:

cd  iptest_clust_0.85/

gunzip -c iptest_clust_0.85/1A_0.clustS.gz | head -n 28

Step 4: Joint estimation of heterozygosity and error rate

We’re trying to tell the difference between reads that are “real” and those that are errors.

ipyrad -p params-iptest.txt -s 4

This step does not produce new output files, only a stats file with the estimated heterozygosity and error rate parameters.

Check the stats:

ipyrad -p params-iptest.txt -r

Step 5: Consensus base calls

This step uses the inferred error rate and heterozygosity (step 4) to call the consensus of sequences within each cluster.

Here we are identifying what we believe to be the real haplotypes at each locus within each sample.

ipyrad -p params-iptest.txt -s 5

Check output:

ipyrad -p params-iptest.txt -r

The important information is the number of reads_consens. This is the number of “good” reads within each sample that go on to the next step.

With empirical data, this is often a step where many reads are filtered out of the data set.

This step creates a new directory called ./iptest_consens to store the consensus sequences for each sample.

cd  iptest_consens

gunzip -c iptest_consens/1A_0.consens.gz | head

Step 6: Clustering across samples

Identifies similar sequences at each locus between samples.

This uses the same clustering threshold as step 3 to identify sequences between samples that are probably sampled from the same locus, based on sequence similarity.

ipyrad -p params-iptest.txt -s 6

Step 7: Filter and write output files

This step applies a filter for max number of indels per locus, max heterozygosity, max number of SNPs, and the min number of samples per locus (all of these are specified in the parameters file).

This step also writes a bunch of output files for common downstream applications.

ipyrad -p params-iptest.txt -s 7

On real (non simulated) data:

A new directory is created called iptest_outfiles.

This directory contains all the output files specified in the params file.

The default is to create all supported output files which include:

  • PHYLIP(.phy)
  • NEXUS(.nex)
  • EIGENSTRAT’s genotype format(.geno)
  • STRUCTURE(.str)
  • as well as many others.