Introduction - graph-genome/Schematize GitHub Wiki

Introduction

Welcome to Pantograph Pangenome Browser. This tool was designed to overcome several limitations of existing techniques in order to communicate more information about the biological reality of genetic diversity. Pantograph’s main goals are 1) Show any combination of genome rearrangements between individuals 2) show all sequence diversity in a reference-free global view.

The left to right x-axis of the browser is not a single reference genome nucleotide position, but instead a pangenome union of all genetic diversity within the species. You’ll be able to see the sequences of specific individuals placed in the larger context of total genetic diversity.

At the end of this tutorial you should be able to browse a pangenome and translate what you see into biological insights.

An Introduction to Pangenome Graphs

Pangenome Sequence

Graph sorting imposes a single linear coordinate system on the whole graph by determining an order to list Nodes. This is extremely useful for implementation and navigation. The global coordinates do not exactly match a reference genome or any other genome. There is a separate method to convert Path coordinates to global coordinates elsewhere. Thankfully, the graph genome is already mostly linear. From an evolutionary standpoint, we start with a single individual with a linear genome. Deletions, Insertions and SNPs add more columns to the matrix, but it is still collinear (syntenic). Inversions and translocations introduce the first truly non-linear variation.

Finding an ideal sort is a tricky problem addressed by odgi sort. The goal is to place all syntenic variant Nodes next to each other, then the rare few Links will bridge long distances across the pangenome to describe the unique ordering of each individual genome. A bad sort results in too many links and chromosomes scrambled together.

Pangenome Matrix depiction of Single Nucleotide Variants

Let’s start with the familiar territory of Multiple Sequence Alignments (MSA). An MSA shows many individual sequences together by inserting gaps in Individual A at the position where Individual B has extra sequence. This means that each column of an MSA shows one homologous base across all individuals. Of course, homology is inferred, not known. We tweak this concept by introducing a new column for every observed variant.

In a Pangenome Matrix (Matrix), each variant has its own column and is shown as either present or absent in each individual. In this representation of the multiple sequences, SNPs, insertions, and deletions are all represented in the same way. This allows you to read off the sequence of any individual by reading the top “Pangenome Sequence” and skipping nucleotides that are not present for that individual.

Next: Links