xyzzy - MetabolicEngineeringGroupCBMA/MetabolicEngineeringGroupCBMA.github.io GitHub Wiki

SEGUID sequence checksum calculation

What are checksums and why use them?

Biological sequence data files are stored in databases such as GenBank, with a unique number or identifier (or key) that is unique for each sequence in the database. This is important so that we can be sure to always access the same file for a given key. For GenBank, the identifier is called ACCESSION number ().

Figure 1: The ACCESSION number is highlighted by the yellow box. The ACCESSION number is a unique identifier for GenBank records. See this link for further information on the GenBank file format.{width="14.751cm" height="4.135cm"}

The same sequence may be stored in different databases under different keys that are specific to each database. This is a problem since reference tables between keys have to be maintained in order to identify the same sequence.

The SEquence Globally Unique IDentifier (SEGUID) was proposed to solve this problem by allowing the calculation of a 27 character unique code from any biological sequence. The calculation of the SEGUID is based on a cryptographic checksum algorithm called SHA-1.

For example, the SEGUID for the sequence "atat" is 4OTkHvexAJWhbulLjgdi807FwqA.

The SEGUIDs for two sequences are very different even if the sequences are very similar.

Using SEGUID calculator

Go to https://seguidcalculator.pythonanywhere.com/legacy. You should see a window similar to the one in below.

Figure 2: Empty SEGUID calculator main window{width="11.712cm" height="7.449cm"}

Now, open the text file "plasmid_pUC19_in_raw_sequence_format.txt" that should be in the same folder as this document. This file contains the complete sequence of the pUC19 plasmid. Paste the sequence into the large window marked with the orange rectangle and click on the "Calc" button. The same sequence can also be found in GenBank with the accession number L09137.

Figure 3 The SEGUID checksums for the pUC19 vector{width="10.423cm" height="6.318cm"}

You should get the lsseguid=71B4PwSgBZ3htFjJXwHPxtUIPYE for linear single-strand sequences. There are also checksums for other topologies.

Now change the first character of the sequence "T" for a "a" as depicted. Do not change anything else. Then click the "calc" button.

Figure 4: Changing first letter in pUC19 sequence. Compare with the previous figure{width="10.137cm" height="6.128cm"}

Clicking the "Calc" button should give you lsseguid=scIQxOHOpWQOoMyD6VxzCBggwO8**. As we can see, the new result is very different although only one out of a total of 2686 nucleotides was changed.

**Note! **Identical sequences in UPPERCASE or lowercase or mixed case have the same seguids.

Characters, numbers and spaces except the ones allowed in biological sequences are allowed but automatically removed from the input. This means that all characters except "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z" or "a b c d e f g h i j k l m n o p q r s t u v w x y z" are automatically removed from the input. The Characters window shows all the different characters in present in the sequence after filtering.

Always check the calculated size versus the size you would expect as well as and the characters that the sequence contain. You should be familiar with this tool as we will use it to check that the DNA sequences that result from exercises are correct.

There is a button called "Reverse complement". This button transforms the sequence in the window into the reverse complement and calculates the seguid.

Click the "Reverse complement button". The new slseguid should be lsseguid=giuu4irAEGCUtb9JD50X909J73Y.

Question 1:

Calculate the lsseguid for the sequence "gatt" and the reverse complement. The first six characters are ***lsseguid=kAeu2g and lsseguid=Zd45Jj ***What are the last tree characters of each checksum?

Question 2:

Compare the seguids for the sequence "ggatcc" and the reverse complement. Why are they the same?

There is also an online version of the SEGUID calculator here ().

The ldseguid checksum

A linear double stranded (ds) DNA molecule can be described by either the upper or lower strands. If we consider the dsDNA molecule , this blunt sequence can be described as TACGACC or GGTCGTA.

In orfer to produce a unique checksum the ldseguid calculates the checksum for the lexicographically smallest of the strands (sorted in alphabetical order) followed by the second strand.

If we were to sort the two strands in the first strand would be GGTCGTA and the second TACGACC, since G comes before T.

The clseguid and cdseguid checksums

The c(l/d)seguid algorithm looks at the circular sequence and rotates it until the lexicographically smallest sequence rotation is found. Then the algorithm does the same thing for the reverse complement of the sequence. Then the algorithm determines which one of the rotations is the smallest and uses this for the calulation.

For example, consider the circular single-stranded sequence GATT which consists of four nucleotides where the first C and the last A are connected.

There are four different cyclic permutations of the sequence and four for the reverse complement (AATC).

The smallest rotation is obviously the last ATTG . The csseguid for GATT (and for all the sequences in is therefore csseguid=7vp...egs.* ***

**

If the sequence I double-stranded, there are four more sequences to consider (). Now the smalles sequece is*** *AATC.

The cdseguid for GATT (and for all sequences in ) is: cdseguid=PyB...An8.

Question 3:

The pUC19 is a commonly used E. coli cloning vector. It is a circular double stranded DNA molecule. There are four sequences in Genbank that are very similar to the pUC19 plasmid sequence. The following three text files should be present in the directory of this document. They are downloaded copies of these sequences.

One differs from the others, which one?


JD007526_in_raw_sequence_format.txt JD009973_in_raw_sequence_format.txt L09137_in_raw_sequence_format.txt M77789_in_raw_sequence_format.txt

Question 4:

The sequence CAT represents a circular double stranded DNA molecule. List all the circular permutations of both the sequence and the reverse complement of the sequence. How many permutations are there? Do this manually on a piece of paper or in a text editor on your computer.

Question 5:

This is an individual exercise for each student. The input data can be found in the TP01 Google spreadsheet where you can find your name in the leftmost column.Please answer as as indicated for the first example student "Max Maximus". If your name is *not* in the list, please inform your instructor.

The three rightmost columns contain three sequences (Sequence1, 2 and 3).

Two of the sequences represent the same circular sequence and one of the sequences is different. Which one is different? Put your answer in the "Answer" column.

Literature

If you want to know more, there is optional further reading:

The Wikipedia article about checksums (Portuguese) or the English Wikipedia article about checksums. These articles explain checksums from a technical point of view.

Babnigg, G., and Giometti, C. S. (2006) A database of unique protein sequence identifiers for proteome studies. Proteomics 6, 4514--4522. link. This publication suggests the SEGUID as protein sequence identifiers.

Bassi, S., Bassi, S., and Gonzalez, V. (2007) New checksum functions for Biopython. Nature Precedings. Link. This presentation deals with SEGUID implementation.

Website from the Argonne National Laboratory where the SEGUID was first developed link.

⚠️ **GitHub.com Fallback** ⚠️