WURCS notations - wurcs/WorkingGroup GitHub Wiki
WURCS (Web3 Unique Representation of Carbohydrate Structures) is the notation scheme and algorithm for the wide-ranging and complex carbohydrate structure data on the databases to code a unique linear format based on atomic level information about monosaccharides and their linkages.
Each section in WURCS 2.0 is composed their components as follows.
<WURCS>
L <Version>
L <Unit Count>
L <UniqueRES Count>
L <RES Count>
L <LIN Count> | <Uncertain LIN Count>
L <UniqueRES List>
L <UniqueRES>
L <ResidueCode>
L <BackboneCode>
L <SkeletonCode>
L <CarbonDescriptor>
L <Anomeric Information>
L <MOD>
L <LIP> | <Statistic LIP> | <Alternative LIP>
<LIP>
L <Position>
L <Direction>
L <Star Index>
<Statistic LIP>
L <Probability Range>
L <Upper Probability>
L <Lower Probability>
L <LIP>
<Alternative LIP>
L <LIP>
L <MAP>
L <RES Sequence>
L <RES>
L <LIN List>
L <LIN> | <Repeated LIN>
<LIN>
L <GLIP> | <Statistic GLIP> | <Alternative GLIP> | <RES Alternative GLIP>
<GLIP>
L <RES Index>
L <Position>
L <Direction>
L <Star Index>
<Statistic GLIP>
L <Probability Range>
L <GLIP>
<Alternative GLIP>
L <GLIP>
<RES Alternative GLIP>
L <Alternative GLIP>
L <MAP>
<Repeated LIN>
L <LIN>
L <Repeat Count Range>
L <Max Repeat Count>
L <Min Repeat Count>
WURCS= <Version> / <Unit Count> / <UniqueRES List> / <RES Sequence> / <LIN List>
The format starts with WURCS= and is followed by the sections Version, Unit Count, UniqueRES List, RES Sequence, and LIN List, separated by forward slashes /. Version is the WURCS version (currently 2.0), Unit Count is the section which contains the number of UniqueRESs, RESs, and LINs, UniqueRES List is the section in which UniqueRESs are listed, RES Sequence is the section which uniquely lists the RESs in order, and LIN List is the section in which LINs are listed. Here, UniqueRES is a string representing every unique monosaccharide contained in the glycan, RES is an index number of the UniqueRES in UniqueRES List representing each monosaccharide residue, and LIN is a string representing linkage information between two or more different RESs. Components of these sections are ordered according to a sorting algorithm.
The Version section describes the current WURCS version as follows:
<Version>: 2.0 (current version)
The Unit Count section is as composed of UniqueRES Count, RES Count and LIN Count, in order separated by commas , as follows:
<Unit Count>: <UniqueRES Count>,<RES Count>,<LIN Count>
They represent the number of UniqueRESs, RESs and LINs, respectively, in the representative glycan. However, there may be the case when the number of linkages is uncertain. Therefore, if the glycan potentially contains uncertain linkages, Uncertain LIN Count is used instead as follows:
<Uncertain LIN Count>: <LIN Count> +
For Uncertain LIN Count, plus + is added after the LIN Count, where LIN Count indicates the number of known linkages. Here, plus + indicates that there may potentially be more than LIN Count linkages.
The UniqueRES List section is composed of one or more UniqueRESs with no separator as follows:
<UniqueRES List>: <UniqueRES #1> <UniqueRES #2> ... <UniqueRES #n>
A UniqueRES is the ResidueCode enclosed in square brackets [ and ] as follows:
<UniqueRES>: [ <ResidueCode> ]
ResidueCode represents a monosaccharide residue consisting of a backbone and its modifications and is composed of a BackboneCode and some MODs separated by underscore _ as follows:
<ResidueCode>: <BackboneCode> _ <MOD #1> _ <MOD #2> _ ... _ <MOD #n>
BackboneCode represents the backbone structure of the monosaccharide residue and is composed of SkeletonCode and Anomeric Information, described above, separated by a hyphen - as follows:
<BackboneCode>: <SkeletonCode> - <Anomeric Information>
SkeletonCode represents a backbone carbon chain structure and is a sequence of CarbonDescriptors.
More detailed definitions are described in WURCS components#SkeletonCode.
CarbonDescriptor represents a carbon on the backbone carbon chain as a character.
More detailed definitions are described in WURCS components#CarbonDescriptor.
Anomeric Information represents the position and stereochemistry of an anomeric center.
Detailed definitions are described in WURCS components#Anomeric Information.
MOD represents a substituent modifying a backbone and is composed of LIPs and a MAP, where each LIP is separated by a hyphen - followed by MAP as follows:
<MOD>: <LIP #1> - <LIP #2> - ... - <LIP #n> <MAP>
LIP represents the linkage information between a backbone and its modification, and is composed of Position, Direction and Star Index as follows:
<LIP>: <Position> <Direction> <Star Index>
Position is the carbon number of the backbone, which is represented either by a positive integer or a question mark ? indicated as an unknown position. Direction is an alphabetical symbol that represents the comparison of the modifications that are connected to the same backbone carbon. Star Index is the index number of backbone carbon indicated by an asterisk * in the MAP and is represented by a positive integer. Direction and Star Index are omittable in some cases.
More detailed definitions are described in WURCS components#LIP/GLIP.
MAP represents the atomic group of the modification. A position connected to backbone carbon on the modification is also included in the MAP, representing as an asterisk *. The MAP have two or more asterisks when the modification has multiple positions to the backbone carbons. If the positions can be distinguished each other, Star Indices must be added followed by the asterisks for indexing.
More detailed definitions are described in WURCS components#MAP.
The LIN List section is composed of a list of LINs separated by an underscore _ as follows:
<LIN List>: <LIN #1> _ <LIN #2> _ ... _ <LIN #n>
If there is no LIN, e.g. in the case of a single monosaccharide, LIN List is left blank.
A LIN is composed of one or more GLIPs and a MAP as follows:
<LIN>: <GLIP #1> - <GLIP #2> - ... - <GLIP #n> <MAP>
A GLIP is the same as a LIP, except that the connecting modification bridges two or more backbones. Therefore, GLIP requires the specification of a RES Index as follows:
<GLIP>: <RES Index> <Position> <Direction> <Star Index>
RES Index is the alphabetical index of the corresponding the order of RESs, where a = 1, b = 2, etc. Thus, the components of LIP and GLIP can be parsed by alternating alphabet and number, greatly reducing the length of the whole string.
More detailed definitions are described in WURCS components#LIP/GLIP.
To handle ambiguous linkages, several sections are defined in both WURCS 1.0 and 2.0.
If the linkage information is represented statistically, Statistic LIP is used instead of LIP. For Statistic LIP, the Probability Range is enclosed by two percent sign symbols %, alongside the LIP section as follows:
<Statistic LIP>: % <Probability Range> % <LIP> or <LIP> % <Probability Range> %
If the Probability Range precedes LIP, the probability indicates that the LIP on the backbone side is indefinite, and if it follows LIP, the opposite holds. Within the Probability Range, the upper and lower probability values are separated by a hyphen -, but these are represented as a single value if they are the same, as follows:
<Probability Range>: <Upper Probability> - <Lower Probability> or <Probability>
The probability value is represented as a decimal fraction, with no zero in the whole number position, e.g. 0.333 is represented as .333. A question mark ? is used for an unknown probability value.
If alternative linkage positions are possible, Alternative LIP is used instead of LIP. For Alternative LIP, each candidate LIP is separated by a vertical bar | as follows:
<Alternative LIP>: <LIP #1> | <LIP #2> | ... | <LIP #n>
Statistic LIP and Alternative LIP cannot be used together.
Statistic GLIP and Alternative GLIP can also be used in the same way as Statistic LIP and Alternative LIP, respectively. These are represented as follows:
<Statistic GLIP>: % <Probability Range> % <GLIP> or <GLIP> % <Probability Range> %
<Alternative GLIP>: <GLIP #1> | <GLIP #2> | ... | <GLIP #n>
If Alternative GLIP contains linkages toward two or more backbones, RES Alternative GLIP is used instead. In the RES Alternative GLIP section, curly brackets { or } are added surrounding Alternative GLIP as follows:
<RES Alternative GLIP>: { <Alternative GLIP> or <Alternative GLIP> }
{ is added at the beginning to indicate the start of an alternative branch, and } at the end to indicate the end.
If the glycan has a repeating unit, Repeated LIN is used instead of LIN. Repeated LIN contains the linkage LIN between the starting and ending monosaccharides in the repeating unit. In Repeated LIN, the Repeat Count Range section is added following the LIN separated by a tilde ~ as follows:
<Repeated LIN>: <LIN> ~ <Repeat Count Range>
In Repeat Count Range, the maximum and minimum count numbers are separated by a colon : or a single number is used if they are the same as follows:
<Repeat Count Range>: <Max Repeat Count> : <Min Repeat Count> or <Repeat Count>
The count number is represented as a positive integer. n is used for an unknown count number.
Some of the section can be omitted when the information is obvious from the other structures.
MOD is omitted if the substituent is a hydrogen, hydroxyl or carbonyl group.
MAP is omitted if it is an ether linkage represented as *O*.
Star Index is omitted if the number of Star Index is zero 0.
Direction is omitted if it is obvious from CarbonDescriptor and Star Index is omitted. When the Star Index is not omitted but obvious from CarbonDescriptor, n is used for the Direction to serve as a separator between Position and Star Index. CarbonDescriptors when Direction is not omitted are M, C, c, N, n.