Variations - giffordlabcvr/Hepadnaviridae-GLUE GitHub Wiki

Overview


Patterns of residues within virus sequences, at both the nucleotide and amino acid levels, are associated with specific functions or phenotypes. Knowledge about such residue patterns is typically derived from testing specific virus strains in the laboratory or observing their specific phenotypes. As these patterns emerge from research, it is often of interest to quickly scan sets of sequences to investigate which sequences contain the pattern, and if so, in what form.

A Variation is a named nucleotide or amino-acid residue pattern, captured as an object in the GLUE database. Variations are created and configured using GLUE commands. Once they exist within a project, a range of GLUE commands can be used to quickly scan for their presence in different forms of sequence data.

  1. Variation concepts
  2. Variation type: nucleotideSimplePolymorphism
  3. Variation type: nucleotideRegexPolymorphism
  4. Variation type: nucleotideInsertion
  5. Variation type: nucleotideDeletion
  6. Variation type: aminoAcidSimplePolymorphism
  7. Variation type: aminoAcidRegexPolymorphism
  8. Variation type: aminoAcidInsertion
  9. Variation type: aminoAcidDeletion
  10. Variation type: conjunction
  11. Examples of Variation creation commands
  12. Commands for Variation-scanning

Variation concepts


A Variation must be created within an existing FeatureLocation object belonging to a specific ReferenceSequence. This is in order to anchor it to a specific genomic location. This design allows documented residue patterns from the research literature to be quickly incorporated into a GLUE project using standardised reference coordinates.

To create a Variation you need to follow these steps:

  1. Choose a ReferenceSequence on which the Variation will be defined, enter the command mode for that ReferenceSequence. If your project uses an alignment tree this would often be the constraining ReferenceSequence of the root alignment tree node.
  2. Choose a FeatureLocation within the ReferenceSequence. This would often be the gene where the Variation pattern might be located. Enter the command mode for that FeatureLocation.
  3. Choose a name for the Variation. This identifier has to be unique amongst the Variations within the selected FeatureLocation.
  4. Choose a type for the variation. This will depend on the kind of pattern you want to scan for. There are nine types available, each type is described below.
  5. Use the create variation command. For most types of variation, you must also supply start/end coordinates within the FeatureLocation; these can be expressed in terms of the ReferenceSequence nucleotide position. Alternatively, for protein-coding Features, they can be expressed using codon labels.

Once the Variation has been created, a command mode for that variation becomes available. Within this command mode you can use set metatag to configure values for the metatags of the Variation. These configure how scanning for the Variation will operate. The specific metatags for each variation type are explained in the relevant section below.

Concrete examples for how Variations may be created are given here.

Finally, you can use various Variation-scanning commands to use your Variations to analyse sequence data.

Variation type: nucleotideSimplePolymorphism


Scan for is a fixed, contiguous string of unambiguous nucleotides anywhere between the start and end locations of the Variation (inclusive).

Required metatags
  • SIMPLE_NT_PATTERN: The nucleotide pattern: a string of A, C, G or T characters.
Optional metatags
  • MIN_COMBINED_NT_FRACTION: Floating-point number between 0.0 and 1.0, default 1.0. Setting this to less than 1.0 allows the pattern to be detected in cases where the sequence data contains ambiguous nucleotide characters. Ambiguous FASTA characters allow alternative unambiguous options. For example the ambiguous Y allows two options: C or T. A fraction is computed for each possible match between the sequence and pattern. Initially the fraction is 1.0. Each time an ambiguous character is encountered which allows the relevant pattern character, the fraction is divided by the number of alternative options. So if for example a C is required by the pattern but in the potential match location the sequence contains Y, the fraction for that match would be halved. The metatag value specifies the minimum combined fraction which is allowed for the Variation to be considered present.

Variation type: nucleotideRegexPolymorphism


A nucleotide pattern match which allows for greater expressive flexibility. A nucleotide regular expression is defined; this may match anywhere between the start and end locations of the Variation (inclusive).

Required metatags
  • REGEX_NT_PATTERN: The regular expression to be matched. This may use any metacharacter or mechanism in the Java 8 java.util.regex.Pattern documentation.

Example: the regular expression C[GT]{3,7}AA calls for a C, followed by between 3 and 7 Gs or Ts in any combination, followed by two As.

Variation type: nucleotideInsertion


An insertion of nucleotides relative to the ReferenceSequence within which the Variation is defined. The insertion must start after the start nucleotide and finish before the end nucleotide of the Variation.

Optional metatags
  • FLANKING_NTS: Optional integer, default 3. The inserted nucleotides must be flanked on either side by blocks of nucleotides which are homologous to adjacent blocks on the ReferenceSequence. This metatag dictates the minimum length of these blocks.
  • MIN_INSERTION_LENGTH_NTS: Optional integer, default null. The minumum number of inserted nucleotides.
  • MAX_INSERTION_LENGTH_NTS: Optional integer, default null. The maximum number of inserted nucleotides.

Variation type: nucleotideDeletion


A deletion of nucleotides relative to the ReferenceSequence within which the Variation is defined. The deletion must start at or after the start nucleotide and finish at or before the end nucleotide of the Variation.

Optional metatags
  • FLANKING_NTS: Optional integer, default 3. The deleted nucleotides on the ReferenceSequence must be flanked on either side by blocks of nucleotides which are homologous to adjacent blocks on the query sequence. This metatag dictates the minimum length of these blocks.
  • MIN_DELETION_LENGTH_NTS: Optional integer, default null. The minumum number of deleted nucleotides.
  • MAX_DELETION_LENGTH_NTS: Optional integer, default null. The maximum number of deleted nucleotides.

Variation type: aminoAcidSimplePolymorphism


Scan for is a fixed, contiguous string of amino acid residues within the protein translation, anywhere between the start and end locations of the Variation (inclusive).

Required metatags
  • SIMPLE_AA_PATTERN: The amino acid pattern: a string of amino acid FASTA characters.
Optional metatags
  • MIN_COMBINED_TRIPLET_FRACTION: Floating-point number between 0.0 and 1.0, default 1.0. Allows for matching of amino acid patterns in the presence of ambiguous nucleotide bases. For each amino acid residue in the pattern, a triplet of three possibly ambiguous nucleotide characters is scanned. If the triplet contains ambiguous characters then there may be multiple unambiguous nucleotide triplets which are consistent. The fraction of unambiguous consistent triplets which code for the required amino acid residue is the triplet fraction. If the pattern contains multiple amino acid residues, the triplet fractions of each are multiplied together to produce a fraction for the possible match. The metatag provides the minimum value for the possible match.

Variation type: aminoAcidRegexPolymorphism


An amino acid pattern match which allows for greater expressive flexibility. An amino acid regular expression is defined; this may match anywhere in the protein translation between the start and end locations of the Variation (inclusive).

Required metatags
  • REGEX_AA_PATTERN: The regular expression to be matched. This may use any metacharacter or mechanism in the Java 8 java.util.regex.Pattern documentation.

Variation type: aminoAcidInsertion


An insertion of amino acids relative to the ReferenceSequence within which the Variation is defined. The insertion must start after the start location and finish before the end location of the Variation.

Optional metatags
  • FLANKING_AAS: Optional integer, default 1. The inserted amino acids must be flanked on either side by blocks of amino acids which are homologous to adjacent blocks on the ReferenceSequence. This metatag dictates the minimum length of these blocks (in amino acids).
  • MIN_INSERTION_LENGTH_AAS: Optional integer, default null. The minimum number of inserted amino acids.
  • MAX_INSERTION_LENGTH_AAS: Optional integer, default null. The maximum number of inserted amino acids.

Variation type: aminoAcidDeletion


Optional metatags
  • FLANKING_AAS: Optional integer, default 1. The deleted amino acids on the ReferenceSequence must be flanked on either side by blocks of amino acids which are homologous to adjacent blocks on the query sequence. This metatag dictates the minimum length of these blocks (in amino acids).
  • MIN_DELETION_LENGTH_AAS: Optional integer, default null. The minimum number of deleted amino acids. |
  • MAX_DELETION_LENGTH_AAS: Optional integer, default null. The maximum number of deleted amino acids. |

Variation type: conjunction


Scan for the conjunction of multiple Variations. A set of up to 5 "conjunct" Variations is specified using metatags. The conjunction Variation is considered to have matched if and only if all the conjunct Variations have matched. Note that start / end locations are not required for this Variation type.

Required metatags
  • CONJUNCT_NAME_1: Names the first conjunct Variation. This must be defined on the same ReferenceSequence and FeatureLocation as the conjunction.
Optional metatags
  • CONJUNCT_NAME_2 CONJUNCT_NAME_3 CONJUNCT_NAME_4 CONJUNCT_NAME_5: Names the second, third, fourth, fifth conjunct Variations as necessary. These must be defined on the same ReferenceSequence and FeatureLocation as the conjunction.

Examples of Variation creation commands


  1. This example is from hepatitis C research. A nucleotide binding motif (NBM) was suggested in the NS4B protein Einav et al., 2004. The most conserved element of the NBM is the "A" motif consisting of a Glycine (G) at codon position 129 followed by four amino acids of any type, followed by a Glycine (G) at position 134 and a Lysine (K) at position 135. The research showed that Arginine (R) or Serine (S) could be substituted for the Lysine with only a small reduction in the binding activity, so one formulation of the "A" motif might allow either of these substitutions.

    We could create this as a GLUE Variation within the HCV-GLUE project:

    reference REF_MASTER_NC_004102 feature-location NS4B create variation NBM_A --vtype aminoAcidRegexPolymorphism --labeledCodon 129 135 variation NBM_A set metatag REGEX_AA_PATTERN "G.{4}G[KRS]" exit exit exit

  2. This example is from hepatitis E virus (HEV) research. The ORF3 protein of HEV contains one or two Proline-Serine-Alanine-Proline (PSAP) amino acid motifs, which may play a role as a functional domain for virion release (Nagashima et al. 2011).

    We could create this as a Variation within the HEV-GLUE project, or in the example project which is based on HEV:

    reference REF_MASTER_M73218 feature-location ORF3 create variation PSAP -t aminoAcidSimplePolymorphism --labeledCodon 96 108 variation PSAP set metatag SIMPLE_AA_PATTERN PSAP exit exit exit

    Here we will scan for the PSAP motif anywhere between locations 96 and 108 on our master reference M73218, so this will detect where sequences have two copies of the motif in this genome region. Note that the coordinates will be slightly different from those mentioned in Nagashima et al. as a different reference sequence was used.

Commands for Variation-scanning


There are three broad categories of data which we can scan for GLUE Variations. In each of these areas, there are GLUE commands and/or module types associated with Variation-scanning.

  1. Sequences stored in GLUE alignments
    The variation frequency command in alignment mode computes summary figures for the presence of many variations over a set of member sequences.
    The variation member scan command, also in alignment mode, scans a set of member sequences for a single specified variation, providing detailed results on any matches.
    The variation scan command in member mode scans a single alignment member for multiple variations and may also provide detailed match results.
  2. Consensus sequences stored in FASTA files
    The variation scan command in the fastaSequenceReporter module type will scan sequences in a FASTA file for multiple variations, providing detailed match results.
  3. Deep sequencing data stored in SAM/BAM files
    The variation scan command in the samReporter module type will scan individual reads in a SAM/BAM file for multiple variations, providing summaries of how many reads contained each Variation.

Scanning for a Variation within a piece of sequence data requires a homology to be established between the sequence data and the ReferenceSequence where the Variation is defined. In the case of sequences stored in GLUE alignments the homology is already in place; the other two cases are slightly more complicated as the homology must be computed as a preliminary step.

Here is an example usage of the variation member scan command. We ran this in the example project, after adding the PSAP motif Variation suggested above.

Mode path: /
GLUE> project example
OK
Mode path: /project/example
GLUE> alignment AL_3
OK
Mode path: /project/example/alignment/AL_3
GLUE> variation member scan -c -w "sequence.source.name = 'ncbi-refseqs' and referenceMember = false" -r REF_MASTER_M73218 -f ORF3 -v PSAP -t
+=============+=============+============+=============+=============+==========+============+==========+=============+============+=============+=============+
|alignmentName| sourceName  | sequenceID |firstRefCodon|lastRefCodon | queryAAs | refNtStart | refNtEnd |queryNtStart | queryNtEnd |  queryNts   |combinedTripl|
|             |             |            |             |             |          |            |          |             |            |             | etFraction  |
+=============+=============+============+=============+=============+==========+============+==========+=============+============+=============+=============+
|AL_3e        |ncbi-refseqs | AB248521   |96           |99           | PSAP     | 5391       | 5402     |5419         | 5430       |CCCTCGGCTCCT |1.00         |
|AL_3e        |ncbi-refseqs | AB248521   |105          |108          | PSAP     | 5418       | 5429     |5446         | 5457       |CCCAGCGCCCCC |1.00         |
|AL_3_AB290312|ncbi-refseqs | AB290312   |96           |99           | PSAP     | 5391       | 5402     |5416         | 5427       |CCCTCGGCTCCA |1.00         |
|AL_3_AB290312|ncbi-refseqs | AB290312   |105          |108          | PSAP     | 5418       | 5429     |5443         | 5454       |CCCAGCGCCCCT |1.00         |
|AL_3_AB290313|ncbi-refseqs | AB290313   |105          |108          | PSAP     | 5418       | 5429     |5443         | 5454       |CCCAGCGCCCCT |1.00         |
|AL_3f        |ncbi-refseqs | AB369687   |96           |99           | PSAP     | 5391       | 5402     |5399         | 5410       |CCCTCGGCTCCT |1.00         |
|AL_3f        |ncbi-refseqs | AB369687   |105          |108          | PSAP     | 5418       | 5429     |5426         | 5437       |CCCAGCGCCCCT |1.00         |
|AL_3_AB369689|ncbi-refseqs | AB369689   |105          |108          | PSAP     | 5418       | 5429     |5423         | 5434       |CCCAGCGCCCCT |1.00         |
|AL_3a        |ncbi-refseqs | AF082843   |105          |108          | PSAP     | 5418       | 5429     |5442         | 5453       |CCCAGCGCCCCT |1.00         |
|AL_3g        |ncbi-refseqs | AF455784   |96           |99           | PSAP     | 5391       | 5402     |5401         | 5412       |CCCTCGGCTCCT |1.00         |
|AL_3g        |ncbi-refseqs | AF455784   |105          |108          | PSAP     | 5418       | 5429     |5428         | 5439       |CCCAGCGCCCCT |1.00         |
|AL_3b        |ncbi-refseqs | AP003430   |105          |108          | PSAP     | 5418       | 5429     |5444         | 5455       |CCCAGCGCCCCA |1.00         |
|AL_3j        |ncbi-refseqs | AY115488   |105          |108          | PSAP     | 5418       | 5429     |5459         | 5470       |CCCAGCGCCCCC |1.00         |
|AL_3_EU360977|ncbi-refseqs | EU360977   |96           |99           | PSAP     | 5391       | 5402     |5423         | 5434       |CCCTCGGCTCCT |1.00         |
|AL_3_EU360977|ncbi-refseqs | EU360977   |105          |108          | PSAP     | 5418       | 5429     |5450         | 5461       |CCCAGCGCCCCC |1.00         |
|AL_3f        |ncbi-refseqs | EU723513   |96           |99           | PSAP     | 5391       | 5402     |5394         | 5405       |CCCTCGGCTCCT |1.00         |
|AL_3f        |ncbi-refseqs | EU723513   |105          |108          | PSAP     | 5418       | 5429     |5421         | 5432       |CCCAGCGCCCCC |1.00         |
|AL_3c        |ncbi-refseqs | FJ705359   |96           |99           | PSAP     | 5391       | 5402     |5416         | 5427       |CCCTCGGCTCCT |1.00         |
|AL_3c        |ncbi-refseqs | FJ705359   |105          |108          | PSAP     | 5418       | 5429     |5443         | 5454       |CCCAGCGCCCCT |1.00         |
|AL_3ra       |ncbi-refseqs | FJ906895   |105          |108          | PSAP     | 5418       | 5429     |5501         | 5512       |CCCAGCGCCCCC |1.00         |
|AL_3i        |ncbi-refseqs | FJ998008   |105          |108          | PSAP     | 5418       | 5429     |5418         | 5429       |CCCAGCGCCCCT |1.00         |
|AL_3ra       |ncbi-refseqs | JQ013791   |105          |108          | PSAP     | 5418       | 5429     |5481         | 5492       |CCCAGCGCCCCC |1.00         |
|AL_3h        |ncbi-refseqs | JQ013794   |96           |99           | PSAP     | 5391       | 5402     |5357         | 5368       |CCCTCGGCTCCA |1.00         |
|AL_3h        |ncbi-refseqs | JQ013794   |105          |108          | PSAP     | 5418       | 5429     |5384         | 5395       |CCCAGCGCCCCT |1.00         |
|AL_3_JQ953664|ncbi-refseqs | JQ953664   |105          |108          | PSAP     | 5418       | 5429     |5444         | 5455       |CCCAGCGCCCCC |1.00         |
|AL_3ra       |ncbi-refseqs | KJ013415   |105          |108          | PSAP     | 5418       | 5429     |5503         | 5514       |CCCAGCGCCCCC |1.00         |
+=============+=============+============+=============+=============+==========+============+==========+=============+============+=============+=============+

Mode path: /project/example/alignment/AL_MASTER
GLUE>

In this case we have scanned all reference sequences in HEV genotype 3 or any of its subtypes. We find that all the sequences have the PSAP motif at location 105-108, some also contain the motif at location 96-99. For each match the start / end locations on the member sequence are given, as well as the underlying nucleotides.