Command Layer - giffordlabcvr/Hepadnaviridae-GLUE GitHub Wiki
The GLUE interactive command line interpreter is a powerful environment for interacting with the GLUE command layer at a fine-grained level. The design could be compared to interactive R or Python interpreters, or command line clients provided by databases such as MySQL.
The tutorial covers most of the key aspects of using the interpreter, interspersed with some examples of what can be achieved with it.
In this tutorial we will use the example GLUE project so you should download and build this project if you have not already done so.
- Mode navigation
- Automatic command completion
- Example commands: Data query
- Command syntax
- Example commands: Sequence analysis
- Command history and other useful keystrokes
- Example commands: Exporting alignments
- Mode wrapping
- Console options
- Next steps
On starting the GLUE interpreter you will see something like this:
GLUE Version 0.1.144
Copyright (C) 2018 The University of Glasgow
This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you
are welcome to redistribute it under certain conditions. For details see
GNU Affero General Public License v3: http://www.gnu.org/licenses/
Mode path: /
GLUE>
Try entering the commands displayed below:
Mode path: /
GLUE> project example
OK
Mode path: /project/example
GLUE> feature MT
OK
Mode path: /project/example/feature/MT
GLUE> exit
OK
Mode path: /project/example
GLUE> exit
OK
Mode path: /
GLUE>
Notice that the GLUE interpreter outputs a mode path each time a command is typed. The project example command changed the mode path from "/" to "/project/example". Then the feature MT command changed the mode path from "/project/example" to "/project/example/feature/MT".
The GLUE interpreter provides a hierarchy of command modes, based on the GLUE data schema. A specific command mode operates on a specific object in the database. We started in root mode (path "/"), which has no associated data object. We then moved to the project mode for the example project, and within this we moved to the mode for the "MT" Feature object within the example project.
Different command modes allow different sets of commands to be entered. While some commands are non-mode-specific, others can only be executed in a specific mode. The set of available commands can be found in the online reference documentation, or from within the interpreter by using the help command, with no arguments, in a specific command mode.
The command mode hierarchy represents data object containment. The path associated with each mode specifies its position in the hierarchy. The exit command changes the current mode to the parent mode within the hiearchy. If you are at one of the modes inside project mode, you can return to project mode using the project-mode command. Similarly, you can return to root mode from anywhere in the hierarchy using the root-mode command.
Navigate into "/project/example" mode, enter "list f" on the command line and press the key:
Mode path: /
GLUE> project example
Mode path: /project/example
GLUE> list f
feature format
GLUE> list f
GLUE is suggesting that the next word after "list" in the command could be "feature" or "format". Add an "e" to the command then press the key again. GLUE will auto-complete the command so that it is "list feature".
- Auto-completion applies to both command keywords and to command options and arguments.
- If there are multiple options based on what is already typed, auto-completion will suggest these.
- Auto-completion will apply at the end or in the middle of the command line, wherever the cursor is located.
Let's run a command to query some data from the example project. We will use the list sequence command in "/project/example" mode.
Mode path: /project/example
GLUE> list sequence
This command lists all data objects of type Sequence. The core schema documentation explains the structure of Sequence data in GLUE. The results of the command are displayed in an interactive table.
+====================+===================+
| source.name | sequenceID |
+====================+===================+
| fasta-hev-examples | IND-HEV-AVH1-1991 |
| fasta-hev-examples | IND-HEV-AVH2-1998 |
| fasta-hev-examples | IND-HEV-AVH3-2000 |
| fasta-hev-examples | IND-HEV-AVH4-2006 |
| fasta-hev-examples | IND-HEV-AVH5-2010 |
| fasta-hev-examples | IND-HEV-FHF1-2003 |
| fasta-hev-examples | IND-HEV-FHF2-2004 |
| fasta-hev-examples | IND-HEV-FHF3-2005 |
| fasta-hev-examples | IND-HEV-FHF4-2006 |
| fasta-hev-examples | IND-HEV-FHF5-2007 |
| ncbi-hev-examples | AB481226 |
| ncbi-hev-examples | AB591734 |
| ncbi-hev-examples | AF444003 |
| ncbi-hev-examples | FJ705359 |
| ncbi-hev-examples | FJ763142 |
| ncbi-hev-examples | FJ998015 |
| ncbi-hev-examples | JF443717 |
| ncbi-hev-examples | JQ013791 |
| ncbi-hev-examples | JX855794 |
+====================+===================+
Sequences 1 to 19 of 63 [F:first, L:last, P:prev, N:next, Q:quit]
You can page backwards and forwards through the table using the , ,
and keys, and return to the command line using .
The list sequence command and other "list" commands in GLUE are quite powerful.
- By default values for the source name and sequenceID of each Sequence are returned, however you can specify that list sequence returns other properties, including those accessed by traversing relationalships.
- The Sequence objects for which data is returned can be filtered using a "where clause" option, based on a logical user-defined expression.
- The results can be sorted by a combination of properties, in ascending or descending order
As an example let's run this variant of the list sequence command:
GLUE> list sequence -s gb_create_date -w "source.name = 'ncbi-hev-examples'" sequenceID host_species gb_create_date
+============+=======================+================+
| sequenceID | host_species | gb_create_date |
+============+=======================+================+
| AF444003 | - | 21-Dec-2001 |
| AB481226 | - | 18-Feb-2009 |
| FJ763142 | Homo sapiens | 21-Mar-2009 |
| FJ705359 | Sus scrofa | 07-Jun-2009 |
| FJ998015 | Sus scrofa | 12-Sep-2009 |
| AB591734 | Herpestes javanicus | 29-Mar-2011 |
| JQ013791 | Oryctolagus cuniculus | 10-Aug-2012 |
| JF443717 | Homo sapiens | 21-Oct-2012 |
| JX855794 | Sus scrofa | 09-Dec-2012 |
| KP294371 | Sus scrofa | 15-Jul-2016 |
+============+=======================+================+
- Only Sequence objects within the Source named "ncbi-hev-examples" are listed.
This is specified using a "where clause":-w "source.name = 'ncbi-hev-examples'"
.
Within the example project, this Source actually contains a set of 10 full-length Hepatitis E virus non-reference sequences from GenBank. - The sequences are listed in ascending order of their creation date on GenBank, using
-s gb_create_date
- At the end of the command we added three properties
sequenceID
,host_species
andgb_create_date
This specifies the columns of the result table.
Many other GLUE commands use "where clause" filters and produce tabular results. One such command is the amino-acid frequency command. This command can be used to compute amino acid frequencies for particular genome locations based on a certain set of sequences within the alignment tree. Try this:
Mode path: /project/example
GLUE> alignment AL_MASTER
OK
Mode path: /project/example/alignment/AL_MASTER
GLUE> amino-acid frequency -c -w "sequence.host_species = 'Sus scrofa' and referenceMember = false" -r REF_MASTER_M73218 -f MT -l 60 65
+=======+===========+============+============+
| codon | aminoAcid | numMembers | pctMembers |
+=======+===========+============+============+
| 60 | E | 14 | 100.00 |
| 61 | V | 14 | 100.00 |
| 62 | L | 10 | 71.43 |
| 62 | F | 4 | 28.57 |
| 63 | W | 14 | 100.00 |
| 64 | N | 14 | 100.00 |
| 65 | H | 14 | 100.00 |
+=======+===========+============+============+
This shows the frequency of different amino acid residues at codon locations 60 to 65 (-l 60 65
) within the Methyltransferase (-f MT
) region of ORF1 as defined on the master reference sequence (-r REF_MASTER_M73218
). Alignment members within different clades are considered because the command recursively (-c
) visits all descendents of the the root alignment (AL_MASTER
). Alignment members are only considered if their sequence host species is pig (sequence.host_species = 'Sus scrofa'
). Reference members, which only exist to satisfy the alignment tree invariant, are excluded (referenceMember = false
).
For location 62 there is some variation whereas at the other locations the amino acid residues are fixed for this group.
For more information on how "where clause" filters are constructed consult the guide to querying the GLUE database.
It is possible to save the results from a GLUE command to a tabular text file, so that we can put them in a spreadsheet or use them in another program. One quick way to do this is to run two console
commands immediately before running a command such as list sequence:
GLUE> console set cmd-output-file-format tab
OK
Mode path: /project/example
GLUE> console set next-cmd-output-file sequences.txt
OK
Mode path: /project/example
GLUE> list sequence -s gb_create_date -w "source.name = 'ncbi-hev-examples'" sequenceID host_species gb_create_date
The list sequence command will operate as normal, but with a side-effect that the results will be written to a tab-delimited file sequences.txt
.
Each GLUE command has a syntax structure which is documented in the help system. You can look up the syntax in the online reference documentation, or use the help command from within the interpreter.
Below, documentation is retrieved for the multi-set field command.
Mode path: /project/example
GLUE> help multi-set field
multi-set field: Set a field value for one or more configurable table objects
Usage: multi-set field <tableName> (-w <whereClause> | -a) <fieldName> <fieldValue> [-b <batchSize>]
Options:
-w <whereClause>, --whereClause <whereClause> Qualify updated objects
-a, --allObjects Update all objects in table
-b <batchSize>, --batchSize <batchSize> Update batch size
The usage line specifies the command syntax:
multi-set field <tableName> (-w <whereClause> | -a) <fieldName> <fieldValue> [-b <batchSize>]
-
The initial part of any command consists of 1-3 (possibly hyphenated) keywords which identify the command.
In this case the keywords aremulti-set field
. -
Commands may also have arguments which are values supplied by the user, these are indicated in the syntax using angle brackets
< >
.
In this case for example the the command takes a<tableName>
argument. -
Similar to arguments, commands may take options. Options have a short form using a single hyphen, e.g.
-a
and an alternate long form using a double hyphen, e.g.--allObjects
. Some options themselves require an argument.
For example the-b
option above requires an integer<batchSize>
argument. -
Arguments and options may be mandatory or optional. In the command syntax optional elements are indicated using square brackets
[ ]
.
For example[-b <batchSize>]
indicates that this option is optional. -
Sometimes alternative options may be supplied, this is indicated in the command syntax using the pipe character
|
.
For example(-w <whereClause> | -a)
indicates that either the-w
or the-a
option may be used, but not both. -
Some options or arguments may be repeated. This is indicated by
...
in the command syntax.
For example in the list sequence command syntax,[<fieldName> ...]
indicates that multiple values may be supplied for this argument. -
Quoting Sometimes we would like to supply an argument value string which contains reserved characters such as space. In this case double or single quotation marks can be used to enclose the string. The
\
character is used to escape quotation marks within the string where necessary. If double quotation marks enclose the string, single quotation marks within the string do not need to be escaped (and vice versa).
Some examples:list sequence --whereClause "source.name = 'ncbi-hev-examples'" list sequence --whereClause "source.name = "ncbi-hev-examples"" list sequence --whereClause 'source.name = "ncbi-hev-examples"' list sequence --whereClause 'source.name = 'ncbi-hev-examples''
We can use GLUE to analyse sequences in FASTA files. The example project zip contains a file sequence.fasta
containing a single HEV sequence. We will analyse this sequence using a couple of modules which have been defined in the example project. Modules provide extra commands within module command modes. First let's navigate from project mode to the command mode for module "exampleMaxLikelihoodGenotyper", a module of type maxLikelihoodGenotyper.
Mode path: /project/example
GLUE> module exampleMaxLikelihoodGenotyper
OK
Mode path: /project/example/module/exampleMaxLikelihoodGenotyper
GLUE>
We can use the genotype file command to quickly identify the genotype and subtype of the sequence.
Mode path: /project/example/module/exampleMaxLikelihoodGenotyper
GLUE> genotype file -f sequence.fasta
+===========+====================+===================+
| queryName | genotypeFinalClade | subtypeFinalClade |
+===========+====================+===================+
| sequence1 | AL_4 | AL_4b |
+===========+====================+===================+
Now let's exit the current module mode and enter the command mode for module "exampleSequenceReporter", a module of type fastaSequenceReporter.
Mode path: /project/example/module/exampleMaxLikelihoodGenotyper
GLUE> exit
OK
Mode path: /project/example/
GLUE> module exampleSequenceReporter
OK
Mode path: /project/example/module/exampleSequenceReporter
GLUE>
We can translate the nucleotides within the example sequence to amino acids using the amino-acid command withn this module.
This command determines the reading frame by aligning the sequence with a "target" ReferenceSequence. We know from the previous command that the sequence is of subtype 4b. So we will supply the ReferenceSequence for subtype 4b as the target as this should generate a good alignment.
GLUE> amino-acid --fileName sequence.fasta --acRefName REF_MASTER_M73218 --featureName ORF1 --targetRefName REF_4b_DQ279091
+===========+=========+=========+==========+===========+===========+===========+
|codonLabel | queryNt | acRefNt | codonNts | aminoAcid |definiteAas|possibleAas|
+===========+=========+=========+==========+===========+===========+===========+
|1 | 26 | 28 | ATG | M |M |M |
|2 | 29 | 31 | GAG | E |E |E |
|3 | 32 | 34 | GCC | A |A |A |
|4 | 35 | 37 | CAT | H |H |H |
|5 | 38 | 40 | CAG | Q |Q |Q |
|6 | 41 | 43 | TTC | F |F |F |
|7 | 44 | 46 | ATA | I |I |I |
|8 | 47 | 49 | AAG | K |K |K |
|9 | 50 | 52 | GCT | A |A |A |
|10 | 53 | 55 | CCT | P |P |P |
|11 | 56 | 58 | GGC | G |G |G |
|12 | 59 | 61 | GTT | V |V |V |
|13 | 62 | 64 | ACT | T |T |T |
|14 | 65 | 67 | ACT | T |T |T |
|15 | 68 | 70 | GCT | A |A |A |
|16 | 71 | 73 | ATT | I |I |I |
|17 | 74 | 76 | GAC | D |D |D |
|18 | 77 | 79 | CAG | Q |Q |Q |
|19 | 80 | 82 | GCT | A |A |A |
+===========+=========+=========+==========+===========+===========+===========+
Rows 1 to 19 of 1694 [F:first, L:last, P:prev, N:next, Q:quit]
Even with auto-completion, GLUE commands can be long and complex to manage. Similar to the Unix Bash command line, the GLUE interpreter stores a history of previously-typed commands so that you can alter and re-use old commands. The history is stored in between GLUE interpreter sessions.
Use and to scroll through the command history.
You can also search for a previous command containing a specific string: press <Ctrl+R>; the (reverse-i-search):
prompt will appear, start typing the string and GLUE will scroll to a recent command which matches. Keep pressing <Ctrl+R> and older matches will be found.
You can also use the following keystrokes to edit commands more quickly.
Keystroke | Function |
---|---|
<Ctrl+A> | Move cursor to start of line |
<Ctrl+E> | Move cursor to end of line |
<Alt+Left> | Move cursor backwards by one word |
<Alt+Right> | Move cursor forwards by one word |
<Ctrl+W> | Cut previous word to clipboard |
<Ctrl+U> | Cut all previous words to clipboard |
<Ctrl+K> | Cut all following words to clipboard |
<Ctrl+Y> | Paste from clipboard into command line |
The example project stores 10 example sequences within a structure called an alignment tree. This structure links sequence data based on nucleotide homologies and evolutionary relationships.
GLUE allows us to export data from the alignment tree as nucleotide or protein alignments in FASTA format. First let's navigate from project mode to the command mode for module "exampleFastaAlignmentExporter", a module of type fastaAlignmentExporter.
Mode path: /project/example
GLUE> module exampleFastaAlignmentExporter
OK
Mode path: /project/example/module/exampleFastaAlignmentExporter
GLUE>
The following export command will create a FASTA file alignment1.fna
, containing a nucleotide alignment of the 10 example sequences, constrained to the master reference sequence (restricting the set of nucleotide columns).
Mode path: /project/example/module/exampleFastaAlignmentExporter
GLUE> export AL_MASTER --recursive --whereClause "sequence.source.name = 'ncbi-hev-examples'" --fileName alignment1.fna
OK
We can also export a protein alignment. Let's exit the current module mode and enter the command mode for module "exampleFastaProteinAlignmentExporter", a module of type fastaProteinAlignmentExporter.
Mode path: /project/example/module/exampleFastaAlignmentExporter
GLUE> exit
OK
Mode path: /project/example/
GLUE> module exampleFastaProteinAlignmentExporter
OK
Mode path: /project/example/module/exampleFastaProteinAlignmentExporter
GLUE>
The following export command will create a FASTA file alignment2.faa
, containing a protein alignment of the ORF1 region from the 10 example sequences, again constrained to the master reference sequence.
Mode path: /project/example/module/exampleFastaProteinAlignmentExporter
GLUE> export AL_MASTER -c -r REF_MASTER_M73218 -f ORF1 -w "sequence.source.name = 'ncbi-hev-examples'" -o alignment2.faa
OK
Some research questions focus on a small region of the viral genome. The following export command will create a a protein alignment of the ORF1 region, but in this case only the region between codons 135 and 155 inclusive. We also use the --preview
option which means the alignment is "previewed" in the interpeter rather than saved to a file.
Mode path: /project/example/module/exampleFastaProteinAlignmentExporter
GLUE> export AL_MASTER -c -r REF_MASTER_M73218 -f ORF1 -l 135 155 -w "sequence.source.name = 'ncbi-hev-examples'" --preview
>AL_3e.ncbi-hev-examples.AB481226
LRGLPPVDRTYCFDGFSRCTF
>AL_3a.ncbi-hev-examples.AB591734
LRGLPPADRTYCFDGFSRCAF
>AL_1b.ncbi-hev-examples.AF444003
LRGLPAADRTYCFDGFSGCNF
>AL_3c.ncbi-hev-examples.FJ705359
LRGLPPVDRTYCFDGFSRCSF
>AL_4a.ncbi-hev-examples.FJ763142
LRGLPPVDRTYCFDGFSGCTF
>AL_3e.ncbi-hev-examples.FJ998015
LRGLPPVDRTYCFDGFSCCAF
>AL_1c.ncbi-hev-examples.JF443717
LRGLSAADRTYCFDGFSGCNF
>AL_3ra.ncbi-hev-examples.JQ013791
LRGLPPVDRTYCFDGFARCAF
>AL_4b.ncbi-hev-examples.JX855794
LRGLPPADRTYCFDGFSGCTF
>AL_3i.ncbi-hev-examples.KP294371
LRGLPPVDRSYCFDGFSRCAF
The default FASTA IDs generated in each of these FASTA alignments is a dotted string which identifies the AlignmentMember from which each alignment row was generated: the alignment name, the source name and the sequence ID.
This can be customised in the module's stored configuration document. One way to update this document is the set property command. In this case we will specify that the FASTA ID consists of the sequence ID, host species, and subtype name.
Mode path: /project/example/module/exampleFastaProteinAlignmentExporter
GLUE> set property idTemplate "${sequence.sequenceID}/${sequence.renderProperty('host_species')}/${alignment.displayName}"
OK
Mode path: /project/example/module/exampleFastaProteinAlignmentExporter
GLUE> export AL_MASTER -c -r REF_MASTER_M73218 -f ORF1 -l 135 155 -w "sequence.source.name = 'ncbi-hev-examples'" --preview
>AB481226/-/Subtype 3e
LRGLPPVDRTYCFDGFSRCTF
>AB591734/Herpestes javanicus/Subtype 3a
LRGLPPADRTYCFDGFSRCAF
>AF444003/-/Subtype 1b
LRGLPAADRTYCFDGFSGCNF
>FJ705359/Sus scrofa/Subtype 3c
LRGLPPVDRTYCFDGFSRCSF
>FJ763142/Homo sapiens/Subtype 4a
LRGLPPVDRTYCFDGFSGCTF
>FJ998015/Sus scrofa/Subtype 3e
LRGLPPVDRTYCFDGFSCCAF
>JF443717/Homo sapiens/Subtype 1c
LRGLSAADRTYCFDGFSGCNF
>JQ013791/Oryctolagus cuniculus/Subtype 3ra
LRGLPPVDRTYCFDGFARCAF
>JX855794/Sus scrofa/Subtype 4b
LRGLPPADRTYCFDGFSGCTF
>KP294371/Sus scrofa/Subtype 3i
LRGLPPVDRSYCFDGFSRCAF
Navigation between command modes can become cumbersome, especially if you want to only execute a single command within the mode before exiting that mode, as in this example:
Mode path: /project/example
GLUE> reference REF_MASTER_M73218
OK
Mode path: /project/example/reference/REF_MASTER_M73218
GLUE> list feature-location
+==============+
| feature.name |
+==============+
| ORF3 |
| ORF2 |
| ORF1 |
| Y |
| X |
| RdRp |
| PPR |
| PCP |
| MT |
| Hel |
+==============+
FeatureLocations found: 10
Mode path: /project/example/reference/REF_MASTER_M73218
GLUE> exit
OK
Mode path: /project/example
To streamline this, GLUE allows mode wrapping. The command which you would have used to enter the command mode (in this case reference REF_MASTER_M73218) can be prepended to the command which you want to execute within the mode (list feature-location). This creates a single mode-wrapped command, executed from the outer mode, which returns the same result:
Mode path: /project/example
GLUE> reference REF_MASTER_M73218 list feature-location
+==============+
| feature.name |
+==============+
| ORF3 |
| ORF2 |
| ORF1 |
| Y |
| X |
| RdRp |
| PPR |
| PCP |
| MT |
| Hel |
+==============+
FeatureLocations found: 10
Mode path: /project/example
Mode wrapping can be nested, for example you could run the above commend from root mode as follows:
Mode path: /
GLUE> project example reference REF_MASTER_M73218 list feature-location
The GLUE interpreter itself has a number of settings which you may find useful to modify. These are grouped under the general heading of console options. Some examples of the most useful console options are given in the table below.
Console option | Function |
---|---|
log-level | GLUE outputs logging messages to the console during certain operations. This option configures the level of detail: INFO (default) implies minimal detail whereas FINEST gives maximum detail. |
load-save-path | Absolute path for loading/saving file data. All commands which load or save data will use this file path as a basis (unless they are passed an absolute path). |
cmd-result-format | GLUE command results can be rendered in different formats such as tab-delimited, JSON or XML. This configures the format which is rendered on the console. |
verbose-error | Can be set to true or false. If true the full Java stack trace is shown on the console when there is an error. Useful for debugging the GLUE engine. |
table-result-float-precision | Sets the number of decimal places used for floating point numbers in command output tables. |
Console options can be set using the console set command and queried using the console show command. The console add option-line command can be used to monitor console options.
Console option settings are not stored in the database. However, users can set their preferences in a file named .gluerc
which is stored in the user's home directory. The GLUE interpreter will read this file when it starts up and apply any console settings it contains.
An example .gluerc
file is shown below:
console set log-level FINEST console add option-line load-save-path console set table-result-float-precision 2
You can now follow the step-by-step guide to building your own GLUE project.