Glossifier service API - NCIOCPL/cgov-digital-platform GitHub Wiki
Description
The glossifier
service scans an HTML fragment and identifies portions of that fragment which match entries in the NCI Dictionary of Cancer Terms. The service, which is available at /pdq/api/glossifier
, takes three parameters, each of which is required, though the dictionaries
and languages
parameters can be empty arrays. The parameters and the returned result are encoded as JSON
arrays.
Parameters
-
fragment
Possibly empty string containing an HTML fragment, which is not necessarily well-formed XML. The service scans the input string for matches with term names which represent glossary concepts in PDQ. Matches are made without regard to case or differences in whitespace embedded within the term name. In addition the Unicode character RIGHT SINGLE QUOTATION MARK (U+2019) is mapped to the APOSTROPHE character (U+0027). Matches are made to the longest possible sequence of characters in the input string, scanning from the beginning of the string toward the end, and substrings of a longer matched string are not reported. For example, in the input fragment "breast cancer" a match with the term name "breast cancer" is reported, but no match is reported for "breast" or for "cancer," even though the latter two words are also found in the PDQ glossary. Even though the fragment is not necessarily well-formed XML, the service attempts to identify and mask portions of the input which are likely to match the patterns for certain HTML markup. This masking is done in three passes, and any portion of the input fragment which is masked by an earlier pass is ignore by subsequent masking passes when looking for these markup constructs. The first pass masks out all substrings which look like HTML comments (everything beginning with ""); the second pass ignores substrings which match the pattern for HTML anchor elements, including text or mixed content (that is, everything between "<a\s+", where "\s+" represents a sequence of one or more whitespace characters, and the next occurrence of ""; this pass also masks out everything between "{{" and the next occurrence of "}}" (inclusive); the third pass masks out substrings which match the pattern for HTML tags (but not the text content of the elements delimited by those tags), looking for everything not already masked out which begins with the character "<" through the next occurrence of the ">" character. Although this approach risks masking out too much in the event of an unescaped "<" appearing in the middle of text content, that risk is balanced by the decreased likelihood of presenting attribute values inside element tags for glossary markup.
-
dictionaries
Possibly empty array of strings identifying dictionaries to which PDQ glossary terms are assigned. If the array of dictionary strings is empty, then matched term names are reported without regard to which dictionaries have been associated with the terms, even for terms which have not been associated with any dictionary at all. If the array of dictionary strings is not empty, then only those terms which have been assigned to at least one of the dictionaries named in the array will be reported. Currently, the only dictionary which is associated with the PDQ glossary terms is "Cancer.gov."
-
languages
Possibly empty array of strings identifying languages for which term names should be reported. If the array of language strings is empty, then matched term names are reported without regard to language of the term. If the array of language strings is not empty, then only term names in the language(s) specified will be reported. ISO two-character standard language codes are used as the values in the array. Currently the only two languages supported are English ("en") and Spanish ("es").
Response
A possibly empty array of objects, each of which represents a single match of a segment of the input fragment with a PDQ glossary term name. Only names which are published on Cancer.gov will be included in the array. Matching names in a given language for which the PDQ glossary does not have a published definition in that language will not be included in the array. The following members are defined for the term elements of the array:
-
start
Required integer representing the offset of the beginning of the portion of the input fragment where the match with the glossary term name was found. The offset 0 (zero) represents the beginning of the input fragment. Positions are calculated by counting Unicode characters in the input fragment, not bytes in a serialized representation of the fragment.
-
length
Required integer representing the number of Unicode characters (not bytes in a serialized representation of the fragment) which matched the name of the term.
-
doc_id
Required string containing the PDQ identifier for the Glossary document, in the form
CDR\d{10}
. -
dictionary
Optional string identifying a dictionary in which the term is reported. If more than one dictionary is reported for a term, a separate entry array will be included in the array for each dictionary reported.
-
language
Required string identifying the language of the term. If the same string is used in more than one language for a term, and the clients requests matches in those languages, then separate term elements will be included in the array for each such language.
-
first_occurrence
Required boolean flag indicating whether this match represents the first time the term name was matched in the input fragment.
Example
See https://github.com/NCIOCPL/cgov-digital-platform/issues/917 for a more substantial request/response example pair (the wiki won't let me attach files, it seems).
Request
{
"fragment": "<p>mama Gerota\u2019s capsule breast cancer and mama</p>",
"languages": [
"es"
],
"dictionaries": [
"Cancer.gov"
]
}
Response
[
{
"start": "3",
"length": "4",
"doc_id": "CDR0000304766",
"dictionary": "Cancer.gov",
"language": "es",
"first_occurrence": true
},
{
"start": "43",
"length": "4",
"doc_id": "CDR0000304766",
"dictionary": "Cancer.gov",
"language": "es",
"first_occurrence": false
}
]