APIv1 - Georgetown-IR-Lab/HealthSurveillanceFramework GitHub Wiki
The framework consists of three components. Each component's inputs and outputs are listed in the components section; the input and output files' formats are listed in the file formats section.
Example: component/example/concept_extraction_naive.py
- DocumentFiles
- Thesaurus (optional)
- ConceptPairs: one (concept, document id) pair for each concept detected
Example: component/example/concept_aggregation_naive.py
- ConceptPairs: (Concept, document id) pairs (i.e., the concept extraction method's output)
- Method-specific data sources (optional): a resource needed by the method such as Wikipedia or a thesaurus
- ConceptPairs: (Higher-level concept, document id) pair for each concept in the input
Example: component/example/trend_detection_naive.py
- Document metadata
- ConceptPairs: (Higher-level concept, document id) pairs (i.e., the concept aggregation method's output)
- TrendTimes: the times at which each concept is trending
All files are in JSON format.
[ "document_file_1", ..., "document_file_n" ]
A list of JSON document files. Each document file may contain many documents, but each should be small enough to load into memory.
Example: data/example/docfiles.json
{ "document_id_1": <document 1's text>, "document_id_2": <document 2's text>, ... }
Document text is a space-separated string of tokenized text (e.g., "I have the flu)
Example: data/example/docs.json
{ "document_id_1": { "created_at": <date>, "author": <username> }, ... }
Dates are strings in the Unix time format, which is the number of seconds since the epoch (1970-01-01 UTC).
Usernames are strings.
Example: data/example/docmeta.json
{ "concept_id_1": ["phrase 1 expressing concept", "phrase 2 expressing concept", ...], ... }
For example, a thesaurus containing only a few phrases about the hair loss concept might look like this:
{ "26": ["bald", "alopecia", "hair_loss"] }
To use a dictionary as input rather than a thesaurus, simply assign only one phrase to each concept.
Example: data/example/thesaurus.json
{ "concept_id_X": ["document_id_123", "document_id_128", ...],
"concept_id_Y": ...
}
Each concept id is mapped to a list of documents it occurs in.
{ "concept_id_X": [ { "start": <date>, "end": <date>, "strength": <strength> },
{ "start": <date>, "end": <date>, "strength": <strength> },
...
],
"concept_id_Y": [ { "start": <date>, "end": <date>, "strength": <strength> }, ... ]
}