Internal XML Format - VincTheSecond/rextractor GitHub Wiki

Here is description of elements used in internal XML documents:

document
Root element
No attributes.
document → metadata
Document metadata section. So far undefined.
No attributes.
document → body
Source text for extraction process.
No attributes.
document → body → text
One piece of text to process. Entities and relations will be detected independently for each text element. This element could contain one or several sentences.
Attributes:
- id
  - Unique identifier of source text.
document → description
Contains information about data sources, as well as all output data extracted from texts.
No attributes.
document → description → resources
Describes sources of input text.
No attributes.
document → description → resources → resource
Describes one source of input text. Text in element document → body → text could be composed with several resource text snippets. Each resource snippet is exactly specified here by a character offset in the input text.
Attributes:
- text_id -- id of the input text element document → body → text
- resource -- id of resource text from original submitted document
- start -- character offset with the position where resource begins in the input text
- end -- character offset with the position where resource ends in the input text
document → description → chunks
List of text chunks marked in the document.
No attributes.
document → description → chunks → chunk
Definition of one text chunk.
Attributes:
- text_id -- id of the input text element document → body → text where text chunks is marked
- chunk_id -- unique identifier of the text chunk
- start -- character offset in the input text where text chunk begins
- end -- character offset in the input text where text chunk ends
- nodes -- list of tree nodes (see [PML language format](PML Description)) separated by a whitespace (\s) which appear in the text chunk
document → description → entities
List of entities detected in the document. Each entity could consist of one or several text chunks. Several entities could share one text chunk. Entity could be detected during Entity detection process or during Relation detection. In the first case, entity description contains link to DBE where are stored several data fields, Ontological concept included. In the second case, no DBE link is available and Ontological concept is available in the relation description.
No attributes.
document → description → entities → entity
Definition of one entity.
Attributes:
- enitity_id -- unique identifier of the entity
- dbe_id -- link to [DBE](Database of Entities), if entity was detected during entity detection
- chunk_ids -- list of chunk identifiers which
- nodes -- list of tree nodes (see [PML language format](PML Description)) separated by a whitespace (\s) which appear in the entity
document → description → relations
List of relations detected in the document. Relation could be defined using existing entities (detected in Entity detection process) or it could create new entities. Is new entity is created, their description in document → description → entities does not have link to DBE.
No attributes.
document → description → relations → relation
Definition of one relation.
Attributes:
- relation_id -- unique identifier of the relation
- dbr_id -- link to [DBR](Database of Relations) with formal description of relation
- (subject|predicate|object)_ids -- identifiers of entities in the position of subject, predicate or object
- (subject|predicate|object)_concept -- ontological concept of entities specified above.