Output Description File - VincTheSecond/rextractor GitHub Wiki
XML document describes entities and relations detected in the input document. Each entity is defined as list of text-chunks (element annotation in HTML document) and each relation is defined as a set of entities. Here is description of elements used in XML document:
-
document
- Root element.
- No attributes.
-
document → metadata
- Document metadata section. So far undefined.
- No attributes.
-
document → entities
- List of entities detected in the input document.
- No attributes.
-
document → entities → entity
- Description of one entity detected in the document.
- Attributes:
- entity_id
- Unique id of the entity. This is is used for entity addressing in relation definitions.
- dbe_id
- Identifier of entry in the Database of Entities. Defined if entity was detected in Entity component.
- chunk_ids
- List of chunks (element in HTML document) which creates the entity.
- entity_id
-
document → entities → entity → dependency_tree
- Definition of the dependency tree for the entity. Each token of the entity has defined its node in the tree. Each node (element ) is descripted by several attributes:
- form
- Original form of the token.
- lemma
- Base form of the token.
- ord
- Ord number of the token in the entity.
- parent
- Ord number of the parent token.
- form
- Definition of the dependency tree for the entity. Each token of the entity has defined its node in the tree. Each node (element ) is descripted by several attributes:
-
document → relations
- List of detected relations
- No attributes.
-
document → relations → relation
- Description of one specific relation.
- Attributes:
- relation_id
- Unique relation identifier
- dbr_id
- Id of the query which describes relation and its RDF representation.
- (subject|predicate|object)_ids
- List of entity ids which are on the position of the $1.
- (subject|predicate|object)_concept
- Concept of the entity used in RDF transformation.
- relation_id