Output Description File - VincTheSecond/rextractor GitHub Wiki

XML document describes entities and relations detected in the input document. Each entity is defined as list of text-chunks (element annotation in HTML document) and each relation is defined as a set of entities. Here is description of elements used in XML document:

  • document

    • Root element.
    • No attributes.
  • document → metadata

    • Document metadata section. So far undefined.
    • No attributes.
  • document → entities

    • List of entities detected in the input document.
    • No attributes.
  • document → entities → entity

    • Description of one entity detected in the document.
    • Attributes:
      • entity_id
        • Unique id of the entity. This is is used for entity addressing in relation definitions.
      • dbe_id
        • Identifier of entry in the Database of Entities. Defined if entity was detected in Entity component.
      • chunk_ids
        • List of chunks (element in HTML document) which creates the entity.
  • document → entities → entity → dependency_tree

    • Definition of the dependency tree for the entity. Each token of the entity has defined its node in the tree. Each node (element ) is descripted by several attributes:
      • form
        • Original form of the token.
      • lemma
        • Base form of the token.
      • ord
        • Ord number of the token in the entity.
      • parent
        • Ord number of the parent token.
  • document → relations

    • List of detected relations
    • No attributes.
  • document → relations → relation

    • Description of one specific relation.
    • Attributes:
      • relation_id
        • Unique relation identifier
      • dbr_id
        • Id of the query which describes relation and its RDF representation.
      • (subject|predicate|object)_ids
        • List of entity ids which are on the position of the $1.
      • (subject|predicate|object)_concept
        • Concept of the entity used in RDF transformation.
⚠️ **GitHub.com Fallback** ⚠️