Internal XML Format - VincTheSecond/rextractor GitHub Wiki
Here is description of elements used in internal XML documents:
-
document
-
Root element
-
No attributes.
-
document → metadata
-
Document metadata section. So far undefined.
-
No attributes.
-
document → body
-
Source text for extraction process.
-
No attributes.
-
document → body → text
-
One piece of text to process. Entities and relations will be detected independently for each text element. This element could contain one or several sentences.
-
Attributes:
- id
- Unique identifier of source text.
- id
-
document → description
-
Contains information about data sources, as well as all output data extracted from texts.
-
No attributes.
-
document → description → resources
-
Describes sources of input text.
-
No attributes.
-
document → description → resources → resource
-
Describes one source of input text. Text in element
document → body → text
could be composed with several resource text snippets. Each resource snippet is exactly specified here by a character offset in the input text. -
Attributes:
- text_id -- id of the input text element
document → body → text
- resource -- id of resource text from original submitted document
- start -- character offset with the position where resource begins in the input text
- end -- character offset with the position where resource ends in the input text
- text_id -- id of the input text element
-
document → description → chunks
-
List of text chunks marked in the document.
-
No attributes.
-
document → description → chunks → chunk
-
Definition of one text chunk.
-
Attributes:
- text_id -- id of the input text element
document → body → text
where text chunks is marked - chunk_id -- unique identifier of the text chunk
- start -- character offset in the input text where text chunk begins
- end -- character offset in the input text where text chunk ends
- nodes -- list of tree nodes (see [PML language format](PML Description)) separated by a whitespace (
\s
) which appear in the text chunk
- text_id -- id of the input text element
-
document → description → entities
-
List of entities detected in the document. Each entity could consist of one or several text chunks. Several entities could share one text chunk. Entity could be detected during Entity detection process or during Relation detection. In the first case, entity description contains link to DBE where are stored several data fields, Ontological concept included. In the second case, no DBE link is available and Ontological concept is available in the relation description.
-
No attributes.
-
document → description → entities → entity
-
Definition of one entity.
-
Attributes:
- enitity_id -- unique identifier of the entity
- dbe_id -- link to [DBE](Database of Entities), if entity was detected during entity detection
- chunk_ids -- list of chunk identifiers which
- nodes -- list of tree nodes (see [PML language format](PML Description)) separated by a whitespace (
\s
) which appear in the entity
-
document → description → relations
-
List of relations detected in the document. Relation could be defined using existing entities (detected in Entity detection process) or it could create new entities. Is new entity is created, their description in
document → description → entities
does not have link to DBE. -
No attributes.
-
document → description → relations → relation
-
Definition of one relation.
-
Attributes:
- relation_id -- unique identifier of the relation
- dbr_id -- link to [DBR](Database of Relations) with formal description of relation
- (subject|predicate|object)_ids -- identifiers of entities in the position of subject, predicate or object
- (subject|predicate|object)_concept -- ontological concept of entities specified above.