FreeDict HOWTO – XML Markup - freedict/fd-dictionaries GitHub Wiki

XML Markup

FreeDict dictionaries are marked up using the XML version of the Text Encoding Initiative, Chapter 9 (Dictionaries). This format has been widely adopted throughout the linguistics community and can serve both as base for human-readable dictionaries as well as machine-readable corpora. This section discusses technical advantages and disadvantages of this chapter. For a more practical introduction to the format, please see the chapter on Editing TEI XML. Full instructions are given in the "Writing a FreeDict Dictionary" section. We are also developing and testing tools that may be suitable for automating most of this.

There are many advantages to using a standard content based approach like this:

Advantages of using the TEI XML markup format

  1. Inherits most of the advantages of XML including:

    • content based rather than layout based
    • application independent
    • platform independent
    • further processing readily possible across the entire FreeDict collection
    • enables full use of existing or customised XML technologies
  2. Standardises input and output formats.

  3. Protects against obsolescence

  4. TEI has comprehensive DTDs available.

    • The Dictionary DTD is just one of a very wide conceptual set.
    • Elements already exist for lexicographic, etymological, phonetic and other particularities of dictionaries.
    • The TEI XML combination allows processing, development and use beyond the immediate scope of the FreeDict translating dictionaries.
  5. TEI technologies are reasonably well understood and used in academic circles.

Like for anything else, using TEI XML bears disadvantages:

Disadvantages of using the TEI XML markup format

  1. High memory requirements, for storage as well as for processing

  2. The TEI DTD is too permissive. It allows too complex content models for its elements, because it was written to capture as many existing texts as possible. Since almost all elements are allowed inside others, writing software to further process TEI data becomes complex. FreeDict uses its own subset of the TEI DTD. This subset wil be defined in this Howto, once it is stable. Till then it is described only.

  3. XML data requires more than a text editor for easy maintenance due to its verbosity. Eg. you cannot enter entries speedily when you have to enter all tags manually. There is no solution to this yet, but there are a number of solutions for the wider TEI community.

⚠️ **GitHub.com Fallback** ⚠️