Cross Compilation - spencermountain/wtf_wikipedia GitHub Wiki

The Cross Compilation of Wiki Source converts the downloaded wiki markdown string into another format that can be handle in a

  • database with JSON output,
  • HTML or RevealJS
  • Markdown,
  • LaTeX,
  • ... for furhter editing and processing of the document. The library wtf_wikipedia offers cross compilation into the formats JSON, HTML, MarkDown and LaTeX.

Plain Text Export

wtf_wikipedia also offers a plaintext method, that returns only paragraphs of nice text, and no junk: Assume you have stored the source text of the fetch call to the MediaWiki in the variable wikisrc

wtf.fetch('Toronto Blue Jays', 'en', function (err, doc) {
  if (err) {
    console.error(err);
  }
  console.log(doc.text());
});

You can test these plaintext export features with the js-file ./bin/wtf.js by calling:

$ node ./bin/wtf.js --plaintext George_Clooney

Markdown Export

wtf_wikipedia also offers a markdown method, that returns converted into MarkDown syntax. The following code downloads the article about 3D Modelling from the english Wikiversity:

wtf.fetch('3D Modelling', 'enwikiversity', function (err, doc) {
  if (err) {
    console.error(err);
  }
  console.log(doc.markdown());
});

You can test these MarkDown export features with the js-file ./bin/wtf.js by calling:

$ node ./bin/wtf.js --markdown George_Clooney

HTML Export

wtf_wikipedia also offers a HTML method, that returns converted article into HTML syntax. The following code downloads the article about 3D Modelling from the english Wikiversity:

wtf.fetch('3D Modelling', 'enwikiversity', , function (err, doc) {
  if (err) {
    console.error(err);
  }
  console.log(doc.html());
});

You can test these HTML export features with the js-file ./bin/wtf.js by calling:

$ node ./bin/wtf.js --html George_Clooney

LaTeX Export

wtf_wikipedia also offers a LaTeX method, that returns converted article into LaTeX syntax. The following code downloads the article about 3D Modelling from the english Wikiversity:

wtf.fetch('3D_Modelling', 'enwikiversity', function (err, doc) {
  if (err) {
    console.error(err);
  }
  // converts the Wikiversity article about "3D Modelling"
  // from the english domain https://en.wikiversity.org
  // https://en.wikiversity.org/wiki/3D_Modelling
  console.log(doc.latex());
});

You can test these LaTeX export features with the js-file ./bin/wtf.js by calling:

$ node ./bin/wtf.js --latex George_Clooney

JSON Export

wtf_wikipedia aggregates information about Wiki article and populates a JSON object, if you want see what kind of JSON you can output the stringified JSON to console with

wtf.fetch('Swarm_Intelligence', 'enwikiversity', function (err, doc) {
  if (err) {
    console.error(err);
  }
  console.log(JSON.stringify(doc.json(), null, 4));
});

Preprocessing of Wiki Markdown

wtf_fetch()-Call fetches the Wiki Source from the MediaWiki. The wtf_fetch module is available as a separate module, if you want to access the wiki source without parsing. The hard task is parsing the source due to the fact that wiki source language contains fragments of different grammar (e.g. LaTeX syntax wrapped in <math>...</math> tags. After downloading some preprocessing might be helpful for further improvement of the cross-compilation of the source text from the MediaWiki.

  • a Tokenizer will replace some of content elements e.g. math-tags by a token and assign a specialized handler for these type of content elements. The token is regarded as word in a sentence and will not create conflicts with other parsing processes and incompatible syntax (e.g. within mathematical expression as LaTeX : is just a symbol defining a mathematical operation and outside the <math> as first character in the first line it represents an identation.

You can test these LaTeX export features with the js-file ./bin/wtf.js by calling:

$ node ./bin/wtf.js --latex George_Clooney

Define new Export formats

This section explains how developers can extend the capabilities of wtf_wikipedia to additional export formats.

If you want to create new additional export formats, try PanDoc document conversion to get an idea what formats can be useful and are used currently in PanDoc (see https://pandoc.org). Select as input format MediaWiki in the PanDoc Webinterface and copy a MediaWiki source text into the left input textarea. Select an output format and press convert.

Workflow for new Exports

We explain how to extend wtf_wikipedia with a new export format (e.g. LibreOffice Writer). wtf_wikipedia is able to export to

  • PlainText,

  • HTML,

  • MarkDown and

  • LaTeX. The following sections describe the definition of a new export format in 4-5 steps:

  • (1) Create a GitHub or GitLab repository with name that indicates the purpose of the package (e.g. wtf_wikipedia_odt because odt is the file extension of LibreOffice writer. From LibreOffice you can export to MicroSoft-Office (but not vice-versa). Create a source directory for new method for the output format in /src/ for all tree nodes in the Abstract Syntax Tree (AST). All

    • /src/01-document/,
    • /src/02-section/,
    • /src/03-paragraph/
    • /src/image/,
    • /src/reference/,
    • /src/table/,
    • /src/list/,
    • /src/infobox/,
    • /src/math/, (not implemented in Version 7.2.2 yet)
  • (2) All the AST nodes need a new export method (e.g. toOdt()). for the all the tree nodes in the Abstract Syntax Trees mentioned in (1). Look at the other export methods in the repository of wtf_wikipedia/src how these are defined e.g. for LaTeX by toLatex() or for HTML by toHtml() and adapt them to your new export format.

  • (3) @spencermountain created a mapper to new format names that allows the functions to be called html(), latex() or markdown(). This allows the recursive call of method for all tree node according to the new output format (e.g. odt())

  • (4) Now we need to assign the new export format to the Document object and all other AST nodes so that the extended format will be available at the root node first see how it is done in wtf_wikipedia/src/document/Document.js e.g. by

  const wtf = require('wtf_wikipedia');
  wtf.Document.odt = function (options) {
    // ...
    return odt_zip_file;
  },

make sure that all tree nodes of the Abstract Syntax Tree (AST) have an export method for ODT and extend the module exports at the very end of Section in /src/section/Section.js at line 240 ff.

odt : function(options) {
    options = setDefaults(options, defaults);
    return toOdt(this, options);
  },
  ...
  • (4) Create or extend the test script in directory /tests. A test script for the format odf will be named odf.test.js. A test script for the HTML based presentation RevealJS format reveal will be named reveal.test.js. Look at other formats e.g. html.test.js to understand the concept of testing mechanism. Basically you create the
    • exported a defined text with wtf (e.g. wtf.latex(...)) and store it in the have-variable
    • define the desired output in the want variable,
    • and the t.equal(have, want, "test-case-name") defines the comparision of have and want.
    • html_tidy(), latex_tidy(), ... are removing comments and generate compressed equivalent code for a smarter t.equal-comparison. These functions are defined in tests/tidy.js.
  • (5) run test and build for the extended wtf_wikipedia
  • (6 optional) create a Pull request on the original wtf_wikipedia repository of GitHub maintained by Spencer Kelly to share the code with the community

Handling Relative Links and Inter-Wiki Links in Wiki Source

If a source text in Wikipedia or Wikiversity is exported, the file is in general removed out of the relative link context. The library /src/lib/wikiconvert.js contains a Javascript class to preprocessing the relative links.

General approach:

  • Wiki source text was fetched e.g. from english Wikiversity then the
    • the language ID is en and
    • the domain ID is wikiversity
  • a relative link replacement should be defined like this:
    • Input Wiki Markdown:
The is text of the english Wikipedia of the article "My Article"
My [[my wiki link]], my [[/relative link/]] and my [[w:de:mein_link|inter-wiki link]] to 
the german Wikipedia is defined by those links.
  • Output HTML:
The is text of the english Wikipedia of the article "My Article"
My <a href="https://en.wikiversity.org/wiki/my_wiki_link">my wiki link</a>,
my <a href="https://en.wikiversity.org/wiki/My_article/relative link">relative link</a> and
my <a href="https://de.wikiversity.org/wiki/mein_link">inter-wiki link</a>  to 
the german Wikipedia is defined by those links.
  • Depending on options the a-tag main be exported with target="_blank" to open a new window
  • Inter-wiki links can be encoded by domain:language:article (e.g. w:de:my_article which is short for wikipedia:de:my_article) to refer to content that is available in a specific language only (e.g. the english Wikipedia only). The wikiid used in wtf_wikipedia combines the language-ID and domain-ID. The wikioid site map is stored in /src/data/site_map.js and stores all combinatoric options of language and domain.

Mapping wiki domain can be separated from the language abbreviation with a hash:

var domain_map = {};
domain_map["w"] = "wikipedia";
domain_map["wikipedia"] = "wikipedia";
domain_map["Wikipedia"] = "wikipedia";
domain_map["v"] = "wikiversity";
domain_map["wikiversity"] = "wikiversity";
domain_map["Wikiversity"] = "wikiversity";
domain_map["b"] = "wikibooks";
domain_map["wikibooks"] = "wikibooks";
domain_map["Wikibooks"] = "wikibooks";
...

The domain map is an associative array that maps a possible domain prefix in an interwiki to an explicit part of the domain name. The explicit part of the domain name (e.g. wikipedia for the abbreviation w) is necessary to expand relative link to absolute links especially when a converted wiki document is used outside the Wikipedia or Wikiversity server environment. The relative links [[Swarm Intelligence]] or Water in Wikiverity do not work anymore. They must be expanded to https://en.wikiversity.org/wiki/Swarm_Intelligence or https://en.wikiversity.org/wiki/Water - this link conversion can be implemented by setting in options e.g. options.absolute_links=true and wtf_wikipedia assures that the relative links still work, when the export file displayed outside the MediaWiki server context (e.g. Wikipedia or Wikiversity) .

Defining test cases for the new Format

Test cases are defined in the folder tests/ and have the ending .test.js (e.g. html.test.js for the HTML test cases). Just by naming the file with that ending .test.js the test will be included in the NPM test call npm run test. Desired output can be generated for different format by the PanDoc-Try web-interface. Select as input format in PanDoc-Try web-interface the format MediaWiki and select as output format the new format (e.g. Reveal for web-based presentation or Open Document Format to generate LibreOffice files based on a template file with all your style).

Offline Use of Exported File

Media files like:

  • images,
  • audio,
  • video files can be displayed offline (without internet connectivity) if and only if the media files are stored locally on the device as well. The command line tool wget can be used for downloading the media files to the device. The file can be stored into subfolders (e.g. of the generated HTML file) in corresponding subfolders. For example in a subfolder export/my_html/
  • export/my_html/images,
  • export/my_html/audio,
  • export/my_html/video The selection of the subdirectory can be done with the following function that checks the extension of the file and derives the subdirectory name from it:
function getExtensionOfFilename(pFilename) {
  var re = /(?:\.([^.]+))?$/;
  // re.exec("/path.file/project/output.dzslides.html")[1];  returns  "html"
  return re.exec(pFilename)[1];   // "html"
}


function getMediaSubDir(pMediaLink) {
  var vExt = getExtensionOfFilename(pMediaLink);
  var vSubDir ="images"
  switch (vExt) {
    case "wav":
        vSubDir = "audio"
    break;
    case "mp3":
        vSubDir = "audio"
    break;
    case "mid":
        vSubDir = "audio"
    break;
    case "ogg":
        vSubDir = "video"
    break;
    case "webm":
        vSubDir = "video"
    break;
    default:
        vSubDir = "images"
  };
  return vSubDir;
}

Create Office Documents

If you try PanDoc document conversion the key to generate Office documents is the export format ODF. LibreOffice can load and save even the OpenDocument Format and LibreOffice can load and save MICR0S0FT Office formats. So exporting to Open Document Format will be good option to start with in wtf_wikipedia. The following description are a summary of aspects that support developers in bringing the Office export format e.g. to web-based environment like the ODF-Editor. OpenDocument Format provides a comprehensive way forward for wtf_wikipedia to exchange documents from a MediaWiki source text reliably and effortlessly to different formats, products and devices. Regarding the different Wikis of the Wiki Foundation as a Content Sink e.g. the educational content in Wikiversity is no longer restricted to a single export format (like PDF) open ups access to other specific editors, products or vendors for all your needs. With wtf_wikipedia and an ODF export format the users have the opportunity to choose the 'best fit' application of the Wiki content. This section focuses on Office products.

Open Document Format ODT

Some important information to support Office Documents in the future

  • see WebODF how to edit ODF documents on the web or display slides. Current limitation of WebODF is, that does not render mathematical expressions, but alteration in WebODF editor does not remove the mathematical expressions from the ODF file (state 2018/04/07). WebODF does not render the mathematical expressions but this may be solved in the WebODF-Editor by using MathJax or KaTeX in the future.
  • The ODT-Format is the default export format of LibreOffice/OpenOffice. Supporting the Open Community Approach OpenSource office products are used to avoid commercial dependencies for using generated Office products.
    • The ODT-Format of LibreOffice is basically a ZIP-File.
    • Unzip shows the folder structure within the ZIP-format. Create a subdirectory e.g. with the name zipout/ and call unzip mytext.odt -d zipout (Linux, MacOSX).
    • The main text content is stored in content.xml as the main file for defining the content of Office document
    • Remark: Zipping the folder content again will create a parsing error when you load the zipped office document again in LibreOffice. This may be caused by an inappropriate order in the generated ZIP-file. The file mimetype must be the first file in the ZIP-archive.
    • The best way to generate ODT-files is to generate an ODT-template mytemplate.odt with LibreOffice and all the styles you want to apply for the document and place a marker at specific content areas, where you want to replace the cross-compiled content with wtf_wikipedia in content.xml. The file content.xml will be updated in ODT-ZIP-file. Also marker replacement is possible in ODF-files (see also WebODF demos.
    • Image must be downloaded from the MediaWiki (e.g. with an NPM equivalent of wget for fetching the image, audio or video) and add the file to the folder structure in the ZIP. Create a ODT-file with LibreOffice with an image and unzip the ODT-file to learn about way how ODT stores the image in the ODT zip-file.
  • JSZip: JSZip can be used to update and add certain files in a given ODT template (e.g. mytemplate.odt). Handling ZIP-files in a cross-compilation WebApp with wtf_wikipedia that runs in your browser and generates an editor environment for the cross-compiled Wiki source text (like the WebODF editor). The updating the ODT template as ZIP-file can be handled with JSZip by replacing the content.xml in a ZIP-archive. content.xml can be generated with wtf_wikipedia when the odf-export format is added to /src/output/odf (ToDo: Please create a pull request if you have done that).
  • LibreOffice Export: Loading ODT-files in LibreOffice allows to export the ODT-Format to
    • Office documents doc- and docx-format,
    • Text files (.txt),
    • HTML files (.html),
    • Rich Text files (.rtf),
    • PDF files (.pdf) and even
    • PNG files (.png).
  • Planing of the ODT support can be down in this README and collaborative implementation can be organized with Pull Requests PR.
  • Helpful Libraries: node-odt, odt

Word Export with Javascript Libraries

Create directory for new output format

First go to the subdirectory /src/output. We will show, how a new export format can be added to wtf_wikipedia. Create a new subdirectory (e.g. /src/output/latex) to support a new export format. Copy the files

  • index.js,
  • infobox.js,
  • sentence.js,
  • table.js,
  • math.js (not supported in all formats of <2.6.1 - see ToDo) from the subdirectory /src/output/html into the new subdirectory for the export format (e.g. /src/output/latex). Adapt these function step by step, so that the exported code generates the sentences and tables in an appropriate syntax of the new format.

At the very end of the file /src/output/latex/index.js the new export function is defined. Alter the method name

const toHtml = function(str, options) {
  ....
}

to a method name of the new export format (e.g. for LaTeX the method name toLatex)

const toLatex = function(str, options) {
  ....
}

The code of this method can be reused in most cases (no alteration necessary).

Add the new output format as method

The new output format can be exported by wtf_wikipedia if a method is added to the file index.js. Add a new require command must be added to the other export formats that are already integrated in wtf_wikipedia.

const markdown = require('./output/markdown');
const html     = require('./output/html');
const latex    = require('./output/latex');

After adding the last line for the new export format, the code for cross-compilation to LaTeX is available in the variable latex. The last step is to add the latex output format to the Module Export. Therefore the method for the new output format must be added to the export hash of wtf_wikipedia add the very end of index.js by adding the line latex: latex, to export hash.

module.exports = {
  fetch: fetch,
  plaintext: plaintext,
  markdown: markdown,
  html: html,
  latex: latex,
  version: version,
  custom: customize,
  parse: (str, obj) => {
    obj = obj || {};
    obj = Object.assign(obj, options); //grab 'custom' persistent options
    return parse(str, obj);
  }
};
⚠️ **GitHub.com Fallback** ⚠️