Dictionary support - koreader/koreader GitHub Wiki

How to lookup a word

KOReader supports dictionary lookup in EPUB and PDF/DJVU documents. To select a phrase for the dictionary or Wikipedia, simply hold on a word or, hold and drag to select multiple words for other functions.

How to install a dictionary

To use the dictionary lookup function, you first need to install one or more dictionaries in the StarDict format.

The StarDict-format dictionary files have suffixes *.idx, *.ifo or *.ifo.gz, *.dict or *.dict.dz.

The dictionaries need to be installed into one of these directories:

/sdcard/koreader/data/dict directory for Android
/mnt/private/koreader/data/dict for Cervantes
koreader/data/dict directory for Kindle
.adds/koreader/data/dict/ directory for Kobo
applications/koreader/data/dict directory for Pocketbook
$HOME/.config/koreader/data/dict directory for Linux
$HOME/Library/Application Support/koreader/data/dict directory for macOS

Since v2020.04 you can override the directory where dictionaries are installed. This is useful if your device has more than one application that can read StarDict dictionaries to avoid duplicates. To do so, you'll need to add the full path to defaults.custom.lua. For example: STARDICT_DATA_DIR = "/mnt/onboard/.adds/vlasovsoft/dictionary".

Where to find dictionaries

The reader.dict (ex "BoboTiG/ebook-reader-dict") project provides StarDict version of daily dumps of Wiktionary monolingual dictionaries for a variety of languages. It also provides non-free multilingual, and universal, dictionaries.
The WikDict project provides bilingual StarDict dictionaries (download link) based on Wiktionary for a lot of language pairs.
This Github repository contains dictionaries based on Wiktionary from many languages to English, including English-English.
The DictInfo website provides outdated monolingual dictionaries based on Wiktionary.
The Firedict site contains a list of freely available dictionaries.
One can convert between different dictionaries formats using PyGlossary.
Some freely available dictionaries can be converted to the StarDict format with stardicter. See also wiktionary-to-stardict.
It is also possible to convert dict.cc dictionaries to the StarDict format with dictcc-stardict.
You may also be able to use DICT files used by the standard dictd daemon and the related dict packages that contain .dict files. Those files can be converted to stardict format using the /usr/lib/stardict-tools/dictd2dic command provided in the stardict-tools package, although it seems to fail to create the necessary metadata files like the .ifo file.
You can download dictionaries from the internet within KOReader as shown here.
Fictionaries provides dictionaries for various speculative fiction books and series.

HTML encoding within StarDict dictionaries supported

You can use HTML encoded dictionaries, as described here.

Also, dictionaries can be tweaked with a custom CSS file, as described here and here. You can find sample files showing how to tweak them here. And some more discussion can be found here.

MuPDF is used to render the HTML dictionary results. If KOReader notices MuPDF didn't like the HTML, it falls back to stripping tags, keeping line feeds, and gives it back to MuPDF.

We can't easily fix up HTML, but one can add a .lua file in the dict directory with code to tweak the output before feeding it to MuPDF.

You need to be at ease with Lua, or just hack the samples @poire-z created for some french dicts. More details in #3585 (and #3606, #3611).

To strip inline CSS

You can strip (or more simply make them not interpreted by MuPDF) the inline CSS with something like the following in the <dictfilename>.lua:

return function(html)
    -- html = html:gsub(' style=', ' zzztyle=')
    html = html:gsub(' [Ss][Tt][Yy][Ll][Ee]=', ' zzztyle=')
    return html
end

An example of how to apply this on a dictionary

Edit an .ifo file in the dictionary folder. There should be a parameter sametypesequence. To make CSS stripping work it should be sametypesequence=h.
Keep in mind that CSS stripping is a very powerful tool which can lead to enormous substitutions. To play it safe, check the output of the Stardict binary to find out what tags are used in the HTML layout. For example, from SSH or a terminal on a device, go to the koreader/ directory and call sdcv -02 data/dict quaint, where data/dict is the dictionary folder and quaint in the search query. The output should look like this:

[root@kindle koreader]# ./sdcv -02 data/dict/ quaint
Found 2 items, similar to quaint.
-->Longman Dictionary of Contemporary English 5th Ed. (En-En)
-->quaint

<k>quaint</k>
<c c="blue"><b>quaint</b></c> /kweɪnt/ <abr>BrE</abr> <rref>bre_quaint0205.wav</rref> <abr>AmE</abr> <rref>ame_quaint.wav</rref><i><c> adjective</c></i>
<blockquote><blockquote>[<c c="lightcoral">Date: </c><c c="darkgray">1100-1200</c>; <c c="lightcoral">Language: </c><c c="darkgray">Old French</c>; <c c="lightcoral">Origin: </c><c c="darkgray">cointe</c><c c="darkgray"> </c><i><c c="lightseagreen">&apos;clever&apos;</c></i><c c="darkgray">, from </c><c c="darkgray">Latin</c><c c="darkgray"> </c><c c="darkgray">cognitus</c><c c="darkgray"> </c><i><c c="lightseagreen">&apos;known&apos;</c></i>]</blockquote></blockquote>
<blockquote><blockquote> unusual and attractive, especially in an old-fashioned way: </blockquote></blockquote>
<blockquote><blockquote><blockquote><blockquote>  <rref>exa_p008-000464505.wav</rref> <ex>a quaint little village in Yorkshire</ex></blockquote></blockquote></blockquote></blockquote>

From the output, several things can be extracted. One - the main tag for paragraphs is <blockquote>. Two - the main tag for colored text is <c c="color"> which is not a classical CSS-coloring scheme. Moreover, colors themselves are written out as text instead of HTML-RGB references, so they might be completely ignored by KOReader. Three - there are references to .wav sound files, which are redundant for KOReader. In dictionary applications that support such references, these are essentially small icons of a speaker action as a button to trigger the sound. However in KOReader, they will be rendered plainly as in the html source, e.g. bre_quaint0205.wav. Four - there is an extra word of the query in the <k> tag.

After you figure out what you would like to replace, create a .lua file with exactly the same name as the .ifo file (before the file extension). Here is an example content of such a file to replace color schemes and definitions with classical ones, in it, we replaced .wav references with a Unicode icon of a speaker (to distinguish sound examples from the word explanation), we removed any <k> tag words, and made sure the images are pointing to the right path, realtive to ...koreader/data/dict/DICTNAME/res/ directory.

return function(html)
    html = html:gsub('<rref[^>]*>[^<]*%.wav</rref>', '🔊')
    html = html:gsub('<k[^>]*>[^<]*</k>', '')
    html = html:gsub('<c>', '<span>')
    html = html:gsub('</c>', '</span>')
    html = html:gsub('<c c="', '<span style="color:')
    html = html:gsub('"color:indigo"', '"color:#4B0082"')
    html = html:gsub('"color:darkgray"', '"color:#A9A9A9"')
    html = html:gsub('"color:lightcoral"', '"color:#F08080"')
    html = html:gsub('"color:lightseagreen"', '"color:#20B2AA"')
    html = html:gsub('"color:darkgoldenrod"', '"color:#B8860B"')
    html = html:gsub('<rref[^>]*>', '<img src="/')
    html = html:gsub('.jpg</rref>', '.jpg">')
    return html
end

If you want to tweak the text output with css, create a .css file with the same name as the .ifo and .lua files (before the file extension). For this particular example, the CSS file looks like:

blockquote{
    margin-left: 1.0rem;
    margin-right: 0.5rem;
    text-align: justify;
}

Here is a screenshot of how it was before with sametypesequence=x by default, and after making it sametypesequence=h and adding .lua and .css:

Dictionary lookups in scanned pages

KOReader has a built-in OCR engine for recognizing words in scanned PDF/DJVU pages. To use OCR on scanned pages, you need to install the appropriate Tesseract trained data set and add new document languages to koreader/defaults.lua (if your language is other than English or Chinese).

Download language data files for Tesseract 4.00+ and copy the appropriate language data file (e.g. eng.traineddata in the tesseract-fast repository for English and spa.traineddata for Spanish) into koreader/data/tessdata.
To add new languages, open koreader/defaults.custom.lua and add languages via their ISO 3-letter code (important, this needs to match the training data filename!) to the DKOPTREADER_CONFIG_DOC_LANGS_CODE array:

DKOPTREADER_CONFIG_DOC_LANGS_CODE = {"eng", "chi_sim"}    -- language code, make sure you have corresponding training data

For example, for Kazakh these would be kaz; for Russian - rus, etc. If you are unsure of the code for your language, look at the tessdata filenames first.

If you've never customized any advanced settings before, the file will not exist, in which case, just follow the directions in the next sentence, any modified entries will appear in bold, and will automatically be added to the file on exit (this will also help making sure that file is syntactically sound).

If you don't need to add new entries, and simply want to modify the existing ones, you can also go to Tools > More tools > Advanced settings in the file-manager's top menu, and find the DKOPTREADER_CONFIG_DOC_LANGS_CODE entry there.

Forced OCR option make KOReader to ignore any built-in text layers that come with pdf/djvu and use only OCR tessdata instead.

Sorting how dictionaries are displayed

You can configure the order of dictionaries in the interface below.

Tap the name of one dictionary (not the checkbox) to select it, you can then move it up or down using the buttons at the bottom of the screen.

More info can be found here.

Tips and tricks

To look up a word in the dictionary, press and hold on the word. If you press and hold for more than 3 seconds, it will open a menu with more options, as described here.

The dictionary supports a history of searched words, accessible through the menu. More info can be found here (with images).

You can cancel any search by tap. More on this here.