Highlight locators - readmill/API GitHub Wiki
This is the documentation for v1
of the Readmill API which is deprecated and will be discontinued on 2012-12-16
. Please upgrade to v2
, the new developer documentations are at developers.readmill.com.
Locators
Data structure
Locators are represented with a JSON structure that might look something like this:
{
position: 0.738,
pre: "i am the text just before the highlighted text",
mid: "i am the highlighted text",
post: "i am the text just after the highlighted text",
xpath: {
start: "//*[@class='starttag']",
end: "/*[@class='endtag']",
},
file_id: "chapter-2"
}
Valid locators
There are 4 locator types:
position
: a number representing how far through the book the location is, valid values are between 0.0
(beginning) and 1.0
(end).
pre
/mid
/post
: The text of the highlight (mid
), plus the text preceding the highlight (pre
) and the text following the highlight (post
). Also referred to in this document as text locators.
mid
MUST always be present.
The text should not include any language specific formatting information (i.e regular, it should be human readable text). pre
, mid
and post
will be normalized before stored on Readmill.
xpath
: A pair of XPaths which locate the start and end of the highlight within the book. if xpath
is present, its start
and end
children MUST both be present.
file_id
: An ID which can be used by the client to accurately locate the highlight within the book.
A valid locator MUST include the mid
locator and MAY (should) use one or more of the other locator methods too.
Text locators (pre, mid, post)
The text in pre
, post
and mid
are normalized to make its content more widely usable. The normalization is a simple, but rather effective, way to enable two versions of a book to share a highlight even if the exact formatting of the books are slightly different.
Consider for example the book Metamorphosis, which can be downloaded for free in several formats from a variety of sources (for example project Guthenberg).
A highlight made in the HTML version might look this:
{
mid: 'Just by chance one day, rather than any real curiosity, she opened the door to Gregor's room and found herself face to face with him'
}
with this locator we, of course, have good chance of finding the offset again in the same text:
var textToFind = highlight.mid; // Just by chance one day, rather than any real curiosity, she opened the door to...
var canFindHighlight = document.body.innerText.indexOf(textToFind) > -1; // => true
However, in a reader that uses the txt format that same passage may look like this:
...
structure that made her able to withstand the hardest of things in\n
her long life, wasn't really repelled by Gregor. Just by chance one\n
day, rather than any real curiosity, she opened the door to Gregor's\n
room and found herself face to face with him. He was taken totally\n
...
Without disregarding the extraneous \n
characters and the double spaces after each full stop, we won't be able to match our highlighted string even though it's the same book. And the same is true for the reverse case: a user that highlights text in the text format won't be able to view their highlight in an ePub reader showing the HTML version.
For this reason Readmill always normalizes whitespace received in pre
, mid
and post
locators. This ensures a higher level of reusability of the data between different formats and readers at the cost of a bit more work in the client implementation.
Any app that would like to show highlights should match locator data against a normalized version of the book text. The exact steps need to normalize text can be viewed below.
Normalizing text for use with text locators
The simple explanation of how text should be normalized to match the data used in text locators is that any sequence of one or more whitespace characters should be replaced with a single space (ascii: 20
).
Unfortunately the exact definition of "whitespace" is a bit muddy when talking about groups of characters. Almost all regular expression engines has a predefined group for whitespace \s
. Unfortunately the exact range of character matched by \s
differs between them (see for example javascript (Mozilla), javascript (IE) ruby, perl, iOS).
To ensure that two implementations end up with the exact same result no matter the method (or regular expression engine), we have to agree no exactly which characters we mean when we say "whitespace". At Readmill we choose to include a large set of whitespace characters to filter out as much irrelevant characters as possible.
The precise definition of whitespace used for normalizing highlights is (note the leading space ascii: 20
):
[ \f\n\r\t\u00a0\u0020\u1680\u180e\u2028\u2029\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u202f\u205f\u3000]
The exact definition shown in a number of languages:
# ruby
WHITESPACE_CHARS = ' \f\n\r\t\u00a0\u0020\u1680\u180e\u2028\u2029\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u202f\u205f\u3000'
def normalize_text(text)
text.gsub /[#{WHITESPACE_CHARS}]+/, ' '
end
// javascript
var WHITESPACE_CHARS = ' \f\n\r\t\u00a0\u0020\u1680\u180e\u2028\u2029\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u202f\u205f\u3000';
function normalizeWhitespace(text) {
var reWhitespace = new RegExp('[' + WHITESPACE_CHARS + ']+', 'g');
return text.replace(reWhitespace, ' ');
}
// Java
String normalizeWhitespace(String text) {
final String WHITESPACE_CHARS = " \f\n\r\t\u00a0\u0020\u1680\u180e\u2028\u2029\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u202f\u205f\u3000";
return text.replaceAll(WHITESPACE_CHARS, " ");
}
- (NSString *)stringByNormalizingWhitespace
{
return [self stringByReplacingOccurrencesOfString:@"[ \\\f\\\n\\\r\\\t\\\u00a0\\\u0020\\\u1680\\\u180e\\\u2000\\\u2001\\\u2002\\\u2003\\\u2004\\\u2005\\\u2006\\\u2007\\\u2008\\\u2009\\\u200a\\\u2028\\\u2029\\\u202f\\\u205f\\\u3000]+"
withString:@" "
options:NSRegularExpressionSearch
range:NSMakeRange(0, [self length])];
}