Customizing ConcordanceCrawler - Gldkslfmsd/concordance-crawler GitHub Wiki
Customizing ConcordanceCrawler
Simple usage
How can you customize ConcordanceCrawler? The easiest way is to use LoggingCrawler
class:
from ConcordanceCrawler.core.logging_crawler import LoggingCrawler
crawler = LoggingCrawler()
It gives you an infinite generator of concordances. This class logs process of crawling, catches exceptions and errors occurring during visit of links, logs them and continues with other links.
You can also use ConcordanceCrawler.core.concordance_crawler.ConcordanceCrawler
class, which only allows crawling without logging and error handling.
Once you create your ConcordanceCrawler or LoggingCrawler object, you can configure it by your own desires and then crawl concordances as in following snippet. It will crawl 20 concordances with a word Praha
.
for _,x in zip(range(20), crawler.yield_concordances(['Praha']):
print(x)
Example code
Click here to see an example of using customized ConcordanceCrawler.
Documentation
class LoggingCrawler
You may to use its "public" methods:
yield_concordances(words)
-- returns a generator object yielding concordances
words
: a nonempty list of strings. They are words whose concordances should be crawled. It should be a single word, or a single word and its other forms. The first one is considered as a dictionary form, links will be found only with this form.
setup(**kwargs)
-- setup configurable parts of setup (Logging|Concordance)Crawler
**kwargs
: a key=value pairs, keys are configurable attribute names of (Logging|Concordance)Crawler, values are their new values.
How to setup your crawler
Trivial example:
crawler = ConcordanceCrawler()
crawler.setup(filter_link=lambda _: True)
This will replace a filter every link must pass through before it's visited by a trivial filter that everything passes through.
Here is a complete description of configurable attributes. Include them as keyword arguments for setup
method. All of them (except the first two) are functions. They're expected behavior in ConcordanceCrawler is as follows:
-
allow_logging
-- boolean value, if set to True, your LoggingCrawler will log progress of crawling. It's useful to see whether your crawler works or is stuck. Default value is True. -
bazgen
-- bazword generator object. It's used for generating links by default get_links function. It's an object inheritingAbstractBazwordGenerator
class inConcordanceCrawler.core.bazwords
module. -
get_links(keyword)
-- gets keyword, returns a list of links. A link is a string.
By default it finds them on Bing.com with usage of bazword generator.
filter_link(url)
-- filter links that will be visited.
By default it excludes links ending by '.docx', '.doc', '.pdf', '.ppt', '.pptx', '.odt', or '.img', because they probably contain non-text document. ConcordanceCrawler is not adapted for extraction from these formats.
get_raw_html(url)
-- Extracts document on given url.
url
- returns: a pair, document residing on a given url (as string) and http response headers (as returned from
requests
library).
By default it uses requests
library and stopit
to stop execution if it lasts more than 60 seconds. (Because some documents can be really long.) To every request http header it adds User-Agent value of some real existing browser. It doesn't catch any errors from requests
.
norm_encoding(document, headers)
-- normalize encoding of raw document or return None, if it's impossible.
What is implemented: In fact it only returns the same document or None
. If the document is html and there is metatag specifying allowed
encoding (which is another parameter), or the encoding is given in http response header and these two values are not unequal, then the document
is returned, None
otherwise.
-
predict_format(normed_document)
-- returns a probable format of document as string -
accept_format(format)
-- returns True, if format returned bypredict_format
is acceptable, False otherwise. Then extraction of document will be skipped.
These two method are not implemented in ConcordanceCrawler and do nothing.
get_visible_text(raw_html)
-- assumes thatraw_html
on the input is html. It tries to exclude all html tags and texts invisible by humans displaying the document via web browser.
raw_html
-- raw document extracted from Internet. We assume it's a text in html, but sometimes it can be invalid assumption.- returns: string containing only visible parts of the document
language_filter(text)
-- gets a text. Returns True if language of the text should be accepted, False otherwise.
Implemented: it filters English texts. It uses comparison of 1,2,3-grams of given text and really long English Wikipedia text.
sentence_segmentation(text)
-- gets a text, returns a list of sentences contained in text.
Implemented: it uses segtok
library improved by one additional rule.
sentence_filter(sentence)
-- returns True/False meaning whether the sentence can be included to corpus or not.
By default it does nothing, all sentences pass. Optional excluding of concordances containing any non-ASCII character is done by this way.
concordance_filtering(sentence, target_words)
-
sentence
-
target_words -- a list of target words
-
returns:
None
, if any of target words is not presented in a sentence, or a triple (word, start, end).-
word is a target word which was found in a sentence
-
start is its starting index in sentence
-
end is the first index behind word occurrence in a sentence
Example:
concordance_filtering('a b c', ['d', 'e, 'b'])
returns('b', 2, 3)
.Implemented: In ConcordanceCrawler it's implemented by two ways, the first uses simple regular expression and the other uses automatic lemmatizing via
textblob
. -