Getting Started - michaeltelford/wgit GitHub Wiki
Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
- URL parsing
- Document content extraction (data mining etc)
- Recursive website crawling (indexing, statistical analysis etc)
Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.
Check out this demo search engine - built using Wgit, Sinatra and MongoDB - deployed to fly.io. Try searching for something that's Ruby related like "Matz" or "Rails".
Only MRI Ruby is tested and supported, but Wgit may work with other Ruby implementations.
Currently, the supported range of MRI Ruby versions is:
ruby '~> 3.0'
a.k.a. between Ruby 3.0 and up to but not including Ruby 4.0. Wgit will probably work fine with older versions but best to upgrade if possible.
$ bundle add wgit
$ gem install wgit
$ wgit
Calling the installed executable will start an REPL session.
require 'wgit'
crawler = Wgit::Crawler.new # Uses Typhoeus -> libcurl underneath. It's fast!
url = Wgit::Url.new 'https://wikileaks.org/What-is-Wikileaks.html'
doc = crawler.crawl url # Or use #crawl_site(url) { |doc| ... } etc.
crawler.last_response.class # => Wgit::Response is a wrapper for Typhoeus::Response.
doc.class # => Wgit::Document
doc.class.public_instance_methods(false).sort # => [
# :==, :[], :at_css, :at_xpath, :author, :author=, :base, :base=, :base_url, :content, :css,
# :description, :description=, :empty?, :external_links, :extract, :html, :inspect,
# :internal_absolute_links, :internal_links, :keywords, :keywords=, :links, :links=, :nearest_fragment,
# :no_index?, :parser, :score, :search, :search!, :size, :stats, :text, :text=, :title, :title=, :to_h,
# :to_json, :url, :xpath
# ]
doc.url # => "https://wikileaks.org/What-is-Wikileaks.html"
doc.title # => "WikiLeaks - What is WikiLeaks"
doc.stats # => {
# :url=>44, :html=>28133, :title=>17, :keywords=>0,
# :links=>35, :text=>67, :text_bytes=>13735
# }
doc.links # => ["#submit_help_contact", "#submit_help_tor", "#submit_help_tips", ...]
doc.text # => ["The Courage Foundation is an international organisation that <snip>", ...]
results = doc.search 'corruption' # Searches doc.text for the given query.
results.first # => "ial materials involving war, spying and corruption.
# It has so far published more"
Below are some practical examples of Wgit in use.
See the Wgit::Indexer#index_www
documentation and source code for an already built example of a WWW HTML indexer. It will crawl any external URL's (in the database) and index their HTML for later use, be it searching or otherwise. It will literally crawl the WWW forever if you let it!
See the Database Example for information on how to configure a database for use with Wgit.
Wgit uses itself to download and save webpages to disk (used in tests). See the script here and edit it for your own purposes.
The broken_link_finder
gem uses Wgit under the hood to find and report a website's broken links. Check out its repository for more details.
This gist takes a URL, crawls it and extracts the meaningful content without annoyances like cookie banners, popups, ads etc. It writes a clean HTML file for you to open in any browser.
Perform a Bing search (using the q
URL param) and index the query and results to a database for future processing. This example uses the Wgit::DSL
for convenience but you can use the API to do the same thing if you'd prefer.
require 'wgit'
include Wgit::DSL
ENV['WGIT_CONNECTION_STRING'] = 'mongodb://user:password@localhost/crawler'
start 'https://www.bing.com/search?q=aid+workers'
extract :query, "//input[@id='sb_form_q']/@value"
extract(
:results, "//li[@class='b_algo']",
singleton: false, text_content_only: false
) do |results|
# Map the results into a preferable format.
results.map do |result|
{
title: result.at_xpath('./h2/a').content,
url: result.at_xpath('./div/div/cite').content,
text: result.at_xpath('./div/p').content,
}
end
end
index
The below script downloads the contents of the first css link found on Facebook's index page.
require 'wgit'
require 'wgit/core_ext' # Provides the String#to_url and Enumerable#to_urls methods.
crawler = Wgit::Crawler.new
doc = crawler.crawl 'https://www.facebook.com'.to_url
# Provide your own xpath (or css selector) to search the HTML using Nokogiri underneath.
href = doc.at_xpath "//link[@rel='stylesheet']/@href"
href.content # => "https://static.xx.fbcdn.net/rsrc.php/v3/y1/l/0,cross/NvZ4mNTW3Fd.css"
href.class # => Nokogiri::XML::Attr
css = crawler.crawl href.content.to_url
css[0..50] # => "._3_s0._3_s0{border:0;display:flex;height:44px;min-"
The below script downloads the contents of several webpages and pulls out their keywords for comparison. Such a script might be used by marketeers for search engine optimisation (SEO) for example.
require 'wgit'
require 'wgit/core_ext' # => Provides the String#to_url and Enumerable#to_urls methods.
my_pages_keywords = ['Everest', 'mountaineering school', 'adventure']
my_pages_missing_keywords = []
competitor_urls = [
'http://altitudejunkies.com',
'http://www.mountainmadness.com',
'http://www.adventureconsultants.com'
].to_urls
crawler = Wgit::Crawler.new
crawler.crawl(*competitor_urls) do |doc|
# If there are keywords present in the web document.
if doc.keywords.respond_to? :-
puts "The keywords for #{doc.url} are: \n#{doc.keywords}\n\n"
my_pages_missing_keywords.concat(doc.keywords - my_pages_keywords)
end
end
if my_pages_missing_keywords.empty?
puts 'Your pages are missing no keywords, nice one!'
else
puts 'Your pages compared to your competitors are missing the following keywords:'
puts my_pages_missing_keywords.uniq
end
The next example requires a configured database instance. The use of a database with Wgit is entirely optional however and isn't required for crawling or URL parsing etc. A database is only needed when indexing (inserting crawled data into the database, for searching etc).
Several different DBMS's can be used with Wgit through the use of an "adapter" class. See this wiki article for more information.
Wgit employs a simple data model which is database agnostic:
Collection | Purpose | Wgit Class |
---|---|---|
urls |
Stores URL's to be crawled at a later date | Wgit::Url |
documents |
Stores web documents after they've been crawled | Wgit::Document |
Wgit provides an in-memory DB, mainly used for testing and experimenting with. To use the Wgit::Database::InMemory
adapter class is very simple:
require 'wgit'
Wgit::Database.adapter_class = Wgit::Database::InMemory
# Use Wgit...
The Wgit::Database::InMemory
class contains the source code for this database adapter.
The default DB that Wgit will use is MongoDB. The rest of this example will assume you're using MongoDB.
See MongoDB Atlas for a (small) free account or provide your own MongoDB instance. Take a look at the mongo-wgit Docker image for an already configured example database; the source of which can be found in the ./docker
directory of this repository.
The name of the database can be anything you like, but remember to correctly identify the database using its name in the connection string.
The Wgit::Database::MongoDB
class contains the source code for this database adapter.
The following versions of MongoDB are currently supported (but older versions will likely work just fine):
Gem | Database |
---|---|
~> 2.19 | >= 4.0 |
Running the following Wgit code will programmatically configure your database:
# By default Wgit will use the MongoDB class but we set it here explicitly for completeness.
Wgit::Database.adapter_class = Wgit::Database::MongoDB
db = Wgit::Database.new '<connection_string>'
# Mongo requires certain indexes to exist for searching, so we create them (once).
db.create_collections
db.create_unique_indexes
Wgit::Model.set_default_search_fields(db) # Creates a search index in Mongo using the Wgit::Model search fields.
Or take a look at the mongo_init.js file for the equivalent Javascript commands.
Note: The text search index lists all document fields to be searched by MongoDB when calling Wgit::Database#search
. Therefore, you should append this list with any other fields that you want searched. For example, if you extend the API then you might want to search your new fields in the database by adding them to the index above. This can be done programmatically with:
# We set the search fields via the Wgit::Model passing the db param so Wgit's search methods abide by these fields too.
Wgit::Model.set_search_fields({ my_field: 2, ... }, db) # my_field has a search weight/priority of 2.
# OR
Wgit::Model.set_search_fields([:my_field, ...], db) # With a default weight of 1.
The following script demonstrates how to use Wgit to index and then search HTML documents stored in a database. If you're running the code for yourself, remember to replace the database connection string with your own.
require 'wgit'
### CONNECT TO THE DATABASE ###
Wgit::Database.adapter_class = Wgit::Database::MongoDB
# In the absence of a connection string parameter, ENV['WGIT_CONNECTION_STRING'] will be used.
db = Database.new '<connection_string>'
### SEED SOME DATA ###
# Here we create our own document rather than crawling the web (which works in the same way).
# We provide the web page's URL and HTML Strings.
doc = Wgit::Document.new(
'http://test-url.com',
"<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
)
db.upsert doc
### SEARCH THE DATABASE ###
# Searching the database returns Wgit::Document's which have fields containing the query.
query = 'cow'
results = db.search query
# By default, the MongoDB ranking applies i.e. results.first has the most hits.
# Because results is an Array of Wgit::Document's, we can custom sort/rank e.g.
# `results.sort_by! { |doc| doc.url.crawl_duration }` ranks via page load times with
# results.first being the fastest. Any Wgit::Document attribute can be used, including
# those you define yourself by extending the API.
top_result = results.first
top_result.class # => Wgit::Document
doc.url == top_result.url # => true
### PULL OUT THE BITS THAT MATCHED OUR QUERY ###
# Searching each result gives the matching text snippets from that Wgit::Document.
top_result.search(query).first # => "How now brown cow."
### SEED URLS TO BE CRAWLED LATER ###
db.upsert top_result.external_links
urls_to_crawl = db.uncrawled_urls # => Results will include top_result.external_links.
Document serialising in Wgit is the means of downloading a web page and serialising parts of its content into accessible Wgit::Document
attributes/methods. For example, Wgit::Document#author
will return you the webpage's xpath value of meta[@name='author']
.
There are two ways to extend the Document serialising behaviour of Wgit for your own means:
- Add additional textual content to
Wgit::Document#text
. - Define
Wgit::Document
instance methods for specific HTML elements.
Below describes these two methods in more detail. Some of this functionality is also covered in the How To Extract Content article.
Wgit contains a set of Wgit::HTMLToText.text_elements
defining which HTML elements contain text on a page; which in turn are serialised. Once serialised you can process this text content via methods like Wgit::Document#text
and Wgit::Document#search
etc.
The below code example shows how to extract additional text from a webpage:
require 'wgit'
# The default text_elements cover most visible page text but let's say we
# have a <table> element with text content that we want.
Wgit::HTMLToText.text_elements[:table] = :block # or :inline
doc = Wgit::Document.new(
'http://some_url.com',
<<~HTML
<html>
<p>Hello world!</p>
<table>My table</table>
</html>
HTML
)
# Now every crawled Document#text will include <table> text content.
doc.text # => ["Hello world!", "My table"]
doc.search('table') # => ["My table"]
Note: This only works for textual page content. For more control over the serialised elements themselves, see below.
Wgit provides some default extractors to extract a page's text, links etc. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next.
Therefore, you can define a Document extractor for each HTML element(s) that you want to extract and serialise into a Wgit::Document
instance variable, equipped with a getter and setter method. Once an extractor is defined, all subsequently crawled Documents will contain your extracted content.
Here's how to add a Document extractor to serialise a specific page element:
require 'wgit'
# Let's get all the page's <table> elements.
Wgit::Document.define_extractor(
:tables, # Wgit::Document#tables will return the page's tables.
'//table', # The xpath to extract the tables.
singleton: false, # True returns the first table found, false returns all.
text_content_only: false, # True returns the table text, false returns the Nokogiri object.
) do |tables|
# Here we can inspect/manipulate the tables before they're set as Wgit::Document#tables.
tables
end
# Our Document has a table which we're interested in. Note it doesn't matter how the Document
# is initialised e.g. manually (as below) or via Wgit::Crawler methods etc.
doc = Wgit::Document.new(
'http://some_url.com',
<<~HTML
<html>
<p>Hello world! Welcome to my site.</p>
<table>
<tr><th>Name</th><th>Age</th></tr>
<tr><td>Socrates</td><td>101</td></tr>
<tr><td>Plato</td><td>106</td></tr>
</table>
<p>I hope you enjoyed your visit :-)</p>
</html>
HTML
)
# Call our newly defined method to obtain the table data we're interested in.
tables = doc.tables
# Both the collection and each table within the collection are plain Nokogiri objects.
tables.class # => Nokogiri::XML::NodeSet
tables.first.class # => Nokogiri::XML::Element
# Note, the Document's stats now include our 'tables' extractor.
doc.stats # => {
# :url=>19, :html=>242, :links=>0, :text=>8, :text_bytes=>91, :tables=>1
# }
See the Wgit::Document.define_extractor docs for more information.
Extractor Notes:
- It's recommended that extracted URL's be mapped into
Wgit::Url
objects.Wgit::Url
's are treated as Strings when being inserted into the database. - A
Wgit::Document
extractor (once initialised) will become a Document instance variable, meaning that the value will be inserted into the Database if it's a primitive type e.g.String
,Array
etc. Complex types e.g. Ruby objects won't be inserted. It's up to you to ensure the data you want inserted, can be inserted. - Once inserted into the Database, you can search a
Wgit::Document
's extractor attributes by updating theWgit::Model
's text search index. See the Database Example for more information.
Below are some general points to keep in mind when using Wgit:
- All absolute
Wgit::Url
's must be prefixed with an appropiate protocol e.g.https://
etc. - By default, up to 5 URL redirects will be followed; this is configurable however.
- IRI's (URL's containing non ASCII characters) are supported and will be normalised/escaped prior to being crawled.