Getting Started - michaeltelford/wgit GitHub Wiki

What is Wgit?

Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.

Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:

URL parsing
Document content extraction (data mining etc)
Recursive website crawling (indexing, statistical analysis etc)

Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.

Check out this demo search engine - built using Wgit, Sinatra and MongoDB - deployed to fly.io. Try searching for something that's Ruby related like "Matz" or "Rails".

Installation
Basic Usage
Practical Examples
Database Example
Extending The API
Caveats

Installation

Only MRI Ruby is tested and supported, but Wgit may work with other Ruby implementations.

Currently, the supported range of MRI Ruby versions is:

ruby '~> 3.0' a.k.a. between Ruby 3.0 and up to but not including Ruby 4.0. Wgit will probably work fine with older versions but best to upgrade if possible.

Using Bundler

$ bundle add wgit

Using RubyGems

$ gem install wgit

Verify

$ wgit

Calling the installed executable will start an REPL session.

Basic Usage

require 'wgit'

crawler = Wgit::Crawler.new # Uses Typhoeus -> libcurl underneath. It's fast!
url = Wgit::Url.new 'https://wikileaks.org/What-is-Wikileaks.html'

doc = crawler.crawl url # Or use #crawl_site(url) { |doc| ... } etc.
crawler.last_response.class # => Wgit::Response is a wrapper for Typhoeus::Response.

doc.class # => Wgit::Document
doc.class.public_instance_methods(false).sort # => [
#   :==, :[], :at_css, :at_xpath, :author, :author=, :base, :base=, :base_url, :content, :css,
#   :description, :description=, :empty?, :external_links, :extract, :html, :inspect,
#   :internal_absolute_links, :internal_links, :keywords, :keywords=, :links, :links=, :nearest_fragment,
#   :no_index?, :parser, :score, :search, :search!, :size, :stats, :text, :text=, :title, :title=, :to_h,
#   :to_json, :url, :xpath
# ]

doc.url   # => "https://wikileaks.org/What-is-Wikileaks.html"
doc.title # => "WikiLeaks - What is WikiLeaks"
doc.stats # => {
          #   :url=>44, :html=>28133, :title=>17, :keywords=>0,
          #   :links=>35, :text=>67, :text_bytes=>13735
          # }
doc.links # => ["#submit_help_contact", "#submit_help_tor", "#submit_help_tips", ...]
doc.text  # => ["The Courage Foundation is an international organisation that <snip>", ...]

results = doc.search 'corruption' # Searches doc.text for the given query.
results.first # => "ial materials involving war, spying and corruption.
              #     It has so far published more"

Practical Examples

Below are some practical examples of Wgit in use.

WWW HTML Indexer

See the Wgit::Indexer#index_www documentation and source code for an already built example of a WWW HTML indexer. It will crawl any external URL's (in the database) and index their HTML for later use, be it searching or otherwise. It will literally crawl the WWW forever if you let it!

See the Database Example for information on how to configure a database for use with Wgit.

Website Downloader

Wgit uses itself to download and save webpages to disk (used in tests). See the script here and edit it for your own purposes.

Broken Link Finder

The broken_link_finder gem uses Wgit under the hood to find and report a website's broken links. Check out its repository for more details.

Clean HTML Reader

This gist takes a URL, crawls it and extracts the meaningful content without annoyances like cookie banners, popups, ads etc. It writes a clean HTML file for you to open in any browser.

Index Search Results

Perform a Bing search (using the q URL param) and index the query and results to a database for future processing. This example uses the Wgit::DSL for convenience but you can use the API to do the same thing if you'd prefer.

require 'wgit'

include Wgit::DSL

ENV['WGIT_CONNECTION_STRING'] = 'mongodb://user:password@localhost/crawler'

start 'https://www.bing.com/search?q=aid+workers'

extract :query, "//input[@id='sb_form_q']/@value"
extract(
  :results, "//li[@class='b_algo']",
  singleton: false, text_content_only: false
) do |results|
    # Map the results into a preferable format.
    results.map do |result|
      {
        title: result.at_xpath('./h2/a').content,
        url:   result.at_xpath('./div/div/cite').content,
        text:  result.at_xpath('./div/p').content,
      }
    end
end

index

CSS Indexer

The below script downloads the contents of the first css link found on Facebook's index page.

require 'wgit'
require 'wgit/core_ext' # Provides the String#to_url and Enumerable#to_urls methods.

crawler = Wgit::Crawler.new
doc     = crawler.crawl 'https://www.facebook.com'.to_url

# Provide your own xpath (or css selector) to search the HTML using Nokogiri underneath.
href = doc.at_xpath "//link[@rel='stylesheet']/@href"

href.content # => "https://static.xx.fbcdn.net/rsrc.php/v3/y1/l/0,cross/NvZ4mNTW3Fd.css"
href.class   # => Nokogiri::XML::Attr

css = crawler.crawl href.content.to_url
css[0..50] # => "._3_s0._3_s0{border:0;display:flex;height:44px;min-"

Keyword Indexer (SEO Helper)

The below script downloads the contents of several webpages and pulls out their keywords for comparison. Such a script might be used by marketeers for search engine optimisation (SEO) for example.

require 'wgit'
require 'wgit/core_ext' # => Provides the String#to_url and Enumerable#to_urls methods.

my_pages_keywords = ['Everest', 'mountaineering school', 'adventure']
my_pages_missing_keywords = []

competitor_urls = [
  'http://altitudejunkies.com',
  'http://www.mountainmadness.com',
  'http://www.adventureconsultants.com'
].to_urls

crawler = Wgit::Crawler.new

crawler.crawl(*competitor_urls) do |doc|
  # If there are keywords present in the web document.
  if doc.keywords.respond_to? :-
    puts "The keywords for #{doc.url} are: \n#{doc.keywords}\n\n"
    my_pages_missing_keywords.concat(doc.keywords - my_pages_keywords)
  end
end

if my_pages_missing_keywords.empty?
  puts 'Your pages are missing no keywords, nice one!'
else
  puts 'Your pages compared to your competitors are missing the following keywords:'
  puts my_pages_missing_keywords.uniq
end

Database Example

The next example requires a configured database instance. The use of a database with Wgit is entirely optional however and isn't required for crawling or URL parsing etc. A database is only needed when indexing (inserting crawled data into the database, for searching etc).

Several different DBMS's can be used with Wgit through the use of an "adapter" class. See this wiki article for more information.

Data Model

Wgit employs a simple data model which is database agnostic:

Collection	Purpose	Wgit Class
`urls`	Stores URL's to be crawled at a later date	`Wgit::Url`
`documents`	Stores web documents after they've been crawled	`Wgit::Document`

InMemory

Wgit provides an in-memory DB, mainly used for testing and experimenting with. To use the Wgit::Database::InMemory adapter class is very simple:

require 'wgit'

Wgit::Database.adapter_class = Wgit::Database::InMemory

# Use Wgit...

The Wgit::Database::InMemory class contains the source code for this database adapter.

MongoDB

The default DB that Wgit will use is MongoDB. The rest of this example will assume you're using MongoDB.

See MongoDB Atlas for a (small) free account or provide your own MongoDB instance. Take a look at the mongo-wgit Docker image for an already configured example database; the source of which can be found in the ./docker directory of this repository.

The name of the database can be anything you like, but remember to correctly identify the database using its name in the connection string.

The Wgit::Database::MongoDB class contains the source code for this database adapter.

Versioning

The following versions of MongoDB are currently supported (but older versions will likely work just fine):

Gem	Database
~> 2.19	>= 4.0

Configuring MongoDB Manually

Running the following Wgit code will programmatically configure your database:

# By default Wgit will use the MongoDB class but we set it here explicitly for completeness.
Wgit::Database.adapter_class = Wgit::Database::MongoDB

db = Wgit::Database.new '<connection_string>'

# Mongo requires certain indexes to exist for searching, so we create them (once).
db.create_collections
db.create_unique_indexes
Wgit::Model.set_default_search_fields(db) # Creates a search index in Mongo using the Wgit::Model search fields.

Or take a look at the mongo_init.js file for the equivalent Javascript commands.

Note: The text search index lists all document fields to be searched by MongoDB when calling Wgit::Database#search. Therefore, you should append this list with any other fields that you want searched. For example, if you extend the API then you might want to search your new fields in the database by adding them to the index above. This can be done programmatically with:

# We set the search fields via the Wgit::Model passing the db param so Wgit's search methods abide by these fields too.
Wgit::Model.set_search_fields({ my_field: 2, ... }, db) # my_field has a search weight/priority of 2.
# OR
Wgit::Model.set_search_fields([:my_field, ...], db) # With a default weight of 1.

Code Example

The following script demonstrates how to use Wgit to index and then search HTML documents stored in a database. If you're running the code for yourself, remember to replace the database connection string with your own.

require 'wgit'

### CONNECT TO THE DATABASE ###

Wgit::Database.adapter_class = Wgit::Database::MongoDB

# In the absence of a connection string parameter, ENV['WGIT_CONNECTION_STRING'] will be used.
db = Database.new '<connection_string>'

### SEED SOME DATA ###

# Here we create our own document rather than crawling the web (which works in the same way).
# We provide the web page's URL and HTML Strings.
doc = Wgit::Document.new(
  'http://test-url.com',
  "<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
)
db.upsert doc

### SEARCH THE DATABASE ###

# Searching the database returns Wgit::Document's which have fields containing the query.
query = 'cow'
results = db.search query

# By default, the MongoDB ranking applies i.e. results.first has the most hits.
# Because results is an Array of Wgit::Document's, we can custom sort/rank e.g.
# `results.sort_by! { |doc| doc.url.crawl_duration }` ranks via page load times with
# results.first being the fastest. Any Wgit::Document attribute can be used, including
# those you define yourself by extending the API.

top_result = results.first
top_result.class           # => Wgit::Document
doc.url == top_result.url  # => true

### PULL OUT THE BITS THAT MATCHED OUR QUERY ###

# Searching each result gives the matching text snippets from that Wgit::Document.
top_result.search(query).first # => "How now brown cow."

### SEED URLS TO BE CRAWLED LATER ###

db.upsert top_result.external_links
urls_to_crawl = db.uncrawled_urls # => Results will include top_result.external_links.

Extending The API

Document serialising in Wgit is the means of downloading a web page and serialising parts of its content into accessible Wgit::Document attributes/methods. For example, Wgit::Document#author will return you the webpage's xpath value of meta[@name='author'].

There are two ways to extend the Document serialising behaviour of Wgit for your own means:

Add additional textual content to Wgit::Document#text.
Define Wgit::Document instance methods for specific HTML elements.

Below describes these two methods in more detail. Some of this functionality is also covered in the How To Extract Content article.

1. Extending The Default Text Elements

Wgit contains a set of Wgit::HTMLToText.text_elements defining which HTML elements contain text on a page; which in turn are serialised. Once serialised you can process this text content via methods like Wgit::Document#text and Wgit::Document#search etc.

The below code example shows how to extract additional text from a webpage:

require 'wgit'

# The default text_elements cover most visible page text but let's say we
# have a <table> element with text content that we want.
Wgit::HTMLToText.text_elements[:table] = :block # or :inline

doc = Wgit::Document.new(
  'http://some_url.com',
  <<~HTML
  <html>
    <p>Hello world!</p>
    <table>My table</table>
  </html>
  HTML
)

# Now every crawled Document#text will include <table> text content.
doc.text            # => ["Hello world!", "My table"]
doc.search('table') # => ["My table"]

Note: This only works for textual page content. For more control over the serialised elements themselves, see below.

2. Serialising Specific HTML Elements (via Document Extractors)

Wgit provides some default extractors to extract a page's text, links etc. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next.

Therefore, you can define a Document extractor for each HTML element(s) that you want to extract and serialise into a Wgit::Document instance variable, equipped with a getter and setter method. Once an extractor is defined, all subsequently crawled Documents will contain your extracted content.

Here's how to add a Document extractor to serialise a specific page element:

require 'wgit'

# Let's get all the page's <table> elements.
Wgit::Document.define_extractor(
  :tables,                  # Wgit::Document#tables will return the page's tables.
  '//table',                # The xpath to extract the tables.
  singleton: false,         # True returns the first table found, false returns all.
  text_content_only: false, # True returns the table text, false returns the Nokogiri object.
) do |tables|
  # Here we can inspect/manipulate the tables before they're set as Wgit::Document#tables.
  tables
end

# Our Document has a table which we're interested in. Note it doesn't matter how the Document
# is initialised e.g. manually (as below) or via Wgit::Crawler methods etc.
doc = Wgit::Document.new(
  'http://some_url.com',
  <<~HTML
  <html>
    <p>Hello world! Welcome to my site.</p>
    <table>
      <tr><th>Name</th><th>Age</th></tr>
      <tr><td>Socrates</td><td>101</td></tr>
      <tr><td>Plato</td><td>106</td></tr>
    </table>
    <p>I hope you enjoyed your visit :-)</p>
  </html>
  HTML
)

# Call our newly defined method to obtain the table data we're interested in.
tables = doc.tables

# Both the collection and each table within the collection are plain Nokogiri objects.
tables.class       # => Nokogiri::XML::NodeSet
tables.first.class # => Nokogiri::XML::Element

# Note, the Document's stats now include our 'tables' extractor.
doc.stats # => {
#   :url=>19, :html=>242, :links=>0, :text=>8, :text_bytes=>91, :tables=>1
# }

See the Wgit::Document.define_extractor docs for more information.

Extractor Notes:

It's recommended that extracted URL's be mapped into Wgit::Url objects. Wgit::Url's are treated as Strings when being inserted into the database.
A Wgit::Document extractor (once initialised) will become a Document instance variable, meaning that the value will be inserted into the Database if it's a primitive type e.g. String, Array etc. Complex types e.g. Ruby objects won't be inserted. It's up to you to ensure the data you want inserted, can be inserted.
Once inserted into the Database, you can search a Wgit::Document's extractor attributes by updating the Wgit::Model's text search index. See the Database Example for more information.

Caveats

Below are some general points to keep in mind when using Wgit:

All absolute Wgit::Url's must be prefixed with an appropiate protocol e.g. https:// etc.
By default, up to 5 URL redirects will be followed; this is configurable however.
IRI's (URL's containing non ASCII characters) are supported and will be normalised/escaped prior to being crawled.

Getting Started - michaeltelford/wgit GitHub Wiki

What is Wgit?

Table Of Contents

Installation

Using Bundler

Using RubyGems

Verify

Basic Usage

Practical Examples

WWW HTML Indexer

Website Downloader

Broken Link Finder

Clean HTML Reader

Index Search Results

CSS Indexer

Keyword Indexer (SEO Helper)

Database Example

Data Model

InMemory

MongoDB

Versioning

Configuring MongoDB Manually

Code Example

Extending The API

1. Extending The Default Text Elements

2. Serialising Specific HTML Elements (via Document Extractors)

Caveats

⚠️ GitHub.com Fallback ⚠️

Getting Started - michaeltelford/wgit GitHub Wiki

What is Wgit?

Table Of Contents

Installation

Using Bundler

Using RubyGems

Verify

Basic Usage

Practical Examples

WWW HTML Indexer

Website Downloader

Broken Link Finder

Clean HTML Reader

Index Search Results

CSS Indexer

Keyword Indexer (SEO Helper)

Database Example

Data Model

InMemory

MongoDB

Versioning

Configuring MongoDB Manually

Code Example

Extending The API

1. Extending The Default Text Elements

2. Serialising Specific HTML Elements (via Document Extractors)

Caveats

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️