How To Extract Content - michaeltelford/wgit GitHub Wiki

This article describes the different methods of extracting content from a crawled HTML document using Wgit.

Wgit allows you to extract document content using Nokogiri's underlying functionality. You can choose to select and extract content using either CSS or XPath. XPath is the recommended choice as it typically offers more power over CSS when selecting with advanced queries e.g. matching on an attribute value's suffix. For this reason, some methods of extracting content only support XPath as a selector.

How To Write / Text Xpath

Once we have the URL of the page we want to crawl, we must decide on what content we're interested in, and the correct XPath to extract it from the crawled document.

The easiest way is to use the console pane in your browser's developer tools. There you can test out XPath queries and the results.

Tip: On Chrome / Firefox, you can use the XPath shorthand: $x('//p');

Remember that by default, Wgit doesn't parse a page's Javascript. So it's often worthwhile checking that the page source (not the HTML in the developer tools) contains the content you want to extract - if not then you should consider enabling Javascript parsing.

Now that we know our xpath query, there are several methods of using it to extract our content, each providing a different level of granularity:

  • Wgit::Document.define_extractor
  • Wgit::Document#extract
  • Wgit::Document#xpath / #css
  • Wgit::Document#parser

Wgit::Document.define_extractor

Document extractors are the most abstract and recommended method of extracting content from a crawled Wgit::Document. Take this example:

require 'wgit'
require 'wgit/core_ext'

Wgit::Document.define_extractor(
  :syntax, '//code',
  singleton: true, text_content_only: true
)

crystal = Wgit::Crawler.new.crawl 'https://crystal-lang.org/'.to_url

puts crystal.syntax

Which prints a nice code snippet of Crystal's Ruby-esque syntax:

# A very basic HTTP server
require "http/server"

server = HTTP::Server.new do |context|
  context.response.content_type = "text/plain"
  context.response.print "Hello world, got #{context.request.path}!"
end

puts "Listening on http://127.0.0.1:8080"
server.listen(8080)

Document extractors have the following characteristics to consider:

  • Wgit::Document.define_extractor only supports XPath as a selector.
  • Each defined extractor takes a symbol as its name, which becomes an instance variable on the doc containing the extracted content. The xpath parameter is used to extract the content.
  • Each defined extractor takes two optional named parameters, which control how the content is returned:
    • Setting singleton: true returns the first result found, otherwise all are returned.
    • Setting text_content_only: true returns the inner text content of the element(s), otherwise the Nokogiri object(s) are returned (which can be useful for further processing).
    • Both of these parameters default to true allowing them to be omitted until an override is desired.
  • Each defined extractor can also take a block, used to transform/format the extracted content before its instance variable gets set.
  • Extractors once defined, will fire for all crawled documents, extracting the content if found. That's why define_extractor is a class (not instance) method on Wgit::Document.
  • Defined extractors are useful for indexing (saving documents to a database) because they define instance vars which will automatically be inserted into the database. These vars will then be re-serialised when loading documents from the database, meaning your extracted data will persist throughout the life-cycle of your application.

Check out the Wgit::Document.define_extractor docs for more information.

Wgit::Document#extract

If you'd rather extract content from a specific Wgit::Document instance, as a one off (rather than extracting from every document crawled), then you can do the following:

crystal = Wgit::Crawler.new.crawl 'https://crystal-lang.org/'.to_url

syntax = crystal.extract('//code', singleton: true, text_content_only: true)

puts syntax

This will print the same output as the above example, which uses Wgit::Document.define_extractor.

Points to note about #extract:

  • Only supports XPath as a selector.
  • The singleton: and text_content_only: params work the same as with Wgit::Document.define_extractor.
  • Returns the extracted content. There's no instance variable defined as with Wgit::Document.define_extractor. This means the content won't be inserted into the database if saved (indexed).
  • Wgit::Document.define_extractor calls doc#extract under the hood when a document is initialised (from a crawl etc.)

Wgit::Document#xpath / #css

If you'd rather call into Nokogiri, then you can call any of the following methods, which call the Nokogiri method of the same name underneath:

  • Wgit::Document#xpath - Returns all XPath selected results
  • Wgit::Document#at_xpath - Returns the first XPath selected result
  • Wgit::Document#css - Returns all CSS selected results
  • Wgit::Document#at_css - Returns the first CSS selected result

Using these methods, you have your choice of selector; but you must process the raw Nokogiri results yourself.

Wgit::Document#parser

If you want access to the actual Nokogiri object (that handles the HTML parsing for every Wgit::Document) then you can call:

  • Wgit::Document#parser which returns an instance of Nokogiri::HTML::Document.