How To Use The DSL - michaeltelford/wgit GitHub Wiki
The Wgit::DSL
provides wrapper methods around the API for convenience, and its use is optional.
require 'wgit'
require 'json'
include Wgit::DSL
start 'http://quotes.toscrape.com/tag/humor/'
follow "//li[@class='next']/a/@href"
extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
extract :authors, "//div[@class='quote']/span/small", singleton: false
quotes = []
crawl_site do |doc|
doc.quotes.zip(doc.authors).each do |arr|
quotes << {
quote: arr.first,
author: arr.last
}
end
end
puts JSON.generate(quotes)
The DSL can be quicker and simpler to use than the API by abstracting away some of the boiler plate code e.g. instantiating classes. Using the DSL, you can crawl, index and search the web. But some functionality - such as URL parsing - is only possible using the API.
The Wgit DSL is typically used for quickly writing scripts that extract data from the web, either as experiments or written by non technical users. Anything that's possible with the DSL is also possible using Wgit's API classes. Often, when using Wgit in another library or appication, it's cleaner and more flexible to use the API, but the choice is yours.
When you include Wgit::DSL
, you include its defined methods and instance vars. All DSL instance vars and constants are prefixed with dsl_
to avoid conflicts. It's up to you to ensure the methods don't override other definitions in your code however. If in doubt, use the Wgit API instead - which is prefixed with the Wgit::
namespace.
Check out the DSL's yardocs for the full list of available methods.
An alternative method of using the DSL is by subclassing Wgit::Base
- which extend
s Wgit::DSL
underneath. The syntax of which is similar to the Kimura framework. This approach can provide an additional layer of abstraction over the typical DSL use case from above.
class QuotesCrawler < Wgit::Base
mode :crawl_site
start 'http://quotes.toscrape.com/tag/humor/'
follow "//li[@class='next']/a/@href"
extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
extract :authors, "//div[@class='quote']/span/small", singleton: false
def parse(doc)
doc.quotes.zip(doc.authors).each do |arr|
yield({
quote: arr.first,
author: arr.last
})
end
end
end
if __FILE__ == $0
quotes = []
QuotesCrawler.run { |quote| quotes << quote }
puts JSON.generate(quotes)
end
How it works:
- You must call the
start
DSL method to define the URLs to crawl. - Your crawler class must define a
#parse(doc)
method which can optionallyyield
some data. You then have access to this data via a block when yourun
your crawler. - Any defined
extract
ors will be callable on thedoc
passed toparse
- which is called for every crawled URL/page. - The
mode
DSL method specifies whichWgit::Crawler/Indexer
method to call, defaulting tocrawl
- which crawls a single URL/page. - When crawling a site, by default all internal
<a>
href URLs are followed - you can override this withfollow
XPath. - If indexing a site (crawling and then saving to a database), don't forget to set the
ENV['WGIT_CONNECTION_STRING']
. - Define
#initialize
,#setup
and#teardown
methods as needed inside your class. These methods are called before and after the crawl. - Call
self.class.<dsl_method>
as needed from inside your class's instance methods e.g.self.class.last_response
etc. - The
run
method returns the created instance of your class for convienence. You can use this to query your class after the crawl has completed.
Here's another example of a class based DSL crawler (using some of the above points):
require "wgit"
# Suppress the index logging.
Wgit.logger.level = Logger::WARN
# Set your databases connection string.
ENV['WGIT_CONNECTION_STRING'] = "mongodb://rubyapp:abcdef@localhost/crawler"
$url = "https://txti.es"
class WebsiteIndexer < Wgit::Base
mode :index_site
start $url
attr_reader :page_count, :total_time
def initialize
@page_count = 0
@total_time = 0
end
def setup
puts "Starting to index #{$url}..."
puts
end
def parse(doc)
@total_time += self.class.last_response.total_time
return if doc.empty?
puts_info(doc)
@page_count += 1
end
def teardown
puts "Finished indexing #{$url} (#{@page_count} pages in #{@total_time})"
end
private
def puts_info(doc)
puts doc.title || "No title"
puts doc.description&.[](0..100) || "No description"
puts doc.stats
puts doc.url
puts
end
end
if __FILE__ == $0
indexer = WebsiteIndexer.run
puts "On average, one page was indexed every #{indexer.total_time / indexer.page_count} ms"
end
Which will insert the crawled pages into the database and output:
Starting to index https://txti.es...
txti - Fast web pages for everybody
No description
{:url=>15, :html=>3706, :text=>9, :text_bytes=>192, :links=>4, :title=>35, :author=>14}
https://txti.es
About txti
No description
{:url=>21, :html=>1834, :text=>14, :text_bytes=>826, :links=>4, :title=>10}
https://txti.es/about
How to use txti
No description
{:url=>19, :html=>3804, :text=>49, :text_bytes=>2456, :links=>7, :title=>15}
https://txti.es/how
txti - Terms of Service
No description
{:url=>21, :html=>11589, :text=>42, :text_bytes=>10481, :links=>1, :title=>23}
https://txti.es/terms
Made via txti.es:
Images in txti
All images will be centered and start on a new line (so text doesn't flow around them.
{:url=>22, :html=>2335, :text=>8, :text_bytes=>658, :links=>3, :description=>203, :title=>17, :author=>7}
https://txti.es/images
Made via txti.es:
Images in txti
All images will be centered and start on a new line (so text doesn't flow around them.
{:url=>29, :html=>2155, :text=>4, :text_bytes=>536, :links=>1, :description=>203, :title=>17, :author=>7}
https://txti.es/images/images
Finished indexing https://txti.es (6 pages in 0.913512)
On average, one page was indexed every 0.152252 ms