How To Handle Redirects - michaeltelford/wgit GitHub Wiki
The Wgit::Crawler
class provides a means of crawling a HTTP endpoint and returning its response content for serialization.
By default, the crawl_*
methods will follow HTTP URL redirects (e.g 301's) but this is configurable. This article describes the different redirect behavior and how to use it in your application.
Wgit::Crawler#crawl_url
crawls a single HTTP URL whereas #crawl_urls
can crawl several by calling #crawl_url
underneath. Both use the default named parameter follow_redirects: true
to follow all redirects regardless of their Location
header value. Override this value to configure the redirect crawl behavior. For example:
require 'wgit'
crawler = Wgit::Crawler.new
url = Wgit::Url.new 'http://twitter.com' # Redirects to HTTPS.
# The `crawl` method is an alias for `crawl_urls`.
crawler.crawl url, follow_redirects: false
# => nil (because the crawled failed, due to an illegal redirect)
crawler.last_response.status
# => 301
From the logs of the crawl (having set Wgit.logger.level = Logger::DEBUG
before hand):
[wgit] [http] Request: http://twitter.com
[wgit] [http] Response: 301 (0 bytes in 0.33 seconds)
[wgit] Wgit::Crawler#fetch('http://twitter.com') exception: Redirect not allowed: https://twitter.com/
The above examples show how to disallow and allow all redirects, regardless of their destination. But what if we want to limit where we redirect to? The solution is to pass a Symbol
to follow_redirects:
. Sticking with the above example, let's say we only want to allow redirects within the same host:
url = Wgit::Url.new 'http://twitter.com' # Redirects to HTTPS at the same host.
crawler.crawl url, follow_redirects: :host # Notice the :host value.
# => Wgit::Document (because the crawl succeeded)
crawler.last_response.status
# => 200
The logs show the redirect happening under the hood:
[wgit] [http] Request: http://twitter.com
[wgit] [http] Response: 301 (0 bytes in 0.255 seconds)
[wgit] [http] Request: https://twitter.com/
[wgit] [http] Response: 200 (386372 bytes in 0.831 seconds)
So what other Symbols can we pass to follow_redirects:
and how do they affect the redirect logic?
The answer lies in the Wgit::Url#relative?
method and its opts:
named parameter values. See its docs for the most up to date options.
At the time of writing, any one of these follow_redirects:
Symbol values can be used:
$ [:origin, :host, :domain, :brand]
The Symbol value is used to call that method on the redirected to Wgit::Url
. The value of which is compared to the original URL (having called the same method). For example:
# Below demonstrates a simplified version of the Crawler redirect logic.
url = Wgit::Url.new 'http://twitter.com'
url_redirect = Wgit::Url.new 'https://twitter.com'
a = url.host
# => "twitter.com"
b = url_redirect.host
# => "twitter.com"
a == b
# => true (allowing the redirect to take place)
The below values would be used when comparing the original and redirected to URLs, depending on which Symbol you pass:
url = Wgit::Url.new 'http://www.example.com/public'
url.origin # => "http://www.example.com" + port if present
url.host # => "www.example.com"
url.domain # => "example.com"
url.brand # => "example"
It should be noted that Wgit::Crawler#crawl_site
does not accept the follow_redirects:
named parameter; because #crawl(url, follow_redirects: :host)
is used under the hood when crawling pages on a given host. Otherwise, you might end up crawling pages outside of the site/host you originally specified.