How To Handle Redirects - michaeltelford/wgit GitHub Wiki

The Wgit::Crawler class provides a means of crawling a HTTP endpoint and returning its response content for serialization.

By default, the crawl_* methods will follow HTTP URL redirects (e.g 301's) but this is configurable. This article describes the different redirect behavior and how to use it in your application.

Wgit::Crawler#crawl_url crawls a single HTTP URL whereas #crawl_urls can crawl several by calling #crawl_url underneath. Both use the default named parameter follow_redirects: true to follow all redirects regardless of their Location header value. Override this value to configure the redirect crawl behavior. For example:

require 'wgit'

crawler = Wgit::Crawler.new
url = Wgit::Url.new 'http://twitter.com' # Redirects to HTTPS.

# The `crawl` method is an alias for `crawl_urls`.
crawler.crawl url, follow_redirects: false
# => nil (because the crawled failed, due to an illegal redirect)

crawler.last_response.status
# => 301

From the logs of the crawl (having set Wgit.logger.level = Logger::DEBUG before hand):

[wgit] [http] Request:  http://twitter.com
[wgit] [http] Response: 301 (0 bytes in 0.33 seconds)
[wgit] Wgit::Crawler#fetch('http://twitter.com') exception: Redirect not allowed: https://twitter.com/

The above examples show how to disallow and allow all redirects, regardless of their destination. But what if we want to limit where we redirect to? The solution is to pass a Symbol to follow_redirects:. Sticking with the above example, let's say we only want to allow redirects within the same host:

url = Wgit::Url.new 'http://twitter.com' # Redirects to HTTPS at the same host.

crawler.crawl url, follow_redirects: :host # Notice the :host value.
# => Wgit::Document (because the crawl succeeded)

crawler.last_response.status
# => 200

The logs show the redirect happening under the hood:

[wgit] [http] Request:  http://twitter.com
[wgit] [http] Response: 301 (0 bytes in 0.255 seconds)
[wgit] [http] Request:  https://twitter.com/
[wgit] [http] Response: 200 (386372 bytes in 0.831 seconds)

So what other Symbols can we pass to follow_redirects: and how do they affect the redirect logic?

The answer lies in the Wgit::Url#relative? method and its opts: named parameter values. See its docs for the most up to date options.

At the time of writing, any one of these follow_redirects: Symbol values can be used:

$ [:origin, :host, :domain, :brand]

The Symbol value is used to call that method on the redirected to Wgit::Url. The value of which is compared to the original URL (having called the same method). For example:

# Below demonstrates a simplified version of the Crawler redirect logic.
url          = Wgit::Url.new 'http://twitter.com'
url_redirect = Wgit::Url.new 'https://twitter.com'

a = url.host
# => "twitter.com"

b = url_redirect.host
# => "twitter.com"

a == b
# => true (allowing the redirect to take place)

The below values would be used when comparing the original and redirected to URLs, depending on which Symbol you pass:

url = Wgit::Url.new 'http://www.example.com/public'

url.origin # => "http://www.example.com" + port if present
url.host   # => "www.example.com"
url.domain # => "example.com"
url.brand  # => "example"

It should be noted that Wgit::Crawler#crawl_site does not accept the follow_redirects: named parameter; because #crawl(url, follow_redirects: :host) is used under the hood when crawling pages on a given host. Otherwise, you might end up crawling pages outside of the site/host you originally specified.