How To Crawl Authenticated Webpages - michaeltelford/wgit GitHub Wiki

Using Wgit, it's possible to crawl webpages that are only accessible to authenticated users. This article describes how and provides an example.

High Level Steps

  1. Login to the website you want to crawl using a web browser -> This will set the necessary session cookies on your machine
  2. Export the session cookies for that website to file (in the Netscape cookie text format) -> cookies.txt
  3. Pass the cookies.txt file to a Wgit::Crawler instance -> Now Wgit is authenticated as you
  4. Use Wgit as normal, crawling and extracting authenticated content as desired

Login using a web browser

You can in theory use any browser you want too, as long as you can export the cookies after login. So first, go to the domain you want to crawl e.g. github.com and login on your user account (this is the account Wgit will crawl as on your behalf).

Export the session cookies

You'll likely need to install an extension to export the session cookies. For example, if using Firefox, you can install the Export Cookies extension. This extension will export your session cookies for a given domain to a text file, formatted in the Netscape cookie text format. Record the file path of your cookies text file.

Pass the session cookies to Wgit

require "wgit"
require "wgit/core_ext"

# Set the path to your cookies text file
cookies_file = "/Users/<user>/Downloads/cookies.txt"

# Configure Wgit to pass the cookies file to Typhoeus -> libcurl
crawler = Wgit::Crawler.new(typhoeus_opts: {
  cookiefile: cookies_file,  # Required: Sends cookies from this file with all requests
  cookiejar:  cookies_file   # Optional: Saves any new cookies to this file
})

# Use Wgit as you would normally, now able to crawl authenticated content
url = Wgit::Url.new "https://github.com/<user>/<private_repo>/"
crawler.crawl(url) { |doc| ... }

Use Wgit as normal, but authenticated as you

Some things to remember about passing your session cookies around:

  • Cookies have an expiry meaning you'll likely have to refresh the cookies (repeating the above steps) periodically.
  • Cookies (especially session cookies) can contain sensitive information, they are essentially your online user identity; so use responsibly.
  • Never upload or share your cookies text file with anyone. It might also be a good idea to add the cookies file to your .gitignore etc.
⚠️ **GitHub.com Fallback** ⚠️