How To Crawl Authenticated Webpages - michaeltelford/wgit GitHub Wiki
Using Wgit, it's possible to crawl webpages that are only accessible to authenticated users. This article describes how and provides an example.
- Login to the website you want to crawl using a web browser -> This will set the necessary session cookies on your machine
- Export the session cookies for that website to file (in the Netscape cookie text format) ->
cookies.txt
- Pass the
cookies.txt
file to aWgit::Crawler
instance -> Now Wgit is authenticated as you - Use Wgit as normal, crawling and extracting authenticated content as desired
You can in theory use any browser you want too, as long as you can export the cookies after login. So first, go to the domain you want to crawl e.g. github.com
and login on your user account (this is the account Wgit will crawl as on your behalf).
You'll likely need to install an extension to export the session cookies. For example, if using Firefox, you can install the Export Cookies extension. This extension will export your session cookies for a given domain to a text file, formatted in the Netscape cookie text format. Record the file path of your cookies text file.
require "wgit"
require "wgit/core_ext"
# Set the path to your cookies text file
cookies_file = "/Users/<user>/Downloads/cookies.txt"
# Configure Wgit to pass the cookies file to Typhoeus -> libcurl
crawler = Wgit::Crawler.new(typhoeus_opts: {
cookiefile: cookies_file, # Required: Sends cookies from this file with all requests
cookiejar: cookies_file # Optional: Saves any new cookies to this file
})
# Use Wgit as you would normally, now able to crawl authenticated content
url = Wgit::Url.new "https://github.com/<user>/<private_repo>/"
crawler.crawl(url) { |doc| ... }
Some things to remember about passing your session cookies around:
- Cookies have an expiry meaning you'll likely have to refresh the cookies (repeating the above steps) periodically.
- Cookies (especially session cookies) can contain sensitive information, they are essentially your online user identity; so use responsibly.
- Never upload or share your cookies text file with anyone. It might also be a good idea to add the cookies file to your
.gitignore
etc.