How To Prevent Indexing - michaeltelford/wgit GitHub Wiki

When indexing (crawling and saving HTML to a Database), the rules set in a website's robots.txt file should always be honoured.

Wgit does this by default when using any Wgit::Indexer#index_* method.

If performing your own indexing using Wgit without calling one of these methods, we strongly encourage you to honour the indexing rules set out in the robots.txt file.

You can do this by studying the Wgit::Indexer class and by taking a look at the documentation and source code of the Wgit::RobotsParser class; which takes a robots.txt file contents and parse them into rules that apply to the Wgit indexing library. You can use this handy parser in your own indexing methods.

If you have a site and want to update your own robots.txt file with rules that apply to Wgit, see below.

At present Wgit only parses the following keys in a robots.txt file:

User-agent
Allow
Disallow

All other keys are ignored.

Only rules applying to User-agent: * or User-agent: wgit will be followed; any other indexer rules will obviously be ignored.

Treat the wgit rule block as a black list for pages you don't want indexed. If you don't mind Wgit indexing your entire site, then don't include a wgit block at all.

You can make Wgit ignore your site completely by adding this to your site:

robots.txt

User-agent: wgit
Disallow: *

To disallow certain pages, for example the /login and /account pages from being indexed, add this to your site:

robots.txt

User-agent: wgit
Disallow: /login
Disallow: /account

Unsupported syntax includes:

Disallow: /blah$ - The $ will be ignored making the line: Disallow: /blah.

In addition to having a robots.txt file on your webserver, you can tell Wgit to ignore certain pages of your site using the noindex field with either a:

Response header of X-Robots-Tag: noindex
HTML meta tag of <meta name="robots" content="noindex"> or <meta name="wgit" content="noindex"> (the former applies to all crawlers, the latter applies only to Wgit)

There may be times when you want to ignore a site's robots.txt file; for example, when indexing your own websites but haven't yet updated the robots.txt file.

You can achieve this with:

ENV["WGIT_IGNORE_ROBOTS_TXT"] = "true"

# Use Wgit as normal...
# indexer.index(url)