How To Prevent Indexing - michaeltelford/wgit GitHub Wiki
When indexing (crawling and saving HTML to a Database), the rules set in a website's robots.txt
file should always be honoured.
Wgit does this by default when using any Wgit::Indexer#index_*
method.
If performing your own indexing using Wgit without calling one of these methods, we strongly encourage you to honour the indexing rules set out in the robots.txt
file.
You can do this by studying the Wgit::Indexer
class and by taking a look at the documentation and source code of the Wgit::RobotsParser
class; which takes a robots.txt
file contents and parse them into rules that apply to the Wgit indexing library. You can use this handy parser in your own indexing methods.
If you have a site and want to update your own robots.txt
file with rules that apply to Wgit, see below.
At present Wgit only parses the following keys in a robots.txt
file:
User-agent
Allow
Disallow
All other keys are ignored.
Only rules applying to User-agent: *
or User-agent: wgit
will be followed; any other indexer rules will obviously be ignored.
Treat the wgit
rule block as a black list for pages you don't want indexed. If you don't mind Wgit indexing your entire site, then don't include a wgit
block at all.
You can make Wgit ignore your site completely by adding this to your site:
robots.txt
User-agent: wgit
Disallow: *
To disallow certain pages, for example the /login
and /account
pages from being indexed, add this to your site:
robots.txt
User-agent: wgit
Disallow: /login
Disallow: /account
Unsupported syntax includes:
Disallow: /blah$
- The$
will be ignored making the line:Disallow: /blah
.
In addition to having a robots.txt
file on your webserver, you can tell Wgit to ignore certain pages of your site using the noindex
field with either a:
- Response header of
X-Robots-Tag: noindex
- HTML meta tag of
<meta name="robots" content="noindex">
or<meta name="wgit" content="noindex">
(the former applies to all crawlers, the latter applies only to Wgit)
There may be times when you want to ignore a site's robots.txt
file; for example, when indexing your own websites but haven't yet updated the robots.txt
file.
You can achieve this with:
ENV["WGIT_IGNORE_ROBOTS_TXT"] = "true"
# Use Wgit as normal...
# indexer.index(url)