KeepOutOfIndex - compuphase/sphider-pdo GitHub Wiki

Keeping pages from being indexed

There are four ways to avoid files to be indexed, and one additional way to keep sections from a HTML page to be indexed.

  • robots.txt
  • ext.txt
  • must include / must not include
  • the nofollow link attribute
  • sphider_noindex comments

robots.txt

The most common way to prevent pages from being indexed is using the robots.txt standard, by either putting a robots.txt file into the root directory of the server, or adding the necessary meta tags into the page headers. For details, see the robots.txt specification

If you create rules specific for Sphider-PDO, set the user-agent to Sphider.

Standard rules for a robots.txt file are for disallowing complete directories to be spidered. The paths should always start from the (virtual) root, which means that the "disallow" value always starts with a slash. For example:

Disallow: /cgi-bin/ Disallow: /private/

If the value does not start with a "/", an URL is disallowed if the value appears anywhere inside the URL. For example, if the following string is set:

Disallow: opera

URLs like:

  • http://www.example.com/operations/
  • http://www.example.com/finance/cooperative-bank.html

are all skipped.

To skip all files with a particular extension, add a rule with the extension and put a "$" at the end. The "$" matches the end of the URL, so it will disallow a file only if its name ends with the extention in the rule. For example, to disallow all files with the extension ".txt", use:

Disallow: .txt$

ext.txt

This file is in the admin subdirectory (of where Sphider is installed). The file contains a list of file extensions that should not be indexed. Extensions of binary files (images, zip files) are typically in this list.

Must include / must not include

A powerful option Sphider supports is defining a must include / must not include string list for a site (click on Advanced options in Index screen for this). Any url containing a string in the "must not include" list is ignored. Any url that does not contain any string in the "must include" list is likewise ignored.

All strings in the string list should be separated by a newline enter). For example, to prevent a forum in your site from being indexed, you might add www.yoursite.com/forum to the "must not include" list. This means that all URLs containing the string will be ignored and wont be indexed. Using Perl-style regular expressions instead of literal strings is also supported.

Every string starting with a "*" in front is considered as a regular expression, so that "*/[a]+/" denotes a string with one or more a's in it.

Ignoring links (nofollow)

Sphider respects the rel="nofollow" attribute in <a href..> tags. For example, if a page contains the link <a href="foo.html" rel="nofollow>, Sphider ignores the link to foo.html. As a result, foo.html will not be indexed (unless there are other pages that have a link to foo.html without the nofollow keyword.

Ignoring parts of a page

Sphider includes an option to exclude parts of pages from being indexed. This can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between

<!--sphider_noindex-->

and

<!--/sphider_noindex-->

tags is not indexed. However, links in such a section are still followed.

⚠️ **GitHub.com Fallback** ⚠️