HTML Form GET or POST - internetarchive/heritrix3 GitHub Wiki
To use the HTML Form GET or POST credential, supply a domain,
http-method, login-uri, and form-items.
Before a URI is scheduled for crawling, Heritrix looks for
preconditions. Examples of preconditions include capturing the DNS
record of the server that hosts the URI and fetching the robots.txt
file. The HTML Form credentials are also processed as a precondition.
If there are HTML Form credentials for a particular CrawlServer in the
Credential Store, the URI specified in the HTML Form credential
login-uri field is scheduled as a precondition for the site, after the
DNS and robots.txt preconditions.
See RFC2617 (BASIC AND DIGEST Auth) domain.
The login-uri is a relative or absolute URI to which the HTML Form
submits. It is not necessarily the page that contains the HTML Form;
rather it is the ACTION URI the to which the form submits.
Form-items are a listing of HTML Form key/value pairs. The submit button usually must be included in the form-items.
An HTML Form GET or POST credential configuration is illustrated
below. (Note that this bean must appear in the
CredentialStore's 'credentials' map, either via a
bean-reference, or by being defined inline there.)
<bean id="credential"
class="org.archive.modules.credential.HtmlFormCredential">
<property name="domain" value="example.com" />
<property name="loginUri" value="http://example.com/login"/>
<property name="formItems">
<map>
<entry key="login" value="mylogin"/>
<entry key="password" value="mypassword"/>
<entry key="submit" value="submit"/>
</map>
</property>
</bean>Note
- For a site with an HTML Form credential, a login is performed
against all listed HTML Form credential
login-urisafter the DNS androbots.txtpreconditions are fulfilled. The crawler will only view sites that have HTML Form credentials from alogged-inperspective. There is no current way for a single Heritrix job to crawl a site in an unauthenticated state and then re-crawl the site in an authenticated state. (You would have to do this in two separately-configured job launches.) - The form login is only run once. Heritrix continues crawling regardless of whether the login succeeds. There is no way of telling Heritrix to retry authentication if the first attempt is not successful. Neither is there a means for the crawler to report success or failed authentications. The crawl operator should examine the logs to determine whether authentication succeeded.
- Some sites' login forms may have form items with dynamic names, or necessary extra hidden fields whose required value changes for each visitor. This HtmlFormCredential mechanism has no support for successfully submitting such forms.