CM130SPIchanges - googlegsa/manager.v3 GitHub Wiki

Introduction

In preparation for the upcoming 1.3.0 release of the Connector Manager and Connectors, we want to make sure that all of the Connectors are compatible with changes to the Connector Manager since the last release. There have been several small changes to the SPI, some additional functionality made available for the Connectors, and some clarification of flow of control.

Details

ConnectorFactory for use by ConnectorType.validateConfig()

The Connector Factory is provided to ConnectorType.validateConfig(), which may use it to construct Connector instances for the purpose of validation. The ConnectorFactory uses the same mechanism to create the Connector instance that the ConnectorManager uses to create the "Normal" running instances. However, the instances created by the ConnectorFactory are considered transient - they are not scheduled for traversal or used to authorize search results.

For additional information see:

ConnectorType.validateConfig() May Return a Modified Configuration

ConnectorType.validateConfig() may now return a modified configuration in the ConfigureResponse if desired. That modified configuration will be saved and used to created the running connector instance.

For additional information see:

Exception Handling in TraversalManager, DocumentList, Document, Property, Value

The handling of Exceptions thrown during document traversal and feeding has been greatly improved. In the previous releases, exceptions thrown during traversal would often result in loops or hangs, usually halting traversal progress. Connectors should only ever throw RepositoryExceptions out of these interfaces, however we now provide a new subclass of RepositoryException, called RepositoryDocumentException, that is handled differently. In short, throwing a RepositoryDocumentException will force the Connector Manager to skip the document currently being processed, proceeding to the next one. Throwing a RepositoryException will instruct the Connector Manager to abandon the current batch of documents and retry later. The Connector must also properly handle a call to DocumentList.checkpoint() after an exception is thrown.

For more information, see:

Returning Null DocumentList vs Empty DocumentList from TraversalManager

Previous versions of the Connector Manager handled a null return value and an empty DocumentList [non-null, but zero items] returned from TraversalManager.startTraversal() and TraversalManager.resumeTraversal() identically. This version of the Connector Manager makes a subtle differentiation between the two. A null return value is interpreted as before: no new content is available for indexing, sleep for a few minutes and try again. An returned empty DocumentList is interpreted differently: although no suitable documents were found yet, the Connector is performing a rather time-consuming search looking for appropriate content. The Connector Manager will call checkpoint() and reschedule the Connector for an immediate call to resumeTraversal(). This allows the Connector to time-slice or monitor a time-consuming search for content without running afoul of the Connector Manager time-out of work threads. Connectors that return an empty DocumentList when they should be returning null, will effectively run in a busy loop.

For more information, see:

New "google:title" Property

The named link that the GSA presents in search results is usually a title or headline that the GSA extracts from the document content. At this time, the GSA does not make use of other meta-data supplied by the Connector to display this title, so if the feed has no content or the GSA cannot extract a meaningful title from the supplied content, it instead displays the URL to the document in the search result. Unfortunately, the URLs of documents from Connector Feeds are usually uninformative to the viewer.

The Connector Manager has created a new canonical metadata field, "google:title", defined as SpiConstants.PROPNAME_TITLE. At this point, the GSA makes no special use of this field. However, if the Connector Manager receives a meta-data and content feed with no actual "google:content" field, it will create stub content consisting of an html title fragment. This fools the current GSA versions into displaying that title in the search results.

In the future the GSA may make more direct use of the google:title field, so even if your Connector does provide content, it should still present the document name/title/headline/subject as google:title.

For more information see:

TraversalContext and TraversalContextAware

The Connector Manager now provides a TraversalContext implementation to Connectors so that they may better determine what types of document content to provide during a traversal. Connectors may use the information provided by the TraversalContext to limit content provided for indexing, based upon document size or mime-type.

For instance, the Connector might use TraversalContext information to:

  • Provide a Document with meta-data and full content.
  • Provide a Document with meta-data but supply content in an alternate format (such as HTML or PDF).
  • Provide a Document with meta-data and summarized content.
  • Provide a Document with meta-data but no content.
  • Skip a Document entirely.

If a Connector's TraversalManager implementation adds the com.google.enterprise.connector.spi.TraversalContextAware interface, the Connector Manager will then call the setTraversalContext() method, supplying a TraversalContext for the Connector to use, before calling any methods in the TraversalManager interface.

If a TraversalContext is provided, the Connector's TraversalManager may then use it to tailor its Document feed. For instance, the TraversalContext could be used to determine whether or not to supply a "google:content" property for a Document, based upon the document size or mime-type. Note that the TraversalContext interface has changed slightly from its previous (unimplemented) version.

For additional information, see:

Connector Configuration Storage

This version of the Connector Manager moves the stored Connector schedule and traversal state (checkpoint) from the Java Preferences to files stored in the Connector instance directory (found under $TOMCAT_HOME/webapps/connector-manager/WEB-INF/connectors). This is the same directory that the Connector's configuration properties file and optional connectorInstance.xml file is stored.

The presence of these two additional files is unlikely to affect the Connectors. The files are named $CONNECTOR_NAME_schedule.txt and $CONNECTOR_NAME_state.txt, where $CONNECTOR_NAME is the name of the Connector instance.

For more information, see:

Password Encryption with EncryptedPropertyPlaceholderConfigurer

All properties in the Connector's configuration properties file, whose property key contains the substring "password" (case-insensitive match) are now encrypted by default. In the past, only properties with the key "Password" were encrypted. Connectors using the EncryptedPropertyPlaceholderConfigurer are unlikely to notice the change.

The names of future new configuration properties should be chosen accordingly. For instance, this now allows a Connector to maintain separate passwords for different repository services. However, the Livelink Connector configuration now has an encrypted boolean property, because it happens to contain the substring "password" in its name.

For more information, see:

SMB Search URLs

Previous versions of the Connector Manager would reject google:searchurl metadata that used the "smb:" scheme for the URL. This has been fixed.

For additional information, see:

AuthorizationResponse.equals() and AuthorizationResponse.hash()

The AuthorizationResponse.equals() and AuthorizationResponse.hash() methods have been changed to include the AuthorizationResponse.valid member in the computations. In previous versions of the Connector Manager, only the AuthorizationResponse.docid member was used in AuthorizationResponse.equals() and AuthorizationResponse.hash().

The change is subtle, but AuthorizationResponse instances { "1234", true } and { "1234", false } are now considered inequal, where they would have been considered equal in the past.

For more information, see: