Scanners - nsoft/jesterj GitHub Wiki

Scanners are an entry point for data into the ingestion DAG. There may be more than one entry point, though this is intended for re-use of logic with multiple data sources and not for any form of merging/joining/sorting operations.

Available Scanners

JdbcScanner

This scanner polls a jdbc connection using a supplied query and scrolls through the results, and when new or changed rows are detected it creates a Document object containing a field for each column in the results.

SimpleFileScanner

This scanner initially walks the filesystem from the specified root and examines all sub-folders looking for files. Each file is loaded into the ingestion DAG as a Document. After a configured polling interval has elapsed, and if there is not a scan still in progress, the file system will be walked again.

Important Scanner Options

  • scanFreqMS(long) - this is the polling interval, and represents the minimum amount of time required to elapse before a new scan is executed. If a scan is still in progress or thread scheduling is delayed, the actual start time of a subsequent scan may be greater than this value, but it should never be less. The interval is measured between scan start times.
  • retryErroredDocsUpTo(int) - This determines how many time an error will be retried. After this number of retries the document will be considered DEAD with respect to that destination and will not be retried unless a manual update to the internal Cassandra is performed or (if document hashing is turned on) the content of the document has changed.
  • rememberScannedIds(boolean) - If set to false every scan will re-feed every document, if set to true documents scanned will be recorded in Cassandra and subsequent scans will ignore documents already indexed. This behavior is modified by detectChangesViaHashing(boolean).
  • detectChangesViaHashing(boolean) - This modifies the behavior of rememberScannedIds(true) such that when a document previously indexed exhibits a change in content (but no change in file name, or other documentId), the new version of the document will be indexed. Since this change detection is performed via MD5 hashing of the raw bytes of the file (or database result, etc) it implies significantly more CPU and disk access. When this is set to true, and scanned id's are being remembered JesterJ will be fault tolerant, meaning that if the JVM/pod/vm/powercord fails, restarting JesterJ with an identical startup command will resume and process complete processing of any documents that were in-flight at the time of the failure. Note that the scanner will re-start from the beginning of the scan so if new versions of documents have arrived, those new versions should be picked up correctly. The down side of this is that all document content must be hashed again, so the inital phase of a restart may have a period of high CPU and no output to your search index until documents not previously encountered are found again. Fast Resume (to avoid this high cpu period and gap in output on restart) is an expected feature for future versions.

Writing Scanners

A scanner is more complicated than writing a Document processor. The scanner generally needs to provide code to actively poll a system for changes, listen for events, or perhaps both. Scanners will be an area of ongoing development so you can expect to see changes. Some changes across major versions may require you to adjust your code. That said, the following may help future proof your Scanner

  • Extend ScannerImpl.java and ScannerImpl.Builder.java (both of which are abstract)
  • Scanners MUST set an id on the document.
  • It is a good idea to use an ID that your scanner could parse to locate and re-feed the document. This is expected to be required for fault tolerance features in 1.0
  • Please note the calls to super class in the builder methods such as batchSize() and the use of a getObject getter. You should override this getObject method to ensure that your builder's object is updated.
  • 1.0 release may move some of the infrastructure you see in SimpleFileWatchScanner up to the parent class.

Getting help writing scanners

Please open a discussion in the General Help section if you need help