DataSpider 102: Crawl IDs and database integrity - modakanalytics/dataspider GitHub Wiki

Database Integrity

Even though Kosh is not intended to be a system of record for meta-data, the design has been made robust for maintaining a consistent an trusted state. This is done by tagging DataSpider produced output with a crawl id to allow global state consistency to be maintained in an efficient manner. The process is discussed below.

Crawl IDs

Two of the most important fields in the Kosh tables are crawl_id and prev_crawl_id. Although it is good in general to view the glass as being half full (or even fuller), things happen when processing in complex networked systems and it is nice to be able to revert to a trusted state in the event the glass spills. Each time the meta-data for a source is obtained by a DataSpider crawl that data is associated with a crawl_id. The various records are also tagged with the prev_crawl_id.

Rollbacks

The power that these two fields provide is to allow the Kosh database to be rolled back to state it was in following the last successful DataSpider crawl. There is an assumption that a new DataSpider crawl wil not be initiated until a the database has been successfully rolled back from the failed crawl. Since the data in Kosh is Type 2 and since it is meta-data there are not billions and billions of rows, the rollback process can be accomplished quite efficiently.