DataSpider 103: CDC for metadata don't let Type2 data take over the world - modakanalytics/dataspider GitHub Wiki
Change Data Capture - CDC
Just to calm the nerves of any database administrators who might be reading this, DataSpider is not expecting CDC logs to be turned on for system metadata tables across the enterprise. Our approach is this:
- Each time DataSpider crawls a source and captures the metadata, it does so completely.
- The captured metadata is stored in staging tables that have similar to the metadata tables.
- For each entity metadata - for instance the metadata about a specific table the existing staging table value is compared against the corresponding metadata table in Kosh. There are a few possible results of this comparison.
- If the information is in the Kosh metadata tables (from previous crawl data) and not in the staging tables (from current crawled data), then the record in the metadata table is closed out using soft delete using Type2 rules (no longer valid).
- If the information in the metadata table is the same as the staging table, then the record is left as is.
- If the information in the staging table (from current crawled data) is different than the Kosh metadata table (from previous crawl data) then the record in the metadata table is closed out. This is basically a variant of the first case.
- After these checks and adjustments are made then the the remaining records in the staging table are moved into the metadata table.