DataSpider 101: Meta data crawling made easy - modakanalytics/dataspider GitHub Wiki

DataSpider is a meta-data crawler for relational databases (currently Oracle, SqlServer,PostgresSQL, and MySQL) that is tightly integrated with the Kosh data model. DataSpider is also meta-data driven from prescriptive information entered when the basic datastore information is entered into Kosh. The primary purpose of DataSpider is to make sure meta-data stored in Kosh reflects a very current state of the various enterprise datastores. The typical life cycle of the process is as follows:

  1. Prerequisites
    • Population of basic information into Kosh. This really just includes basic connection information
      • Location (host/port) information for the datastore
      • Type of database (oracle, sqlserver, etc.)
      • Schemas in the datastore that should be subject to crawling
    • Access agreement from the owners of the databases to be crawled. A scenario that we employed was to get a global user account with read-only access to database system tables in order to capture all of the meta-data needed for proper ingestion. Note: for actual ingestion to be orchestrated by the Bots, there was also a need to get global access to read the data. This can be done in a much more restrictive fashion if needed.
    • Installation of the appropriate database JDBC drivers, not included as part of DataSpider since there may be licensing issues in some cases.
    • Determination of basic operational data related to frequency of crawling each datastore - typically it is desirable that daily crawling of the meta-data be done.
  2. The DataSpider will read meta-data related to crawling frequency, etc. as noted above and begin to crawl through the sources of information. Basically the result of a crawl is to populate the Kosh database with the information on:
    • Tables in the Schemas being crawled
    • Column information for each of the Tables found during the crawl. This include name,type,precision and scale.
    • Optional statistics on each of the Tables found in the Schemas crawled.