DataSpider 101: Meta data crawling made easy - modakanalytics/dataspider GitHub Wiki
DataSpider is a meta-data crawler for relational databases (currently Oracle, SqlServer,PostgresSQL, and MySQL) that is tightly integrated with the Kosh data model. DataSpider is also meta-data driven from prescriptive information entered when the basic datastore information is entered into Kosh. The primary purpose of DataSpider is to make sure meta-data stored in Kosh reflects a very current state of the various enterprise datastores. The typical life cycle of the process is as follows:
- Prerequisites
- Population of basic information into Kosh. This really just includes basic connection information
- Location (host/port) information for the datastore
- Type of database (oracle, sqlserver, etc.)
- Schemas in the datastore that should be subject to crawling
- Access agreement from the owners of the databases to be crawled. A scenario that we employed was to get a global user account with read-only access to database system tables in order to capture all of the meta-data needed for proper ingestion. Note: for actual ingestion to be orchestrated by the Bots, there was also a need to get global access to read the data. This can be done in a much more restrictive fashion if needed.
- Installation of the appropriate database JDBC drivers, not included as part of DataSpider since there may be licensing issues in some cases.
- Determination of basic operational data related to frequency of crawling each datastore - typically it is desirable that daily crawling of the meta-data be done.
- Population of basic information into Kosh. This really just includes basic connection information
- The DataSpider will read meta-data related to crawling frequency, etc. as noted above and begin to crawl through the sources of information. Basically the result of a crawl is to populate the Kosh database with the information on:
- Tables in the Schemas being crawled
- Column information for each of the Tables found during the crawl. This include name,type,precision and scale.
- Optional statistics on each of the Tables found in the Schemas crawled.