Data Governance - sgml/signature GitHub Wiki

Examples

National Student Clearinghouse:
  Description: The National Student Clearinghouse offers the NextGen API, allowing institutions to automate transcript ordering and delivery. This API provides a secure, real-time, automated approach to transcript ordering and electronic delivery between the Clearinghouse and your institution's student information system (SIS).
  API Documentation: https://help.studentclearinghouse.org/pdp/knowledge-base/submitting-data-files-through-api/
  GitHub Repository: https://github.com/NationalStudentClearinghouse

Bureau of Labor Statistics:
  Description: The Bureau of Labor Statistics (BLS) provides a Public Data API that allows developers to retrieve published historical time series data in JSON format or as an Excel spreadsheet. The API supports both GET and POST requests and is available in two versions: Version 2.0 (requires registration) and Version 1.0 (open for public use).
  API Documentation: https://www.bls.gov/developers/home.htm
  GitHub Repository: https://github.com/dsagher/Bureau-of-Labor-Statistics-API-Project

Tools

mit_license_tools:
  - name: "OpenRefine"
    description: "A powerful tool for working with messy data and improving its quality. It allows users to clean, transform, and enrich data through a user-friendly interface."
    url: "http://openrefine.org/"

  - name: "Data Quality Tool Kit (DQTK)"
    description: "A suite of tools for assessing and improving data quality, including data profiling, data cleansing, and data validation."
    url: "https://github.com/open-dq/data-quality-toolkit"

  - name: "Apache Griffin"
    description: "An open-source Data Quality framework that provides a comprehensive set of tools for data quality management, including data lineage, data quality measurement, and data quality monitoring."
    url: "https://griffin.apache.org/"

Videos

Prerequisites

data_governance_limits:
  data_quality: "Automation relies on high-quality data. Inaccurate or incomplete data can lead to errors and poor decision-making."
  complexity: "Data governance involves complex processes and policies, making automation difficult with diverse data sources and systems."
  human_oversight: "Human oversight is necessary for complex decision-making, exception handling, and ensuring compliance with regulations."
  integration: "Integrating automated tools with existing systems and processes can be challenging, especially with legacy systems."
  scalability: "Maintaining the scalability of automated governance tools as data volumes grow can be difficult."
  security: "Ensuring automated processes are secure and comply with data protection regulations is crucial."

Apache Atlas

Comparison Chart

Tool	License Type	Key Features
Apache Atlas	Apache License 2.0	Metadata management, data lineage tracking, data cataloging
Amundsen	Apache License 2.0	Data discovery, metadata management, collaboration tools
DataHub	Apache License 2.0	Data cataloging, metadata management, data lineage tracking
Magda	Apache License 2.0	Data cataloging, metadata management, data lineage tracking
Open Metadata	Apache License 2.0	Metadata management, data cataloging, data lineage tracking
Egeria	Apache License 2.0	Metadata management, data lineage tracking, data cataloging
Truedat	Apache License 2.0	Data cataloging, metadata management, data lineage tracking

ENV

environment_variables:
  - METADATA_CLIENT_HEAP: "1024m"
  - JAVA_HOME: "/path/to/your/java"
  - LOG_DIR: "/path/to/your/logs"
  - METADATA_COLLECTOR_ENABLED: true
  - KNOX_ENABLED: true
  - LDAP_ENABLED: true
  - TLS_ENABLED: true
  - KERBEROS_ENABLED: true
  - METADATA_OPTS: "-Xmx1024m"