Data Governance - sgml/signature GitHub Wiki
Tools
mit_license_tools:
- name: "OpenRefine"
description: "A powerful tool for working with messy data and improving its quality. It allows users to clean, transform, and enrich data through a user-friendly interface."
url: "http://openrefine.org/"
- name: "Data Quality Tool Kit (DQTK)"
description: "A suite of tools for assessing and improving data quality, including data profiling, data cleansing, and data validation."
url: "https://github.com/open-dq/data-quality-toolkit"
- name: "Apache Griffin"
description: "An open-source Data Quality framework that provides a comprehensive set of tools for data quality management, including data lineage, data quality measurement, and data quality monitoring."
url: "https://griffin.apache.org/"
Videos
- Apache Atlas Introduction: Need for Governance and Metadata Management
- Installation & Configuration of Apache ATLAS Part 2
- Installation & Configuration of Apache ATLAS Part 1
- Data Governance using Apache ATLAS
- Apache Atlas: A Hands-on Course
- Apache Atlas Wiki
Prerequisites
data_governance_limits:
data_quality: "Automation relies on high-quality data. Inaccurate or incomplete data can lead to errors and poor decision-making."
complexity: "Data governance involves complex processes and policies, making automation difficult with diverse data sources and systems."
human_oversight: "Human oversight is necessary for complex decision-making, exception handling, and ensuring compliance with regulations."
integration: "Integrating automated tools with existing systems and processes can be challenging, especially with legacy systems."
scalability: "Maintaining the scalability of automated governance tools as data volumes grow can be difficult."
security: "Ensuring automated processes are secure and comply with data protection regulations is crucial."
Apache Atlas
Comparison Chart
Tool |
License Type |
Key Features |
Apache Atlas |
Apache License 2.0 |
Metadata management, data lineage tracking, data cataloging |
Amundsen |
Apache License 2.0 |
Data discovery, metadata management, collaboration tools |
DataHub |
Apache License 2.0 |
Data cataloging, metadata management, data lineage tracking |
Magda |
Apache License 2.0 |
Data cataloging, metadata management, data lineage tracking |
Open Metadata |
Apache License 2.0 |
Metadata management, data cataloging, data lineage tracking |
Egeria |
Apache License 2.0 |
Metadata management, data lineage tracking, data cataloging |
Truedat |
Apache License 2.0 |
Data cataloging, metadata management, data lineage tracking |
ENV
environment_variables:
- METADATA_CLIENT_HEAP: "1024m"
- JAVA_HOME: "/path/to/your/java"
- LOG_DIR: "/path/to/your/logs"
- METADATA_COLLECTOR_ENABLED: true
- KNOX_ENABLED: true
- LDAP_ENABLED: true
- TLS_ENABLED: true
- KERBEROS_ENABLED: true
- METADATA_OPTS: "-Xmx1024m"