Module : Extract, Load, Transfrom - waidyanatha/dongcha GitHub Wiki
Introduction
The wrangler app is instrumental to ETL tasks.
- Extracts or streams data, of various formats, from any source; mainly using utils/etl/loads apache spark workloads
- spark file workloads (e.g., csv,txt,pdf,json)
- spark RDBMS workloads (e.g., postgres,mysql,etc)
- spark NoSQL workloads unstructured data (mongoDB,couchDB,etc)
- Transforms the data, using utils/etl/transform into a format that makes domain and functional sense.
- The data is extracted and stored in a cleansed and raw form.
- raw data is further cleaned, transformed, cataloged, and historically achieved.
- historic data is available for further curation and use for data mining (AI/ML), visual analytics, and datamart services.
- The ETL processes are, usually, automated with airflow using dag files.