2021 Q1 - ja-guzzle/guzzle_docs GitHub Wiki

  1. Support for Synpase spark-pool: https://github.com/ja-guzzle/guzzle_common/issues/442
  2. Purview:
    • Let purview scan the data-assets (which is sources and targets) and guzzle jobs are used to define the lineage
    • Study how it support ADF - which we are still not sure if its based on run-time ; which means when the job runs it tries to identify the sources and target which its using and then it crates lineage
  • Study ADF - Dataflows - from what we could find it supports, it even supports column level lineage
  • There are complexity file name pattern wiht wil-cards and recursive ; and how Purivew groups multiple files as one dataset (from the research we did it was grouping purely based on file name and not file content). Example if part-01,part-02.. it will group them as one dataset, though the content is very diff and take schema from one; however if emp and employee are similar data it still treats them separate.
  • We can think this completely fresh and ignore what we have done for ATLAS
  • We still need to support scannign SQL and getting all tables, but formation graph should happen as by purview as soon as we link the a source and sink dataset with a guzzle job
  • pipeline and batches are out - only activites will be brought in
  • We need to bring the yml and the link to guzzle ui (like atlas)
  • Aim is explorative - not a definite spec we want to deliver. Once we study this and what is possible then we define the spec, and UI and other integration changes
  1. Making job work in analytics cluster: https://github.com/ja-guzzle/guzzle_common/issues/411
  2. Github issues: https://github.com/ja-guzzle/guzzle_common/issues/640
  3. Review the approach for Init script: https://github.com/ja-guzzle/guzzle_common/issues/397
  4. Upgrade Utility : https://github.com/ja-guzzle/guzzle_common/issues/560
  5. timezone -
    • Audit logs in guzzle repo - always in UTC (does not matter who makes the entry and updates it - whether its spark or JVMs in Guzzle machine)
    • Running logs - also in UTC- does ont matter where it gets generated
    • The timpestamp in the user data like current_timestamp for effetive_start/end, refresh date etc
    • Business date that we generate in the init batch UI - we can leave it to local machine date-time
  6. Exception handling
    • The key focus is to avoid generic message like Key not found : threshold for ingestion
    • Subset of items from here: https://github.com/ja-guzzle/guzzle_common/issues?q=is%3Aissue+is%3Aopen+exception+handling
    • Detail exception handling of every condition based on the api /libs we are calling can be very complex task for existing code and we can take up in future
    • Going forward we can can try to handle as much of specific known exception in the code within proper user friendly message (applies to all)

  1. Parameter handling: https://github.com/ja-guzzle/guzzle_common/issues/591 (may be a param which is being manipulated in groovy had an issue)
    • Special params like batch_id, business_date etc should not be override and and gracefully handled stating effective ones getting picked up
    • The hanlding of this in context of rerun of indivdiual acitvity (we exclude certain param like log file, job instance id); getting previous business date; resume for pipeline and stages
  2. Remove this feature: https://github.com/ja-guzzle/guzzle_common/issues/601
  3. If the default compute in guzzle.yml is not present (or deleted accidenlty), API does not come up. Review the usage of default_compute and consder following; make it user specific. For Data sampling , if user hasn't overrident or choosen one then we take from guzzle.yml. Spark override usage this to retrieve the DBR and node types; If a job is submitted with default spark we may be using this.
  4. UI handling for fields which take SQL, column names, and literal (with and without quotes) - how distinguish https://github.com/ja-guzzle/guzzle_common/issues/533

  1. Ingestion and Processing
  • Hanlding of special column names
  • table names with [ / qualifier for the supported first class datastore : Delta, hive, synpase, files and for jdbc we can default it to mysql
  1. Guzzle resolves schema at runtime, and it striclty by column names - however in design time if someone wants to review the effetive mapping - we want to provide Ability to show effective mapping of source and target for ingestion and processing job (for SQL we just take the column list of SQL as-is ). This is little similar to samplin where we apply source rules and get effetive source schema, now its to take it forwared to apply transformaiton and finally show maping to target (and reject)
  2. Explicitly able to specify the file names: https://github.com/ja-guzzle/ingestion/issues/191
  3. Processing
    • For delta today - we use hive end point to use spark engine, but instead we want to provide option in Delta processing for user to select whether tehy want spark dataframe or sql template
    • Bring some alignment / or restriction in UI for the templtae based ETL

UI

  1. whether its false indicator for modified
  2. or stating that config has change (in admin) as we move for a menue
  3. Things hidden or not edittable
  4. Human