APP : Module : Packages - waidyanatha/dongcha GitHub Wiki

Introduction

Key Components

APP.CFG

The utils app.cfg the primarily associated requirements are setting the respective

Modules

ETL

Load

Files Read/Write

The FilesRW gives two simple functions:

  1. Read_data function can read a set of files into the desired as_type specified dtype such as dict, dataframe, list, etc
  2. Write_data function will write the data of dtype into a file of the desired file format; such as CSV, JSON, TXT, and so on The file read and write can be from your local HDD or cloud storage. The storeMode defines where the data is physically located

The FilesRW Abstraction describes the implementation details.

NoSQL worked loads

Spark Database workloads

Transform

Clean-n-Enrich

ML

Machine Learning (ML) utilities are common and reusable functional packages. They simply take a dataset and return the results as instructed by the ML class methods.

Cluster

There are three packages with two that separate point (cloud) clustering from graph clustering (or subgraph community detection). The third package is for computing the cluster quality measures with the point of graph clustering.

Point

The package provides a consistent and common structure for all clustering categories. It makes it easy to simply give the data set and instruct the clustering category to receive clustered and labeled data. In the points.py package the main function cluster_n_label_data must be provided with a:

  • dataset without any NaN values; either as a
    • pandas dataframe with, at least, two columns to build a numpy ndarray
    • numpy ndarray
  • clustering category name, like k-means,denclue,dbscan, and so on
  • columns (optional) if there are non-numeric columns, then the columns attribute can be used to specify a list of numeric columns to consider
  • KWARGS are used to instruct other cluster category-specific properties, such as n_clusters, max_iters, epsilon

DimReduc

Dimensionality Reduction (dimreduc)

NatLang

Natural Language (NatLang) related workloads. It inherits Natural Language Processing (NLP) and Natural Language Generation (NLG) functions. It primary extends:

    from sentence_transformers import SentenceTransformer, util
    from nltk.corpus import stopwords
    from nltk import trigrams, bigrams, edit_distance
    from collections import Counter
    from difflib import SequenceMatcher
    import re

NLP

The package offers various sentence cleansing, encoding, semantic similarity scoring, n-gram analysis, and other NLP workloads

BOW

A package that can create and update domain specific Bag Of Words (BOW) to support NLP workloads

Statistics

Timeseries

All packages are designed for managing timeseries functions. Therefore, at least, one of the attributes must be year, month, day, datetime, or timestamp based.

Rollingstats

A pyspark function for computing the mean, standard deviation, or sum of a timeseries dataset. The simple_moving_stats function must be instructed with the following:

  1. name of the DATEATTRIBUTE that defines the time attribute
  2. a WINLENGH and WINUNIT that defines the window length; i.e. 7-day , 30-day, etc

LIB