APP : Module : Packages - waidyanatha/dongcha GitHub Wiki
Introduction
Key Components
APP.CFG
The utils app.cfg the primarily associated requirements are setting the respective
- modules and packages.
- Storage configurations
- logging.
Modules
ETL
Load
Files Read/Write
The FilesRW gives two simple functions:
- Read_data function can read a set of files into the desired
as_type
specified dtype such as dict, dataframe, list, etc - Write_data function will write the data of dtype into a file of the desired file format; such as CSV, JSON, TXT, and so on
The file read and write can be from your local HDD or cloud storage. The
storeMode
defines where the data is physically located
The FilesRW Abstraction describes the implementation details.
NoSQL worked loads
Spark Database workloads
Transform
Clean-n-Enrich
ML
Machine Learning (ML) utilities are common and reusable functional packages. They simply take a dataset and return the results as instructed by the ML class methods.
Cluster
There are three packages with two that separate point (cloud) clustering from graph clustering (or subgraph community detection). The third package is for computing the cluster quality measures with the point of graph clustering.
Point
The package provides a consistent and common structure for all clustering categories. It makes it easy to simply give the data set and instruct the clustering category to receive clustered and labeled data. In the points.py
package the main function cluster_n_label_data
must be provided with a:
- dataset without any NaN values; either as a
- pandas
dataframe
with, at least, two columns to build a numpyndarray
- numpy
ndarray
- pandas
- clustering category name, like
k-means
,denclue
,dbscan
, and so on - columns (optional) if there are non-numeric columns, then the columns attribute can be used to specify a list of numeric columns to consider
- KWARGS are used to instruct other cluster category-specific properties, such as
n_clusters
,max_iters
,epsilon
DimReduc
Dimensionality Reduction (dimreduc)
NatLang
Natural Language (NatLang) related workloads. It inherits Natural Language Processing (NLP) and Natural Language Generation (NLG) functions. It primary extends:
from sentence_transformers import SentenceTransformer, util
from nltk.corpus import stopwords
from nltk import trigrams, bigrams, edit_distance
from collections import Counter
from difflib import SequenceMatcher
import re
NLP
The package offers various sentence cleansing, encoding, semantic similarity scoring, n-gram analysis, and other NLP workloads
BOW
A package that can create and update domain specific Bag Of Words (BOW) to support NLP workloads
Statistics
Timeseries
All packages are designed for managing timeseries functions. Therefore, at least, one of the attributes must be year, month, day, datetime, or timestamp based.
Rollingstats
A pyspark
function for computing the mean, standard deviation, or sum of a timeseries dataset. The simple_moving_stats
function must be instructed with the following:
- name of the
DATEATTRIBUTE
that defines the time attribute - a
WINLENGH
andWINUNIT
that defines the window length; i.e. 7-day , 30-day, etc