Package : Loader : RDBM Class - waidyanatha/dongcha GitHub Wiki
from dongcha.etl.loader import sparkRDBM
Introduction
How to use the Utils app, ETL module Loader package sparkDBwls.py
Current working RDBMS
- PostgreSQL - all CRUD function working and tested
- BigQuery - waiting for spark 3.4 connectors (jars) to be released
The package builds on pyspark database read/write features. Moreover, it is designed for spark partitions and multi node implementation. It supports any RDBMS such as Postgresql, MySQL, MS SQL Server, etc. It has been tested for postgresql. An implementation requires defining the:
- database connection parameters in the utils config file; namely,
app.cfg
[DATABASE]
section - pyspark session parameters with the relevant utils config file; i.e.,
app.cfg
[SPARK]
section - necessary JAR files for the respective RDBMS are made available in the
$SPARK_HOME/jars/
folder. See below for more details.
The package features include function to CRUD a database using two ways
- dbTable - defining the schema and table name
- query - defining an SQL query; such as with multiple tables join and where statements
Prerequisites
- APP.CFG settings must be defined with the necessary DATABASE section. a. If using cloud, then additional cloud service specific sections must be configured
- Postgresql instance
- version >= 11.0 installation
- In addition to default pg_admin user, another user account for the application must be registered with full CRUD privileges to all schemas and databases and must be registered in the APP.CFG file.
- BigQuery - TBD
Utils APP.CFG settings
The necessary parameters for the two config sections are discussed below.
Database settings
[DATABASE]
#--database types: mysql, postgresql (default: postgres)
dbtype = PostgreSQL
#--hostIP default 127.0.0.1
dbhostip = 127.0.0.1
#--port default 5432
dbport = 5432
#--database driver
# postgresql for apache spark: 'org.postgresql.Driver'
dbdriver = org.postgresql.Driver
#--database name
dbname = mydbname
#--schema name
dbschema = mydbschema
#--username and password to connect
# default db_user=postgres, db_pswd = postgres
dbuser = mydbuser
dbpswd = mydbpswd
Spark settings
Prerequisits
- install and setup spark: https://computingforgeeks.com/how-to-install-apache-spark-on-ubuntu-debian/
- also install findspark by running >>> python3 -m pip install findspark
[SPARK]
#--settings to connect to the database to perform work loads '''
# $SPARK_HOME directory
homedir = /opt/spark/
# $SPARK_HOME/bin directory
bindir = /opt/spark/bin
# spark config type; default spark.jars
config = spark.jars
# to download Postgres JDBC drivers: https://jdbc.postgresql.org/
jardir = /opt/spark/jars/postgresql-42.5.0.jar
# master can be local[*], mesos, or yarn
master = local[1]
# partitions can be any number based on the number of clusters and cores
partitions = 2
# default format is jdbc
format = jdbc
# savemode can be append or overwrite
savemode = Append
Setting Class Properties
There are three ways to set the class properties. The hierarchy for setting the property values are in the order listed below; i.e. if the parameter is passed through the function, it will overwrite the set value and if not defined will attempt to use the value defined in the config file.
- Passing the property values to the CRUD function as key-value pairs; e.g.
kwargs['DBTYPE']='PostgreSQL'
- Simply using the class property setter function of the property; e.g.
clsDB.dbType = 'PostgreSQL'
- If defined in the app.cfg, then will set the property parameter with the value defined in the config
Class properties
DB parameters
self._dbType # set/get db type; e.g. PostgreSQL, MySQL,
self._dbDriver # set/get db driver; default 'org.postgresql.Driver'
self._dbHostIP # set/get db IP address; default localhost = 127.0.0.1
self._dbPort # set/get db TCP/IP port; default 5432
self._dbName # set/get db name
self._dbSchema # set/get db schema name
self._partitions # set/get number of partitions; default = 2
self._dbUser # set/get db schema username
self._dbPswd # set/get db schema password
self._dbConnURL # set/get db connection url
Session parameters
self._homeDir # set/get the $SPARK_HOME path
self._binDir # set/get the $SPARK_HOME\bin path
self._config # set/get the config type; defulat 'spark.jars'
self._jarDir # set/get the $SPARK_HOME/jars file; default postgresql-42.5.0.jar
self._appName # set/get spark appName; default is constructed from package identifiers
self._master # set/get master = local[*], mesos, or yarn
self._rwFormat # set/get read/write formal; default 'jdbc'
self._session # set/get spark session
self._saveMode # set/get savemode = append or overwrite
Functions (@ClassMethods)
There are four key functions to support CRUD the database tables, read views, or call DB stored procedures.
Read data from the table(s)
@classmethod
def read_data_from_table(
self,
select:str="",
db_table:str="",
db_column:str="",
lower_bound=None,
upper_bound=None,
**kwargs) -> DataFrame:
- Madetory that either the
select
attribute ordb_table
must be specified- select attribute takes an SQL query statement as a single string
- db_table attribute takes the table, name with or without the schema as a prefix;
- if the schema is not specified the function will default to the schema name set in the class property
- The
db_column
specifies the column to use when applying the partitionings; it is optional - If the
lower_bound
andupper_bound
are specified, then it is applied to the defined partitioning column; also optional- The implementer is expected to know or learn more about partition columns, lower_bound, and upper_bound
- The
**kwargs
can be used to- set the
self.session
andself.dbConnURL
parameters; if unspecified will use default values - set the
self.partitions
parameter; if unspecified will use default values
- set the
Insert data into a table
@classmethod
def insert_sdf_into_table(
self,
save_sdf,
db_table:str,
**kwargs):
- The
save_sdf
defines the data set; if not a pyspark dataframe, theself.data
property will attempt to convert it to a pyspark dataframe. Hence, you can provide a pandas dataframe as well - The
db_table
attribute takes the table, name with or without the schema as a prefix; * if the schema is not specified the function will default to the schema name set in the class property - The
**kwargs
can be used to- set the
self.session
andself.dbConnURL
parameters; if unspecified will use default values
- set the