01) Environment Setup - roachlong/example-ml-flow GitHub Wiki

Clone the Repository

First install git if you don't already have it. Instructions for Mac, Windows or Linux can be found here. Then open a Mac Terminal or Windows PowerShell in your workspace folder (or wherever you keep your local repositories) and execute the following command.

git clone https://github.com/roachlong/example-ml-flow.git
cd example-ml-flow
git status

Docker

For Mac you can use brew, but if you're on Mac Silicone you may need additional virtualization software such as colima.

brew install colima
brew install docker
docker --version
colima start
colima status
brew install docker-compose

For Windows you can install Docker Desktop and follow the instructions outlined here.

MinIO

We'll use a container instance of MinIO to sink CDC data. First we need to grab the docker version of MinIO.

docker pull minio/minio

Then download the MinIO command line client. On Mac you can use brew install minio/stable/mc and on Windows you can download the client from here, move it into your local filesystem (i.e. C:\Users\myusername\minio) and add the location to your system Path environment variable. Once done open a new terminal window and execute mc --help to confirm.

Cockroach

If we're executing the PoC as a stand alone lab we can install and run a single node instance of cockroach on our laptops. For Mac you can install CRDB with brew install cockroachdb/tap/cockroach. For Windows you can download and extract the latest binary from here, then add the location of the cockroach.exe file (i.e. C:\Users\myname\AppData\Roaming\cockroach) to your Windows Path environment variable.

Then open a new Mac Terminal or PowerShell window and execute the following command to launch your single node database.

cockroach start-single-node --insecure --store=./data

Then open a browser to http://localhost:8080 to view the dashboard for your local cockroach instance

Python

On Mac we can install Python with brew install python and on Windows you can follow the instructions here to download and run the installer.

Then we need to make sure the following packages are installed, which you may want to run in a python virtual environment.

python3 -m venv venv
# and either
venv/bin/pip3 install psycopg
venv/bin/pip3 install faker
venv/bin/pip3 install numpy
# or
venv/Scripts/pip3 install psycopg
venv/Scripts/pip3 install faker
venv/Scripts/pip3 install numpy

Jupyter Notebook

On Mac we can use brew to install jupyter and it's dependencies along with additional python packages that are required.

brew install jupyter
brew install libomp
python -m pip install pandas
python -m pip install matplotlib
python -m pip install seaborn
python -m pip install scikit-learn
python -m pip install xgboost
python -m pip install imbalanced-learn
brew services start jupyterlab

On Windows we can request the Anaconda installation from here and follow the installation instructions that they will email you. Once installed you can launch Jupyter Notebook from the Anaconda Navigator app or open an Anaconda Prompt and execute the jupyter notebook command.

dbworkload

This is a tool we use to simulate data flowing into cockroach, developed by one of our colleagues with python. We can install the tool with pip3 install "dbworkload[postgres]", and then add it to your path. On Mac or Linux with Bash you can use:

echo -e '\nexport PATH=`python3 -m site --user-base`/bin:$PATH' >> ~/.bashrc 
source ~/.bashrc

For Windows you can add the location of the dbworkload.exe file (i.e. C:\Users\myname\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_abcdefghijk99\LocalCache\local-packages\Python39\Scripts) to your Windows Path environment variable. The pip command above should provide the exact path to your local python executables.

Then execute dbworkload --help to confirm the installation setup.

Sample Dataset

We're going to use Brandon Harris' Sparkov Data Generation tool, which was added as a submodule of the example-ml-flow repository for demonstration purposes, and will need to be updated separately after you've cloned this repository.

git submodule update --init --recursive

We'll generate some sample data in the data/generated folder, creating variables in the terminal shell window to limit the scope of the data. On Mac variables are assigned like my_var="example" and on Windows we proceed the variable assignment with a $ symbol $my_var="example".

So on Mac

customers=10
days=10
start_date=$(date +"%m-%d-%Y")
end_date=$(date -v "+${days}d" +"%m-%d-%Y")
cd Sparkov_Data_Generation
../venv/bin/python ./datagen.py -n ${customers} -o ../data/generated ${start_date} ${end_date})

And on Windows

$customers=10
$days=10
$start_date=(Get-Date).ToString('MM-dd-yyyy')
$end_date=(Get-Date).AddDays($days).ToString('MM-dd-yyyy')
cd Sparkov_Data_Generation
../venv/Scripts/python ./datagen.py -n ${customers} -o ../data/generated ${start_date} ${end_date}