Transfer Processed data to Delta Lake Repository - 3C-SCSU/Avatar GitHub Wiki

Set-Up:

Set up a virtual machine on Google Cloud
Use Ubuntu 20.04.6 as the operating system
Access the VM via SSH

Create two directories:

raw_data/: Stores unprocessed brainwave data
processed_data/: Stores cleaned data

File Shuffler:

Grant required permissions for the File Shuffler tool
Download and set up the File Shuffler to ensure data privacy during processing

Environment Configuration

Install Java to ensure compatibility with PySpark
Set the JAVA_HOME environment variable

Set up PySpark:

Install and configure PySpark
Set up the PySpark session
Define a data schema for input and processed data
Add error handling to manage failures during processing.

Data Processing Use the input file data/brainwave_data.csv and add the following:

id: Unique ID for each record
signal_type: type of brainwave signal
value: values for each signal
Process and validate data before transferring
Navigate to /deployment/data for setting up and running processing tasks

Python Script Develop a Python script to handle data transfer from processed_data/:

Write data securely to the Delta Lake repository
Validate data transfer
Confirm using python3 /home/Avatar/Deployment/my_transfer.py

Successful Execution Data transfers securely to the Delta Lake repository Future Development:

Add a GUI for user-friendly integration
Enable real-time monitoring of data transfer processes
Automate processing tasks

Demonstration Video: