Transfer Processed data to Delta Lake Repository - 3C-SCSU/Avatar GitHub Wiki

Set-Up:

  1. Set up a virtual machine on Google Cloud
  2. Use Ubuntu 20.04.6 as the operating system
  3. Access the VM via SSH

Create two directories:

  1. raw_data/: Stores unprocessed brainwave data
  2. processed_data/: Stores cleaned data

File Shuffler:

  1. Grant required permissions for the File Shuffler tool
  2. Download and set up the File Shuffler to ensure data privacy during processing

Environment Configuration

  1. Install Java to ensure compatibility with PySpark
  2. Set the JAVA_HOME environment variable

Set up PySpark:

  1. Install and configure PySpark
  2. Set up the PySpark session
  3. Define a data schema for input and processed data
  4. Add error handling to manage failures during processing.

Data Processing Use the input file data/brainwave_data.csv and add the following:

  1. id: Unique ID for each record

  2. signal_type: type of brainwave signal

  3. value: values for each signal

  4. Process and validate data before transferring

  5. Navigate to /deployment/data for setting up and running processing tasks

Python Script Develop a Python script to handle data transfer from processed_data/:

  1. Write data securely to the Delta Lake repository
  2. Validate data transfer
  3. Confirm using python3 /home/Avatar/Deployment/my_transfer.py

Successful Execution Data transfers securely to the Delta Lake repository Future Development:

  1. Add a GUI for user-friendly integration
  2. Enable real-time monitoring of data transfer processes
  3. Automate processing tasks

Demonstration Video:

Watch the video