Transfer Processed data to Delta Lake Repository - 3C-SCSU/Avatar GitHub Wiki
Set-Up:
- Set up a virtual machine on Google Cloud
- Use Ubuntu 20.04.6 as the operating system
- Access the VM via SSH
Create two directories:
- raw_data/: Stores unprocessed brainwave data
- processed_data/: Stores cleaned data
File Shuffler:
- Grant required permissions for the File Shuffler tool
- Download and set up the File Shuffler to ensure data privacy during processing
Environment Configuration
- Install Java to ensure compatibility with PySpark
- Set the JAVA_HOME environment variable
Set up PySpark:
- Install and configure PySpark
- Set up the PySpark session
- Define a data schema for input and processed data
- Add error handling to manage failures during processing.
Data Processing Use the input file data/brainwave_data.csv and add the following:
-
id: Unique ID for each record
-
signal_type: type of brainwave signal
-
value: values for each signal
-
Process and validate data before transferring
-
Navigate to /deployment/data for setting up and running processing tasks
Python Script Develop a Python script to handle data transfer from processed_data/:
- Write data securely to the Delta Lake repository
- Validate data transfer
- Confirm using python3 /home/Avatar/Deployment/my_transfer.py
Successful Execution Data transfers securely to the Delta Lake repository Future Development:
- Add a GUI for user-friendly integration
- Enable real-time monitoring of data transfer processes
- Automate processing tasks
Demonstration Video: