How to deploy data processing code on GCP - YC-1412/CISC525_final_project GitHub Wiki

Prerequisit

Set up a GCP Dataproc cluster

Setup Google Cloud CLI

Download and install: https://cloud.google.com/sdk/docs/install, follow the instructions
- Download the file, unzip it, run ./google-cloud-sdk/install.sh (see trouble shooting if cannot create gcloud folder)
- Run ./google-cloud-sdk/bin/gcloud init, follow the command prompt window notification
Test
- Run gcloud --version
- If it shows gcloud not recognized, run
```
echo 'export PATH="$PATH:User/path/to/google-cloud-sdk/bin"' >> ~/.zshrc
source ~/.zshrc
```
- Use echo $PATH to check if the path is added
- And try again
Trouble shooting
- https://stackoverflow.com/questions/72667317/please-verify-that-you-have-permissions-to-write-to-the-parent-directory

Submit jobs

Copy file

Find your bucket for that cluster

Copy a file

gsutil -m cp -r data/raw.csv gs://your-bucket-name/data/

Copy a folder

gsutil -m cp -r data/raw_folder gs://your-bucket-name/data

Note: To copy a folder, there must be something inside

Submit

gcloud dataproc jobs submit pyspark gs://dataproc-staging-your-bucket-name/final_project/src/process_data.py \
    --cluster=cluster-yingyu \
    --region=us-central1 \
    --properties spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
    -- --year_month 202201 --country US --data_path gs://dataproc-staging-your-bucket-name//final_project/data/

Last line, -- means anything after are arguments user defined for the script. If not specify data path, the default root folder is under hdfs.