How to deploy data processing code on GCP - YC-1412/CISC525_final_project GitHub Wiki
Prerequisit
- Set up a GCP Dataproc cluster
Setup Google Cloud CLI
- Download and install: https://cloud.google.com/sdk/docs/install, follow the instructions
- Download the file, unzip it, run
./google-cloud-sdk/install.sh
(see trouble shooting if cannot create gcloud folder) - Run
./google-cloud-sdk/bin/gcloud init
, follow the command prompt window notification
- Download the file, unzip it, run
- Test
- Run
gcloud --version
- If it shows gcloud not recognized, run
echo 'export PATH="$PATH:User/path/to/google-cloud-sdk/bin"' >> ~/.zshrc source ~/.zshrc
- Use
echo $PATH
to check if the path is added - And try again
- Run
- Trouble shooting
Submit jobs
Copy file
- Find your bucket for that cluster
- Copy a file
gsutil -m cp -r data/raw.csv gs://your-bucket-name/data/
- Copy a folder
gsutil -m cp -r data/raw_folder gs://your-bucket-name/data
- Note: To copy a folder, there must be something inside
Submit
gcloud dataproc jobs submit pyspark gs://dataproc-staging-your-bucket-name/final_project/src/process_data.py \
--cluster=cluster-yingyu \
--region=us-central1 \
--properties spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
-- --year_month 202201 --country US --data_path gs://dataproc-staging-your-bucket-name//final_project/data/
Last line, -- means anything after are arguments user defined for the script. If not specify data path, the default root folder is under hdfs.