How to deploy data processing code on GCP - YC-1412/CISC525_final_project GitHub Wiki

Prerequisit

  • Set up a GCP Dataproc cluster

Setup Google Cloud CLI

Submit jobs

Copy file

  • Find your bucket for that cluster
  • Copy a file
    gsutil -m cp -r data/raw.csv gs://your-bucket-name/data/
    
  • Copy a folder
    gsutil -m cp -r data/raw_folder gs://your-bucket-name/data
    
  • Note: To copy a folder, there must be something inside

Submit

gcloud dataproc jobs submit pyspark gs://dataproc-staging-your-bucket-name/final_project/src/process_data.py \
    --cluster=cluster-yingyu \
    --region=us-central1 \
    --properties spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
    -- --year_month 202201 --country US --data_path gs://dataproc-staging-your-bucket-name//final_project/data/

Last line, -- means anything after are arguments user defined for the script. If not specify data path, the default root folder is under hdfs.