GCP Dataproc - ghdrako/doc_snipets GitHub Wiki

Dataproc is GCP's big-data-managed service for running Hadoop and Spark clusters. Hadoop and Spark are open source frameworks that handle data processing for big data applications in a distributed manner. Essentially, they provide massive storage for data, while also providing enormous processing power to handle concurrent processing tasks.

Dataproc moves away from persistent clusters to ephemeral clusters. Dataproc integrates well with Cloud Storage. Therefore, if we have a requirement to run a job, we can spin up our cluster very quickly, process our data, and store it on Cloud Storage in the same region. Then, we can simply delete our cluster.

The underlying Dataproc infrastructure is built on Compute Engine, which means we can build on several machine types, depending on our budget, and take advantage of predefined and custom machine types. Cost savings are also increased by using preemptible instances.

In a Dataproc cluster, there are different classes of machines:

  • Master nodes: This machine will assign and synchronize tasks on worker nodes and process the results.
  • Worker nodes: These machines will process data. These can be expensive due to high their CPU and memory specifications.
  • Preemptible worker nodes: These are secondary worker nodes and are optional. They do the same job but lower the per-hour compute costs for non-critical data processing

When we create a new cluster, we can select different cluster modes:

  • Standard: This includes one master node and N worker nodes. In the event of a Compute Engine failure, in-flight jobs will fail and the filesystem will be inaccessible until the master nodes reboot.
  • High availability: This includes three master nodes and N worker nodes. This is designed to allow uninterrupted operations, despite a Compute Engine failure or reboots.
  • Single node: This combines both master and worker nodes. This is not suitable for large data processing and should be used for PoC or small-scale non-critical data processing.

By default, when we create a cluster, standard Apache Hadoop ecosystem components will be automatically installed on the cluster:

  • Apache Spark
  • Apache Hadoop
  • Apache Pig
  • Apache Hive
  • Python
  • Java
  • Hadoop Distributed File System (HDFS)

IAM

  • Dataproc Editor: This has full control over Dataproc.
  • Dataproc Viewer: This has rights to get and list Dataproc machine types, regions, zones, and projects.
  • Dataproc Worker: This is for service accounts only and provides the minimum permissions necessary to operate with Dataproc.
  • Dataproc Admin: This role has the same permissions as Editor but can Get and Set Dataproc IAM permissions.

Migrating HDFS Data from On-Premises to Google Cloud

https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-data

Two different migration models you should consider for transferring HDFS data to the cloud:

  • push
  • pull