Hadoop - bobbae/gcp GitHub Wiki

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.

Basic introduction to Apache Hadoop

https://www.youtube.com/watch?v=OoEpfb6yga8

Map reduce job example

You can see how to create a small three node Hadoop cluster and submit map reduce example.

MrJob

mrjob lets you write MapReduce jobs in Python 2.7/3.4+ and run them on several platforms.

Hive

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis.

https://www.youtube.com/watch?v=cMziv1iYt28

Using Apache Hive on Dataproc.

Apache Hive is considered similar to BigQuery.

Migrating from Hive to Bigquery

https://cloud.google.com/blog/products/data-analytics/apache-hive-to-bigquery

Hadoop Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

https://www.youtube.com/watch?v=Hve24pRW_Ps

Hive vs Pig vs SQL

https://www.whizlabs.com/blog/hive-vs-pig-vs-sql/

Pig Latin SQL Challenge

Doing ETL in SQL or Pig Latin to give more detailed feel for why one might prefer one or the other in solving actual common problems:

http://www.olric.org/2019/09/pig-latin-sql-challenge-or-window.html?m=1

Sawzall

A perspective on Sawzall DSL (domain specific language) over Google map/reduce and Pig DSL over Hadoop map/reduce.

Big Data Hadoop Tutorial

https://www.guru99.com/bigdata-tutorials.html