Hadoop Components - salmanbaig8/imp GitHub Wiki

URL: https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0111EN+2016/courseware/70992a516f564b6a816181bda269b71f/eb01da7f58f5449d8c0fc14e43662571/

The MapReduce philosophy: Processes huge datasets for certain kinds of distributable problems using a large number of nodes Map: Master node partitions the i/p into smaller sub problems Distributes the sub-problems to the worker nodes

Reduce: Master node then takes the answer to all the sub-problems combines them in someway to get the o/p

How to use Pig and Hive in a Hadoop environment: They translate high level langs to mapReduce Jobs Offer significant reductions in program size over java provide points of extension to cover gaps in functionality provide interoperability with other langs None support random reads/writes or low latency queries

Pig: developed @yahoo, dataflow lang, can operate on complex, nested DSC, schema optional, Relationally complete, Turing complete when extended with Java UDFs Running: -script: pig scriptfile.pig -grunt: pig (to launch command line tool) -Embedded: Call into pig from java -Execution environments: local, distributed

Hive: developed @FB, Declarative lang(SQL dialect), Schema non optional but data can have many schemas, Relationally complete, Turing complete when extended with Java UDFs Runnig: -Hive shell interactive - hive Script - hive -f myscript Inline - hive -e 'SELECT * from myTable'

Moving data into Hadoop using Flume and Sqoop: Flume: A service for moving large amounts of data around a cluster soon after the data is produced -primary use case: Gathering log files from every machine in a cluster Transferring the data to a centralized persistent store e.g. HDFS Stream oriented dataflow: source....>logical node .....> sink ex: tail(access.log) ---> logicalNode ---> HDFS Tiers: Agent^tier, collector^tier, Storage^tier

Sqoop: Transfers data bw Hadoop and RDB'S, uses mapreduce to import and export data

Scheduling and controlling Hadoop job execution using Oozie: Collections of actions arranged in a Direct acyclic graph there is control dependency bw actions workflows are written in hPDL (xml) workflow actions start jobs in remote systems, the remote systems callback oozie to notify that the action has completed