Hadoop - taoualiw/My-Knowledge-Base GitHub Wiki

Hadoop

Hadoop is a framework that solves the main two problems of Big Data :

  • Storing : HDFS distributed storing
  • Processing : Yarn parallel and scalable processing

Hadoop Ecosystem:

  • MapReduce : software plateform provides logic of processing, using parallel distributed algorithms : Map: filter, group, sort / Reduce : aggregate, summarize
  • PiG : scripting tool (programming language + environment)
  • Apache Hive : similar to sequel
  • MaHOOT & Spark Mlib : machine learning
  • Spark : real time data analysis
  • Hbase : open noSQL database
  • Zookeeper & Ambari : management and coordination of jobs and services
  • OOZIE : a clock, schedular
  • Scoop/Floop :ingestion of data into HDFS
  • HDFS
  • Yarn
  • Kafka/Storm

References

⚠️ **GitHub.com Fallback** ⚠️