Hadoop - taoualiw/My-Knowledge-Base GitHub Wiki
Hadoop is a framework that solves the main two problems of Big Data :
- Storing : HDFS distributed storing
- Processing : Yarn parallel and scalable processing
- MapReduce : software plateform provides logic of processing, using parallel distributed algorithms : Map: filter, group, sort / Reduce : aggregate, summarize
- PiG : scripting tool (programming language + environment)
- Apache Hive : similar to sequel
- MaHOOT & Spark Mlib : machine learning
- Spark : real time data analysis
- Hbase : open noSQL database
- Zookeeper & Ambari : management and coordination of jobs and services
- OOZIE : a clock, schedular
- Scoop/Floop :ingestion of data into HDFS
- HDFS
- Yarn
- Kafka/Storm