Hadoop Intro - salmanbaig8/imp GitHub Wiki

Hadoop handles unstructured, semistructured data Hadoop is not suitable for OLTP(online Transaction processing), OLAP(analytical processing), Dss(Decision support system)

Big Data: Large collections of data, also known as data sets that grows large how big is big data : fb processes 600 TB everyday, twitter 7 TB everyday Add BI/analytics functionality Derive information of data in motion IBM tool Infosphere Streams, Hadoop can used data at rest as well as data in motion.

Tools used for Hadoop: Eclipse for java Lucene text search engine lib within java Hbase:hadoop db hive : Datawarehousing tool to tract ETL (extract transform and load) adn store data in hadoop files Pig: high level lang that generates map reduce code to analyze large data sets spark : is a cluster computing framework zookeper : is a centralized confg service, and naming registry for large distributed systems apache ambari: manages and monitors hadoop clusters during tuning into webUi Avro: datasterilization system UIMA(unstructured information management Architecture) : architecture for development and discovery composition deployment within analysis for unstructured data URN : large scale Os for big data app mapreduce : is a sw framework for easily handling appl which process large amount of data

Hadoop is not good for: Not to process transaction(random access) when work cant be parallelized for low latency data access for processing small files intensive calculations with little data

using cloud, Hadoop cluster can be setup on demand