Hadoop concepts - PSJoshi/Notes GitHub Wiki

What is big data

Every day human beings eat, sleep, work, play, and produce data—lots and lots of data. According to IBM, the human race generates 2.5 quintillion (25 billion billion) bytes of data every day.The term “big data” has become such a common catch phrase. Simply put, when people talk about big data, they mean the ability to take large portions of this data, analyze it, and turn it into something useful. Big data is much more than that. It’s about:

taking vast quantities of data, often from multiple sources
collecting multiple kinds of data at the same time, as well as data that changed over time—that didn’t need to be first transformed into a specific format or made consistent
analyzing the data in a way that allows for ongoing analysis of the same data pools for different purposes
doing all of that quickly, even in real time.

IT industry came up with "VVV" to describe the above facets - for volume (the vast quantities), variety (the different kinds of data and the fact that data changes over time), and velocity (speed).

Big data vs. the data warehouse

The data warehouse was purpose-built to analyze specific data for specific purposes, and the data was structured and converted to specific formats, with the original data essentially destroyed in the process, for that specific purpose—and no other—in what was called extract, transform, and load (ETL). Data warehousing’s ETL approach limited analysis to specific data for specific analyses. That was fine when all your data existed in your transaction systems, but not so much in today’s Internet-connected world with data from everywhere. On the contrary, in big data, the data did not need to be permanently changed (transformed) for analysis. This non-destructive analysis meant that organizations could both analyze the same pools of data for different purposes and could analyze data from sources gathered for different purposes.Big data systems let you work with unstructured data largely as it comes, but the type of query results you get is nowhere near the sophistication of the data warehouse.

Technology behind big data

To accomplish the four required facets of big data—volume, variety, nondestructive use, and speed—required several technology breakthroughs, including the development of a distributed file system (Hadoop), a method to make sense of disparate data on the fly (first Google’s MapReduce, and more recently Apache Spark), and a cloud/Internet infrastructure for accessing and moving the data as needed. Until about a dozen years ago, it wasn’t possible to manipulate more than a relatively small amount of data at any one time. Limitations on the amount and location of data storage, computing power, and the ability to handle disparate data formats from multiple sources made the task all but impossible. In 2003, Google introduced MapReduce concept. Mapreduce simplifies dealing with large data sets by first mapping the data to a series of key/value pairs, then performing calculations on similar keys to reduce them to a single value, processing each chunk of data in parallel on hundreds or thousands of low-cost machines. Google created a breakthrough that made big data possible - Hadoop and it consists of two key services:

reliable data storage using the Hadoop Distributed File System (HDFS)
high-performance parallel data processing using a technique called MapReduce.

Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data—and run large-scale, high-performance processing jobs—in spite of system changes or failures. Although Hadoop provides a platform for data storage and parallel processing, the real value comes from add-ons, cross-integration, and custom implementations of technology. There are many useful Hadoop projects like:

Hadoop Common: The common utilities that sup- port the other Hadoop subprojects.
Chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that sup- ports structured data storage for large tables.
HDFS: A distributed le system that provides high throughput access to application data.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Pig: A high-level data- ow language and execution framework for parallel computation.
ZooKeeper: A high-performance coordination service for distributed applications.

With MapReduce, developers can create programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. The MapReduce framework is broken down into two functional areas:

Map, a function that parcels out work to different nodes in the distributed cluster.
Reduce, a function that collates the work and resolves the results into a single value.

One of MapReduce’s primary advantages is that it is fault-tolerant, which it accomplishes by monitoring each node in the cluster; each node is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and reassigns the work to other nodes.

Even with Hadoop, you still need a way to store and access the data. That’s typically done via a NoSQL database like MongoDB, like CouchDB, or Cassandra, which specialize in handling unstructured or semi-structured data distributed across multiple machines. Still, having massive amounts of data stored in a NoSQL database across clusters of machines isn’t much good until you do something with it. That’s where big data analytics comes in. Tools like Tableau, Splunk, and Jasper BI let you parse that data to identify patterns, extract meaning, and reveal new insights.

Ref - https://www.infoworld.com/article/3220044/big-data/what-is-big-data-everything-you-need-to-know.html