Hadoop Architecture and HDFS - salmanbaig8/imp GitHub Wiki

URL : https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0111EN+2016/courseware/f7e45d019b4648c0a31cca0f46abd34a/ffdb5d5ddec8463f9afbbf83e3874408/

PRE 2.2 Arch: Terminology: Node: A simply a computer Rack: (consists of multiple nodes 30 to 40 rack1, rack2) connected to same n/w switch n/w bandwidth b/w any 2 nodes in the same rack, is > bandwidth 2 nodes on diff racks Hadoop Cluster(cluster): A collection of racks

2 main components: -- Distributed file system: HDFS(Hadoop distributed file system) IBM spectrum scale -- Map reduce engine: Framework for performing calculations on the data in the file system Has a built in resource manager and scheduler

HDFS: Runs on top of existing file system Not POSIX compliant Designed to tolerate high component failure rate(reliability is through replication) Designed to handle very large files(Large stream data access patterns)No Random access Uses blocks to store a file or parts of file

HDFS file blocks: Not same as OS file blocks, HDFS book made up of multiple OS blocks Default is 64MB( recommended 128MB(biginsights default)) Blocks with data are replicated to multiple nodes(so if node1 crashes, node2 is present)

Mapreduce from Google: Processes huge datasets for certain kinds of distributable problems using a large number of nodes Map reduce program consists of map and reduce func's Allows for distributed processing of the map and reduce operations(tasks run in paralle)

Types of Nodes for HDFS in a cluster: DataNode(many): blocks from diff files can be stored on same datanode, reports to namenode the list of block it stores nameNode: Metadata in memory, responsible for identifying whcih data node consists the data checkpoint node, backupnode: Jobtracker:Manages map reduce jobs in the cluster, one per hadoop cluster, receives jobs requested by clients schedules and monitors mapreduce jobs on task trackers, Attempts to direct a task to task tracker where the data resides Tasktracker(many): Runs the mapreduce task in JVM's, have a set number of slots to run task, communicates with the Job tracker via heartbeat messages, reads blocks from dataNodes

ARCH 2.2: Provides YARN (map reduce V2) Resource manager and scheduler external to any framework DataNodes still exist Job tracker and task tracker no longer exist

2main ideas for YARN Provides generic scheduling and resource management support more than just map reduce, support more than just batch processing More efficient scheduling and workload management 2 Name nodes: only 1 is active Journal nodes 3 or odd number

Name node: own name space,pool access to multiple data nodes HDFS replication: processing data : client -> Name node (checks the file doesnt already exist,and client has permission to write to a file) -> determines data node where the 1st block is written else -> client already on datanode write it there -> data is replicated and pipeline is created bw datanodes -> replica and final node in the same rack -> success acknowledment is sent from 3-2-1st node- client

HDFS command line: syntax hdfs dfs ex: hdfs dfs -ls

All FS shell commands take path URI's as arguments scheme://authority/path

Scheme: for HDFS is hdfs for local filesystem is file

ex: hdfs dfs -cp file:///sampleData/spark/myfile.txt hdfs://rvm.svl.ibm.com:8020/user/spark/test/myfile.txt

scheme are authority are optional Defaults are taken from core-site.xml configuration file

supports most of POSIX like commands some HDFS specific copyFromLocal, copyToLocal, get, getmerge, put, setrep copyFromLocal / put > local FS to HDFS copyToLocal / get > HDFS to local getmerge > get all files from directories that match the source pattern, merges and sorts them to only file on local fs setrep > sets replication factor of a file, can be exe recursively to change an entire tree, can specify to wait until replication level is achieved

⚠️ **GitHub.com Fallback** ⚠️