Hello readers! This is the 4th article of our tutorial series on Hadoop and in this article, we will learn about hadoop basic building blocks. Along with this we will learn their internal structure and working.What is hadoop?
Hadoop is a framework to store and process very large data-sets in distributed environment made up of commodity hardware. It stores data in a distributed file system and unlike traditional clusters it performs computation on the node where data is stored.
1. Hadoop holds the capability to store enormous, formless data in the low time interval.
2. A unique distributed computing model of hadoop process big data very fast. The processing power of hadoop depends on the number of nodes you use in hadoop cluster. More the number of nodes provides us more processing power.
3. Hadoop framework is designed in such way that we can scale it in both Vertical and Horizontal manners.
4. Unlike traditional relational data stores, hadoop is flexible to store any type of data, without any pre-processing.
5. Hadoop is highly fault tolerant. Presence of replicated data in cluster protects it against failure of hardware.
Where it is used and how?
Moving forward from withdrawing relevant and interested knowledge from data, spam filtering and other parts, now hadoop is being used by many organisations
1. One of the most popular use if hadoop is building web-based recommendation engine. Facebook used it to recommend people you may know functionality, LinkedIns jobs you may be interested, Flipkart, Amazon, Myntra, eBay's people also viewed recommendation everywhere hadoop is used.
2. Helps us to create Data Lake, Data Lake stores data in original format to provide a unrefined view to data scientist so that they can understand and analyze it properly.
3. Hadoop is also playing a crucial role of the data warehouse, toppings of hadoop (flume, sqoop, hive) avails us Extract Transform Load facility.
4. Most of the companies are using hadoop as a cheap data storage, low cost commodity hardware makes it affordable.
Challenges in hadoop:
1. A major challenge is that hadoop is not useful in interactive analytical tasks, it is made for good simple tasks.
2. Data security is another prime challenge in hadoop. It uses Kerberos authentication protocol to secure the hadoop environment.
3. There is no tool present in the hadoop eco system to look for the data standard and data quality.
4. A very basic problem is difficult to find persons who have skills in java and the ability to do productive work in MapReduce.
The hadoop 2.0 has 4 basic building blocks as mentioned below:
1. Hadoop-commons: These are basic libraries and utilities used in hadoop
2. Hadoop Distributed File System (HDFS): It is a scalable distributed system used to store data. Stores data on multiple nodes, maintaining replicas, without any pre-processing.
3. MapReduce: It is a framework to process huge amount of data in parallel on cluster of commodity hardware.
4. Yet Another Resource Negotiator (YARN): It is a resource manager of the processes running on hadoop.
|Fig 1: Hadoop 1.0 and hadoop 2.0 basic building blocks|
Let us discuss the building blocks displayed in hadoop 2.0 in in above figure.
1. Hadoop Common: Hadoop common is the core of hadoop framework which provides utmost important services to hadoop. Along with that, it abstracts the underlying OS and its file system. Essential JARs and scripts, required to start and stop hadoop are also present in hadoop commons. It also provides us documentation about projects in which hadoop is contributed.
2. HDFS: HDFS is distributed file system which is used to store very large files. It stores file in a distributed manner. Depending on the size of the file it divides file into parts and stores those parts in the cluster. It provides fault tolerance by replicating those parts of file across cluster. The default count of replica is three, but you can change the count of replica. For fault tolerance, hdfs stores first replica in the same rack (We will discuss this in the next post on hdfs) to quickly overcome the failure of a node and to continue the processing and stores second replica on another rack.
HDFS in hadoop 1.0 contains NameNode, DataNodes and Secondary NameNode. Here NameNode is acts as master (a powerful server), its job is to maintain the metadata of hdfs in fsimage and edit logs. Metadata contains the Namespace information and block information of HDFS. IT does not stores actual dataset, but it knows where those data blocks are stored. NameNode is a very critical block of hadoop, whose failure makes whole hadoop cluster inaccessible. NameNode acts as a master of HDFS which makes it a single point of failure. Hence NameNode must be made up of reliable hardware. DataNodes acts as slaves, which are affordable computers (commodity hardware).
DataNodes are responsible to store actual data in HDFS (so configured with lard hard disk). They must be in continuous communication with NameNode to indicate their availability in the cluster. To do so they send a heartbeat signal (which indicated DataNodes are live) to NameNode. While starting, DataNode sends heartbeat along with the metadata to NameNode. Failure of DataNode doesn’t affect the availability of data to the job because NameNode manages this by locating the replica of the data block in the cluster and provides that replica to continue the job.
Secondary NameNode is actually the wrong name given to node, which creates confusion in newbies. Its name indicates that it is a backup node for NameNode but actually it is not. Secondary NameNodes task is to keep the fsimage by updating it with the edit logs, edit log stores the sequence of changes in the file system happened when NameNode starts. The secondary NameNode stores the checkpoint in hdfs hence it also known as checkpoint node.
3. MapReduce: MapReduce is a framework for processing large amount of data on cluster of commodity hardware, in parallel. MapReduce is made up of map task and reduce task. The map task gets input in the
format and generates intermediate output in the same format of . The MapReduce library is responsible to consolidate the intermediate value of respected key and provides this as an input to reduce task. Reduce task when receives input, then by considering the intermediate keys, it merges and performs operations on the value. Reduce always performs a task after completion of map task.
The program written in the MapReduce framework automatically executes, in parallel, on the cluster. The system takes care of partitioning the data, scheduling the tasks across the cluster, failure handling and inter and intra machine communication. Programmer never comes to know about the parallelism of the program and where it is executing. The MapReduce framework is so powerful that it can process petabytes of data on a cluster of thousand nodes.
4. YARN: It is introduced in second generation hadoop-2.0 by apache open source software foundation. YARN is a resource management framework used to manage and provide resources to the applications at the time of execution. YARN provides API's and daemon processes to write generic MapReduce applications, along with this it also capable to run the applications which do not follows non-MapReduce model. Application can be a simple MapReduce job or the Directed Acyclic Graph (DAG) generated by various tools (toppings) which run their queries on hadoop, one of the example is HIVE.
The very basic idea behind YARN is to decouple the resource management, job scheduling and job monitoring into separate daemons. This brought the Resource Manager and pre-application Application Master in the picture. Resource scheduling, monitoring and management are done by the resource manager. Node manager is a daemon which runs on every machine/node and its responsibility is to launch, manage and monitor containers as well as reports the same to resource manager.
One of the key difference between YARN and MR1 is the ApplicationMaster. The execution of the application is done by ApplicationMaster, it gets the container from Resource manager and executes the application in it. ApplicationMaster is the first process which starts after application starts in container(Known as managed AM) the execution and it exists after all processes/ tasks completes.
Here we come to the end of the article. We learned about WH's of hadoop and its basic building blocks. In next article we will discuss the HDFS and MapReduce in details.Please share your views and feedback in the comment section below and stay tuned for more articles on Hadoop.