Major component involves in running hadoop ecosystem on cluster are:
1. Hadoop Distributed File System(HDFS):-HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.
2. MapReduce:-Map Reduce is the ‘heart‘ of Hadoop that consists of two parts – ‘map’ and ‘reduce’. Maps and reduces are programs for processing data. ‘Map’ processes the data first to give some intermediate output which is further processed by ‘Reduce’ to generate the final output. Thus, MapReduce allows for distributed processing of the map and reduction operations.
In this tutorial, I will describe how to setup and run Hadoop cluster. We will build Hadoop cluster using three Ubuntu machine in this tutorial.
Following are the capacities in which nodes may act in our cluster:-
1. NameNode:-Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.
2. SecondaryNameNode:-Downloads periodic checkpoints from the nameNode for fault-tolerance. There is exactly one SecondaryNameNode in each cluster.
3. JobTracker: - Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
4. DataNode: -Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.
5. TaskTracker: -Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.
In our case, one machine in the cluster is designated as namenode, Secondarynamenode and jobTracker.This is the master. The rest of machine in the cluster act as both Datanode and TaskTracker. They are slaves.
Step 1: download the ubuntu image from ubuntu website,download VmPlayer and configure the ubuntu image.(allow the guest machine to share the ipaddress of host machine)
Step 2: download hadoop binaries from apache website and extract to your home folder
Step 3: on master machine,change hostname to master, change hosts file as fallows:
Step 4:Allow the ssh connectivity between both machine.
Step 5:Access master machine,go to hadoop_home_folder/conf/masters,add the master hostname in the file.
Step 6:Access master machine,go to hadoop_home_folder/conf/slaves,add the slave's hostname in the file.
Step 7:Change JAVA_HOME path in conf/hadoop-env.sh as fallows
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
Step 8: Edit Core-site.xml as fallows
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/rajkrrsingh/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
Step 9: Edit hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/rajkrrsingh/namenodeanddatanode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/rajkrrsingh/namenodeanddatanode</value>
</property>
</configuration>
Step 10: Edit mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
Step 10: repeat the same step on the slaves machine.
Step 11: format the namenode using the fallowing command
bin/hadoop namenode -format
Step 12: Now we are all set,lets start the hadoop cluster first invoke the bin/start-dfs.sh command fallowing by bin/start-mapred-sh command.