Monday, June 3, 2013

Installation of MapR on Ubuntu through Ubuntu Partner Archive.


In the continuation of my previous post about availability of Hadoop to ubuntu through Ubuntu Partner Archive.Now you can install
Hadoop using apt0get command.

pre-requisites:
CPU : 64-bit
OS : Red Hat, CentOS, SUSE, or Ubuntu
Memory : 4 GB minimum, more in production
Disk : Raw, unformatted drives and partitions
DNS : Hostname, reaches all other nodes
Users : Common users across all nodes; Keyless ssh
Java : Must run Java
Other : NTP, Syslog, PAM

Step to install Hadoop:
Edit sources.list file and add the MapR repositories as fallows

deb http://package.mapr.com/releases/v2.1.2/ubuntu/ mapr optional
deb http://package.mapr.com/releases/ecosystem/ubuntu binary/


update your repository using fallowing command:
sudo apt-get update

Now invoke fallowing command to install Hadoop:
sudo apt-get install mapr-single-node

That's it,start hadooping.

Ubuntu and Hadoop: the perfect match | Canonical

Ubuntu and Hadoop: the perfect match | Canonical

Sunday, June 2, 2013

Installtion of HBase in fully distributed enviiornment

In this post we will see how to install HBase in fully distributed enviornment,before that we need to see all the component involved in fully distributed configuration of HBase.

HDFS:HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.

HBase Master: HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. In a distributed cluster, the Master typically runs on the namenode.

Region Servers:HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode.

Zookeeper: A distributed Apache HBase (TM) installation depends on a running ZooKeeper cluster. All participating nodes and clients need to be able to access the running ZooKeeper ensemble. Apache HBase by default manages a ZooKeeper "cluster" for you. It will start and stop the ZooKeeper ensemble as part of the HBase start/stop process. You can also manage the ZooKeeper ensemble independent of HBase and just point HBase at the cluster it should use. To toggle HBase management of ZooKeeper, use the HBASE_MANAGES_ZK variable in conf/hbase-env.sh. This variable, which defaults to true, tells HBase whether to start/stop the ZooKeeper ensemble servers as part of HBase start/stop.

In the coming example we have 2 ubuntu machine image configured in VMPlayere,both are up and running hadoop,if you are facing trouble in configuring hadoop cluster you can fallow the post http://www.rajkrrsingh.blogspot.in/2013/06/install-and-configure-2-node-hadoop.html.

consider a senerio in which we have one master and 2 slave nodes.now on master edit the /etc/hosts file as fallows
127.0.0.1 localhost
192.168.92.128  master.hdcluster.com  master
192.168.92.129  regionserver1.hdcluster.com  regionserver1
192.168.92.130  regionserver2.hdcluster.com  regionserver2

#127.0.1.1 ubuntu

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

edit /etc/hosts file on one of the slave with 192.168.92.129 ip address
127.0.0.1 localhost
192.168.92.128  master.hdcluster.com  master
192.168.92.129  regionserver1.hdcluster.com  regionserver1

#127.0.1.1 ubuntu

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

edit /etc/hosts file on one of the slave with 192.168.92.130 ip address
127.0.0.1 localhost
192.168.92.128  master.hdcluster.com  master
192.168.92.130  regionserver2.hdcluster.com  regionserver2

#127.0.1.1 ubuntu

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Download HBase binaries on the Master machine and extract to the home folder.
Edit the /conf/hbase-env.sh file as fallows
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
export HBASE_MANAGES_ZK=true


Now Edit the /conf/hbase-site.xml as fallows
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/**
 * Copyright 2010 The Apache Software Foundation
 *
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-->
<configuration>
 <property> 
      <name>hbase.master</name> 
      <value>192.168.92.128:90000</value> 
 </property> 
 <property>
  <name>hbase.rootdir</name>
  <value>hdfs://master:54310/user/hbase</value>
 </property>

 <property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
 </property>

 <property>
  <name>hbase.zookeeper.qourum</name>
  <value>master,regionserver1,regionserver2</value>
 </property>

 <property>
  <name>hbase.zookeeper.property.datadir</name>
  <value>/home/rajkrrsingh/zookeeperdatadir</value>
 </property>

 <property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>2222</value>
 </property>
</configuration>

copy the HBase folder on the regionservers,that complete our cluster configuration you can start the cluster using start-hbase.sh command.

Installing Apache HBase on Ubuntu in standalone mode

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse[clarification needed] data.

HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper.[1] Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs.

In the coming step we will setup a HBase on Ubuntu in standalone mode.

Step 1: Download the HBase binaries from any of the fallowing available mirrors using link : http://www.apache.org/dyn/closer.cgi/hbase/

Step 2: Unzipped the content to any of the local directory,preferably home directory.here is the snapshot of the HBase home directory


Step 3:Change EXPORT JAVA_HOME to the location where java is installed and change HBASE_HEAPSIZE to 1000MB(see the image)


Step 4:Change conf/Hbase-Site.Xml as fallows


Step 5: Copy Hadoop-1.0.4.jar to HBase_home_folder/lib folder
Step 6: Copy ${HADOOP_HOME}/lib/commons-configuration-*.jar to ${HBASE_HOME}/lib/
Step 7: Now its done,start the HBase server using fallowing command bin/start-hbase.sh
Step 8: run jps command to know which services are running,you will find the fallowing deamon up and running.

HRegionServer
HMaster
HQuorumPeer



Install and configure 2 node hadoop cluster using ubuntu Image


Major component involves in running hadoop ecosystem on cluster are:
1. Hadoop Distributed File System(HDFS):-HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.

2. MapReduce:-Map Reduce is the ‘heart‘ of Hadoop that consists of two parts – ‘map’ and ‘reduce’. Maps and reduces are programs for processing data. ‘Map’ processes the data first to give some intermediate output which is further processed by ‘Reduce’ to generate the final output. Thus, MapReduce allows for distributed processing of the map and reduction operations.

In this tutorial, I will describe how to setup and run Hadoop cluster. We will build Hadoop cluster using three Ubuntu machine in this tutorial.

Following are the capacities in which nodes may act in our cluster:-

1. NameNode:-Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.

2. SecondaryNameNode:-Downloads periodic checkpoints from the nameNode for fault-tolerance. There is exactly one SecondaryNameNode in each cluster.

3. JobTracker: - Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

4. DataNode: -Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

5. TaskTracker: -Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

In our case, one machine in the cluster is designated as namenode, Secondarynamenode and jobTracker.This is the master. The rest of machine in the cluster act as both Datanode and TaskTracker. They are slaves.

Step 1: download the ubuntu image from ubuntu website,download VmPlayer and configure the ubuntu image.(allow the guest machine to share the ipaddress of host machine)

Step 2: download hadoop binaries from apache website and extract to your home folder
Step 3: on master machine,change hostname to master, change hosts file as fallows:



Step 4:Allow the ssh connectivity between both machine.
Step 5:Access master machine,go to hadoop_home_folder/conf/masters,add the master hostname in the file.
Step 6:Access master machine,go to hadoop_home_folder/conf/slaves,add the slave's hostname in the file.
Step 7:Change JAVA_HOME path in conf/hadoop-env.sh as fallows
export JAVA_HOME=/usr/lib/jvm/java-6-oracle

Step 8: Edit Core-site.xml as fallows
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/rajkrrsingh/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>


</configuration>

Step 9: Edit hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/home/rajkrrsingh/namenodeanddatanode</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/home/rajkrrsingh/namenodeanddatanode</value>
</property>


</configuration>
Step 10: Edit mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>


</configuration>

Step 10: repeat the same step on the slaves machine.
Step 11: format the namenode using the fallowing command
bin/hadoop namenode -format

Step 12: Now we are all set,lets start the hadoop cluster first invoke the bin/start-dfs.sh command fallowing by bin/start-mapred-sh command.