Sunday, November 24, 2013

HBase compatibility with Hadoop 2.x

HBase releases downloaded from the mirror site are compiled against a certain Hadoop version.for example,hbase-0.94.7 and hbase-0.92.2 are compatible with Hadoop1.0.x version.They wont work with the Hadoop 2.x.

If you have Hadoop 2.x version installed on your system,you need to download the HBase source and compile against the Hadoop 2.x version.

let it make it to work with Hadoop 2.x version

Download and install maven latest version

Install subversion
$sudo yum install subversion

Checkout HBase code
$svn co http://svn.apache.org/repos/asf/hbase/tags/0.94.7 hbase-0.94.7

Now go to the HBase directory and build an HBase tarball,use -Dhadoop.profile option to compile against the Hadoop 2.x
$MAVEN_OPTS="-Xmx2g"
$mvn clean install javadoc:aggregate site assembly:assembly -DskipTests -Prelease -Dhadoop.profile=2.0

verify that *.tar.gz file produced in the target directory

Tuesday, November 19, 2013

Apache Oozie Workflow : Configure and Running a MapReduce job

In this post I will demonstrate you how to configure the Oozie workflow. let's develop a simple MapReduce program using java, if you find any difficulties in doing it then download the code from my git location.Download

Please follow my earlier post to install and run oozie server, create a job directory say SimpleOozieMR as per following directory structure

---SimpleOozieMR
----workflow
-----lib
------workflow.xml

in the lib folder copy the you hadoop job jar and related jars.
let's configure our workflow.xml and keep it into the workflow directory as shown.
<workflow-app name="WorkFlowPatentCitation" xmlns="uri:oozie:workflow:0.1">
    <start to="JavaMR-Job"/>
        <action name="JavaMR-Job">
                <java>
                        <job-tracker>${jobTracker}</job-tracker>
                        <name-node>${nameNode}</name-node>
                        <prepare>
                                <delete path="${outputDir}"/>
                        </prepare>
                        <configuration>
                            <name>mapred.queue.name</name>
       <value>default</value>
                        </configuration>
      <main-class>com.rjkrsinghhadoop.App</main-class>
      <arg>${citationIn}</arg>
      <arg>${citationOut}</arg>
                </java>
                <ok to="end"/>
                <error to="fail"/>
        </action>
        <kill name="fail">
            <message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
        </kill>
    <end name="end" />
</workflow-app>

Now configure your properties file PatentCitation.properties as follows
nameNode=hdfs://master:8020
jobTracker=master:8021
queueName=default
citationIn=citationIn-hdfs
citationOut=citationOut-hdfs
oozie.wf.application.path=$(namenode)/user/rks/oozieworkdir/SimpleOozieMR/workflow

lets create a shell script which will run your first oozie job:
#!/bin/sh
#
export OOZIE_URL="http://localhost:11000/oozie"
#copy your input data to the hdfs
hadoop fs -copyFromLocal /home/rks/CitationInput.txt citationIn-hdfs
#copy SimpleOozieMR to hdfs
hadoop fs -put /home/rks/SimpleOozieMR SimpleOozieMR
#running the oozie job
cd /usr/lib/oozie/bin/
oozie job -config /home/rks/SimpleOozieMR/PatentCitation.properties -run

Apache oozie : Getting Started

Apache oozie Introduction:

--- Started by Yahoo, currenly managed by Apache open source project.
--- Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
-- MapReduce
-- Pig,Hive
-- Streaming
-- Standard Applications
--- Oozie is a scalable, reliable and extensible system.

--- User specifies action flow as Directed Acyclic Graph (DAG)
--- DAG: is a collection of vertices and directed edge configured so that one may not traverse the same vertex twice
--- Each node signifies eighter a Job or Script,Execution and branching can be parameterized by time, decision, data availability,
file size etc.
--- Client specifies process flow in webflow XML
--- Oozie is an extra level of abstraction between user and Hadoop
--- Oozie has its own server application which talks to it's own database(Apache Derby(default),MySql,Oracle etc.
--- User must load required component into the HDFS prior to the execution like input data, flow XML,JARs, resource files.





Interaction with Oozie through command line
$oozie job --oozie http://localhost:11000/oozie -config /user/rks/spjob/job.properties -run 


Web Interface





Installation
-Download Oozie from the Apache oozie official site
-Download ExtJS
-Configure core-site.xml
-restart namenode
-Copy Hadoop jars into a directory
-Extract ExtJS into Oozie's webapp
-Run oozie-setup.sh
-Relocalt newly generated war file
-Configure oozie-site.xml
-Initialize the databse
-Start oozie server

it's done, in the next course of action we will run MapReduce job configured using. stay tuned






Saturday, November 16, 2013

WebHDFS REST API -- Overview


Hadoop support all the hdfs operations using the underline java implementaion for all the HDFS commands like ls, mkdir, cat, rm, merge etc.
most of the time need arises to access the HDFS from some external applications other than the accessing the Hadoop cluster, the external system can be in any programming language other than the Java.

To support the access of HDFS from external application hadoop provides the WebHDFS REST API, which is based on the commonn http methods like GET,PUT,POST,DELETE. these methods supports the user operations like OPEN, GETFILESTATUS, LISTSTATUS are using HTTP GET, others like CREATE, MKDIRS, RENAME, SETPERMISSIONS are relying on HTTP PUT. APPEND operations is based on HTTP POST, while DELETE is using HTTP DELETE on the HDFS.


WebHDFS REST API OVERVIEW



In order to configure the WebHDFS, update hdfs-site.xml as follows
<property>
           <name>dfs.webhdfs.enabled</name>
           <value>true</value>
        </property>

now restart your namenode to access the WebHDFS.

in the omming post I will post about the java client to access the WebHDFS REST API... stay tuned

Monday, November 11, 2013

Mapreduce : Writing output to multiple files using MultipleOutputFormat

Multiple Outputs
FileOutputFormat and its subclasses generate a set of files in the output directory. There is one file per reducer, and files are named by the partition number: part-00000, part-00001, etc. There is sometimes a need to have more control over the naming of the files or to produce multiple files per reducer. MapReduce comes with two libraries to help you do this: MultipleOutputFormat and MultipleOutputs.

MultipleOutputFormat
MultipleOutputFormat allows you to write data to multiple files whose names are derived from the output keys and values. MultipleOutputFormat is an abstract class with two concrete subclasses, MultipleTextOutputFormat and MultipleSequenceFileOutputFormat, which are the multiple file equivalents of TextOutputFormat and SequenceFileOutputFormat. MultipleOutputFormat provides a few protected methods that subclasses can override to control the output filename. In Example 7-5, we create a subclass of MultipleTextOutputFormat to override the generateFileNameForKeyValue() method to return the station ID, which we extracted from the record value.

-- reference Hadoop Definitive guide

In this example I will demonstrate you how to write output data to multiple files.you can find the code of this example on the following git location

we have our sample customer data with attribute customer no,cust name, region, company. we will write the same region customer to the same file along with the other attributes.
customer no,customer name,region,company
.................................................................
9899821411,"Burke, Honorato U.",Alaska,Eu Incorporated
9899821422,"Bell, Emily R.",Arizona,Ut Eros Non Company
9899821379,"Hewitt, Chelsea Y.",PA,Egestas Aliquam Fringilla LLP
9899821387,"Baldwin, Merrill H.",VT,Rhoncus Proin Corp.
9899821392,"Bradshaw, Uma H.",OH,Nam Nulla Associates
9899821453,"Pollard, Boris G.",Hawaii,Consequat Corp.
9899821379,"Avila, Velma D.",OR,Sodales LLC

Create your Mapper Class as follows:
package com.rajkrrsingh.mr.hadoop;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MultipleOutputMapper extends Mapper<LongWritable, Text, Text, Text> {
 
 private Text txtKey = new Text("");
 private Text txtValue = new Text("");
 
 @Override
 protected void map(LongWritable key, Text value,Context context)
   throws IOException, InterruptedException {
  if(value.toString().length() > 0) {
   String[] custArray = value.toString().split(",");
   txtKey.set(custArray[0].toString());
   txtValue.set(custArray[1].toString()+"\t"+custArray[3].toString());
   context.write(txtKey, txtValue);
  }
 }

}

Setup your reduce class:
package com.rajkrrsingh.mr.hadoop;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

public class MultiOutputReducer extends Reducer<Text, Text, Text, Text>{
 
 private MultipleOutputs multipleOutputs;
 
 @Override
 protected void setup(Context context) throws IOException, InterruptedException {
  multipleOutputs  = new MultipleOutputs(context);
 }
 
 @Override
 protected void reduce(Text key, Iterable<Text> values,Context context)
   throws IOException, InterruptedException {
  for(Text value : values) {
   multipleOutputs.write(key, value, key.toString());
  }
 }
 
 @Override
 protected void cleanup(Context context)
   throws IOException, InterruptedException {
  multipleOutputs.close();
 }

}

setup your driver class as follows:
package com.rajkrrsingh.mr.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class App extends Configured implements Tool
{
    public static void main( String[] args ) throws Exception
    {
       int exitCode = ToolRunner.run(new Configuration(),new App(),args);
       System.exit(exitCode);
    }

 @Override
 public int run(String[] args) throws Exception {
  if(args.length != 2) {
   System.out.println("Two Params are required to extecute App <input-path> <output-path>");
  }
  
  Job job = new Job(getConf());
  job.setJobName("MultipleOutputFormat example");
  job.setJarByClass(App.class);
  LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
  FileInputFormat.setInputPaths(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));
  
  job.setMapperClass(MultipleOutputMapper.class);
  job.setMapOutputKeyClass(Text.class);
  job.setReducerClass(MultiOutputReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(Text.class);
  job.setNumReduceTasks(5);
  
  boolean success = job.waitForCompletion(true);
  return success ? 0 : 1;
 }
}

Create the job jar using maven assembly plugin and run on your hadoop cluster
$bin/hadoop jar /home/rks/MultipleOutputExample/target/MultipleOutputExample.jar com.rajkrrsingh.mr.hadoop.App 
/user/rks/input /user/rks/output

Monday, October 28, 2013

Hadoop : Joining two datasets using Map Side Join

It’s inevitable that you’ll come across data analyses where you need to pull in data from
different sources. For example, given our custer order data sets, you may want to find out
if certain order are placed from the customers with their detailed information such as address etc. You’ll have to look at customer
data (cust_data.txt) as well as cust_order data (cust_order.txt). In the database world it would just be a matter of joining two tables, and most databases automatically take care of the join processing for you. Unfortunately, joiningdata in Hadoop is more involved, and there are several possible approaches withdifferent trade-offs.
you can download the complete code from my github
In this example I will demonstrate you to use Map Side join using distributed cache.

Input Data : order_custid.txt
..............................................................................

781571544 S9876126
781571545 S9876127
781571546 S9876128
781571547 S9876129
781571548 S9876130
781571549 S9876131
781571550 S9876132
781571551 S9876133
781571552 S9876134
our customer dataset
Customer Data : cust_data.txt
.............................................................................

781571544 Smith,John      (248)-555-9430  jsmith@aol.com
781571545 Hunter,April    (815)-555-3029  april@showers.org
781571546 Ching,Iris      (305)-555-0919  iching@zen.org
781571547 Doe,John        (212)-555-0912  jdoe@morgue.com
781571548 Jones,Tom       (312)-555-3321  tj2342@aol.com
781571549 Smith,John      (607)-555-0023  smith@pocahontas.com
781571550 Crosby,Dave     (405)-555-1516  cros@csny.org
781571551 Johns,Pam       (313)-555-6790  pj@sleepy.com
781571552 Jetter,Linda    (810)-555-8761  netless@earthlink.net

We will create a our mapper class in the setup method we will parse through the order_custid file available in distributed cache and keep custid and order no in the hashmap.

package com.rajkrrsingh.hadoop.mapjoin;

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
 
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
 
public class MapSideJoinMapper extends Mapper<LongWritable, Text, Text, Text> {
 
        private static HashMap<String, String> CustIdOrderMap = new HashMap<String, String>();
        private BufferedReader brReader;
        private String orderNO = "";
        private Text outKey = new Text("");
        private Text outValue = new Text("");
 
        enum MYCOUNTER {
                RECORD_COUNT, FILE_EXISTS, FILE_NOT_FOUND, SOME_OTHER_ERROR
        }
 
        @Override
        protected void setup(Context context) throws IOException,
                        InterruptedException {
 
                Path[] cacheFilesLocal = DistributedCache.getLocalCacheFiles(context
                                .getConfiguration());
 
                for (Path eachPath : cacheFilesLocal) {
                        if (eachPath.getName().toString().trim().equals("order_custid.txt")) {
                                context.getCounter(MYCOUNTER.FILE_EXISTS).increment(1);
                                setupOrderHashMap(eachPath, context);
                        }
                }
 
        }
 
        private void setupOrderHashMap(Path filePath, Context context)
                        throws IOException {
 
                String strLineRead = "";
 
                try {
                        brReader = new BufferedReader(new FileReader(filePath.toString()));
 
                        while ((strLineRead = brReader.readLine()) != null) {
                                String custIdOrderArr[] = strLineRead.toString().split("\\s+");
                                CustIdOrderMap.put(custIdOrderArr[0].trim(),        custIdOrderArr[1].trim());
                        }
                } catch (FileNotFoundException e) {
                        e.printStackTrace();
                        context.getCounter(MYCOUNTER.FILE_NOT_FOUND).increment(1);
                } catch (IOException e) {
                        context.getCounter(MYCOUNTER.SOME_OTHER_ERROR).increment(1);
                        e.printStackTrace();
                }finally {
                        if (brReader != null) {
                                brReader.close();
 
                        }
 
                }
        }
 
        @Override
        public void map(LongWritable key, Text value, Context context)
                        throws IOException, InterruptedException {
 
                context.getCounter(MYCOUNTER.RECORD_COUNT).increment(1);
 
                if (value.toString().length() > 0) {
                        String custDataArr[] = value.toString().split("\\s+");
 
                        try {
                                orderNO = CustIdOrderMap.get(custDataArr[0].toString());
                        } finally {
                                orderNO = ((orderNO.equals(null) || orderNO
                                                .equals("")) ? "NOT-FOUND" : orderNO);
                        }
 
                        outKey.set(custDataArr[0].toString());
 
                        outValue.set(custDataArr[1].toString() + "\t"
                                        + custDataArr[2].toString() + "\t"
                                        + custDataArr[3].toString() + "\t" + orderNO);
 
                }
                context.write(outKey, outValue);
                orderNO = "";
        }
}

Setup Driver class as follows:
package com.rajkrrsingh.hadoop.mapjoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class App extends Configured implements Tool
{
    public static void main( String[] args ) throws Exception
    {
            int exitCode = ToolRunner.run(new Configuration(),new App(), args);
            System.exit(exitCode);
        }
    @Override
    public int run(String[] args) throws Exception {
            if(args.length !=2 ){
                    System.err.println("Usage : App -files <location-to-cust-id-and-order-file> <input path> <output path>");
                    System.exit(-1);
            }
            Job job = new Job(getConf());
            job.setJobName("Map Side Join");
            job.setJarByClass(App.class);
            FileInputFormat.addInputPath(job,new Path(args[0]) );
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
            job.setMapperClass(MapSideJoinMapper.class);
            job.setNumReduceTasks(0);
            
            boolean success = job.waitForCompletion(true);
                return success ? 0 : 1;
            
    }
}
here we have set 0 reducer so that the output of the map can directly written to the output file.
Now create a jar and run the following command on your Hadoop cluster
$bin/hadoop jar /home/rks/example/HadoopMapSideJoin/target/HadoopMapSideJoin.jar com.rajkrrsingh.hadoop.mapjoin -files /home/rks/Downloads/order_custid.txt input output

that what we need and here is the result.
output::
.................................................................................
781571544 Smith,John (248)-555-9430 jsmith@aol.com S9876126
781571545 Hunter,April (815)-555-3029 april@showers.org S9876127
781571546 Ching,Iris (305)-555-0919 iching@zen.org S9876128
781571547 Doe,John (212)-555-0912 jdoe@morgue.com S9876129
781571548 Jones,Tom (312)-555-3321 tj2342@aol.com S9876130
781571549 Smith,John (607)-555-0023 smith@pocahontas.com S9876131
781571550 Crosby,Dave (405)-555-1516 cros@csny.org S9876132
781571551 Johns,Pam (313)-555-6790 pj@sleepy.com S9876133
781571552 Jetter,Linda (810)-555-8761 netless@earthlink.net S9876134
781571552 Jetter,Linda (810)-555-8761 netless@earthlink.net S9876134
781571552 Jetter,Linda (810)-555-8761 netless@earthlink.net S9876134

Monday, October 21, 2013

Hadoop : Merging Small tar files to the Sequence File

The Hadoop Distributed File System (HDFS) is a distributed file system. It is mainly designed for batch processing of large volume of data. The default block size of HDFS is 64MB. When data is represented in files significantly smaller than the default block size the performance degrades dramatically. Mainly there are two reasons for producing small files. One reason is some files are pieces of a larger logical file (e.g. - log files). Since HDFS has only recently supported appends, these unbounded files are saved by writing them in chunks into HDFS. Other reason is some files cannot be combined together into one larger file and are essentially small. e.g. - A large corpus of images where each image is a distinct file.

Solution to the small files by merging them into a Sequence File:

Sequence files is a Hadoop specific archive file format similar to tar and zip. The concept behind this is to merge the file set with using a key and a value pair and this created files known as ‘Hadoop Sequence Files’. In this method file name is used as the key and the file content is used as value.

In the proposed solution we will demonstrate how to write small files to the Sequence File and a Sequence file reader which will list the file name in Sequence File:

Setting up a local file system:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;

public class LocalSetup {

    private FileSystem fileSystem;
    private Configuration config;

    
    public LocalSetup() throws Exception {
        config = new Configuration();

        
        config.set("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem");

        fileSystem = FileSystem.get(config);
        if (fileSystem.getConf() == null) {
                throw new Exception("LocalFileSystem configuration is null");
        }
    }

    
    public Configuration getConf() {
        return config;
    }

    
    public FileSystem getLocalFileSystem() {
        return fileSystem;
    }
}

In the next course of action we will setup a class which will read from the .tar.gz,.tgz,.tar.bz2 extension files and write it to the Sequence File with key as the name of file and value be the content of the file:
import org.apache.tools.bzip2.CBZip2InputStream;
import org.apache.tools.tar.TarEntry;
import org.apache.tools.tar.TarInputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.IOException;
import java.util.zip.GZIPInputStream;



public class TarToSeqFile {

    private File inputFile;
    private File outputFile;
    private LocalSetup setup;

    
    public TarToSeqFile() throws Exception {
        setup = new LocalSetup();
    }

    
    public void setInput(File inputFile) {
        this.inputFile = inputFile;
    }

    public void setOutput(File outputFile) {
        this.outputFile = outputFile;
    }

    public void execute() throws Exception {
        TarInputStream input = null;
        SequenceFile.Writer output = null;
        try {
            input = openInputFile();
            output = openOutputFile();
            TarEntry entry;
            while ((entry = input.getNextEntry()) != null) {
                if (entry.isDirectory()) { continue; }
                String filename = entry.getName();
                byte[] data = TarToSeqFile.getBytes(input, entry.getSize());
                
                Text key = new Text(filename);
                BytesWritable value = new BytesWritable(data);
                output.append(key, value);
            }
        } finally {
            if (input != null) { input.close(); }
            if (output != null) { output.close(); }
        }
    }

    private TarInputStream openInputFile() throws Exception {
        InputStream fileStream = new FileInputStream(inputFile);
        String name = inputFile.getName();
        InputStream theStream = null;
        if (name.endsWith(".tar.gz") || name.endsWith(".tgz")) {
            theStream = new GZIPInputStream(fileStream);
        } else if (name.endsWith(".tar.bz2") || name.endsWith(".tbz2")) {
            fileStream.skip(2);
            theStream = new CBZip2InputStream(fileStream);
        } else {
            theStream = fileStream;
        }
        return new TarInputStream(theStream);
    }

    private SequenceFile.Writer openOutputFile() throws Exception {
        Path outputPath = new Path(outputFile.getAbsolutePath());
        return SequenceFile.createWriter(setup.getLocalFileSystem(), setup.getConf(),
                                         outputPath,
                                         Text.class, BytesWritable.class,
                                         SequenceFile.CompressionType.BLOCK);
    }

    
    private static byte[] getBytes(TarInputStream input, long size) throws Exception {
        if (size > Integer.MAX_VALUE) {
            throw new Exception("A file in the tar archive is too large.");
        }
        int length = (int)size;
        byte[] bytes = new byte[length];

        int offset = 0;
        int numRead = 0;

        while (offset < bytes.length &&
               (numRead = input.read(bytes, offset, bytes.length - offset)) >= 0) {
            offset += numRead;
        }

        if (offset < bytes.length) {
            throw new IOException("A file in the tar archive could not be completely read.");
        }

        return bytes;
    }

    
    public static void main(String[] args) {
        if (args.length != 2) {
            exitWithHelp();
        }

        try {
            TarToSeqFile me = new TarToSeqFile();
            me.setInput(new File(args[0]));
            me.setOutput(new File(args[1]));
            me.execute();
        } catch (Exception e) {
            e.printStackTrace();
            exitWithHelp();
        }
    }

    public static void exitWithHelp() {
        System.err.println("Usage:  <tarfile> TarToSeqFile  <output>\n\n" +
                           "<tarfile> may be GZIP or BZIP2 compressed, must have a\n" +
                           "recognizable extension .tar, .tar.gz, .tgz, .tar.bz2, or .tbz2.");
        System.exit(1);
    }
}

In this way we can write files to a single Sequence file, to test it further we will read from the Sequence file and list the keys of the file as output
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;


public class SeqKeyList {

    private String inputFile;
    private LocalSetup setup;

    public SeqKeyList() throws Exception {
        setup = new LocalSetup();
    }

    public void setInput(String filename) {
        inputFile = filename;
    }

    
    public void execute() throws Exception {
        Path path = new Path(inputFile);
        SequenceFile.Reader reader = 
            new SequenceFile.Reader(setup.getLocalFileSystem(), path, setup.getConf());

        try {
            System.err.println("Key type is " + reader.getKeyClassName());
            System.err.println("Value type is " + reader.getValueClassName());
            if (reader.isCompressed()) {
                System.err.println("Values are compressed.");
            }
            if (reader.isBlockCompressed()) {
                System.err.println("Records are block-compressed.");
            }
            System.err.println("Compression type is " + reader.getCompressionCodec().getClass().getName());
            System.err.println("");

            Writable key = (Writable)(reader.getKeyClass().newInstance());
            while (reader.next(key)) {
                System.out.println(key.toString());
            }
        } finally {
            reader.close();
        }
    }

    public static void main(String[] args) {
        if (args.length != 1) {
            exitWithHelp();
        }

        try {
            SeqKeyList me = new SeqKeyList();
            me.setInput(args[0]);
            me.execute();
        } catch (Exception e) {
            e.printStackTrace();
            exitWithHelp();
        }
    }

    
    public static void exitWithHelp() {
        System.err.println("Usage: SeqKeyList   <sequence-file>\n" +
                           "Prints a list of keys in the sequence file, one per line.");
        System.exit(1);
    }
}


Hadoop : How to read and write a Map file

MapFile
A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile can be thought of as a persistent form of java.util.Map (although it doesn’t implement this interface), which is able to grow beyond the size of a Map that is kept in memory.

Writing a MapFile
Writing a MapFile is similar to writing a SequenceFile: you create an instance of MapFile.Writer, then call the append() method to add entries in order. (Attempting to add entries out of order will result in an IOException.) Keys must be instances of WritableComparable, and values must be Writable—contrast this to SequenceFile, which can use any serialization framework for its entries.

Iterating through the entries in order in a MapFile is similar to the procedure for a SequenceFile: you create a MapFile.Reader, then call the next() method until it returns false, signifying that no entry was read because the end of the file was reached

our data sample:
#custId orderNo
965412 S986512
965413 S986513
965414 S986514
965415 S986515
965416 S986516

Now configure your dependencies in pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.rjkrsinghhadoop</groupId>
  <artifactId>MapFileConverter</artifactId>
  <packaging>jar</packaging>
  <version>1.0</version>
  <name>MapFileConverter</name>
  <url>http://maven.apache.org</url>
  <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.7</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>1.0.4</version>
        </dependency>
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging-api</artifactId>
            <version>1.0.4</version>
        </dependency>
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
            <version>1.0.4</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>commons-cli</groupId>
            <artifactId>commons-cli</artifactId>
            <version>1.2</version>
        </dependency>
    </dependencies>

<!--
    <repositories>
        <repository>
            <id>libdir</id>
            <url>file://${basedir}/lib</url>
        </repository>
    </repositories>
-->

    <build>
        <finalName>exploringhadoop</finalName>
        <plugins>
                        <plugin>
                                <groupId>org.apache.maven.plugins</groupId>
                                <artifactId>maven-compiler-plugin</artifactId>
                                <configuration>
                                        <source>1.6</source>
                                        <target>1.6</target>
                                </configuration>
                        </plugin>
                        <plugin>
                                <artifactId>maven-assembly-plugin</artifactId>
                                <configuration>
                                        <finalName>${project.name}-${project.version}</finalName>
                                        <appendAssemblyId>true</appendAssemblyId>
                                        <descriptors>
                                                <descriptor>src/main/assembly/assembly.xml</descriptor>
                                        </descriptors>
                                </configuration>
                        </plugin>
        </plugins>
    </build>
</project>

Create MapFileConverter.java which will responsible for converting the text file to the Map file
package com.rjkrsinghhadoop;


import  java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.MapFile;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.fs.FSDataInputStream;

public class MapFileConverter {

  @SuppressWarnings("deprecation")
        public static void main(String[] args) throws IOException{
                
                Configuration conf = new Configuration();
                FileSystem fs;
                
                try {
                        fs = FileSystem.get(conf);
                        
                        Path inputFile = new Path(args[0]);
                        Path outputFile = new Path(args[1]);

                      Text txtKey = new Text();
                          Text txtValue = new Text();

                        String strLineInInputFile = "";
                        String lstKeyValuePair[] = null;
                        MapFile.Writer writer = null;
                        
                        FSDataInputStream inputStream = fs.open(inputFile);

                        try {
                                writer = new MapFile.Writer(conf, fs, outputFile.toString(),
                                                txtKey.getClass(), txtKey.getClass());
                                writer.setIndexInterval(1);
                                while (inputStream.available() > 0) {
                                        strLineInInputFile = inputStream.readLine();
                                        lstKeyValuePair = strLineInInputFile.split("\\t");
                                        txtKey.set(lstKeyValuePair[0]);
                                        txtValue.set(lstKeyValuePair[1]);
                                        writer.append(txtKey, txtValue);
                                }
                        } finally {
                                IOUtils.closeStream(writer);
                                System.out.println("Map file created successfully!!");
                  }
        } catch (IOException e) {
                        e.printStackTrace();
                }        
        }
}

to look up a file based on the provided key we will use MapFileReader.

package com.rjkrsinghhadoop;


import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.MapFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;

public class MapFileReader {

  
        @SuppressWarnings("deprecation")
        public static void main(String[] args) throws IOException {

                Configuration conf = new Configuration();
                FileSystem fs = null;
                Text txtKey = new Text(args[1]);
                Text txtValue = new Text();
                MapFile.Reader reader = null;

                try {
                        fs = FileSystem.get(conf);

                        try {
                                reader = new MapFile.Reader(fs, args[0].toString(), conf);
                                reader.get(txtKey, txtValue);
                        } catch (IOException e) {
                                e.printStackTrace();
                        }

                } catch (IOException e) {
                        e.printStackTrace();
                }
                System.out.println("The value for Key "+ txtKey.toString() +" is "+ txtValue.toString());
        }
}

to ship your code in the jar file we will need an assembly descriptor create a assembly.xml in the resources folder as follows:
<assembly
    xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
    <id>job</id>
    <formats>
        <format>jar</format>
    </formats>
    <includeBaseDirectory>false</includeBaseDirectory>
    <dependencySets>
        <dependencySet>
            <unpack>false</unpack>
            <scope>runtime</scope>
            <outputDirectory>lib</outputDirectory>
            <excludes>
                <exclude>${artifact.groupId}:${artifact.artifactId}</exclude>
            </excludes>
        </dependencySet>
        <dependencySet>
            <unpack>false</unpack>
            <scope>system</scope>
            <outputDirectory>lib</outputDirectory>
            <excludes>
                <exclude>${artifact.groupId}:${artifact.artifactId}</exclude>
            </excludes>
        </dependencySet>
    </dependencySets>
    <fileSets>
        <fileSet>
            <directory>${basedir}/target/classes</directory>
            <outputDirectory>/</outputDirectory>
            <excludes>
                <exclude>*.jar</exclude>
            </excludes>
        </fileSet>
    </fileSets>
</assembly>

now run mvn assembly:assembly which will create a jar file in the target directory, which is ready to be run on your hadoop cluster.

Hadoop : How to read and write Sequence File using mapreduce


Sequence files is a Hadoop specific archive file format similar to tar and zip. The concept behind this is to merge the file set with using a key and a value pair and this created files known as ‘Hadoop Sequence Files’. In this method file name is used as the key and the file content is used as value.

A sequence file consists of a header followed by one or more records. The first three bytes of a sequence file are the bytes SEQ, which acts a magic number, followed by a single byte representing the version number. The header contains other fields including the names of the key and value classes, compression details, user-defined metadata, and the sync marker. Recall that the sync marker is used to allow a reader to synchronize to a record boundary from any position in the file. Each file has a randomly generated sync marker, whose value is stored in the header. Sync markers appear between records in the sequence file. They are designed to incur less than a 1% storage overhead, so they don’t necessarily appear between every pair of records (such is the case for short records).



The internal format of the records depends on whether compression is enabled, and if it is, whether it is record compression or block compression.

If no compression is enabled (the default), then each record is made up of the record length (in bytes), the key length, the key, and then the value. The length fields are written as four-byte integers adhering to the contract of the writeInt() method of java.io.DataOutput. Keys and values are serialized using the Serialization defined for the class being written to the sequence file.

In this sample code I will demonstarate you how to read and write the sequence file. The Complete code is available on my Git repo

we will use the the following sample data:
#custId orderNo
965412 S986512
965413 S986513
965414 S986514
965415 S986515
965416 S986516

configure the hadoop related dependencies in the pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.rjkrsinghhadoop</groupId>
  <artifactId>SequenceFileReaderWriter</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>SequenceFileReaderWriter</name>
  <url>http://maven.apache.org</url>
  <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.7</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>1.0.4</version>
        </dependency>
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging-api</artifactId>
            <version>1.0.4</version>
        </dependency>
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
            <version>1.0.4</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>commons-cli</groupId>
            <artifactId>commons-cli</artifactId>
            <version>1.2</version>
        </dependency>
    </dependencies>

<!--
    <repositories>
        <repository>
            <id>libdir</id>
            <url>file://${basedir}/lib</url>
        </repository>
    </repositories>
-->

    <build>
        <finalName>exploringhadoop</finalName>
        <plugins>
   <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
     <source>1.6</source>
     <target>1.6</target>
    </configuration>
   </plugin>
   <plugin>
    <artifactId>maven-assembly-plugin</artifactId>
    <configuration>
     <finalName>${project.name}-${project.version}</finalName>
     <appendAssemblyId>true</appendAssemblyId>
     <descriptors>
      <descriptor>src/main/assembly/assembly.xml</descriptor>
     </descriptors>
    </configuration>
   </plugin>
        </plugins>
    </build>
</project>

Now create a mapper class to as follows:

package com.rjkrsinghhadoop;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class SequenceFileWriterMapper extends Mapper<Text,Text,Text,Text> {
        
        
        @Override
        protected void map(Text key, Text value,Context context)         throws IOException, InterruptedException {
                context.write(key, value);                
        }

}

Create a java class SequenceFileWriterApp which will write a text file to the Sequence file

package com.rjkrsinghhadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

public class SequenceFileWriterApp 
{
    public static void main( String[] args ) throws Exception
    {
            if(args.length !=2 ){
                    System.err.println("Usage : Sequence File Writer Utility <input path> <output path>");
                    System.exit(-1);
            }
            Configuration conf = new Configuration();
            Job job = new Job(conf);
            job.setJarByClass(SequenceFileWriterApp.class);
            job.setJobName("SequenceFileWriter");
            
            FileInputFormat.addInputPath(job,new Path(args[0]) );
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
            
            job.setMapperClass(SequenceFileWriterMapper.class);
            
            job.setInputFormatClass(KeyValueTextInputFormat.class);
            job.setOutputFormatClass(SequenceFileOutputFormat.class);
            
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);
            job.setNumReduceTasks(0);
            
            
            System.exit(job.waitForCompletion(true) ? 0:1);
    }
}

To read a sequence file and convert it back to the txt file we need a SequenceFileReader
package com.rjkrsinghhadoop;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class SequenceFileReader  {


  public static void main(String[] args) throws Exception {
          if(args.length !=2 ){
                System.err.println("Usage : Sequence File Writer Utility <input path> <output path>");
                System.exit(-1);
        }
        Configuration conf = new Configuration();
        Job job = new Job(conf);
        job.setJarByClass(SequenceFileReader.class);
        job.setJobName("SequenceFileReader");
        
        FileInputFormat.addInputPath(job,new Path(args[0]) );
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        job.setMapperClass(SequenceFileWriterMapper.class);
        
        job.setInputFormatClass(SequenceFileInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(0);
        
        
        System.exit(job.waitForCompletion(true) ? 0:1);
}
}
r

ship your code in the jar file we will need an assembly descriptor create a assembly.xml in the resources folder as follows:
<assembly
    xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
    <id>job</id>
    <formats>
        <format>jar</format>
    </formats>
    <includeBaseDirectory>false</includeBaseDirectory>
    <dependencySets>
        <dependencySet>
            <unpack>false</unpack>
            <scope>runtime</scope>
            <outputDirectory>lib</outputDirectory>
            <excludes>
                <exclude>${artifact.groupId}:${artifact.artifactId}</exclude>
            </excludes>
        </dependencySet>
        <dependencySet>
            <unpack>false</unpack>
            <scope>system</scope>
            <outputDirectory>lib</outputDirectory>
            <excludes>
                <exclude>${artifact.groupId}:${artifact.artifactId}</exclude>
            </excludes>
        </dependencySet>
    </dependencySets>
    <fileSets>
        <fileSet>
            <directory>${basedir}/target/classes</directory>
            <outputDirectory>/</outputDirectory>
            <excludes>
                <exclude>*.jar</exclude>
            </excludes>
        </fileSet>
    </fileSets>
</assembly>

now run mvn assembly:assembly which will create a jar file in the target directory, which is ready to be run on your hadoop cluster.

Monday, September 30, 2013

How to Convert TSV to JSON using command line

In most of the provided dataset we have a need arises to convert it to the JSON format.
here I am demonstrating you a simple and useful approach to achieve that.

We have sample input data as follows:

ord_status.tsv
5 2 3
111 109 2
21 12 9

it's basically a data from some e-commerce application which state the order qty, ship qty and back order qty of particular item.now it's turn to convert it to the JSON document

$: export FIELDS=ord_qty,shp_qty,backord_qty
$: cat ord_status.tsv| ruby -rjson -ne 'puts ENV["FIELDS"].split(",").zip($_.strip.split("\t")).inject({}){|h,x| h[x[0]]=x[1];h}.to_json'

here is the outcome
{"ord_qty":"5","shp_qty":"2","backord_qty":"3"}
{"ord_qty":"111","shp_qty":"109","backord_qty":"2"}
{"ord_qty":"21","shp_qty":"12","backord_qty":"9"}

Thursday, September 19, 2013

Spring Roo with MongoDB Persistence

Spring Roo is an open source software tool that uses convention-over-configuration principles to provide rapid application development of Java-based enterprise software. The resulting applications use common Java technologies such as Spring Framework, Java Persistence API, Java Server Pages, Apache Maven and AspectJ. Spring Roo is a member of the Spring portfolio of projects.

Roo focuses on higher productivity, stock-standard Java APIs, high usability, avoiding engineering trade-offs and facilitating easy Roo removal.

MongoDB is a leading NoSQL open-source document database,it's Written in C++, MongoDB supports document oriented storage.the other major feature includes full index support Replication and HA, optimized querying, auto sharding and GridFS support to store large files.
Added features include aggregation framework to aggregate query result on large set of unstructured Big Data.

Spring Roo now supports MongoDB persistence.In this tutorial I will demonstrate you building a test application using Spring Roo and MongoDB persistence.

Prerequisite:
1. Spring Roo installed and added to the Environment variable
2. Maven 2.2+
3. MongoDb 2.4, installed up and mongod is running on port 27017

lets open a roo console by typing roo
____  ____  ____
   / __ \/ __ \/ __ \
  / /_/ / / / / / / /
 / _, _/ /_/ / /_/ /
/_/ |_|\____/\____/    1.2.4.RELEASE [rev 75337cf]


Welcome to Spring Roo. For assistance press TAB or type "hint" then hit ENTER.
roo>

on the prompt execute the fallowing command in order:

project --topLevelPackage com.rajkrrsingh.roomongoapp
mongo setup --databaseName personDB
entity mongo --class ~.model.Person --testAutomatically
field string --fieldName name --notNull
repository mongo --interface ~.repository.PersonRepository --entity ~.model.Person
web mvc setup
web mvc scaffold --class ~.web.PersonController
perform package
quit

now after quit from the roo console type fallowing command it will deploy and run your application on tomcat

mvn tomcat:run

open a web browser and access the link http://localhost:8080/roomongoapp/ your sample application is up and running now its turn to look into the mongo database.
Access the mongo console and use look for the personDB databse

use personDB
switched to db personDB
> show collections
person
system.indexes
> db.person.find().pretty()
{
        "_id" : "101",
        "_class" : "com.rajkrrsingh.roomongoapp.model.Person",
        "name" : "Rajkumar Singh"
}
{
        "_id" : "102",
        "_class" : "com.rajkrrsingh.roomongoapp.model.Person",
        "name" : "Sharad Singh"
}

inserted records are there in the database.

Monday, September 2, 2013

Apache Pig : Writiting Java UDF for Pig

In this tutorial how to write user defined function for the pig.suppose we have a sample data in the for of tab separate document as fallows.the first column depicts name of the customer, second column represent the location of the customer and the third column give the information of customer's credit rating on the scale of 10. we need to filter out the the credit who scored the bad rating e.g. less than 5.

Amit  Noida 5
Ajay Delhi 8
Abhi Lucknow 3
Dev Punjab 7
Deepak Bihar 2

Lets create a Maven java project using the fallowing command:

>mvn archetype:generate -DgroupId=com.rajkrrsingh.pig.udf -DartifactId=JavaUDF
 -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
Above command will create a new Java project with the name of JavaUDF, open pom.xml in the project directory and add the fallowing dependencies in it.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.rajkrrsingh.pig.udf</groupId>
  <artifactId>JavaUDF</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>JavaUDF</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <!-- TODO: make sure Hadoop version is compatible -->
    <dependency>
      <groupId>org.apache.pig</groupId>
      <artifactId>pig</artifactId>
   <version>0.10.0</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
   <version>4.11</version>
    </dependency>
    <dependency>
      <groupId>org.hamcrest</groupId>
      <artifactId>hamcrest-all</artifactId>
   <version>1.1</version>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.2.1</version>
        <configuration>
          <descriptors>
            <descriptor>src/main/assembly/jar.xml</descriptor>
          </descriptors>
          <finalName>pig-examples</finalName>
          <outputDirectory>${project.build.directory}/../..</outputDirectory>
          <appendAssemblyId>false</appendAssemblyId>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

Now from command line execute
mvn eclipse:eclipse
Import the project in the eclipse using Import from existing project and create a java package and add the fallowing class to it.
package com.rajkrrsingh.pig.udf;

import java.io.IOException;

import org.apache.pig.FilterFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;

public class IsGoodCreditRating extends FilterFunc {

 @Override
 public Boolean exec(Tuple args) throws IOException {
  if (args == null || args.size() == 0) {
        return false;
      }
      try {
        Object object = args.get(0);
        if (object == null) {
          return false;
        }
        int i = (Integer) object;
        if(i>5){
         return true;
        }else{
         return false;
        }
      } catch (ExecException e) {
        throw new IOException(e);
     }
 }
 
 
}
Create the jar file using the assembly plugin and moved it to your cluster.In the next step we will write a Pig script.
CreditScore.pig
REGISTER JavaUDF.jar;
records = LOAD 'sample.txt' AS (name:chararray, location:chararray, creditrating:int);
filter_records = FILTER records BY com.rajkrrsingh.pig.udf.IsGoodCreditRating(creditrating);
grouped_records = GROUP filter_records BY location;
DUMP grouped_records;

run the scrip using pig CreditScore.pig and get the result.

Apache PIG : Installation and Running PIG on multi node cluster (Ubuntu)

PIG installation is very straight forward,if you want to configure Pig on multi node Hadoop cluster, then there is no need to install any specific api of utility,Pig launches jobs and interacts with your Hadoop filesystems from your node itself.

Prerequisite : Java 6 (install and set JAVA_HOME properly)

get the binaries to install pig from the official apache pig website mentioed here
download the binaries, I am using ubuntu so its better to use wget

#wget http://www.dsgnwrld.com/am/pig/pig-0.11.1/pig-0.11.1.tar.gz

After the download completes extract the tarball as fallows:
#tar xzf pig-0.11.1.tar.gz

Now its time to add the Pig binaries to your command line path
#export PIG_HOME=/home/rajkrrsingh/pig-0.11.1
#export PATH=$PATH:$PIG_HOME/bin

By setting the enviornmet variable you are able to run the Pig in local enviornment but to run Pig on the cluster you still need to provide some information to pig runtime about you Hadoop installation so that it can get the cluster information from HDFS-site.xml and mapreduce-site.xml and core-site.xml.

by setting PIG_CLASSPATH you can provide the cluster information to the Pig:
export PIG_CLASSPATH="/home/rajkrrsingh/hadoop-1.0.4/conf"

that's all needed to install Pig on your cluster,now its time to run the Pig using fallowing command
#pig -x mapreduce
or
#pig

grunt>

Wednesday, August 28, 2013

Apache Hadoop and Spring Data : Configuring mapreduce Job

Spring for Apache Hadoop simplifies developing Apache Hadoop by providing a unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive. It also provides integration with other Spring ecosystem project such as Spring Integration and Spring Batch enabling you to develop solutions for big data ingest/export and Hadoop workflow orchestration.

In this tutorial I am going to demonstrate you how to configure mapreduce with the spring.the complete source code is available on the Github location

I assume that your Hadoop Cluster is up and running:

Let's set up a simple java project using maven and add the fallowing dependencies in the POM.xml.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.rajkrrsingh.hadoop</groupId>
  <artifactId>HadoopSpringData</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>HadoopSpringData</name>
  <url>http://maven.apache.org</url>

  <properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <apache.hadoop.version>1.0.4</apache.hadoop.version>
        <slf4j.version>1.6.1</slf4j.version>
        <spring.version>3.1.2.RELEASE</spring.version>
        <spring.data.hadoop.version>1.0.0.RELEASE</spring.data.hadoop.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>
    <dependencies>
        <!-- Apache Commons -->
        <dependency>
            <groupId>commons-lang</groupId>
            <artifactId>commons-lang</artifactId>
            <version>2.6</version>
        </dependency>

        <!-- Spring Framework -->
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-beans</artifactId>
            <version>${spring.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-core</artifactId>
            <version>${spring.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-context-support</artifactId>
            <version>${spring.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-context</artifactId>
            <version>${spring.version}</version>
        </dependency>
        <dependency>
            <groupId>cglib</groupId>
            <artifactId>cglib</artifactId>
            <version>2.2.2</version>
        </dependency>
        <!-- Spring Data Apache Hadoop -->
        <dependency>
            <groupId>org.springframework.data</groupId>
            <artifactId>spring-data-hadoop</artifactId>
            <version>${spring.data.hadoop.version}</version>
        </dependency>
        <!-- Apache Hadoop Core -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>${apache.hadoop.version}</version>
        </dependency>
        <!-- Logging dependencies -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>${slf4j.version}</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>${slf4j.version}</version>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.16</version>
        </dependency>
        <!-- Testing Dependencies -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.9</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.mockito</groupId>
            <artifactId>mockito-core</artifactId>
            <version>1.8.5</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    <build>
        <finalName>SpringHadoopJob</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.6</source>
                    <target>1.6</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.2.2</version>
                <configuration>
                    <descriptors>
                        <descriptor>src/main/assembly/assembly.xml</descriptor>
                    </descriptors>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>2.3.1</version>
                <configuration>
                    <archive>
                        <manifest>
                            <addClasspath>true</addClasspath>
                            <classpathPrefix>lib/</classpathPrefix>
                            <mainClass>com.rajkrrsingh.hadoop.MaxTemperature</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-site-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                    <reportPlugins>
                        <!-- Cobertura Plugin -->
                        <plugin>
                            <groupId>org.codehaus.mojo</groupId>
                            <artifactId>cobertura-maven-plugin</artifactId>
                            <version>2.5.1</version>
                        </plugin>
                    </reportPlugins>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

let us take a Max Temperature mapreduce example from the Hadoop definite guide and setup Mapper Reducer and Main Class:
here is our code for the Mapper:
package com.rajkrrsingh.hadoop;
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper
  extends Mapper<LongWritable, Text, Text, IntWritable> {

  private static final int MISSING = 9999;
  
  @Override
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    
    String line = value.toString();
    String year = line.substring(15, 19);
    int airTemperature;
    if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
      airTemperature = Integer.parseInt(line.substring(88, 92));
    } else {
      airTemperature = Integer.parseInt(line.substring(87, 92));
    }
    String quality = line.substring(92, 93);
    if (airTemperature != MISSING && quality.matches("[01459]")) {
      context.write(new Text(year), new IntWritable(airTemperature));
    }
  }
}

Setup Reducer as fallows:

package com.rajkrrsingh.hadoop;
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer
  extends Reducer<Text, IntWritable, Text, IntWritable> {
  
  @Override
  public void reduce(Text key, Iterable<IntWritable> values,
      Context context)
      throws IOException, InterruptedException {
    
    int maxValue = Integer.MIN_VALUE;
    for (IntWritable value : values) {
      maxValue = Math.max(maxValue, value.get());
    }
    context.write(key, new IntWritable(maxValue));
  }
}


In our main class we are executing the job using application context so we need to set up ApplicationContext but before that we need to provide the application specific properties files as fallows.
application.properties
fs.default.name=hdfs://master:54310
mapred.job.tracker=master:54311

input.path=/user/rajkrrsingh/SAMPLE
output.path=/user/rajkrrsingh/MaxTempOut

In application.properties file we are providing the namenode location,job tracker location,input path and output path of the data.the input data can be downloaded from the Here

Now we need to setup out ApplicationContext,in the provided code we are providing hadoop configuration using hdp:configuration, setup of job is done under the job element where we are providing the out input output paths,our job driver class and our Mapper reducer.

The next setting which we are doing here is to configure our job runner which will invoke our configured job,if you have multiple mapreduce jobs then we can configure the same in the applicationContext.xml.

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:hdp="http://www.springframework.org/schema/hadoop"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:context="http://www.springframework.org/schema/context"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
       http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.1.xsd">

    <!-- Configures the properties file. -->
    <context:property-placeholder location="classpath:application.properties" />

    <!-- Configures Apache Hadoop -->
    <hdp:configuration>
        fs.default.name=${fs.default.name}
        mapred.job.tracker=${mapred.job.tracker}
    </hdp:configuration>

    <!-- Configures the Max Temp job. -->
    <hdp:job id="MaxTempJob"
             input-path="${input.path}"
             output-path="${output.path}"
             jar-by-class="com.rajkrrsingh.hadoop.MaxTemperature"
             mapper="com.rajkrrsingh.hadoop.MaxTemperatureMapper"
             reducer="com.rajkrrsingh.hadoop.MaxTemperatureReducer"/>

    <!-- Configures the job runner that runs the Hadoop jobs. -->
    <hdp:job-runner id="MaxTempJobRunner" job-ref="MaxTempJob" run-at-startup="true"/>
</beans>


lets accrss our application context in our job driver class i.e MaxTemprature.java as fallows.
package com.rajkrrsingh.hadoop;
import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;

public class MaxTemperature {

  public static void main(String[] args) throws Exception {
   ApplicationContext ctx = new ClassPathXmlApplicationContext("applicationContext.xml");
}
}

Now its time to package our jar and run on the cluster for that I have provided the assembly.xml under the resources folder,run mvn assembly:assembly which will create a zip file in which you will find the SpringDataHadoop.jar

Now on NameNode run your jar as fallows

$hadoop jar SpringDataHadoop.jar

after successful completion check the output directory

Monday, August 26, 2013

Import TSV file from HFDS to HBase using Identity Mapper


In this tutorial I am going to demonstrate you how to import the tab separated file stored on the HDFS to HBase database.

lets start by creating a table on HBase

Step 1 : Create a table in HBase with the name of orders

with coloum family 'ship_to_address','ord_date','ship_date','item','status','price'
create 'orders','ship_to_address','ord_date','ship_date','item','status','price'

Here is our input tsv file stored on hdfs

USA NY New York 28-07-2013 29-07-2013 Toner shipped 200$
USA California San Fransico 29-07-2013 29-07-2013 Cati in process 150$
USA NY Rochester 28-07-2013 28-07-2013 Toner shipped 200$
USA NY Syracuse 21-07-2013 23-07-2013 Paper shipped 80$
USA NY Albany 21-07-2013 21-07-2013 Paper failed 80$
USA California Long Beach 26-07-2013 28-07-2013 Toner shipped 200$

Step 2 : Write your identity Mapper class as fallows:
public class ImportFromTSVMapper extends
            Mapper<LongWritable, Text, ImmutableBytesWritable, Writable> {
    
        @Override
        public void map(LongWritable offset, Text line, Context context)
                throws IOException {
            try {
                String lineString = line.toString();
                String[] arr = lineString.split("\t");
                Put put = new Put(arr[0].getBytes());
                put.add("ship_to_address".getBytes(), "country".getBytes(), Bytes.toBytes(arr[1]));
                put.add("ship_to_address".getBytes(),"state".getBytes(), Bytes.toBytes(arr[2]));
                put.add("ship_to_address".getBytes(),"city".getBytes(), Bytes.toBytes(arr[3]));
                put.add("ord_date".getBytes(),"ord_date".getBytes(), Bytes.toBytes(arr[4]));
                put.add("ship_date".getBytes(),"ship_date".getBytes(), Bytes.toBytes(arr[5]));
                put.add("item".getBytes(),"item".getBytes(), Bytes.toBytes(arr[6]));
                put.add("status".getBytes(),"status".getBytes(), Bytes.toBytes(arr[7]));
                put.add("price".getBytes(),"price".getBytes(), Bytes.toBytes(arr[8]));
                       context.write(new ImmutableBytesWritable(arr[0].getBytes()), put);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

Step 3 : Write your job Main class to configure MR job
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

public class ImportTSVFile {
    
    
   public static void main(String[] args) throws Exception {
        Configuration conf = HBaseConfiguration.create();
        String table = "order";
        String input = "/home/rajkrrsingh/mspfeed/ordfeed";
        String column = "";

        conf.set("conf.column", column);
        Job job = new Job(conf, "Import from hdfs to hbase");
        job.setJarByClass(ImportTSVFile.class);
        job.setMapperClass(ImportFromTSVMapper.class);
        job.setOutputFormatClass(TableOutputFormat.class);
        job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, table);
        job.setOutputKeyClass(ImmutableBytesWritable.class);
        job.setOutputValueClass(Writable.class);
        job.setNumReduceTasks(0);
        FileInputFormat.addInputPath(job, new Path(input));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

}


The result can be verified using HBase console,your tsv file has been imported to the HBase database.


Monday, June 3, 2013

Installation of MapR on Ubuntu through Ubuntu Partner Archive.


In the continuation of my previous post about availability of Hadoop to ubuntu through Ubuntu Partner Archive.Now you can install
Hadoop using apt0get command.

pre-requisites:
CPU : 64-bit
OS : Red Hat, CentOS, SUSE, or Ubuntu
Memory : 4 GB minimum, more in production
Disk : Raw, unformatted drives and partitions
DNS : Hostname, reaches all other nodes
Users : Common users across all nodes; Keyless ssh
Java : Must run Java
Other : NTP, Syslog, PAM

Step to install Hadoop:
Edit sources.list file and add the MapR repositories as fallows

deb http://package.mapr.com/releases/v2.1.2/ubuntu/ mapr optional
deb http://package.mapr.com/releases/ecosystem/ubuntu binary/


update your repository using fallowing command:
sudo apt-get update

Now invoke fallowing command to install Hadoop:
sudo apt-get install mapr-single-node

That's it,start hadooping.

Ubuntu and Hadoop: the perfect match | Canonical

Ubuntu and Hadoop: the perfect match | Canonical

Sunday, June 2, 2013

Installtion of HBase in fully distributed enviiornment

In this post we will see how to install HBase in fully distributed enviornment,before that we need to see all the component involved in fully distributed configuration of HBase.

HDFS:HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.

HBase Master: HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. In a distributed cluster, the Master typically runs on the namenode.

Region Servers:HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode.

Zookeeper: A distributed Apache HBase (TM) installation depends on a running ZooKeeper cluster. All participating nodes and clients need to be able to access the running ZooKeeper ensemble. Apache HBase by default manages a ZooKeeper "cluster" for you. It will start and stop the ZooKeeper ensemble as part of the HBase start/stop process. You can also manage the ZooKeeper ensemble independent of HBase and just point HBase at the cluster it should use. To toggle HBase management of ZooKeeper, use the HBASE_MANAGES_ZK variable in conf/hbase-env.sh. This variable, which defaults to true, tells HBase whether to start/stop the ZooKeeper ensemble servers as part of HBase start/stop.

In the coming example we have 2 ubuntu machine image configured in VMPlayere,both are up and running hadoop,if you are facing trouble in configuring hadoop cluster you can fallow the post http://www.rajkrrsingh.blogspot.in/2013/06/install-and-configure-2-node-hadoop.html.

consider a senerio in which we have one master and 2 slave nodes.now on master edit the /etc/hosts file as fallows
127.0.0.1 localhost
192.168.92.128  master.hdcluster.com  master
192.168.92.129  regionserver1.hdcluster.com  regionserver1
192.168.92.130  regionserver2.hdcluster.com  regionserver2

#127.0.1.1 ubuntu

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

edit /etc/hosts file on one of the slave with 192.168.92.129 ip address
127.0.0.1 localhost
192.168.92.128  master.hdcluster.com  master
192.168.92.129  regionserver1.hdcluster.com  regionserver1

#127.0.1.1 ubuntu

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

edit /etc/hosts file on one of the slave with 192.168.92.130 ip address
127.0.0.1 localhost
192.168.92.128  master.hdcluster.com  master
192.168.92.130  regionserver2.hdcluster.com  regionserver2

#127.0.1.1 ubuntu

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Download HBase binaries on the Master machine and extract to the home folder.
Edit the /conf/hbase-env.sh file as fallows
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
export HBASE_MANAGES_ZK=true


Now Edit the /conf/hbase-site.xml as fallows
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/**
 * Copyright 2010 The Apache Software Foundation
 *
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-->
<configuration>
 <property> 
      <name>hbase.master</name> 
      <value>192.168.92.128:90000</value> 
 </property> 
 <property>
  <name>hbase.rootdir</name>
  <value>hdfs://master:54310/user/hbase</value>
 </property>

 <property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
 </property>

 <property>
  <name>hbase.zookeeper.qourum</name>
  <value>master,regionserver1,regionserver2</value>
 </property>

 <property>
  <name>hbase.zookeeper.property.datadir</name>
  <value>/home/rajkrrsingh/zookeeperdatadir</value>
 </property>

 <property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>2222</value>
 </property>
</configuration>

copy the HBase folder on the regionservers,that complete our cluster configuration you can start the cluster using start-hbase.sh command.

Installing Apache HBase on Ubuntu in standalone mode

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse[clarification needed] data.

HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper.[1] Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs.

In the coming step we will setup a HBase on Ubuntu in standalone mode.

Step 1: Download the HBase binaries from any of the fallowing available mirrors using link : http://www.apache.org/dyn/closer.cgi/hbase/

Step 2: Unzipped the content to any of the local directory,preferably home directory.here is the snapshot of the HBase home directory


Step 3:Change EXPORT JAVA_HOME to the location where java is installed and change HBASE_HEAPSIZE to 1000MB(see the image)


Step 4:Change conf/Hbase-Site.Xml as fallows


Step 5: Copy Hadoop-1.0.4.jar to HBase_home_folder/lib folder
Step 6: Copy ${HADOOP_HOME}/lib/commons-configuration-*.jar to ${HBASE_HOME}/lib/
Step 7: Now its done,start the HBase server using fallowing command bin/start-hbase.sh
Step 8: run jps command to know which services are running,you will find the fallowing deamon up and running.

HRegionServer
HMaster
HQuorumPeer



Install and configure 2 node hadoop cluster using ubuntu Image


Major component involves in running hadoop ecosystem on cluster are:
1. Hadoop Distributed File System(HDFS):-HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.

2. MapReduce:-Map Reduce is the ‘heart‘ of Hadoop that consists of two parts – ‘map’ and ‘reduce’. Maps and reduces are programs for processing data. ‘Map’ processes the data first to give some intermediate output which is further processed by ‘Reduce’ to generate the final output. Thus, MapReduce allows for distributed processing of the map and reduction operations.

In this tutorial, I will describe how to setup and run Hadoop cluster. We will build Hadoop cluster using three Ubuntu machine in this tutorial.

Following are the capacities in which nodes may act in our cluster:-

1. NameNode:-Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.

2. SecondaryNameNode:-Downloads periodic checkpoints from the nameNode for fault-tolerance. There is exactly one SecondaryNameNode in each cluster.

3. JobTracker: - Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

4. DataNode: -Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

5. TaskTracker: -Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

In our case, one machine in the cluster is designated as namenode, Secondarynamenode and jobTracker.This is the master. The rest of machine in the cluster act as both Datanode and TaskTracker. They are slaves.

Step 1: download the ubuntu image from ubuntu website,download VmPlayer and configure the ubuntu image.(allow the guest machine to share the ipaddress of host machine)

Step 2: download hadoop binaries from apache website and extract to your home folder
Step 3: on master machine,change hostname to master, change hosts file as fallows:



Step 4:Allow the ssh connectivity between both machine.
Step 5:Access master machine,go to hadoop_home_folder/conf/masters,add the master hostname in the file.
Step 6:Access master machine,go to hadoop_home_folder/conf/slaves,add the slave's hostname in the file.
Step 7:Change JAVA_HOME path in conf/hadoop-env.sh as fallows
export JAVA_HOME=/usr/lib/jvm/java-6-oracle

Step 8: Edit Core-site.xml as fallows
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/rajkrrsingh/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>


</configuration>

Step 9: Edit hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/home/rajkrrsingh/namenodeanddatanode</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/home/rajkrrsingh/namenodeanddatanode</value>
</property>


</configuration>
Step 10: Edit mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>


</configuration>

Step 10: repeat the same step on the slaves machine.
Step 11: format the namenode using the fallowing command
bin/hadoop namenode -format

Step 12: Now we are all set,lets start the hadoop cluster first invoke the bin/start-dfs.sh command fallowing by bin/start-mapred-sh command.