Monday, September 30, 2013

How to Convert TSV to JSON using command line

In most of the provided dataset we have a need arises to convert it to the JSON format.
here I am demonstrating you a simple and useful approach to achieve that.

We have sample input data as follows:

ord_status.tsv
5 2 3
111 109 2
21 12 9

it's basically a data from some e-commerce application which state the order qty, ship qty and back order qty of particular item.now it's turn to convert it to the JSON document

$: export FIELDS=ord_qty,shp_qty,backord_qty
$: cat ord_status.tsv| ruby -rjson -ne 'puts ENV["FIELDS"].split(",").zip($_.strip.split("\t")).inject({}){|h,x| h[x[0]]=x[1];h}.to_json'

here is the outcome
{"ord_qty":"5","shp_qty":"2","backord_qty":"3"}
{"ord_qty":"111","shp_qty":"109","backord_qty":"2"}
{"ord_qty":"21","shp_qty":"12","backord_qty":"9"}

Thursday, September 19, 2013

Spring Roo with MongoDB Persistence

Spring Roo is an open source software tool that uses convention-over-configuration principles to provide rapid application development of Java-based enterprise software. The resulting applications use common Java technologies such as Spring Framework, Java Persistence API, Java Server Pages, Apache Maven and AspectJ. Spring Roo is a member of the Spring portfolio of projects.

Roo focuses on higher productivity, stock-standard Java APIs, high usability, avoiding engineering trade-offs and facilitating easy Roo removal.

MongoDB is a leading NoSQL open-source document database,it's Written in C++, MongoDB supports document oriented storage.the other major feature includes full index support Replication and HA, optimized querying, auto sharding and GridFS support to store large files.
Added features include aggregation framework to aggregate query result on large set of unstructured Big Data.

Spring Roo now supports MongoDB persistence.In this tutorial I will demonstrate you building a test application using Spring Roo and MongoDB persistence.

Prerequisite:
1. Spring Roo installed and added to the Environment variable
2. Maven 2.2+
3. MongoDb 2.4, installed up and mongod is running on port 27017

lets open a roo console by typing roo
____  ____  ____
   / __ \/ __ \/ __ \
  / /_/ / / / / / / /
 / _, _/ /_/ / /_/ /
/_/ |_|\____/\____/    1.2.4.RELEASE [rev 75337cf]


Welcome to Spring Roo. For assistance press TAB or type "hint" then hit ENTER.
roo>

on the prompt execute the fallowing command in order:

project --topLevelPackage com.rajkrrsingh.roomongoapp
mongo setup --databaseName personDB
entity mongo --class ~.model.Person --testAutomatically
field string --fieldName name --notNull
repository mongo --interface ~.repository.PersonRepository --entity ~.model.Person
web mvc setup
web mvc scaffold --class ~.web.PersonController
perform package
quit

now after quit from the roo console type fallowing command it will deploy and run your application on tomcat

mvn tomcat:run

open a web browser and access the link http://localhost:8080/roomongoapp/ your sample application is up and running now its turn to look into the mongo database.
Access the mongo console and use look for the personDB databse

use personDB
switched to db personDB
> show collections
person
system.indexes
> db.person.find().pretty()
{
        "_id" : "101",
        "_class" : "com.rajkrrsingh.roomongoapp.model.Person",
        "name" : "Rajkumar Singh"
}
{
        "_id" : "102",
        "_class" : "com.rajkrrsingh.roomongoapp.model.Person",
        "name" : "Sharad Singh"
}

inserted records are there in the database.

Monday, September 2, 2013

Apache Pig : Writiting Java UDF for Pig

In this tutorial how to write user defined function for the pig.suppose we have a sample data in the for of tab separate document as fallows.the first column depicts name of the customer, second column represent the location of the customer and the third column give the information of customer's credit rating on the scale of 10. we need to filter out the the credit who scored the bad rating e.g. less than 5.

Amit  Noida 5
Ajay Delhi 8
Abhi Lucknow 3
Dev Punjab 7
Deepak Bihar 2

Lets create a Maven java project using the fallowing command:

>mvn archetype:generate -DgroupId=com.rajkrrsingh.pig.udf -DartifactId=JavaUDF
 -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
Above command will create a new Java project with the name of JavaUDF, open pom.xml in the project directory and add the fallowing dependencies in it.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.rajkrrsingh.pig.udf</groupId>
  <artifactId>JavaUDF</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>JavaUDF</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <!-- TODO: make sure Hadoop version is compatible -->
    <dependency>
      <groupId>org.apache.pig</groupId>
      <artifactId>pig</artifactId>
   <version>0.10.0</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
   <version>4.11</version>
    </dependency>
    <dependency>
      <groupId>org.hamcrest</groupId>
      <artifactId>hamcrest-all</artifactId>
   <version>1.1</version>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.2.1</version>
        <configuration>
          <descriptors>
            <descriptor>src/main/assembly/jar.xml</descriptor>
          </descriptors>
          <finalName>pig-examples</finalName>
          <outputDirectory>${project.build.directory}/../..</outputDirectory>
          <appendAssemblyId>false</appendAssemblyId>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

Now from command line execute
mvn eclipse:eclipse
Import the project in the eclipse using Import from existing project and create a java package and add the fallowing class to it.
package com.rajkrrsingh.pig.udf;

import java.io.IOException;

import org.apache.pig.FilterFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;

public class IsGoodCreditRating extends FilterFunc {

 @Override
 public Boolean exec(Tuple args) throws IOException {
  if (args == null || args.size() == 0) {
        return false;
      }
      try {
        Object object = args.get(0);
        if (object == null) {
          return false;
        }
        int i = (Integer) object;
        if(i>5){
         return true;
        }else{
         return false;
        }
      } catch (ExecException e) {
        throw new IOException(e);
     }
 }
 
 
}
Create the jar file using the assembly plugin and moved it to your cluster.In the next step we will write a Pig script.
CreditScore.pig
REGISTER JavaUDF.jar;
records = LOAD 'sample.txt' AS (name:chararray, location:chararray, creditrating:int);
filter_records = FILTER records BY com.rajkrrsingh.pig.udf.IsGoodCreditRating(creditrating);
grouped_records = GROUP filter_records BY location;
DUMP grouped_records;

run the scrip using pig CreditScore.pig and get the result.

Apache PIG : Installation and Running PIG on multi node cluster (Ubuntu)

PIG installation is very straight forward,if you want to configure Pig on multi node Hadoop cluster, then there is no need to install any specific api of utility,Pig launches jobs and interacts with your Hadoop filesystems from your node itself.

Prerequisite : Java 6 (install and set JAVA_HOME properly)

get the binaries to install pig from the official apache pig website mentioed here
download the binaries, I am using ubuntu so its better to use wget

#wget http://www.dsgnwrld.com/am/pig/pig-0.11.1/pig-0.11.1.tar.gz

After the download completes extract the tarball as fallows:
#tar xzf pig-0.11.1.tar.gz

Now its time to add the Pig binaries to your command line path
#export PIG_HOME=/home/rajkrrsingh/pig-0.11.1
#export PATH=$PATH:$PIG_HOME/bin

By setting the enviornmet variable you are able to run the Pig in local enviornment but to run Pig on the cluster you still need to provide some information to pig runtime about you Hadoop installation so that it can get the cluster information from HDFS-site.xml and mapreduce-site.xml and core-site.xml.

by setting PIG_CLASSPATH you can provide the cluster information to the Pig:
export PIG_CLASSPATH="/home/rajkrrsingh/hadoop-1.0.4/conf"

that's all needed to install Pig on your cluster,now its time to run the Pig using fallowing command
#pig -x mapreduce
or
#pig

grunt>