Monday, September 2, 2013

Apache Pig : Writiting Java UDF for Pig

In this tutorial how to write user defined function for the pig.suppose we have a sample data in the for of tab separate document as fallows.the first column depicts name of the customer, second column represent the location of the customer and the third column give the information of customer's credit rating on the scale of 10. we need to filter out the the credit who scored the bad rating e.g. less than 5.

Amit  Noida 5
Ajay Delhi 8
Abhi Lucknow 3
Dev Punjab 7
Deepak Bihar 2

Lets create a Maven java project using the fallowing command:

>mvn archetype:generate -DgroupId=com.rajkrrsingh.pig.udf -DartifactId=JavaUDF
 -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
Above command will create a new Java project with the name of JavaUDF, open pom.xml in the project directory and add the fallowing dependencies in it.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.rajkrrsingh.pig.udf</groupId>
  <artifactId>JavaUDF</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>JavaUDF</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <!-- TODO: make sure Hadoop version is compatible -->
    <dependency>
      <groupId>org.apache.pig</groupId>
      <artifactId>pig</artifactId>
   <version>0.10.0</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
   <version>4.11</version>
    </dependency>
    <dependency>
      <groupId>org.hamcrest</groupId>
      <artifactId>hamcrest-all</artifactId>
   <version>1.1</version>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.2.1</version>
        <configuration>
          <descriptors>
            <descriptor>src/main/assembly/jar.xml</descriptor>
          </descriptors>
          <finalName>pig-examples</finalName>
          <outputDirectory>${project.build.directory}/../..</outputDirectory>
          <appendAssemblyId>false</appendAssemblyId>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

Now from command line execute
mvn eclipse:eclipse
Import the project in the eclipse using Import from existing project and create a java package and add the fallowing class to it.
package com.rajkrrsingh.pig.udf;

import java.io.IOException;

import org.apache.pig.FilterFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;

public class IsGoodCreditRating extends FilterFunc {

 @Override
 public Boolean exec(Tuple args) throws IOException {
  if (args == null || args.size() == 0) {
        return false;
      }
      try {
        Object object = args.get(0);
        if (object == null) {
          return false;
        }
        int i = (Integer) object;
        if(i>5){
         return true;
        }else{
         return false;
        }
      } catch (ExecException e) {
        throw new IOException(e);
     }
 }
 
 
}
Create the jar file using the assembly plugin and moved it to your cluster.In the next step we will write a Pig script.
CreditScore.pig
REGISTER JavaUDF.jar;
records = LOAD 'sample.txt' AS (name:chararray, location:chararray, creditrating:int);
filter_records = FILTER records BY com.rajkrrsingh.pig.udf.IsGoodCreditRating(creditrating);
grouped_records = GROUP filter_records BY location;
DUMP grouped_records;

run the scrip using pig CreditScore.pig and get the result.