In this tutorial how to write user defined function for the pig.suppose we have a sample data in the for of tab separate document as fallows.the first column depicts name of the customer, second column represent the location of the customer and the third column give the information of customer's credit rating on the scale of 10. we need to filter out the the credit who scored the bad rating e.g. less than 5.
Lets create a Maven java project using the fallowing command:
Now from command line execute
CreditScore.pig
run the scrip using pig CreditScore.pig and get the result.
Amit Noida 5 Ajay Delhi 8 Abhi Lucknow 3 Dev Punjab 7 Deepak Bihar 2
Lets create a Maven java project using the fallowing command:
>mvn archetype:generate -DgroupId=com.rajkrrsingh.pig.udf -DartifactId=JavaUDF -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=falseAbove command will create a new Java project with the name of JavaUDF, open pom.xml in the project directory and add the fallowing dependencies in it.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.rajkrrsingh.pig.udf</groupId> <artifactId>JavaUDF</artifactId> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <name>JavaUDF</name> <url>http://maven.apache.org</url> <dependencies> <!-- TODO: make sure Hadoop version is compatible --> <dependency> <groupId>org.apache.pig</groupId> <artifactId>pig</artifactId> <version>0.10.0</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> </dependency> <dependency> <groupId>org.hamcrest</groupId> <artifactId>hamcrest-all</artifactId> <version>1.1</version> </dependency> </dependencies> <build> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <version>2.2.1</version> <configuration> <descriptors> <descriptor>src/main/assembly/jar.xml</descriptor> </descriptors> <finalName>pig-examples</finalName> <outputDirectory>${project.build.directory}/../..</outputDirectory> <appendAssemblyId>false</appendAssemblyId> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build> </project>
Now from command line execute
mvn eclipse:eclipseImport the project in the eclipse using Import from existing project and create a java package and add the fallowing class to it.
package com.rajkrrsingh.pig.udf; import java.io.IOException; import org.apache.pig.FilterFunc; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.Tuple; public class IsGoodCreditRating extends FilterFunc { @Override public Boolean exec(Tuple args) throws IOException { if (args == null || args.size() == 0) { return false; } try { Object object = args.get(0); if (object == null) { return false; } int i = (Integer) object; if(i>5){ return true; }else{ return false; } } catch (ExecException e) { throw new IOException(e); } } }Create the jar file using the assembly plugin and moved it to your cluster.In the next step we will write a Pig script.
CreditScore.pig
REGISTER JavaUDF.jar; records = LOAD 'sample.txt' AS (name:chararray, location:chararray, creditrating:int); filter_records = FILTER records BY com.rajkrrsingh.pig.udf.IsGoodCreditRating(creditrating); grouped_records = GROUP filter_records BY location; DUMP grouped_records;
run the scrip using pig CreditScore.pig and get the result.