Monday, September 2, 2013

Apache PIG : Installation and Running PIG on multi node cluster (Ubuntu)

PIG installation is very straight forward,if you want to configure Pig on multi node Hadoop cluster, then there is no need to install any specific api of utility,Pig launches jobs and interacts with your Hadoop filesystems from your node itself.

Prerequisite : Java 6 (install and set JAVA_HOME properly)

get the binaries to install pig from the official apache pig website mentioed here
download the binaries, I am using ubuntu so its better to use wget

#wget http://www.dsgnwrld.com/am/pig/pig-0.11.1/pig-0.11.1.tar.gz

After the download completes extract the tarball as fallows:
#tar xzf pig-0.11.1.tar.gz

Now its time to add the Pig binaries to your command line path
#export PIG_HOME=/home/rajkrrsingh/pig-0.11.1
#export PATH=$PATH:$PIG_HOME/bin

By setting the enviornmet variable you are able to run the Pig in local enviornment but to run Pig on the cluster you still need to provide some information to pig runtime about you Hadoop installation so that it can get the cluster information from HDFS-site.xml and mapreduce-site.xml and core-site.xml.

by setting PIG_CLASSPATH you can provide the cluster information to the Pig:
export PIG_CLASSPATH="/home/rajkrrsingh/hadoop-1.0.4/conf"

that's all needed to install Pig on your cluster,now its time to run the Pig using fallowing command
#pig -x mapreduce
or
#pig

grunt>