Tuesday, November 19, 2013

Apache oozie : Getting Started

Apache oozie Introduction:

--- Started by Yahoo, currenly managed by Apache open source project.
--- Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
-- MapReduce
-- Pig,Hive
-- Streaming
-- Standard Applications
--- Oozie is a scalable, reliable and extensible system.

--- User specifies action flow as Directed Acyclic Graph (DAG)
--- DAG: is a collection of vertices and directed edge configured so that one may not traverse the same vertex twice
--- Each node signifies eighter a Job or Script,Execution and branching can be parameterized by time, decision, data availability,
file size etc.
--- Client specifies process flow in webflow XML
--- Oozie is an extra level of abstraction between user and Hadoop
--- Oozie has its own server application which talks to it's own database(Apache Derby(default),MySql,Oracle etc.
--- User must load required component into the HDFS prior to the execution like input data, flow XML,JARs, resource files.

Interaction with Oozie through command line
$oozie job --oozie http://localhost:11000/oozie -config /user/rks/spjob/job.properties -run 

Web Interface

-Download Oozie from the Apache oozie official site
-Download ExtJS
-Configure core-site.xml
-restart namenode
-Copy Hadoop jars into a directory
-Extract ExtJS into Oozie's webapp
-Run oozie-setup.sh
-Relocalt newly generated war file
-Configure oozie-site.xml
-Initialize the databse
-Start oozie server

it's done, in the next course of action we will run MapReduce job configured using. stay tuned