Thursday, October 26, 2017

Apache Druid Installation Issue on HDP

during installatioin I hit some issues where installtion is failing with variour reasons, I have document the some of the hurdle I faced and how I can overcome of those. there is some issue with superset installtion when you select the storage as sqllite so please select superset storage mysql or postgress in Ambari Installation wizard.

Issue1: Requires: openblas-devel

Druid Broker Installation failed with following exception:

resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/bin/yum -d 0 -e 0 -y install superset_2_6_2_0_205' returned 1. Error: Package: superset_2_6_2_0_205-0.15.0.2.6.2.0-205.x86_64 (HDP-2.6)
           Requires: openblas-devel
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

Resolution

wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-10.noarch.rpm
rpm -ivh epel-release-7-10.noarch.rpm 
yum install openblas-devel.x86_64 -y

Issue 2: Druid broker start failed with following exception

resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/jdk64/jdk1.8.0_112/bin/java -cp /usr/lib/ambari-agent/DBConnectionVerification.jar:/usr/hdp/current/druid-broker/extensions/mysql-metadata-storage/* org.apache.ambari.server.DBConnectionVerification 'jdbc:mysql://localhost:3306/druid?createDatabaseIfNotExist=true' druid [PROTECTED] com.mysql.jdbc.Driver' returned 1. ERROR: Unable to connect to the DB. Please check DB connection properties.
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver

Resolution:

wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.37/mysql-connector-java-5.1.37.jar
ambari-server setup --jdbc-db=mysql --jdbc-driver=<location where you downloaded mysql connector>

Issue 3: Druid services failed to start with following exception

File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 303, in _call
    raise ExecutionFailed(err_msg, code, out, err)
resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/jdk64/jdk1.8.0_112/bin/java -cp /usr/lib/ambari-agent/DBConnectionVerification.jar:/usr/hdp/current/druid-broker/extensions/mysql-metadata-storage/mysql-connector-java-5.1.37.jar org.apache.ambari.server.DBConnectionVerification 'jdbc:mysql://rk262secure.hdp.local:3306/druid?createDatabaseIfNotExist=true' druid [PROTECTED] com.mysql.jdbc.Driver' returned 1. ERROR: Unable to connect to the DB. Please check DB connection properties.
java.sql.SQLException: Access denied for user 'druid'@'rk262secure.hdp.local' (using password: YES)

Resolution:

mysql> CREATE DATABASE druid DEFAULT CHARACTER SET utf8;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE DATABASE superset DEFAULT CHARACTER SET utf8;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE USER 'superset'@'%' IDENTIFIED BY 'druid';
Query OK, 0 rows affected (0.00 sec)

mysql>CREATE USER 'druid'@'%' IDENTIFIED BY 'druid';
Query OK, 0 rows affected (0.00 sec)

mysql> GRANT ALL PRIVILEGES ON *.* TO 'druid'@'%' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)

mysql> GRANT ALL PRIVILEGES ON *.* TO 'superset'@'%' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)

mysql> commit;
Query OK, 0 rows affected (0.00 sec)

Friday, October 6, 2017

Monitoring Kafka Broker JMX using Jolokia JVM Agent

Download Jolokia JVM Agent from following location
https://jolokia.org/download.html
wget http://search.maven.org/remotecontent?filepath=org/jolokia/jolokia-jvm/1.3.7/jolokia-jvm-1.3.7-agent.jar

mv jolokia-jvm-1.3.7-agent.jar agent.jar
here is the small shell script to get metrics you are intersted in
#!/bin/sh

url=`java -jar agent.jar start $1 | tail -1`

memory_url="${url}read/kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent"
used=`wget -q -O - "${memory_url}/MeanRate" | grep MeanRate`
echo $used
java -jar agent.jar --quiet stop $1
we are good to run this and start seeing this messages (19876 is the pid for kafka broker)
./script1.sh 19876 
{"timestamp":1507308779,"status":200,"request":{"mbean":"kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPool","attribute":"MeanRate","type":"read"},"value":0.9999940178320261}

Tuesday, September 5, 2017

Kafka Mirror Maker - from source non-kerberized cluster to kerberized cluster

Kafka Mirror Maker - from source non-kerberized cluster to target (kerberized) cluster

Env:

source cluster:
HDP242
un-secure
hostname: rksnode1

destination cluster:
HDP242
Kerberos enabled
hostname: rksnode2

Source cluster consumer settings

cat /tmp/sourceClusConsumer.properties 
zookeeper.connect=rksnode1:2181
bootstrap.servers=rksnode1:6667
zk.connectiontimeout.ms=1000000
consumer.timeout.ms=-1
group.id=mirror-maker-group
shallow.iterator.enable=true
mirror.topics.whitelist=app_log
enable.auto.commit=true
#security.protocol=PLAINTEXT

Destination cluster producer settings

cat /tmp/targetClusProducer.properties 
bootstrap.servers=rksnode2.hdp.local:6667
#zk.connect=rksnode2:2181
producer.type=async
compression.codec=0
serializer.class=kafka.serializer.DefaultEncoder
max.message.size=10000000
queue.time=1000
queue.enqueueTimeout.ms=-1
security.protocol=PLAINTEXTSASL

Running mirror maker

now login into the target cluster(rksnode2 in this case) and run the mirror maker as follows
//get the kerberos ticket
kinit -kt /etc/security/keytabs/kafka.service.keytab kafka/rksnode2.hdp.local@EXAMPLE.COM
//run mirror maker
/usr/hdp/current/kafka-broker/bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config /tmp/sourceClusConsumer.properties  --producer.config /tmp/targetClusProducer.properties --whitelist test --new.consumer

Wednesday, August 2, 2017

Creating Custom UDF with LLAP

Creating and running Temporary functions are discouraged while running query on LLAP because of security reason, since many users are sharing same instances of LLAP, it can create a conflict but still you can create temp functions using add jar and hive.llap.execution.mode=auto.

with exculsive llap execution mode(hive.llap.execution.mode=only) you will run into the ClassNotFoundException, hive.llap.execution.mode=auto will allow some part of query(map tasks) to run in the tez container.

Here are steps to create custom permanent funtion in LLAP(steps are tested on HDP-260)

  1. create a jar for UDF funtion (in this case I am using simple udf):
git clone https://github.com/rajkrrsingh/SampleCode
mvn clean package
  1. upload the target/SampleCode.jar to the node where HSI is running(in my case I have copied it to /tmp directory)
  2. add jar to hive_aux_jars goto Ambari--> hive --> config --> hive-interactive-env template
      export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH:/tmp/SampleCode.jar
  1. add the jar to Auxillary JAR list goto Ambari--> hive --> config --> Auxillary JAR list
Auxillary JAR list=/tmp/SampleCode.jar
  1. restart LLAP
  2. create Permanent Custom function
connect to HSI using beeline
create FUNCTION CustomLength as 'com.rajkrrsingh.hiveudf.CustomLength';
 describe function CustomLength;
 select CustomLength(description) from sample_07 limit 1;
  1. check where the SampleCode.jar localized
root@hdp26 container_e06_1501140901077_0019_01_000002]# pwd
/hadoop/yarn/local/usercache/hive/appcache/application_1501140901077_0019/container_e06_1501140901077_0019_01_000002
[root@hdp26 container_e06_1501140901077_0019_01_000002]# find . -iname sample*
./app/install/lib/SampleCode.jar

Sunday, June 18, 2017

Spark Kafka Integration in Kerberized Enviorment

Sample Application

using direct stream
 import kafka.serializer.StringDecoder;
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming._
 import org.apache.spark.streaming.kafka._
 
 
 object SparkKafkaConsumer2 {
 
   def main(args: Array[String]) {
 
     // TODO: Print out line in log of authenticated user
     val Array(brokerlist, group, topics, numThreads) = args
     var kafkaParams = Map(
       "bootstrap.servers"->"rks253secure.hdp.local:6667",
       "key.deserializer" ->"org.apache.kafka.common.serialization.StringDeserializer",
       "value.deserializer" ->"org.apache.kafka.common.serialization.StringDeserializer",
       "group.id"-> "test",
       "security.protocol"->"PLAINTEXTSASL",
       "auto.offset.reset"-> "smallest"
     )
 
     val sparkConf = new SparkConf().setAppName("KafkaWordCount")
     val ssc = new StreamingContext(sparkConf, Seconds(100))
 
     val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder] (ssc, kafkaParams, Set(topics))
 
     // TODO: change to be a variable
     kafkaStream.saveAsTextFiles("/tmp/streaming_output")
     ssc.start()
     ssc.awaitTermination()
   }
 }
using createStream
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils

object Kafka_Word_Count {

 def main(args: Array[String]) {
   
   val conf = new SparkConf().setAppName("KafkaWordCount")
     .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
     .set("spark.driver.allowMultipleContexts", "true")

   val ssc = new StreamingContext(conf, Seconds(3))
   val groupID = "test"
   val numThreads = "2"
   val topic = "kafkatopic"
   val topicMap = topic.split(",").map((_, numThreads.toInt)).toMap

   val kafkaParams = Map[String, String](
     "zookeeper.connect" -> "rks253secure.hdp.local:2181",
     "group.id" -> groupID,
     "zookeeper.connection.timeout.ms" -> "10000",
     "security.protocol"->"PLAINTEXTSASL"
   )

   val lines = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap,
     StorageLevel.MEMORY_AND_DISK_SER_2).map(_._2)
   lines.print()
   ssc.start()
   ssc.awaitTermination()
 }
}

kafka_jaas.conf (for spark local mode)

KafkaServer { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=true 
keyTab="/etc/security/keytabs/kafka.service.keytab" 
storeKey=true 
useTicketCache=false 
serviceName="kafka" 
principal="kafka/rks253secure.hdp.local@EXAMPLE.COM"; 
}; 
KafkaClient { 
com.sun.security.auth.module.Krb5LoginModule required 
useTicketCache=true 
renewTicket=true 
serviceName="kafka"; 
}; 
Client { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=true 
keyTab="/etc/security/keytabs/kafka.service.keytab" 
storeKey=true 
useTicketCache=false 
serviceName="zookeeper" 
principal="kafka/rks253secure.hdp.local@EXAMPLE.COM"; 
};

kafka_jaas.conf (for spark yarn client mode)

KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/etc/security/keytabs/kafka.service.keytab"
serviceName="kafka"
principal="kafka/rks253secure.hdp.local@EXAMPLE.COM";
};
Client {
   com.sun.security.auth.module.Krb5LoginModule required
   useKeyTab=true
   keyTab="/etc/security/keytabs/kafka.service.keytab"
   storeKey=true
   useTicketCache=false
   serviceName="zookeeper"
   principal="kafka/rks253secure.hdp.local@EXAMPLE.COM";
};

Spark Submit command

kinit from kafka user..

spark-submit --files /etc/kafka/conf/kafka_jaas.conf,/etc/security/keytabs/kafka.service.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/etc/kafka/conf/kafka_jaas.conf" --driver-java-options "-Djava.security.auth.login.config=/etc/kafka/conf/kafka_jaas.conf" --class SparkKafkaConsumer2 --master local[2] 
/tmp/SparkKafkaSampleApp-1.0-SNAPSHOT-jar-with-dependencies.jar "rks253secure.hdp.local:6667" test kafkatopic 1 

Saturday, May 27, 2017

Spark LLAP Setup for Spark Thrift Server

ENV HDP-2.6.0.3-8

Download spark-llap assembly jar from http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/spark-llap/

Add following in Custom spark-thrift-sparkconf

spark_thrift_cmd_opts --jars /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.6.0.3-8-assembly.jar
spark.executor.extraClassPath /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.6.0.3-8-assembly.jar
spark.hadoop.hive.llap.daemon.service.hosts @llap0
spark.jars /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.6.0.3-8-assembly.jar
spark.sql.hive.hiveserver2.url jdbc:hive2://hostname1.hwxblr.com:10500/;principal=hive/_HOST@EXAMPLE.COM;hive.server2.proxy.user=${user}
spark.hadoop.hive.zookeeper.quorum hostname1.hwxblr.com:2181

Add following to Custom-spark-defaults

spark.sql.hive.hiveserver2.url jdbc:hive2://hostname1.hwxblr.com:10500/;principal=hive/_HOST@EXAMPL
spark.jars /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.6.0.3-8-assembly.jar
spark.hadoop.hive.zookeeper.quorum hostname1.hwxblr.com:2181
spark.hadoop.hive.llap.daemon.service.hosts @llap0
spark.executor.extraClassPath /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.6.0.3-8-assemb

start thirft server from ambari and run query as follows

beeline -u "jdbc:hive2://hostname3.hwxblr.com:10015/;principal=hive/_HOST@EXAMPLE.COM" -e "select * from test;"

if your query failed with following exception then please check you spark-llap-assembly is available on executors classpath (revisit spark.executor.extraClassPath )

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hostname1.hwxblr.com): java.lang.NullPointerException
	at org.apache.hadoop.hive.llap.tez.LlapProtocolClientProxy.<init>(LlapProtocolClientProxy.java:94)
	at org.apache.hadoop.hive.llap.ext.LlapTaskUmbilicalExternalClient.<init>(LlapTaskUmbilicalExternalClient.java:119)
	at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getRecordReader(LlapBaseInputFormat.java:143)
	at org.apache.hadoop.hive.llap.LlapRowInputFormat.getRecordReader(LlapRowInputFormat.java:51)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:240)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
	at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:388)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Sunday, May 21, 2017

Steps to setup kdc before installing kerberos through ambari on hortonworks cluster Raw

ENV

#### OS centos7
#### REALM EXAMPLE.COM (update accordingly)
#### AS and KDC are running on hostname rks253secure.hdp.local (update accordingly)

install required packages

yum install -y krb5-server krb5-workstation pam_krb5
cd  /var/kerberos/krb5kdc

modify kadm acls

cat kadm5.acl 
*/admin@EXAMPLE.COM	*

modify kdc conf

cat kdc.conf 
[kdcdefaults]
 kdc_ports = 88
 kdc_tcp_ports = 88

[realms]
 EXAMPLE.COM = {
  #master_key_type = aes256-cts
  acl_file = /var/kerberos/krb5kdc/kadm5.acl
  dict_file = /usr/share/dict/words
  admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
  supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
 }

modify krb5.conf on node where ambari server is running.

cat /etc/krb5.conf

[libdefaults]
  renew_lifetime = 7d
  forwardable = true
  default_realm = EXAMPLE.COM
  ticket_lifetime = 24h
  dns_lookup_realm = false
  dns_lookup_kdc = false
  default_ccache_name = /tmp/krb5cc_%{uid}
  #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
  #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5

[logging]
  default = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmind.log
  kdc = FILE:/var/log/krb5kdc.log

[realms]
  EXAMPLE.COM = {
    admin_server = rks253secure.hdp.local
    kdc = rks253secure.hdp.local
  }

create KDC database

kdb5_util create -s -r EXAMPLE.COM

Start and Enable Kerberos

systemctl start krb5kdc kadmin
systemctl enable krb5kdc kadmin

create principal root/admin@EXAMPLE.COM

# kadmin.local
kadmin.local: addprinc root/admin
kadmin.local: quit

test if you are able to get TGT after supplying password.

kinit root/admin@EXAMPLE.COM

now start ambari-server enable kerberos wizard which will ask you to supply KDC and AS host name and REALM to start


Thursday, May 18, 2017

How to Configure and Run Storm AutoHDFS plugin (sample Application)

Add these configuration to custom storm-site.

nimbus.autocredential.plugins.classes ["org.apache.storm.hdfs.common.security.AutoHDFS"]
nimbus.credential.renewers.classes ["org.apache.storm.hdfs.common.security.AutoHDFS"]
hdfs.keytab.file  /etc/security/keytabs/hdfs.headless.keytab
hdfs.kerberos.principal hdfs-s253_kerb@LAB.HORTONWORKS.NET
nimbus.credential.renewers.freq.secs 518400

nimbus.childopts -Xmx1024m _JAAS_PLACEHOLDER -javaagent:/usr/hdp/current/storm-nimbus/contrib/storm-jmxetric/lib/jmxetric-1.0.4.jar=host=localhost,port=8649,wireformat31x=true,mode=multicast,config=/usr/hdp/current/storm-nimbus/contrib/storm-jmxetric/conf/jmxetric-conf.xml:/etc/hadoop/conf/hdfs-site.xml:/etc/hadoop/conf/core-site.xml:/etc/hbase/conf/hbase-site.xml,process=Nimbus_JVM

add the following in storm-env

export STORM_EXT_CLASSPATH=/usr/hdp/current/hbase-client/lib/:/usr/hdp/current/hadoop-mapreduce-client/:/usr/hdp/current/hadoop-client

remove the hadoop-aws.*.jar from classpath.

update core-site,core-default and hdfs-site.xml to storm-hdfs jar

add following snippet in core-site.xml

<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
</property>

[root@r253secure contrib]# jar uvf storm-hdfs/storm-hdfs-1.0.1.2.5.3.0-37.jar core-site.xml 
[root@r253secure contrib]# jar uvf storm-hdfs/storm-hdfs-1.0.1.2.5.3.0-37.jar hdfs-site.xml
[root@r253secure contrib]# jar uvf storm-hdfs/storm-hdfs-1.0.1.2.5.3.0-37.jar core-default.xml

jar tvf storm-hdfs/storm-hdfs-1.0.1.2.5.3.0-37.jar | egrep 'core-site|hdfs-site'
  5538 Wed May 03 14:25:04 UTC 2017 core-site.xml
  8047 Tue Jan 17 14:14:10 UTC 2017 hdfs-site.xml

copy updated storm-hdfs to storm lib directory

[root@r253secure contrib]# cp storm-hdfs/storm-hdfs-1.0.1.2.5.3.0-37.jar ../lib/

copy core-site.xml and hdfs-site.xml to lib folder

copy hadoop-hdfs.jar to storm lib folder.

Sample code to test your topology

https://github.com/rajkrrsingh/sample-storm-hdfs-app

Monday, May 8, 2017

oozie spark shell action example

workflow dir @hdfs

 hadoop fs -ls /tmp/sparkOozieShellAction/
Found 4 items
-rw-r--r--   3 oozie hdfs        178 2017-05-08 07:00 /tmp/sparkOozieShellAction/job.properties
drwxr-xr-x   - oozie hdfs          0 2017-05-08 07:01 /tmp/sparkOozieShellAction/lib
-rw-r--r--   3 oozie hdfs        279 2017-05-08 07:12 /tmp/sparkOozieShellAction/spark-pi-job.sh
-rw-r--r--   3 oozie hdfs        712 2017-05-08 07:34 /tmp/sparkOozieShellAction/workflow.xml

oozie- spark share lib

[oozie@rk253 ~]$ hadoop fs -ls /user/oozie/share/lib/lib_20170508043956/spark
Found 8 items
-rw-r--r--   3 oozie hdfs     339666 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-api-jdo-3.2.6.jar
-rw-r--r--   3 oozie hdfs    1890075 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-core-3.2.10.jar
-rw-r--r--   3 oozie hdfs    1809447 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-rdbms-3.2.9.jar
-rw-r--r--   3 oozie hdfs        167 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/hive-site.xml
-rw-r--r--   3 oozie hdfs      22440 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/oozie-sharelib-spark-4.2.0.2.5.3.0-37.jar
-rw-r--r--   3 oozie hdfs      44846 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/py4j-0.9-src.zip
-rw-r--r--   3 oozie hdfs     357563 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/pyspark.zip
-rw-r--r--   3 oozie hdfs  188897932 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/spark-assembly-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar

job.properties

[oozie@rk253 ~]$ cat job.properties 
nameNode= hdfs://rk253.openstack:8020 
jobTracker= rk253.openstack:8050 
oozie.wf.application.path=/tmp/sparkOozieShellAction/ 
oozie.use.system.libpath=true 

workflow.xml

[oozie@rk253 ~]$ cat job.properties 
nameNode= hdfs://rk253.openstack:8020 
jobTracker= rk253.openstack:8050 
oozie.wf.application.path=/tmp/sparkOozieShellAction/ 
oozie.use.system.libpath=true 
master=yarn-client
[oozie@rk253 ~]$ cat workflow.xml 
<workflow-app name="WorkFlowForShellAction" xmlns="uri:oozie:workflow:0.4">
    <start to="shellAction"/>
    <action name="shellAction">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <exec>spark-pi-job.sh
            </exec>
            <env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
		<file>/tmp/sparkOozieShellAction/spark-pi-job.sh#spark-pi-job.sh</file>
	    <capture-output/>
        </shell>
    <ok to="end"/>
    <error to="killAction"/>
    </action>
    <kill name="killAction">
        <message>"Killed job due to error"</message>
    </kill>
    <end name="end"/>
</workflow-app>

spark-pi-job.sh

[oozie@rk253 ~]$ cat spark-pi-job.sh 
/usr/hdp/2.5.3.0-37/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 /usr/hdp/2.5.3.0-37/spark/lib/spark-examples-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar 10

run oozie job

oozie job -oozie http://rk253:11000/oozie/ -config job.properties -run 

Sunday, May 7, 2017

oozie spark action example

directory structure at hdfs

[oozie@rk253 ~]$ hadoop fs -lsr /tmp/sparkOozieAction
lsr: DEPRECATED: Please use 'ls -R' instead.
-rwxrwxrwx   3 oozie hdfs        167 2017-05-08 05:01 /tmp/sparkOozieAction/job.properties
drwxrwxrwx   - oozie hdfs          0 2017-05-08 05:04 /tmp/sparkOozieAction/lib
-rwxrwxrwx   3 oozie hdfs  110488188 2017-05-08 04:58 /tmp/sparkOozieAction/lib/spark-examples-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar
-rw-r--r--   3 oozie hdfs       1571 2017-05-08 05:46 /tmp/sparkOozieAction/workflow.xml

oozie share lib

[oozie@rk253 ~]$ hadoop fs -ls /user/oozie/share/lib/lib_20170508043956/spark
Found 8 items
-rw-r--r--   3 oozie hdfs     339666 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-api-jdo-3.2.6.jar
-rw-r--r--   3 oozie hdfs    1890075 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-core-3.2.10.jar
-rw-r--r--   3 oozie hdfs    1809447 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-rdbms-3.2.9.jar
-rw-r--r--   3 oozie hdfs        167 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/hive-site.xml
-rw-r--r--   3 oozie hdfs      22440 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/oozie-sharelib-spark-4.2.0.2.5.3.0-37.jar
-rw-r--r--   3 oozie hdfs      44846 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/py4j-0.9-src.zip
-rw-r--r--   3 oozie hdfs     357563 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/pyspark.zip
-rw-r--r--   3 oozie hdfs  188897932 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/spark-assembly-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar

job.properties

[oozie@rk253 ~]$ cat job.properties 
nameNode= hdfs://rk253.openstack:8020 
jobTracker= rk253.openstack:8050 
oozie.wf.application.path=/tmp/sparkOozieAction/ 
oozie.use.system.libpath=true 
master=yarn-client

workflow.xml

[oozie@rk253 ~]$ cat job.properties 
nameNode= hdfs://rk253.openstack:8020 
jobTracker= rk253.openstack:8050 
oozie.wf.application.path=/tmp/sparkOozieAction/ 
oozie.use.system.libpath=true 
master=yarn-client
[oozie@rk253 ~]$ cat workflow.xml 
<workflow-app name="spark-wf" xmlns="uri:oozie:workflow:0.5"> 
        <start to="spark-action"/> 
        <action name="spark-action"> 
                <spark xmlns="uri:oozie:spark-action:0.1"> 
                        <job-tracker>${jobTracker}</job-tracker> 
                        <name-node>${nameNode}</name-node> 
                        <configuration> 
                        </configuration> 
                        <master>${master}</master> 
                        <name>spark pi job</name> 
                        <class>org.apache.spark.examples.SparkPi</class> 
                        <jar>${nameNode}/tmp/sparkOozieAction/lib/spark-examples-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar</jar> 
                        <spark-opts>--driver-memory 512m --executor-memory 512m --num-executors 1</spark-opts> 
                        <arg>10</arg> 
                </spark> 
                <ok to="end"/> 
                <error to="kill"/> 
        </action> 
        <kill name="kill"> 
                <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> 
        </kill> 
        <end name="end"/> 
</workflow-app> 

run

oozie job -oozie http://rk253:11000/oozie/ -config job.properties -run