Saturday, May 27, 2017

Spark LLAP Setup for Spark Thrift Server

ENV HDP-2.6.0.3-8

Download spark-llap assembly jar from http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/spark-llap/

Add following in Custom spark-thrift-sparkconf

spark_thrift_cmd_opts --jars /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.6.0.3-8-assembly.jar
spark.executor.extraClassPath /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.6.0.3-8-assembly.jar
spark.hadoop.hive.llap.daemon.service.hosts @llap0
spark.jars /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.6.0.3-8-assembly.jar
spark.sql.hive.hiveserver2.url jdbc:hive2://hostname1.hwxblr.com:10500/;principal=hive/_HOST@EXAMPLE.COM;hive.server2.proxy.user=${user}
spark.hadoop.hive.zookeeper.quorum hostname1.hwxblr.com:2181

Add following to Custom-spark-defaults

spark.sql.hive.hiveserver2.url jdbc:hive2://hostname1.hwxblr.com:10500/;principal=hive/_HOST@EXAMPL
spark.jars /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.6.0.3-8-assembly.jar
spark.hadoop.hive.zookeeper.quorum hostname1.hwxblr.com:2181
spark.hadoop.hive.llap.daemon.service.hosts @llap0
spark.executor.extraClassPath /usr/hdp/current/spark-client/lib/spark-llap-1.0.0.2.6.0.3-8-assemb

start thirft server from ambari and run query as follows

beeline -u "jdbc:hive2://hostname3.hwxblr.com:10015/;principal=hive/_HOST@EXAMPLE.COM" -e "select * from test;"

if your query failed with following exception then please check you spark-llap-assembly is available on executors classpath (revisit spark.executor.extraClassPath )

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hostname1.hwxblr.com): java.lang.NullPointerException
	at org.apache.hadoop.hive.llap.tez.LlapProtocolClientProxy.<init>(LlapProtocolClientProxy.java:94)
	at org.apache.hadoop.hive.llap.ext.LlapTaskUmbilicalExternalClient.<init>(LlapTaskUmbilicalExternalClient.java:119)
	at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getRecordReader(LlapBaseInputFormat.java:143)
	at org.apache.hadoop.hive.llap.LlapRowInputFormat.getRecordReader(LlapRowInputFormat.java:51)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:240)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
	at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:388)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Sunday, May 21, 2017

Steps to setup kdc before installing kerberos through ambari on hortonworks cluster Raw

ENV

#### OS centos7
#### REALM EXAMPLE.COM (update accordingly)
#### AS and KDC are running on hostname rks253secure.hdp.local (update accordingly)

install required packages

yum install -y krb5-server krb5-workstation pam_krb5
cd  /var/kerberos/krb5kdc

modify kadm acls

cat kadm5.acl 
*/admin@EXAMPLE.COM	*

modify kdc conf

cat kdc.conf 
[kdcdefaults]
 kdc_ports = 88
 kdc_tcp_ports = 88

[realms]
 EXAMPLE.COM = {
  #master_key_type = aes256-cts
  acl_file = /var/kerberos/krb5kdc/kadm5.acl
  dict_file = /usr/share/dict/words
  admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
  supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
 }

modify krb5.conf on node where ambari server is running.

cat /etc/krb5.conf

[libdefaults]
  renew_lifetime = 7d
  forwardable = true
  default_realm = EXAMPLE.COM
  ticket_lifetime = 24h
  dns_lookup_realm = false
  dns_lookup_kdc = false
  default_ccache_name = /tmp/krb5cc_%{uid}
  #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
  #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5

[logging]
  default = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmind.log
  kdc = FILE:/var/log/krb5kdc.log

[realms]
  EXAMPLE.COM = {
    admin_server = rks253secure.hdp.local
    kdc = rks253secure.hdp.local
  }

create KDC database

kdb5_util create -s -r EXAMPLE.COM

Start and Enable Kerberos

systemctl start krb5kdc kadmin
systemctl enable krb5kdc kadmin

create principal root/admin@EXAMPLE.COM

# kadmin.local
kadmin.local: addprinc root/admin
kadmin.local: quit

test if you are able to get TGT after supplying password.

kinit root/admin@EXAMPLE.COM

now start ambari-server enable kerberos wizard which will ask you to supply KDC and AS host name and REALM to start


Thursday, May 18, 2017

How to Configure and Run Storm AutoHDFS plugin (sample Application)

Add these configuration to custom storm-site.

nimbus.autocredential.plugins.classes ["org.apache.storm.hdfs.common.security.AutoHDFS"]
nimbus.credential.renewers.classes ["org.apache.storm.hdfs.common.security.AutoHDFS"]
hdfs.keytab.file  /etc/security/keytabs/hdfs.headless.keytab
hdfs.kerberos.principal hdfs-s253_kerb@LAB.HORTONWORKS.NET
nimbus.credential.renewers.freq.secs 518400

nimbus.childopts -Xmx1024m _JAAS_PLACEHOLDER -javaagent:/usr/hdp/current/storm-nimbus/contrib/storm-jmxetric/lib/jmxetric-1.0.4.jar=host=localhost,port=8649,wireformat31x=true,mode=multicast,config=/usr/hdp/current/storm-nimbus/contrib/storm-jmxetric/conf/jmxetric-conf.xml:/etc/hadoop/conf/hdfs-site.xml:/etc/hadoop/conf/core-site.xml:/etc/hbase/conf/hbase-site.xml,process=Nimbus_JVM

add the following in storm-env

export STORM_EXT_CLASSPATH=/usr/hdp/current/hbase-client/lib/:/usr/hdp/current/hadoop-mapreduce-client/:/usr/hdp/current/hadoop-client

remove the hadoop-aws.*.jar from classpath.

update core-site,core-default and hdfs-site.xml to storm-hdfs jar

add following snippet in core-site.xml

<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
</property>

[root@r253secure contrib]# jar uvf storm-hdfs/storm-hdfs-1.0.1.2.5.3.0-37.jar core-site.xml 
[root@r253secure contrib]# jar uvf storm-hdfs/storm-hdfs-1.0.1.2.5.3.0-37.jar hdfs-site.xml
[root@r253secure contrib]# jar uvf storm-hdfs/storm-hdfs-1.0.1.2.5.3.0-37.jar core-default.xml

jar tvf storm-hdfs/storm-hdfs-1.0.1.2.5.3.0-37.jar | egrep 'core-site|hdfs-site'
  5538 Wed May 03 14:25:04 UTC 2017 core-site.xml
  8047 Tue Jan 17 14:14:10 UTC 2017 hdfs-site.xml

copy updated storm-hdfs to storm lib directory

[root@r253secure contrib]# cp storm-hdfs/storm-hdfs-1.0.1.2.5.3.0-37.jar ../lib/

copy core-site.xml and hdfs-site.xml to lib folder

copy hadoop-hdfs.jar to storm lib folder.

Sample code to test your topology

https://github.com/rajkrrsingh/sample-storm-hdfs-app

Monday, May 8, 2017

oozie spark shell action example

workflow dir @hdfs

 hadoop fs -ls /tmp/sparkOozieShellAction/
Found 4 items
-rw-r--r--   3 oozie hdfs        178 2017-05-08 07:00 /tmp/sparkOozieShellAction/job.properties
drwxr-xr-x   - oozie hdfs          0 2017-05-08 07:01 /tmp/sparkOozieShellAction/lib
-rw-r--r--   3 oozie hdfs        279 2017-05-08 07:12 /tmp/sparkOozieShellAction/spark-pi-job.sh
-rw-r--r--   3 oozie hdfs        712 2017-05-08 07:34 /tmp/sparkOozieShellAction/workflow.xml

oozie- spark share lib

[oozie@rk253 ~]$ hadoop fs -ls /user/oozie/share/lib/lib_20170508043956/spark
Found 8 items
-rw-r--r--   3 oozie hdfs     339666 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-api-jdo-3.2.6.jar
-rw-r--r--   3 oozie hdfs    1890075 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-core-3.2.10.jar
-rw-r--r--   3 oozie hdfs    1809447 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-rdbms-3.2.9.jar
-rw-r--r--   3 oozie hdfs        167 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/hive-site.xml
-rw-r--r--   3 oozie hdfs      22440 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/oozie-sharelib-spark-4.2.0.2.5.3.0-37.jar
-rw-r--r--   3 oozie hdfs      44846 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/py4j-0.9-src.zip
-rw-r--r--   3 oozie hdfs     357563 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/pyspark.zip
-rw-r--r--   3 oozie hdfs  188897932 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/spark-assembly-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar

job.properties

[oozie@rk253 ~]$ cat job.properties 
nameNode= hdfs://rk253.openstack:8020 
jobTracker= rk253.openstack:8050 
oozie.wf.application.path=/tmp/sparkOozieShellAction/ 
oozie.use.system.libpath=true 

workflow.xml

[oozie@rk253 ~]$ cat job.properties 
nameNode= hdfs://rk253.openstack:8020 
jobTracker= rk253.openstack:8050 
oozie.wf.application.path=/tmp/sparkOozieShellAction/ 
oozie.use.system.libpath=true 
master=yarn-client
[oozie@rk253 ~]$ cat workflow.xml 
<workflow-app name="WorkFlowForShellAction" xmlns="uri:oozie:workflow:0.4">
    <start to="shellAction"/>
    <action name="shellAction">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <exec>spark-pi-job.sh
            </exec>
            <env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
		<file>/tmp/sparkOozieShellAction/spark-pi-job.sh#spark-pi-job.sh</file>
	    <capture-output/>
        </shell>
    <ok to="end"/>
    <error to="killAction"/>
    </action>
    <kill name="killAction">
        <message>"Killed job due to error"</message>
    </kill>
    <end name="end"/>
</workflow-app>

spark-pi-job.sh

[oozie@rk253 ~]$ cat spark-pi-job.sh 
/usr/hdp/2.5.3.0-37/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 /usr/hdp/2.5.3.0-37/spark/lib/spark-examples-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar 10

run oozie job

oozie job -oozie http://rk253:11000/oozie/ -config job.properties -run 

Sunday, May 7, 2017

oozie spark action example

directory structure at hdfs

[oozie@rk253 ~]$ hadoop fs -lsr /tmp/sparkOozieAction
lsr: DEPRECATED: Please use 'ls -R' instead.
-rwxrwxrwx   3 oozie hdfs        167 2017-05-08 05:01 /tmp/sparkOozieAction/job.properties
drwxrwxrwx   - oozie hdfs          0 2017-05-08 05:04 /tmp/sparkOozieAction/lib
-rwxrwxrwx   3 oozie hdfs  110488188 2017-05-08 04:58 /tmp/sparkOozieAction/lib/spark-examples-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar
-rw-r--r--   3 oozie hdfs       1571 2017-05-08 05:46 /tmp/sparkOozieAction/workflow.xml

oozie share lib

[oozie@rk253 ~]$ hadoop fs -ls /user/oozie/share/lib/lib_20170508043956/spark
Found 8 items
-rw-r--r--   3 oozie hdfs     339666 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-api-jdo-3.2.6.jar
-rw-r--r--   3 oozie hdfs    1890075 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-core-3.2.10.jar
-rw-r--r--   3 oozie hdfs    1809447 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/datanucleus-rdbms-3.2.9.jar
-rw-r--r--   3 oozie hdfs        167 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/hive-site.xml
-rw-r--r--   3 oozie hdfs      22440 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/oozie-sharelib-spark-4.2.0.2.5.3.0-37.jar
-rw-r--r--   3 oozie hdfs      44846 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/py4j-0.9-src.zip
-rw-r--r--   3 oozie hdfs     357563 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/pyspark.zip
-rw-r--r--   3 oozie hdfs  188897932 2017-05-08 04:42 /user/oozie/share/lib/lib_20170508043956/spark/spark-assembly-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar

job.properties

[oozie@rk253 ~]$ cat job.properties 
nameNode= hdfs://rk253.openstack:8020 
jobTracker= rk253.openstack:8050 
oozie.wf.application.path=/tmp/sparkOozieAction/ 
oozie.use.system.libpath=true 
master=yarn-client

workflow.xml

[oozie@rk253 ~]$ cat job.properties 
nameNode= hdfs://rk253.openstack:8020 
jobTracker= rk253.openstack:8050 
oozie.wf.application.path=/tmp/sparkOozieAction/ 
oozie.use.system.libpath=true 
master=yarn-client
[oozie@rk253 ~]$ cat workflow.xml 
<workflow-app name="spark-wf" xmlns="uri:oozie:workflow:0.5"> 
        <start to="spark-action"/> 
        <action name="spark-action"> 
                <spark xmlns="uri:oozie:spark-action:0.1"> 
                        <job-tracker>${jobTracker}</job-tracker> 
                        <name-node>${nameNode}</name-node> 
                        <configuration> 
                        </configuration> 
                        <master>${master}</master> 
                        <name>spark pi job</name> 
                        <class>org.apache.spark.examples.SparkPi</class> 
                        <jar>${nameNode}/tmp/sparkOozieAction/lib/spark-examples-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar</jar> 
                        <spark-opts>--driver-memory 512m --executor-memory 512m --num-executors 1</spark-opts> 
                        <arg>10</arg> 
                </spark> 
                <ok to="end"/> 
                <error to="kill"/> 
        </action> 
        <kill name="kill"> 
                <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> 
        </kill> 
        <end name="end"/> 
</workflow-app> 

run

oozie job -oozie http://rk253:11000/oozie/ -config job.properties -run