Sunday, November 29, 2015

Amazon EMR : Creating a Spark Cluster and Running a Job

Amazon Elastic MapReduce (EMR) is an Amazon Web Service (AWS) for data processing and analysis. Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing.
In this example lets spin a spark cluster and run a spark job which crunch the apache logs and filter out the error logs only.

Prerequisites:
AWS Account
install and configure the AWS CLI tool
create default roles

Spark Job
follow these steps to create a sample jobs
mkdir SampleSparkApp
cd SampleSparkApp
mkdir -p src/main/scala
cd src/main/scala
vim SimpleApp.scala

package com.example.project

/**
 * @author rsingh
 */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "s3://rks-clus-data/log.txt" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val  errors = logData.filter(line => line.contains("error"))
    errors.saveAsTextFile("s3://rks-clus-data/error-log.txt") 
  }
}

cd -
vim build.sbt

name := "Spark Log Job"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.5.0","org.apache.spark" %% "spark-streaming" % "1.5.0")

* now build the project using sbt
sbt package

the jar will be available after successful build target/scala-2.10/spark-log-job_2.10-1.0.jar
upload job jar to the s3 bucket
aws s3 cp target/scala-2.10/spark-log-job_2.10-1.0.jar s3://rks-clus-data/job-jars/

upload sample logs at your s3 bucket
aws s3 cp log.txt s3://rks-clus-data/

create job steps as follows
cat step.json
[
{
"Name": "SampleSparkApp",
"Type":"CUSTOM_JAR",
"Jar":"command-runner.jar",
"Args":
[
"spark-submit",
"--deploy-mode", "cluster",
"--class", "com.example.project.SimpleApp",
"s3://rks-clus-data/job-jars/spark-count-job_2.10-1.0.jar",
"s3://rks-clus-data/log.txt",
"s3://rks-clus-data"
],
"ActionOnFailure": "TERMINATE_CLUSTER"
}
]

now Spin a Amazon EMR cluster with auto terminate option
    aws emr create-cluster \
    --name "Single Node Spark Cluster" \
    --instance-type m3.xlarge \
    --release-label emr-4.2.0 \
    --instance-count 1 \
    --use-default-roles \
    --applications Name=Spark \
    --steps file://step.json \
    --auto-terminate

The above command will spin a spark cluster on EMR and run a job.it will terminate automatically irrespective of success or failure.

Tuesday, November 24, 2015

Apache Storm Sample Application

These are baby steps to build and run the storm application.
pre requisite : java 7, maven
create default maven archetype
$  mvn archetype:generate 
$  cd StormSampleApp/
$  update pom.xml as follow https://github.com/rajkrrsingh/StormSampleApp/blob/master/pom.xml
$  create a driver class composed of Spout definition, Bolt definitions and topology configuration as follows
   https://github.com/rajkrrsingh/StormSampleApp/blob/master/src/main/java/com/rajkrrsingh/storm/SampleTopology.java
$  mvn package
$  storm jar target/StormSampleApp-1.0-SNAPSHOT-jar-with-dependencies.jar com.rajkrrsingh.storm.SampleTopology

Saturday, November 21, 2015

Apache Drill : Run query through Microsoft Excel

In this short demo I will demonstrate you how to setup Drill with Microsoft Excel and execute the query.I this example I am using Microsoft Excel 2011 edition.
1.Download Apache drill from https://drill.apache.org/download/
2.Extract at some location and start drill into embedded mode using $DRILL_HOME/bin/drill-embedded
3.Download and configure MapR Drill ODBC driver according to your env https://drill.apache.org/docs/installing-the-driver-on-mac-os-x/
4.update the ODBC datasource configuration as follows
ConnectionType=Direct
HOST=localhost
PORT=31010
ZKQuorum=
ZKClusterID=
5. open MS Excel, go to data-> get external data-> new databse query -> select drill datasource
6. click test query to run your query.

Friday, November 20, 2015

Apache Drill : Saving query output to csv, parquet and json output

Quick way to save query result in different format (json,parquet,csv) depending upon the needs
0: jdbc:drill:zk=> alter session set `store.format`='csv';
+-------+------------------------+
|  ok   |        summary         |
+-------+------------------------+
| true  | store.format updated.  |
+-------+------------------------+
1 row selected (0.13 seconds)
0: jdbc:drill:zk=> create table dfs.tmp.`emp_csv` as select * from cp.`default`.`employee.json`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1155                       |
+-----------+----------------------------+
1 row selected (0.444 seconds)
0: jdbc:drill:zk=> alter session set `store.format`='parquet';
+-------+------------------------+
|  ok   |        summary         |
+-------+------------------------+
| true  | store.format updated.  |
+-------+------------------------+
1 row selected (0.133 seconds)
0: jdbc:drill:zk=> create table dfs.tmp.`emp_parquet` as select * from cp.`default`.`employee.json`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1155                       |
+-----------+----------------------------+
1 row selected (0.826 seconds)
0: jdbc:drill:zk=> alter session set `store.format`='json';
+-------+------------------------+
|  ok   |        summary         |
+-------+------------------------+
| true  | store.format updated.  |
+-------+------------------------+
1 row selected (0.116 seconds)
0: jdbc:drill:zk=> create table dfs.tmp.`emp_json` as select * from cp.`default`.`employee.json`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1155                       |
+-----------+----------------------------+
1 row selected (0.322 seconds)
0: jdbc:drill:zk=> 

//output
root@RKS ~]# hadoop fs -ls /tmp/emp*
Found 1 items
-rwxr-xr-x   3 mapr mapr     175674 2015-11-21 01:58 /tmp/emp_csv/0_0_0.csv
Found 1 items
-rwxr-xr-x   3 mapr mapr     563564 2015-11-21 02:02 /tmp/emp_json/0_0_0.json
Found 1 items
-rwxr-xr-x   3 mapr mapr      52179 2015-11-21 02:01 /tmp/emp_parquet/0_0_0.parquet

Wednesday, November 18, 2015

Sqoop Import Eroor due to 'must contain '$CONDITIONS' in WHERE clause'

Sqoop import as parquet files fail with the following exceptions
$sqoop import --connect jdbc:mysql://127.0.0.1:3306/mysql --username root  --query 'select * from user where  $CONDITION'  --target-dir  /user/mapr/import-dir --as-parquetfile  -m 1

ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Query [select * from user where 1=1 $CONDITION] must contain '$CONDITIONS' in WHERE clause
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:300)
 at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1833)
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter
 at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1651)
 at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:107)
 at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478)
The right way to do it is
sqoop import --connect jdbc:mysql://127.0.0.1:3306/metastore --username root  --query "select * from TBLS where \$CONDITIONS"   --target-dir  /user/mapr/import_dir --as-parquetfile  -m 1

Tuesday, November 17, 2015

load csv data to hive partitioned tables

1.Create an External partitioned Table to some location

CREATE EXTERNAL TABLE `csvtable`( 
`col1` string COMMENT , 
`col2` string COMMENT , 
`col3` string COMMENT , 
`col4` string COMMENT ) 
PARTITIONED BY ( 
`year` string) 
ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
LOCATION 
'/mycsvtable' 

2.create a directory under it with the name of your partition
hadoop fs -mkdir /mycsvtable/year=2001

3.copy some csv files to the directory
hadoop fs -put sample.csv /mycsvtable/year=2001

4. hive still does not know about this partition as metastore has not been updated yet, if you try to query
csvtable it will fetch you no result.to resolve this issue alter the table so that metastore updated accordingly
alter table csvtable add partition (year='2001')

Now you are good to Go.

Runnig Zeppelin notebook on MapR cluster

Zeppelin, a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.these are the minimal steps to get it done on MapR-5.0 cluster
insure java is installed
yum install git
yum install npm

# install maven
wget http://www.eu.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
sudo tar -zxf apache-maven-3.3.3-bin.tar.gz -C /usr/local/
sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/local/bin/mvn

git clone https://github.com/apache/incubator-zeppelin.git
cd incubator-zeppelin/
mvn clean package -Pmapr50 -DskipTests

# update conf/zeppelin-env.sh
export SPARK_HOME=
export HADOOP_HOME=

#optional configuration
conf/zeppelin-env.sh
conf/zeppelin-site.xml

#Run
bin/zeppelin-daemon.sh start
open a web-browser and browse http://hostname:8080/ 

#To Stop
bin/zeppelin-daemon.sh stop

Tuesday, November 10, 2015

Hive metastore upgrade manually

Env: hive 12 running derby in embedded mode upgraded to hive-1.0

download derbytools-10.4.2.0.jar
--dump current schema
execute java -cp derbytools-10.4.2.0.jar:$HIVE_HOME/lib/derby-10.4.2.0.jar org.apache.derby.tools.dblook -d 'jdbc:derby:metastore_db' > schema-12.sql
(--metastore_db is relative location)

upgrade hive to hive 1.0 but dont start it.
copy following files from the $HIVE_HOME/scripts/metastore/upgrade/derby/
upgrade-0.12.0-to-0.13.0.derby.sql
018-HIVE-6757.derby.sql
017-HIVE-6458.derby.sql
016-HIVE-6386.derby.sql
hive-txn-schema-0.13.0.derby.sql
019-HIVE-7784.derby.sql
upgrade-0.13.0-to-0.14.0.derby.sql


download derbytools-10.10.1.1.jar
execute java -cp derbytools-10.10.1.1.jar:derby-10.10.1.1.jar org.apache.derby.tools.ij
connect 'jdbc:derby:metastore_db'
run 'upgrade-0.12.0-to-0.13.0.derby.sql';
run 'upgrade-0.13.0-to-0.14.0.derby.sql';

now start hive, you can see my old tables and also query them without any issue.

Thursday, November 5, 2015

Building Spark Notebook for MapR 5.0

Env : MapR 5.0
Spark : 1.4.1

git clone https://github.com/andypetrella/spark-notebook.git

// add mapr repo to sbt proxy repo
vim ~/.sbt/repositories
[repositories]
local

maven-central

typesafe: http://repo.typesafe.com/typesafe/releases/

typesafe-snapshots: http://repo.typesafe.com/typesafe/snapshots/

mapr: http://repository.mapr.com/maven

cd spark-notebook
sbt -Dspark.version=1.4.1 -Dhadoop.version=2.7.0-mapr-1506 -Dwith.hive=true -Dwith.parquet=true clean dist

add env variables

$ export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop
$ export EXTRA_CLASSPATH=/opt/mapr/lib/commons-configuration-1.6.jar:/opt/mapr/lib/hadoop-auth-2.7.0.jar:/opt/mapr/lib/maprfs-5.0.0-mapr.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/zookeeper-3.4.5-mapr-1503.jar

copy spark assembly jar to the /apps/spark
hadoop fs -put /opt/mapr/spark/spark-1.4.1/lib/spark-assembly-1.4.1-hadoop2.5.1-mapr-1501.jar /apps/spark/spark-assembly.jar


now start notebook using sbt run
[info] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9000

(Server started, use Ctrl+D to stop and go back to the console...)
the message indicate that everything went smooth and notebook has started on port 9000. to access notebook open a browser and type address http://:9000 and u are good to go