Sunday, November 29, 2015

Amazon EMR : Creating a Spark Cluster and Running a Job

Amazon Elastic MapReduce (EMR) is an Amazon Web Service (AWS) for data processing and analysis. Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing.
In this example lets spin a spark cluster and run a spark job which crunch the apache logs and filter out the error logs only.

AWS Account
install and configure the AWS CLI tool
create default roles

Spark Job
follow these steps to create a sample jobs
mkdir SampleSparkApp
cd SampleSparkApp
mkdir -p src/main/scala
cd src/main/scala
vim SimpleApp.scala

package com.example.project

 * @author rsingh
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "s3://rks-clus-data/log.txt" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val  errors = logData.filter(line => line.contains("error"))

cd -
vim build.sbt

name := "Spark Log Job"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.5.0","org.apache.spark" %% "spark-streaming" % "1.5.0")

* now build the project using sbt
sbt package

the jar will be available after successful build target/scala-2.10/spark-log-job_2.10-1.0.jar
upload job jar to the s3 bucket
aws s3 cp target/scala-2.10/spark-log-job_2.10-1.0.jar s3://rks-clus-data/job-jars/

upload sample logs at your s3 bucket
aws s3 cp log.txt s3://rks-clus-data/

create job steps as follows
cat step.json
"Name": "SampleSparkApp",
"--deploy-mode", "cluster",
"--class", "com.example.project.SimpleApp",
"ActionOnFailure": "TERMINATE_CLUSTER"

now Spin a Amazon EMR cluster with auto terminate option
    aws emr create-cluster \
    --name "Single Node Spark Cluster" \
    --instance-type m3.xlarge \
    --release-label emr-4.2.0 \
    --instance-count 1 \
    --use-default-roles \
    --applications Name=Spark \
    --steps file://step.json \

The above command will spin a spark cluster on EMR and run a will terminate automatically irrespective of success or failure.

Post a Comment