Sunday, November 29, 2015

Amazon EMR : Creating a Spark Cluster and Running a Job

Amazon Elastic MapReduce (EMR) is an Amazon Web Service (AWS) for data processing and analysis. Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing.
In this example lets spin a spark cluster and run a spark job which crunch the apache logs and filter out the error logs only.

Prerequisites:
AWS Account
install and configure the AWS CLI tool
create default roles

Spark Job
follow these steps to create a sample jobs
mkdir SampleSparkApp
cd SampleSparkApp
mkdir -p src/main/scala
cd src/main/scala
vim SimpleApp.scala

package com.example.project

/**
 * @author rsingh
 */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "s3://rks-clus-data/log.txt" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val  errors = logData.filter(line => line.contains("error"))
    errors.saveAsTextFile("s3://rks-clus-data/error-log.txt") 
  }
}

cd -
vim build.sbt

name := "Spark Log Job"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.5.0","org.apache.spark" %% "spark-streaming" % "1.5.0")

* now build the project using sbt
sbt package

the jar will be available after successful build target/scala-2.10/spark-log-job_2.10-1.0.jar
upload job jar to the s3 bucket
aws s3 cp target/scala-2.10/spark-log-job_2.10-1.0.jar s3://rks-clus-data/job-jars/

upload sample logs at your s3 bucket
aws s3 cp log.txt s3://rks-clus-data/

create job steps as follows
cat step.json
[
{
"Name": "SampleSparkApp",
"Type":"CUSTOM_JAR",
"Jar":"command-runner.jar",
"Args":
[
"spark-submit",
"--deploy-mode", "cluster",
"--class", "com.example.project.SimpleApp",
"s3://rks-clus-data/job-jars/spark-count-job_2.10-1.0.jar",
"s3://rks-clus-data/log.txt",
"s3://rks-clus-data"
],
"ActionOnFailure": "TERMINATE_CLUSTER"
}
]

now Spin a Amazon EMR cluster with auto terminate option
    aws emr create-cluster \
    --name "Single Node Spark Cluster" \
    --instance-type m3.xlarge \
    --release-label emr-4.2.0 \
    --instance-count 1 \
    --use-default-roles \
    --applications Name=Spark \
    --steps file://step.json \
    --auto-terminate

The above command will spin a spark cluster on EMR and run a job.it will terminate automatically irrespective of success or failure.

No comments: