Amazon Elastic MapReduce (EMR) is an Amazon Web Service (AWS) for data processing and analysis. Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing.
In this example lets spin a spark cluster and run a spark job which crunch the apache logs and filter out the error logs only.
Prerequisites:
AWS Account
install and configure the AWS CLI tool
create default roles
Spark Job
follow these steps to create a sample jobs
upload sample logs at your s3 bucket
create job steps as follows
cat step.json
[
{
"Name": "SampleSparkApp",
"Type":"CUSTOM_JAR",
"Jar":"command-runner.jar",
"Args":
[
"spark-submit",
"--deploy-mode", "cluster",
"--class", "com.example.project.SimpleApp",
"s3://rks-clus-data/job-jars/spark-count-job_2.10-1.0.jar",
"s3://rks-clus-data/log.txt",
"s3://rks-clus-data"
],
"ActionOnFailure": "TERMINATE_CLUSTER"
}
]
now Spin a Amazon EMR cluster with auto terminate option
The above command will spin a spark cluster on EMR and run a job.it will terminate automatically irrespective of success or failure.
In this example lets spin a spark cluster and run a spark job which crunch the apache logs and filter out the error logs only.
Prerequisites:
AWS Account
install and configure the AWS CLI tool
create default roles
Spark Job
follow these steps to create a sample jobs
mkdir SampleSparkApp cd SampleSparkApp mkdir -p src/main/scala cd src/main/scala vim SimpleApp.scala package com.example.project /** * @author rsingh */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "s3://rks-clus-data/log.txt" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val errors = logData.filter(line => line.contains("error")) errors.saveAsTextFile("s3://rks-clus-data/error-log.txt") } } cd - vim build.sbt name := "Spark Log Job" version := "1.0" scalaVersion := "2.10.4" libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.5.0","org.apache.spark" %% "spark-streaming" % "1.5.0") * now build the project using sbt sbt package the jar will be available after successful build target/scala-2.10/spark-log-job_2.10-1.0.jarupload job jar to the s3 bucket
aws s3 cp target/scala-2.10/spark-log-job_2.10-1.0.jar s3://rks-clus-data/job-jars/
upload sample logs at your s3 bucket
aws s3 cp log.txt s3://rks-clus-data/
create job steps as follows
cat step.json
[
{
"Name": "SampleSparkApp",
"Type":"CUSTOM_JAR",
"Jar":"command-runner.jar",
"Args":
[
"spark-submit",
"--deploy-mode", "cluster",
"--class", "com.example.project.SimpleApp",
"s3://rks-clus-data/job-jars/spark-count-job_2.10-1.0.jar",
"s3://rks-clus-data/log.txt",
"s3://rks-clus-data"
],
"ActionOnFailure": "TERMINATE_CLUSTER"
}
]
now Spin a Amazon EMR cluster with auto terminate option
aws emr create-cluster \ --name "Single Node Spark Cluster" \ --instance-type m3.xlarge \ --release-label emr-4.2.0 \ --instance-count 1 \ --use-default-roles \ --applications Name=Spark \ --steps file://step.json \ --auto-terminate
The above command will spin a spark cluster on EMR and run a job.it will terminate automatically irrespective of success or failure.