Amazon Elastic MapReduce (EMR) is an Amazon Web Service (AWS) for data processing and analysis. Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing.
In this example lets spin a spark cluster and run a spark job which crunch the apache logs and filter out the error logs only.
Prerequisites:
AWS Account
install and configure the AWS CLI tool
create default roles
Spark Job
follow these steps to create a sample jobs
upload sample logs at your s3 bucket
create job steps as follows
cat step.json
[
{
"Name": "SampleSparkApp",
"Type":"CUSTOM_JAR",
"Jar":"command-runner.jar",
"Args":
[
"spark-submit",
"--deploy-mode", "cluster",
"--class", "com.example.project.SimpleApp",
"s3://rks-clus-data/job-jars/spark-count-job_2.10-1.0.jar",
"s3://rks-clus-data/log.txt",
"s3://rks-clus-data"
],
"ActionOnFailure": "TERMINATE_CLUSTER"
}
]
now Spin a Amazon EMR cluster with auto terminate option
The above command will spin a spark cluster on EMR and run a job.it will terminate automatically irrespective of success or failure.
In this example lets spin a spark cluster and run a spark job which crunch the apache logs and filter out the error logs only.
Prerequisites:
AWS Account
install and configure the AWS CLI tool
create default roles
Spark Job
follow these steps to create a sample jobs
mkdir SampleSparkApp
cd SampleSparkApp
mkdir -p src/main/scala
cd src/main/scala
vim SimpleApp.scala
package com.example.project
/**
* @author rsingh
*/
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "s3://rks-clus-data/log.txt" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val errors = logData.filter(line => line.contains("error"))
errors.saveAsTextFile("s3://rks-clus-data/error-log.txt")
}
}
cd -
vim build.sbt
name := "Spark Log Job"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.5.0","org.apache.spark" %% "spark-streaming" % "1.5.0")
* now build the project using sbt
sbt package
the jar will be available after successful build target/scala-2.10/spark-log-job_2.10-1.0.jar
upload job jar to the s3 bucketaws s3 cp target/scala-2.10/spark-log-job_2.10-1.0.jar s3://rks-clus-data/job-jars/
upload sample logs at your s3 bucket
aws s3 cp log.txt s3://rks-clus-data/
create job steps as follows
cat step.json
[
{
"Name": "SampleSparkApp",
"Type":"CUSTOM_JAR",
"Jar":"command-runner.jar",
"Args":
[
"spark-submit",
"--deploy-mode", "cluster",
"--class", "com.example.project.SimpleApp",
"s3://rks-clus-data/job-jars/spark-count-job_2.10-1.0.jar",
"s3://rks-clus-data/log.txt",
"s3://rks-clus-data"
],
"ActionOnFailure": "TERMINATE_CLUSTER"
}
]
now Spin a Amazon EMR cluster with auto terminate option
aws emr create-cluster \
--name "Single Node Spark Cluster" \
--instance-type m3.xlarge \
--release-label emr-4.2.0 \
--instance-count 1 \
--use-default-roles \
--applications Name=Spark \
--steps file://step.json \
--auto-terminate
The above command will spin a spark cluster on EMR and run a job.it will terminate automatically irrespective of success or failure.