Friday, October 2, 2015

Apache Spark : Reading and Writing Sequence Files

Writing a Sequence file:
scala> val data = sc.parallelize(List(("key1", 1), ("Kay2", 2), ("Key3", 2)))
data: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[7] at parallelize at :27
scala> data.saveAsSequenceFile("/tmp/seq-output")

The output can be verified using hadoop ls command
[root@maprdemo sample-data]# hadoop fs -lsr /tmp/seq-output
lsr: DEPRECATED: Please use 'ls -R' instead.
-rwxr-xr-x   1 root root          0 2015-10-02 01:12 /tmp/seq-output/_SUCCESS
-rw-r--r--   1 root root        102 2015-10-02 01:12 /tmp/seq-output/part-00000
-rw-r--r--   1 root root        119 2015-10-02 01:12 /tmp/seq-output/part-00001
[root@maprdemo sample-data]# hadoop fs -text /tmp/seq-output/part-00001
Kay2 2
Key3 2

Reading Sequence file

scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text
scala> import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.IntWritable
val result = sc.sequenceFile("/tmp/seq-output/part-00001", classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, y.get())}
scala> val result = sc.sequenceFile("/tmp/seq-output/part-00001", classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, y.get())}
result: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at :29

scala> result.collect
res14: Array[(String, Int)] = Array((Kay2,2), (Key3,2))

No comments: