Writing a Sequence file:
The output can be verified using hadoop ls command
Reading Sequence file
scala> val data = sc.parallelize(List(("key1", 1), ("Kay2", 2), ("Key3", 2)))
data: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[7] at parallelize at :27
scala> data.saveAsSequenceFile("/tmp/seq-output")
The output can be verified using hadoop ls command
[root@maprdemo sample-data]# hadoop fs -lsr /tmp/seq-output lsr: DEPRECATED: Please use 'ls -R' instead. -rwxr-xr-x 1 root root 0 2015-10-02 01:12 /tmp/seq-output/_SUCCESS -rw-r--r-- 1 root root 102 2015-10-02 01:12 /tmp/seq-output/part-00000 -rw-r--r-- 1 root root 119 2015-10-02 01:12 /tmp/seq-output/part-00001 [root@maprdemo sample-data]# hadoop fs -text /tmp/seq-output/part-00001 Kay2 2 Key3 2
Reading Sequence file
scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text
scala> import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.IntWritable
val result = sc.sequenceFile("/tmp/seq-output/part-00001", classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, y.get())}
scala> val result = sc.sequenceFile("/tmp/seq-output/part-00001", classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, y.get())}
result: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at :29
scala> result.collect
res14: Array[(String, Int)] = Array((Kay2,2), (Key3,2))
No comments:
Post a Comment