Writing a Sequence file:
The output can be verified using hadoop ls command
Reading Sequence file
scala> val data = sc.parallelize(List(("key1", 1), ("Kay2", 2), ("Key3", 2))) data: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[7] at parallelize at:27 scala> data.saveAsSequenceFile("/tmp/seq-output")
The output can be verified using hadoop ls command
[root@maprdemo sample-data]# hadoop fs -lsr /tmp/seq-output lsr: DEPRECATED: Please use 'ls -R' instead. -rwxr-xr-x 1 root root 0 2015-10-02 01:12 /tmp/seq-output/_SUCCESS -rw-r--r-- 1 root root 102 2015-10-02 01:12 /tmp/seq-output/part-00000 -rw-r--r-- 1 root root 119 2015-10-02 01:12 /tmp/seq-output/part-00001 [root@maprdemo sample-data]# hadoop fs -text /tmp/seq-output/part-00001 Kay2 2 Key3 2
Reading Sequence file
scala> import org.apache.hadoop.io.Text import org.apache.hadoop.io.Text scala> import org.apache.hadoop.io.IntWritable import org.apache.hadoop.io.IntWritable val result = sc.sequenceFile("/tmp/seq-output/part-00001", classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, y.get())} scala> val result = sc.sequenceFile("/tmp/seq-output/part-00001", classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, y.get())} result: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at:29 scala> result.collect res14: Array[(String, Int)] = Array((Kay2,2), (Key3,2))
No comments:
Post a Comment