scala - Key/Value pair RDD -
i have question on key/value pair rdd.
i have 5 files in c:/download/input
folder has dialogs in films content of files follows:
movie_horror_conjuring.txt movie_comedy_eurotrip.txt movie_horror_insidious.txt movie_sci-fi_interstellar.txt movie_horror_evildead.txt
i trying read files in input folder using sc.wholetextfiles() key/value follows
(c:/download/input/movie_horror_conjuring.txt,values)
i trying operation have group input files of each genre using groupbykey()
. values of horror movies , comedy movies , on.
is there way can generate key/value pair way (horror, values)
instead of (c:/download/input/movie_horror_conjuring.txt,values)
val ipfile = sc.wholetextfiles("c:/download/input") val output = ipfile.groupbykey().map(t => (t._1,t._2))
the above code giving me output follows
(c:/download/input/movie_horror_conjuring.txt,values) (c:/download/input/movie_comedy_eurotrip.txt,values) (c:/download/input/movie_horror_conjuring.txt,values) (c:/download/input/movie_sci-fi_interstellar.txt,values) (c:/download/input/movie_horror_evildead.txt,values)
where need output follows :
(horror, (values1, values2, values3)) (comedy, (values1)) (sci-fi, (values1))
i tried map , split operations remove folder paths of key file name, i'm not able append corresponding values files.
also know how can lines count in values1, values2, values3 etc.
my final output should
(horror, 100)
where 100 sum of count of lines in values1 = 40 lines, values2 = 30 lines , values3 = 30 lines , on..
try this:
val output = ipfile.map{case (k, v) => (k.split("_")(1),v)}.groupbykey() output.collect
let me know if works you!
update:
to output in format of (horror, 100)
:
val output = ipfile.map{case (k, v) => (k.split("_")(1),v.count(_ == '\n'))}.reducebykey(_ + _) output.collect
Comments
Post a Comment