java - Apache Spark: How can I use an RDD inside of my RDD loop? -

is possible run loop in apache spark parallelized , work rdds in loop?

i want use correlation function cartesian product of values. cartesian product large, thinking parallelizing loop.

but if parallelize loop, can still work rdds in loop? collecting , iterating not option.

example:

lets have products, date , price product in dataframe data.

the rdd materials list of products have.

dataframe data; javardd<string> materials;  javapairrdd<string, string> cartesian = materials.cartesian(materials);

in next step want loop through combinations of cartesian rdd , filter dataframe tuple values.

what thinking of this:

cartesian.foreach(new voidfunction<tuple2<string, string>() {      public void call (tuple2<string, string> arg0) throws exception {      javardd<double> rdd1 = data.where(data.col("product").equalto(arg0._1))         .tojavardd().map(new function<row, double>() {          public double call(row arg1) throws exception {         return arg1.getdouble(2);         }     });      javardd<double> rdd2 = ...      }      statistics.corr(rdd1, rdd2); });

this creates memory leak error in row try create rdd1. there possible way this? there way work rdds in parallelized loop?

how display result of correlation calculation? accumulator correct solution this?

Search This Blog

Living

java - Apache Spark: How can I use an RDD inside of my RDD loop? -

Comments

Post a Comment

Popular posts from this blog

delphi - Disable and change color of node in Treeview -

unity3d - Rotate an object to face an opposite direction -

elasticsearch python client - work with many nodes - how to work with sniffer -