java - Apache Spark: How can I use an RDD inside of my RDD loop? -
is possible run loop in apache spark parallelized , work rdds in loop?
i want use correlation function cartesian product of values. cartesian product large, thinking parallelizing loop.
but if parallelize loop, can still work rdds in loop? collecting , iterating not option.
example:
lets have products, date , price product in dataframe data.
the rdd materials list of products have.
dataframe data; javardd<string> materials; javapairrdd<string, string> cartesian = materials.cartesian(materials);
in next step want loop through combinations of cartesian rdd , filter dataframe tuple values.
what thinking of this:
cartesian.foreach(new voidfunction<tuple2<string, string>() { public void call (tuple2<string, string> arg0) throws exception { javardd<double> rdd1 = data.where(data.col("product").equalto(arg0._1)) .tojavardd().map(new function<row, double>() { public double call(row arg1) throws exception { return arg1.getdouble(2); } }); javardd<double> rdd2 = ... } statistics.corr(rdd1, rdd2); });
this creates memory leak error in row try create rdd1. there possible way this? there way work rdds in parallelized loop?
how display result of correlation calculation? accumulator correct solution this?
Comments
Post a Comment