hdfs - What is the cost of a Spark shuffle? -

hdfs - What is the cost of a Spark shuffle? -

data information:

your data have n lines;
each line b bytes long;
each line belongs 1 among g groups, given numeric variable thegroup;
initially, groups randomly sorted accross partitions.

cluster information:

your sparkcontext has e executors;
each executor has n nodes;
transferring 1 kb 1 executor costs pingtime (aggregation time included)
there's 1 output cable , 1 input cable on each executor.

your mission: groupby(thegroup) using spark, iff it's not long do.

big problem: what's estimate of how time t operation going take?

wild guesses far: imagine t be:

increasing in n (n.e)⁻¹ log( n (n.e)⁻¹ )
- idea: because there n (n.e)⁻¹ lines on each node , might have sorted first
increasing in b, obviously
increasing in pingtime
increasing in g, have no idea how: perhaps increasing in g² while g < n
decreasing in n

i need estimate of magnitude order of t, i'm still missing terms (like relationship between e , g).

Comments