hdfs - What is the cost of a Spark shuffle? -


data information:

  • your data have n lines;
  • each line b bytes long;
  • each line belongs 1 among g groups, given numeric variable thegroup;
  • initially, groups randomly sorted accross partitions.

cluster information:

  • your sparkcontext has e executors;
  • each executor has n nodes;
  • transferring 1 kb 1 executor costs pingtime (aggregation time included)
  • there's 1 output cable , 1 input cable on each executor.

your mission: groupby(thegroup) using spark, iff it's not long do.

big problem: what's estimate of how time t operation going take?


wild guesses far: imagine t be:

  • increasing in n (n.e)⁻¹ log( n (n.e)⁻¹ )
    • idea: because there n (n.e)⁻¹ lines on each node , might have sorted first
  • increasing in b, obviously
  • increasing in pingtime
  • increasing in g, have no idea how: perhaps increasing in g² while g < n
  • decreasing in n

i need estimate of magnitude order of t, i'm still missing terms (like relationship between e , g).


Comments

Popular posts from this blog

angular - Is it possible to get native element for formControl? -

unity3d - Rotate an object to face an opposite direction -

javascript - Why jQuery Select box change event is now working? -