hdfs - What is the cost of a Spark shuffle? -
data information:
- your data have n lines;
- each line b bytes long;
- each line belongs 1 among g groups, given numeric variable
thegroup
; - initially, groups randomly sorted accross partitions.
cluster information:
- your
sparkcontext
has e executors; - each executor has n nodes;
- transferring 1 kb 1 executor costs pingtime (aggregation time included)
- there's 1 output cable , 1 input cable on each executor.
your mission: groupby(thegroup)
using spark, iff it's not long do.
big problem: what's estimate of how time t operation going take?
wild guesses far: imagine t be:
- increasing in n (n.e)⁻¹ log( n (n.e)⁻¹ )
- idea: because there n (n.e)⁻¹ lines on each node , might have sorted first
- increasing in b, obviously
- increasing in pingtime
- increasing in g, have no idea how: perhaps increasing in g² while g < n
- decreasing in n
i need estimate of magnitude order of t, i'm still missing terms (like relationship between e , g).
Comments
Post a Comment