concurrency - What is the right way to make a barrier in distributed tensorflow? -


during distributed training want sync after each epoch, calculations on chief worker , proceed or stop training depending on these calculations. need barrier so.

i don't see similar in documentation, implemented solution based on queues (similar how gradients stored , applied in distributed training):

def build_barrier(tasks, task_index, barrier_name):     queues = []     i, task in enumerate(tasks):         tf.device('%s/cpu:0' % task):             tf.name_scope(barrier_name):                 queues.append(                     tf.fifoqueue(                         len(tasks),                         (tf.float32),                         shapes=(()),                         name=str(i),                         shared_name=str(i)))      tf.control_dependencies([queue.enqueue(1.) queue in queues]):         return queues[task_index].dequeue_many(len(tasks)) 

the idea create queue per worker. 'signal' push token in each queue , 'join' dequeue many tokens corresponding queue how many tasks want synchronize.

the question is: right way or there better way?

your solution quite similar syncreplicasoptimizer. in syncreplicasoptimizer, uses sync token queue simulate barrier, , accumulator each variable accumulate , average grad update. it's typical bulk synchronous parallelism, while has additional job of implementing stale synchronous parallelism in tensorflow.

in addition, tensorflow provides barrier in newest version, can check more information.


Comments

Popular posts from this blog

angular - Is it possible to get native element for formControl? -

unity3d - Rotate an object to face an opposite direction -

javascript - Why jQuery Select box change event is now working? -