concurrency - What is the right way to make a barrier in distributed tensorflow? -

during distributed training want sync after each epoch, calculations on chief worker , proceed or stop training depending on these calculations. need barrier so.

i don't see similar in documentation, implemented solution based on queues (similar how gradients stored , applied in distributed training):

def build_barrier(tasks, task_index, barrier_name):     queues = []     i, task in enumerate(tasks):         tf.device('%s/cpu:0' % task):             tf.name_scope(barrier_name):                 queues.append(                     tf.fifoqueue(                         len(tasks),                         (tf.float32),                         shapes=(()),                         name=str(i),                         shared_name=str(i)))      tf.control_dependencies([queue.enqueue(1.) queue in queues]):         return queues[task_index].dequeue_many(len(tasks))

the idea create queue per worker. 'signal' push token in each queue , 'join' dequeue many tokens corresponding queue how many tasks want synchronize.

the question is: right way or there better way?

your solution quite similar syncreplicasoptimizer. in syncreplicasoptimizer, uses sync token queue simulate barrier, , accumulator each variable accumulate , average grad update. it's typical bulk synchronous parallelism, while has additional job of implementing stale synchronous parallelism in tensorflow.

in addition, tensorflow provides barrier in newest version, can check more information.

Search This Blog

Living

concurrency - What is the right way to make a barrier in distributed tensorflow? -

Comments

Post a Comment

Popular posts from this blog

elasticsearch python client - work with many nodes - how to work with sniffer -

angular - Is it possible to get native element for formControl? -

Upload file with tags through OwnCloud or NextCloud API -