2016-04-01 15 views
0

ADAM Optimization kullanarak tensorflow üzerinde bir ağ çoklu gpu ağı kurmaya çalışıyorum.Tensorflow Adam Multigpu Gradyan

Cifar10_multigpu'dan gelen kodla başa çıkıyorum, ancak gradyan ikinci kuleyi çağırdığında, iki kulenin ortalaması üzerindeki ilk ve üretilen hatanın gradyanını çağırıyor. iki kulenin kod

for d in devs: 
     with tf.device(d): 
      with tf.name_scope('%s_%d' % (tf_model.TOWER_NAME, i)) as scope: 
       loss = tower_loss(scope) 
       tf.get_variable_scope().reuse_variables() 
       summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope) 
       grads = opt.compute_gradients(loss) 
       print('\n'.join('{}: {}'.format(*k) for k in enumerate(grads))) 
       tower_grads.append(grads) 
     i +=1 

ve bu her kuleyi oluşturur:

0: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca11d0ae10>) 
1: (<tf.Tensor 'tower_0/gradients/tower_0/conv1/Conv2D_grad/tuple/control_dependency_1:0' shape=(1, 1, 8, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351b10>) 
2: (<tf.Tensor 'tower_0/gradients/tower_0/conv1/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c380dd0>) 
3: (<tf.Tensor 'tower_0/gradients/tower_0/conv2/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351a10>) 
4: (<tf.Tensor 'tower_0/gradients/tower_0/conv2/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6dd0>) 
5: (<tf.Tensor 'tower_0/gradients/tower_0/conv3/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 32) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6490>) 
6: (<tf.Tensor 'tower_0/gradients/tower_0/conv3/BiasAdd_grad/tuple/control_dependency_1:0' shape=(32,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351990>) 
7: (<tf.Tensor 'tower_0/gradients/tower_0/conv4/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 32, 64) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351890>) 
8: (<tf.Tensor 'tower_0/gradients/tower_0/conv4/BiasAdd_grad/tuple/control_dependency_1:0' shape=(64,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3b7790>) 
9: (<tf.Tensor 'tower_0/gradients/tower_0/conv5/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 64, 128) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2d9110>) 
10: (<tf.Tensor 'tower_0/gradients/tower_0/conv5/BiasAdd_grad/tuple/control_dependency_1:0' shape=(128,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2849d0>) 
11: (<tf.Tensor 'tower_0/gradients/tower_0/conv6/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 128, 256) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2e6f10>) 
12: (<tf.Tensor 'tower_0/gradients/tower_0/conv6/BiasAdd_grad/tuple/control_dependency_1:0' shape=(256,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2afed0>) 
13: (<tf.Tensor 'tower_0/gradients/tower_0/fc1/MatMul_grad/tuple/control_dependency_1:0' shape=(18944, 4096) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1f9550>) 
14: (<tf.Tensor 'tower_0/gradients/tower_0/fc1/add_grad/tuple/control_dependency_1:0' shape=(4096,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c214a10>) 
15: (<tf.Tensor 'tower_0/gradients/tower_0/fc1_1/MatMul_grad/tuple/control_dependency_1:0' shape=(4096, 1024) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c23dfd0>) 
16: (<tf.Tensor 'tower_0/gradients/tower_0/fc1_1/add_grad/tuple/control_dependency_1:0' shape=(1024,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c269bd0>) 
17: (<tf.Tensor 'tower_0/gradients/tower_0/softmax_linear/MatMul_grad/tuple/control_dependency_1:0' shape=(1024, 360) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1d1a50>) 
18: (<tf.Tensor 'tower_0/gradients/tower_0/softmax_linear/softmax_linear_grad/tuple/control_dependency_1:0' shape=(360,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1def50>) 

: İlk kulenin bu oluşturmak geçişlerini bakarak

stream, target= placeholder_inputs(FLAGS.batch_size*tf_model.ANGLES/FLAGS.num_gpus) 
    logits = tf_model.inference_noisy_simulate(stream) 
    _ = tf_model.loss(logits, target) 
    losses = tf.get_collection('losses', scope) 
    total_loss = tf.add_n(losses, name='total_loss') 

ve ikincisi bunu üretir;

0: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca11d0ae10>) 
1: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351b10>) 
2: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c380dd0>) 
3: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351a10>) 
4: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6dd0>) 
5: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6490>) 
6: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351990>) 
7: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351890>) 
8: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3b7790>) 
9: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2d9110>) 
10: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2849d0>) 
11: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2e6f10>) 
12: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2afed0>) 
13: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1f9550>) 
14: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c214a10>) 
15: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c23dfd0>) 
16: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c269bd0>) 
17: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1d1a50>) 
18: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1def50>) 
19: (<tf.Tensor 'tower_1/gradients/tower_1/conv1/Conv2D_grad/tuple/control_dependency_1:0' shape=(1, 1, 8, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c178c50>) 
20: (<tf.Tensor 'tower_1/gradients/tower_1/conv1/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfbb490>) 
21: (<tf.Tensor 'tower_1/gradients/tower_1/conv2/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfda950>) 
22: (<tf.Tensor 'tower_1/gradients/tower_1/conv2/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf91bd0>) 
23: (<tf.Tensor 'tower_1/gradients/tower_1/conv3/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 32) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfcb590>) 
24: (<tf.Tensor 'tower_1/gradients/tower_1/conv3/BiasAdd_grad/tuple/control_dependency_1:0' shape=(32,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf39e90>) 
25: (<tf.Tensor 'tower_1/gradients/tower_1/conv4/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 32, 64) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf499d0>) 
26: (<tf.Tensor 'tower_1/gradients/tower_1/conv4/BiasAdd_grad/tuple/control_dependency_1:0' shape=(64,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf14fd0>) 
27: (<tf.Tensor 'tower_1/gradients/tower_1/conv5/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 64, 128) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf39150>) 
28: (<tf.Tensor 'tower_1/gradients/tower_1/conv5/BiasAdd_grad/tuple/control_dependency_1:0' shape=(128,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebd8d0>) 
29: (<tf.Tensor 'tower_1/gradients/tower_1/conv6/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 128, 256) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf23110>) 
30: (<tf.Tensor 'tower_1/gradients/tower_1/conv6/BiasAdd_grad/tuple/control_dependency_1:0' shape=(256,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf04610>) 
31: (<tf.Tensor 'tower_1/gradients/tower_1/fc1/MatMul_grad/tuple/control_dependency_1:0' shape=(18944, 4096) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebdc50>) 
32: (<tf.Tensor 'tower_1/gradients/tower_1/fc1/add_grad/tuple/control_dependency_1:0' shape=(4096,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebd310>) 
33: (<tf.Tensor 'tower_1/gradients/tower_1/fc1_1/MatMul_grad/tuple/control_dependency_1:0' shape=(4096, 1024) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be96e10>) 
34: (<tf.Tensor 'tower_1/gradients/tower_1/fc1_1/add_grad/tuple/control_dependency_1:0' shape=(1024,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be96990>) 
35: (<tf.Tensor 'tower_1/gradients/tower_1/softmax_linear/MatMul_grad/tuple/control_dependency_1:0' shape=(1024, 360) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be52c90>) 
36: (<tf.Tensor 'tower_1/gradients/tower_1/softmax_linear/softmax_linear_grad/tuple/control_dependency_1:0' shape=(360,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf56f50>) 

ikinci gelen ilk Hiçbiri nasıl kaldırılacağı merak ediyorum, ancak hiçbir hedefleme endeksleri yüzden daha kuleleri için yapabilirsiniz.

cevap

0

Hatayı zaten buldum. Öğrenme oranı için eğitilebilir bir değişken kullanıyordum (lt'yi traklamak istedim ama mümkün görünmüyordu) ve ayrıca adam tarafından hesaplamak için değişkenlerin listesini ekledim. Bunun doğru bir yol olup olmadığından emin değilim ama işe yarıyor gibi görünüyor.

with tf.Graph().as_default(), tf.device('/cpu:0'): 
     devs = ['/job:prs/task:0/gpu:0','/job:worker/task:0/gpu:0'] # 
     global_step = tf.get_variable('global_step', [], initializer=tf.constant_initializer(0), trainable=False) 
     num_batches_per_epoch = dt_fdr.FLS_PER_ANGLE/ FLAGS.batch_size 
     #lr = tf.Variable(tf.constant(FLAGS.learning_rate, dtype=tf.float32)) 
     opt = tf.train.AdamOptimizer(FLAGS.learning_rate) 
     tower_grads = [] 
     for i in xrange(FLAGS.num_gpus): 
      with tf.device(devs[i]): 
       with tf.name_scope('%s_%d' % (tf_model.TOWER_NAME, i)) as scope: 
        loss = tower_loss(scope) 
        tf.get_variable_scope().reuse_variables() 
        summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope) 
        #"print('\n'.join('{}: {}'.format(*k) for k in enumerate(summaries))) 
        grads = opt.compute_gradients(loss, tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)) 
        #print('\n'.join('{}: {}'.format(*k) for k in enumerate(grads))) 
        tower_grads.append(grads) 
     grads = average_gradients(tower_grads) 
     #summaries.append(tf.scalar_summary('learning_rate', lr)) 
     for grad, var in grads: 
      if grad: 
       summaries.append(
        tf.histogram_summary(var.op.name + '/gradients', grad)) 
     apply_gradient_op = opt.apply_gradients(grads, global_step=global_step) 
     for var in tf.trainable_variables(): 
      summaries.append(tf.histogram_summary(var.op.name, var)) 

     train_op = apply_gradient_op 

     saver = tf.train.Saver(tf.all_variables()) 

     summary_op = tf.merge_summary(summaries) 

     init = tf.initialize_all_variables() 

     sess = tf.Session("grpc://nelson-lab:2500",config=tf.ConfigProto(
      allow_soft_placement=True, 
      log_device_placement=FLAGS.log_device_placement)) 
     sess.run(init) 

Ayrıca birisi de adam kullanarak bazı çift gpu eğitimi yapmaya çalıştıysa merak ediyorum.

Selamlar

+0

teşekkür After

lr = tf.Variable(FLAGS.learning_rate, trainable=False) opt = tf.train.AdamOptimizer(lr) 

. Ayrıca, çoklu GPU'larda eğitimi paralel hale getirmek için çalışıyorum. Uygulamanızla ilgili başarı elde ettiniz mi? –

0

Ek

aşağıdaki gibi beyan, eğitim safhasında öğrenme hızını güncellemek için bir planınız varsa. Uygulamanızı paylaşmak için bu

sess.run(tf.assign(lr, new_lr))