首发于RC&DL
tensorflow学习(三)

tensorflow学习(三)

几个API的简单记录

tf.add_to_collection

tf.add_to_collection(
name,
value
)

第一个输入时这个collection的名称,第二个是变量,可以单个变量也可以是varlist。

tf.get_collection

tf.get_collection(
key,
scope=None
)

第一个输入key是collection的name, scope参数目前没用到先不说。

tensorflow提供了collection这个函数,为了用来简化过多的变量,把同一类型或者自认为同一个命名下的变量存放在一个具有name的collection中。这样一来,对于变量的操作变的简单了很多,通过tf.get_collection(name)就可以对这些变量进行相关的操作。

import tensorflow as tf

v1 = tf.Variable([1, 2 ,3], name="var1")
v2 = tf.Variable([2, 3, 4], name="var2")

tf.add_to_collection("varlist", [v1, v2])
collection1 = tf.get_collection("varlist")
print(collection1)    #[[<tf.Variable 'var1:0' shape=(3,) dtype=int32_ref>, <tf.Variable 'var2:0' shape=(3,) dtype=int32_ref>]]

with tf.variable_scope("s1"):
    var1 = tf.get_variable("v1", shape=[1, 3], initializer=tf.constant_initializer([1, 2, 3]))
    tf.add_to_collection("collection", var1)
with tf.variable_scope("s2"):
    var1 = tf.get_variable("v1", shape=[1, 3], initializer=tf.constant_initializer([2, 3, 4]))
    tf.add_to_collection("collection", var1)
collection2 = tf.get_collection("collection")
print(collection2)    # [<tf.Variable 's1/v1:0' shape=(1, 3) dtype=float32_ref>, <tf.Variable 's2/v1:0' shape=(1, 3) dtype=float32_ref>]

通过tf.get_collection得到了这些变量的varlist。

tf.add_n

tf.add_n(
inputs,
name=None
)

inputs输入时一个varlist,返回的结果是这个list里面所有元素的加和

a1 = tf.Variable([10], dtype=tf.float32, name="a1")
a2 = tf.Variable([20], dtype=tf.float32, name="a2")
a3 = tf.Variable([30], dtype=tf.float32, name="a3")

tf.add_to_collection("a", [a1, a2, a3])
a_list = tf.get_collection("a")[0]

sess.run(tf.global_variables_initializer())
print(sess.run(tf.add_n(a_list)))                # [60.]

tf.train.Optimizer

此类定义用于添加Ops以训练模型的API,不需要直接使用此类,而是实例化其中的一个子类。如GradientDescentOptimizer,AdagradOptimizer或MomentumOptimizer

这里就是常用的梯度下降和Adam优化器方法,用法也很简单

train_op = tf.train.AdamOptimizer(0.001).minimize(loss)

minimize()方法通过更新var_list来最小化loss损失函数

minimize(
loss,
global_step=None,
var_list=None,
gate_gradients=GATE_OP,
aggregation_method=None,
colocate_gradients_with_ops=False,
name=None,
grad_loss=None
)

minimize() = compute_gradients() + apply_gradients()

compute_gradients

compute_gradients(
loss,
var_list=None,
gate_gradients=GATE_OP,
aggregation_method=None,
colocate_gradients_with_ops=False,
grad_loss=None
)

apply_gradients

apply_gradients(
grads_and_vars,
global_step=None,
name=None
)

minimize方法其实是两个方法的结合,第一个方法是compute_gradients()方法用来计算可训练的变量var_list中的变量。第二个方法是apply_gradients()方法把从compute_gradients()得到的梯度进行方向传播给weights和biases进行参数更新。

这两个方法完成了BP操作,如果对于梯度没有其余的操作建议使用minimize方法直接一步到位。如果需要对梯度进行平滑处理或者其他的一些call_gradients的方法来进行操作,需要使用第二种方法。

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

comput_gradients()的源码

  def compute_gradients(self, loss, var_list=None,
                        gate_gradients=GATE_OP,
                        aggregation_method=None,
                        colocate_gradients_with_ops=False,
                        grad_loss=None):
    """Compute gradients of `loss` for the variables in `var_list`.

    This is the first part of `minimize()`.  It returns a list
    of (gradient, variable) pairs where "gradient" is the gradient
    for "variable".  Note that "gradient" can be a `Tensor`, an
    `IndexedSlices`, or `None` if there is no gradient for the
    given variable.

    Args:
      loss: A Tensor containing the value to minimize.
      var_list: Optional list or tuple of `tf.Variable` to update to minimize
        `loss`.  Defaults to the list of variables collected in the graph
        under the key `GraphKeys.TRAINABLE_VARIABLES`.
      gate_gradients: How to gate the computation of gradients.  Can be
        `GATE_NONE`, `GATE_OP`, or `GATE_GRAPH`.
      aggregation_method: Specifies the method used to combine gradient terms.
        Valid values are defined in the class `AggregationMethod`.
      colocate_gradients_with_ops: If True, try colocating gradients with
        the corresponding op.
      grad_loss: Optional. A `Tensor` holding the gradient computed for `loss`.

    Returns:
      A list of (gradient, variable) pairs. Variable is always present, but
      gradient can be `None`.

    Raises:
      TypeError: If `var_list` contains anything else than `Variable` objects.
      ValueError: If some arguments are invalid.
      RuntimeError: If called with eager execution enabled and if `grad_loss`
        is not `None` or `loss` is not callable.

    @compatibility(eager)
    When eager execution is enabled, `loss` should be a Python function that
    takes elements of `var_list` as arguments and computes the value to be
    minimized. If `var_list` is None, `loss` should take no arguments.
    Gradient computation is done with respect to the elements of `var_list` if
    not None, else with respect to any trainable variables created during the
    execution of the `loss` function.
    `gate_gradients`, `aggregation_method`, `colocate_gradients_with_ops` and
    `grad_loss` are ignored when eager execution is enabled.
    @end_compatibility
    """
    if context.in_eager_mode():
      if grad_loss is not None:
        raise RuntimeError(
            "`grad_loss` argument to Optimizer.compute_gradients "
            "not supported when eager execution is enabled.")
      if not callable(loss):
        raise RuntimeError(
            "`loss` passed to Optimizer.compute_gradients should "
            "be a function when eager execution is enabled.")
      # TODO(agarwal): consider passing parameters to the `loss` function.
      if var_list is None:
        return backprop.implicit_grad(loss)()
      else:
        var_list = nest.flatten(var_list)
        grads = backprop.gradients_function(loss)(*var_list)
        grads_and_vars = list(zip(grads, var_list))
        return grads_and_vars
    if gate_gradients not in [Optimizer.GATE_NONE, Optimizer.GATE_OP,
                              Optimizer.GATE_GRAPH]:
      raise ValueError("gate_gradients must be one of: Optimizer.GATE_NONE, "
                       "Optimizer.GATE_OP, Optimizer.GATE_GRAPH.  Not %s" %
                       gate_gradients)
    self._assert_valid_dtypes([loss])
    if grad_loss is not None:
      self._assert_valid_dtypes([grad_loss])
    if var_list is None:
      var_list = (
          variables.trainable_variables() +
          ops.get_collection(ops.GraphKeys.TRAINABLE_RESOURCE_VARIABLES))
    else:
      var_list = nest.flatten(var_list)
    # pylint: disable=protected-access
    var_list += ops.get_collection(ops.GraphKeys._STREAMING_MODEL_PORTS)
    # pylint: enable=protected-access
    processors = [_get_processor(v) for v in var_list]
    if not var_list:
      raise ValueError("No variables to optimize.")
    var_refs = [p.target() for p in processors]
    grads = gradients.gradients(
        loss, var_refs, grad_ys=grad_loss,
        gate_gradients=(gate_gradients == Optimizer.GATE_OP),
        aggregation_method=aggregation_method,
        colocate_gradients_with_ops=colocate_gradients_with_ops)
    if gate_gradients == Optimizer.GATE_GRAPH:
      grads = control_flow_ops.tuple(grads)
    grads_and_vars = list(zip(grads, var_list))
    self._assert_valid_dtypes(
        [v for g, v in grads_and_vars
         if g is not None and v.dtype != dtypes.resource])
    return grads_and_vars

可以看到计算梯度的方法主要是来自于

grads = backprop.gradients_function(loss)(*var_list)
grads_and_vars = list(zip(grads, var_list))

通过backprop.gradients_function方法来对var_list进行求偏导数:gradients = \frac {\partial (loss) }{\partial(var\_list)}\\

把梯度和对应的变量进行打包操作,返回每个turple的list数据。得到的梯度可以进行自定义操作,如单机多卡中的同步模式对梯度的平均处理。代码来自于cifar10的单机多卡程序,适应自己的代码做了一些简单的更改。

def average_gradients(tower_grads):
    """Calculate the average gradient for each shared variable across all towers.

    Note that this function provides a synchronization point across all towers.

    Args:
      tower_grads: List of lists of (gradient, variable) tuples. The outer list
        is over individual gradients. The inner list is over the gradient
        calculation for each tower.
    Returns:
       List of pairs of (gradient, variable) where the gradient has been averaged
       across all towers.
    """
    average_grads = []
    # tower_grads is from optimizer.compute_gradients and return (gradients, variable)
    for grad_and_vars in zip(*tower_grads):
        # Note that each grad_and_vars looks like the following:
        #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
        grads = []
        for g, _ in grad_and_vars:
            # pdb.set_trace()
            # Add 0 dimension to the gradients to represent the tower.
            if g is None:
                print(g)
            else:
                expanded_g = tf.expand_dims(g, 0)
                # Append on a 'tower' dimension which we will average over below.
                grads.append(expanded_g)
        # Average over the 'tower' dimension.
                print(grads)
                grad = tf.concat(axis=0, values=grads)
                grad = tf.reduce_mean(grad, 0)

                # Keep in mind that the Variables are redundant because they are shared
                # across towers. So .. we will just return the first tower's pointer to
                # the Variable.
                v = grad_and_vars[0][1]
                grad_and_var = (grad, v)
                average_grads.append(grad_and_var)
    return average_grads

得到新的梯度以后,可以把梯度与变量重新进行打包。进入到BP的更新操作中就是apply_gradients方法。

  def apply_gradients(self, grads_and_vars, global_step=None, name=None):
    """Apply gradients to variables.

    This is the second part of `minimize()`. It returns an `Operation` that
    applies gradients.

    Args:
      grads_and_vars: List of (gradient, variable) pairs as returned by
        `compute_gradients()`.
      global_step: Optional `Variable` to increment by one after the
        variables have been updated.
      name: Optional name for the returned operation.  Default to the
        name passed to the `Optimizer` constructor.

    Returns:
      An `Operation` that applies the specified gradients. If `global_step`
      was not None, that operation also increments `global_step`.

    Raises:
      TypeError: If `grads_and_vars` is malformed.
      ValueError: If none of the variables have gradients.
    """
    # This is a default implementation of apply_gradients() that can be shared
    # by most optimizers.  It relies on the subclass implementing the following
    # methods: _create_slots(), _prepare(), _apply_dense(), and _apply_sparse().

    grads_and_vars = tuple(grads_and_vars)  # Make sure repeat iteration works.
    if not grads_and_vars:
      raise ValueError("No variables provided.")
    converted_grads_and_vars = []
    for g, v in grads_and_vars:
      if g is not None:
        try:
          # Convert the grad to Tensor or IndexedSlices if necessary.
          g = ops.convert_to_tensor_or_indexed_slices(g)
        except TypeError:
          raise TypeError(
              "Gradient must be convertible to a Tensor"
              " or IndexedSlices, or None: %s" % g)
        if not isinstance(g, (ops.Tensor, ops.IndexedSlices)):
          raise TypeError(
              "Gradient must be a Tensor, IndexedSlices, or None: %s" % g)
      p = _get_processor(v)
      converted_grads_and_vars.append((g, v, p))

    converted_grads_and_vars = tuple(converted_grads_and_vars)
    var_list = [v for g, v, _ in converted_grads_and_vars if g is not None]
    if not var_list:
      raise ValueError("No gradients provided for any variable: %s." %
                       ([str(v) for _, _, v in converted_grads_and_vars],))
    with ops.control_dependencies(None):
      self._create_slots([_get_variable_for(v) for v in var_list])
    update_ops = []
    with ops.name_scope(name, self._name) as name:
      self._prepare()
      for grad, var, processor in converted_grads_and_vars:
        if grad is None:
          continue
        # We colocate all ops created in _apply_dense or _apply_sparse
        # on the same device as the variable.
        # TODO(apassos): figure out how to get the variable name here.
        scope_name = var.op.name if context.in_graph_mode() else ""
        with ops.name_scope("update_" + scope_name), ops.colocate_with(var):
          update_ops.append(processor.update_op(self, grad))
      if global_step is None:
        apply_updates = self._finish(update_ops, name)
      else:
        with ops.control_dependencies([self._finish(update_ops, "update")]):
          with ops.colocate_with(global_step):
            apply_updates = state_ops.assign_add(global_step, 1, name=name)

      if context.in_graph_mode():
        if isinstance(apply_updates, ops.Tensor):
          apply_updates = apply_updates.op
        train_op = ops.get_collection_ref(ops.GraphKeys.TRAIN_OP)
        if apply_updates not in train_op:
          train_op.append(apply_updates)

      return apply_updates

自己理解到的描述一下

 p = _get_processor(v)
 converted_grads_and_vars.append((g, v, p))
converted_grads_and_vars = tuple(converted_grads_and_vars)

这些操作主要是得到gradients, variables和来自于_get_processor(v)的这个参数,这个我个人理解的应该是采用的一种快速更新variables的方法。每个processor都会包含一个update_op这样的函数来进行variable更新操作。

update_op方法

变量更新公式:

weights = weights - learnin\_rate * \frac{\partial(loss)}{\partial(weights)} \\

这里的optimizer参数来自于Optimizer类的self调用,g来自于call_gradients()方法或者compute_gradients()方法得到的gradients。

篇幅太长了,下一个介绍一下tensorflow梯度下降的代码位置。

编辑于 2018-07-30

文章被以下专栏收录