深入解读残差网络ResNet V2(附源码)

论文在Identity Mappings in Deep Residual Networks(一般称作ResNet V2),是论文Deep Residual Learning for Image Recognition(一般称作ResNet V1)的改进。

ResNet V1可参考残差网络ResNet V1


0 Abstract

给了本文的概括,即前向参数和反向梯度如果直接从block传到下一个block,而不用经过ReLU等操作,效果会更好:

In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings.

1 Introduction

首先给出了残差网络的通用公式:

\bold y_l=h(\bold x_l)+F(\bold x_l,W_l),\\ \bold x_{l+1}=f(\bold y_l)

在ResNet V1中, h(\bold x_l)=\bold x_l ,为同一性映射,函数 f 是ReLU激活函数。

ResNet V2的思想是这种同一性映射不仅仅发生在单个的residual unit中,而是贯穿整个网络:

In this paper, we analyze deep residual networks by focusing on creating a "direct" path for propagating information: not only within a residual unit, but through the entire network.

要做到这一点,即 \bold x_{l+1}=\bold x_l ,需要满足两个条件,即h(\bold x_l)=\bold x_lf(\bold y_l)=\bold y_l ,两者均为同一性映射。这两个条件,分别在第3节和第4节中详细论证了。


2 Analysis of Deep Residual Networks

如果满足上述两个条件,则可用递归公式得到任意两层之间的关系:

\bold x_L=\bold x_l+\sum_{i=l}^{L-1}{F(\bold x_i,W_i})

这个公式可以引出两个特性:

  1. 深层网络单元 \bold x_L 可以表示成浅层网络单元 \bold x_l 与残差单元 \sum_{i=l}^{L-1}{F(\bold x_i,W_i}) 之和;
  2. \bold x_L=\bold x_0+\sum_{i=0}^{L-1}{F(\bold x_i,W_i}) ,网络单元\bold x_L可以表示成所有残差单元与 \bold x_0 之和;

这在反向传播时是一个非常好的性质:

\frac{\partial \bold\varepsilon}{\partial \bold x_l}= \frac{\partial \bold\varepsilon}{\partial \bold x_L} \frac{\partial \bold x_L}{\partial \bold x_l}= \frac{\partial \bold\varepsilon}{\partial \bold x_L}+\frac{\partial \bold\varepsilon}{\partial \bold x_L}(\frac{\partial}{\partial \bold x_l}\sum_{i=l}^{L-1}{F(\bold x_i,W_i}))

可以看到, \frac{\partial \bold\varepsilon}{\partial \bold x_L} 是深层网络的梯度,直接传递给浅层网络了,这表明梯度衰减问题得到了很好的控制:

This implies that the gradient of a layer does not vanish even when the weights are arbitrarily small.


上述两个公式证明ResNet V2中前向传播和反向传播的特性都非常好:

Eqn.(4) and Eqn.(5) suggest that the signal can be directly propagated from any unit to another, both forward and backward.

注意这里说的from any unit to another,而不是说指的相邻的两个unit。


具体实现的代码(在resnet_v2.py)与ResNet V1很类似,多出的一点在于residual支路上是没有ReLU等操作的,因此代码利用了preact这个中间层去主路上做卷积操作:

with tf.variable_scope(scope, 'bottleneck_v2', [inputs]) as sc:
    depth_in = slim.utils.last_dimension(inputs.get_shape(), min_rank=4)
    preact = slim.batch_norm(inputs, activation_fn=tf.nn.relu, scope='preact')
    if depth == depth_in:
      shortcut = resnet_utils.subsample(inputs, stride, 'shortcut')
    else:
      shortcut = slim.conv2d(preact, depth, [1, 1], stride=stride,
                             normalizer_fn=None, activation_fn=None,
                             scope='shortcut')

    residual = slim.conv2d(preact, depth_bottleneck, [1, 1], stride=1,
                           scope='conv1')
    residual = resnet_utils.conv2d_same(residual, depth_bottleneck, 3, stride,
                                        rate=rate, scope='conv2')
    residual = slim.conv2d(residual, depth, [1, 1], stride=1,
                           normalizer_fn=None, activation_fn=None,
                           scope='conv3')

    output = shortcut + residual

    return slim.utils.collect_named_outputs(outputs_collections,
                                            sc.name,
                                            output)

3 On the Importance of Identity Skip Connections

首先分析了h(\bold x_l)=\lambda_l\bold x_l的情况,很容易导致梯度爆炸或梯度衰减,舍弃。

3.1 Experiments on Skip Connections

主要分析了图中(b)~(f)的5种变体。

Constant scaling:分为两种情况,第一种情况只有shortcut有scale,第二种情况与highway gating相同。前一种情况不能收敛,后一种错误率比ResNet V1的要高;

Exclusive gating:这是借鉴自Highway Networks的,没有看过这篇文章,不介绍了;

Shortcut-only gating:比Exclusive gating简化了,也不介绍;

1\times 1 convolutional shortcut:在shortcut上弄一个 1\times1 的卷积;

Dropout shortcut:在shortcut搞dropout;

3.2 Discussions

分析了很多种方法,同一性映射最有用:

These experiments suggest that keeping a "clean" information path is helpful for easing optimization.

论文分析了原因,即不是网络表达能力不够(增加gating和1*1的卷积会增加网络的表达能力),而是优化的问题(至于这个issue是啥论文没有细说):

However, their (the gating and 1*1 convolutional shortcuts) training error is higher than that of identity shortcuts, indicating that the degradation of these models is caused by optimization issues, instead of representational abilities.

4 On the Usage of Activation Functions

4.1 Experiments on Activation

BN after addition:如图;

ReLU before ReLU:这会导致residual非负,但其应该是可以为负的;

Post-activation or pre-activation?:两种是等价的;

4.2 Analysis

两点影响,第一是简化优化过程,第二BN起到了正则化作用。


5 Results

没啥好讲的。


6 Conclusions

没啥好讲的。


【已完结】

编辑于 2019-08-15

文章被以下专栏收录