[论文阅读]探究Batch Normalization起作用的真正原因

论文在How Does Batch Normalization Help Optimization?,是NIPS 2018上一篇对BN原理进行阐述的论文。

BN可以参考深入解读Inception V2之Batch Normalization(附源码)

发现一篇很棒的文章深度学习中的Normalization模型 - 极市博客,微博首席科学家张俊林写的。

找到一篇翻译文章How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)

有论文作者回答的相关话题[R] How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)


0 Abstract

参考深入解读Inception V2之Batch Normalization(附源码),Inception V2阐述BN有用的原因是可以防止产生"internal covariate shift"现象:

The popular belief is that this effectiveness stems from controlling the change of the layers' input distributions during training to reduce the so-called "internal covariate shift".

论文给了不同的观点,提出原因是因为BN使得landscape变得平滑了:

Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.

loss landscape可参考SGD在两层神经网络上是怎么收敛的?(这篇文章值得一读),里面提了一篇文章Visualizing the Loss Landscape of Neural Nets,最近我也读了这篇文章,在[论文阅读]损失函数可视化及其对神经网络的指导作用,主要思想是将高维参数空间投影到二维或者三维,观察投影之后的样子可以对神经网络有个直观的认识,比如下图:


1 Introduction

终于看到了对ICS("internal covariate shift")的解释(看Inception V2的时候一直不理解),虽然是informally非正式的,表示是上一层的均值或方差的变化会引起下一层参数分布的改变:

Informally, ICS refers to the change in the distribution of layer inputs caused by updates to the preceding layers.

2 Batch normalization and internal covariate shift

对BN的阐述很简洁,首先为了增加参数分布的稳定性需要使得参数分布均值为0、方差为1:

Broadly speaking, BatchNorm is a mechanism that aims to stabilize the distribution (over a mini-batch) of inputs to a given network layer during training. This is achieved by augmenting the network with additional layers that set the first two moments (mean and variance) of the distribution of each activation to be zero and one respectively.

但是每个mini batch的均值和方差都不一样呀,且随着训练的进行参数也会相应的进行调整,所以为了维持模型表达能力设置了可训练的scale和shift比例:

Then, the batch normalized inputs are also typically scaled and shifted based on trainable parameters to preserve model expressivity. This normalization is applied before the non-linearity of the previous layer.

论文对比了有无BatchNorm的差异:

效果还是很明显的,有BN收敛更快,准确率更高,然后论文分析了参数的变化:

可以看到Layer #3中Standard模型的中值不位于0处,这是ICS的来源。而Standard+BatchNorm模型是钟型,这是典型的高斯分布曲线,且随着迭代的进行(2000、6000等指迭代次数),中值一直位于0处。

通过观察,论文提出了两个问题,即BN的作用是否与ICS相关,以及BN是否会消除或减少ICS:

(1) Is the effectiveness of BatchNorm indeed related to internal covariate shift?
(2) Is BatchNorm's stabilization of layer input distributions even effective in reducing ICS?

这节里面的两个小节分别对应了这两个问题。

2.1 Does BatchNorm's performance stem from controlling internal covariate shift?

这个小节的中心思想是:既然BN能够改变漂移现象,那么人工在BN后加一个漂移,就破坏了BN改变漂移现象的作用,那么BN+人工漂移能不能起到和BN相同的作用呢?论文做了如下实验,主要是在BN层后增加一个随机的噪声分布,人为制造漂移现象:

We train networks with random noise injected after BatchNorm layers. Specifically, we perturb each activation for each sample in the batch using i.i.d. noise sampled from a non-zero mean and non-unit variance distribution. We emphasize that this noise distribution changes at each time step.

图2是结果:

注意这个噪声是time-varying的,随时间随机改变,意味着神经网络不能通过学习噪声将这个噪声给去掉。

结论发现加了噪声之后BN依然有相同好的结果:

Observe that the performance difference between models with BatchNorm layers, and "noisy" BatchNorm layers is almost non-existent. Also, both these networks perform much better than standard networks.

然而,分析图2,加了noise的BN网络layer上参数的稳定性更不好(中心值往左右震荡非常激烈):

Moreover, the "noisy" BatchNorm network has qualitatively less stable distributions than even the standard, non-BatchNorm network, yet it still performs better in terms of training.

把图放大了明显一点,特别是Layer #13:

因此,论文认为BN与ICS不相关(增加了ICS,BN的作用依然很好),这回答了第2节的第一个问题:

Clearly, these findings are hard to reconcile with the claim that the performance gain due to Batch-Norm seems from increased stability of layer input distributions.

2.2 Is BatchNorm reducing internal covariate shift?

因为要量化层梯度在前一层更新参数的差异,论文给出了如下的定义:

差异只在 W_1,...,W_{i-1} 的参数是第 t 步的还是第 t+1 步的,ICS则定义为 \parallel G_{t,i}-G'{t,i}\parallel_2 ,即欧式距离。

为了隔离激活函数非线性的影响,使用了25层的deep linear networks (DLN)进行实验,实验结果如图3所示:

一个指标是定义2.1定义的欧式举例 \ell_2-Difference ,这个距离为0是最好的,表示两个矢量没有差异,一个指标是 Cos\ Angle ,这个为1是最好的,表示两个矢量同方向,此时夹角为0。

结果显示BN并不能减少ICS,相反,还会有一定程度上的增加,结论就是BN并不能减少ICS,这就回答了第2节的第二个问题:

This evidence suggests that, from optimization point of view, controlling the distributions layer inputs as done in BatchNorm, might not even reduce the internal covariate shift.

3 Why does BatchNorm work?

3.1 The smoothing effect of BatchNorm

论文认为BN的作用在于使得loss landscape更加光滑:

Indeed, we identify the key impact that BatchNorm has on the training process: it reparametrized the underlying optimization problem to make its landscape significantly more smooth.

附录A里面的注释4说ResNets和BatchNorm的作用类似,因此这里用一张ResNets的图来说明loss landscape变光滑是什么样子:

有残差网络的loss landscape明显光滑了很多,在本论文里会用 L -Lipschitz函数即 |f(x_1)-f(x_2)|\leq L \parallel x_1-x_2\parallel 来定量描述这种光滑度。L -Lipschitz函数参考利普希茨連續,我简单理解成等同于斜率的绝对值的最大值。

另外,论文用了一个更强的光滑系数,斜率的斜率不超过一定值,即 \beta -smoothness: \parallel \bigtriangledown f(x_1)-\bigtriangledown f(x_2)\parallel\leq\beta\parallel x_1-x_2\parallel ,注释3里讲\beta -smoothness没有一个全局的值:

It is worth noting that, due to the existence of non-linearities, one should not expect the \beta-smoothness to be bounded in an absolute, global sense.

论文讲这是由于非线性的影响,这里我举个简单的例子吧, ReLU:f(x)=max(x,0) ,那么 f'(x)=\begin{equation}\left\{ \begin{array}{**lr**} 0,x<0 \\ 1,x\geq0 \end{array}   \right.  \end{equation} ,那么 f''(0)=lim_{\bigtriangleup x\rightarrow0}\frac{f'(\bigtriangleup x)-f'(-\bigtriangleup x)}{2\bigtriangleup x}=lim_{\bigtriangleup x\rightarrow0}\frac{1-0}{2\bigtriangleup x}=\infty ,为正无穷。

论文认为光滑的意思就是两个值都很小:

That is, the loss changes at a smaller rate and the magnitudes of the gradients are smaller too.

论文引用了Visualizing the Loss Landscape of Neural Nets的观点,不光滑的loss landscape不利于训练,需要设计好的学习率和初始值:

This makes gradient descent-based training algorithms unstable, e.g., due to exploding or vanishing gradients, and thus highly sensitive to the choice of the learning rate and initialization.

3.2 Exploration of the optimization landscape

论文提出了训练过程中的loss、gradient、\beta -smoothness,如下图所示:

注意到图4(a)和图4(b)里面都不是曲线,而是区域,对每个step,都有一个最大值,最小值,论文说是variation,不太理解是什么意思,感觉与我理解的loss landscape不一样。

3.3 Is BatchNorm the best (only?) way to smoothen the landscape?

论文将BN的作用归结于平滑了loss landscape,下一步自然就是以平滑loss landscape为准绳设计一个比BN更先进的算法。

使用\ell _p -norm后分布就不是高斯分布了,漂移现象也很严重,但是结果依然很好:

论文的结论是BN没有独一无二的效果:

We observe that all the normalization strategies offer comparable performance to BatchNorm.

可以设计一个更优的正则化方法:

Therefore, it might be valuable to perform a principled exploration of the design space of similar normalization schemes as it can lead to even better performance.

4 Theoretical Analysis

这节从理论角度进行分析。

使用Deep Neural Network (DNN)进行分析,由于没有非线性的激活函数,因此整个函数都是线性的,不过论文的分析可以建立在非线性的基础上:

note that our analysis does not necessitate that the entire network be fully linear.

4.1 Setup

使用了图7所示的网络进行对比实验,左边是一个线性的DNN网络,右边则在左边的基础上加了一个BN层:

4.2 Theoretical Results

公式太多了,看的脑袋疼,理智告诉我还是以后再看吧。


5 Related work

介绍了BN的替代算法和BN的一些别的优点,可以参考深度学习中的Normalization模型 - 极市博客


6 Conclusions

没啥好讲的。


【已完结】

编辑于 2019-09-03

文章被以下专栏收录