读PyTorch源码学习RNN(1)

读PyTorch源码学习RNN(1)

PyTorch中RNN的实现分两个版本:1)GPU版;2)CPU版。由于GPU版是直接调用cuDNN的RNN API,这里咱就略去不表。这篇文章将讲述0.2.0版PyTorch是如何实现CPU版RNN模型的。

RNN,更准确的说,torch.nn.RNN,实现的是Jeffrey Elman在1990年提出的simple recurrent neural network (SRNN),它还有一个更为广泛的称呼:Elman network。

RNN的隐状态计算公式如下:

\[h_t = \tanh(w_{ih} * x_t + b_{ih} + w_{hh} * h_{t-1} + b_{hh})\]

模型包含两个参数矩阵 w_{ih}w_{hh} 以及两个bias变量 b_{ih}b_{hh}

原来这就是RNN呀,虽然一提到RNN,各种论文、博客都会讲到什么sequence、递归、循环、各种绕的词汇,而本质上,RNN模型只包含这四个参数,不论序列长度是多少,只有这四个参数,果然够simple~ 所谓的训练过程就是在寻找这四个参数的最优值! (注:更严谨的说法是,对于单层单方向的RNN网络,只包含这四个参数)


讲到这里,插三个小问题。

  1. PyTorch中RNN仅支持两种激活函数:tanh和ReLU。为啥不支持sigmoid以及各种ReLU的变形呢?前面提到了,PyTorch的GPU版没有自己实现,而是直接调用了cuDNN,因为cuDNN中的RNN仅支持tanh和ReLU。为了统一,CPU版的RNN也只支持tanh和ReLU。
  2. 很多博客在讲解RNN时,都会说“RNN每个时刻的输入是 x_t 和上一个时刻的隐状态 h_{t-1} ,输出是当前时刻隐状态 h_t 和模型输出 y_t ”。而实际上,从RNN的隐状态计算公式可以看到,RNN输出的只有 h_t ,根本没有所谓的 y_t 。很多人之所以这么说,其实是混淆了一个概念:在将RNN应用于具体任务时,比如情感分类,在得到RNN的隐状态后,通常并不会将 h_t 和类别label直接关联,而是在 h_t 后面接一层全连接网络,而全连接网络的输出才是模型预测的类别 y_t. so,还是要分清RNN的 h_t和分类模型的 y_t
  3. 用过PyTorch的朋友大概都知道,对于不同的网络层,输入的维度虽然不同,但是通常输入的第一个维度都是batch_size,比如torch.nn.Linear的输入(batch_size,in_features),torch.nn.Conv2d的输入(batch_size, C_{in},H_{in},W_{in} ,)。而RNN的输入却是(seq_len, batch_size, input_size),batch_size位于第二维度!虽然你可以将batch_size和序列长度seq_len对换位置,此时只需要把batch_first设置为True。但是默认情况下RNN输入为啥不是batch first?原因同上,因为cuDNN中RNN的API就是batch_size在第二维度!进一步,为啥cuDNN要这么做呢?因为batch first意味着模型的输入(一个Tensor)在内存中存储时,先存储第一个sequence,再存储第二个... 而如果是seq_len first,模型的输入在内存中,先存储所有序列的第一个单元,然后是第二个单元... 两种区别如下图所示:

batch first vs seq_len first

seq_len first意味着不同序列中同一个时刻对应的输入单元在内存中是毗邻的,这样才能做到真正的batch计算。


在学习RNN源码之前,先复习Python中的一个语法:闭包(closure)。请戳链接


okay,下面开始阅读RNN的源码(torch/nn/modules/rnn.py)。

Pytorch中不论是RNN、LSTM还是GRU,都继承了相同的基类RNNBase,并且三者只在构造方法(__init__)有细微差别:

以RNN为例,

class RNN(RNNBase):
    def __init__(self, *args, **kwargs):
        if 'nonlinearity' in kwargs:
            if kwargs['nonlinearity'] == 'tanh':
                mode = 'RNN_TANH'
            elif kwargs['nonlinearity'] == 'relu':
                mode = 'RNN_RELU'
            else:
                raise ValueError("Unknown nonlinearity '{}'".format(
                    kwargs['nonlinearity']))
            del kwargs['nonlinearity']
        else:
            mode = 'RNN_TANH'

        super(RNN, self).__init__(mode, *args, **kwargs)


构造方法只做一件事:声明RNN的类型是RNN_TANH还是RNN_RELU。除此之外没有定义任何新方法。


接着看RNNBase代码,__init()__的核心代码:

# 为RNN每一层,每个方向,都创建一组参数w_ih,w_hh,b_ih,b_hh。
# 并且把所有参数设置为模型的属性,这一步通过setattr()函数实现
self._all_weights = []
for layer in range(num_layers):
    for direction in range(num_directions):
        # compute cuurent layer's input size
        layer_input_size = input_size if layer == 0 else hidden_size * num_directions
        
        w_ih = Parameter(torch.Tensor(hidden_size, layer_input_size))
        w_hh = Parameter(torch.Tensor(hidden_size, hidden_size))
        b_ih = Parameter(torch.Tensor(hidden_size))
        b_hh = Parameter(torch.Tensor(hidden_size))
        layer_params = (w_ih, w_hh, b_ih, b_hh) # current layer's params

        suffix = '_reverse' if direction == 1 else ''
        param_names = ['weight_ih_l{}{}', 'weight_hh_l{}{}'] 
        if bias:
            param_names += ['bias_ih_l{}{}', 'bias_hh_l{}{}']
        # current layer's param name, e.g. [weight_ih_l0, weight_hh_l0, bias_ih_l0, bias_hh_l0]
        param_names = [x.format(layer, suffix) for x in param_names]

        for name, param in zip(param_names, layer_params):
            setattr(self, name, param)  # self.name = param, 为实例添加属性
        self._all_weights.append(param_names)

self.reset_parameters()

self.reset_parameters()方法是对RNN参数进行初始化:

def reset_parameters(self):
    stdv = 1.0 / math.sqrt(self.hidden_size)
    for weight in self.parameters():
        init.uniform_(weight, -stdv, stdv)

可以看到,对于所有的 bw 都使用了均匀分布进行随机初始化,为啥要这么初始化呢?我在Weight initialization when using ReLUs中找到了PyTorch核心开发人员Soumith Chintala在当时(2014年9月)对神经网络参数初始化的建议:

“I initialized my weights with a uniform distribution, mean 0 and std-deviation such that the output neurons would be reasonably bounded for the next layer (so this depended on fanin and fanout)”

“anyways, for most practical purposes, I found the torch defaults to work well.

For conv layers:

stdv = 1/math.sqrt(self.kW*self.kH*self.nInputPlane)

For linear layers:

stdv = 1./math.sqrt(inputSize)”

而RNN本质上就是linear layers。

// 如果你对神经网络如何初始化参数由兴趣,强烈建议阅读这两个链接 1) Weight initialization when using ReLUs 2) weight initialization discussion


继续看RNNBase的forward方法,RNN处理的是各种序列(比如一句话,一篇文章),而这些序列通常长度不相同,也就是variable length sequence,这里咱们暂时只分析最简单的情况:batch内各个序列长度相同。

def forward(self, input, hx=None):
    batch_sizes = None # is not packed, batch_sizes = None
    max_batch_size = input.size(0) if self.batch_first else input.size(1) # batch_size
    
    if hx is None: # 使用者可以不传输hidden, 自动创建全0的hidden
        num_directions = 2 if self.bidirectional else 1
        hx = torch.autograd.Variable(input.data.new(self.num_layers *
                                                    num_directions,
                                                    max_batch_size,
                                                    self.hidden_size).zero_())
        if self.mode == 'LSTM': # h_0, c_0
            hx = (hx, hx)
   
    flat_weight = None # if cpu

    func = self._backend.RNN( # self._backend = thnn_backend # backend = THNNFunctionBackend(), FunctionBackend
        self.mode,
        self.input_size,
        self.hidden_size,
        num_layers=self.num_layers,
        batch_first=self.batch_first,
        dropout=self.dropout,
        train=self.training,
        bidirectional=self.bidirectional,
        batch_sizes=batch_sizes,
        dropout_state=self.dropout_state,
        flat_weight=flat_weight
    )
    output, hidden = func(input, self.all_weights, hx)

    return output, hidden

可以看到,在训练RNN时,可以不传入 h_0 ,此时PyTorch会自动创建全0的 h_0

forward中最重要的也是真正执行前向计算的是如下两行代码:

func = self._backend.RNN( 
        self.mode,
        self.input_size,
        self.hidden_size,
        num_layers=self.num_layers,
        batch_first=self.batch_first,
        dropout=self.dropout,
        train=self.training,
        bidirectional=self.bidirectional,
        batch_sizes=batch_sizes,
        dropout_state=self.dropout_state,
        flat_weight=flat_weight
    )
output, hidden = func(input, self.all_weights, hx)

还记得前面说过的闭包(closure)吗?这里func就是一个闭包。为啥这么说呢,看一看RNN的源码就知道了:

def RNN(*args, **kwargs):
    def forward(input, *fargs, **fkwargs):
        func = AutogradRNN(*args, **kwargs) # if no gpu, RNN=AutogradRNN
        # func也是闭包 
        return func(input, *fargs, **fkwargs) 

    return forward

soga~前面提到的func果然是一个闭包。func这个闭包中的函数就是RNN中的forward。func(input, self.all_weights, hx)等同于AutogradRNN(input, self.all_weights, hx)。

注意函数RNN中的forward中的func也是一个闭包~

继续看AutogradRNN的实现,看看模型RNN到底是如何实现的:

def AutogradRNN(mode, input_size, hidden_size, num_layers=1, batch_first=False,
                dropout=0, train=True, bidirectional=False, batch_sizes=None,
                dropout_state=None, flat_weight=None):

    if mode == 'RNN_RELU':
        cell = RNNReLUCell
    elif mode == 'RNN_TANH':
        cell = RNNTanhCell
    elif mode == 'LSTM':
        cell = LSTMCell
    elif mode == 'GRU':
        cell = GRUCell
    else:
        raise Exception('Unknown mode: {}'.format(mode))

    rec_factory = Recurrent
    if bidirectional:
        layer = (rec_factory(cell), rec_factory(cell, reverse=True)) # (Recurrent中的forward, Recurrent中的forward)
    else:
        layer = (rec_factory(cell),) # Recurrent(RNNTanhCell)
    # func is another closure o..o
    func = StackedRNN(layer, 
                      num_layers,
                      (mode == 'LSTM'),
                      dropout=dropout,
                      train=train)

    def forward(input, weight, hidden):
        if batch_first and batch_sizes is None:
            input = input.transpose(0, 1) # 即使输入数据是batch_first, 内部也要转为seq first

        nexth, output = func(input, hidden, weight)

        if batch_first and batch_sizes is None:
            output = output.transpose(0, 1)

        return output, nexth

    return forward

AutogradRNN中还是通过闭包的方式封装了真正执行RNN计算的代码... 有一点需要注意,即使RNN的输入数据是batch first,内部也会转为seq_len first。


那我们就继续看StackedRNN:

def StackedRNN(inners, num_layers, lstm=False, dropout=0, train=True):

    num_directions = len(inners) # 2 or 1
    total_layers = num_layers * num_directions

    def forward(input, hidden, weight):
        assert(len(weight) == total_layers)
        next_hidden = []

        if lstm:
            hidden = list(zip(*hidden))

        for i in range(num_layers):
            all_output = []
            for j, inner in enumerate(inners):
                l = i * num_directions + j

                hy, output = inner(input, hidden[l], weight[l]) # 调用Recurrent()
                next_hidden.append(hy)
                all_output.append(output)

            input = torch.cat(all_output, input.dim() - 1)

            if dropout != 0 and i < num_layers - 1: # 只有多层的rnn,才存在dropout, 对output
                input = F.dropout(input, p=dropout, training=train, inplace=False)

        if lstm:
            next_h, next_c = zip(*next_hidden)
            next_hidden = (
                torch.cat(next_h, 0).view(total_layers, *next_h[0].size()),
                torch.cat(next_c, 0).view(total_layers, *next_c[0].size())
            )
        else:
            next_hidden = torch.cat(next_hidden, 0).view(
                total_layers, *next_hidden[0].size())

        return next_hidden, input

    return forward


耶~终于找到真正执行前向计算的代码了。

也就是下面这几行:

for i in range(num_layers):
    all_output = []
    for j, inner in enumerate(inners):
        l = i * num_directions + j

        hy, output = inner(input, hidden[l], weight[l]) # 调用Recurrent()
        next_hidden.append(hy)
        all_output.append(output)

    input = torch.cat(all_output, input.dim() - 1)

    if dropout != 0 and i < num_layers - 1: # 只有多层的rnn,才存在dropout, 对output
        input = F.dropout(input, p=dropout, training=train, inplace=False)

对于每一层,每个方向,调用Recurrent方法计算一次前向:

def Recurrent(inner, reverse=False):
    def forward(input, hidden, weight):
        output = []
        steps = range(input.size(0) - 1, -1, -1) if reverse else range(input.size(0)) # steps=[seq_len-1, ...,1,0] or [0,1,...,seq_len-1]
        for i in steps:
            hidden = inner(input[i], hidden, *weight)
            # hack to handle LSTM
            output.append(hidden[0] if isinstance(hidden, tuple) else hidden)

        if reverse:
            output.reverse()
        output = torch.cat(output, 0).view(input.size(0), *output[0].size())

        return hidden, output

    return forward

而真正执行每个时刻的隐状态的计算如下:

def RNNReLUCell(input, hidden, w_ih, w_hh, b_ih=None, b_hh=None):
    hy = F.relu(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh))
    return hy

def RNNTanhCell(input, hidden, w_ih, w_hh, b_ih=None, b_hh=None):
    hy = F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh))
    return hy


ok,至此我们就弄清了RNN的前向计算过程~虽然还有一些特殊情况没提到,咱们下回继续。


插张流程图


未完待续。。。

编辑于 2018-11-23

文章被以下专栏收录