pytorch RNN 变长输入 padding

用RNN(包括LSTM\GRU等)做NLP任务时,对于同一个batch内的短句子一般需要padding补齐,这些padding的字符一般不应该拿去算output、hidden state、loss...的。

TensorFlow中已经实现了RNN的变长输入,只需要在调用RNN时指定batch内各sequence的length即可:RNN(input, length...)。但在pytorch中就会麻烦一些、尤其是当input是pair时,下面简述。


下文所说的input、x、x_1、x_2都是指一个batch的sequence。


1,当输入只有一个时,即forward中只有一个x: def forward(self, x):...,此时非常简单,只需要在调用RNN之前用torch.nn.utils.rnn.pack_padded_sequence(input, length) 对x进行包装即可,这里,需要注意的是x内的seq需要按长度降序排列。形如:

x_emb = self.emb(x)
x_emb_p = torch.nn.utils.rnn.pack_padded_sequence(x_emb, xlen, batch_first=True)
out_pack, (ht, ct) = self.rnn(x_1_emb_p, None)

此时,返回的ht和ct就是剔除padding字符后的hidden state和cell state,都是Variable类型的。但是返回的output是PackedSequence类型的,可以使用:

out = torch.nn.utils.rnn.pad_packed_sequence(out_pack, batch_first=True)

将output转化成tuple:(padded sequence, sequence lengths)。


2,当输入是pair时,即forward中有两个x: def forward(self, x_1, x_2):...,而且x_1与x_2中的seq是一一对应的、成pair的(比如做semantic similarity任务时,很多时候输入两个句子--一个pair,计算它们的语义相似度,等等)。就不能简单用pack_padded_sequence搞定了,原因是:上文也提到了pack_padded_sequence要求输入的batch内部有序(seq按其长度降序排列),此时若分别对x_1和x_2内的seq进行排序就会破坏它们之间的对应关系。此时,可以这样做:

  • 对 x_1 内部的seq按其长度进行sort:x_1 --> x_1_sorted
  • pack_padded_sequence对 x_1_sorted 进行包装,并输入到rnn进行计算
  • 对得到的ht、ct进行unsort(sort的逆操作:[2,1,3] -sort->[1,2,3]-unsort->[2,1,3]),使其内的state张量回到它们本来的位置。对于output要先torch.nn.utils.rnn.pad_packed_sequence(output_pack)、再unsort。
  • 对 x_2 进行同样的操作,这样x_1 x_2得到的的output、ht、ct就对应上了。

示例如下:

def forward(self, x_1, x_2, x1_len, x2_len):
    """sort"""
    x1_sort_idx = np.argsort(-x1_len)
    x_1 = x_1[x1_sort_idx]
    x1_len = x1_len[x1_sort_idx]
    x1_unsort_idx = np.argsort(x1_sort_idx)

    x2_sort_idx = np.argsort(-x2_len)
    x_2 = x_2[x2_sort_idx]
    x2_len = x2_len[x2_sort_idx]
    x2_unsort_idx = np.argsort(x2_sort_idx)

    """to Variable"""
    x_1 = Variable(torch.LongTensor(x_1))
    x_2 = Variable(torch.LongTensor(x_2))

    """embedding"""
    x_1_emb = self.emb(x_1)
    x_2_emb = self.emb(x_2)

    """pack"""
    x_1_emb_p = torch.nn.utils.rnn.pack_padded_sequence(x_1_emb, x1_len, batch_first=True)
    x_2_emb_p = torch.nn.utils.rnn.pack_padded_sequence(x_2_emb, x2_len, batch_first=True)

    """encode"""
    out_1_pack, (ht_1, ct_1) = self.rnn(x_1_emb_p, None)
    out_2_pack, (ht_2, ct_2) = self.rnn(x_2_emb_p, None)
    """unpack: out"""
    # out_2 = torch.nn.utils.rnn.pad_packed_sequence(out_2_pack, batch_first=True)  # (sequence, lengths)
    # out_2 = out_2[0] # 包含最后 padding 生成的  state:都是0

    state_1 = ht_1[-1, :, :]
    state_2 = ht_2[-1, :, :]
    """unsort:state"""
    state_1 = state_1[torch.LongTensor(x1_unsort_idx)]
    state_2 = state_2[torch.LongTensor(x2_unsort_idx)]
    .......

大致流程是这样的:

batch seq -> sort -> pad and pack ->process using RNN -> unpack ->unsort

为了使用方便,我把这套流程封装了一下,使得以后可以像使用TensorFlow中的RNN那样直接输入input和length就能做变长输入的encoding,如下:

# coding=utf-8
import torch
import torch.nn as nn
import numpy as np

class LSTM_V(object):
    def __init__(self, input_size, hidden_size, num_layers=1, bias=True, batch_first=True, dropout=0,
                 bidirectional=False, only_use_last_hidden_state=False):
        """
        LSTM which can hold variable length sequence, use like TensorFlow's RNN(input, length...).

        :param input_size:The number of expected features in the input x
        :param hidden_size:The number of features in the hidden state h
        :param num_layers:Number of recurrent layers.
        :param bias:If False, then the layer does not use bias weights b_ih and b_hh. Default: True
        :param batch_first:If True, then the input and output tensors are provided as (batch, seq, feature)
        :param dropout:If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer
        :param bidirectional:If True, becomes a bidirectional RNN. Default: False
        """
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bias = bias
        self.batch_first = batch_first
        self.dropout = dropout
        self.bidirectional = bidirectional
        self.only_use_last_hidden_state = only_use_last_hidden_state
        self.LSTM = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            bias=bias,
            batch_first=batch_first,
            dropout=dropout,
            bidirectional=bidirectional
        )

    def run(self, x, x_len):
        """
        sequence -> sort -> pad and pack ->process using RNN -> unpack ->unsort

        :param x: Variable
        :param x_len: numpy list
        :return:
        """
        """sort"""
        x_sort_idx = np.argsort(-x_len)
        x_unsort_idx = torch.LongTensor(np.argsort(x_sort_idx))
        x_len = x_len[x_sort_idx]
        x = x[torch.LongTensor(x_sort_idx)]
        """pack"""
        x_emb_p = torch.nn.utils.rnn.pack_padded_sequence(x, x_len, batch_first=self.batch_first)
        """process using RNN"""
        out_pack, (ht, ct) = self.LSTM(x_emb_p, None)
        """unsort: h"""
        ht = torch.transpose(ht, 0, 1)[
            x_unsort_idx]  # (num_layers * num_directions, batch, hidden_size) -> (batch, ...)
        ht = torch.transpose(ht, 0, 1)

        if self.only_use_last_hidden_state:
            return ht
        else:
            """unpack: out"""
            out = torch.nn.utils.rnn.pad_packed_sequence(out_pack, batch_first=self.batch_first)  # (sequence, lengths)
            out = out[0]  #
            """unsort: out c"""
            out = out[x_unsort_idx]
            ct = torch.transpose(ct, 0, 1)[
                x_unsort_idx]  # (num_layers * num_directions, batch, hidden_size) -> (batch, ...)
            ct = torch.transpose(ct, 0, 1)

            return out, (ht, ct)

使用方法是:new 出对象后调用 .run(x_1, x1_len) 即可,x_1中的seq不要求有序,只需要传入各seq的长度list就行。

-----

ps:写的仓促

编辑于 2017-10-07