Transformer模型笔记

Transformer模型笔记

*多图*

最近阅读会在讨论attention is all you need 一系列的论文,对transformer这个模型不是很理解。之后翻阅了很多知乎笔记,博客还是没懂Q,K,V是怎么来的。最后幸运的发现了哈佛nlp组用pytorch实现的代码才明白了一半(

The Annotated Transformernlp.seas.harvard.edu图标

)。前几天又发现了一篇刚出的博客(

The Illustrated Transformerjalammar.github.io图标

)详细地用图片展示了transformer模型的细节。所以准备把两篇干货合二为一,把一些当时理解上的难点在知乎里记录一下,但不会注重于nlp一些任务的细节。注:哈佛nlp组的代码适用于pytorch0.3版本,如果要在0.4版本上运行还需小改一下。



1. 大框架

大框架很容易理解,但看上图又很复杂,简化一下左边一个encoder把输入读进去,右边一个decoder得到输出:

当时第一个问题就是左边encoder的输出是怎么和右边decoder结合的。因为decoder里面是有N层的。再画张图直观的看就是这样:

也就是说encoder的输出,会和每一层的decoder进行结合。

Encoder和Decoder的内部结构:

2. 细节: Multi-Head Attention 与 Scaled Dot-Product Attention

先理解Scaled Dot-Product Attention里的Q,K,V从哪里来:

按照我的理解就是给我一个输入X, 通过3个线性转换把X转换为Q,K,V。

原博主的图示就展示的非常清晰好懂,由于画格子很困难我就直接截图了:

输入:两个单词,Thinking, Machines. 通过嵌入变换会X1,X2两个向量[1 x 4]。分别与Wq,Wk,Wv三个矩阵[4x3]想做点乘得到,{q1,q2},{k1,k2},{v1,v2} 6个向量[1x3]。
向量{q1,k1}做点乘得到得分(Score) 112, {q1,k2}做点乘得到得分96。
对该得分进行规范,除以8。这个在论文中的解释是为了使得梯度更稳定。工程问题没什么好解释的。之后对得分『14,12』做softmax得到比例『0.88,0.12』。
用得分比例「0.88,0.12」乘以[v1,v2]值(Values)得到一个加权后的值。将这些值加起来得到z1。这就是这一层的输出。仔细感受一下,用Q,K去计算一个thinking对与thinking, machine的权重,用权重乘以thinking,machine的V得到加权后的thinking,machine的V,最后求和得到针对各单词的输出Z。
之前的例子是单个向量的运算例子。这张图展示的是矩阵运算的例子。输入是一个[2x4]的矩阵(单词嵌入),每个运算是[4x3]的矩阵,求得Q,K,V。
Q对K转制做点乘,除以dk的平方根。做一个softmax得到合为1的比例,对V做点乘得到输出Z。那么这个Z就是一个考虑过thinking周围单词(machine)的输出。

Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

注意看这个公式, QK^T 其实就会组成一个word2word的attention map!(加了softmax之后就是一个合为1的权重了)。比如说你的输入是一句话 "i have a dream" 总共4个单词, 这里就会形成一张4x4的注意力机制的图

这样一来,每一个单词就对应每一个单词有一个权重

注意encoder里面是叫self-attention,decoder里面是叫masked self-attention。

这里的masked就是要在做language modelling(或者像翻译)的时候,不给模型看到未来的信息。

mask就是沿着对角线把灰色的区域用0覆盖掉,不给模型看到未来的信息。

就别说,i作为第一个单词,只能有和i自己的attention。have作为第二个单词,有和i, have 两个attention。 a 作为第三个单词,有和i,have,a 前面三个单词的attention。到了最后一个单词dream的时候,才有对整个句子4个单词的attention。

做完softmax后就像这样,横轴合为1


self-attention这里就出现一个问题,如果输入的句子特别长,那就为形成一个 NxN的attention map,这就会导致内存爆炸...所以要么减少batch size多gpu训练,要么剪断输入的长度,还有一个方法是用conv对K,V做卷积减少长度。

对K,V做卷机和stride(stride的话(n,1)是对seq_len单边进行跳跃),会减少seq_len的长度而不会减少hid_dim的长度。所以最后的结果Z还是和原先一样(因为Q没有改变)。mask的话比较麻烦了,作者用的是local attention。


Multi-Head Attention就是把上面的过程做H次,然后把输出Z合起来。

(1)得到8个输出Z后将8个Z合在一起。(2)为了使得输出与输入结构对标 乘以一个线性W0 得到(3) Z。

Pytorch 代码:

在实现的时候在很多地方用了pytorch的view功能。

''' ======== Multi-Head Attention ========'''

class MutiHeadAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        super(MutiHeadAttention, self).__init__()
        self.d_k = d_model // h
        self.d_v = d_model // h
        self.d_model = d_model
        self.h = h
        self.W_QKV = clone(nn.Linear(d_model, d_model, bias=False), 3)
        self.W_0 = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, Q, K, V, mask=None):
        if mask is not None:
            mask = mask.unsqueeze(1)

        # Q.size(): (batch, -1, d_model)
        n_batch = Q.size(0)
        # 1) (QWi, KWi, VWi)
        Q, K ,V = \
            [linear(x).view(n_batch, -1, self.h, self.d_k).transpose(1, 2)
             for linear, x in zip(self.W_QKV, (Q, K, V))]
        # 2) headi = Attention()
        X = Attention(Q, K, V, mask=mask, dropout=self.dropout)
        # 3) Concat(head1, ..., head_h)
        X = X.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
        # 4) *W0
        X = self.W_0(X)
        return X

''' ======== Scaled Dot-Product Attention ========'''

def Attention(Q, K, V, mask=None, dropout=None):
    '''
    Attention(Q, K, V) = softmax((QK^T)/sqrt(dk))V
    '''
    #dk.size(): (batch, h, -1, d_k)
    dk = K.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(dk) #(batch, h, -1, -1)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weight = F.softmax(scores, dim = -1) #right most dimension, (batch, h, -1, -1)
    if dropout is not None:
        weight = dropout(weight)
    res = torch.matmul(weight, V) # (batch, h, -1, d_k)
    return res

明白了这个细节后接下来就是

  1. Transformer的结构:由一个encoder, 一个decoder,一个decoder后的输出层(generator),外加2个嵌入层(embed)组成。
''' ========== Transformer ========== '''

class Transformer(nn.Module):
    def __init__(self, Encoder, Decoder, src_embed, tgt_embed, Generator):
        super(Transformer, self).__init__()
        self.encoder = Encoder
        self.decoder = Decoder
        self.generator = Generator
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed

    def forward(self, src, tgt, src_mask, tgt_mask):
        memory = self.encoding(src, src_mask)
        out = self.decoding(memory, src_mask, tgt, tgt_mask)
        return out

    def encoding(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decoding(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

2. Encoder层的结构:Encoder总共有N层,每一层的结构为(输入进来通过multi-head attention,进入add&norm,进入前向网络再进入add&norm)。add&norm就是一个简单的layer normalization外加残差网络的结合。

''' ======== Encoder layer ======= '''

class EncoderLayer(nn.Module):
    def __init__(self, size, attention, feed_forward, dropout=0.1):
        super(EncoderLayer, self).__init__()

        self.feed_forward = feed_forward
        self.multi_head_attention = attention
        self.add_norm_1 = AddNorm(size, dropout)
        self.add_norm_2 = AddNorm(size, dropout)
        self.size = size

    def forward(self, x, mask):
        output = self.add_norm_1(x, lambda x: self.multi_head_attention(x, x, x, mask))
        output = self.add_norm_2(output, self.feed_forward)
        return output

''' ======== Encoder ======= '''

class Encoder(nn.Module):
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clone(layer, N) # clone the layer for N times
        self.norm = nn.LayerNorm(layer.size)

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

3. Decoder: Decoder也总共有N层,每一层的结构为(输出进入第一层masked-multi-head attention后进入add&norm,再进入第二层multi-head attention后进入add&norm,再进入第三层的前向网络再进入add&norm得到输出。)

''' ======== Decoder layer ======= '''

class DecoderLayer(nn.Module):
    def __init__(self, size, self_attention, src_attention, feed_forward, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.add_norm_1 = AddNorm(size, dropout)
        self.add_norm_2 = AddNorm(size, dropout)
        self.add_norm_3 = AddNorm(size, dropout)
        self.muti_head_attention = src_attention
        self.masked_muti_head_attention = self_attention
        self.feed_forward = feed_forward

    def forward(self, x, memory, src_mask, tgt_mask):
        m = memory
        x = self.add_norm_1(x, lambda x: self.masked_muti_head_attention(x, x, x, tgt_mask))
        x = self.add_norm_2(x, lambda x: self.muti_head_attention(x, m, m, src_mask))
        output = self.add_norm_3(x, self.feed_forward)
        return output

''' ======== Decoder ======= '''

class Decoder(nn.Module):
    def __init__(self, DecoderLayer, N):
        super(Decoder, self).__init__()
        self.layers = clone(DecoderLayer, N) # clone layer for N times
        self.norm = nn.LayerNorm(DecoderLayer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

4. Generator:输出层就是简单的一个前向网络(Linear)外加一个softmax

''' ======== Output Linear + Softmax ======= '''

class Generator(nn.Module):
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        res = F.log_softmax(self.proj(x), dim=-1)
        return res


运行代码:

由于哈佛组的代码是基于0.3版本的pytorch的,在稍作修改后就可以在0.4版本的pytorch上运行了。之后用4块gpu并行运算试了一下效果跑了10轮,大概2个小时不到,每秒大概是22000左右个tokens 

Epoch Step: 1 Loss: 9.184584 Tokens per Sec: 1428.097987
Epoch Step: 51 Loss: 8.581259 Tokens per Sec: 22881.229523
Epoch Step: 401 Loss: 5.100830 Tokens per Sec: 22435.471005
Epoch Step: 1 Loss: 4.619835 Tokens per Sec: 35756.163202
4.603617369766699

Epoch Step: 1 Loss: 4.910212 Tokens per Sec: 11098.529115
Epoch Step: 401 Loss: 4.215462 Tokens per Sec: 20742.150697
Epoch Step: 1 Loss: 3.598128 Tokens per Sec: 35914.361411
3.641148615486463

Epoch Step: 1 Loss: 3.718708 Tokens per Sec: 7785.488002
Epoch Step: 401 Loss: 2.940014 Tokens per Sec: 21459.424470
Epoch Step: 1 Loss: 3.128638 Tokens per Sec: 35194.173030
3.177007715838013

Epoch Step: 1 Loss: 3.165454 Tokens per Sec: 9554.168255
Epoch Step: 401 Loss: 3.662171 Tokens per Sec: 22810.586728
Epoch Step: 1 Loss: 2.850439 Tokens per Sec: 35899.257282
2.9338041949162217

Epoch Step: 1 Loss: 3.266672 Tokens per Sec: 10821.854679
Epoch Step: 401 Loss: 1.319928 Tokens per Sec: 22326.982653
Epoch Step: 1 Loss: 2.873256 Tokens per Sec: 35833.428723
2.896158274551431

Epoch Step: 1 Loss: 3.084807 Tokens per Sec: 8893.051986
Epoch Step: 401 Loss: 2.127793 Tokens per Sec: 22557.649552
Epoch Step: 1 Loss: 2.450163 Tokens per Sec: 36131.422063
2.533294169177738

Epoch Step: 1 Loss: 1.723834 Tokens per Sec: 11001.281288
Epoch Step: 401 Loss: 3.326911 Tokens per Sec: 22412.767403
Epoch Step: 1 Loss: 2.420858 Tokens per Sec: 34084.542174
2.5214455853650186

Epoch Step: 1 Loss: 2.098521 Tokens per Sec: 11321.194600
Epoch Step: 401 Loss: 2.315263 Tokens per Sec: 21979.691604
Epoch Step: 1 Loss: 2.251485 Tokens per Sec: 36769.537334
2.350589816463498

Epoch Step: 1 Loss: 2.297822 Tokens per Sec: 8245.470223
Epoch Step: 401 Loss: 2.347269 Tokens per Sec: 23856.052394
Epoch Step: 1 Loss: 2.213846 Tokens per Sec: 34757.977969
2.3151114787357896

Translation:    They 're seeing the dirty technology that 's the dirty life of New York .
Target: So 1860 , they are seeing this dirty technology that is going to choke the life out of New York .

Translation:    And really every day , we fall under the main rule of the ongoing and ongoing and the prisoner of their human rights , the laws of their laws and laws .
Target: And every day , every day we wake up with the rule of the militias and their continuous violations of human rights of prisoners and their disrespect of the rule of law .

Translation:    Because even though we do the same picture changes , our perspective , our perspective , and as they can always see new milestones , and I can see how they see how they deal with their eyes and how they deal with everything they see it .
Target: Because while we take the same photo , our perspectives change , and she reaches new milestones , and I get to see life through her eyes , and how she interacts with and sees everything .

Translation:    If there 's a photographers and there 's a light there , and there 's a nice tube , and we want to go back to a client , " Cameron is now a picture , and then we 're going to go back and go back , and then , and this arm , and then you just
Target: So if the photographer is right there and the light is right there , like a nice <unk> , and the client says , " Cameron , we want a walking shot , " well then this leg goes first , nice and long , this arm goes back , this arm goes forward , the head is at three quarters , and you just go back and forth , just do that , and then you look back at your imaginary friends , 300 , 400 , 500 times .

总结:

别的一些细节例如对位置做嵌入

,优化器的预热,学习率的调整等我觉得比较直观就不写了。该笔记主要针对transformer的模型理解。google之后在此之上应用在很多任务上,例如用这个生成wikipedia的总结,还有生成预训练。当然之后的模型都是丢掉了encoder,单独用decoder部分。



更新: Language modeling 用transformer的公式

问题:

P(w^1,...,w^{n}) = \prod_{j=1}^{n}p(w^j|w^1,...,w^{j-1})

Model architecture:

\begin{align*} &h_0 = UW_{\text{embed}} + W_{\text{position}}\\ &h_l = \text{transformer-block}(h_{l-1}) \forall l \in [1,n]\\ &P(u) = \text{softmax}(h_nW_{\text{embed}}^T) \end{align*}

where U = (u_{-k},...,u_{-1}) is the context vector of tokens(one hot encodings), n is the number of layers, W_{\text{embed}} is the token embedding matrix, and W_{\text{position}} is the position embedding matrix. Let d_n be the length of the input and d_{model} be the dimension of the embedding. W_{embed} \in \mathbb{R}^{d_{ntokens}\times d_{\text{model}}} and U \in \mathbb{R}^{d_n \times d_{\text{ntokens}}} . So h_0 \in \mathbb{R}^{d_n \times d_{\text{model}}} .

Transformer block:

\begin{align*} &\text{transformer-block:}\\ &\text{input: } h_{in}\\ &\text{output: } h_{out}\\ &h_{mid} = \text{LayerNorm}(h_{in} + \text{MultiHead}(h_{in}))\\ &h_{out} = \text{LayerNorm}(h_{mid} + \text{FFN}(h_{mid}))\\ \end{align*}

where h_{in}, h_{out} \in \mathbb{R}^{d_n \times d_{\text{model}}} . d_n is the length of the input and d_{\text{model}} is the dimension of the model (e.g., embedding dimension). LayerNorm is layer normalization.

MultiHead Attention:

\begin{align*} &\text{MultiHead}(h) = \text{Concat}[head_1,...,head_m] W^O\\ &\text{where } head_i = \text{Attention}(Q, K, V)\\ &\text{where } Q, K, V = hW_i^Q, hW_i^K, hW_i^V \end{align*}

where m is the number of heads, h \in \mathbb{R}^{d_n \times d_{\text{model}}} is the input, the W_i^Q \in \mathbb{R}^{d_{\text{model}}\times d_k}, W_i^K \in \mathbb{R}^{d_{\text{model}}\times d_k}, W_i^V \in \mathbb{R}^{d_{\text{model}}\times d_v} , and $ W^O \in \mathbb{R}^{m*d_{v}\times d_{\text{model}}}. The output of the Attention is head_i \in \mathbb{R}^{d_n \times d_v} and the output of the MultiHead is \in \mathbb{R}^{d_n \times d_{\text{model}}} .

Self-Attention:

\[\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V\]

where the Q, K \in \mathbb{R}^{d_n \times d_k} and V \in \mathbb{R}^{d_n \times d_v} .

Position-wise Feed Forward Neural Network

\[\text{FFN}(h) = \text{ReLU}(hW_1 + b_1)W_2 + b_2\]

where W_1 \in \mathbb{R}^{d_{\text{model}}\times d_{\text{ff}}} , W_2 \in \mathbb{R}^{d_{\text{ff}}\times d_{\text{model}}} , and h \in \mathbb{R}^{d_n \times d_{\text{model}}} .



卷积后的Attention

作者非常友好啊... 代码链接:

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py#L4448github.com

编辑于 2019-04-20

文章被以下专栏收录