Gryffindor
首发于Gryffindor
[Paper Reading] Efficient Estimation of Word Representations in Vector Space

[Paper Reading] Efficient Estimation of Word Representations in Vector Space

简述

本文是w2v提出者Mikolov关于w2v的效果评估,针对不同任务不同模型的对比,结论有助于减少调研实验。

1 CBOW,skip-gram在大规模语料上得到不弱于NN的高质量词向量,并且计算量大幅度降低。

2 对比各种模型的表现,整体来看skip-gram优于CBOW。

作者:Tomas Mikolov,Kai Chen 等

单位:Google Inc., Mountain View, CA

关键词:Word Representation, Word Embedding, Neural Network, Syntactic Similarity, Semantic Similarity

链接arxiv.org/pdf/1301.3781

问题:如何在海量的语料集合上学习到已给高质量的词表示?

Learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary


主体内容

之前的方法:使用NN的方式基于前N个gram,预测下个词的概率模型,作为中间产物的权重矩阵即为词的向量表示。

A very popular model architecture for estimating neural network language model (NNLM) was proposed in , where a feedforward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model

缺点:计算量太大。

these architectures were significantly more computationally expensive for training.

具体模型

  • 之前也有两种词表示模型:Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).

In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words [20, 31]; LDA moreover becomes computationally very expensive on large data sets.

(LSA效果一般,LDA计算代价太高。因此忽略两者,这里主要focus NN得到的词向量)

为了便于比较,设定一个训练代价函数:O = E × T × Q

where E is number of the training epochs, T is the number of the words in the training set and Q is defined further for each model architecture.Common choice is E = 3 − 50 and T up to one billion.

All models are trained using stochastic gradient descent and backpropagation.

(SGD,BP简直就是基石)

  • Feedforward Neural Net Language Model(NNLM):

传统的NNLM模型包含四层,即输入层、映射层、隐含层和输出层,计算复杂度的开销在于映射层到隐含层之间的计算,以及最后输出层到V词表的归一化。

针对NNLM计算复杂度太高的问题,作者提出了两种模型:

  • CBOW- Continuous Bag-of-Words Model

利用当前word的前后文word,生成一个词袋(使用词袋,说明只看出现与否,不再关注远近了),然后将上下文的向量取平均值,这就完成了输入层到映射层的操作。

weight matrix between the input and the projection layer is shared.(由于词袋和取和平均的操作,使得权重矩阵得以共享),这个阶段的计算复杂度在于N个word的平均,N*D。

而若是使用分层softmax的情况下,对应的huffman树的深度为log2(V),针对每个中间节点做logistic regression,这个阶段对应的复杂度是D*log2(V)。

故总体复杂度为Q = N*D + D*log2(V ):

  • Continuous Skip-gram Model

对于skip-gram,就是CBOW反过来,用一个word,去预测周围的C个word,如此,需要做C次分层softmax(即为每个周围word,匹配一次)

由于没有了求平均的操作,因此输入层到映射层的计算复杂度为D,(仅有一个word,不必像CBOW一样求平均)而映射层到输入层的复杂度的D*log2(V),再考虑有C个word,因此总体的复杂度为C*(D+D*log2(V))

(所以可以看到skip-gram 要比CBOW的计算复杂度更高。)


模型对比

评估方式和测试数据

现在的一般性评估,基本上还是以下这种方式来评估。

compute vector X = vector("biggest")-?vector("big")+vector("small").

France is to Paris as Germany is to Berlin. 具体的从以下角度来构造pair:

we define a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions

语料

We have used a Google News corpus for training the word vectors. This corpus contains about 6B tokens. We have restricted the vocabulary size to 1 million most frequent words.(只取高频的100w,要是中文去重后词汇表,这个数字肯定不够)

可见,adding more dimensions or adding more training data provides diminishing improvements.


model的准确性比较

主要看红框内容,可见:

1 在相同语料和vector 维度下,skip-gram在semantic下表现极为出色,在syntactic下也非常好。

2 CBOW在syntactic下表现很不错(但还是弱于skip-gram),但在samentic下的表现去非常糟糕。

思考:为什么skip-gram的表现会明显好于CBOW?


model训练时长比较

(skip-gram的训练时长还与C有关。)

(1 在固定C=5的情况下,所有基本相同的情形,skip-gram的是训练时长差不多是CBOW的三倍。

2 对于CBOW,训练时长与语料规模基本成1倍关系,与维度也基本成1倍关系。

3 对于skip-gram,训练时长与语料规模基本成1倍关系,与维度基本成1.5倍关系。)


纵情释放,再来一次比较:

1 在大规模语料上,通过NNLM训练高维度词向量,几乎不可能,成本太大了。

2 在大规模语料+高维度情况下,此时CBOW在syntactic上的表现开始不弱于skip-gram了。


:)

文章被以下专栏收录