最大似然和其他损失函数的关系

华为云语音语义创新Lab招NLP、CV、语音、数字人实习生

根据最大似然 (Maximum likelihood，ML，和最大对数似然等价) 可以推导出不同的损失函数，

如在分类问题中（包括二分类和多分类），可推导出交叉熵损失函数 (cross-entropy loss)；
在回归问题中，可以推导出最小均方差损失 (mean square error)；
最大似然和最小化KL divergence是等价的。

How is ML related with different loss functions relies on how to interpret the probability. In regression, the probability p(y|x) is the Gaussian distribution with zero mean error. In classification, the probability p(y_k|x) is the probability of x belongs to class y_k (multinomial distribution).

\begin{equation} \hat{\vec{\theta}}_{ML}=\arg\max_\theta \sum_{i=1}^{n}log {p(\vec{y}_i|\vec{x}_i;\vec{\theta})} \end{equation} \tag{1}

p(\vec{y}|\vec{x};\vec{\theta}) 是什么物理意义？

在回归问题中

在回归问题中 \vec{y} 是一个一维向量，只包含一个数值 y 。回归问题假设：

p({y}|\vec{x};\vec{\theta})=\mathcal{N}({y}; \hat{{y}}, \sigma^2) \tag{2}

其中 \hat{{y}}=f_{\vec{\theta}}(\vec{x})

即 p({y}|\vec{x};\vec{\theta}) 表示 {y} 服从以 \hat{{y}} 为均值， \sigma^2 为方差的高斯分布，其中假设\sigma是固定的常数；把(2)代入到(1)中，即可得到最大似然估计等价于最小化均方差：

\hat{\vec{\theta}}_{MSE}=\arg\min_\theta \frac{1}{n} \sum_{i=1}^{n}(y-\hat y)^2 \tag{3}

详情可见《Pattern Recognition and Machine Learning》1.2.5节。

在分类问题中

p(\vec{y}|\vec{x};\vec{\theta}) 表示样本 \vec{x} 属于各个类别的概率，这的概率可以用softmax函数得出。样本属于多项式分布，可以直接写出其似然函数，对似然函数求对数后即可得到cross-entropy loss。

详情可见《Pattern Recognition and Machine Learning》4.3.4节。

而交叉熵损失函数和KL divergence有如下关系：

D_{KL}(Y||\hat Y) = H(Y,\hat Y)-H(Y) \\

上式右边第一部分就是交叉熵，第二部分和模型无关，所以最小化交叉熵等价于最小化KL divergence，等价于最大似然。

Goodfellow 的Deep Learning的5.5节和PRML4.34节也给出了ML和最小化KL等价的结论，但是没有给出具体推导过程，下面几个连接从不同角度给出了推导过程：

编辑于 2018-02-18 11:10

机器学习

统计学