FM+FTRL算法原理以及工程化实现

FM+FTRL算法原理以及工程化实现

\large{<--------收藏别忘记点赞-------->}\tag{^_^}

前言

上一篇文章讲了LR+FTRL算法原理以及工程化实现。在实际的项目开发中,常常使用的是LR+组合特征+FTRL的方式进行建模。这种方式需要人工组合特征,非常考验经验,而且存在线下组合的有效特征线上不一定有效、当前有效的特征未来也不一定有效,所以逐渐被其它的可以自动组合特征的模型取代。业界常用的两种组合特征的方式是:FM系列以及Tree系列。本文主要讲解FM系列里面的基本款:Factor Machine模型。

特征组合

我们采用多项式模型表述组合特征的作用。在多项式模型中,特征 x_i​x_j​ 的组合用 x_ix_j​ 表示。为了简单起见,我们讨论二阶多项式模型。具体的模型表达式如下:

y=w_{0} + \sum_{i=1}^{N}w_{i}x_{i} + \sum_{i=1}^{N}\sum_{j=i+1}^{N}w_{i,j}x_{i}x_{j} \tag1

公式1的前两个属于标准的线性模型,第三个部分对应的就是特征组合。传统的多项式模型中,系数 w_{i,j}​ 是相互独立的,需要组合特征 x_{i}x_{j}​ 出现足够置信的情况下才能学习出有效的值,而实际的推荐、广告系统数据中训练数据往往非常的稀疏,因此简单的多项式组合特征获得的模型往往失效,比如高阶kernel的SVM模型。

那么如何解决二次项参数的训练问题?矩阵分解提供了一种解决思路。在model-based的协同过滤中,一个rating矩阵可以分解为user矩阵和item矩阵,每个user和item都可以用一个隐向量表示。如下图所示,我们把每一个user表示成一个二维向量,同时把每个item表示成一个二维向量,两个向量的乘积就是就是矩阵中user对item的打分。

因为 w_{i,j}->W​ 是一个NxN的实对称矩阵,根据实对称矩阵的性质,可以知道W有N个线性无关的特征向量,并且这些特征向量都可以正则化而得到一组模为1的向量。故而实对称矩阵可以分解为:

W=Q*\Lambda*Q\tag{2}

其中Q​为正交矩阵, \Lambda​ 为实对角矩阵。所以W可以分解为 W=V^{T}V​ ,其中V的第j列( v_{j}​ )对应的便是第j维特征 x_{j}​ 的隐向量。换句话说,特征 x_{i}​x_{j}​ 的交叉项系数 w_{i,j}=<v_{i},v_{j}>

因此FM的组合特征的系数更新为一下的形式:

\hat{y}(x) = w_{0} + \sum_{i=1}^{N}w_{i}x_{i} + \sum_{i=1}^{N}\sum_{j=i+1}^{N}<v_{i},v_{j}>x_{i}x_{j} \tag3

其中: v_{i} = (v_{i1},v_{i2},v_{i3},v_{i4},v_{i5},....v_{ik})^T , <v_{i}, v_{j}> 代表隐向量的点积。隐向量的长度为k(k<​<n,包含k个描述特征的因子。参数因子化使得 x_{i}x_{j}​ 的参数不再是相互独立的,比如 x_{i}x_{j}​x_{i}x_{h}​ 的系数分别是 <v_{i},v_{j}>​ 以及 <v_{i},v_{h}>​ ,他们之间有共同项 v_{i}​ 。也就是说,所有包含 x_{i}​ 的非零组合特征(存在某个 j\neq{i}​ ,使得 x_{i,j}\neq{0}​ )的样本都可以用来学习隐向量 v_{i}​ ,这很大程度上避免了数据稀疏性造成的影响。而在多项式模型中, w_{hi}w_{ij} 相互独立的。

举个例子,如上图所示,假设我们想预测组合特征A与ST对应的target的值,很明显上面并没有A与ST同时为1的数据,如果 w_{a,st}​ 是独立的,那么 w_{a,st}=0​ 。但是对于FM而言, v_{a}, v_{st}​ 可以从其它的组合(比如 <v_{a}, v_{ti}>​, <v_{b},v_{st}>​ )中学习出来,因此 w_{a,st}\neq{0}​

所以FM能很好的解决高维稀疏数据的特征组合问题。

参数优化

公式2是一个通用的拟合方程,使用不同的损失函数可以用于解决分类、回归、排序问题,比如使用MSE求解回归问题,使用合页损失函数/cross-entropy损失函数来解决分类问题。

公式2的计算复杂度,从第一眼上看是 kn^{2}​ ,但是可以通过以下的方式优化到 kn​

\begin{align*} &\sum_{i=1}^{n}\sum_{j=i+1}^{n}<v_{i},v_{j}>x_{i}x_{j}\\ &=\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}<v_{i},v_{j}>x_{i}x_{j}-\frac{1}{2}\sum_{i=1}^{n}<v_{i},v_{i}>x_{i}x_{i}\\ &=\frac{1}{2}(\sum_{i=1}^{n}\sum_{j=1}^{n}\sum_{f=1}^{k}<v_{i,f},v_{j,f}>x_{i}x_{j}-\sum_{i=1}^{n}\sum_{f=1}^{k}<v_{i,f},v_{i,f}>x_{i}^2)\\ &=\frac{1}{2}\sum_{f=1}^{k}((\sum_{i=1}^{n}v_{i,f}x_{i})(\sum_{j=1}^{n}v_{j,f}x_{j})-(\sum_{i=1}^{n}v_{i,f}^2x_{i}^2)))\\ &=\frac{1}{2}\sum_{f=1}^{k}((\sum_{i=1}^{n}v_{i,f}x_{i})^2-(\sum_{i=1}^{n}v_{i,f}^2x_{i}^2)) \end{align*}\tag4

公式4第一步推理是根据点积的对称性,把双重求和想象成实对称矩阵的上三角(不包括对角线)元素求和,那么可以得到上三角矩阵元素的和等于整个矩阵元素求和减去矩阵的trace,然后除以2的值。

可以看到FM的组合特征系数的计算复杂度已经降低为 kn​ 。其中k一般设为2或者4,通过抑制FM的表达能力,来提高模型的泛化能力。当K固定以后,FM的计算时间复杂度直随着特征的数量线性增加。结合公式3和公式4得到FM模型的参数方程为:

\begin{align*} \hat{y}(x) &= w_{0} + \sum_{i=1}^{N}w_{i}x_{i} + \frac{1}{2}\sum_{f=1}^{k}((\sum_{i=1}^{n}v_{i,f}x_{i})^2-(\sum_{i=1}^{n}v_{i,f}^2x_{i}^2))\\ \end{align*}\tag{5}

公式5的参数空间为 \theta={w_{0},w_{1},w_{2}...,v_{1,1},....v_{n,k}}​ 。以二分类为例,在公式5的基础上再外面在套一个sigmod函数即可:

\begin{align*} P(Y=1|X) &=\pi(x)\\ &= \frac{exp(\hat{y(x|\theta)})}{1+exp(\hat{y(x|\theta)})} \\ \end{align*}\tag{6}

\begin{align*} P(Y=1|X) &=1-\pi(x)\\ &= \frac{1}{1+exp(\hat{y(x|\theta)})} \\ \end{align*}\tag{7}

似然函数是:

\prod_{i=1}^{n}[\pi(x_i)]^{y_{i}}*[1-\pi(x_{i})]^{1-y_{i}}\tag{8}

对数似然函数是:

\begin{align*} L(\theta) =& \sum_{i=1}^{N}[y_{i}*log(\pi(x_{i}))+(1-y_{i})*log(1-\pi(x_{i})]))\\ =&\sum_{i=1}^{N}[y_{i}log\frac{\pi(x_{i})}{1-\pi_(x_{i})}+log(1-\pi(x_{i}))]\\ =&\sum_{i=1}^{N}[y_{i}*(\hat{y(x|\theta)})+log(1+exp(\hat{y(x|\theta)}))] \tag{9} \end{align*}

这样FM参数优化问题就变成了以对数似然函数为目标的最优化问题,也就是对公式9求最大值时的参数 \theta ;实际应用中,通常是使用最小化损失函数的概念来寻找最佳参数空间,所以对公式10取负号得到逻辑回归模型的损失函数:

\begin{align*} l(\theta) =&-\sum_{i=1}^{N}[y_{i}*(\hat{y(x|\theta)})-log(1+exp(\hat{y(x|\theta)}))]\\ \end{align*}\tag{10}

也就是我们常说的交叉熵损失函数,是一个高阶可导连续凸函数。根据凸函数优化理论,可以使用梯度下降法、牛顿法、trust-region等方法进行优化。 l(\theta)​ 对应的 模型参数的偏导是: \frac{\partial}{\partial\theta}=\left\{              \begin{array}{fm}              &1 &\text{if $\theta$ is $w_{0}$}\\              &x_{i} &\text{if  $\theta$ is $w_{i}$}\\              &x_{i}\sum_{j=1}^{j=n}v_{j,f}x_{j}-x_{i,f}x_{i}^2 &\text{if  $\theta$ is $v_{i,f}$}\\              \end{array}\tag{11} \right.

这里我们使用更加高效的FTRL优化算法。简单回顾一下FTRL算法工程化伪代码:

\begin{align*} \hline &\Large{\text{Algorithm1 Per-Cordinate FTRL-Proximal with $L_{1}$ and}}\\ &\Large{\text{$L_2$ Regularization for Logistic Regression}}\\ \hline &\large{\text{# With per-cordinate learning rates of Eq.(2)}}\\ &\large\bold{Input:}\text{parameters $\alpha$,$\beta$,$\lambda_{1}$,$\lambda_{2}$}\\ &\text{$\forall{i}\in\{1,.....,d\}$,initialize $z_{i}=0$ and $n_{i}=0$}\\ &\bold{for}\text{ t=1 $\bold{to}$ T } \bold{do}\\ &\space\space\space\space\text{Recieve feature vector $x_{t}$ and let $I=\{i|x_{i}\neq{0}\}$}\\ &\space\space\space\space\text{For $i\in{I}$ compute}\\ &\space\space\space\space\space\space\space\space{w_{t,i}=\left\{              \begin{array}{fm}              &0 &\text{if $|z_{i}|$ < $\lambda_{1}$}\\              &-(\frac{\beta+\sqrt{n_{i}}}{\alpha}+\lambda_{2})^{-1}(z_{i}-sgn(z_{i})\lambda_{1}) &\text{otherwise.}\\              \end{array} \right.}\\ &\space\space\space\space\text{Predict $P_t = \sigma(x_{t}*w)$ using the $w_{t,i}$ compute above} \\ &\space\space\space\space\text{Observe label $y_{t}\in\{0,1\}$}\\ &\space\space\space\space\text{$\bold{for}$ all $i\in{I}$ do}\\ &\space\space\space\space\space\space\space\space\text{$g_{i}=(p_{t}-y_{t})x_{i}$ #gradient of loss w.r.t. $w_{i}$}\\ &\space\space\space\space\space\space\space\space\text{$\sigma_{i}=\frac{1}{\alpha}(\sqrt{n_{i}+g_{i}^2}-\sqrt{n_{i}})$ # equal $\frac{1}{\eta_{t,i}}-\frac{1}{\eta_{t-1,i}}$}\\ &\space\space\space\space\space\space\space\space\text{$z_{i}\leftarrow{z_{i}}+g_{i}-\sigma_{i}\omega_{t,i}$}\\ &\space\space\space\space\space\space\space\space\text{$n_{i}\leftarrow{n_{i}}+g_{i}^2$}\\ &\space\space\space\space\text{$\bold{end\space{for}}$}\\ &\text{$\bold{end\space{for}}$}\\ \hline \end{align*}

其实FTRL是一个online learning的框架,能解决的问题绝不仅仅是LR,已经成了一个通用的优化算子,比如TensorFlow的optimizer中都包含了FTRL。我们只要把截图中的伪代码修改, p_{t}​ 的计算改为 y(x|\theta)​ ,对于每一轮的特征向量 x​ 的每一维非0特征 x_{i}​ ,都要相应的更新模型参数 w_{0},w_{1},w_{2}...,v_{1,1},....v_{n,k}​ ,更新公式不变和截图一致,梯度的计算即为损失函数对每个参数的偏导,前面已经给出。 \sigma,z,n​ 的更新公式不变。伪代码如下(labtex代码有点长,知乎无法显示,只好省略一部分伪代码):

\begin{align*} \hline &\Large\text{Algorithm: FM with FTRL}\\ \hline &\large\text{Input: Parameters $\alpha^{w},\alpha^{v},\beta^{w},\beta^{v},\lambda_{1}^{w},\lambda_2^{w},\sigma$}\\ &\text{Init:$w_{0}=0;n_{0}^{w}=0,z_{0}^{w}=0;$}\\ &\text{Init:$\forall{i},\forall{f},w_{i}=0;n_{i}^{w}=0;z_{i}^{w}=0;v_{i,f}\sim{N(0,\sigma)};n_{i,f}^v=0;z_{i,f}^v=0$;}\\ &\text{for t=1 to T, do}\\ &\space\space\space\space\text{Recive feature vector x and let $I=\{i|x_{i}\neq{0}\}$}\\ &\space\space\space\space\space\space\space\space\space\space\space\space\text{$w_{0}=\left\{              \begin{array}{fm}              &0 &\text{if $|z_{0}^w|$ < $\lambda_{1}^w$}\\              &-(\frac{\beta^w+\sqrt{n_{0}^w}}{\alpha^w}+\lambda_{2}^w)^{-1}(z_{0}^w-sgn(z_{0}^w)\lambda_{1}^w) &\text{otherwise.}\\              \end{array} \right.$}\\ &\space\space\space\space\text{for $i\in{I}$ compute:}\\ &\space\space\space\space\space\space\space\space\space\space\space\space\text{$w_{i}=\left\{              \begin{array}{fm}              &0 &\text{if $|z_{i}^w|$ < $\lambda_{i}^w$}\\              &-(\frac{\beta^w+\sqrt{n_{i}^w}}{\alpha^w}+\lambda_{2}^w)^{-1}(z_{i}^w-sgn(z_{i}^w)\lambda_{1}^w) &\text{otherwise.}\\              \end{array} \right.$}\\ &\space\space\space\space\space\space\space\space\text{for $f=1\space{to}\space{k}$ compute:}\\ &\space\space\space\space\space\space\space\space\space\space\space\space\text{$v_{i,f}=\left\{              \begin{array}{fm}              &0 &\text{if $|z_{i,f}^v|$ < $\lambda_{1}^v$}\\              &-(\frac{\beta^v+\sqrt{n_{i,f}^v}}{\alpha^v}+\lambda_{2}^v)^{-1}(z_{i,f}^v-sgn(z_{i,f}^v)\lambda_{1}^v) &\text{otherwise.}\\              \end{array} \right.$}\\ &\space\space\space\space\space\space\space\space\text{end for}\\ &\space\space\space\space\text{end for}\\ &\space\space\space\space\text{Compute $\hat{y}(x|\theta)$}\\ &\space\space\space\space\text{Observe label $y\in\{-1,1\}$}\\ &\space\space\space\space\text{Compute $g_{0}^{w}$ #对应$\theta$为$w_{0}$}\\ &\space\space\space\space\text{$\theta_{0}^{w}=\frac{1}{\alpha^w}(\sqrt{n_{0}^{w}+(g_{0}^{w})^2}-\sqrt{n_{0}^{w}})$}\\ &\space\space\space\space\text{$z_{0}^{w}\leftarrow{z_{0}^w+g_{0}^w-\sigma_{0}^w\omega_{0}^w}$}\\ &\space\space\space\space\text{$n_{0}^{w}\leftarrow{n_{0}^w+(g_{0}^w)^2}$}\\ &\space\space\space\space\text{for $i\in{I}, do$}\\ &\space\space\space\space\space\space\space\space\text{compute $g_{i}^{w}$}\\ &\space\space\space\space\space\space\space\space\text{$\sigma_{i}^{w}=\frac{1}{\alpha^{w}}(\sqrt{n_{i}^{w}+(g_{i}^{w})^2}-\sqrt{n_{i}^{w}})$}\\ &\space\space\space\space\space\space\space\space\text{$z_{i}^{w}\leftarrow{z_{i}^w+g_{i}^w-\sigma_{i}^w\omega_{i}^w}$}\\ &\space\space\space\space\space\space\space\space\text{$n_{i}^{w}\leftarrow{n_{i}^w+(g_{i}^w)^2}$}\\ &\space\space\space\space\space\space\space\space\text{for $f=1$ to k do}\\ &\space\space\space\space\space\space\space\space\text{……}\\ &\space\space\space\space\space\space\space\space\text{end for}\\ &\space\space\space\space\text{end for}\\ &\text{end for}\\ \hline \end{align*}

工程实现

通过上面的讲述,理解了FM+FTRL的基本原理,按照之前的LR+FTRL的流程,也能给出对应的FM+FTRL的代码实现。这里主要给出的是基于TensorFlow的FM+FTRL代码实现,其中FTRL的部分使用的是TensorFlow的通用算子:

# -*- coding: utf-8 -*-

"""
this is a fm_ftrl model with structured tensorflow coding style, and support online feature encoding
"""

import functools
import tensorflow as tf
import numpy as np
import os
import pandas as pd


def doublewrap(function):
    """
    A decorator decorator, allowing to use the decorator to be used without
    parentheses if no arguments are provided. All arguments must be optional.
    """

    @functools.wraps(function)
    def decorator(*args, **kwargs):
        if len(args) == 1 and len(kwargs) == 0 and callable(args[0]):
            return function(args[0])
        else:
            return lambda wrapee: function(wrapee, *args, **kwargs)

    return decorator


@doublewrap
def define_scope(function, scope=None, *args, **kwargs):
    """
    A decorator for functions that define TensorFlow operations. The wrapped
    function will only be executed once. Subsequent calls to it will directly
    return the result so that operations are added to the graph only once.
    The operations added by the function live within a tf.variable_scope(). If
    this decorator is used with arguments, they will be forwarded to the
    variable scope. The scope name defaults to the name of the wrapped
    function.
    """
    attribute = '_cache_' + function.__name__
    name = scope or function.__name__

    @property
    @functools.wraps(function)
    def decorator(self):
        if not hasattr(self, attribute):
            with tf.variable_scope(name, *args, **kwargs):
                setattr(self, attribute, function(self))
        return getattr(self, attribute)

    return decorator


class FM_FTRL:
    def __init__(self, x, y, p, k):
        """

        :param x: input x
        :param y: label
        :param p: number of columns
        :param k: dim of v for FM pair interaction vector
        """
        self.x = x
        self.y = y
        self.p = p
        self.k = k
        self.predict
        self.optimize
        self.w0
        self.W
        self.V
        self.norm
        self.error
        self.loss

    @define_scope
    def predict(self):
        """
        this function used to predict data
        :return:
        """
        x = self.x
        self.w0 = tf.Variable(tf.zeros([1]))
        self.W = tf.Variable(tf.zeros([self.p]))
        self.V = tf.Variable(tf.random_normal([self.k, self.p], stddev=0.01))
        liner_terms = tf.add(self.w0,
                             tf.reduce_sum(tf.multiply(self.W, x), 1, keepdims=True)
                             )
        pair_terms = tf.multiply(0.5,
                                 tf.reduce_sum(
                                     tf.subtract(
                                         tf.pow(tf.matmul(x, tf.transpose(self.V)), 2),
                                         tf.matmul(tf.pow(x, 2), tf.transpose(tf.pow(self.V, 2)))
                                     )
                                 ))
        predict = tf.add(liner_terms, pair_terms)
        return predict

    @define_scope
    def norm(self):
        """

        :return:
        """
        lambda_w = tf.constant(0.001, name="lambda_w")
        lambda_v = tf.constant(0.001, name="lambda_v")
        l2_norm = tf.reduce_sum(
            tf.add(
                tf.multiply(lambda_w, tf.pow(self.W, 2)),
                tf.multiply(lambda_v, tf.pow(self.V, 2))
            )
        )
        return l2_norm

    @define_scope
    def error(self):
        y = self.y
        y_hat = self.predict
        error = tf.reduce_mean(
            tf.square(
                tf.subtract(y, y_hat)
            )
        )
        return error

    @define_scope
    def loss(self):
        loss = tf.add(self.error, self.norm)
        return loss

    @define_scope
    def optimize(self, mode="ftrl"):
        if mode == 'ftrl':
            opt = tf.train.FtrlOptimizer(learning_rate=0.1).minimize(self.loss)
        else:
            opt = tf.train.AdamOptimizer(learning_rate=0.001).minimize(self.loss)
        return opt


def hash_java(key):
    """
    hash equal to jaha hash funtion,which hash valus > 0, this is very import for engineer and ont-hot encode
    :param key:
    :return:
    """
    h = 0
    for c in key:
        h = ((h * 37) + ord(c)) & 0xFFFFFFFF
    return h


def main():
    """

    :return:
    """
    epochs = 20
    batch_size = 1000

    D = 3000
    p = 2
    k = 2

    cols = ['user', 'item', 'rating', 'timestamp']
    use_cols = ['user', 'item', 'rating']
    features = ['user', 'item']

    data_dir = os.path.abspath(f"{os.path.abspath(os.path.dirname(os.path.realpath(__file__)))}/../../Data/fm/ml-100k")

    x = tf.placeholder('float', shape=[None, D])
    y = tf.placeholder('float', shape=[None, 1])
    model = FM_FTRL(x=x, y=y, p=D, k=k)
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    num_lines = sum(1 for l in open(f'{data_dir}/ua.base')) - 1
    print(f"total train lines number is {num_lines}")
    for epoch in range(epochs):
        total_bacth = 0
        avg_cost = 0.
        # create random data based on random index
        index_random = np.random.permutation(num_lines)

        for row_index in range(0, index_random.shape[0], batch_size):
            skip_rows = np.concatenate([index_random[:row_index], index_random[row_index+batch_size:]])
            row = pd.read_csv(f'{data_dir}/ua.base', delimiter='\t', names=cols,
                              usecols=['user', 'item', 'rating'],
                              skiprows=skip_rows)
            total_bacth += 1
            bY = row['rating'].values.reshape(-1, 1)
            bX = np.zeros([D])
            for f in features:
                hash_index = hash_java(str(row[f]) + f) % D
                if hash_index < 0:
                    raise ValueError("index for one-hot should be bigger than 0")
                bX[hash_index] = 1
            bX = bX.reshape(-1, D)
            mse, loss_val, w, v, _ = sess.run([model.error, model.loss, model.W, model.V, model.optimize],
                                              feed_dict={x: bX, y: bY})
            avg_cost += loss_val
            # Display logs per epoch step
        if (epoch + 1) % 1 == 0:
            print(f"total batch is {total_bacth}")
            print(f"Epoch:{epoch + 1}, cost={avg_cost/total_bacth}")
    print('MSE: ', mse)
    print('Learnt weights:', w, w.shape)
    print('Learnt factors:', v, v.shape)
    # print(f"auc value is {tf.summary.scalar('AUC', auc)}")

    errors = []
    test = pd.read_csv(f'{data_dir}/ua.test', delimiter='\t', names=cols, usecols=['user', 'item', 'rating'],
                       chunksize=batch_size)
    for row in test:
        bY = row['rating'].values.reshape(-1, 1)
        bX = np.zeros([D])
        for f in features:
            hash_index = hash_java(str(row[f]) + "_" + f) % D
            bX[hash_index] = 1
        bX = bX.reshape(-1, D)
        errors.append(sess.run(model.error, feed_dict={x: bX, y: bY}))

    RMSE = np.sqrt(np.array(errors).mean())
    print(RMSE)
    sess.close()


if __name__ == '__main__':
    main()

数据地址: MovieLens 100K Dataset

参考文献:

  1. Factorization Machines (Steffen Rendle)
  2. Factorization Machines with libFM (STEFFEN RENDLE, University of Konstanz)
  3. 美团技术团队
  4. Introductory Guide – Factorization Machines & their application on huge datasets (with codes in Python)
  5. 推荐系统遇上深度学习(一)--FM模型理论和实践
  6. FM, FTRL, Softmax
  7. LR+FTRL算法原理以及工程化实现






编辑于 2019-11-16

文章被以下专栏收录