【实战NLP】豆瓣影评情感分析

如题,使用 lstm rnn 预测豆瓣短影评情感。

  1. 数据准备 -- 使用 bs4、requests 构造豆瓣爬虫
  2. 词库训练 -- 使用 gensim 训练中文 wiki 语料库
  3. 训练 rnn
  4. 预测结果



数据准备

编写豆瓣爬虫工具,主要爬取电影、电影评论评分数据。

使用 request.Session 维护cookie,请求电影、影评数据;
使用 json 和 BeautifulSoup 进行数据提取;
使用 MySQLdb 存储数据


豆瓣对同一用户的请求频率有限制,如果统一用户请求过快,就会要求人机身份验证,所以登录状态下是不行的(除非你有海量豆瓣账号),非登录状态下只能获取每个电影的前10页影评,共200条。另外,注意不停地更换 cookie 和 User-Agent,防止被封。尽量不要使用多线程,除非你有很多代理,否则也是很快就会被封 IP。


通过页面:movie.douban.com/typera 可以按照分类及电影排行获取到大量电影(同一电影可能存在不同分类,需要去重复)。需要注意的是,数据是动态加载的,分期其 joon url 及数据即可。


使用此方法,我获取到了 15748 部电影。



通过页面 movie.douban.com/subjec 可以获取到影评。分析其页面结构即可获取到页面上的短评。在遍历了60%的电影之后,我获取到了 1297427 条带有评分的评论。其中,一星到五星的数量分别是 43806,129079, 446463, 459875, 218204。足够用的了。


部分代码。

get_movie.py:循环构造请求电影数据的 url,请求并解析,存储到数据库中。Sleep 0.3秒是防止请求过快,被封 ip。

    first_in_interval = True
    first_in_page = True
    for m_type in range(movie_type, 32):
        if m_type == 9 or m_type == 21:
            continue

        if not first_in_interval:
            interval = 0
        else:
            first_in_interval = False

        for i in range(interval, 100, 10):

            if not first_in_page:
                page = 0
            else:
                first_in_page = False

            while True:
                time.sleep(0.3)
                c_url = get_json_url(m_type, i, page)
                data = session.get(c_url)
                if data.status_code != 200:
                    print "Error: url: %s. code: %d" % (c_url,  data.status_code)
                    refresh_cookie(session)
                    continue
                else:
                    res = json.loads(data.text)

                    if not res:
                        break

                    for item in res:
                        movie_id = item['id']
                        title = item['title']
                        types = item['types']
                        actors = item['actors']
                        if len(item['rating']) == 2:
                            rating = int(item['rating'][1]) / 10
                        else:
                            rating = 0
                        score = float(item['score'])
                        if "release_date" in item:
                            release_date = item['release_date']
                        else:
                            release_date = '1970-01-01'
                        regions = item['regions']
                        url = item['url']
                        cover_url = item['cover_url']
                        add_movie(movie_id, title, types, actors, rating, score, release_date, regions, url, cover_url)

                    print "current at type[%d], interval[%d], page[%d]" % (m_type, i, page)
                    set_movie_ckpt(m_type, i, page)
                random_refresh_cookie(session)
                page += 1

get_comment.py:请求评论列表页面,并进行解析

def get_movie_comment():
    with requests.Session() as session:
        while True:
            try:
                m_p = queue.get(block=False)
                movie_id = m_p[0]
                page = m_p[1]
            except Queue.Empty as e:
                return
            n = get_next_movie_id(movie_id)

            print "start get movie: %d, page: %d, 已完成:百分之 %f" % (movie_id, page, n[1] * 100.0 / (n[2] + n[1]))

            for i in range(page, 11):
                random_refresh_cookie(session)
                time.sleep(0.3)
                # print "get movie : %d, page : %d" % (movie_id, i)
                url = get_comment_url(movie_id, i)

                res = get_res(session, url)

                if not res:
                    print "res get error:" + url
                    continue

                if res.status_code != 200:
                    print "Error: url: %s. code: %d" % (url, res.status_code)
                    refresh_cookie(session)
                    continue

                soup = BeautifulSoup(res.text, 'html.parser')

                comments = soup.find_all("div", class_="comment-item")

                if not comments:
                    # comment for this one is over
                    set_comment_ckpt(movie_id, i)
                    break

                for comment in comments:
                    rating = comment.find("span", class_="rating")
                    if not rating:
                        continue

                    rating_class = rating['class']
                    rating_num = 1
                    if "allstar10" in rating_class:
                        rating_num = 1
                    if "allstar20" in rating_class:
                        rating_num = 2
                    if "allstar30" in rating_class:
                        rating_num = 3
                    if "allstar40" in rating_class:
                        rating_num = 4
                    if "allstar50" in rating_class:
                        rating_num = 5

                    comment_p = comment.find('p')
                    comment_content = comment_p.get_text()

                    cid = comment['data-cid']

                    div_avatar = comment.find('div', class_='avatar')
                    avatar_a = div_avatar.find('a')
                    avatar_img = avatar_a.find('img')

                    user_avatar = avatar_img['src']
                    user_name = avatar_a['title']
                    user_location = avatar_a['href']
                    user_id = user_location[30:]
                    user_id = user_id[:-1]

                    span_comment_time = comment.find('span', class_='comment-time')
                    comment_time = span_comment_time['title']

                    span_votes = comment.find('span', class_='votes')
                    votes = span_votes.get_text()

                    lock.acquire()
                    add_comment(cid, movie_id,
                                user_id, user_avatar,
                                user_name, comment_content,
                                rating_num, comment_time, votes)
                    lock.release()

                set_comment_ckpt(movie_id, i)

refresh_cookie. 我原本使用的方式是通过请求登录页面获取 cookie 中的 bid,后来看到一篇文章,发现可以直接使用随机字符串。

def refresh_cookie(session):
    session.headers.clear()
    session.cookies.clear()
    session.headers = {
        "User-Agent": make_random_useragent("pc"),
        "Host": "movie.douban.com",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, sdch, br",
        "Accept-Language": "zh-CN, zh; q=0.8, en; q=0.6",
        "Cookie": "bid=%s" % "".join(random.sample(string.ascii_letters + string.digits, 11))
    }

    # data = session.get("https://accounts.douban.com/login")
    #
    # if data.status_code != 200:
    #     print "获取新cookie失败: " + str(data.status_code)
    # else:
    #     print "重新获取 cookie 成功"

    # session.headers['Host'] = "movie.douban.com"

    return 200

词库训练


使用中文的 wiki 语料库。



首先下载中文语料库 dumps.wikimedia.org/zhw ,然后使用 wikiextractor attardi/wikiextractor 解压缩语料文本。再使用 OpenCC 进行繁体字转换。使用正则表达式清除掉标签,并使用 jieba 进行分词。


分词以及清除标签

# coding=utf-8
import jieba
import os
import sys
import re
reload(sys)
sys.setdefaultencoding("utf-8")


def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', raw_html)
    return cleantext

if __name__ == '__main__':
    inp, outp = sys.argv[1:3]
    count = 0
    space = ' '
    for root, dirs, files in os.walk(inp):
        for filename in files:
            if filename.startswith("wiki"):
                if not os.path.exists(outp + root):
                    os.makedirs(outp + root)

                output = open(outp + root + "/" + filename, 'w')
                f = open(root + "/" + filename, 'r')

                for line in f.readlines():
                    seg_list = jieba.cut(cleanhtml(line))
                    output.write(space.join(seg_list) + '\n')

                output.close()
                # opencc 进行繁简转换
                # status, res = commands.getstatusoutput("opencc -i " + root + "/" + filename + " -o"
                #                          + " s_" + root + "/" + filename + " -c t2s.json")
                # if status == 0:

                count += 1
                print count

最后使用gensim 对分好词、清理完标签的文本数据进行训练。

train.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Pan Yang (panyangnlp@gmail.com)
# Copyright 2017 @ Yu Zhen

import gensim
import logging
import multiprocessing
import os
import re
import sys

from pattern.en import tokenize
from time import time

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)


def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', raw_html)
    return cleantext


class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for root, dirs, files in os.walk(self.dirname):
            for filename in files:
                file_path = root + '/' + filename
                for line in open(file_path):
                    sline = line.strip()
                    if sline == "":
                        continue
                    sentence = sline.split()
                    yield sentence


if __name__ == '__main__':
    if len(sys.argv) != 2:
        print "Please use python train_with_gensim.py data_path"
        exit()
    data_path = sys.argv[1]
    begin = time()

    sentences = MySentences(data_path)
    model = gensim.models.Word2Vec(sentences,
                                   size=200,
                                   window=10,
                                   min_count=10,
                                   workers=multiprocessing.cpu_count())
    model.save("data/word2vec_gensim")
    model.wv.save_word2vec_format("data/word2vec_org",
                                  "data/vocabulary",
                                  binary=False)

    end = time()
    print "Total procesing time: %d seconds" % (end - begin)



后来我觉得把影评的材料跟 wiki 材料一起训练会比较好。


部分代码参考一下两篇文章:

使用 word2vec 训练wiki中英文语料库自然语言处理 | 我爱自然语言处理

训练


1-2星为差评 3星为中评,4-5星为好评。对于差评、中评、好评,各采用 15000 条数据进行训练,5000条数据作为评估。使用 tflearn 的 lstm + full_connected 进行训练

def main():
    train_x, train_y = get_train_data()
    test_x, test_y = get_test_data()

    print "正在处理训练数据的 padding"
    train_x = pad_sequences_array(train_x, maxlen=100, value=0.)
    print "正在处理测试数据的 padding"
    test_x = pad_sequences_array(test_x, maxlen=100, value=0.)

    train_y = to_categorical(train_y, 3)
    test_y = to_categorical(test_y, 3)

    net = tflearn.input_data([None, 100, 200])
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, 3, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.0001, loss='categorical_crossentropy')

    model = tflearn.DNN(net, tensorboard_verbose=1,
                        tensorboard_dir='log/', checkpoint_path='model/',
                        best_checkpoint_path='best_model/', best_val_accuracy=0.8)
    model.load("model-19000")
    model.fit(train_x, train_y, validation_set=(test_x, test_y),
              show_metric=True, batch_size=32,
              validation_batch_size=32, snapshot_step=500)

def pad_sequences_array(sequences, maxlen=None, dtype='float64', padding='post',
                  truncating='post', value=0.):
    """
    根据 tflearn pad_sequences 构造
    """
    lengths = [len(s) for s in sequences]

    nb_samples = len(sequences)
    if maxlen is None:
        maxlen = np.max(lengths)

    x = (np.ones((nb_samples, maxlen, 200)) * value).astype(dtype)
    for idx, s in enumerate(sequences):
        if len(s) == 0:
            continue  # empty list was found
        if truncating == 'pre':
            trunc = s[-maxlen:]
        elif truncating == 'post':
            trunc = s[:maxlen]
        else:
            raise ValueError("Truncating type '%s' not understood" % padding)

        if padding == 'post':
            x[idx, :len(trunc)] = trunc
        elif padding == 'pre':
            x[idx, -len(trunc):] = trunc
        else:
            raise ValueError("Padding type '%s' not understood" % padding)
    return x


跑了15000 步之后,正确率不再上升(0.54左右),然后吧 learning_rate 调低 10 倍,再进行训练。 最后正确率也只是0.54 +。


预测

这里是一些预测数据。


最左边是行号。然后是 评论内容,原评分,预测评分。

分析


正确率既然能够大于33%,说明还是有一定效果的呀。(有点乐观了)



影响正确率的一些因素分析:

  1. 影评质量不高,没有把很短的评论剔除
  2. 没有把影评数据与 wiki 数据一起训练,导致影评中有些词汇在 wiki 词汇中找不到。
  3. 少量影评的打分和评语的确不搭配 =。=
  4. 训练数据量不够大
  5. 网络模型本身的问题。下次可尝试使用 CNN。


后记

昨天把评论全部导出,跟 wiki 一起训练跑出一个词库,然后拿来训练网络,交叉验证正确率上上到0.57+了(训练数据 90000,测试数据 30000。想多拿一些数据做训练,可是电脑内存不够啊)。

看来这样做还是有一定效果的。以后做一些 nlp 项目的话可以尝试把训练文本与 wiki 一起跑出词库,效果应该都会好一些。


部分代码:

xiang2017/douban_short_commentgithub.com图标

编辑于 2018-05-30