TF-IDF的算法Python实现和简单示例(上)

TF-IDF的算法Python实现和简单示例(上)

TF-IDF(Term Frequency & Inverse Documentation Frequency 词频-逆文档)算法是当前非常常用的一种文本特征的提取方法,在文本信息检索,语意抽取等自然语言处理(NLP)中广泛应用。本文将简单的介绍一下基于英文文本的TF-IDF算法实现,并且利用现在比较流行的词云的方式直观的表现出一个结果。

开发环境:Python 3.6.0 NLTK 3.2(NLTK是一个在自然语言处理方面被广泛利用的Python语言类库,他提供的集成方法可以大幅提高编程效率,官网:Natural Language Toolkit,也可以利用pip安装)

$ pip3 install nltk

安装完毕nltk之后就可以在python中调用NLTK的包了,具体的一些用法会在之后的代码中体现,并且也会在以后的博客中予以介绍。

下面我们来介绍数据的预处理过程,首先我们先要引入一些我们要用到的包:

import nltk
import math
import string

from nltk.corpus import stopwords
from collections import Counter
from nltk.stem.porter import*

对于第一次用nltk包的用户需要执行下面的命令安装一些没有被初始集成的工具包。

nltk.download()

这里我们需要用到punkt工具包和stopwords工具包(分别用来分词和停词),如果不执行下面这两步程序可能会报错。

nltk.download('punkt')
nltk.download('stopwords')

然后我们定义三段语料(语料来自Wiki对于NLP的介绍)

text1 = "Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof."

text2 = "The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed."

text3 = "During the 1970s, many programmers began to write conceptual ontologies, which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky。"

以上是基本的准备过程,这里我使用了一个小语料,接下来我们进行基本的预处理过程,主要涉及分词,词干抽取,以及去除一些没有实际语意的高频连词。

def get_tokens(text):
    lower = text.lower()
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
    no_punctuation = lower.translate(remove_punctuation_map)
    tokens = nltk.word_tokenize(no_punctuation)

    return tokens

这一步我们创建了一个分词函数,将所有英语字母转化为小写方便在下一步进行分析,并且将成段落的语料转化为了一个以单词为单位的Python List对象完成分词。例如我们有这么一句话,“Nature language processing is cool !” 将会被转化成[“nature”,“language”,“pro?sing”,“is”,“cool”,“!”]这么一个列表

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))

    return stemmed
这一步是进行对已经分完词进行词干抽取,在英语的语言形式中经常有不同的变形,例如apple和apples表示单复数,process和processing的分词和动名词形式,这些单词往往在语言表示的意思上有相同的含义,所以对类似进行变形过的词汇进行词干抽取,可以提取出有相同词干词义的词。至此一些常规的NLP的文本预处理工作就完成了,接下来我们来简单介绍一下TF-IDF的实现原理。

TF(Term Frequency):中文意思是词频,也就是在一段文本中出现的频率较高的词,由于我们在之前的预处理中已经去掉了英文中的停词(类似与to,is,are,the这些高频出现但是却没有真正的实际意义的词汇)所以这里我们往往可以认为出现频率越高的词汇会对整个文档有较大的影响。

tf_{ij}=\frac{n_{i,j}}{\Sigma_{k} n_{k,j}}

IDF(Inverse Document Frequency):逆文档频率,首先我们回想一下停词,它们往往会在文档中非常高频的出现但是反而不能表达出文档的真实意思。那么同样的在不是停词的另外一些单词中,有些单词往往可以更加体现出文章的真实表达的意思,就像this thing made in china,and this thing is big。中thing只是个指代它既不能告诉你它是什么具体的东西也不能告诉你它的任何具体特征,但是big和china却可以很好的描述这句话说了什么,但是things的词频要比china和big都要大,这显然是有问题的。所以为了能够解决这么一个问题,我们需要对前面的TF进行修正,于是提出了逆文档频率,它的大小和一个词的常见程度是成反比的。

idf_{i} = log\frac{\left| D \right|}{\left| \left\{ j:t_{i}\in d_{j}   \right\}  \right| }

将TF和IDF相乘就会得到TF-IDF的算法:

TF-IDF\left( t \right) = TF\left( t \right) \times IDF\left( t \right)

下面的这段代码实现了TF-IDF的算法:

def tf(word, count):
    return count[word] / sum(count.values())
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)
def idf(word, count_list):
    return math.log(len(count_list)) / (1 + n_containing(word, count_list))
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

然后这里我们调用了之前的写的子功能实现了TF-IDF的算法

def count_term(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    stemmer = PorterStemmer()
    stemmed = stem_tokens(filtered, stemmer)
    count = Counter(stemmed)
    return count

def main():
    texts = [text1, text2, text3]
    countlist = []
    for text in texts:
        countlist.append(count_term(text))
    for i, count in enumerate(countlist):
        print("Top words in document {}".format(i + 1))
        scores = {word: tfidf(word, count, countlist) for word in count}
        sorted_words = sorted(scores.items(), key = lambda x: x[1], reverse=True)
        for word, score in sorted_words[:5]:
            print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

if __name__ == "__main__":
    main()

运行的结果:

[nltk_data] Downloading package punkt to ...
[nltk_data] \AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
Top words in document 1
        Word: languag, TF-IDF: 0.07121
        Word: natur, TF-IDF: 0.06103
        Word: comput, TF-IDF: 0.04069
        Word: process, TF-IDF: 0.03052
        Word: concern, TF-IDF: 0.02034
Top words in document 2
        Word: translat, TF-IDF: 0.05086
        Word: machin, TF-IDF: 0.02713
        Word: research, TF-IDF: 0.02034
        Word: sixti, TF-IDF: 0.01017
        Word: littl, TF-IDF: 0.01017
Top words in document 3
        Word: mani, TF-IDF: 0.02555
        Word: lehnert, TF-IDF: 0.02555
        Word: 1978, TF-IDF: 0.02555
        Word: began, TF-IDF: 0.01277
        Word: exampl, TF-IDF: 0.01277

在这里我们得到了每个词的占文档的一个重要程度,在下半部分中我们会利用这个权重值去建立一个的词云,让数据会更加直观。

2017.5.8 牛肉咖喱饭

编辑于 2017-05-08

文章被以下专栏收录