Python数据分析及可视化实例之抽取文本主题(30)
系列文章总目录:Python数据分析及可视化实例目录
1.项目背景:
接上一期:Python数据分析之文本处理文本相似度
PS趁热打铁
2.分析步骤:
(1)如何加载前几期处理好的字典、预料包;
(2)计算tf-idf 和 lsi;
(3)转化为lsi模型, 可用作聚类或分类;如:利用Sklearn中的模型进行文本分类。
(4)LDA模型,将每篇文档主题以概率的形式给出。
参考资料:主题模型TopicModel:通过gensim实现LDA - 皮皮blog - CSDN博客
3.源码:
# coding: utf-8
# In[1]:
import os
from gensim import corpora, models, similarities
from pprint import pprint
from matplotlib import pyplot as plt
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# In[2]:
def PrintDictionary(dictionary):
token2id = dictionary.token2id
dfs = dictionary.dfs
token_info = {}
for word in token2id:
token_info[word] = dict(
word = word,
id = token2id[word],
freq = dfs[token2id[word]]
)
token_items = token_info.values()
token_items = sorted(token_items, key = lambda x:x['id'])
print('The info of dictionary: ')
pprint(token_items)
print('--------------------------')
# In[3]:
def Show2dCorpora(corpus):
nodes = list(corpus)
ax0 = [x[0][1] for x in nodes] # 绘制各个doc代表的点
ax1 = [x[1][1] for x in nodes]
# print(ax0)
# print(ax1)
plt.plot(ax0,ax1,'o')
plt.show()
# In[4]:
if (os.path.exists("../../tmp/deerwester.dict")):
dictionary = corpora.Dictionary.load('../../tmp/deerwester.dict')
corpus = corpora.MmCorpus('../../tmp/deerwester.mm')
print("Used files generated from first tutorial")
else:
print("Please run first tutorial to generate data set")
# In[5]:
PrintDictionary(dictionary)
# In[6]:
# 尝试将corpus(bow形式) 转化成tf-idf形式
# tfidf_model = models.TfidfModel(corpus) # step 1 -- initialize a model 将文档由按照词频表示 转变为按照tf-idf格式表示
# doc_bow = [(0, 1), (1, 1),[4,3]]
# doc_tfidf = tfidf_model[doc_bow]
# doc_tfidf
# In[7]:
# 将整个corpus转为tf-idf格式
corpus_tfidf = tfidf_model[corpus]
pprint(list(corpus_tfidf))
pprint(list(corpus))
# In[8]:
## LSI模型 **************************************************
# 转化为lsi模型, 可用作聚类或分类
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi_model[corpus_tfidf]
nodes = list(corpus_lsi)
pprint(nodes)
# In[9]:
lsi_model.print_topics(2) # 打印各topic的含义
# In[10]:
ax0 = [x[0][1] for x in nodes] # 绘制各个doc代表的点
ax1 = [x[1][1] for x in nodes]
print(ax0)
print(ax1)
plt.plot(ax0,ax1,'o')
plt.show()
# In[11]:
lsi_model.save('../../tmp/model.lsi') # same for tfidf, lda, ...
lsi_model = models.LsiModel.load('../../tmp/model.lsi')
# In[12]:
## LDA模型 **************************************************
lda_model = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lda = lda_model[corpus_tfidf]
Show2dCorpora(corpus_lsi)
# In[13]:
nodes = list(corpus_lda)
pprint(list(corpus_lda))
# In[14]:
# 此外,还有Random Projections, Hierarchical Dirichlet Process等模型
新手可查阅历史目录:
最后,别只收藏不关注哈
编辑于 2020-12-30 16:17