Python数据分析及可视化实例之抽取文本主题（30）

civilpy

注册土木工程师资格证持证人

系列文章总目录：Python数据分析及可视化实例目录

1.项目背景：

接上一期：Python数据分析之文本处理文本相似度

PS趁热打铁

2.分析步骤：

（1）如何加载前几期处理好的字典、预料包；

（2）计算tf-idf 和 lsi；

（3）转化为lsi模型, 可用作聚类或分类；如：利用Sklearn中的模型进行文本分类。

（4）LDA模型，将每篇文档主题以概率的形式给出。

参考资料：主题模型TopicModel：通过gensim实现LDA - 皮皮blog - CSDN博客

3.源码：

# coding: utf-8



# In[1]:

import os
from gensim import corpora, models, similarities
from pprint import pprint
from matplotlib import pyplot as plt
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


# In[2]:

def PrintDictionary(dictionary):
    token2id = dictionary.token2id
    dfs = dictionary.dfs
    token_info = {}
    for word in token2id:
        token_info[word] = dict(
            word = word,
            id = token2id[word],
            freq = dfs[token2id[word]]
        )
    token_items = token_info.values()
    token_items = sorted(token_items, key = lambda x:x['id'])
    print('The info of dictionary: ')
    pprint(token_items)
    print('--------------------------')


# In[3]:

def Show2dCorpora(corpus):
    nodes = list(corpus)
    ax0 = [x[0][1] for x in nodes] # 绘制各个doc代表的点
    ax1 = [x[1][1] for x in nodes]
    # print(ax0)
    # print(ax1)
    plt.plot(ax0,ax1,'o')
    plt.show()


# In[4]:

if (os.path.exists("../../tmp/deerwester.dict")):
    dictionary = corpora.Dictionary.load('../../tmp/deerwester.dict')
    corpus = corpora.MmCorpus('../../tmp/deerwester.mm')
    print("Used files generated from first tutorial")
else:
    print("Please run first tutorial to generate data set")


# In[5]:

PrintDictionary(dictionary)


# In[6]:

# 尝试将corpus(bow形式) 转化成tf-idf形式
# tfidf_model = models.TfidfModel(corpus) # step 1 -- initialize a model 将文档由按照词频表示 转变为按照tf-idf格式表示
# doc_bow = [(0, 1), (1, 1),[4,3]]
# doc_tfidf = tfidf_model[doc_bow]
# doc_tfidf


# In[7]:

# 将整个corpus转为tf-idf格式
corpus_tfidf = tfidf_model[corpus]
pprint(list(corpus_tfidf))
pprint(list(corpus))


# In[8]:

## LSI模型 **************************************************
# 转化为lsi模型, 可用作聚类或分类
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi_model[corpus_tfidf]
nodes = list(corpus_lsi)
pprint(nodes)


# In[9]:

lsi_model.print_topics(2) # 打印各topic的含义


# In[10]:

ax0 = [x[0][1] for x in nodes] # 绘制各个doc代表的点
ax1 = [x[1][1] for x in nodes]
print(ax0)
print(ax1)
plt.plot(ax0,ax1,'o')
plt.show()


# In[11]:

lsi_model.save('../../tmp/model.lsi') # same for tfidf, lda, ...
lsi_model = models.LsiModel.load('../../tmp/model.lsi')


# In[12]:

## LDA模型 **************************************************
lda_model = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lda = lda_model[corpus_tfidf]
Show2dCorpora(corpus_lsi)


# In[13]:

nodes = list(corpus_lda)
pprint(list(corpus_lda))


# In[14]:

# 此外，还有Random Projections, Hierarchical Dirichlet Process等模型

新手可查阅历史目录：

最后，别只收藏不关注哈

编辑于 2020-12-30 16:17

Python

数据分析

自然语言处理

Python数据分析及可视化实例之抽取文本主题（30）

最后，别只收藏不关注哈

文章被以下专栏收录

Python数据采集处理分析挖掘可视化应用实例