机器学习
首发于机器学习

Using the Stanford CoreNLP API

1. Generating annotations

CoreNLP package的骨干由两个类构成:

  1. Annotation
  2. Annotator

Annotation是数据结构来保存annotator的结果,通常是maps,表示由key到annotations的bit,例如the parse, the part-of-speech tags, or named entity tags.

Annotator更像函数,除了操作对象是Annotation而不是Objects,Annotator能够进行tokenize, parse, or NER tag sentences. Annotators和Annotations通过AnnotationPipelines进行集成,构造出了一般Annotator的序列。Stanford CoreNLP继承了AnnotationPipeline这个类,并且用NLP Annotator来进行定制化。


目前支持的Annotator和生成的Annotation总结在Annotators,我们给出一些例子

  1. tokenize(TokenizerAnnotator类),用来进行切分词,这个方法将文本切分为roughly words。
  2. pos(POSTaggerAnnotator类),用来将一个tokens序列切分成句子;
  3. parse(ParserAnnotator类),采用constituent representations和dependency representations提供了完整的句法分析。其中consituent-based 输出存储在TreeAnnotation中.

采用StanfordCoreNLP(Properties props) 来创建一个Stanford CoreNLP对象,这个方法采用“annotators”属性中给出的annotator来创建了一个pipeline。


import edu.stanford.nlp.pipeline.*;
import java.util.*;

public class BasicPipelineExample {

    public static void main(String[] args) {

        // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        // read some text in the text variable
        String text = "...";

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(text);

        // run all Annotators on this text
        pipeline.annotate(document);

    }

}

我们可以将通过创建一个内容更多的Properties属性对象并传递给CoreNLP,有一些overall properties例如“annotators”,但是大部分的properties只针对一个annotator并且要被anannotator.property. 注意一个property的值一定是String。在我们的单个annotator文档中,我们将类型写作"boolean", "file, classpath, or URL" or "List<String>"。这意味着String value会被解析为这种类型的值。在Properties对象中的值一定是String,如果需要设置多个properties,则可以采用PropertiesUtils.asProperties(String ...),如下所示

// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(
	PropertiesUtils.asProperties(
		"annotators", "tokenize,ssplit,pos,lemma,parse,natlog",
		"ssplit.isOneSentence", "true",
		"parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz",
		"tokenize.language", "en"));

// read some text in the text variable
String text = ... // Add your text here!
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

2. Interpreting the output

Annotators的输出需要通过数据结构CoreMap和CoreLabel来进行访问。

// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {
  // traversing the words in the current sentence
  // a CoreLabel is a CoreMap with additional token-specific methods
  for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
    // this is the POS tag of the token
    String pos = token.get(PartOfSpeechAnnotation.class);
    // this is the NER label of the token
    String ne = token.get(NamedEntityTagAnnotation.class);
  }

  // this is the parse tree of the current sentence
  Tree tree = sentence.get(TreeAnnotation.class);

  // this is the Stanford dependency graph of the current sentence
  SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
}

// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph = 
  document.get(CorefChainAnnotation.class);

我们可以得到结果如下所示

this is a simple text

this	DT	O
is	VBZ	O
a	DT	O
simple	JJ	O
text	NN	O

语法树
(ROOT (S (NP (DT this)) (VP (VBZ is) (NP (DT a) (JJ simple) (NN text)))))

依存句法分析
-> text/NN (root)
  -> this/DT (nsubj)
  -> is/VBZ (cop)
  -> a/DT (det)
  -> simple/JJ (amod)

对于constituent-based和dependency-based的区别我们会专门讲。


3. 中文库的使用

相对于英文来说,中文文本的处理稍微麻烦一点,需要给pipeline指定配置文件。

首先需要stanford-corenlp-3.8.0-models-chinese.jar,可以在maven中指定

		<dependency>
			<groupId>edu.stanford.nlp</groupId>
			<artifactId>stanford-corenlp</artifactId>
			<version>3.8.0</version>
		</dependency>
		<dependency>
			<groupId>edu.stanford.nlp</groupId>
			<artifactId>stanford-corenlp</artifactId>
			<version>3.8.0</version>
			<classifier>models</classifier>
		</dependency>

		<dependency>
			<groupId>edu.stanford.nlp</groupId>
			<artifactId>stanford-corenlp</artifactId>
			<version>3.8.0</version>
			<classifier>models-chinese</classifier>
		</dependency>

然后中文语料模型包中有一个默认的配置文件,在stanford-corenlp-3.8.0-models-chinese.jar的StanfordCoreNLP-chinese.properties,如下所示


主要是指定相应pipeline的操作步骤以及对应的语料文件的位置。实际使用中我们可能用不到所有的步骤,或者要使用不同的语料库,因此可以自定义配置文件,再引入代码中。

public void runAllAnnotators() {
	StanfordCoreNLP pipeline = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties");
	String text2 = "我爱北京天安门";
	Annotation document = new Annotation(text2);
	pipeline.annotate(document);
	parserOutput(document);
}

运行得到结果如下所示

我爱	VV	O
北京	NR	GPE
天安门	NR	FACILITY
语法树
(ROOT (IP (VP (VV 我爱) (NP (NP (NR 北京)) (NP (NR 天安门))))))
依存句法分析
-> 我爱/VV (root)
  -> 天安门/NR (dobj)
    -> 北京/NR (compound:nn)

编辑于 2017-12-06

文章被以下专栏收录

    本专栏专注于介绍深度学习、传统机器学习、自然语言处理等内容,包括算法、工程实现等