本文共 2263 字,大约阅读时间需要 7 分钟。
词性标注:
词性(POS)常用的POS标记库Penn Treebank,PennTreeBank原本是一个NLP项目的名称,该项目主要是对相关语料进行标注,标注内容包括词性标注以及语法分析,其语料来自1989年的华尔街日报,包含2499篇文章。下面是Penn Treebank库编号 缩写 英文 中文1 CC Coordinating conjunction 并列连接词2 CD Cardinal number 基数3 DT Determiner 限定词4 EX Existential there 存在型there5 FW Foreign word 外文单词6 IN Preposition/subord, conjunction 介词/从属,连接词7 JJ Adjective 形容词8 JJR Adjective, comparative 形容词,比较级9 JJS Adjective, superlative 形容词,最高级10 LS List item marker 列表项标记11 MD Modal 情态动词12 NN Noun ,singular or mass 名词,可数或不可数13 NNS Noun, plural 名词,复数14 NNP Proper noun, singular 专有名词,单数15 NNPS Proper noun, plural 专有名词,复数16 PDT Predeterminer 前位限定词17 POS Possessive ending 所有格结束词18 PRP Personal pronoun 人称代名词19 PP$ Possessive pronoun 物主代词,所有格代名词20 RB Adverb 副词21 RBR Adverb, comparative 副词,比较级22 RBS Adverb, superlative 副词,最高级23 RP Particle 小品词24 SYM Symbol(mathematical or scientific) 符号(数学或科学)25 TO to To26 UH Interjection 感叹词27 VB Verb, base form 动词,基本形态28 VBD Verb, past tense 动词,过去式29 VBG Verb, gerund/present participle 动词,动名词/现在分词30 VBN Verb, past participle 动词,过去分词31 VBP Verb, non-3rd ps. sing. Present 动词,非第三人称单数现在式32 VBZ Verb, 3rd ps. sing. Present 动词,第三人称单数现在式33 WDT wh-determiner wh-限定词34 WP wh-pronoun wh-代词35 WP$ Possessive wh-pronoun 所有格wh-代词36 WRB wh-adverb wh-副词37 # Pound sign #符号38 $ Dollar sign 美元符号39 . Sentence-final punctuation 句点40 , Comma 逗号41 : Colon, semi-colon 冒号,分号42 ( Left bracket character 左括号43 ) Right bracket character 右括号44 “ Straight double quote 双引号45 ‘ Left open single quote 左单引号46 “ Left open double quote 左双引号47 ’ Right close single quote 右单引号48 ” Right close double quote 右双引号和中学时学的英语差不多。下面是一个简单的用POS语料库的例子:import nltkfrom nltk import word_tokenizes="I was watching TV"print(nltk.pos_tag(word_tokenize(s)))
结果:
[('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]
代码中先将文本进行表示化处理,再调用NLTK库中的pos_tag方法得到一组(词形,词性标签),可以看到很好地将一句话进行了标注。
用POS语料库可以进行很多灵活的操作,如找出文本中所有的名词等:import nltkfrom nltk import word_tokenizes="I was watching TV"#print(nltk.pos_tag(word_tokenize(s)))tagged=nltk.pos_tag(word_tokenize(s))allnoun=[word for word ,pos in tagged if pos in ['NN','NNP']]print (allnoun)
结果:
['TV']
如果要找动词只需要改变pos的词性为
allnoun=[word for word ,pos in tagged if pos in ['VB','VBD','VBG','VBN']]
转载地址:http://pgxyl.baihongyu.com/