当前位置：搜档网 › 面向专业领域的中文分词方法

面向专业领域的中文分词方法

Computer Engineering and Applications 计算机工程与应用

2018，54（17）1引言随着信息技术的高速发展，各专业领域的文本数据急剧增长。利用自然语言处理技术分析文本数据，解决实际问题并提高工作效率，已成为研究热点之一。中文分词是中文自然语言处理技术的基础性工作，其结果直接影响了后续工作（如信息检索、文本分类、信息抽取等）的性能。常用的分词方法可以分为两类：基于词典的分词和基于统计的分词[1]。基于统计的分词方法由于在歧义切分和未登录词（Out-of-Vocabulary ，OOV ）识别方面相对于基于词典的分词有了较大提升，因而成为近年来主流的分词方法[2]。常用的分词统计模型有隐马尔科夫模

型[3]、条件随机场模型（CRF ）[4-5]、最大熵模型[6]、神经网络模型[7]等。然而，当测试语料和训练语料领域不一致时，分词的准确率和OOV 识别等性能会大幅度下降[8-9]。因此，利用基于统计的分词方法切分专业领域文本时，需要为相应的领域制作标注好的训练语料。然而，标注专业领域的训练语料将耗费较大的人力物力，且现阶段已

完成标注工作的专业领域数量稀少。

跨领域分词的研究工作逐渐引起研究人员的关注。张梅山等[9]在CRF 中文分词模型中加入词典相关的特征函数，在CRF 解码中运用通用词典和领域词典提面向专业领域的中文分词方法

成于思1，施云涛2

CHENG Yusi 1,SHI Yuntao 2

1.东南大学土木工程学院，南京210096

2.中国移动通信集团南京分公司网络部，南京210019

1.School of Civil Engineering,Southeast University,Nanjing 210096,China

2.Nanjing Branch Network Department,China Mobile Communications Group,Nanjing 210019,China

CHENG Yusi,SHI Yuntao.Domain specific Chinese word https://www.sodocs.net/doc/3d17048813.html,puter Engineering and Applications,2018,54（17）：30-34.

Abstract ：The performance of statistical methods for Chinese word segmentation is limited owing to lack of the specific training corpus,and the dictionary-based methods are affected by unknown words and segmentation ambiguities.To realize domain adaptation,an approach combined statistical methods and a domain dictionary is developed.The approach firstly builds a high quality domain dictionary,and uses a statistical method to obtain preliminary results.Then,an algorithm for eliminating ambiguity is designed based on rules and Chinese character subsets with defined properties.Experimental results on a construction law domain corpus show that the precision,the recall and F-measure achieve 92.08%,94.26%and 93.16%.The approach combined with new word detection can improve the performance of unknown words processing.Key words ：Chinese word segmentation;domain specific;ambiguity resolution;domain dictionary;construction law

摘要：在专业领域分词任务中，基于统计的分词方法的性能受限于缺少专业领域的标注语料，而基于词典的分词方法在处理新词和歧义词方面还有待提高。针对专业领域分词的特殊性，提出统计与词典相结合的分词方法，完善领域词典构建流程，设计基于规则和字表的二次分词歧义消解方法。在工程法领域语料上进行分词实验。实验结果表明，在工程法领域的分词结果准确率为92.08%，召回率为94.26%，F 值为93.16%。该方法还可与新词发现等方法结合，改善未登录词的处理效果。

关键词：中文分词；专业领域；歧义消解；领域词典；工程法

文献标志码：A 中图分类号：TP30doi ：10.3778/j.issn.1002-8331.1806-0117

基金项目：国家自然科学基金青年科学基金（No.71601047）；中国博士后科学基金（No.2015M581706）。

作者简介：成于思（1983—），女，博士，讲师，研究领域为文本挖掘与工程法律，E-mail ：xchengyusi@https://www.sodocs.net/doc/3d17048813.html, ；施云涛（1985—），男，

高级工程师，研究领域为自然语言处理。

收稿日期：2018-06-11修回日期：2018-08-15文章编号：1002-8331（2018）17-0030-05

30万方数据

面向专业领域的中文分词方法

相关文档

最新文档