- 无标题文档
查看论文信息

中文题名:

 基于深度学习的典籍人称代词指代消解研究    

姓名:

 陈诗    

学号:

 2018814048    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 125500    

学科名称:

 管理学 - 图书情报    

学生类型:

 硕士    

学位:

 图书情报硕士    

学校:

 南京农业大学    

院系:

 信息科技学院    

专业:

 图书情报(专业学位)    

研究方向:

 自然语言处理    

第一导师姓名:

 黄水清    

第一导师单位:

 南京农业大学    

第二导师姓名:

 王飞    

完成日期:

 2020-04-30    

答辩日期:

 2020-05-30    

外文题名:

 Research on the anaphoric resolution of personal pronouns in classical books based on deep learning    

中文关键词:

 指代消解 ; 人称代词 ; 典籍 ; 古代汉语    

外文关键词:

 Anaphora resolution ; Personal pronouns ; Classics ; The ancient Chinese    

中文摘要:

在中华文化源远流长的历史长河中,留下了浩如烟海的珍贵古汉语典籍文献。典籍文本包含着丰富的历史信息,记载着前人非凡出彩的哲思,它奠定了民族文化的根基,对传统文化的弘扬与传承至关重要。随着信息时代的发展,针对民族文化的重要载体——古代汉语典籍文本,如何运用古文信息处理技术对古代汉语典籍进行深度挖掘与知识发现有着非凡意义,不仅有助于传统文化的发扬与传承,同时也有利于提升国家文化软实力。

人称代词是在自然语言中指代人物实体的代词,一个完整的指代关系由用于指向指代词的“照应语”和所指内容,即“先行语”共同组成。古汉语典籍中的人称代词与现代汉语虽功用一致,但由于古代汉语与现代汉语在语法、字词等方面存在诸多差异,在人称代词方面也存在数量上、单复数、以及词性兼类的诸多差别。因此正确识别出古汉语中的人称代词对古代汉语典籍研究的深度挖掘有着不容小觑的作用,同时人称代词识别的性能对指代消解的性能起着影响作用。本文研究探讨古汉语典籍中出现的句内人称代词指代消解问题,分别采用传统机器学习与深度学习的方法对人称代词的识别、指代消解方法进行对比研究。本文的重点工作内容为以下三点:

  1. 构建人称代词指代消解语料库。本文以电子化《史记》为语料,经南京农业大学词性标注集标注,通过分析古代汉语人称代词与其指代关系的特点,并根据标注集的缺陷进行修改,制定了指代消解语料标注规范,完成了本文实验所需的人称代词指代消解语料库。该语料库基于古代汉语并富含人称信息和句内指代关系,能够满足本文实验的需要。
  2. 采用基于传统机器学习方法的CRF模型和深度学习的BERT模型进行典籍人称代词识别实验,这一实验为后续章节基于深度学习方法的人称代词指代消解研究奠定基础。首先介绍了CRF模型框架,其次对特征选择及特征模板进行了介绍,使用不同分割方法、引入词性特征探究识别效果。再次介绍基于深度学习的BERT模型,并用于无词性字单位语料训练,最后对各个实验结果进行对比评价。实验结果表明在CRF实验中引入词性特征,采用词单位的方式进行人称代词识别,效果最佳,F均值达91.83%。在同等无词性字单位分割语料的情况下,BERT深度学习模型识别效果优于CRF模型,F均值达90.85%,同样适用于小规模语料的人称代词识别。
  3. 采用Bi-LSTM-CRF模型、BERT模型进行人称代词指代消解。首先在Bi-LSTM-CRF实验中,结合 Word Embedding 获取深层隐性语义特征,进行了4次实验,形成3组对照实验,一是无词性词单位语料和无词性字单位语料进行实验,二是在无词性词单位的语料基础上增加attention机制,第三组对照实验是给词单位语料增加了词性特征,与之前的无词性词单位语料实验进行对比,以探索词性特征对人称代词指代消解效果的影响。结果证明,在无词性的情况下,词单位语料的实验效果优于字单位实验语料。Attention机制的加入对指代消解效果有所提升。而词性特征的增加能够提升模型的消解效果最优。其次根据训练语料的情况将BERT模型调整至实验最佳参数进行指代消解实验,经过十折交叉验证,消解效果F均值82.43%。最后对各个实验的消解结果进行可视化分析,结果表明在Bi-LSTM-CRF模型中引入词性特征,采用词单位的分割方式进行实验消解效果最佳,F均值达84.00%。
外文摘要:

There are various kinds of precious ancient Chinese classics in the long history of Chinese culture. The text of classic books contains rich historical information and records the outstanding philosophy of predecessors. It lays the foundation of national culture and is crucial to the promotion and inheritance of traditional culture. With the development of the information age, it is of great significance to make deep exploration and knowledge discovery of ancient Chinese classics by using the information processing technology of ancient Chinese classics, which is an important carrier of national culture -- ancient Chinese classics. It is not only conducive to the development and inheritance of traditional culture, but also conducive to the promotion of national cultural soft power.

Personal pronoun is a pronoun that refers to a person entity in natural language, and a complete referential relation is composed of "reference language" used to refer to the pronoun and the referent content, that is, "antecedent language". Although the personal pronouns in ancient Chinese classics have the same function as that in modern Chinese, there are many differences between ancient Chinese and modern Chinese in grammar, words and so on. Therefore, the correct identification of personal pronouns in ancient Chinese plays an important role in the in-depth study of ancient Chinese classics. In this paper, the problem of reference resolution in ancient Chinese classics is discussed in depth, the methods of personal pronoun recognition and reference resolution are compared and studied by using traditional machine learning and deep learning methods. This paper focuses on the following three points:

  • Construct a anaphoric resolution of personal pronoun corpus. Based on electronic the historical records as the corpora, part-of-speech tagging set marked by Nanjing agricultural university, after having the analysis of the characteristics of ancient Chinese personal pronouns refer to relation with modified and set according to the labeling of defects, this paper establishes the anaphoric resolution of corpus annotation specifications and forms the required personal pronouns which refers to the anaphoric resolution of tests under corpus study.This corpus is based on ancient Chinese and is rich in personal information and intra-sentence referential relations, which can meet the needs of the experiment in this paper.
  • Carry out the personal pronoun recognition experiment of classic books by CRF model based on traditional machine learning method and BERT model based on deep learning method, and this experiment lays a foundation for the research of personal pronoun anaphora resolution based on deep learning method in the following chapters. Firstly, the CRF model framework is introduced. Secondly, the feature selection and feature template are introduced. The BERT model based on deep learning is introduced again, and it is used for the training of non-part of speech word unit corpus. Finally, the experimental results are compared and evaluated. The experimental results show that the introduction of part of speech in CRF experiment and the use of word units for personal pronoun recognition have the best effect, with the average value of F reaching 91.83%. In the case of the same segmentation of non-part of speech word units, the recognition effect of BERT deep learning model is better than that of CRF model, with the average value of F reaching 90.85%. It is also applicable to personal pronoun recognition of small-scale corpus.
  • Use Bi-LSTM-CRF model and BERT model to resolve anaphoric resolution of personal pronouns. Firstly, combining Word Embedding to obtain the deep implicit semantic features in the Bi-LSTM-CRF experiment, and carrying out four experiments so as to form the three groups of control experiments, one is in the non-speech corpus,use the word-based corpus and the chars-based corpus for comparative experiments.The second is to conduct experiments on the corpus of non-speech word units, and at the same time increase the attention mechanism in the experiment, and the third group control experiment is to give the word unit increases the processing features, with no part of speech of words before unit corpus experimental comparison, to explore the part of speech characteristics of pronouns refer to eliminate the effect. The results show that the experimental results of word unit corpus are better than that of word unit corpus in the case of no part of speech. The addition of Attention mechanism improves the resolution of reference. The addition of part of speech features can greatly improve the resolution effect of the model. Secondly, according to the training corpus, BERT model was adjusted to the best experimental parameters for reference resolution experiment. After ten-fold cross verification, the average f-resolution effect was 82.43%. Finally, visual analysis was carried out on the resolution results of each experiment. The results showed that the part of speech feature was introduced into the bi-lstm-crf model, and the segmentation method of word unit was adopted to achieve the best experimental resolution effect, with the average value of F up to 84.00%.
参考文献:

[1]Chen C, Ng V. Combining the Best of Two Worlds: A Hybrid Approach to Multilingual Coreference Resolution[C].Joint Conference on EMNLP and CoNLL-Shared Task. Association for Computational Linguistics, 2012: 56-63.

[2]Clark K, Manning C D. Entity-Centric Coreference Resolution with Model Stacking[C].Meeting of the Association for Computational Linguistics and the, International Joint Conference on Natural Language Processing. 2015: 1405-1415.

[3]Curdle C, Wagstaft K. Noun Phrase Coreference as Clustering[C].In Proceedings of the Joint Conference on Empirical Methods in NLP and Very Large Corpora. 1999: 277–308.

[4]FEN X,ZHANG Y,GLASSJ.Speech featurede-Noising and dereverberation via deep auto encoders for noisy reverberant speech recogintion[C].Proceedings of the IEEE International Conference on Acoutstics,Speech and Signal Processing.Florence:IEEE,2004: 1759-763.

[5]Hochreiters, Schmidhuber J. Longshort-termmemory[J].Neural Computation,1997,9(8): 1735-1780

[6]Hinton G E, Osindero S,Teh Y W.A fast learning algorithm for deep belief nets[J].Neural Computation,2006,18 18(7): 1527-1554.

[7]Hobbs J R. Resolving pronoun references[J]. Lingua, 1978, 44(4): 311-338.

[8]Hua Chengcheng, Wang Hong, Chen Jichi, et al.Novel functional brain network methods based on CNN with an application in proficiency evaluation[J].Neurocomputing, 2019, 359:153-162.

[9]Kennedy C , Boguraev B. Anaphora for Everyone: Pronominal Anaphora Resolution without a Parser[C].Conference on Computational Linguistics. Association for Computational Linguistics, 1996:113-118.

[10]Kong F, Zhou G D, Zhu Q. Employing the centering theory in pronoun resolution from the semantic perspective[C].Conference on Empirical Methods in Natural Language Processing: Volume. Association for Computational Linguistics, 2009: 987-996.

[11]Lappin S, Leass H J. An algorithm for pronominal anaphora resolution[J]. Computational Linguistics, 1994, 20(4): 535-561.

[12]Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015, 521(7553): 436-444.

[13]Lee H, Peirsman Y, Chang A, et al. Stanford's multi-pass sieve coreference resolution system at the CoNLL-2011 shared task[C].Fifteenth Conference on Computational Natural Language Learning: Shared Task. 2011: 28-34.

[14]Li Hongyang,Chen Jiang,Lu Huchuan,et al.CNN for saliency detection with low-level feature integration[J].Neurocomputing,2017,226: 212-220.

[15]Li Jun,Huang Guimin,Chen Jianheng,et al.Dual CNN for relation extraction with knowledge- based attention and word embedding[J].Computational Intelligence and Neuroscience,2019: 1-10.

[16]Luo X, Ittycheriah A,Jing H,et al.A Mention-Synchronous Coreference Resolution Algorithm Based on the Bell Tree[C].Meeting of the Association for Computational Linguistics,21-26 July, 2004, Barcelona, Spain.DBLP, 2004: 135-142.

[17]Mccarthy J F, Lehnert W G. Using Decision Trees for Coreference Resolution[C].Fourteenth International Joint Conference on Artificial Intelligence. Montreal, 1995: 1050-1055.

[18]Rumelhart D E,Hinton G E, Williams R J.Learning internal representation by back-propagation of errors[J].Nature, 1986,323(323):533-536.

[19]Raghunathan K,Lee H,Rangarajan S,et al.A multi-pass sieve for coreference resolution[C].Proceddings of the 2010 Conference on Empirical Methods in Natural Language Processing.MIT,Massachusetts:ACL,2010: 492-501.

[20]Soon W M, Ng H T, Lim D C Y. A machine learning approach to coreference resolution of noun phrases[J]. Computational Linguistics, 2001,27(4): 521-544.

[21]VAN DEEMTER K,KIBBLE R.On coreferring:coreference in MUC and related annotation schemes[J].Computational Linguistics,2000,26(4): 629-637.

[22]VINCENT P,LAROCHELLE H,LAJOIEI,et al.Stacked denoising auto encoders:learning useful representations in a deep network with a local denoising criterion[J].The Journal of Machine Learning Research,2010,11(11): 3371-3408.

[23]Wiseman S, Rush A M, Shieber S M. Learning Global Features for Coreference Resolution[C].Meeting of the Association for Computational Linguistics, 12-17 June, 2016, San Diego, California, 2016: 994-1004.

[24]Yang X, Zhou G, Su J, et al. Coreference Resolution Using Competition Learning Approach[C].Meeting of the Association for Computational Linguistics, 7-12 July, 2003, Sapporo Convention Center, Sapporo, Japan. DBLP, 2003: 176-183.

[25]Zenlenko D,Aone C,Tibbetts J. Coreference resolution for information extraction[A].Proceedings of theACL Workshop on Reference Resolution and Its Applications[C]. Barcelona,Spain: ACL,2004: 9-16.

[26]陈小荷.先秦文献信息处理[M].北京:世界图书出版公司,2013:298-314.

[27]樊峻畅.红外图像中基于卷积神经网络的车辆检测[D].西安:西安电子科技大学,2017.

[28]冯岭,谢世博,刘斌.基于多层感知机的技术创新人才发现方法[J].计算机应用与软件,2019,36(7):26-31.

[29]付健,孔芳.融入结构化信息的端到端中文指代消解[J].计算机工程,2020,46(01):45-51.

[30]高强.基于深度卷积网络学习算法及其应用研究[D].北京:北京化工大学,2015.

[31]高俊伟,孔芳,朱巧明,李培峰,华秀丽.无监督中文名词短语指代消解研究[J].计算机工程,2012,38(17):189-191.

[32]郭锡良.1991年古汉语语法研究简述[J].语文建设,1992(05):22-24.

[33]顾孙炎.基于深度神经网络的中文命名实体识别研究[D].南京:南京邮电大学,2018.

[34]黄建年.农业古籍的计算机断句标点与分词标引研究[D].南京:南京农业大学,2009.

[35]黄水清,王东波,何琳.以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J].图书情报工作,2015,59(11):127-133.

[36]黄水清,王东波.古文信息处理研究的现状及趋势[J].图书情报工作,2017,61(12):43-49.

[37]侯宇青阳,全吉成,王宏伟.深度学习发展综述[J].舰船电子工程,2017,37(04):5-9+111.

[38]李义琳.上古汉语和现代汉语人称代词比较[J].山西师大学报(社会科学版),1990(03):91-94.

[39]郎君.基于决策树的中文名词短语指代消解[C].中国中文信息学会.第二届全国学生计算语言学研讨会论文集.中国中文信息学会:中国中文信息学会,2004:172-174.

[40]李国臣,罗云飞.采用优先选择策略的中文人称代词的指代消解[J].中文信息学报,2005,19(04):24-30.

[41]刘汉生.《史记》与《世说新语》人称代词比较[J].天中学刊,2007(01):81-84.

[42]梁社会,陈小荷.先秦文献《孟子》自动分词方法研究[J].南京师范大学文学院学报,2013 (3):175-182.

[43]路玉君.基于RNN的陆空通话语义描述与度量方法[D].天津:中国民航大学,2017.

[44]李阳辉,谢明,易阳.基于降噪自动编码器及其改进模型的微博情感分析[J].计算机应用研究,2017,34(2):373-377.

[45]李冬白.基于深度学习的维吾尔语代词指代消解研究[D].新疆:新疆大学,2017.

[46]李东欣,禹龙,田生伟,李圃,赵建国.注意力机制的LSTM-DBN维语人称代词指代消解[J].计算机技术与发展,2019,29(07):33-38.

[47]钱智勇,周建忠,童国平,苏新宁.基于HMM的楚辞自动分词标注研究[J].图书情报工作,2014,58(04):105-110.

[48]秦越,禹龙,田生伟,赵建国,冯冠军.基于深度置信网络的维吾尔语人称代词待消解项识别[J].计算机科学,2017,44(10):228-233.

[49]邱冰,皇甫娟.基于中文信息处理的古代汉语分词研究[J].微计算机信息,2008(24):100-102.

[50]疏骏.基于中心理论的汉语指代词消解算法[C].Northeastern University、Tsinghua University、Chinese Information Processing Society of China、Chinese Languages Computer Society, USA.Advances in Computation of Oriental Languages-Proceedings of the 20th International Conference on Computer Processing of Oriental Languages.Northeastern University、Tsinghua University、Chinese Information Processing Society of China、Chinese Languages Computer Society, USA:中国中文信息学会,2003:247-254.

[51]石民,李斌,陈小荷.基于CRF的先秦汉语分词标注一体化研究[J].中文信息学报,2010,24(02):39-45.

[52]田生伟,秦越,禹龙,吐尔根·依布拉音,冯冠军.基于Bi-LSTM的维吾尔语人称代词指代消解[J].电子学报,2018,46(07):1691-1699.

[53]田启川,王满丽.深度学习算法研究进展[J].计算机工程与应用,2019,55(22):25-33.

[54]王爽,熊德兰,王晓霞.基于实例的古文机器翻译设计与实现[J].许昌学院学报,2009,28(05):88-91.

[55]王厚峰.指代消解的基本方法和实现技术[J].中文信息学报,2002,16(6):9-17.

[56]王厚峰,梅铮.鲁棒性的汉语人称代词消解[J].软件学报,2005,16(05):700-707.

[57]王厚峰,何婷婷.汉语中人称代词的消解研究[J].计算机学报,2001(02):136-143.

[58]吴兵兵.基于词向量和LSTM的汉语零指代消解研究[D].哈尔滨:哈尔滨工业大学,2016.

[59]王博立,史晓东,苏劲松.一种基于循环神经网络的古文断句方法[J].北京大学学报(自然科学版),2017,53(02):255-261.

[60]许敏,王能忠,马彦华.汉语中指代问题的研究及讨论[J].西南师范大学学报(自然科学版),1999(06):633-637.

[61]徐四海,夏锡骏.古代汉语第二人称代词三题[J].江苏广播电视大学学报,2007(01):51-54.

[62]徐凡,朱巧明,周国栋.衔接性驱动的篇章一致性建模研究[J].中文信息学报,2014,28(3):11-21.

[63]奚雪峰,周国栋.基于Deep Learning的代词指代消解[J].北京大学学报自然科学版,2014,50(1):100-110.

[64]夏吾吉,华却才让.基于混合策略的藏文人称代词指代消解研究[J].计算机工程与应用,2018,54(07):66-69+1.

[65]于丽丽,丁德鑫,曲维光,陈小荷,李惠.基于条件随机场的古汉语词义消歧研究[J].微电子学与计算机,2009,26(10):45-48.

[66]袁悦,王东波,黄水清,李斌.不同词性标记集在典籍实体抽取上的差异性探究[J].数据分析与知识发现,2019,3(03):57-65.

[67]杨启萌,禹龙,田生伟,艾山·吾买尔.基于多注意力机制的维吾尔语人称代词指代消解[J/OL].自动化学报:1-11[2020-03-18].https://doi.org/10.16383/j.aas.c180678.

[68]周俊生,黄书剑,陈家骏,曲维光.一种基于图划分的无监督汉语指代消解算法[J].中文信息学报,2007(02):77-82.

[69]周庆,刘斌,余正伟等.综合模块化航电软件测试环境研究[J].航空学报,2012(11):722-733.

[70]周炫余,刘娟,卢笑.篇章中指代消解研究综述[J].武汉大学学报理学版,2014,60(1):24-36.

[71]张立民,刘凯.基于深度玻尔兹曼机的文本特征提取研究[J].微电子学与计算机,2015,32(2):142-147.

[72]周炫余,刘娟,邵鹏.基于测度优化 Laplacian SVM的中文指代消解方法[J].电子学报,2016,44(12):3064-3072.

[73]曾丹梦.基于深度学习的眼动关键技术研究[D].西安电子科技大学,2019.

[74]翟东海,侯佳林,刘月.基于深度学习的文本情感分析算法并行化研究[J].西南交通大学学报,2019,54(3):647-654.

中图分类号:

 G25    

开放日期:

 2020-06-11    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式