中文题名: | 词性在汉语科技文献检索中的作用与影响 |
姓名: | |
学号: | 2005114008 |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 1205 |
学科名称: | 图书馆、情报与档案管理 |
学生类型: | 硕士 |
学位: | 管理学硕士 |
学校: | 南京农业大学 |
院系: | |
专业: | |
研究方向: | 信息检索技术 |
第一导师姓名: | |
第一导师单位: | |
完成日期: | 2008-06-11 |
答辩日期: | 2008-06-11 |
外文题名: | Part-of-speech effect and affect in search that in Chinese literature of science and technology |
中文关键词: | |
外文关键词: | part-of-speech tagging ; chinese document retrieval ; natural language questions ; retrieval evaluation |
中文摘要: |
词性标注是自然语言处理词法分析中一种较为成熟的技术,而自然语言处理在信息检索中又占有举足轻重的作用,将词性用于外文文献信息检索已有一定的研究,研究表明词性用于外文文献信息检索有一定影响,但影响不大。本研究主要针对词性用于汉语科技文献检索的作用和影响进行研究,试图用测评数据说明影响程度和作用大小。
整个研究过程中,实现了畜牧兽医语料库和词表的建立工作。词性标注过程利用的是中国科学院计算技术研究所研制出的基于多层隐马尔可夫模型的汉语词法分析系统ICTCLAS、南京农业大学研究生程冲设计的CARMM系统中的未登录词功能以及自建的畜牧兽医词表相结合的方式实现,词性标记集选用的是汉语文本词性标注标记集(北大版)。采用了两种提取检索词方式和多种检索模型,其中,两种提取检索词方式包括保留14维词性提取的检索词方式和人工辅助参与提取检索词方式;多种检索模型包括传统的布尔逻辑检索模型、“部分匹配的”布尔逻辑检索模型和向量空间模型。在向量空间模型中,根据阈值取值有其自身的不足的特点,本研究采用了两种阈值2%和5%的方式,得到了多种测评数据。根据测评数据,得出了带词性的检索和不带词性的检索的测评结果。测评结果采用四种方式测评,分别是概括表统计(包括每个检索提问式的R、P和Rav、Pav四个指标的测评结果表),R、P折线图,R、P差额直方图和R、P差值平均值表。最终根据测评结果,得出了在检全率方面,不带词性的检索效率要高于带词性的检索效率;在检准率方面,除了“部分匹配的”布尔逻辑检索结果显示的是不带词性的检索检准率要高之外,其他结果都表明带词性的检索要略胜一筹。总体来看,带词性的检索并没有体现多大的优越性。而且,从测评结果来看,在词性用于检索的同时选择的检索模型也是制约最终结果的一个因素。
本研究总的来看主要创新可以归结为4个方面。第一,词性首次用于汉语文献检索。第二,对文献语词和提问检索词的词性进行了14维降维处理,提高了检索效率;第三,设计了可用于词性检索的“部分匹配的布尔逻辑模型”;第四,用实验测评数据得出了词性检索对汉语文献检索的影响程度。
﹀
|
外文摘要: |
Part-of-speech (POS) tagging is a natural language processing lexical analysis of more mature technology, and natural language processing occupies a decisive role in Information Retrieval.POS for foreign-language literature has certain information re- trieval research, it shows a certain extent, but little impact. The purpose of this thesis is to study part of speech for Chinese science and technology literature search, trying to use the survey data on the impact and role of size.
The whole course of the study, to achieve the Animal Husbandry and Veterinary corpus and the establishment of working vocabulary. POS tagging process is the use of the Chinese Academy of Sciences Institute of Computing Technology developed the multi-storey Hidden Markov Model Based on the Chinese lexical analysis system ICTCLAS, Nanjing Agricultural University graduate Chengchong designed system of unknown words function as well as self which named CARMM and The Animal Husbandry and Veterinary vocable table together to achieve, POS tag-Chinese text is optional POS tagging tag set (Beijing university’s edition). Using the two key words from a variety of ways and two retrieval models, including two from Search term form of reservations from the 14-dimensional POS retrieval term approach and artificial participation from Search term; retrieve a variety of models including traditional Boolean logic Retrieval Model, "part of the match" Boolean logic model and retrieval vector space model. The vector space model, in accordance with the threshold value has its own shortage of features, this study used two threshold 2% and 5% of the way, have a variety of survey data. According to survey data, obtained with the results of POS retrieval and non-POS retrieval . Evaluation results of a survey in four ways, are summarized in Table statistics (including the retrieval of each of the questions R, P and Rav, Pav four indicators of the survey results table), R, P broken line map, the difference histogram of R, P and R, P average margin table. According to final results of the survey, found that in the Recall Ratio, without the part of speech retrieval efficiency is higher than that of POS with the retrieval efficiency. In the Precision Ratio, in addition to "part of the match" Boolean logic of the search results show the non-POS access to the seizure of high-rate, other results show that with the POS is better than non-POS . Overall, with part of speech and did not reflect the retrieval of the superiority of how much. Moreover, the results of the survey from the POS option for the retrieval of the retrieval model is the final result of constraints as a factor.
Overall results of this study can be attributed mainly to four areas. First, part of speech for the first time in Chinese literature search. Prior research has been on the part of speech information retrieval performance of the article, but based on non-Ch- inese literature studies, in this paper on the basis of the language of the sublimation, making part of speech for the retrieval of more comprehensive study, filled a part of speech Chinese literature search for gaps;Second, the literature of words and a search was part of speech the word 14-dimensional reduced-order processing and improve the efficiency of retrieval; Third, the design can be used to retrieve the part of speech "part of the match Boolean logic model ";Fourth, the use of survey data retrieval on the part of speech that the Chinese document retrieval degree of impact.
﹀
|
中图分类号: | G252.7 |
馆藏号: | 2005114008 |
开放日期: | 2020-06-30 |