查看论文信息

查看全文

免费浏览

查看论文信息

中文题名：	汉语文本抄袭识别系统研究
姓名：	曹艳
学号：	2005114012
保密级别：	公开
论文语种：	chi
学科代码：	120502
学科名称：	情报学
学生类型：	硕士
学位：	管理学硕士
学校：	南京农业大学
院系：	信息科学技术学院
专业：	情报学
研究方向：	文本数据挖掘
第一导师姓名：	牛又奇
第一导师单位：	南京农业大学信息科学技术学院
完成日期：	2008-06-18
答辩日期：	2008-06-18
外文题名：	RESEARCH ON PLAGIARISM-IDENTIFICATION SYSTEM FOR CHINESE DOCUMENTS
中文关键词：	中文文档 ; 抄袭识别 ; 相似度算法 ; 相似例证
外文关键词：	Chinese documents ; plagiarism identification ; similarity methods ; similar illustration
中文摘要：	︿抄袭识别属于文档复制检测技术的一种应用类型，它是提高学术论文质量、净化学术环境的一种重要措施。抄袭识别就是判断某篇给定文档是否抄袭了其他一篇或多篇文档的内容，具体包括完全抄袭、大部分抄袭和少部分抄袭。本文首先阐述了汉语文本抄袭识别的意义和文档复制检测技术的基本原理，并简要介绍了几个典型的文档复制检测原型系统、抄袭识别工具及在线服务网站的功能和特点。其次，总结了中文分词方法及几种现有的分词系统，作为后续研究的基础。再次，介绍并分析了各种现有文本相似度算法及其优缺点，在此基础上，提出了多层次特征融合的相似度算法，利用此算法比较文档间的相似情况，从而在已有文档中确定待测文档的相似文档。本研究系统首先采用关键词相似度计算、类号比较、基于字符匹配的文档题名和摘要相似度计算来计算文档间相关性，从而找出文档库中与待测文档相关的文档；然后将自动分词后的文档正文进行停用词过滤、“重构”(即同义词转换)，把重构后的有意义的实义词结点集合看作初始文档的词条集合，利用基于集合模型的相似度方法计算待测文档正文与相关文档正文内容间的相似度值，从而确定相似文档。然后基于公共子串的思想，构建了无重复最长公共子串求解算法和基于分词的无重复最长公共子串的求解算法，分别利用这两种文本比较算法求出待测文档与相似文档间的“公共内容”，生成相似报告，从而对于抄袭判断给予合理的解释，也即例证。接着，描述了同义词表、分类表等各种词表的构建方式，在现有抄袭识别工具的功能、特点研究基础上，解决了1:n的中文文档间相似度计量、定位相似内容等难点问题，设计并实现了一个面向学术期刊论文的汉语文本抄袭识别原型系统。最后，解释实验数据的选择，阈值的设置，并利用测试文档对本文提出的多层次特征融合型相似度算法和两种文本比较生成相似报告的方法进行了测评，同时总结了笔者所做的主要工作、本文的创新之处及进一步的工作设想。﹀
外文摘要：	︿ Plagiarism identification is one type of copy detection technology, it is a powerful measure to improve the quality of academic papers and encourage academic honesty. Plagiarism identification for documents is to judge whether the given document plagiarize contents of other documents in the database, which plagiarism occurs in some way, such as by copying total documents contents, duplicating most parts of documents contents, or partial. Firstly, this paper introduces the signification of plagiarism identification for Chinese documents and basic theories of the technology, analyses the functions and characteristics of current copy detection systems, tools, or websites for documents are given. Secondly, this paper summarizes the methods of Chinese automatic segmentation and several current segmentation systems, as the basis of plagiarism identification. Thirdly, this paper introduces and analyses all kinds of similarity methods, presents a new similarity method of many properties integrated. Use this new method finding similar documents. This system makes use of keywords similarity, classification similarity, title similarity, abstract similarity to judge relative documents; then restructures the document with notional words, calculates the similarity of text basing on the model of tokens, and determines similar documents. Fourthly, this paper presents the method of non-repeat longest common substring based on common substring, and the method of non-repeat longest common substring based on segmentation. This article uses these two methods to find out the common contents by comparing documents, then creating a similar report. Again, this paper describes the constitutions of all dictionaries, such as thesaurus, classification, stopword list, and so on. This paper solves the difficult problems of measuring similarity method and finding out common contents for Chinese documents. Finally, based on these researches, a prototype of the plagiarism identification system for Chinese documents is designed and implemented by object-oriented method. This system can find overlaps among documents. In the end, this paper explains how to select experimental data and how to certain the values of parameters, and evaluates the performance of the plagiarism identification system. The last chapter is the sum-up and expectation of the article, which including author’s work, innovation of article, existent deficiencies of the system, and more improved measures. ﹀
中图分类号：	TP393
馆藏号：	2005114012
开放日期：	2020-06-30

附件下载