- 无标题文档
查看论文信息

中文题名:

 基于大语言模型的古籍词典智能构建及应用研究    

姓名:

 吴娜    

学号:

 2022114013    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 120502    

学科名称:

 管理学 - 图书馆、情报与档案管理 - 情报学    

学生类型:

 硕士    

学位:

 管理学硕士    

学校:

 南京农业大学    

院系:

 信息管理学院    

专业:

 图书情报与档案管理    

研究方向:

 文本挖掘、数字人文    

第一导师姓名:

 黄水清    

第一导师单位:

  南京农业大学    

完成日期:

 2025-04-08    

答辩日期:

 2025-05-22    

外文题名:

 Intelligent Construction And Application Research Of Ancient Classics Dictionaries Based On Large Language Models    

中文关键词:

 古籍智能信息处理 ; 古籍词典 ; 自动化编纂 ; 大语言模型    

外文关键词:

 Intelligent Information Processing of Ancient Classics ; Ancient Text Dictionaries ; Automated Compilation ; Large Language Models LLMs    

中文摘要:

古籍作为中华文明绵延数千年的重要载体,蕴藏着丰富的历史智慧和文化精髓。古籍词典作为解读古籍的关键工具,不仅能够对古文献中的字词、术语、典章制度等进行精准考释,更能为研究者提供系统化的知识体系,架起古今学术沟通的桥梁。近年来,随着党的十八大召开与“十四五”规划发布后,对古籍文献的研究从数字化资源的开发、保护和存储逐渐扩展到资源的活化利用。随着文本挖掘技术的发展与进步,利用自然语言处理技术探索古籍资源的创造性加工、组织和转化方法也成为数字人文领域的一大热门研究范式。古籍词典作为古籍资源的一大重要类目,其自动化生成与编纂方法是数字人文与计算语言学交叉研究的重点方向。
在当前国家文化数字化战略的推动下,传统词典编纂模式正经历着智能化转型:一方面,基于深度学习的古籍文本分析技术在自动识别生僻字词、标注语义关系等方面都表现出出色的效果;另一方面,大语言模型的突破性进展为古籍文本智能处理带来了新的挑战和机遇。这些先进技术为古籍词典的编纂模式带来了新的可能性,也为古籍词典编纂的“智能化、协同化和场景化”带来了前所未有的发展契机。
据此,本文围绕古籍专书词典编纂这一应用场景,首先全面性地探索了大语言模型在该场景下的应用潜力与局限性,其次在此基础上进一步验证了将古籍专书词典领域知识融入大语言模型的有效路径,最后设计并构建了古籍词典协同编纂系统这一应用工具。研究主要包括如下内容:
(1)通过OCR识别、数据清洗和人工校对的方法,构建了古籍专书词典领域的多种要素语料库,丰富了古籍词典数字化资源的建设,为古籍词典自动化构建领域提供数据支撑。
(2)基于文献调研选取当前在各大文本生成任务中表现突出的系列大语言模型,包括开源和闭源系列大语言模型。针对古籍专书词典编纂这一应用场景,结合古籍专书词典要素,设计全面详实的评测任务和数据集,包含零样本指令(0-shot)、少样本指令(few-shot)和思维链指令(CoT)几种评测模式。利用当前高效的推理加速框架VLLM进行评测实验,以探索并验证当前大语言模型在古籍词典编纂这一场景下的应用能力与局限性,及其在古籍文本智能处理上的应用前景。
(3)在评测结果基础上,选取综合效果最优的Qwen2.5系列模型,基于古籍词典语料库构建指令微调数据集。对Qwen2.5-7B-Instruct和Qwen2.5-7B模型进行指令微调实验,以探索指令微调方法在融合古籍词典知识与通用大语言模型上的有效性,分析并挖掘模型在经过指令微调后的输出内容特征,同时对比基座模型与对话模型在场景下的差异。旨在探索古籍词典编纂自动化路径,为优化大语言模型在古籍词典编纂中的应用能力提供理论与实践支撑。
(4)在完成前述研究内容后,针对古籍词典编纂中的效率低下、周期较长、协同性差的问题,调研当前古籍词典编纂辅助系统,提出一套融合大语言模型的古籍词典协同编纂系统,包含古籍词典的在线协同编辑、大语言模型辅助、知识库检索和专家知识录入功能。该系统功能架构基于Vue.js实现前端页面,Django框架编写后端服务,OnlyOffice提供文档实时在线协同编辑服务。为古籍词典编纂的智能化与自动化提供了创新性的解决方案与实践支撑。

外文摘要:

Ancient Classics, as vital carriers of Chinese civilization spanning thousands of years, contain a wealth of historical wisdom and cultural essence. Ancient lexicons, as key tools for interpreting these texts, not only provide precise explanations of words, terms, and institutional systems in classical literature but also offer researchers a systematic knowledge framework, bridging academic communication between ancient and modern times. In recent years, following the convening of the 18th National Congress of the Communist Party of China and the release of the 14th Five-Year Plan, research on ancient texts has gradually expanded from the development, preservation, and storage of digital resources to their dynamic utilization. With advancements in text mining technologies, the application of natural language processing (NLP) to explore creative methods for processing, organizing, and transforming ancient texts has become a prominent research paradigm in the digital humanities. As a major category of ancient textual resources, the automated generation and compilation of ancient lexicons represent a key interdisciplinary focus in digital humanities and computational linguistics.
Driven by the national cultural digitization strategy, traditional lexicon compilation is undergoing an intelligent transformation. On one hand, deep learning-based text analysis technologies have demonstrated remarkable performance in automatically recognizing rare characters and annotating semantic relationships. On the other hand, the breakthrough progress in large language models (LLMs) presents both new challenges and opportunities for the intelligent processing of ancient texts. These advanced technologies have opened new possibilities for lexicon compilation, offering unprecedented opportunities for its "intelligent, collaborative, and scenario-based" development.
Accordingly, this paper focuses on the application scenario of compiling specialized ancient lexicons. First, it comprehensively explores the potential and limitations of LLMs in this context. Next, it further validates effective methods for integrating domain-specific knowledge of ancient lexicons into LLMs. Finally, it designs and develops a collaborative lexicon compilation system as a practical tool. The research primarily includes the following components:
(1) Through OCR recognition, data cleaning, and manual verification, we built a multi-element corpus for specialized ancient lexicons, enriching the digital resources for ancient lexicons and providing data support for automated compilation.
(2) Based on literature review, we selected state-of-the-art LLMs (both open-source and proprietary) that excel in text generation tasks. For the ancient lexicon compilation scenario, we designed comprehensive evaluation tasks and datasets, including zero-shot (0-shot), few-shot, and chain-of-thought (CoT) evaluation modes. Using the efficient inference framework VLLM, we conducted experiments to assess the capabilities and limitations of current LLMs in this domain and explore their potential in intelligent ancient text processing.
(3) Building on the evaluation results, we selected the best-performing Qwen2.5 series models and constructed an instruction fine-tuning dataset using the ancient lexicon corpus. We conducted fine-tuning experiments on Qwen2.5-7B-Instruct and Qwen2.5-7B to explore the effectiveness of instruction tuning in integrating ancient lexicon knowledge with general-purpose LLMs. We analyzed the output characteristics of the fine-tuned models and compared the differences between base and instruction-tuned models in this scenario. This aims to explore automated pathways for lexicon compilation and provide theoretical and practical insights for optimizing LLM applications in this field.
(4) To address inefficiency, long cycles, and poor collaboration in traditional lexicon compilation, we surveyed existing auxiliary systems and proposed an LLM-enhanced collaborative compilation system. This system enables online collaborative editing, AI-assisted suggestions, knowledge base retrieval, and expert input. The front-end is built with Vue.js, the back-end with Django, and real-time collaborative editing is powered by OnlyOffice. This provides an innovative solution and practical foundation for intelligent and automated ancient lexicon compilation.

中图分类号:

 G35    

开放日期:

 2025-06-17    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式