2603003207
  • Open Access
  • Article

Domain-Specific Fine-tuning of Large Language Models and Intelligent Question-Answering System for Industrial Catalysis

  • Shican Wu 1,2,†,   
  • Xin Chang 1,2,†,   
  • Xiao Ma 1,2,   
  • Xiaoyun Lin 1,2,   
  • Ran Zhao 1,2,   
  • Zhi-Jian Zhao 1,2,*

Received: 30 Dec 2025 | Revised: 23 Feb 2026 | Accepted: 05 Mar 2026 | Published: 16 Mar 2026

Abstract

Industrial catalysis, as a core field of chemical engineering, is characterized by intensive professional terminology and complex knowledge structures, making it challenging for general-purpose large language models to accurately understand and apply relevant professional knowledge. This research presents a domain-specific fine-tuning technique and retrieval-augmented generation system for the industrial catalysis field. Through a multi-model collaborative data processing pipeline, we construct high-quality training corpora, employ parameter-efficient fine-tuning techniques to train specialized domain models, and design a retrieval-augmented generation workflow based on consistency verification. The research first establishes a training corpus containing 2.3 billion tokens, including 1.1 billion domain-specific tokens and 1.2 billion general tokens with a balanced 1:1 ratio strategy. Subsequently, we apply rank-stabilized low-rank adaptation (rsLoRA) method to perform parameter-efficient fine-tuning on the Yi-1.5-6B model, resulting in the PeiYang Micro-Emergence model, which achieves a score of 76.81 in industrial catalysis field evaluation, significantly outperforming the general-purpose model Qwen2.5-72B-Instruct (65.45 points) with 12 times the parameters, while maintaining good general capabilities. We further construct a 3.37 million domain-specific retrieval pair dataset and optimize the embedding model using Matryoshka representation learning (MRL) techniques, achieving an average improvement of 2.87 percentage points in domain retrieval recall@3 while slightly enhancing general capabilities. Finally, we design a professional retrieval-augmented generation workflow integrating bilingual hypothetical document generation, dual-path retrieval, and consistency verification, achieving high-quality professional knowledge services. This system provides accurate and reliable professional knowledge services for the industrial catalysis field, demonstrates the application value of domain-specific large language models in resource-constrained environments, and offers a replicable technical pathway for artificial intelligence applications in other specialized domains.

References 

  • 1.

    Zheng, R.; Liu, Z.; Wang, Y.; et al. Industrial Catalysis: Strategies to Enhance Selectivity. Chin. J. Catal. 2020, 41, 1032–1038.

  • 2.

    Wang, Y.; Tian, Y.; Pan, S.Y.; et al. Catalytic Processes to Accelerate Decarbonization in a Net-Zero Carbon World. ChemSusChem 2022, 15, e202201290.

  • 3.

    Nørskov, J.K.; Bligaard, T.; Rossmeisl, J.; et al. Towards the Computational Design of Solid Catalysts. Nat. Chem. 2009, 1, 37–46.

  • 4.

    Ludwig, J.R.; Schindler, C.S. Catalyst: Sustainable Catalysis. Chem 2017, 2, 313–316.

  • 5.

    Abbas, A.; Cross, M.; Duan, X.; et al. Catalysis at the Intersection of Sustainable Chemistry and a Circular Economy. One Earth 2024, 7, 738–741.

  • 6.

    Wang, Y.; Shi, J.; Jin, Z.; et al. Focus on the Chinese Revolution of Catalysis Based on Catalytic Solutions for the Vital Demands of Society and Economy. Chin. J. Catal. 2018, 39, 1147–1156.

  • 7.

    Bornmann, L.; Haunschild, R.; Mutz, R. Growth Rates of Modern Science: A Latent Piecewise Growth Curve Approach to Model Publication Numbers from Established and New Literature Databases. Humanit. Soc. Sci. Commun. 2021, 8, 224.

  • 8.

    Wei, J.; Tay, Y.; Bommasani, R.; et al. Emergent Abilities of Large Language Models. arXiv 2022, arXiv:2206.07682.

  • 9.

    Wang, L.; Chen, X.; Du, Y.; et al. CataLM: Empowering Catalyst Design through Large Language Models. Int. J. Mach. Learn. Cybern. 2025, 16, 3681–3691.

  • 10.

    Chen, X.; Gao, Y.; Wang, L.; et al. Large Language Model Enhanced Corpus of CO2 Reduction Electrocatalysts and Synthesis Procedures. Sci. Data 2024, 11, 347.

  • 11.

    Su, Y.; Wang, X.; Ye, Y.; et al. Automation and Machine Learning Augmented by Large Language Models in a Catalysis Study. Chem. Sci. 2024, 15, 12200–12233.

  • 12.

    Song, Z.; Yan, B.; Liu, Y.; et al. Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey. arXiv 2025, arXiv:2502.10708.

  • 13.

    Bai, G.; Chai, Z.; Ling, C.; et al. Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. arXiv 2024, arXiv:2401.00625.

  • 14.

    Wang, M.; Stoll, A.; Lange, L.; et al. Bring Your Own Knowledge: A Survey of Methods for LLM Knowledge Expansion. arXiv 2025, arXiv:2502.12598.

  • 15.

    Benavides-Hernández, J.; Dumeignil, F. From Characterization to Discovery: Artificial Intelligence, Machine Learning and High-Throughput Experiments for Heterogeneous Catalyst Design. ACS Catal. 2024, 14, 11749–11779.

  • 16.

    Li, A.; Cui, P.; Wang, X.; et al. The Artificial Intelligence-Catalyst Pipeline: Accelerating Catalyst Innovation from Laboratory to Industry. Front. Chem. Sci. Eng. 2025, 19, 55.

  • 17.

    Tan, Z.; Yang, Q.; Luo, S. AI Molecular Catalysis: Where Are We Now? Org. Chem. Front. 2025, 12, 2759–2776.

  • 18.

    Bran, A.M.; Cox, S.; Schilter, O.; et al. Augmenting Large Language Models with Chemistry Tools. Nat. Mach. Intell. 2024, 6, 525–535.

  • 19.

    Chattoraj, J.; Hamadicharef, B.; Chang, T.S.; et al. AceWGS: An LLM-Aided Framework to Accelerate Catalyst Design for Water-Gas Shift Reactions. arXiv 2025, arXiv:2503.05607.

  • 20.

    Lu, W.; Luu, R.K.; Buehler, M.J. Fine-Tuning Large Language Models for Domain Adaptation: Exploration of Training Strategies, Scaling, Model Merging and Synergistic Capabilities. NPJ Comput. Mater. 2025, 11, 84.

  • 21.

    Lewis, P.; Perez, E.; Piktus, A.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474.

  • 22.

    Young, A.; Chen, B.; Li, C.; et al. Yi: Open Foundation Models by 01.AI. arXiv 2024, arXiv:2403.04652.

  • 23.

    Kalajdzievski, D. A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA. arXiv 2023, arXiv:2312.03732.

  • 24.

    Yang, A.; Yang, B.; Zhang, B.; et al. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115.

  • 25.

    Chen, J.; Xiao, S.; Zhang, P.; et al. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings through Self-Knowledge Distillation. arXiv 2024, arXiv:2402.03216.

  • 26.

    Kusupati, A.; Bhatt, G.; Rege, A.; et al. Matryoshka Representation Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 30233–30249.

  • 27.

    Gao, L.; Ma, X.; Lin, J.; et al. Precise Zero-Shot Dense Retrieval without Relevance Labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14July 2023.

  • 28.

    Gojare, S.; Joshi, R.; Gaigaware, D. Analysis and Design of Selenium WebDriver Automation Testing Framework. Procedia Comput. Sci. 2015, 50, 341–346.

  • 29.

    Conneau, A.; Khandelwal, K.; Goyal, N.; et al. Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, online, 5–10 July 2020; pp. 8440–8451.

  • 30.

    Zheng, L.; Chiang, W.L.; Sheng, Y.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 46595–4662.

  • 31.

    Weber, M.; Fu, D.; Anthony, Q.; et al. RedPajama: An Open Dataset for Training Large Language Models. Adv. Neural Inf. Process. Syst. 2024, 37, 116462–116492.

  • 32.

    Li, J.; Du, L.; Zhao, H.; et al. Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models. arXiv 2025, arXiv:2506.11116.

  • 33.

    Meng, F.; Wang, Z.; Zhang, M. PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models. Adv. Neural Inf. Process. Syst. 2024, 37, 121038–121072.

  • 34.

    Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101.

  • 35.

    Wen, K.; Li, Z.; Wang, J.; et al. Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective. arXiv 2024, arXiv:2410.05192.

  • 36.

    Wen, C.; Sun, X.; Zhao, S.; et al. ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation. arXiv 2023, arXiv:2307.15290.

  • 37.

    Li, S.; Zhao, Y.; Varma, R.; et al. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. arXiv 2020, arXiv:2006.15704.

  • 38.

    Zhao, Y.; Huang, J.; Hu, J.; et al. Swift: A Scalable Lightweight Infrastructure for Fine-Tuning. arXiv 2024, arXiv:2408.05517.

  • 39.

    Contributors, O. OpenCompass: A Universal Evaluation Platform for Foundation Models. Available online: https://github.com/open-compass/opencompass (accessed on 4 March 2025).

  • 40.

    He, Y.; Li, S.; Liu, J.; et al. Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models. arXiv 2024, arXiv:2411.07140.

  • 41.

    Wei, J.; Karina, N.; Chung, H.W.; et al. Measuring Short-Form Factuality in Large Language Models. arXiv 2024, arXiv:2411.04368.

  • 42.

    Lai, G.; Xie, Q.; Liu, H.; et al. RACE: Large-Scale Reading Comprehension Dataset from Examinations. arXiv 2017, arXiv:1704.04683.

  • 43.

    Hu, B.; Chen, Q.; Zhu, F. LCSTS: A Large Scale Chinese Short Text Summarization Dataset. arXiv 2015, arXiv:1506.05865.

  • 44.

    Hendrycks, D.; Burns, C.; Basart, S.; et al. Measuring Massive Multitask Language Understanding. arXiv 2020, arXiv:2009.03300.

  • 45.

    Li, H.; Zhang, Y.; Koto, F.; et al. CMMLU: Measuring Massive Multitask Language Understanding in Chinese. arXiv 2023, arXiv:2306.09212.

  • 46.

    Huang, Y.; Bai, Y.; Zhu, Z.; et al. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. Adv. Neural Inf. Process. Syst. 2023, 36, 62991–63010.

  • 47.

    Cobbe, K.; Kosaraju, V.; Bavarian, M.; et al. Training Verifiers to Solve Math Word Problems. arXiv 2021, arXiv:2110.14168.

Share this article:
How to Cite
Wu, S.; Chang, X.; Ma, X.; Lin, X.; Zhao, R.; Zhao, Z.-J. Domain-Specific Fine-tuning of Large Language Models and Intelligent Question-Answering System for Industrial Catalysis. Smart Chemical Engineering 2026, 2 (1), 2. https://doi.org/10.53941/sce.2026.100002.
RIS
BibTex
Copyright & License
article copyright Image
Copyright (c) 2026 by the authors.