Domain-Specific Fine-tuning of Large Language Models and Intelligent Question-Answering System for Industrial Catalysis

Shican Wu; Xin Chang; Xiao Ma; Xiaoyun Lin; Ran Zhao; Zhi-Jian Zhao

doi:10.53941/sce.2026.100002

Abstract

Industrial catalysis, as a core field of chemical engineering, is characterized by intensive professional terminology and complex knowledge structures, making it challenging for general-purpose large language models to accurately understand and apply relevant professional knowledge. This research presents a domain-specific fine-tuning technique and retrieval-augmented generation system for the industrial catalysis field. Through a multi-model collaborative data processing pipeline, we construct high-quality training corpora, employ parameter-efficient fine-tuning techniques to train specialized domain models, and design a retrieval-augmented generation workflow based on consistency verification. The research first establishes a training corpus containing 2.3 billion tokens, including 1.1 billion domain-specific tokens and 1.2 billion general tokens with a balanced 1:1 ratio strategy. Subsequently, we apply rank-stabilized low-rank adaptation (rsLoRA) method to perform parameter-efficient fine-tuning on the Yi-1.5-6B model, resulting in the PeiYang Micro-Emergence model, which achieves a score of 76.81 in industrial catalysis field evaluation, significantly outperforming the general-purpose model Qwen2.5-72B-Instruct (65.45 points) with 12 times the parameters, while maintaining good general capabilities. We further construct a 3.37 million domain-specific retrieval pair dataset and optimize the embedding model using Matryoshka representation learning (MRL) techniques, achieving an average improvement of 2.87 percentage points in domain retrieval recall@3 while slightly enhancing general capabilities. Finally, we design a professional retrieval-augmented generation workflow integrating bilingual hypothetical document generation, dual-path retrieval, and consistency verification, achieving high-quality professional knowledge services. This system provides accurate and reliable professional knowledge services for the industrial catalysis field, demonstrates the application value of domain-specific large language models in resource-constrained environments, and offers a replicable technical pathway for artificial intelligence applications in other specialized domains.

References

1.
Zheng, R.; Liu, Z.; Wang, Y.; et al. Industrial Catalysis: Strategies to Enhance Selectivity. Chin. J. Catal. 2020, 41, 1032–1038.
2.
Wang, Y.; Tian, Y.; Pan, S.Y.; et al. Catalytic Processes to Accelerate Decarbonization in a Net-Zero Carbon World. ChemSusChem 2022, 15, e202201290.
3.
Nørskov, J.K.; Bligaard, T.; Rossmeisl, J.; et al. Towards the Computational Design of Solid Catalysts. Nat. Chem. 2009, 1, 37–46.
4.
Ludwig, J.R.; Schindler, C.S. Catalyst: Sustainable Catalysis. Chem 2017, 2, 313–316.
5.
Abbas, A.; Cross, M.; Duan, X.; et al. Catalysis at the Intersection of Sustainable Chemistry and a Circular Economy. One Earth 2024, 7, 738–741.
6.
Wang, Y.; Shi, J.; Jin, Z.; et al. Focus on the Chinese Revolution of Catalysis Based on Catalytic Solutions for the Vital Demands of Society and Economy. Chin. J. Catal. 2018, 39, 1147–1156.
7.
Bornmann, L.; Haunschild, R.; Mutz, R. Growth Rates of Modern Science: A Latent Piecewise Growth Curve Approach to Model Publication Numbers from Established and New Literature Databases. Humanit. Soc. Sci. Commun. 2021, 8, 224.
8.
Wei, J.; Tay, Y.; Bommasani, R.; et al. Emergent Abilities of Large Language Models. arXiv 2022, arXiv:2206.07682.
9.
Wang, L.; Chen, X.; Du, Y.; et al. CataLM: Empowering Catalyst Design through Large Language Models. Int. J. Mach. Learn. Cybern. 2025, 16, 3681–3691.
10.
Chen, X.; Gao, Y.; Wang, L.; et al. Large Language Model Enhanced Corpus of CO2 Reduction Electrocatalysts and Synthesis Procedures. Sci. Data 2024, 11, 347.
11.
Su, Y.; Wang, X.; Ye, Y.; et al. Automation and Machine Learning Augmented by Large Language Models in a Catalysis Study. Chem. Sci. 2024, 15, 12200–12233.
12.
Song, Z.; Yan, B.; Liu, Y.; et al. Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey. arXiv 2025, arXiv:2502.10708.
13.
Bai, G.; Chai, Z.; Ling, C.; et al. Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. arXiv 2024, arXiv:2401.00625.
14.
Wang, M.; Stoll, A.; Lange, L.; et al. Bring Your Own Knowledge: A Survey of Methods for LLM Knowledge Expansion. arXiv 2025, arXiv:2502.12598.
15.
Benavides-Hernández, J.; Dumeignil, F. From Characterization to Discovery: Artificial Intelligence, Machine Learning and High-Throughput Experiments for Heterogeneous Catalyst Design. ACS Catal. 2024, 14, 11749–11779.
16.
Li, A.; Cui, P.; Wang, X.; et al. The Artificial Intelligence-Catalyst Pipeline: Accelerating Catalyst Innovation from Laboratory to Industry. Front. Chem. Sci. Eng. 2025, 19, 55.
17.
Tan, Z.; Yang, Q.; Luo, S. AI Molecular Catalysis: Where Are We Now? Org. Chem. Front. 2025, 12, 2759–2776.
18.
Bran, A.M.; Cox, S.; Schilter, O.; et al. Augmenting Large Language Models with Chemistry Tools. Nat. Mach. Intell. 2024, 6, 525–535.
19.
Chattoraj, J.; Hamadicharef, B.; Chang, T.S.; et al. AceWGS: An LLM-Aided Framework to Accelerate Catalyst Design for Water-Gas Shift Reactions. arXiv 2025, arXiv:2503.05607.
20.
Lu, W.; Luu, R.K.; Buehler, M.J. Fine-Tuning Large Language Models for Domain Adaptation: Exploration of Training Strategies, Scaling, Model Merging and Synergistic Capabilities. NPJ Comput. Mater. 2025, 11, 84.
21.
Lewis, P.; Perez, E.; Piktus, A.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474.
22.
Young, A.; Chen, B.; Li, C.; et al. Yi: Open Foundation Models by 01.AI. arXiv 2024, arXiv:2403.04652.
23.
Kalajdzievski, D. A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA. arXiv 2023, arXiv:2312.03732.
24.
Yang, A.; Yang, B.; Zhang, B.; et al. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115.
25.
Chen, J.; Xiao, S.; Zhang, P.; et al. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings through Self-Knowledge Distillation. arXiv 2024, arXiv:2402.03216.
26.
Kusupati, A.; Bhatt, G.; Rege, A.; et al. Matryoshka Representation Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 30233–30249.
27.
Gao, L.; Ma, X.; Lin, J.; et al. Precise Zero-Shot Dense Retrieval without Relevance Labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14July 2023.
28.
Gojare, S.; Joshi, R.; Gaigaware, D. Analysis and Design of Selenium WebDriver Automation Testing Framework. Procedia Comput. Sci. 2015, 50, 341–346.
29.
Conneau, A.; Khandelwal, K.; Goyal, N.; et al. Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, online, 5–10 July 2020; pp. 8440–8451.
30.
Zheng, L.; Chiang, W.L.; Sheng, Y.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 46595–4662.
31.
Weber, M.; Fu, D.; Anthony, Q.; et al. RedPajama: An Open Dataset for Training Large Language Models. Adv. Neural Inf. Process. Syst. 2024, 37, 116462–116492.
32.
Li, J.; Du, L.; Zhao, H.; et al. Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models. arXiv 2025, arXiv:2506.11116.
33.
Meng, F.; Wang, Z.; Zhang, M. PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models. Adv. Neural Inf. Process. Syst. 2024, 37, 121038–121072.
34.
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101.
35.
Wen, K.; Li, Z.; Wang, J.; et al. Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective. arXiv 2024, arXiv:2410.05192.
36.
Wen, C.; Sun, X.; Zhao, S.; et al. ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation. arXiv 2023, arXiv:2307.15290.
37.
Li, S.; Zhao, Y.; Varma, R.; et al. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. arXiv 2020, arXiv:2006.15704.
38.
Zhao, Y.; Huang, J.; Hu, J.; et al. Swift: A Scalable Lightweight Infrastructure for Fine-Tuning. arXiv 2024, arXiv:2408.05517.
39.
Contributors, O. OpenCompass: A Universal Evaluation Platform for Foundation Models. Available online: https://github.com/open-compass/opencompass (accessed on 4 March 2025).
40.
He, Y.; Li, S.; Liu, J.; et al. Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models. arXiv 2024, arXiv:2411.07140.
41.
Wei, J.; Karina, N.; Chung, H.W.; et al. Measuring Short-Form Factuality in Large Language Models. arXiv 2024, arXiv:2411.04368.
42.
Lai, G.; Xie, Q.; Liu, H.; et al. RACE: Large-Scale Reading Comprehension Dataset from Examinations. arXiv 2017, arXiv:1704.04683.
43.
Hu, B.; Chen, Q.; Zhu, F. LCSTS: A Large Scale Chinese Short Text Summarization Dataset. arXiv 2015, arXiv:1506.05865.
44.
Hendrycks, D.; Burns, C.; Basart, S.; et al. Measuring Massive Multitask Language Understanding. arXiv 2020, arXiv:2009.03300.
45.
Li, H.; Zhang, Y.; Koto, F.; et al. CMMLU: Measuring Massive Multitask Language Understanding in Chinese. arXiv 2023, arXiv:2306.09212.
46.
Huang, Y.; Bai, Y.; Zhu, Z.; et al. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. Adv. Neural Inf. Process. Syst. 2023, 36, 62991–63010.
47.
Cobbe, K.; Kosaraju, V.; Bavarian, M.; et al. Training Verifiers to Solve Math Word Problems. arXiv 2021, arXiv:2110.14168.

Scilight Press

Author Information

Abstract

Keywords

References

About Scilight

Journals

Publishing Policies

Contact Us