2509001303
  • Open Access
  • Article

RiverEcho-2.0: A Real-Time Interactive System for Yellow River Culture via Enhanced MultiModal Document RAG

  • Haofeng Wang 1, 2,   
  • Yilin Guo 3,   
  • Tiange Zhang 1,   
  • Zehao Li 4,   
  • Tong Yue 3,   
  • Yizong Wang 3,   
  • Rongqun Lin 5,   
  • Feng Gao 6, *,   
  • Shiqi Wang 7,   
  • Siwei Ma 3, *

Received: 28 Aug 2025 | Revised: 08 Sep 2025 | Accepted: 15 Sep 2025 | Published: 22 Sep 2025

Abstract

The Yellow River culture is a cornerstone of Chinese civilization, embodying rich historical, social, and ecological significance. To conserve and promote this invaluable cultural heritage, we propose RiverEcho-2.0, a real-time interactive digital system designed to facilitate user engagement with Yellow River culture. As the foundation of our system, we curated and digitized a comprehensive collection of books and documents related to Yellow River heritage, constructing a dedicated multimodal corpus. To effectively leverage this corpus, we introduce a novel multi-modal Document Retrieval-Augmented Generation (RAG) framework that enhances document retrieval through context-aware image-text alignment and joint embedding. Experimental results demonstrate that our method achieves a large improvement over existing state-of-the-art multi-modal RAG baselines, leading to significant gains in downstream tasks.

References 

  • 1.
    Cao, G. The Historical Inheritance and Contemporary Value of Yellow River Culture. Jinyang Acad. J. 2022, 2, 119-124.
  • 2.
    Langote, M.; Saratkar, S.; Kumar, P.; et al. Human-computer interaction in healthcare: Comprehensive review. Aims Bioeng. 2024, 11, 343-390.
  • 3.
    De Wet, L. Teaching Human-Computer Interaction Modules—And Then Came COVID-19. Front. Comput. Sci. 2021, 3, 793466.
  • 4.
    Amato, F.; Barolli, L.; Cozzolino, G.; et al. An Intelligent Interface for Human-Computer Interaction in Legal Domain. In Proceedings of the In International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, Tirana, Albania, 27-29 October 2022.
  • 5.
    Hirsch, L.; Paananen, S.; Lengyel, D.; et al. Human-Computer Interaction (HCI) Advances to Re-Contextualize Cultural Heritage toward Multiperspectivity, Inclusion, and Sensemaking. Appl. Sci. 2024, 14, 7652.
  • 6.
    Achiam, J.; Adler, S.; Agarwal, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774.
  • 7.
    Touvron, H.; Lavril, T.; Izacard, G.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971.
  • 8.
    Yang, A.; Xiao, B.; Wang, B.; et al. Baichuan 2: Open large-scale language models. arXiv 2023, arXiv:2309.10305.
  • 9.
    Liu, A.; Feng, B.; Xue, B.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437.
  • 10.
    Yang, A.; Yang, B.; Zhang, B.; et al. Qwen2. 5 technical report. arXiv 2024, arXiv:2412.15115.
  • 11.
    Roziere, B.; Gehring, J.; Gloeckle, F.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950.
  • 12.
    Li, Y.; Li, Z.; Zhang, K.; et al. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus 2023, 15, e40895.
  • 13.
    Zhang, H.; Qiu, B.; Feng, Y.; et al. Baichuan4-Finance Technical Report. arXiv 2024, arXiv:2412.15270.
  • 14.
    Cui, J.; Ning, M.; Li, Z.; et al. Chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model. arXiv 2023, arXiv:2306.16092.
  • 15.
    Jiang, Z.; Wang, J.; Cao, J.;et al. Towards better translations from classical to modern Chinese: A new dataset and a new method. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing, Foshan, China, 12-15 October 2023.
  • 16.
    Chang, E.; Shiue, Y.T.; Yeh, H.S.; et al. Time-aware ancient chinese text translation and inference. arXiv 2021, arXiv:2107.03179.
  • 17.
    Li, Z.; Sun, M. Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist. 2009, 35, 505-512.
  • 18.
    Yu, P.; Wang, X. BERT-based named entity recognition in Chinese twenty-four histories. In Proceedings of the International Conference on Web Information Systems and Applications, Guangzhou, China, 23-25 September 2020.
  • 19.
    Han, X.; Xu, L.; Qiao, F. CNN-BiLSTM-CRF model for term extraction in Chinese corpus. In Proceedings of the Web In- formation Systems and Applications: 15th International Conference, WISA 2018, Taiyuan, China, 14-15 September 2018.
  • 20.
    Wang, D.; Liu, C.; Zhao, Z.; et al. GujiBERT and GujiGPT: Construction of intelligent information processing foundation language models for ancient texts. arXiv 2023, arXiv:2307.05354.
  • 21.
    Chang, L.; Dongbo, W.; Zhixiao, Z.; et al. SikuGPT: A generative pre-trained model for intelligent information processing of ancient texts from the perspective of digital humanities. arXiv 2023, arXiv:2304.07778.
  • 22.
    Wptoux. Bloom-7B-Chunhua. Available online: https://huggingface.co/wptoux/bloom-7b-chunhua (accessed on 1 October 2023).
  • 23.
    XunziALLM. Available online: https://github.com/Xunzi-LLM-of-Chinese-classics/XunziALLM (accessed on 1 March 2024).
  • 24.
    Cao, J.; Peng, D.; Zhang, P.; et al. TongGu: Mastering Classical Chinese Understanding with Knowledge-Grounded Large Language Models. arXiv 2024, arXiv:2407.03937.
  • 25.
    Mallen, A.; Asai, A.; Zhong, V.; et al. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv 2022, arXiv:2212.10511.
  • 26.
    Carlini, N.; Tramer, F.; Wallace, E.; et al. Extracting training data from large language models. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Virtual, 11-13 August 2021.
  • 27.
    Huang, L.; Yu, W.; Ma, W.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. Acm Trans. Inf. Syst. 2025, 43, 1-55.
  • 28.
    Izacard, G.; Lewis, P.; Lomeli, M.; et al. Atlas: Few-shot learning with retrieval augmented language models. J. Mach. Learn. Res. 2023, 24, 1-43.
  • 29.
    Wu, Y.; Rabe, M.N.; Hutchins, D.; et al. Memorizing transformers. arXiv 2022, arXiv:2203.08913.
  • 30.
    He, Z.; Zhong, Z.; Cai, T.; et al. Rest: Retrieval-based speculative decoding. arXiv 2023, arXiv:2311.08252.
  • 31.
    Kang, M.; Gu… rel, N.M.; Yu, N.; et al. C-rag: Certified generation risks for retrieval-augmented language models. arXiv 2024, arXiv:2402.03181.
  • 32.
    Karpukhin, V.; Oguz, B.; Min, S.; et al. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the Empirical Methods in Natural Language Processing, Virtual, 16-20 November 2020.
  • 33.
    Ni, J.; Qu, C.; Lu, J.; et al. Large dual encoders are generalizable retrievers. arXiv 2021, arXiv:2112.07899.
  • 34.
    Nogueira, R.; Cho, K. Passage Re-ranking with BERT. arXiv 2019, arXiv:1901.04085.
  • 35.
    Yoran, O.; Wolfson, T.; Bogin, B.; et al. Answering questions by meta-reasoning over multiple chains of thought. arXiv 2023, arXiv:2304.13007.
  • 36.
    Yao, S.; Zhao, J.; Yu, D.; et al. React: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1-5 May 2023.
  • 37.
    Lewis, P.; Perez, E.; Piktus, A.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459-9474.
  • 38.
    Liu, Z.; Simon, C.E.; Caspani, F. Passage segmentation of documents for extractive question answering. In European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2025; pp. 345-352.
  • 39.
    Laitenberger, A.; Manning, C.D.; Liu, N.F. Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models. arXiv 2025, arXiv:2506.03989.
  • 40.
    Edge, D.; Trinh, H.; Cheng, N.; et al. From local to global: A graph rag approach to query-focused summarization. arXiv 2024, arXiv:2404.16130.
  • 41.
    Cho, J.; Mahata, D.; Irsoy, O.;et al. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding. arXiv 2024, arXiv:2411.04952.
  • 42.
    Faysse, M.; Sibille, H.; Wu, T.; et al. Colpali: Efficient document retrieval with vision language models. arXiv 2024, arXiv:2407.01449.
  • 43.
    Wang, Q.; Ding, R.; Chen, Z.; et al. Vidorag: Visual document retrieval-augmented generation via dynamic iterative reasoning agents. arXiv 2025, arXiv:2502.18017.
  • 44.
    Memon, J.; Sami, M.; Khan, R.A.; et al. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access 2020, 8, 142642-142668.
  • 45.
    Ingemarsson, P.; Daniel, P. PDF Parsing, Unveiling the Most Efficient Method. Bachelor’s Thesis, Linnaeus University, Va… xjo… , Sweden, 2024.
  • 46.
    LiveTalking: Real-Time Interactive Streaming Digital Human. 2024. Available online: https://github.com/lipku/livetalking (accessed on 16 March 2025).
  • 47.
    Prajwal, K.R.; Mukhopadhyay, R. Wav2Lip: A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. 2020. Available online: https://github.com/Rudrabha/Wav2Lip (accessed on 16 March 2025).
  • 48.
    Zhang, Y.; Liu, M.; Chen, Z.; et al. Musetalk: Real-time high quality lip synchronization with latent space inpainting. arXiv 2024, arXiv:2410.10122.
  • 49.
    Metahuman-stream: Real-time Streaming Digital Human Based on NeRF. 2023. Available online: https://github.com/tsman/metahuman-stream (accessed on 16 March 2025).
  • 50.

    Adobe Systems Incorporated. Real-Time Messaging Protocol (RTMP) Specification. 2002. Available online: https://web.archive.org/web/20201001140644/https://www.adobe.com/content/dam/acom/en/devnet/rtmp/pdf/rtmp_specification_1.0.pdf (accessed on 16 March 2025).

  • 51.
    IETF and W3C. Web Real-Time Communication (WebRTC) Standard. 2011. Available online: https://www.w3.org/TR/webrtc/ (accessed on 16 March 2025).
  • 52.
    Synthesia. Synthesia: AI Video Generation Platform. 2017. Available online: https://www.synthesia.io/ (accessed on 16 March 2025).
  • 53.

    Diener, V. VTube Studio: Live2D VTuber Streaming Software. 2021. Available online: https://github.com/mouwoo/VTubeStudio/wiki (accessed on 16 March 2025).

  • 54.
    Guo, Z.; Xia, L.; Yu, Y.; et al. Lightrag: Simple and fast retrieval-augmented generation. arXiv 2024, arXiv:2410.05779..
  • 55.
    Gao, Z.; Li, Z.; Wang, J.; et al. Funasr: A fundamental end-to-end speech recognition toolkit. arXiv 2023, arXiv:2305.11013.
  • 56.
    . Edge-tts: Use Microsoft Edge’s Online Text-to-Speech Service from Python WITHOUT Needing Microsoft Edge or Windows or an API Key. 2024. Available online: https://github.com/rany2/edge-tts (accessed on 16 March 2025).
  • 57.
    Glm, T.; Zeng, A.; Xu, B.; et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv 2024, arXiv:2406.12793.
  • 58.
    Chaplot, D.S.; Jiang, A.Q.; Sablayrolles, A.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825.
  • 59.
    Jha, R.; Wang, B.; Gu… nther, M.; et al. Jina-colbert-v2: A general-purpose multilingual late interaction retriever. arXiv 2024, arXiv:2408.16672.
  • 60.
    Vavekanand, R.; Sam, K. Llama 3.1: An in-depth analysis of the next-generation large language model. Preprint 2024.
  • 61.
    Lu, H.; Liu, W.; Zhang, B.; et al. Deepseek-vl: Towards real-world vision-language understanding. arXiv 2024, arXiv:2403.05525.
  • 62.
    Laurenc¸on, H.; Tronchon, L.; Cord, M.; et al. What matters when building vision-language models? Adv. Neural Inf. Process. Syst. 2024, 37, 87874-87907.
  • 63.
    Guo, Z.; Xu, R.; Yao, Y.; et al. Llava-uhd: An lmm perceiving any aspect ratio and high-resolution images. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 390-406.
  • 64.
    Dong, X.; Zhang, P.; Zang, Y.; et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. Adv. Neural Inf. Process. Syst. 2024, 37, 42566-42592.
  • 65.
    Hu, A.; Xu, H.; Ye, J.; et al. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv 2024, arXiv:2403.12895.
  • 66.
    Bai, J.; Bai, S.; Chu, Y.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609.
  • 67.
    Li, Z.; Yang, B.; Liu, Q.; et al. Monkey: Image resolution and text label are important things for large multi-modal models. In proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17-18 June 2024.
Share this article:
How to Cite
Wang, H.; Guo, Y.; Zhang, T.; Li, Z.; Yue, T.; Wang, Y.; Lin, R.; Gao, F.; Wang, S.; Ma, S. RiverEcho-2.0: A Real-Time Interactive System for Yellow River Culture via Enhanced MultiModal Document RAG. Transactions on Artificial Intelligence 2025, 1 (1), 212–226. https://doi.org/10.53941/tai.2025.100014.
RIS
BibTex
Copyright & License
article copyright Image
Copyright (c) 2025 by the authors.