2510001963
  • Open Access
  • Article

A Survey of Multimodal Models on Language and Vision: A Unified Modeling Perspective

  • Zhongfen Deng 1,*,†,   
  • Yibo Wang 1,   
  • Yueqing Liang 2,   
  • Jiangshu Du 1,†,   
  • Yuyao Yang 1,‡,   
  • Liancheng Fang 1,‡,   
  • Langzhou He 1,‡,   
  • Yuwei Han 1,‡,   
  • Yuanjie Zhu 1,‡,   
  • Chunyu Miao 1,‡,   
  • Weizhi Zhang 1,   
  • Jiahua Chen 1,   
  • Yinghui Li 3,   
  • Wenting Zhao 4,   
  • Philip S. Yu 1

Received: 26 Aug 2025 | Revised: 27 Oct 2025 | Accepted: 28 Oct 2025 | Published: 03 Dec 2025

Abstract

In recent years, significant progress has been made in developing AI systems capable of processing multimodal data—such as text, image, and videos—to perform complex tasks. With the advent of Large Language Models (LLMs), there has been a surge of interest in building multimodal models based on LLMs. Most current approaches employ a heterogeneous architecture to process text and image separately before bridging them together, leading to a critical bridge bottleneck. Modeling multimodal data such as text and image in a unified manner can help overcome this limitation. Therefore, in this survey, we investigate the current research landscape of multimodality modeling from three perspectives. The first group of multimodal models adopts a heterogeneous architecture to bridge different modality data. The second line of research leverages LLM for multimodality modeling via a unified language modeling objective. The third group represents multimodal data entirely within a single visual representation. The latter two groups can offer a more unified treatment of modalities, helping to alleviate the bridge bottleneck and paving the way for more capable multimodal systems.

References 

  • 1.

    Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 261–272.

  • 2.

    Tong, S.; Liu, Z.; Zhai, Y.; et al. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In Proceedings
    of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp.
    9568–9578.

  • 3.

    Zhang, D.; Yu, Y.; Dong, J.; et al. MM-LLMs: Recent Advances in MultiModal Large Language Models. In Findings of
    the Association for Computational Linguistics ACL 2024; Ku, L.W.; Srikumar, V., Eds.; Association for Computational
    Linguistics: Bangkok, Thailand, 2024; pp. 12401–12430.

  • 4.

    Caffagni, D.; Cocchi, F.; Barsellotti, L.; et al. The Revolution of Multimodal Large Language Models: A Survey. In
    Findings of the Association for Computational Linguistics ACL 2024; Ku, L.W., Srikumar, V., Eds.; Association for
    Computational Linguistics: Bangkok, Thailand, 2024; pp. 13590–13618.

  • 5.

    Li, C.; Gan, Z.; Yang, Z.; et al. Multimodal foundation models: From specialists to general-purpose assistants. Found.
    Trends Comput. Graph. Vis. 2024, 16, 1–214.

  • 6.

    Bai, T.; Liang, H.; Wan, B.; et al. A Survey of Multimodal Large Language Model from A Data-centric Perspective. arXiv
    2024, arXiv:2405.16640.

  • 7.

    Qin, Z.; Chen, D.; Zhang, W.; et al. The Synergy between Data and Multi-Modal Large Language Models: A Survey from
    Co-Development Perspective. arXiv 2024, arXiv:2407.08583.

  • 8.

    Jin, Y.; Li, J.; Liu, Y.; et al. Efficient multimodal large language models: A survey. arXiv 2024, arXiv:2405.10739.

  • 9.

    Mai, X.; Tao, Z.; Lin, J.; et al. From Efficient Multimodal Models to World Models: A Survey. arXiv 2024,
    arXiv:2407.00118.

  • 10.

    Huh, M.; Cheung, B.; Wang, T.; et al. The platonic representation hypothesis. arXiv 2024, arXiv:2405.07987.

  • 11.

    Jelinek, F. Statistical Methods for Speech Recognition; MIT Press: Cambridge, MA, USA, 1998.

  • 12.

    Manning, C.; Schutze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA,
    1999.

  • 13.

    Das, B.C.; Amini, M.H.; Wu, Y. Security and privacy challenges of large language models: A survey. ACM Comput. Surv.
    2025, 57, 1–39.

  • 14.

    Zhao, W.X.; Zhou, K.; Li, J.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223.

  • 15.

    Juang, B.; Rabiner, L. Hidden Markov Models for Speech Recognition. Technometrics, 1991, 33(3), 251–272.

  • 16.

    Chen, S.F.; Goodman, J. An Empirical Study of Smoothing Techniques for Language Modeling. Comput. Speech Lang.
    1999, 13, 359–394.

  • 17.

    Bengio, Y.; Ducharme, R.; Vincent, P.; et al. A neural probabilistic language model. J. Mach. Learn. Res. 2003,
    3, 1137–1155.

  • 18.

    Schwenk, H.; Dechelotte, D.; Gauvain, J.L. Continuous Space Language Models for Statistical Machine Translation. In
    Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, Sydney, Australia, 17–21 July 2006; Association
    for Computational Linguistics: Sydney, Australia, 2006; pp. 723–730.

  • 19.

    Mikolov, T.; Karafi´at, M.; Burget, L.; et al. Recurrent neural network based language model. In Proceedings of the 11th
    Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Chiba,
    Japan, 26–30 September 2010; pp. 1045–1048.

  • 20.

    Qiu, X.; Sun, T.; Xu, Y.; et al. Pre-trained models for natural language processing: A survey. Sci. China Technol. Sci. 2020,
    63, 1872–1897.

  • 21.

    Peters, M.E.; Neumann, M.; Iyyer, M.; et al. Deep contextualized word representations. arXiv 2018, arXiv:1802.05365.

  • 22.

    Devlin, J.; Chang, M.W.; Lee, K.; et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
    In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
    Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MI, USA, 2–7 June 2019;
    Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MI, USA, 2019; pp.
    4171–4186.

  • 23.

    Liu, Y.; Ott, M.; Goyal, N.; et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR 2019.

  • 24.

    He, P.; Liu, X.; Gao, J.; et al. Deberta: Decoding-enhanced bert with disentangled attention. arXiv 2020, arXiv:2006.03654.

  • 25.

    Radford, A.; Narasimhan, K; Salimans, T.; et al. Improving Language Understanding by Generative Pre-Training. Available
    online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 30 December 2018).

  • 26.

    Dao, T.; Fu, D.; Ermon, S.; et al. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural
    Inf. Process. Syst. 2022, 35, 16344–16359.

  • 27.

    Lewis, M.; Liu, Y.; Goyal, N.; et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
    Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational
    Linguistics, ACL 2020, Online, 5–10 July 2020; pp. 7871–7880.

  • 28.

    Raffel, C.; Shazeer, N.M.; Roberts, A.; et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text
    Transformer. J. Mach. Learn. Res. 2019, 21, 140:1–140:67.

  • 29.

    Henighan, T.; Kaplan, J.; Katz, M.; et al. Scaling laws for autoregressive generative modeling. arXiv 2020, arXiv:2010.14701.

  • 30.

    Wei, J.; Tay, Y.; Bommasani, R.; et al. Emergent Abilities of Large Language Models. arXiv 2022, arXiv:2206.07682.

  • 31.

    Brown, T.B.; Mann, B.; Ryder, N.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in
    Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS
    2020, Virtual, 6–12 December 2020.

  • 32.

    OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774.

  • 33.

    Chowdhery, A.; Narang, S.; Devlin, J.; et al. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res.
    2023, 24, 240:1–240:113.

  • 34.

    Touvron, H.; Lavril, T.; Izacard, G.; et al. Llama: Open and efficient foundation language models. arXiv 2023,
    arXiv:2302.13971.

  • 35.

    Taori, R.; Gulrajani, I.; Zhang, T.; et al. Stanford alpaca: An instruction-following llama model. Available online:
    https://crfm.stanford.edu/2023/03/13/alpaca.html (accessed on 13 March 2023)

  • 36.

    Chiang, W.L.; Li, Z.; Lin, Z.; et al. Vicuna: An Open-Source Chatbot Impressing Gpt-4 with 90%* Chatgpt Quality.
    Available online: https://vicuna.lmsys.org (accessed on 14 April 2023)

  • 37.

    Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.; Li, Y.F.; et al. Sparks of
    Artificial General Intelligence: Early experiments with GPT-4. arXiv 2023, arXiv:2303.12712.

  • 38.

    Chen, M.; Tworek, J.; Jun, H.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374.

  • 39.

    Roziere, B.; Gehring, J.; Gloeckle, F.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950.

  • 40.

    Tu, T.; Azizi, S.; Driess, D.; et al. Towards generalist biomedical AI. Nejm Ai 2024, 1, AIoa2300138.

  • 41.

    Taylor, R.; Kardas, M.; Cucurull, G.; et al. Galactica: A large language model for science. arXiv 2022, arXiv:2211.09085.

  • 42.

    Guo, D.; Yang, D.; Zhang, H.; et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.
    arXiv 2025, arXiv:2501.12948.

  • 43.

    Peng, B.; Alcaide, E.; Anthony, Q.; et al. RWKV: Reinventing RNNs for the Transformer Era. In Findings of the Association
    for Computational Linguistics: EMNLP 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational
    Linguistics: Singapore, 2023; pp. 14048–14077.

  • 44.

    Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752.

  • 45.

    Beck, M.; P¨oppel, K.; Spanring, M.; et al. xlstm: Extended long short-term memory. arXiv 2024, arXiv:2405.04517.

  • 46.

    Lowe, D. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110.

  • 47.

    Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
    Recognit. 2005, 1, 886–893.

  • 48.

    Mubarak, R.; Alsboui, T.A.A.; Alshaikh, O.; et al. A Survey on the Detection and Impacts of Deepfakes in Visual, Audio,
    and Textual Formats. IEEE Access 2023, 11, 144497–144529.

  • 49.

    LeCun, Y.; Bottou, L.; Bengio, Y.; et al. Gradient-based learning applied to document recognition. Proc. IEEE 1998,
    86, 2278–2324.

  • 50.

    Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural
    Inf. Process. Syst. 2012, 25, 3–6.

  • 51.

    Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014,
    arXiv:1409.1556.

  • 52.

    He, K.; Zhang, X.; Ren, S.; et al. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.

  • 53.

    d’Ascoli, S.; Touvron, H.; Leavitt, M.L.; et al. ConViT: improving vision transformers with soft convolutional inductive
    biases. J. Stat. Mech. Theory Exp. 2021, 2022, 2286–2296.

  • 54.

    Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale. arXiv 2020, arXiv:2010.11929.

  • 55.

    Dai, Z.; Liu, H.; Le, Q.V.; et al. CoAtNet: Marrying Convolution and Attention for All Data Sizes. Adv. Neural Inf.
    Process. Syst. 2021, 34, 3965–3977.

  • 56.

    Liu, Z.; Lin, Y.; Cao, Y.; et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings
    of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp.
    10012–10022.

  • 57.

    Wang, W.; Xie, E.; Fan, D.P.; et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without
    Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal,
    BC, Canada, 11–17 October 2021; pp. 548–558.

  • 58.

    Fan, H.; Xiong, B.; Mangalam, K.; et al. Multiscale vision transformers. In Proceedings of the IEEE/CVF International
    Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6824–6835.

  • 59.

    Yuan, L.; Chen, Y.; Wang, T.; et al. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In
    Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021;
    pp. 558–567.

  • 60.

    Han, K.; Xiao, A.; Wu, E.; et al. Transformer in Transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919.

  • 61.

    Wang, W.; Chen, W.; Qiu, Q.; et al. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE
    Trans. Pattern Anal. Mach. Intell. 2023, 46, 3123–3136.

  • 62.

    Xu, Y.; Zhang, Q.; Zhang, J.; et al. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Adv. Neural
    Inf. Process. Syst. 2021, 34, 28522–28535.

  • 63.

    Yuan, K.; Guo, S.; Liu, Z.; et al. Incorporating convolution designs into visual transformers. In Proceedings of the
    IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 579–588.

  • 64.

    Graham, B.; El-Nouby, A.; Touvron, H.; et al. Levit: A vision transformer in convnet’s clothing for faster inference. In
    Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021;
    pp. 12259–12269.

  • 65.

    Radford, A.; Kim, J.W.; Hallacy, C.; et al. Learning transferable visual models from natural language supervision. In
    Proceedings of the International Conference on Machine Learning. PmLR, Virtual,18–24 July 2021; pp. 8748–8763.

  • 66.

    Wortsman, M.; Ilharco, G.; Kim, J.W.; et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7959–7971.

  • 67.

    Brock, A.; De, S.; Smith, S.L.; et al. High-performance large-scale image recognition without normalization. In Proceedings
    of the International Conference on Machine Learning. PMLR, Virtual,18–24 July 2021; pp. 1059–1071.

  • 68.

    Qiu, L.; Zhang, R.; Guo, Z.; et al. Vt-clip: Enhancing vision-language models with visual-guided texts. arXiv 2021,
    arXiv:2112.02399.

  • 69.

    Fang, Y.; Wang, W.; Xie, B.; et al. Eva: Exploring the limits of masked visual representation learning at scale. In
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24
    June 2023; pp. 19358–19369.

  • 70.

    Liu, Z.; Mao, H.; Wu, C.Y.; et al. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer
    Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986.

  • 71.

    Chen, Z.; Wu, J.; Wang, W.; et al. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-
    Linguistic Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
    Seattle, WA, USA, 16–22 June 2024; pp. 24185–24198.

  • 72.

    Van Den Oord, A.; Vinyals, O. Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 2017, 30.

  • 73.

    Yu, Y.; Kim, Y.; Ahn, S.; et al. MAGVIT: Masked Generative Video Transformer. arXiv 2022, arXiv:2206.11894.

  • 74.

    Oquab, M.; Darcet, T.; Moutakanni, T.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023,
    arXiv:2304.07193.

  • 75.

    Li, J.; Li, D.; Xiong, C.; et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding
    and generation. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23
    July 2022; pp. 12888–12900.

  • 76.

    Alayrac, J.B.; Donahue, J.; Luc, P.; et al. Flamingo: a visual language model for few-shot learning. Adv. Neural Inf.
    Process. Syst. 2022, 35, 23716–23736.

  • 77.

    Li, J.; Li, D.; Savarese, S.; et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and
    large language models. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA,
    23–29 July 2023; pp. 19730–19742.

  • 78.

    Liu, H.; Li, C.; Wu, Q.; et al. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916.

  • 79.

    Liu, H.; Li, C.; Li, Y.; et al. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference
    on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26296–26306.

  • 80.

    Lu, H.; Liu, W.; Zhang, B.; et al. Deepseek-vl: towards real-world vision-language understanding. arXiv 2024,
    arXiv:2403.05525.

  • 81.

    Wu, Z.; Chen, X.; Pan, Z.; et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal
    understanding. arXiv 2024, arXiv:2412.10302.

  • 82.

    Zhu, D.; Chen, J.; Shen, X.; et al. Minigpt-4: Enhancing vision-language understanding with advanced large language
    models. arXiv 2023, arXiv:2304.10592.

  • 83.

    Ye, Q.; Xu, H.; Xu, G.; et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv
    2023, arXiv:2304.14178.

  • 84.

    Li, B.; Zhang, Y.; Chen, L.; et al. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv 2023,
    arXiv:2305.03726

  • 85.

    Pan, X.; Shukla, S.N.; Singh, A.; et al. Transfer between modalities with metaqueries. arXiv 2025, arXiv:2504.06256.

  • 86.

    Shi, W.; Han, X.; Zhou, C.; et al. LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation. arXiv
    2024, arXiv:2412.15188.

  • 87.

    Dai, W.; Li, J.; Li, D.; et al. InstructBLIP: Towards General-purpose VisionLanguage Models with Instruction Tuning.
    arXiv 2023, arXiv:2305.06500.

  • 88.

    Zhang, Y.; Zhang, R.; Gu, J.; et al. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv
    2023, arXiv:2306.17107.

  • 89.

    You, H.; Zhang, H.; Gan, Z.; et al. Ferret: Refer and ground anything anywhere at any granularity. arXiv 2023, arXiv:2310.07704.

  • 90.

    Laurenc¸on, H.; Saulnier, L.; Tronchon, L.; et al. Obelics: An open web-scale filtered dataset of interleaved image-text
    documents. Adv. Neural Inf. Process. Syst. 2024, 36, 71683–71702.

  • 91.

    Zhang, P.; Wang, X.D.B.; Cao, Y.; et al. Internlm-xcomposer: A vision-language large model for advanced text-image
    comprehension and composition. arXiv 2023, arXiv:2309.15112.

  • 92.

    Bai, J.; Bai, S.; Yang, S.; et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading,
    and beyond arXiv 2023, arXiv:2308.12966.

  • 93.

    Chen, J.; Zhu, D.; Shen, X.; et al. Minigpt-v2: large language model as a unified interface for vision-language multi-task
    learning. arXiv 2023, arXiv:2310.09478.

  • 94.

    Xie, Z. OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding. arXiv 2024,
    arXiv:2404.07143.

  • 95.

    Wang, W.; Lv, Q.; Yu, W.; et al. Cogvlm: Visual expert for pretrained language models. arXiv 2023, arXiv:2311.03079.

  • 96.

    Lin, Z.; Liu, C.; Zhang, R.; et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large
    language models. arXiv 2023, arXiv:2311.07575.

  • 97.

    Lin, J.; Yin, H.; Ping, W.; et al. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 26689–26699.

  • 98.

    Wu, Y.; Zhang, Z.; Chen, J.; et al. Vila-u: a unified foundation model integrating visual understanding and generation.
    arXiv 2024, arXiv:2409.04429.

  • 99.

    Dong, X.; Zhang, P.; Zang, Y.; et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension
    in vision-language large model. arXiv 2024, arXiv:2401.16420.

  • 100.

    Liu, H.; Li, C.; Li, Y.; et al. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. Available online: https://llavavl.
    github.io/blog/2024-01-30-llava-next/ (accessed on 30 January 2024)

  • 101.

    Ye, Q.; Xu, H.; Ye, J.; et al. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June
    2024; pp. 13040–13051.

  • 102.

    Gao, P.; Zhang, R.; Liu, C.; et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language
    models. arXiv 2024, arXiv:2402.05935.

  • 103.

    Li, Y.; Zhang, Y.; Wang, C.; et al. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv
    2024, arXiv:2403.18814.

  • 104.

    Yu, Y.Q.; Liao, M.; Wu, J.; et al. Texthawk: Exploring efficient fine-grained perception of multimodal large language
    models. arXiv 2024, arXiv:2404.09204.

  • 105.

    Tang, J.; Lin, C.; Zhao, Z.; et al. TextSquare: Scaling up Text-Centric Visual Instruction Tuning. arXiv 2024,
    arXiv:2404.12803.

  • 106.

    Ge, C.; Cheng, S.; Wang, Z.; et al. ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models.
    arXiv 2024, arXiv:2405.15738.

  • 107.

    Wu, J.; Zhong, M.; Xing, S.; et al. VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for
    Hundreds of Vision-Language Tasks. arXiv 2024, arXiv:2406.08394.

  • 108.

    She, Q.; Pan, J.; Wan, X.; et al. MammothModa: Multi-Modal Large Language Model. arXiv 2024, arXiv:2406.18193.

  • 109.

    Zhang, P.; Dong, X.; Zang, Y.; et al. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting
    Long-Contextual Input and Output. arXiv 2024, arXiv:2407.03320.

  • 110.

    Zhang, S.; Roller, S.; Goyal, N.; et al. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068.

  • 111.

    Hoffmann, J.; Borgeaud, S.; Mensch, A.; et al. Training compute-optimal large language models. arXiv 2022,
    arXiv:2203.15556.

  • 112.

    Touvron, H.; Martin, L.; Stone, K.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023,
    arXiv:2307.09288.

  • 113.

    Bai, J.; Bai, S.; Chu, Y.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609.

  • 114.

    Chung, H.W.; Hou, L.; Longpre, S.; et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 2024,
    25, 1–53.

  • 115.

    Cho, J.; Lu, J.; Schwenk, D.; et al. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers. In
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20
    November 2020; pp. 8785–8805.

  • 116.

    Wu, S.; Fei, H.; Li, X.; et al. Towards Semantic Equivalence of Tokenization in Multimodal LLM. arXiv 2024,
    arXiv:2406.05127.

  • 117.

    Ma, C.; Jiang, Y.; Wu, J.; et al. Groma: Localized visual tokenization for grounding multimodal large language models. In
    European Conference on Computer Vision; Springer: Berlin, Germany, 2024; pp. 417–435.

  • 118.

    Ramesh, A.; Pavlov, M.; Goh, G.; et al. Zero-shot text-to-image generation. In Proceedings of the International Conference
    on Machine Learning. Pmlr, Virtual,18–24 July 2021; pp. 8821–8831.

  • 119.

    Ding, M.; Yang, Z.; Hong, W.; et al. Cogview: Mastering text-to-image generation via transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 19822–19835.

  • 120.

    Aghajanyan, A.; Huang, B.; Ross, C.; et al. Cm3: A causal masked multimodal model of the internet. arXiv 2022,
    arXiv:2201.07520.

  • 121.

    Li, X.; Qiu, K.; Chen, H.; et al. XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation.
    arXiv 2024, arXiv:2412.01762.

  • 122.

    Zheng, S.; Zhou, B.; Feng, Y.; et al. Unicode: Learning a unified codebook for multimodal large language models. In
    European Conference on Computer Vision; Springer: Berlin, Germany, 2024; pp. 426–443.

  • 123.

    Tang, H.; Liu, H.; Xiao, X. Ugen: Unified autoregressive multimodal model with progressive vocabulary learning. arXiv
    2025, arXiv:2503.21193.

  • 124.

    Xie, R.; Du, C.; Song, P.; et al. MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding. arXiv 2024,
    arXiv:2411.17762.

  • 125.

    Tschannen, M.; Pinto, A.S.; Kolesnikov, A. JetFormer: An autoregressive generative model of raw images and text. arXiv
    2024, arXiv:2411.19722.

  • 126.

    Zhang, W.; Xie, Z.; Feng, Y.; et al. From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities. arXiv
    2024, arXiv:2410.02155.

  • 127.

    Jin, Y.; Xu, K.; Xu, K.; et al. Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization.
    arXiv 2024, arXiv:2309.04669.

  • 128.

    Pan, K.; Tang, S.; Li, J.; et al. Auto-Encoding Morph-Tokens for Multimodal LLM. In Proceedings of the Forty-First
    International Conference on Machine Learning, 2024.

  • 129.

    Ge, Y.; Ge, Y.; Zeng, Z.; et al. Planting a seed of vision in large language model. arXiv 2023, arXiv:2307.08041.

  • 130.

    Ge, Y.; Zhao, S.; Zeng, Z.; et al. Making llama see and draw with seed tokenizer. arXiv 2023, arXiv:2310.01218.

  • 131.

    Zhan, J.; Dai, J.; Ye, J.; et al. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv 2024,
    arXiv:2402.12226.

  • 132.

    Fang, R.; Duan, C.; Wang, K.; et al. Puma: Empowering unified mllm with multi-granular visual generation. arXiv 2024,
    arXiv:2410.13861.

  • 133.

    Wang, W.; Bao, H.; Dong, L.; et al. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24
    June 2023; pp. 19175–19186.

  • 134.

    Yu, L.; Shi, B.; Pasunuru, R.; et al. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv
    2023, arXiv:2309.02591.

  • 135.

    Zhu, J.; Ding, X.; Ge, Y.; et al. Vl-gpt: A generative pre-trained transformer for vision and language understanding and
    generation. arXiv 2023, arXiv:2312.09251.

  • 136.

    Sun, Q.; Yu, Q.; Cui, Y.; et al. Emu: Generative pretraining in multimodality. In Proceedings of the Twelfth International
    Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023.

  • 137.

    Ge, Y.; Zhao, S.; Zhu, J.; et al. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.
    arXiv 2024, arXiv:2404.14396.

  • 138.

    Chen, Y.; Wang, X.; Peng, H.; et al. A Single Transformer for Scalable Vision-Language Modeling. arXiv 2024,
    arXiv:2407.06438.

  • 139.

    Team, C. Chameleon: Mixed-modal early-fusion foundation models. arXiv 2024, arXiv:2405.09818.

  • 140.

    Chern, E.; Su, J.; Ma, Y.; et al. Anole: An open, autoregressive, native large multimodal models for interleaved image-text
    generation. arXiv 2024, arXiv:2407.06135.

  • 141.

    Yang, S.; Ge, Y.; Li, Y.; et al. SEED-Story: Multimodal Long Story Generation with Large Language Model. arXiv 2024,
    arXiv:2407.08683.

  • 142.

    Lu, J.; Clark, C.; Lee, S.; et al. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and
    action. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
    16–22 June 2024, pp. 26439–26455.

  • 143.

    Moon, S.; Madotto, A.; Lin, Z.; et al. Anymal: An efficient and scalable any-modality augmented language model. In
    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, Miami, FL,
    USA, 12–16 November 2024, pp. 1314–1332.

  • 144.

    Shukor, M.; Dancette, C.; Rame, A.; et al. UnIVAL: Unified Model for Image, Video, Audio and Language Tasks. arXiv
    2023, arXiv:2307.16184.

  • 145.

    Zhu, Q.; Zhou, L.; Zhang, Z.; et al. Vatlm: Visual-audio-text pre-training with unified masked prediction for speech
    representation learning. IEEE Trans. Multimed. 2024, 26, 1055–1064.

  • 146.

    Chen, M.; Radford, A.; Child, R.; et al. Generative pretraining from pixels. In Proceedings of the 37th International
    Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1691–1703.

  • 147.

    Razzhigaev, A.; Voronov, A.; Kaznacheev, A.; et al. Pixel-Level BPE for Auto-Regressive Image Generation. In Proceedings
    of the FirstWorkshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models,
    Gyeongju, South Korea, 12–17 October 2022; pp. 26–30.

  • 148.

    Ren, S.; Wang, Z.; Zhu, H.; et al. Rejuvenating image-GPT as Strong Visual Representation Learners. arXiv 2024,
    arXiv:2312.02147.

  • 149.

    Amrani, E.; Karlinsky, L.; Bronstein, A. Sample- and Parameter-Efficient Auto-Regressive Image Models. arXiv 2025,
    arXiv:2411.15648.

  • 150.

    Chen, S.; Ge, C.; Zhang, S.; et al. PixelFlow: Pixel-Space Generative Models with Flow. arXiv 2025, arXiv:2504.07963.

  • 151.

    Mansimov, E.; Stern, M.; Chen, M.; Firat, O.; Uszkoreit, J.; Jain, P. Towards end-to-end in-image neural machine
    translation. arXiv 2020, arXiv:2010.10648.

  • 152.

    Salesky, E.; Etter, D.; Post, M. Robust open-vocabulary translation from visual text representations. arXiv 2021,
    arXiv:2104.08211.

  • 153.

    Rust, P.; Lotz, J.F.; Bugliarello, E.; et al. Language modelling with pixels. In Proceedings of the International Conference
    on Learning Representations, Kigali, Rwanda, 1–5 May 2023.

  • 154.

    Lotz, J.; Salesky, E.; Rust, P.; et al. Text Rendering Strategies for Pixel Language Models. In Proceedings of the 2023
    Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 10155–10172.

  • 155.

    Tai, Y.; Liao, X.; Suglia, A.; et al. PIXAR: Auto-Regressive Language Modeling in Pixel Space. arXiv 2024, arXiv:2401.03321.

  • 156.

    Xiao, C.; Huang, Z.; Chen, D.; et al. Pixel Sentence Representation Learning. arXiv 2024, arXiv:2402.08183.

  • 157.

    Gao, T.;Wang, Z.; Bhaskar, A.; et al. Improving Language Understanding from Screenshots. arXiv 2024, arXiv:2402.14073.

  • 158.

    Li, W.; Li, G.; Lan, Z.; et al. Empowering Backbone Models for Visual Text Generation with Input Granularity Control and
    Glyph-Aware Training. arXiv 2024, arXiv:2410.04439.

  • 159.

    Zhao, Z.; Tang, J.; Wu, B.; et al. Harmonizing Visual Text Comprehension and Generation. arXiv 2024, arXiv:2407.16364.

  • 160.

    Li, J.; Xu, Y.; Lv, T.; et al. Dit: Self-supervised pre-training for document image transformer. In Proceedings of the 30th
    ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3530–3539.

  • 161.

    Kim, G.; Hong, T.; Yim, M.; et al. Ocr-free document understanding transformer. In Proceedings of the European
    Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 498–517.

  • 162.

    Tschannen, M.; Mustafa, B.; Houlsby, N. Clippo: Image-and-language understanding from pixels only. In Proceedings
    of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023;
    pp. 11006–11017.

  • 163.

    Lee, K.; Joshi, M.; Turc, I.R.; et al. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In
    Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 18893–
    18912.

  • 164.

    Borenstein, N.; Rust, P.; Elliott, D.; et al. PHD: Pixel-Based Language Modeling of Historical Documents. In Proceedings of
    the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 87–107.

  • 165.

    Alonso, I.; Agirre, E.; Lapata, M. PixT3: Pixel-based Table To Text generation. arXiv 2023, arXiv:2311.09808.

  • 166.

    Park, J.; Choi, J.Y.; Park, J.; et al. Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding. arXiv
    2024, arXiv:2411.05254.

  • 167.

    Lotz, J.F.; Setiawan, H.; Peitz, S.; et al. Overcoming Vocabulary Constraints with Pixel-level Fallback. arXiv 2025,
    arXiv:2504.02122.

  • 168.

    Zhu, M.; Tian, Y.; Chen, H.; et al. SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human
    Annotator Trajectories. arXiv 2025, arXiv:2503.08625.

  • 169.

    Huang, X.; Shen, L.; Liu, J.; et al. Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine.
    arXiv 2025, arXiv:2412.09278.

  • 170.

    Liao,W.;Wang, J.; Li, H.; et al. DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich
    Document Understanding. arXiv 2025, arXiv:2408.15045.

  • 171.

    Wang, Z.; Guan, T.; Fu, P.; et al. Marten: Visual Question Answering with Mask Generation for Multi-modal Document
    Understanding. arXiv 2025, arXiv:2503.14140.

  • 172.

    Siam, M. PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? arXiv 2025,
    arXiv:2502.04192.

  • 173.

    Lu, Y.; Li, X.; Fu, T.J.; et al. From Text to Pixel: Advancing Long-Context Understanding in MLLMs. arXiv 2024,
    arXiv:2405.14213.

  • 174.

    Wang, A.J.; Li, L.; Lin, Y.; et al. Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning. arXiv
    2024, arXiv:2406.02547.

  • 175.

    Sun, Q.; Cui, Y.; Zhang, X.; et al. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 14398–14409.

  • 176.

    Tong, S.; Fan, D.; Zhu, J.; et al. Metamorph: Multimodal understanding and generation via instruction tuning. arXiv 2024,
    arXiv:2412.14164.

  • 177.

    Chen, X.; Wu, Z.; Liu, X.; et al. Janus-pro: Unified multimodal understanding and generation with data and model scaling.
    arXiv 2025, arXiv:2501.17811.

  • 178.

    Bai, S.; Chen, K.; Liu, X.; et al. Qwen2. 5-vl technical report. arXiv 2025, arXiv:2502.13923.

  • 179.

    Wang, P.; Bai, S.; Tan, S.; et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv 2024, arXiv:2409.12191.

  • 180.

    Chen, G.; Li, Z.; Wang, S.; et al. Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models.
    arXiv 2025, arXiv:2504.15271.

  • 181.

    Zhou, C.; Hu, H.; Xu, C.; et al. InternVL 3.0 Technical Report. arXiv 2024, arXiv:2405.01638.

  • 182.

    Wu, C.; Chen, X.; Wu, Z.; et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.
    In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025;
    pp. 12966–12977.

  • 183.

    Xie, J.; Mao, W.; Bai, Z.; et al. Show-o: One single transformer to unify multimodal understanding and generation. arXiv
    2024, arXiv:2408.12528.

  • 184.

    Radford, A.; Kim, J.W.; Hallacy, C.; et al. Learning Transferable Visual Models From Natural Language Supervision.
    arXiv 2021, arXiv:2103.00020.

  • 185.

    Zhang, J.; Lin, K.; Yang, Y.; et al. Eagle: Exploring the Design Space for Multimodal LLMs with Mixture of Encoders.
    arXiv 2024, arXiv:2404.13508.

  • 186.

    Shi, W.; Han, X.; Zhou, C.; et al. LMFusion: Adapting Pretrained Language Models for Multimodal Generation. arXiv
    2024, arXiv:2412.15188.

  • 187.

    Iyer, S.; Lin, X.V.; Pasunuru, R.; et al. Opt-iml: Scaling language model instruction meta learning through the lens of
    generalization. arXiv 2022, arXiv:2212.12017.

  • 188.

    Yang, A.; Li, A.; Yang, B.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388.

  • 189.

    Shazeer, N. Glu variants improve transformer. arXiv 2020, arXiv:2002.05202.

  • 190.

    Schulman, J.; Wolski, F.; Dhariwal, P.; et al. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347.

  • 191.

    Shao, Z.; Wang, P.; Zhu, Q.; et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
    arXiv 2024, arXiv:2402.03300.

  • 192.

    Yu, Q.; Zhang, Z.; Zhu, R.; et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv 2025,
    arXiv:2503.14476.

  • 193.

    Gao, P.; Han, J.; Zhang, R.; et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv 2023,
    arXiv:2304.15010.

  • 194.

    Chen, L.; Li, X.; Wang, R.; et al. Open-Qwen2VL: Compute-Efficient Pre-training of Fully-Open Multimodal LLMs on
    Academic Resources. arXiv 2024, arXiv:2404.14074.

  • 195.

    Shen, H.; Liu, P.; Li, J.; et al. Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv 2025,
    arXiv:2504.07615.

  • 196.

    Zhou, H.; Li, X.; Wang, R.; et al. R1-Zero’s” Aha Moment” in Visual Reasoning on a 2B Non-SFT Model. arXiv 2025,
    arXiv:2503.05132.

  • 197.

    Huang, W.; Jia, B.; Zhai, Z.; et al. Vision-r1: Incentivizing reasoning capability in multimodal large language models.
    arXiv 2025, arXiv:2503.06749.

  • 198.

    Peng, Y.; Zhang, G.; Zhang, M.; et al. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage
    rule-based rl. arXiv 2025, arXiv:2503.07536.

  • 199.

    Chen, L.; Gao, H.; Liu, T.; et al. G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via
    Reinforcement Learning. arXiv 2025, arXiv:2505.13426.

  • 200.

    Yang, Y.; He, X.; Pan, H.; et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal
    formalization. arXiv 2025, arXiv:2503.10615.

  • 201.

    Zhang, S.; Fang, Q.; Yang, Z.; et al. LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision
    Token. arXiv 2025, arXiv:2501.03895.

  • 202.

    Zou, J.; Liao, B.; Zhang, Q.; et al. Omnimamba: Efficient and unified multimodal understanding and generation via state
    space models. arXiv 2025, arXiv:2503.08686.

  • 203.

    Wang, Z.; Cai, S.; Mu, Z.; et al. Omnijarvis: Unified vision-language-action tokenization enables open-world instruction
    following agents. Adv. Neural Inf. Process. Syst. 2024, 37, 73278–73308.

  • 204.

    Zhou, C.; Poczos, B. Objective-Agnostic Enhancement of Molecule Properties via Multi-Stage VAE. arXiv 2023,
    arXiv:2308.13066.

  • 205.

    Nguyen, M.; Adibekyan, V. On the formation of super-Jupiters: Core accretion or gravitational instability? Astrophys.
    Space Sci. 2024, 369, 122.

  • 206.

    Zhao, X.; Zhang, Y.; Zhang, W.; et al. UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval
    and Generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami,
    FL, USA, 12–16 November 2024; pp. 1490–1507.

  • 207.

    Wang, X.; Zhang, X.; Luo, Z.; et al. Emu3: Next-token prediction is all you need. arXiv 2024, arXiv:2409.18869.

  • 208.

    Kou, S.; Jin, J.; Liu, Z.; et al. Orthus: Autoregressive interleaved image-text generation with modality-specific heads. arXiv
    2024, arXiv:2412.00127.

  • 209.

    Zhao, Y.; Xue, F.; Reed, S.; et al. Krahenbuhl, P.; Huang, D.A. QLIP: Text-Aligned Visual Tokenization Unifies Auto-
    Regressive Multimodal Understanding and Generation. arXiv 2025, arXiv:2502.05178.

  • 210.

    Qu, L.; Zhang, H.; Liu, Y.; et al. Tokenflow: Unified image tokenizer for multimodal understanding and generation.
    In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025;
    pp. 2545–2555.

  • 211.

    Huang, R.; Wang, C.; Yang, J.; et al. Illume+: Illuminating unified mllm with dual visual tokenization and diffusion
    refinement. arXiv 2025, arXiv:2504.01934.

  • 212.

    Chow, W.; Li, J.; Yu, Q.; et al. Unified Generative and Discriminative Training for Multi-modal Large Language Models.
    Adv. Neural Inf. Process. Syst. 2025, 37, 23155–23190.

  • 213.

    Yasunaga, M.; Aghajanyan, A.; Shi, W.; et al. Retrieval-augmented multimodal language modeling. arXiv 2022,
    arXiv:2211.12561.

  • 214.

    Woo, G.; Liu, C.; Kumar, A.; et al. Unified training of universal time series forecasting transformers. In Proceedings of the
    41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 53140–53164.

  • 215.

    Gao, S.; Koker, T.; Queen, O.; et al. UniTS: A unified multi-task time series model. Adv. Neural Inf. Process. Syst. 2024,
    37, 140589–140631.

  • 216.

    He, K.; Chen, X.; Xie, S.; et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009.

  • 217.

    Wang, A.; Singh, A.; Michael, J.; et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
    Understanding. In Proceedings of the 2018 EMNLPWorkshop BlackboxNLP: Analyzing and Interpreting Neural Networks
    for NLP, Brussels, Belgium, 1 November 2018; pp. 353–355.

  • 218.

    Radford, A.; Wu, J.; Child, R.; et al. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9.

  • 219.

    Lin, T.Y.; Maire, M.; Belongie, S.; et al. Microsoft coco: Common objects in context. In Proceedings of the Computer
    Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Part V 13, pp. 740–755.

  • 220.

    Young, P.; Lai, A.; Hodosh, M.; et al. From image descriptions to visual denotations: New similarity metrics for semantic
    inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78.

  • 221.

    Plummer, B.A.; Wang, L.; Cervantes, C.M.; et al. Flickr30k entities: Collecting region-to-phrase correspondences for
    richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, Santiago,
    Chile, 7–13 December 2015; pp. 2641–2649.

  • 222.

    Agrawal, H.; Desai, K.; Wang, Y.; et al. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF
    International Conference on Computer Vision, Seoul, Korea (South), 27 October–2 November 2019; pp. 8948–8957.

  • 223.

    Kazemzadeh, S.; Ordonez, V.; Matten, M.; et al. Referitgame: Referring to objects in photographs of natural scenes. In
    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29
    October 2014; pp. 787–798.

  • 224.

    Mao, J.; Huang, J.; Toshev, A.; et al. Generation and comprehension of unambiguous object descriptions. In Proceedings
    of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 11–20.

  • 225.

    Antol, S.; Agrawal, A.; Lu, J.; et al. Vqa: Visual question answering. In Proceedings of the IEEE International Conference
    on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2425–2433.

  • 226.

    Hsieh, H.Y.; Liu, S.W.; Meng, C.C.; et al. TaiwanVQA: A Benchmark for Visual Question Answering for Taiwanese Daily
    Life. In Proceedings of the First Workshop of Evaluation of Multi-Modal Generation, Abu Dhabi, United Arab Emirates,
    20 January 2025; pp. 57–75.

  • 227.

    Goyal, Y.; Khot, T.; Summers-Stay, D.; et al. Making the v in vqa matter: Elevating the role of image understanding in
    visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu,
    HI, USA, 21–26 July 2017; pp. 6904–6913.

  • 228.

    Marino, K.; Rastegari, M.; Farhadi, A.; et al. Ok-vqa: A visual question answering benchmark requiring external knowledge.
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20
    June 2019; pp. 3195–3204.

  • 229.

    Hudson, D.A.; Manning, C.D. Gqa: A new dataset for real-world visual reasoning and compositional question answering.
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20
    June 2019; pp. 6700–6709.

  • 230.

    Gurari, D.; Li, Q.; Stangl, A.J.; et al. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings
    of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp.
    3608–3617.

  • 231.

    Masry, A.; Do, X.L.; Tan, J.Q.; et al. ChartQA: A Benchmark for Question Answering about Charts with Visual and
    Logical Reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin,
    Ireland, 22–27 May 2022; pp. 2263–2279.

  • 232.

    Methani, N.; Ganguly, P.; Khapra, M.M.; et al. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF
    Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1527–1536.

  • 233.

    Mathew, M.; Bagal, V.; Tito, R.; et al. Infographicvqa. In Proceedings of the IEEE/CVFWinter Conference on Applications
    of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1697–1706.

  • 234.

    Kembhavi, A.; Salvato, M.; Kolve, E.; et al. A diagram is worth a dozen images. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part IV 14, pp. 235–251.

  • 235.

    Mathew, M.; Karatzas, D.; Jawahar, C. DocVQA: A Dataset for VQA on Document Images. In Proceedings of the
    IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Virtual, 5–9 January 2021, pp. 2200–2209.

  • 236.

    Mishra, A.; Shekhar, S.; Singh, A.K.; et al. Ocr-vqa: Visual question answering by reading text in images. In Proceedings
    of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25
    September 2019; pp. 947–952.

  • 237.

    Singh, A.; Natarajan, V.; Shah, M.; et al. Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference
    on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8317–8326.

  • 238.

    Lu, P.; Mishra, S.; Xia, T.; et al. Learn to explain: Multimodal reasoning via thought chains for science question answering.
    Adv. Neural Inf. Process. Syst. 2022, 35, 2507–2521.

  • 239.

    Wang, C.J.; Lee, D.; Menghini, C.; et al. EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges. arXiv
    2025, arXiv:2502.08859.

  • 240.

    Wang, H.; Fan, Y.; Naeem, M.F.; et al. TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters.
    arXiv 2025, arXiv:2410.23168.

  • 241.

    Duan, Y.; Wang, W.; Chen, Z.; et al. Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures.
    arXiv 2025, arXiv:2403.02308.

  • 242.

    Dhariwal, P.; Nichol, A. Diffusion models beat GANs on image synthesis. In Proceedings of the 35th International
    Conference on Neural Information Processing Systems (NIPS ’21), Los Angeles, CA, USA, 6–14 December 2021.

  • 243.

    Yin, Z.; Wang, J.; Cao, J.; et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and
    benchmark. Adv. Neural Inf. Process. Syst. 2024, 36, 26650–26685.

  • 244.

    Bai, S.; Yang, S.; Bai, J.; et al. Touchstone: Evaluating vision-language models by language models. arXiv 2023,
    arXiv:2308.16890.

  • 245.

    Farsi, F.; Shariati Motlagh, S.; Bali, S.; et al. Persian in a Court: Benchmarking VLMs In Persian Multi-Modal Tasks.
    In Proceedings of the First Workshop of Evaluation of Multi-Modal Generation, Abu Dhabi, United Arab Emirates, 20
    January 2025; pp. 52–56.

  • 246.

    Liu, Y.; Duan, H.; Zhang, Y.; et al. Mmbench: Is your multi-modal model an all-around player? arXiv 2023,
    arXiv:2307.06281.

  • 247.

    Ying, K.; Meng, F.; Wang, J.; et al. MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-
    Language Models Towards Multitask AGI. In Proceedings of the 41st International Conference on Machine Learning
    (PMLR), Vienna, Austria, 21–27 July 2024; Volume 235, pp. 57116–57198.

  • 248.

    Xu, P.; Shao, W.; Zhang, K.; et al. LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language
    Models. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1877–1893.

  • 249.

    Wang, Z.; Liu, J.; Tang, C.W.; et al. JourneyBench: a challenging one-stop vision-language understanding benchmark
    of generated images. In Proceedings of the 38th International Conference on Neural Information Processing Systems
    (NIPS ’24), Vancouver, BC, Canada, 10–15 December 2024.

  • 250.

    Lu, P.; Bansal, H.; Xia, T.; et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.
    arXiv 2023, arXiv:2310.02255.

  • 251.

    Wang, Z.; Xia, M.; He, L.; et al. CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs. arXiv
    2024, arXiv:2406.18521.

  • 252.

    Ikuta, H.; W¨ohler, L.; Aizawa, K. MangaUB: A Manga Understanding Benchmark for Large Multimodal Models. arXiv
    2024, arXiv:2407.19034.

  • 253.

    Liang, P.P.; Lyu, Y.; Fan, X.; et al. MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. In
    Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track
    (Round 1), Virtual, 6–14 December 2021.

  • 254.

    Yu, W.; Yang, Z.; Ren, L.; et al. MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for
    Integrated Capabilities. arXiv 2024, arXiv:2408.00765.

  • 255.

    Patraucean, V.; Smaira, L.; Gupta, A.; et al. Perception Test: A Diagnostic Benchmark for Multimodal Video Models. In
    Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,
    New Orleans, LA, USA, 10–16 December 2023.

  • 256.

    Li, Y.; Du, Y.; Zhou, K.; et al. Evaluating object hallucination in large vision-language models. arXiv 2023, arXiv:2305.10355.

  • 257.

    Wang, P.; Li, Z.Z.; Yin, F.; et al. MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts. arXiv
    2025, arXiv:2502.20808.

  • 258.

    Li, J.; Pan, K.; Ge, Z.; et al. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. In Proceedings of
    the Twelfth International Conference on Learning Representations, Vienna, Austria, 7 May 2024.

  • 259.

    Wadhawan, R.; Bansal, H.; Chang, K.W.; et al. ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in
    Large Multimodal Models. arXiv 2024, arXiv:2401.13311.

  • 260.

    Li, B.; Ge, Y.; Chen, Y.; et al. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual
    comprehension. arXiv 2024, arXiv:2404.16790.

  • 261.

    Yue, X.; Ni, Y.; Zhang, K.; et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark
    for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA,
    USA, 16–22 June 2024; pp. 9556–9567.

  • 262.

    Imam, M.F.; Lyu, C.; Aji, A.F. Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is
    No! arXiv 2025, arXiv:2501.10674.

  • 263.

    Yang, B.; Zhang, Y.; Liu, D.; et al. Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table
    Understanding and Reasoning. arXiv 2025, arXiv:2501.13042.

  • 264.

    Ruan, J.; Yuan, W.; Gao, X.; et al. VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language
    Reward Models. arXiv 2025, arXiv:2503.07478.

  • 265.

    Gemini Team Google. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2025, arXiv:2312.11805.

  • 266.

    Ma, Y.; Zang, Y.; Chen, L.; et al. MMLongBench-Doc: Benchmarking Long-context Document Understanding with
    Visualizations. arXiv 2024, arXiv:2407.01523.

  • 267.

    team, L.; Barrault, L.; Duquenne, P.A.; et al. Large Concept Models: Language Modeling in a Sentence Representation
    Space. arXiv 2024, arXiv:2412.08821.

  • 268.

    Artetxe, M.; Bhosale, S.; Goyal, N.; et al. Efficient Large Scale Language Modeling with Mixtures of Experts. In
    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Association for Computational
    Linguistics: Abu Dhabi, 2022; pp. 11699–11732.

  • 269.

    Bavishi R.; Elsen E.; Hawthorne C.; et al. Fuyu-8B: A Multimodal Architecture for AI Agents. Available online:
    https://www.adept.ai/blog/fuyu-8b (accessed on 17 October 2023).

  • 270.

    Diao H.; Cui Y.; Li X.; et al. Unveiling Encoder-Free Vision-Language Models. arXiv 2024, arXiv:2406.11832.

Share this article:
How to Cite
Deng, Z.; Wang, Y.; Liang, Y.; Du, J.; Yang, Y.; Fang, L.; He, L.; Han, Y.; Zhu, Y.; Miao, C.; Zhang, W.; Chen, J.; Li, Y.; Zhao, W.; Yu, P. S. A Survey of Multimodal Models on Language and Vision: A Unified Modeling Perspective. Data Mining and Machine Learning 2025, 1 (1), 100001.
RIS
BibTex
Copyright & License
article copyright Image
Copyright (c) 2025 by the authors.