In recent years, significant progress has been made in developing AI systems capable of processing multimodal data—such as text, image, and videos—to perform complex tasks. With the advent of Large Language Models (LLMs), there has been a surge of interest in building multimodal models based on LLMs. Most current approaches employ a heterogeneous architecture to process text and image separately before bridging them together, leading to a critical bridge bottleneck. Modeling multimodal data such as text and image in a unified manner can help overcome this limitation. Therefore, in this survey, we investigate the current research landscape of multimodality modeling from three perspectives. The first group of multimodal models adopts a heterogeneous architecture to bridge different modality data. The second line of research leverages LLM for multimodality modeling via a unified language modeling objective. The third group represents multimodal data entirely within a single visual representation. The latter two groups can offer a more unified treatment of modalities, helping to alleviate the bridge bottleneck and paving the way for more capable multimodal systems.



