Positron emission tomography (PET) is an important medical imaging technique that reflects the molecular activity of tissues and organs by injecting radioactive tracers. Low-dose (LD) PET is gradually being adopted to reduce radiation dose and scanning costs, however this usually leads to increased image noise and artifacts, which can affect clinical diagnosis. Therefore, in order to maintain high-quality PET image generation while utilizing LD-PET data, this paper proposes a multi-modality Vision Transformer-based conditional generative adversarial network (ViT-cGAN) that directly achieves high-quality PET image reconstruction using the corresponding LD-PET sinogram data and computed tomography (CT) images. Specifically, the network incorporates the advantages of Vision Transformer and multi-modality inputs. In addition, an extensive objective function is designed to optimize the network for improving the details and visual quality of the reconstructed images. Experimental results show that our proposed method can effectively reconstruct high-quality PET images, outperforming current state-of-the-art methods.



