Federated Bimodal Graph Neural Networks for Text-Image Retrieval

Xueming Yan; Chuyue Wang; Yaochu Jin

doi:10.53941/ijndi.2025.100009

Abstract

Text-image retrieval is a key challenge in computer vision and natural language processing, aiming to retrieve the most semantically relevant image or text given a query in the opposite modality. However, growing privacy and security concerns make traditional centralized learning approaches increasingly unsuitable for handling sensitive multimodal data. In this paper, we propose FedBi-GNNs, a federated learning framework for bimodal graph neural networks, which enables collaborative training across decentralized clients without sharing private data. Each client independently constructs heterogeneous graphs from local text and image data and learns correspondences via bimodal graph matching. These local representations are then aggregated at a central server using a heterogeneous federated aggregation scheme. Empirical results on the MSCOCO benchmark demonstrate that FedBi-GNNs significantly outperform existing state-of-the-art methods, offering improved retrieval accuracy, enhanced privacy preservation, and greater robustness to data heterogeneity across clients.

References

1.
Ebaid, D.B.; Madbouly, M.M.; El-Zoghabi, A.A. Bi-directional image–text matching deep learning-based approaches: Concepts, methodologies, benchmarks and challenges. Int. J. Comput. Intell. Syst., 2023, 16: 81. doi:10.1007/s44196-023-00260-3
2.
Zhou, Y.H.; Yan, X.M.; Huang, H.; et al. Legal text retrieval with contrastive representation learning and evolutionary data augmen- tation. In Proceedings of2024 IEEE Congress on Evolutionary Computation (CEC), Yokohama, Japan, 30 June 2024—5 July 2024; IEEE: New York, 2024; pp. 1–7. doi:10.1109/CEC60901.2024.10612052
3.
Ren, Z.; Jin, H.L.; Lin, Z.; et al. Joint image-text representation by Gaussian visual-semantic embedding. In Proceedings ofthe 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; ACM: New York, 2016; pp. 207–211. doi:10.1145/2964284.2967212
4.
Engilberge, M.; Chevallier, L.; Perez, P.; et al. Finding beans in burgers: Deep semantic-visual embedding with localization. In Pro- ceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, 2018; pp. 3984–3993. doi:10.1109/CVPR.2018.00419
5.
Zhen, L.L.; Hu, P.; Wang, X.; et al. Deep supervised cross-modal retrieval. In Proceedings of2019 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, 2019; 10386–10395. doi:10.1109/CVPR.2019.01064
6.
Yan, X.M.; Xue, H.W.; Jiang, S.Y.; et al. Multimodal sentiment analysis using multi-tensor fusion network with cross-modal model- ing. Appl. Artif. Intell., 2022, 36: 2000688. doi:10.1080/08839514.2021.2000688
7.
Lee, K.H.; Chen, X.; Hua, G.; et al. Stacked cross attention for image-text matching. In Proceedings of the 15th European Confer- ence on Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 201–216. doi:10.1007/978-3-030-01225-__0 13
8.
Wang, Y.X.; Yang, H.; Qian, X.M.; et al. Position focused attention network for image-text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10 August 2019; AAAI Press: Palo Alto, CA, USA, 2019; pp. 3792–3798.
9.
Wei, X.; Zhang, T.Z.; Li, Y.; et al. Multi-modality cross attention network for image and sentence matching. In Proceedings of2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 10938–10947. doi:10.1109/CVPR42600.2020.01095
10.
Ji, Z.; Chen, K.X.; Wang, H.R. Step-wise hierarchical alignment network for image-text matching. In Proceedings ofthe 30th Inter- national Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 9–27 August 2021; pp. 765–771. doi:10.24963/ijcai. 2021/106
11.
Yan, X.M.; Huang, H.; Jin, Y.C.; et al. Neural architecture search via multi-hashing embedding and graph tensor networks for multi- lingual text classification. IEEE Trans. Emerg. Top. Comput. Intell., 2024, 8: 350–363. doi:10.1109/TETCI.2023.3301774
12.
Wang, S.J.; Wang, R.P.; Yao, Z.W.; et al. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Pro- ceedings of 2020 IEEE Winter Conference on Applications of Computer Vision ( WACV), Snowmass, CO, USA, 1–5 March 2020; IEEE: New York, NY, USA, 2020; pp. 1497–1506. doi:10.1109/WACV45572.2020.9093614
13.
Nguyen, M.D.; Nguyen, B.T.; Gurrin, C. A deep local and global scene-graph matching for image-text retrieval. In New Trends in Intelligent Software Methodologies, Tools and Techniques; Fujita, H., Perez-Meana, H., Eds.; IOS Press: Amsterdam, The Nether- lands, 2021. doi:10.3233/FAIA210049
14.
Liu, C.X.; Mao, Z.D.; Zhang, T.Z.; et al. Graph structured network for image-text matching. In Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13—19 June 2020; IEEE: New York, NY, USA, 2020; pp. 10918–10927. doi:10.1109/CVPR42600.2020.01093
15.
Diao, H.W.; Zhang, Y.; Ma, L.; et al. Similarity reasoning and filtration for image-text matching. In Proceedings of the 35th AAAIConference on Artificial Intelligence, New York, NY, USA, 2—9 February 2021; AAAI Press: Palo Alto, CA, USA, 2021; pp. 1218–1226. doi:10.1609/aaai.v35i2.16209
16.
Yang, Q.; Liu, Y.; Chen, T.J.; et al. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. (TIST), 2019, 10: 12. doi:10.1145/3298981
17.
Liu, D.B.; Miller, T. Federated pretraining and fine tuning of BERT using clinical notes from multiple silos. arXiv, 2020, arXiv: 2002.08562.
18.
Zhuo, Y.X.; Li, B.X. Fedns: Improving federated learning for collaborative image classification on mobile clients. In Proceedings of 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5—9 July 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. doi:10.1109/ICME51207.2021.9428075
19.
Wang, H.; Zeng, Z.R.; Liu, R.F.; et al. A federated learning based Chinese text classification model with parameter factorization weighting. In Proceedings ofthe 2021 7th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC), Beijing, China, 17—19 November 2021; IEEE: New York, NY, USA, 2021; pp. 299–303. doi:10.1109/IC-NIDC54101.2021.9660471
20.
Lyu, L.J.; Yu, H.; Ma, X.J.; et al. Privacy and robustness in federated learning: Attacks and defenses. IEEE Trans. Neural Netw. Learn. Syst., 2024, 35, 8726–8746. doi:10.1109/TNNLS.2022.3216981
21.
Faghri, F.; Fleet, D.J.; Kiros, J.R.; et al. VSE++: Improved visual-semantic embeddings. arXiv, 2018, arXiv: 1707.05612.
22.
Li, K.P.; Zhang, Y.L.; Li, K.; et al. Visual semantic reasoning for image-text matching. In Proceedings of2019 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), Seoul, Korea (South), 27 October 2019–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 4653–4661. doi:10.1109/ICCV.2019.00475
23.
Zong, L.L.; Xie, Q.J.; Zhou, J.H.; et al. FedCMR: Federated cross-modal retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Online, 11—15 July 2021; ACM: New York, 2021; pp. 1672–1676. doi:10.1145/3404835.3462989
24.
McMahan, B.; Moore, E.; Ramage, D.; et al. Communication-efficient learning of deep networks from decentralized data. In Pro- ceedings ofthe 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20—22 April 2017; pp. 1273–1282.
25.
Li, T.; Sahu, A.K.; Zaheer, M.; et al. Federated optimization in heterogeneous networks. In Proceedings of the 3rd Conference on Machine Learning and Systems, Austin, TX, USA, 2—4 March 2020; pp. 429–450.
26.
Ren, S.Q.; He, K.M.; Girshick, R.; et al. Faster R-CNN: Towards real-time object detection with region proposal networks. In Pro- ceedings ofthe 29th International Conference on Neural Information Processing Systems, Montreal, Canada, 7—12 December 2015; MIT Press: Cambridge, UK, 2015; pp. 91–99.
27.
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global vectors for word representation. In Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25—29 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1532–1543. doi:10.3115/v1/D14-1162
28.
Lin, T.Y.; Maire, M.; Belongie, S.; et al. Microsoft COCO: Common objects in context. In Proceedings ofthe 13th European Con- ference on Computer Vision, Zurich, Switzerland, 6—12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. doi:10.1007/978-3-319-10602-__148
29.
Li, Q.B.; Diao, Y.Q.; Chen, Q.; et al. Federated learning on non-IID data silos: An experimental study. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9—12 May 2022; IEEE: New York, NY, USA, 2022; pp. 965–978. doi:10.1109/ICDE53745.2022.00077
30.
Li, A.; Sun, J.W.; Wang, B.H.; et al. LotteryFL: Empower edge intelligence with personalized and communication-efficient feder- ated learning. In Proceedings of 2021 IEEE/ACM Symposium on Edge Computing (SEC), San Jose, CA, USA, 14—17 December 2021; IEEE: New York, NY, USA, 2021; pp. 68–79. doi:10.1145/3453142.3492909
31.
Kairouz, P.; McMahan, H.B.; Avent, B.; et al. Advances and open problems in federated learning. Found. Trends® Mach. Learn., 2021, 14: 1–210. doi:10.1561/2200000083

Scilight Press

Author Information

Abstract

Keywords

References

About Scilight

Journals

Publishing Policies

Contact Us