2509001331
  • Open Access
  • Article

SUGAR: A Sequence Unfolding Based Transformer Model for Group Activity Recognition

  • Yash Gondkar 1,   
  • Chengjie Zheng 1,   
  • Yumeng Yang 2,   
  • Shiqian Shen 3,   
  • Wei Ding 1, *,   
  • Ping Chen 1

Received: 10 Jun 2025 | Revised: 12 Aug 2025 | Accepted: 17 Sep 2025 | Published: 28 Sep 2025

Abstract

Deep learning models built upon Transformer architectures have led to substantial advancements in sequential data analysis. Nevertheless, their direct application to video-based tasks, such as Group Activity Recognition (GAR), remains constrained by the quadratic computational complexity and excessive memory requirements of global self-attention, especially when handling long video sequences. To overcome these limitations, we propose SUGAR: A Sequence Unfolding Based Transformer Model for Group Activity Recognition. Our approach introduces a novel sequence unfolding and folding mechanism that partitions long video sequences into overlapping local windows, enabling the model to concentrate attention within compact temporal regions. This local attention design dramatically reduces computational cost and memory footprint while maintaining high recognition accuracy. Within the Bi-Causal framework, SUGAR replaces conventional Transformer blocks, and experimental results on the Volleyball dataset demonstrate that our model achieves state-of-the-art performance, consistently exceeding 93% accuracy, with significantly improved efficiency. In addition, we investigate Lightning Attention 2 as an alternative linear-complexity attention module, identifying practical challenges such as increased memory usage and unstable convergence. To ensure robustness and training stability, we incorporate a dedicated safety mechanism that mitigates these issues. In summary, SUGAR offers a scalable, resource-efficient solution for group activity analysis in videos and exhibits strong potential for broader applications involving lengthy sequential data in computer vision and bioinformatics.

References 

  • 1.
    Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. https://doi.org/10.1016/0364-0213(90)90002-E.
  • 2.
    Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; et al. Time Series Analysis: Forecasting and Control, 5th ed.; In Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2016.
  • 3.
    Wang, C.; Mohamed, A.S.A. Group Activity Recognition in Computer Vision: A Comprehensive Review, Challenges, and Future Perspectives. arXiv 2023, arXiv:2307.13541.
  • 4.
    Ibrahim, M.S.; Muralidharan, S.; Deng, Z.; et al. A Hierarchical Deep Temporal Model for Group Activity Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1971–1980. https://doi.org/10.1109/CVPR.2016.217.
  • 5.
    Shu, T.; Xie, D.; Rothrock, B.; et al. Joint Inference of Groups, Events and Human Roles in Aerial Videos. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015, pp. 4576–4584. https://doi.org/10.1109/CVPR.2015.7299088.
  • 6.
    Bagautdinov, T.; Alahi, A.; Fleuret, F.; et al. Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3425–3434. https://doi.org/10.1109/CVPR.2017.365.
  • 7.
    Qi, M.; Qin, J.; Li, A.; et al. stagNet: An Attentive Semantic RNN for Group Activity Recognition. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 104–120. https://doi.org/10.1007/978-3-030-01249-6_7.
  • 8.
    Ibrahim, M.S.; Mori, G. Hierarchical Relational Networks for Group Activity Recognition and Retrieval. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11027, pp. 742–758. https://doi.org/10.1007/978-3-030-01219-9_44.
  • 9.
    Wu, J.; Wang, L.; Wang, L.; et al. Learning Actor Relation Graphs for Group Activity Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 9964–9974. https://doi.org/10.1109/CVPR.2019.01020.
  • 10.
    Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008.
  • 11.
    Devlin, J.; Chang, M.-W.; Lee, K.; et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423.
  • 12.
    Wang, S.; Li, B.Z.; Khabsa, M.; et al. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768.
  • 13.
    Qin, Z.; Sun, W.; Li, D.; et al. Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models. arXiv 2024, arXiv:2401.04658.
  • 14.
    Choromanski, K.; Likhosherstov, V.; Dohan, D.; et al. Rethinking Attention with Performers. arXiv 2021. arXiv:2009.14794.
  • 15.
    Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. In Proceedings of the 10th International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. Available online: https://openreview.net/forum?id=uYLFoz1vlAC (accessed on 26 September 2025).
  • 16.
    Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752.
  • 17.
    Zhang, Y.; Liu, W.; Xu, D.; et al. Bi-Causal: Group Activity Recognition via Bidirectional Causality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 1450–1459. https://doi.org/10.1109/CVPR52733.2024.00144.
  • 18.
    Li, S.; Cao, Q.; Liu, L.; et al. GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021 (virtual); pp. 13668–13677. https://doi.org/10.1109/ICCV48922.2021.01341.
  • 19.
    Han, M.; Zhang, D.J.; Wang, Y.; et al. Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 2990–2999. https://doi.org/10.1109/CVPR52688.2022.00300.
  • 20.
    Lu, L.; Lu, Y.; Yu, R.; et al. GAIM: Graph Attention Interaction Model for Collective Activity Recognition. IEEE Trans. Multimed. 2020, 22, 524–539. https://doi.org/10.1109/TMM.2019.2930344.
  • 21.
    Yan, R.; Xie, L.; Tang, J.; et al. HiGCIN: Hierarchical Graph-Based Cross Inference Network for Group Activity Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6955–6968. https://doi.org/10.1109/TPAMI.2020.3034233.
  • 22.
    Pramono, R.R.A.; Fang, W.-H.; Chen, Y.T. Relational Reasoning for Group Activity Recognition via Self-Attention Augmented Conditional Random Field. IEEE Trans. Image Process. 2021, 30, 8184–8199. https://doi.org/10.1109/TIP.2021.3113570.
  • 23.
    Perez, M.; Liu, J.; Kot, A.C. Skeleton-Based Relational Reasoning for Group Activity Analysis. Pattern Recognit. 2022, 122, 108360. https://doi.org/10.1016/j.patcog.2021.108360.
  • 24.
    Amer, M.R.; Xie, D.; Zhao, M.; et al. Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition. In Computer Vision—ECCV 2012. Proceedings of the 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer, Berlin/Heidelberg, Germany, 2012; Volume 7575, pp. 187–200. https://doi.org/10.1007/978-3-642-33765-9_14.
  • 25.
    Azar, S.M.; Atigh, M.G.; Nickabadi, A.; et al. Convolutional Relational Machine for Group Activity Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 7892–7901. https://doi.org/10.1109/CVPR.2019.00808.
  • 26.
    Demirel, B.; Ozkan, H. Decompl: Decompositional Learning with Attention Pooling for Group Activity Recognition from a Single Volleyball Image. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 977–983. https://doi.org/10.1109/ICIP51287.2024.10647499.
  • 27.
    Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.
  • 28.
    Zhai, X.; Hu, Z.; Yang, D.; et al. Spatial Temporal Network for Image and Skeleton Based Group Activity Recognition. In Proceedings of the 2022 Asian Conference on Computer Vision (ACCV), Macao, China, 4–8 December 2022; pp. 329–346. https://doi.org/10.1007/978-3-031-26316-3_20.
  • 29.
    Yuan, H.; Ni, D.; Wang, M. Spatio-Temporal Dynamic Inference Network for Group Activity Recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada 11–17 October 2021 (virtual); pp. 7456–7465. https://doi.org/10.1109/ICCV48922.2021.00738.
  • 30.
    Chung, J.; Gülçehre, C.; Cho, K.; et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555 (accessed on 26 September 2025).
  • 31.
    Zheng, C.; Ding, W.; Shen, S.; et al. MAF: Multimodal Auto Attention Fusion for Video Classification. In Advances and Trends in Artificial Intelligence: Theory and Applications; Fujita, H., Wang, Y., Xiao, Y., Moonis, A., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 253–264. https://doi.org/10.1007/978-3-031-36819-6_22.
  • 32.
    Amer, M.R.; Todorovic, S.; Fern, A.; et al. Monte Carlo Tree Search for Scheduling Activity Recognition. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), Sydney, NSW, Australia, 1–8 December 2013; pp. 1353–1360. https://doi.org/10.1109/ICCV.2013.171.
  • 33.
    Amer, M.R.; Lei, P.; Todorovic, S. HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos. In Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 572–585. https://doi.org/10.1007/978-3-319-10599-4_37.
  • 34.
    Shu, T.; Todorovic, S.; Zhu, S.-C. CERN: Confidence–Energy Recurrent Network for Group Activity Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4255–4263. https://doi.org/10.1109/CVPR.2017.453.
  • 35.
    Saito, T.; Rehmsmeier, M. The Precision–Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. https://doi.org/10.1371/journal.pone.0118432.
  • 36.
    Kim, B.; Lee, J.; Kang, J.; et al. Detector-Free Weakly Supervised Group Activity Recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022.
  • 37.
    Ng, X.L.; Ong, K.E.; Zheng, Q.; et al. Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022.
  • 38.
    He, K.; Gkioxari, G.; Dollár, P.; et al. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017, pp. 2961–2969. https://doi.org/10.1109/ICCV.2017.322.
  • 39.
    Tamura, M.; Vishwakarma, R.; Vennelakanti, R. Hunting Group Clues with Transformers for Social Group Activity Recognition. In Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022. https://doi.org/10.1007/978-3-031-19772-7_2.
  • 40.
    Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. https://doi.org/10.1109/5.18626.
  • 41.
    Berndt, D.J.; Clifford, J. Using Dynamic Time Warping to Find Patterns in Time Series. In Papers from the AAAI Workshop on Knowledge Discovery in Databases (KDD-94); Technical Report WS-94-03; AAAI Press: Seattle, WA, USA, 1994; pp. 359–370. Available online: https://cdn.aaai.org/Workshops/1994/WS-94-03/WS94-03-031.pdf (accessed on 26 September 2025).
  • 42.
    Zheng, C.; Dagnew, T.M.; Yang, L.; et al. Animal-JEPA: Advancing animal behavior studies through joint embedding predictive architecture in video analysis. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 1909–1918. https://doi.org/10.1109/BigData62323.2024.10826081.
Share this article:
How to Cite
Gondkar, Y.; Zheng, C.; Yang, Y.; Shen, S.; Ding, W.; Chen, P. SUGAR: A Sequence Unfolding Based Transformer Model for Group Activity Recognition. Transactions on Artificial Intelligence 2025, 1 (1), 227–245. https://doi.org/10.53941/tai.2025.100015.
RIS
BibTex
Copyright & License
article copyright Image
Copyright (c) 2025 by the authors.