DT-Pose: Towards Robust and Realistic Human Pose Estimation Using WiFi Signals

Yang Chen; Jingcai Guo

Abstract

Robust WiFi-based human pose estimation (HPE) is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. We revisit this problem and reveal two critical yet overlooked issues: (1) cross-domain gap, i.e., due to significant discrepancies in pose distributions between source and target domains; and (2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal consistency contrastive learning strategy with uniformity regularization, integrated into a self-supervised masked pretraining paradigm. This design facilitates robust learning of domain-consistent and sequence-level motion-discriminative WiFi representations while mitigating potential mode collapse caused by signal sparsity. Beyond this, we introduce an effective hybrid decoding architecture that incorporates explicit skeletal topology constraints. By compensating for the inherent absence of spatial priors in WiFi semantic vectors, the decoder enables structured modeling of both adjacent and overarching joint relationships, producing more realistic pose predictions. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in 2D/3D WiFi-based HPE tasks.

References

1.
Cao, Z.; Simon, T.; Wei, S.E.; et al. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299.
2.
Wang, Y.; Li, M.; Cai, H.; et al. Lite pose: Efficient architecture design for 2D human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 13126–13136.
3.
Li, W.; Liu, H.; Ding, R.; et al. Exploiting temporal contexts with strided transformer for 3D human pose estimation. IEEE Trans. Multimed. 2022, 25, 1282–1293.
4.
Gong, J.; Foo, L.G.; Fan, Z.; et al. Diffpose: Toward more reliable 3D pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 18–22 June 2023; pp. 13041–13051.
5.
Zhang, F.; Zhu, X.; Wang, C. Single person pose estimation: A survey. arXiv 2021, preprint, arXiv:2109.10056.
6.
Shi, D.; Wei, X.; Li, L.; et al. End-to-end multi-person pose estimation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11069–11078.
7.
Liu, H.; Chen, Q.; Tan, Z.; et al. Group pose: A simple baseline for end-to-end multi-person pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 15029–15038.
8.
Zheng, C.; Wu, W.; Chen, C.; et al. Deep learning-based human pose estimation: A survey. ACM Comput. Surv. 2023, 1, 1–37.
9.
Zheng, J.; Shi, X.; Gorban, A.; et al. Multi-modal 3D human pose estimation with 2D weak supervision in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4478–4487.
10.
He, T.; Chen, Y.; Wang, L.; et al. An expert-knowledge-based graph convolutional network for skeleton-based physical rehabilitation exercises assessment. IEEE Trans. Neural Syst. Rehabil. Eng. 2024, 32, 1916–1925.
11.
Yan, K.; Wang, F.; Qian, B.; et al. Person-in-WIFI 3D: End-to-end multi-person 3D pose estimation with WIFI. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 969–978.
12.
Gian, T.D.; Lai, T.D.; Luong, T.V.; et al. Hpe-li: Wifi-enabled lightweight dual selective kernel convolution for human pose estimation. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 93–111.
13.
Fan, J.; Yang, J.; Xu, Y.; et al. Diffusion model is a good pose estimator from 3D rf-vision. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 1–18.
14.
Chen, W.; Yu, C.; Tu, C.; et al. A survey on hand pose estimation with wearable sensors and computer-vision-based methods. Sensors 2020, 20, 1074.
15.
Wang, F.; Panev, S.; Dai, Z.; et al. Can WiFi estimate person pose? arXiv 2019, preprint, arXiv:1904.00277.
16.
Wang, F.; Zhou, S.; Panev, S.; et al. Person-in-WiFi: Fine-grained person perception using WiFi. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; pp. 5452–5461.
17.
Jiang, W.; Xue, H.; Miao, C.; et al. Towards 3D human pose construction using WiFi. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, London, UK, 21–25 September 2020; pp. 1–14.
18.
Ren, Y.; Wang, Z.; Wang, Y.; et al. GoPose: 3D human pose estimation using WiFi. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies; Association for Computing Machinery: New York, NY, USA, 2022; Vol. 6, pp. 1–25.
19.
Zhou, Y.; Zhu, A.; Xu, C.; et al. PerUnet: Deep signal channel attention in UNET for WiFi-based human pose estimation. IEEE Sens. J. 2022, 20, 19750–19760.
20.
Yang, J.; Chen, X.; Zou, H.; et al. AutoFi: Toward automatic Wi-Fi human sensing via geometric self-supervised learning. IEEE Internet Things J. 2022, 8, 7416–7425.
21.
Ren, Y.; Wang, Z.; Tan, S.; et al. Winect: 3D human pose tracking for free-form activity using commodity WIFI. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies; Association for Computing Machinery: New York, NY, USA, 2021; Vol. 4, pp. 1–29.
22.
Zhou, Y.; Huang, H.; Yuan, S.; et al. MetaFi++: WiFi-enabled transformer-based human pose estimation for metaverse avatar simulation. IEEE Internet Things J. 2023, 16, 14128–14136.
23.
Yang, J.; Huang, H.; Zhou, Y.; et al. Mm-fi: Multi-modal non-intrusive 4D human dataset for versatile wireless sensing. Adv. Neural Inf. Process. Syst. 2023, 36, 18756–18768.
24.
Zhou, Y.; Yang, J.; Huang, H.; et al. AdaPose: Toward cross-site device-free human pose estimation with commodity WiFi. IEEE Internet Things J. 2024, 24, 40255–40267.
25.
He, K.; Chen, X.; Xie, S.; et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 16000–16009.
26.
He, K.; Zhang, X.; Ren, S.; et al. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778.
27.
Devlin, J.; Chang, M.W.; Lee, K.; et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186.
28.
Radford, A.; Narasimhan, K.; Salimans, T.; et al. Improving Language Understanding by Generative Pre-Training. 2018. Available from: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 11 June 2018).
29.
Bao, H.; Dong, L.; Piao, S.; et al. Beit: Bert pre-training of image transformers. arXiv 2021, preprint, arXiv:2106.08254.
30.
Tong, Z.; Song, Y.; Wang, J.; et al. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093.
31.
Wang, L.; Huang, B.; Zhao, Z.; et al. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 18–22 June 2023; pp. 14549–14560.
32.
Huang, P.Y.; Xu, H.; Li, J.; et al. Masked autoencoders that listen. Adv. Neural Inf. Process. Syst. 2022, 35, 28708–28720.
33.
Yan, H.; Liu, Y.; Wei, Y.; et al. Skeletonmae: Graph-based masked autoencoder for skeleton sequence pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 5606–5618.
34.
Cheng, M.; Tao, X.; Liu, Z.; et al. TimeMAE: Self-Supervised Representations of Time Series with Decoupled Masked Autoencoders. In Proceedings of the Nineteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, 26–30 February 2026; pp. 498–508.
35.
Wang, P.; Li, Z.; Hou, Y.; et al. Action recognition based on joint trajectory maps using convolutional neural networks. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, Netherlands, 15–19 October 2016; pp. 102–106.
36.
Chen, Y.; Guo, J.; He, T.; et al. Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 778–786.
37.
Chen, Y.; Zhang, Z.; Yuan, C.; et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 13359–13368.
38.
Song, Y.F.; Zhang, Z.; Shan, C.; et al. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 2, 1474–1488.
39.
Chi, H.; Ha, M.H.; Chi, S.; et al. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 20186–20196.
40.
Chen, Y.; Guo, J.; Guo, S.; et al. Neuron: Learning context-aware evolving representations for zero-shot skeleton action recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 8721–8730.
41.
Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362.
42.
Chen, Y.; He, T.; Fu, J.; et al. Vision-language meets the skeleton: Progressively distillation with cross-modal knowledge for 3d action representation learning. IEEE Trans. Multimed. 2024, 27, 2293–2303.
43.
Plizzari, C.; Cannici, M.; Matteucci, M. Spatial temporal transformer network for skeleton-based action recognition. In International Conference on Pattern Recognition; Springer International Publishing: Cham, Switzerland, 2021; pp. 694–701.
44.
Gao, Z.; Wang, P.; Lv, P.; et al. Focal and global spatial-temporal transformer for skeleton-based action recognition. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 382–398.
45.
He, T.; Chen, Y.; Gao, X.; et al. Enhancing skeleton-based action recognition with language descriptions from pre-trained large multimodal models. IEEE Trans. Circuits Syst. Video Technol. 2024, 3, 2118–2132.
46.
Wang, Y.; Guo, L.; Lu, Z.; et al. From point to space: 3D moving human pose estimation using commodity WiFi. IEEE Commun. Lett. 2021, 7, 2235–2239.
47.
Deng, F.; Jovanov, E.; Song, H.; et al. WiLDAR: WiFi signal-based lightweight deep learning model for human activity recognition. IEEE Internet Things J. 2023, 2, 2899–2908.

Scilight Press

Author Information

Abstract

Keywords

References

About Scilight

Journals

Publishing Policies

Contact Us