Bridging the Affective Gap: A Pedagogical Framework for Critical Voice Artistry in the AI Era

Jialin Ye; Jingying Yang

doi:10.61414/wmee7j93

Abstract

The rise of AI speech synthesis, while achieving impressive naturalness, has revealed a profound educational challenge: its failure to convey complex human emotions and contextual nuance—termed the “affective gap”—threatens to undermine the ecology of voice artistry and societal aesthetic discernment. This paper first diagnoses this gap by examining its key manifestations (compound emotion flattening, contextual deafness, the prosodic uncanny valley) and tracing its root cause to the epistemological divide between AI’s data-driven pattern recognition and human embodied experience. It then analyzes the consequent structural disruption to the voice-acting industry’s traditional “pyramid” training model and the broader risk of cultural aesthetic deskilling. In response, the paper’s central contribution is to propose a novel pedagogical framework designed to bridge this gap. This framework advocates a decisive shift in voice education from skill transmission towards critical voice artistry, centered on cultivating students’ capacities for deep textual/contextual analysis, empathetic and embodied sense-making, and the critical evaluation and direction of AI-generated speech. The paper argues that by integrating this critical pedagogical approach with strategic technology use, educators can empower future artists to navigate and shape a hybrid human-AI creative landscape. Ultimately, this work provides a theoretically grounded and actionable roadmap for innovating performing arts education in the AI era, positioning educational technology as vital steward of uniquely human expressive intelligence.

References

1.
Bakhshi, H., Benedikt Frey, H., & Osborne, M. (2015). Creativity vs robots: The creative economy and the future of employment. Nesta. https://media.nesta.org.uk/documents/creativity_vs._robots_wv.pdf
2.
Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.463
3.
Boden, M. A. (2003). The creative mind: Myths and mechanisms. Routledge. https://doi.org/10.4324/9780203508527
4.
Boden, M. A. (2009). Computer models of creativity. AI Magazine, 30(3), 23–34. https://doi.org/10.1609/AIMAG.V30I3.2254
5.
Brade, S., Anderson, S., Kumar, R., Jin, Z., & Truong, A. (2025). SpeakEasy: Enhancing text-to-speech interactions for expressive content creation. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (Article 756, pp. 1–19). Association for Computing Machinery. https://doi.org/10.1145/3706598.3714263
6.
Brynjolfsson, E., & McAfee, A. (2014). The second machine age: Work, progress, and prosperity in a time of brilliant technologies. WW Norton & Company.
7.
Casanova, E., Weber, J., Shulby, C. D., Junior, A. C., Gölge, E., & Ponti, M. A. (2022). YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In Proceedings of the 39th International Conference on Machine Learning (pp. 2709–2720). PMLR. https://proceedings.mlr.press/v162/casanova22a.html
8.
Chan, C., & Kuang, J. (2025). Toward objective and interpretable prosody evaluation in text-to-speech: A linguistically motivated approach. arXiv preprint. https://doi.org/10.48550/arXiv.2511.02104
9.
China Academy of Information and Communications Technology. (2023). Artificial intelligence-generated content (AIGC) white paper. https://interpret.csis.org/translations/artificial-intelligence-generated-content-aigc-white-paper-excerpt/
10.
Cho, D. H., Oh, H. S., Kim, S. B., & Lee, S. W. (2025). EmoSphere++: Emotion-controllable zero-shot text-to-speech via emotion-adaptive spherical vector. IEEE Transactions on Affective Computing, 16(3), 2365–2380. https://doi.org/10.1109/TAFFC.2025.3561267
11.
Chuenwattanapranithi, S., Xu, Y., Thipakorn, B., & Maneewongvatana, S. (2009). Encoding emotions in speech with the size code: A perceptual investigation. Phonetica, 65(4), 210–230. https://doi.org/10.1159/000192793
12.
Ciccarelli, D., & Ciccarelli, S. (2013). Voice acting for dummies. John Wiley & Sons.
13.
Clark, H. H., & Fox Tree, J. E. (2002). Using uh and um in spontaneous speaking. Cognition, 84(1), 73–111. https://doi.org/10.1016/S0010-0277(02)00017-3
14.
Clark, W. L. (1996). Being there: Putting brain, body, and world together again. The MIT Press. https://doi.org/10.7551/mitpress/1552.001.0001
15.
Damasio, A. R. (1999). The feeling of what happens: Body and emotion in the making of consciousness. Harcourt Brace.
16.
Dreyfus, H. L. (2014). Skillful coping: Essays on the phenomenology of everyday perception and action. Oxford University Press. https://doi.org/10.1093/ACPROF:OSO/9780199654703.001.0001
17.
du Sautoy, M. (2019). Can AI ever be truly creative? New Scientist, 242(3229), 38–41. https://doi.org/10.1016/S0262-4079(19)30840-1
18.
Fischer, E. (1999). The necessity of art: A Marxist approach. Verso.
19.
Fuchs, T. (2017). Ecology of the brain: The phenomenology and biology of the embodied mind. Oxford University Press. https://doi.org/10.1093/MED/9780199646883.001.0001
20.
Gendlin, E. T. (1997). Experiencing and the creation of meaning: A philosophical and psychological approach to the subjective. Northwestern University Press.
21.
Giroux, H. A. (2020). On critical pedagogy. Bloomsbury Academic. https://doi.org/10.5040/9781350145016
22.
Goffman, E. (1981). Forms of talk. University of Pennsylvania Press.
23.
Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1–3), 335–346. https://doi.org/10.1016/0167-2789(90)90087-6
24.
Horkheimer, M., Adorno, T. W., & Noeri, G. (2002). Dialectic of enlightenment. Stanford University Press.
25.
Hovy, D. (2015). Demographic factors improve classification performance. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 752–762). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/p15-1073
26.
Huang, K., Tu, Q., Fan, L., Yang, C., Zhang, D., Li, S., Fei, Z., Cheng, Q., & Qiu, X. (2025). InstructTTSEval: Benchmarking complex natural-language instruction following in text-to-speech systems. arXiv preprint. https://doi.org/10.48550/arXiv.2506.16381
27.
Irvine, D. R. F. (2018). Plasticity in the auditory system. Hearing Research, 362, 61–73. https://doi.org/10.1016/j.heares.2017.10.011
28.
Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin, 129(5), 770–814. https://doi.org/10.1037/0033-2909.129.5.770
29.
Kassabian, A. (2016). Ubiquitous listening: Affect, attention, and distributed subjectivity. University of California Press. https://doi.org/10.1525/CALIFORNIA/9780520275157.001.0001
30.
Łajszczak, M., Cámbara, G., Li, Y., Beyhan, F., Van Korlaar, A., Yang, F., Joly, A., Martín-Cortinas, Á., Abbas, A., & Michalski, A. (2024). BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100k hours of data. arXiv preprint. https://doi.org/10.48550/arXiv.2402.08093
31.
Lave, J., & Wenger, E. (1991). Situated learning: Legitimate peripheral participation. Cambridge University Press. https://doi.org/10.1017/CBO9780511815355
32.
Oh, C. S., Bailenson, J. N., & Welch, G. F. (2018). A systematic review of social presence: Definition, antecedents, and implications. Frontiers Robotics AI, 5, 114. https://doi.org/10.3389/FROBT.2018.00114
33.
OpenAI. (2024, March 29). Navigating the challenges and opportunities of synthetic voices: Insights from voice engine. https://openai.com/index/navigating-the-challenges-and-opportunities-of-synthetic-voices/
34.
Porges, S. W. (2011). The polyvagal theory: Neurophysiological foundations of emotions, attachment, communication, and self-regulation. W. W. Norton & Company.
35.
Que, S., & Ragni, A. (2025). VisualSpeech: Enhance prosody with visual context in TTS. arXiv preprint. https://doi.org/10.48550/arXiv.2501.19258
36.
Ross, A., Corley, M., & Lai, C. (2024). Is there an uncanny valley for speech? Investigating listeners’ evaluations of realistic synthesised voices. In Y. Chen, A. Chen, & A. Arvaniti (Eds.), Speech prosody 2024 (pp. 1115–1119). International Speech Communication Association.
37.
Scherer, K. R. (2009). The dynamic architecture of emotion: Evidence for the component process model. Cognition and Emotion, 23(7), 1307–1351. https://doi.org/10.1080/02699930902928969
38.
Selwyn, N. (2022). Education and technology: Key issues and debates (4th ed.). Bloomsbury Academic. https://doi.org/10.5040/9781350145573
39.
Sennett, R. (2008). The craftsman. Yale University Press.
40.
Tan, X., Qin, T., Soong, F., & Liu, T.-Y. (2021). A survey on neural speech synthesis. arXiv preprint. https://doi.org/10.48550/arXiv.2106.15561
41.
Varela, F. J., Rosch, E., & Thompson, E. (1991). The embodied mind: Cognitive science and human experience. The MIT Press. https://doi.org/10.7551/MITPRESS/6730.001.0001
42.
Woodruff, A., Shelby, R., Kelley, P. G., Rousso-Schindler, S., Smith-Loud, J., & Wilcox, L. (2024). How knowledge workers think generative AI will (not) transform their industries. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Article 641, pp. 1-26). Association for Computing Machinery. https://doi.org/10.1145/3613904.3642700
43.
Xiong, F., Yu, X., Wai, H., & Ma, A. (2025). AI-driven research ecosystem: Unifying human-AI collaboration models and new research thinking paradigms. Journal of Educational Technology and Innovation, 7(1), 39–53. https://doi.org/10.61414/n0n76c97
44.
Yasuda, A., & Maruyama, Y. (2025). Creativity in the age of generative AI. AI and Ethics, 6(1), 46. https://doi.org/10.1007/s43681-025-00848-9
45.
Zhou, K., Zhang, Y., Zhao, S., Wang, H., Pan, Z., Ng, D., Zhang, C., Ni, C., Ma, Y., & Nguyen, T. H. (2024). Emotional dimension control in language model-based text-to-speech: Spanning a broad spectrum of human emotions. arXiv preprint. https://doi.org/10.48550/arXiv.2409.16681

Scilight Press

Author Information

Abstract

Keywords

References

About Scilight

Journals

Publishing Policies

Contact Us