2603003527
  • Open Access
  • Article

Bridging the Affective Gap: A Pedagogical Framework for Critical Voice Artistry in the AI Era

  • Jialin Ye 1,   
  • Jingying Yang 2,*

Received: 26 Dec 2025 | Revised: 06 Feb 2026 | Accepted: 10 Mar 2026 | Published: 31 Mar 2026

Abstract

The rise of AI speech synthesis, while achieving impressive naturalness, has revealed a profound educational challenge: its failure to convey complex human emotions and contextual nuance—termed the “affective gap”—threatens to undermine the ecology of voice artistry and societal aesthetic discernment. This paper first diagnoses this gap by examining its key manifestations (compound emotion flattening, contextual deafness, the prosodic uncanny valley) and tracing its root cause to the epistemological divide between AI’s data-driven pattern recognition and human embodied experience. It then analyzes the consequent structural disruption to the voice-acting industry’s traditional “pyramid” training model and the broader risk of cultural aesthetic deskilling. In response, the paper’s central contribution is to propose a novel pedagogical framework designed to bridge this gap. This framework advocates a decisive shift in voice education from skill transmission towards critical voice artistry, centered on cultivating students’ capacities for deep textual/contextual analysis, empathetic and embodied sense-making, and the critical evaluation and direction of AI-generated speech. The paper argues that by integrating this critical pedagogical approach with strategic technology use, educators can empower future artists to navigate and shape a hybrid human-AI creative landscape. Ultimately, this work provides a theoretically grounded and actionable roadmap for innovating performing arts education in the AI era, positioning educational technology as vital steward of uniquely human expressive intelligence.

References 

  • 1.

    Bakhshi, H., Benedikt Frey, H., & Osborne, M. (2015). Creativity vs robots: The creative economy and the future of employment. Nesta. https://media.nesta.org.uk/documents/creativity_vs._robots_wv.pdf

  • 2.

    Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.463

  • 3.

    Boden, M. A. (2003). The creative mind: Myths and mechanisms. Routledge. https://doi.org/10.4324/9780203508527 

  • 4.

    Boden, M. A. (2009). Computer models of creativity. AI Magazine, 30(3), 23–34. https://doi.org/10.1609/AIMAG.V30I3.2254

  • 5.

    Brade, S., Anderson, S., Kumar, R., Jin, Z., & Truong, A. (2025). SpeakEasy: Enhancing text-to-speech interactions for expressive content creation. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (Article 756, pp. 1–19). Association for Computing Machinery. https://doi.org/10.1145/3706598.3714263

  • 6.

    Brynjolfsson, E., & McAfee, A. (2014). The second machine age: Work, progress, and prosperity in a time of brilliant technologies. WW Norton & Company. 

  • 7.

    Casanova, E., Weber, J., Shulby, C. D., Junior, A. C., Gölge, E., & Ponti, M. A. (2022). YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In Proceedings of the 39th International Conference on Machine Learning (pp. 2709–2720). PMLR. https://proceedings.mlr.press/v162/casanova22a.html

  • 8.

    Chan, C., & Kuang, J. (2025). Toward objective and interpretable prosody evaluation in text-to-speech: A linguistically motivated approach. arXiv preprint. https://doi.org/10.48550/arXiv.2511.02104

  • 9.

    China Academy of Information and Communications Technology. (2023). Artificial intelligence-generated content (AIGC) white paper. https://interpret.csis.org/translations/artificial-intelligence-generated-content-aigc-white-paper-excerpt/

  • 10.

    Cho, D. H., Oh, H. S., Kim, S. B., & Lee, S. W. (2025). EmoSphere++: Emotion-controllable zero-shot text-to-speech via emotion-adaptive spherical vector. IEEE Transactions on Affective Computing, 16(3), 2365–2380. https://doi.org/10.1109/TAFFC.2025.3561267

  • 11.

    Chuenwattanapranithi, S., Xu, Y., Thipakorn, B., & Maneewongvatana, S. (2009). Encoding emotions in speech with the size code: A perceptual investigation. Phonetica, 65(4), 210–230. https://doi.org/10.1159/000192793

  • 12.

    Ciccarelli, D., & Ciccarelli, S. (2013). Voice acting for dummies. John Wiley & Sons. 

  • 13.

    Clark, H. H., & Fox Tree, J. E. (2002). Using uh and um in spontaneous speaking. Cognition, 84(1), 73–111. https://doi.org/10.1016/S0010-0277(02)00017-3

  • 14.

    Clark, W. L. (1996). Being there: Putting brain, body, and world together again. The MIT Press. https://doi.org/10.7551/mitpress/1552.001.0001 

  • 15.

    Damasio, A. R. (1999). The feeling of what happens: Body and emotion in the making of consciousness. Harcourt Brace. 

  • 16.

    Dreyfus, H. L. (2014). Skillful coping: Essays on the phenomenology of everyday perception and action. Oxford University Press. https://doi.org/10.1093/ACPROF:OSO/9780199654703.001.0001 

  • 17.

    du Sautoy, M. (2019). Can AI ever be truly creative? New Scientist, 242(3229), 38–41. https://doi.org/10.1016/S0262-4079(19)30840-1

  • 18.

    Fischer, E. (1999). The necessity of art: A Marxist approach. Verso. 

  • 19.

    Fuchs, T. (2017). Ecology of the brain: The phenomenology and biology of the embodied mind. Oxford University Press. https://doi.org/10.1093/MED/9780199646883.001.0001 

  • 20.

    Gendlin, E. T. (1997). Experiencing and the creation of meaning: A philosophical and psychological approach to the subjective. Northwestern University Press. 

  • 21.

    Giroux, H. A. (2020). On critical pedagogy. Bloomsbury Academic. https://doi.org/10.5040/9781350145016 

  • 22.

    Goffman, E. (1981). Forms of talk. University of Pennsylvania Press. 

  • 23.

    Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1–3), 335–346. https://doi.org/10.1016/0167-2789(90)90087-6

  • 24.

    Horkheimer, M., Adorno, T. W., & Noeri, G. (2002). Dialectic of enlightenment. Stanford University Press. 

  • 25.

    Hovy, D. (2015). Demographic factors improve classification performance. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 752–762). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/p15-1073

  • 26.

    Huang, K., Tu, Q., Fan, L., Yang, C., Zhang, D., Li, S., Fei, Z., Cheng, Q., & Qiu, X. (2025). InstructTTSEval: Benchmarking complex natural-language instruction following in text-to-speech systems. arXiv preprint. https://doi.org/10.48550/arXiv.2506.16381

  • 27.

    Irvine, D. R. F. (2018). Plasticity in the auditory system. Hearing Research, 362, 61–73. https://doi.org/10.1016/j.heares.2017.10.011

  • 28.

    Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological Bulletin, 129(5), 770–814. https://doi.org/10.1037/0033-2909.129.5.770

  • 29.

    Kassabian, A. (2016). Ubiquitous listening: Affect, attention, and distributed subjectivity. University of California Press. https://doi.org/10.1525/CALIFORNIA/9780520275157.001.0001 

  • 30.

    Łajszczak, M., Cámbara, G., Li, Y., Beyhan, F., Van Korlaar, A., Yang, F., Joly, A., Martín-Cortinas, Á., Abbas, A., & Michalski, A. (2024). BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100k hours of data. arXiv preprint. https://doi.org/10.48550/arXiv.2402.08093

  • 31.

    Lave, J., & Wenger, E. (1991). Situated learning: Legitimate peripheral participation. Cambridge University Press. https://doi.org/10.1017/CBO9780511815355 

  • 32.

    Oh, C. S., Bailenson, J. N., & Welch, G. F. (2018). A systematic review of social presence: Definition, antecedents, and implications. Frontiers Robotics AI, 5, 114. https://doi.org/10.3389/FROBT.2018.00114

  • 33.

    OpenAI. (2024, March 29). Navigating the challenges and opportunities of synthetic voices: Insights from voice engine. https://openai.com/index/navigating-the-challenges-and-opportunities-of-synthetic-voices/

  • 34.

    Porges, S. W. (2011). The polyvagal theory: Neurophysiological foundations of emotions, attachment, communication, and self-regulation. W. W. Norton & Company. 

  • 35.

    Que, S., & Ragni, A. (2025). VisualSpeech: Enhance prosody with visual context in TTS. arXiv preprint. https://doi.org/10.48550/arXiv.2501.19258

  • 36.

    Ross, A., Corley, M., & Lai, C. (2024). Is there an uncanny valley for speech? Investigating listeners’ evaluations of realistic synthesised voices. In Y. Chen, A. Chen, & A. Arvaniti (Eds.), Speech prosody 2024 (pp. 1115–1119). International Speech Communication Association. 

  • 37.

    Scherer, K. R. (2009). The dynamic architecture of emotion: Evidence for the component process model. Cognition and Emotion, 23(7), 1307–1351. https://doi.org/10.1080/02699930902928969

  • 38.

    Selwyn, N. (2022). Education and technology: Key issues and debates (4th ed.). Bloomsbury Academic. https://doi.org/10.5040/9781350145573 

  • 39.

    Sennett, R. (2008). The craftsman. Yale University Press. 

  • 40.

    Tan, X., Qin, T., Soong, F., & Liu, T.-Y. (2021). A survey on neural speech synthesis. arXiv preprint. https://doi.org/10.48550/arXiv.2106.15561

  • 41.

    Varela, F. J., Rosch, E., & Thompson, E. (1991). The embodied mind: Cognitive science and human experience. The MIT Press. https://doi.org/10.7551/MITPRESS/6730.001.0001 

  • 42.

    Woodruff, A., Shelby, R., Kelley, P. G., Rousso-Schindler, S., Smith-Loud, J., & Wilcox, L. (2024). How knowledge workers think generative AI will (not) transform their industries. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Article 641, pp. 1-26). Association for Computing Machinery. https://doi.org/10.1145/3613904.3642700

  • 43.

    Xiong, F., Yu, X., Wai, H., & Ma, A. (2025). AI-driven research ecosystem: Unifying human-AI collaboration models and new research thinking paradigms. Journal of Educational Technology and Innovation, 7(1), 39–53. https://doi.org/10.61414/n0n76c97

  • 44.

    Yasuda, A., & Maruyama, Y. (2025). Creativity in the age of generative AI. AI and Ethics, 6(1), 46. https://doi.org/10.1007/s43681-025-00848-9

  • 45.

    Zhou, K., Zhang, Y., Zhao, S., Wang, H., Pan, Z., Ng, D., Zhang, C., Ni, C., Ma, Y., & Nguyen, T. H. (2024). Emotional dimension control in language model-based text-to-speech: Spanning a broad spectrum of human emotions. arXiv preprint. https://doi.org/10.48550/arXiv.2409.16681

Share this article:
How to Cite
Ye, J.; Yang, J. Bridging the Affective Gap: A Pedagogical Framework for Critical Voice Artistry in the AI Era. Journal of Educational Technology and Innovation 2026, 8 (1), 12–24. https://doi.org/10.61414/wmee7j93.
RIS
BibTex
Copyright & License
article copyright Image
Copyright (c) 2026 by the authors.