2504000006
  • Open Access
  • Article
CCE-Net: Causal Convolution Embedding Network for Streaming Automatic Speech Recognition
  • Feiteng Deng,   
  • Yue Ming *,   
  • Boyang Lyu

Received: 11 Mar 2024 | Accepted: 15 Aug 2024 | Published: 27 Sep 2024

Abstract

Streaming Automatic Speech Recognition (ASR) has gained significant attention across various application scenarios, including video conferencing, live sports events, and intelligent terminals. However, chunk division for current streaming speech recognition results in insufficient contextual information, thus weakening the ability of attention modeling and leading to a decrease in recognition accuracy. For Mandarin speech recognition, there is also a risk of splitting Chinese character phonemes into different chunks, which may lead to incorrect recognition of Chinese characters at chunk boundaries due to incomplete phonemes. To alleviate these problems, we propose a novel front-end network - Causal Convolution Embedding Network (CCE-Net). The network introduces a causal convolution embedding module to obtain richer historical context information, while capturing Chinese character phoneme information at chunk boundaries and feeding it to the current chunk. We conducted experiments on Aishell-1 and Aidatatang. The results showed that our method achieves a character error rate (CER) of 5.07% and 4.90%, respectively, without introducing any additional latency, showing competitive performances.

References 

  • 1.
    Graves, A.; Fernández, S.; Gomez, F.; et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; ACM: New York, 2006; pp 369–376. doi: 10.1145/1143844.1143891
  • 2.
    Cui, X.D.; Saon, G.; Kingsbury, B. Improving RNN transducer acoustic models for English conversational speech recognition. In Proceedings of the 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 2024 August 2023; ISCA: Dublin, Ireland, 2023; pp. 1299–1303.
  • 3.
    Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 49 December 2017; Curran Associates Inc.: Red Hook, 2017; pp. 6000–6010.
  • 4.
    Gulati, A.; Qin, J.; Chiu, C.C.; et al. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; ISCA: Shanghai China, 2020; pp. 5036–5040 .
  • 5.
    Zeyer, A.; Schmitt, R.; Zhou, W.; et al. Monotonic segmental attention for automatic speech recognition. In Proceedings of 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 912 January 2023, IEEE: New York, 2023; pp. 229–236. doi: 10.1109/SLT54892.2023.10022818
  • 6.
    Chiu, C.C.; Raffel, C. Monotonic chunkwise attention. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April3 May 2018; ICLR: Vancouver, Canada,2018.
  • 7.
    Tsunoo, E.; Kashiwagi, Y.; Kumakura, T.; et al. Towards online end-to-end transformer automatic speech recognition. arXiv: 1910.11871, 2019. doi: 10.48550/arXiv.1910.11871
  • 8.
    Inaguma, H.; Mimura, M.; Kawahara, T. Enhancing monotonic multihead attention for streaming ASR. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2529 October 2020; ISCA: Shanghai China, 2020; pp. 2137–2141.
  • 9.
    Miao, H.R.; Cheng, G.F.; Zhang, P.Y.; et al. Online hybrid CTC/attention end-to-end automatic speech recognition architecture. IEEE/ACM Trans. Audio Speech Lang. Process., 2020, 28: 1452−1465. doi: 10.1109/TASLP.2020.2987752
  • 10.
    Miao, H.R.; Cheng, G.F.; Gao, C.F.; et al. Transformer-based online CTC/attention end-to-end speech recognition architecture. In Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 48 May 2020; IEEE: New York, 2020; pp. 6084–6088. doi: 10.1109/ICASSP40776.2020.9053165
  • 11.
    Moritz, N.; Hori, T.; Le, J. Streaming automatic speech recognition with the transformer model. In Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 48 May 2020; IEEE: New York, 2020; pp. 6074–6078. doi: 10.1109/ICASSP40776.2020.9054476
  • 12.
    Zhao, H.B.; Higuchi, Y.; Ogawa, T.; et al. An investigation of enhancing CTC model for triggered attention-based streaming ASR. In Proceedings of 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, 14–17 December 2021; IEEE: New York, 2021; pp. 477–483.
  • 13.
    Wu, C.Y.; Wang, Y.Q.; Shi, Y.Y.; et al. Streaming transformer-based acoustic models using self-attention with augmented memory. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2529 October 2020; ISCA: Shanghai China, 2020; pp. 2132–2136.
  • 14.
    Shi, Y.Y.; Wang, Y.Q.; Wu, C.Y.; et al. Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 611 June 2021; IEEE: New York, 2021; pp. 6783–6787. doi: 10.1109/ICASSP39728.2021.9414560
  • 15.
    Wang, F.Y.; Xu, B. Shifted chunk encoder for transformer based streaming end-to-end ASR. In Proceedings of the 29th International Conference on Neural Information Processing, IIT Indore, India, 22–26 November 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 39–50. doi: 10.1007/978-981-99-1642-9_4
  • 16.
    Dai, Z.H.; Yang, Z.L.; Yang, Y.M.; et al. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; ACL: Stroudsburg, USA, 2019; pp. 2978–2988. doi: 10.18653/v1/P19-1285
  • 17.
    Zhang, B.B.; Wu, D.; Peng, Z.D.; et al. WeNet 2.0: More productive end-to-end speech recognition toolkit. In Proceedings of 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 1822 September 2022; ISCA: Incheon, Korea, 2022; pp. 1661–1665.
  • 18.
    Gulzar, H.; Busto, M.R.; Eda, T.; et al. miniStreamer: Enhancing small conformer with chunked-context masking for streaming ASR applications on the edge. In Proceedings of the 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 2024 August 2023; ISCA: Dublin, Ireland, 2023; pp. 3277–3281.
  • 19.
    Zhang, Q.; Lu, H.; Sak, H.; et al. Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss. In Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, 2020; pp. 7829–7833. doi: 10.1109/ICASSP40776.2020.9053896
  • 20.
    Shi, Y.Y.; Wu, C.Y.; Wang, D.L.; et al. Streaming transformer transducer based speech recognition using non-causal convolution. In Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2327 May 2022; IEEE: New York, 2022; pp. 8277–8281. doi: 10.1109/ICASSP43922.2022.9747706
  • 21.
    Swietojanski, P.; Braun, S.; Can, D.; et al. Variable attention masking for configurable transformer transducer speech recognition. In Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 410 June 2023; IEEE: New York, 2023; pp. 1–5. doi: 10.1109/ICASSP49357.2023.10094588
  • 22.
    Hu, K.; Sainath, T.N.; Pang, R.M.; et al. Deliberation model based two-pass end-to-end speech recognition. In Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 48 May 2020; IEEE: New York, 2020; pp. 7799–7803. doi: 10.1109/ICASSP40776.2020.9053606
  • 23.
    Hu, K.; Pang, R.M.; Sainath, T.N.; et al. Transformer based deliberation for two-pass speech recognition. In Proceedings of 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 1922 January 2021; IEEE: New York, 2021; pp. 68–74. doi: 10.1109/SLT48900.2021.9383497
  • 24.
    An, K.Y.; Zheng, H.H.; Ou, Z.J.; et al. CUSIDE: Chunking, simulating future context and decoding for streaming ASR. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 1822 September 2022; ISCA: Incheon, Korea, 2022; pp. 2103–2107.
  • 25.
    Zhao, H.B.; Fujie, S.; Ogawa, T.; et al. Conversation-oriented ASR with multi-look-ahead CBS architecture. In Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 410 June 2023; IEEE: New York, 2023; pp. 1–5. doi: 10.1109/ICASSP49357.2023.10094614
  • 26.
    Strimel, G.; Xie, Y.; King, B.J.; et al. Lookahead when it matters: Adaptive non-causal transformers for streaming neural transducers. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: New York, USA, 2023; pp. 32654–32676.
  • 27.
    Audhkhasi, K.; Farris, B.; Ramabhadran, B.; et al. Modular conformer training for flexible End-to-End ASR. In Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, USA, 2023; pp. 1–5. doi: 10.1109/ICASSP49357.2023.10095966
  • 28.
    Boyer, F.; Shinohara, Y.; Ishii, T.; et al. A study of transducer based end-to-end ASR with ESPnet: Architecture, auxiliary loss and decoding strategies. In Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 1317 December 2021; IEEE: New York, USA, 2021; pp. 16–23. doi: 10.1109/ASRU51503.2021.9688251
  • 29.
    Yao, Z.Y.; Wu, D.; Wang, X.; et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August3 September 2021; ISCA: Brno, Czechia, 2021; pp. 4054–4058.
  • 30.
    Park, D.S.; Chan, W.; Zhang, Y.; et al. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 1519 September, 2019; ISCA: Graz, Austria, 2019; pp. 2613–2617.
  • 31.
    Burchi, M.; Vielzeuf, V. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 1317 December 2021; IEEE: New York, USA, 2021; pp. 8–15. doi: 10.1109/ASRU51503.2021.9687874
  • 32.
    Guo, P.C.; Boyer, F.; Chang, X.K.; et al. Recent developments on ESPnet toolkit boosted by conformer. In Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 611 June 2021; IEEE: New York, USA, 2021; pp. 5874–5878. doi: 10.1109/ICASSP39728.2021.9414858
  • 33.
    Tsunoo, E.; Kashiwagi, Y.; Watanabe, S. Streaming transformer Asr with blockwise synchronous beam search. In Proceedings of 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; IEEE: New York, USA, 2021; pp. 22–29. doi: 10.1109/SLT48900.2021.9383517
  • 34.
    Wang, Z.C.; Yang, W.W.; Zhou, P.; et al. WNARS: WFST based non-autoregressive streaming end-to-end speech recognition. arXiv: 2104.03587, 2021. doi: 10.48550/arXiv.2104.03587
Share this article:
How to Cite
Deng, F.; Ming, Y.; Lyu, B. CCE-Net: Causal Convolution Embedding Network for Streaming Automatic Speech Recognition. International Journal of Network Dynamics and Intelligence 2024, 3 (3), 100019. https://doi.org/10.53941/ijndi.2024.100019.
RIS
BibTex
Copyright & License
article copyright Image
Copyright (c) 2024 by the authors.