CCE-Net: Causal Convolution Embedding Network for Streaming Automatic Speech Recognition

Feiteng Deng; Yue Ming; Boyang Lyu

doi:10.53941/ijndi.2024.100019

Abstract

Streaming Automatic Speech Recognition (ASR) has gained significant attention across various application scenarios, including video conferencing, live sports events, and intelligent terminals. However, chunk division for current streaming speech recognition results in insufficient contextual information, thus weakening the ability of attention modeling and leading to a decrease in recognition accuracy. For Mandarin speech recognition, there is also a risk of splitting Chinese character phonemes into different chunks, which may lead to incorrect recognition of Chinese characters at chunk boundaries due to incomplete phonemes. To alleviate these problems, we propose a novel front-end network - Causal Convolution Embedding Network (CCE-Net). The network introduces a causal convolution embedding module to obtain richer historical context information, while capturing Chinese character phoneme information at chunk boundaries and feeding it to the current chunk. We conducted experiments on Aishell-1 and Aidatatang. The results showed that our method achieves a character error rate (CER) of 5.07% and 4.90%, respectively, without introducing any additional latency, showing competitive performances.

References

1.
Graves, A.; Fernández, S.; Gomez, F.; et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; ACM: New York, 2006; pp 369–376. doi: 10.1145/1143844.1143891
2.
Cui, X.D.; Saon, G.; Kingsbury, B. Improving RNN transducer acoustic models for English conversational speech recognition. In Proceedings of the 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 20–24 August 2023; ISCA: Dublin, Ireland, 2023; pp. 1299–1303.
3.
Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, 2017; pp. 6000–6010.
4.
Gulati, A.; Qin, J.; Chiu, C.C.; et al. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; ISCA: Shanghai China, 2020; pp. 5036–5040 .
5.
Zeyer, A.; Schmitt, R.; Zhou, W.; et al. Monotonic segmental attention for automatic speech recognition. In Proceedings of 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2023, IEEE: New York, 2023; pp. 229–236. doi: 10.1109/SLT54892.2023.10022818
6.
Chiu, C.C.; Raffel, C. Monotonic chunkwise attention. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; ICLR: Vancouver, Canada,2018.
7.
Tsunoo, E.; Kashiwagi, Y.; Kumakura, T.; et al. Towards online end-to-end transformer automatic speech recognition. arXiv: 1910.11871, 2019. doi: 10.48550/arXiv.1910.11871
8.
Inaguma, H.; Mimura, M.; Kawahara, T. Enhancing monotonic multihead attention for streaming ASR. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; ISCA: Shanghai China, 2020; pp. 2137–2141.
9.
Miao, H.R.; Cheng, G.F.; Zhang, P.Y.; et al. Online hybrid CTC/attention end-to-end automatic speech recognition architecture. IEEE/ACM Trans. Audio Speech Lang. Process., 2020, 28: 1452−1465. doi: 10.1109/TASLP.2020.2987752
10.
Miao, H.R.; Cheng, G.F.; Gao, C.F.; et al. Transformer-based online CTC/attention end-to-end speech recognition architecture. In Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, 2020; pp. 6084–6088. doi: 10.1109/ICASSP40776.2020.9053165
11.
Moritz, N.; Hori, T.; Le, J. Streaming automatic speech recognition with the transformer model. In Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, 2020; pp. 6074–6078. doi: 10.1109/ICASSP40776.2020.9054476
12.
Zhao, H.B.; Higuchi, Y.; Ogawa, T.; et al. An investigation of enhancing CTC model for triggered attention-based streaming ASR. In Proceedings of 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, 14–17 December 2021; IEEE: New York, 2021; pp. 477–483.
13.
Wu, C.Y.; Wang, Y.Q.; Shi, Y.Y.; et al. Streaming transformer-based acoustic models using self-attention with augmented memory. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; ISCA: Shanghai China, 2020; pp. 2132–2136.
14.
Shi, Y.Y.; Wang, Y.Q.; Wu, C.Y.; et al. Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, 2021; pp. 6783–6787. doi: 10.1109/ICASSP39728.2021.9414560
15.
Wang, F.Y.; Xu, B. Shifted chunk encoder for transformer based streaming end-to-end ASR. In Proceedings of the 29th International Conference on Neural Information Processing, IIT Indore, India, 22–26 November 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 39–50. doi: 10.1007/978-981-99-1642-9_4
16.
Dai, Z.H.; Yang, Z.L.; Yang, Y.M.; et al. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; ACL: Stroudsburg, USA, 2019; pp. 2978–2988. doi: 10.18653/v1/P19-1285
17.
Zhang, B.B.; Wu, D.; Peng, Z.D.; et al. WeNet 2.0: More productive end-to-end speech recognition toolkit. In Proceedings of 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18–22 September 2022; ISCA: Incheon, Korea, 2022; pp. 1661–1665.
18.
Gulzar, H.; Busto, M.R.; Eda, T.; et al. miniStreamer: Enhancing small conformer with chunked-context masking for streaming ASR applications on the edge. In Proceedings of the 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 20–24 August 2023; ISCA: Dublin, Ireland, 2023; pp. 3277–3281.
19.
Zhang, Q.; Lu, H.; Sak, H.; et al. Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss. In Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, 2020; pp. 7829–7833. doi: 10.1109/ICASSP40776.2020.9053896
20.
Shi, Y.Y.; Wu, C.Y.; Wang, D.L.; et al. Streaming transformer transducer based speech recognition using non-causal convolution. In Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 23–27 May 2022; IEEE: New York, 2022; pp. 8277–8281. doi: 10.1109/ICASSP43922.2022.9747706
21.
Swietojanski, P.; Braun, S.; Can, D.; et al. Variable attention masking for configurable transformer transducer speech recognition. In Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, 2023; pp. 1–5. doi: 10.1109/ICASSP49357.2023.10094588
22.
Hu, K.; Sainath, T.N.; Pang, R.M.; et al. Deliberation model based two-pass end-to-end speech recognition. In Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, 2020; pp. 7799–7803. doi: 10.1109/ICASSP40776.2020.9053606
23.
Hu, K.; Pang, R.M.; Sainath, T.N.; et al. Transformer based deliberation for two-pass speech recognition. In Proceedings of 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; IEEE: New York, 2021; pp. 68–74. doi: 10.1109/SLT48900.2021.9383497
24.
An, K.Y.; Zheng, H.H.; Ou, Z.J.; et al. CUSIDE: Chunking, simulating future context and decoding for streaming ASR. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18–22 September 2022; ISCA: Incheon, Korea, 2022; pp. 2103–2107.
25.
Zhao, H.B.; Fujie, S.; Ogawa, T.; et al. Conversation-oriented ASR with multi-look-ahead CBS architecture. In Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, 2023; pp. 1–5. doi: 10.1109/ICASSP49357.2023.10094614
26.
Strimel, G.; Xie, Y.; King, B.J.; et al. Lookahead when it matters: Adaptive non-causal transformers for streaming neural transducers. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: New York, USA, 2023; pp. 32654–32676.
27.
Audhkhasi, K.; Farris, B.; Ramabhadran, B.; et al. Modular conformer training for flexible End-to-End ASR. In Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, USA, 2023; pp. 1–5. doi: 10.1109/ICASSP49357.2023.10095966
28.
Boyer, F.; Shinohara, Y.; Ishii, T.; et al. A study of transducer based end-to-end ASR with ESPnet: Architecture, auxiliary loss and decoding strategies. In Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; IEEE: New York, USA, 2021; pp. 16–23. doi: 10.1109/ASRU51503.2021.9688251
29.
Yao, Z.Y.; Wu, D.; Wang, X.; et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August–3 September 2021; ISCA: Brno, Czechia, 2021; pp. 4054–4058.
30.
Park, D.S.; Chan, W.; Zhang, Y.; et al. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September, 2019; ISCA: Graz, Austria, 2019; pp. 2613–2617.
31.
Burchi, M.; Vielzeuf, V. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; IEEE: New York, USA, 2021; pp. 8–15. doi: 10.1109/ASRU51503.2021.9687874
32.
Guo, P.C.; Boyer, F.; Chang, X.K.; et al. Recent developments on ESPnet toolkit boosted by conformer. In Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, USA, 2021; pp. 5874–5878. doi: 10.1109/ICASSP39728.2021.9414858
33.
Tsunoo, E.; Kashiwagi, Y.; Watanabe, S. Streaming transformer Asr with blockwise synchronous beam search. In Proceedings of 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; IEEE: New York, USA, 2021; pp. 22–29. doi: 10.1109/SLT48900.2021.9383517
34.
Wang, Z.C.; Yang, W.W.; Zhou, P.; et al. WNARS: WFST based non-autoregressive streaming end-to-end speech recognition. arXiv: 2104.03587, 2021. doi: 10.48550/arXiv.2104.03587

Scilight Press

Author Information

Abstract

Keywords

References

About Scilight

Journals

Publishing Policies

Contact Us