Open Access
Article
CCE-Net: Causal Convolution Embedding Network for Streaming Automatic Speech Recognition
Feiteng Deng
Yue Ming*
Boyang Lyu
Author Information
Submitted: 11 Mar 2024 | Accepted: 15 Aug 2024 | Published: 27 Sept 2024

Abstract

Streaming Automatic Speech Recognition (ASR) has gained significant attention across various application scenarios, including video conferencing, live sports events, and intelligent terminals. However, chunk division for current streaming speech recognition results in insufficient contextual information, thus weakening the ability of attention modeling and leading to a decrease in recognition accuracy. For Mandarin speech recognition, there is also a risk of splitting Chinese character phonemes into different chunks, which may lead to incorrect recognition of Chinese characters at chunk boundaries due to incomplete phonemes. To alleviate these problems, we propose a novel front-end network - Causal Convolution Embedding Network (CCE-Net). The network introduces a causal convolution embedding module to obtain richer historical context information, while capturing Chinese character phoneme information at chunk boundaries and feeding it to the current chunk. We conducted experiments on Aishell-1 and Aidatatang. The results showed that our method achieves a character error rate (CER) of 5.07% and 4.90%, respectively, without introducing any additional latency, showing competitive performances.

References

Share this article:
Graphical Abstract
How to Cite
Deng, F., Ming, Y., & Lyu, B. (2024). CCE-Net: Causal Convolution Embedding Network for Streaming Automatic Speech Recognition. International Journal of Network Dynamics and Intelligence, 3(3), 100019. https://doi.org/10.53941/ijndi.2024.100019
RIS
BibTex
Copyright & License
article copyright Image
Copyright (c) 2024 by the authors.

This work is licensed under a This work is licensed under a Creative Commons Attribution 4.0 International License.