- 1.
Graves, A.; Fernández, S.; Gomez, F.;
et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In
Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; ACM: New York, 2006; pp 369–376. doi:
10.1145/1143844.1143891 - 2.
Cui, X.D.; Saon, G.; Kingsbury, B. Improving RNN transducer acoustic models for English conversational speech recognition. In Proceedings of the 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 20–24 August 2023; ISCA: Dublin, Ireland, 2023; pp. 1299–1303.
- 3.
Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, 2017; pp. 6000–6010.
- 4.
Gulati, A.; Qin, J.; Chiu, C.C.; et al. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; ISCA: Shanghai China, 2020; pp. 5036–5040 .
- 5.
Zeyer, A.; Schmitt, R.; Zhou, W.;
et al. Monotonic segmental attention for automatic speech recognition. In
Proceedings of 2022 IEEE Spoken Language Technology Workshop (
SLT), Doha, Qatar, 9–
12 January 2023, IEEE: New York, 2023; pp. 229–236. doi:
10.1109/SLT54892.2023.10022818 - 6.
Chiu, C.C.; Raffel, C. Monotonic chunkwise attention. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; ICLR: Vancouver, Canada,2018.
- 7.
Tsunoo, E.; Kashiwagi, Y.; Kumakura, T.;
et al. Towards online end-to-end transformer automatic speech recognition. arXiv: 1910.11871, 2019. doi:
10.48550/arXiv.1910.11871 - 8.
Inaguma, H.; Mimura, M.; Kawahara, T. Enhancing monotonic multihead attention for streaming ASR. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; ISCA: Shanghai China, 2020; pp. 2137–2141.
- 9.
Miao, H.R.; Cheng, G.F.; Zhang, P.Y.; et al. Online hybrid CTC/attention end-to-end automatic speech recognition architecture. IEEE/ACM Trans. Audio Speech Lang. Process., 2020, 28: 1452−1465. doi:
10.1109/TASLP.2020.2987752 - 10.
Miao, H.R.; Cheng, G.F.; Gao, C.F.;
et al. Transformer-based online CTC/attention end-to-end speech recognition architecture. In
Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (
ICASSP), Barcelona, Spain, 4–
8 May 2020; IEEE: New York, 2020; pp. 6084–6088. doi:
10.1109/ICASSP40776.2020.9053165 - 11.
Moritz, N.; Hori, T.; Le, J. Streaming automatic speech recognition with the transformer model. In
Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (
ICASSP), Barcelona, Spain, 4–
8 May 2020; IEEE: New York, 2020; pp. 6074–6078. doi:
10.1109/ICASSP40776.2020.9054476 - 12.
Zhao, H.B.; Higuchi, Y.; Ogawa, T.; et al. An investigation of enhancing CTC model for triggered attention-based streaming ASR. In Proceedings of 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, 14–17 December 2021; IEEE: New York, 2021; pp. 477–483.
- 13.
Wu, C.Y.; Wang, Y.Q.; Shi, Y.Y.; et al. Streaming transformer-based acoustic models using self-attention with augmented memory. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; ISCA: Shanghai China, 2020; pp. 2132–2136.
- 14.
Shi, Y.Y.; Wang, Y.Q.; Wu, C.Y.;
et al. Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In
Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (
ICASSP), Toronto, ON, Canada, 6–
11 June 2021; IEEE: New York, 2021; pp. 6783–6787. doi:
10.1109/ICASSP39728.2021.9414560 - 15.
Wang, F.Y.; Xu, B. Shifted chunk encoder for transformer based streaming end-to-end ASR. In
Proceedings of the 29th International Conference on Neural Information Processing, IIT Indore, India, 22–26 November 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 39–50. doi:
10.1007/978-981-99-1642-9_4 - 16.
Dai, Z.H.; Yang, Z.L.; Yang, Y.M.;
et al. Transformer-XL: Attentive language models beyond a fixed-length context. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; ACL: Stroudsburg, USA, 2019; pp. 2978–2988. doi:
10.18653/v1/P19-1285 - 17.
Zhang, B.B.; Wu, D.; Peng, Z.D.; et al. WeNet 2.0: More productive end-to-end speech recognition toolkit. In Proceedings of 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18–22 September 2022; ISCA: Incheon, Korea, 2022; pp. 1661–1665.
- 18.
Gulzar, H.; Busto, M.R.; Eda, T.; et al. miniStreamer: Enhancing small conformer with chunked-context masking for streaming ASR applications on the edge. In Proceedings of the 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 20–24 August 2023; ISCA: Dublin, Ireland, 2023; pp. 3277–3281.
- 19.
Zhang, Q.; Lu, H.; Sak, H.;
et al. Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss. In
Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (
ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, 2020; pp. 7829–7833. doi:
10.1109/ICASSP40776.2020.9053896 - 20.
Shi, Y.Y.; Wu, C.Y.; Wang, D.L.;
et al. Streaming transformer transducer based speech recognition using non-causal convolution. In
Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (
ICASSP), Singapore, Singapore, 23–
27 May 2022; IEEE: New York, 2022; pp. 8277–8281. doi:
10.1109/ICASSP43922.2022.9747706 - 21.
Swietojanski, P.; Braun, S.; Can, D.;
et al. Variable attention masking for configurable transformer transducer speech recognition. In
Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (
ICASSP), Rhodes Island, Greece, 4–
10 June 2023; IEEE: New York, 2023; pp. 1–5. doi:
10.1109/ICASSP49357.2023.10094588 - 22.
Hu, K.; Sainath, T.N.; Pang, R.M.;
et al. Deliberation model based two-pass end-to-end speech recognition. In
Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (
ICASSP), Barcelona, Spain, 4–
8 May 2020; IEEE: New York, 2020; pp. 7799–7803. doi:
10.1109/ICASSP40776.2020.9053606 - 23.
Hu, K.; Pang, R.M.; Sainath, T.N.;
et al. Transformer based deliberation for two-pass speech recognition. In
Proceedings of 2021 IEEE Spoken Language Technology Workshop (
SLT), Shenzhen, China, 19–
22 January 2021; IEEE: New York, 2021; pp. 68–74. doi:
10.1109/SLT48900.2021.9383497 - 24.
An, K.Y.; Zheng, H.H.; Ou, Z.J.; et al. CUSIDE: Chunking, simulating future context and decoding for streaming ASR. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18–22 September 2022; ISCA: Incheon, Korea, 2022; pp. 2103–2107.
- 25.
Zhao, H.B.; Fujie, S.; Ogawa, T.;
et al. Conversation-oriented ASR with multi-look-ahead CBS architecture. In
Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (
ICASSP), Rhodes Island, Greece, 4–
10 June 2023; IEEE: New York, 2023; pp. 1–5. doi:
10.1109/ICASSP49357.2023.10094614 - 26.
Strimel, G.; Xie, Y.; King, B.J.; et al. Lookahead when it matters: Adaptive non-causal transformers for streaming neural transducers. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: New York, USA, 2023; pp. 32654–32676.
- 27.
Audhkhasi, K.; Farris, B.; Ramabhadran, B.;
et al. Modular conformer training for flexible End-to-End ASR. In
Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (
ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, USA, 2023; pp. 1–5. doi:
10.1109/ICASSP49357.2023.10095966 - 28.
Boyer, F.; Shinohara, Y.; Ishii, T.;
et al. A study of transducer based end-to-end ASR with ESPnet: Architecture, auxiliary loss and decoding strategies. In
Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (
ASRU), Cartagena, Colombia, 13–
17 December 2021; IEEE: New York, USA, 2021; pp. 16–23. doi:
10.1109/ASRU51503.2021.9688251 - 29.
Yao, Z.Y.; Wu, D.; Wang, X.; et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August–3 September 2021; ISCA: Brno, Czechia, 2021; pp. 4054–4058.
- 30.
Park, D.S.; Chan, W.; Zhang, Y.; et al. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September, 2019; ISCA: Graz, Austria, 2019; pp. 2613–2617.
- 31.
Burchi, M.; Vielzeuf, V. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In
Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (
ASRU), Cartagena, Colombia, 13–
17 December 2021; IEEE: New York, USA, 2021; pp. 8–15. doi:
10.1109/ASRU51503.2021.9687874 - 32.
Guo, P.C.; Boyer, F.; Chang, X.K.;
et al. Recent developments on ESPnet toolkit boosted by conformer. In
Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (
ICASSP), Toronto, ON, Canada, 6–
11 June 2021; IEEE: New York, USA, 2021; pp. 5874–5878. doi:
10.1109/ICASSP39728.2021.9414858 - 33.
Tsunoo, E.; Kashiwagi, Y.; Watanabe, S. Streaming transformer Asr with blockwise synchronous beam search. In
Proceedings of 2021 IEEE Spoken Language Technology Workshop (
SLT), Shenzhen, China, 19–22 January 2021; IEEE: New York, USA, 2021; pp. 22–29. doi:
10.1109/SLT48900.2021.9383517 - 34.
Wang, Z.C.; Yang, W.W.; Zhou, P.;
et al. WNARS: WFST based non-autoregressive streaming end-to-end speech recognition. arXiv: 2104.03587, 2021. doi:
10.48550/arXiv.2104.03587