JP7588679B2

JP7588679B2 - Audio signal processing device, program, and audio signal processing method

Info

Publication number: JP7588679B2
Application number: JP2023076242A
Authority: JP
Inventors: ゼンチャンシ; ズールイリン; ウールンディ; 裕子石若; シアンジンシュ; ズユヤン
Original assignee: SoftBank Corp
Current assignee: SoftBank Corp
Priority date: 2022-05-05
Filing date: 2023-05-02
Publication date: 2024-11-22
Anticipated expiration: 2043-05-02
Also published as: JP2023165667A

Description

本発明は、リアルタイム、ノイズ除去、スライディングウィンドウ、ニューラルネットワーク、データストリームに関する。 The present invention relates to real-time, noise reduction, sliding window, neural networks, and data streams.

後続の文献は、本発明に関する。 The following documents relate to the present invention:

ＳａｎｔｉａｇｏＰａｓｃｕａｌ，ＡｎｔｏｎｉｏＢｏｎａｆｏｎｔｅ，ａｎｄＪｏａｎＳｅｒｒａ， "Ｓｅｇａｎ：Ｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔｇｅｎｅｒａｔｉｖｅａｄｖｅｒｓａｒｉａｌｎｅｔｗｏｒｋ，" ａｒＸｉｖｐｒｅｐｒｉｎｔａｒＸｉｖ：１７０３．０９４５２，２０１７．Santiago Pascual, Antonio Bonafonte, and Joan Serra, "Segan: Speech enhancement generative adversarial network," arXiv preprint arXiv:1703.09452, 2017. ＦｒａｎｃｏｉｓＧＧｅｒｍａｉｎ，ＱｉｆｅｎｇＣｈｅｎ，ａｎｄＶｌａｄｌｅｎＫｏｌｔｕｎ， "Ｓｐｅｅｃｈｄｅｎｏｉｓｉｎｇｗｉｔｈｄｅｅｐｆｅａｔｕｒｅｌｏｓｓｅｓ，" ａｒＸｉｖｐｒｅｐｒｉｎｔａｒＸｉｖ：１８０６．１０５２２，２０１８．Francois G. German, Qifeng Chen, and Vladlen Koltun, "Speech denoising with deep feature losses," arXiv preprint arXiv:1806.10522, 2018. ＣｒａｉｇＭａｃａｒｔｎｅｙａｎｄＴｉｌｌｍａｎＷｅｙｄｅ， "Ｉｍｐｒｏｖｅｄｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔｗｉｔｈｔｈｅｗａｖｅ－ｕ－ｎｅｔ，" ａｒＸｉｖｐｒｅｐｒｉｎｔａｒＸｉｖ：１８１１．１１３０７，２０１８．Craig Macartney and Tillman Weyde, "Improved speech enhancement with the wave-u-net," arXiv preprint arXiv:1811.11307, 2018. Ｓｚｕ－ＷｅｉＦｕ，Ｃｈｉｅｎ－ＦｅｎｇＬｉａｏ，ＹｕＴｓａｏ，ａｎｄＳｈｏｕ－ＤｅＬｉｎ， "Ｍｅｔｒｉｃｇａｎ：Ｇｅｎｅｒａｔｉｖｅａｄｖｅｒｓａｒｉａｌｎｅｔｗｏｒｋｓｂａｓｅｄｂｌａｃｋ－ｂｏｘｍｅｔｒｉｃｓｃｏｒｅｓｏｐｔｉｍｉｚａｔｉｏｎｆｏｒｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔ，" ｉｎＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＭａｃｈｉｎｅＬｅａｒｎｉｎｇ．ＰＭＬＲ，２０１９，ｐｐ．２０３１－２０４１．Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin, "Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement," in International Conference on Machine Learning. PMLR, 2019, pp. 2031-2041. ＲｕｉｌｉｎＸｕ，ＲｕｎｄｉＷｕ，ＹｕｋｏＩｓｈｉｗａｋａ，ＣａｒｌＶｏｎｄｒｉｃｋ，ａｎｄＣｈａｎｇｘｉＺｈｅｎｇ， "Ｌｉｓｔｅｎｉｎｇｔｏｓｏｕｎｄｓｏｆｓｉｌｅｎｃｅｆｏｒｓｐｅｅｃｈｄｅｎｏｉｓｉｎｇ，" ｉｎＡｄｖａｎｃｅｓｉｎＮｅｕｒａｌＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＳｙｓｔｅｍｓ，Ｈ．Ｌａｒｏｃｈｅｌｌｅ，Ｍ．Ｒａｎｚａｔｏ，Ｒ．Ｈａｄｓｅｌｌ，Ｍ．Ｆ．Ｂａｌｃａｎ，ａｎｄＨ．Ｌｉｎ，Ｅｄｓ．２０２０，ｖｏｌ．３３，ｐｐ．９６３３－９６４８，ＣｕｒｒａｎＡｓｓｏｃｉａｔｅｓ，Ｉｎｃ．Ruilin Xu, Rundi Wu, Yuko Ishiwaka, Carl Vondrick, and Changxi Zheng, "Listening to sounds of silence for speech" denoising," in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 9633-9648, Curran Associates, Inc. ＡｌｅｘａｎｄｒｅＤｅｆｏｓｓｅｚ，ＧａｂｒｉｅｌＳｙｎｎａｅｖｅ，ａｎｄＹｏｓｓｉＡｄｉ， "Ｒｅａｌｔｉｍｅｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔｉｎｔｈｅｗａｖｅｆｏｒｍｄｏｍａｉｎ，" ａｒＸｉｖｐｒｅｐｒｉｎｔａｒＸｉｖ：２００６．１２８４７，２０２０．Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi, "Real time speech enhancement in the waveform domain," arXiv preprint arXiv:2006.12847, 2020. ＸｉａｎｇＨａｏ，ＸｉａｎｇｄｏｎｇＳｕ，ＲａｄｕＨｏｒａｕｄ，ａｎｄＸｉａｏｆｅｉＬｉ， "Ｆｕｌｌｓｕｂｎｅｔ：Ａｆｕｌｌ－ｂａｎｄａｎｄｓｕｂ－ｂａｎｄｆｕｓｉｏｎｍｏｄｅｌｆｏｒｒｅａｌ－ｔｉｍｅｓｉｎｇｌｅ－ｃｈａｎｎｅｌｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔ，" ｉｎＩＣＡＳＳＰ２０２１－２０２１ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ）．ＩＥＥＥ，２０２１，ｐｐ．６６３３－６６３７．Xiang Hao, Xiangdong Su, Radu Horaud, and Xiaofei Li, "Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6633-6637. ＴｙｌｅｒＶｕｏｎｇ，ＹａｎｇｙａｎｇＸｉａ，ａｎｄＲｉｃｈａｒｄＭ．Ｓｔｅｒｎ， "Ａｍｏｄｕｌａｔｉｏｎ－ｄｏｍａｉｎｌｏｓｓｆｏｒｎｅｕｒａｌ－ｎｅｔｗｏｒｋ－ｂａｓｅｄｒｅａｌｔｉｍｅｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔ，" ２０２１．Tyler Vuong, Yangyang Xia, and Richard M. Stern, "A modulation-domain loss for neural-network-based realtime speech enhancement," 2021. ＱｉｑｕａｎＺｈａｎｇ，ＡａｒｏｎＮｉｃｏｌｓｏｎ，ＭｉｎｇｊｉａｎｇＷａｎｇ，ＫｕｌｄｉｐＫＰａｌｉｗａｌ，ａｎｄＣｈｅｎｘｕＷａｎｇ， "Ｄｅｅｐｍｍｓｅ：Ａｄｅｅｐｌｅａｒｎｉｎｇａｐｐｒｏａｃｈｔｏｍｍｓｅ－ｂａｓｅｄｎｏｉｓｅｐｏｗｅｒｓｐｅｃｔｒａｌｄｅｎｓｉｔｙｅｓｔｉｍａｔｉｏｎ，" ＩＥＥＥ／ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌ．２８，ｐｐ．１４０４－１４１５，２０２０．Qiquan Zhang, Aaron Nicholson, Mingjiang Wang, Kuldip K Paliwal, and Chenxu Wang, "Deepmmse: A deep learning approach to mmse-based noise power spectral density estimation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1404-1415, 2020. Ｈｙｅｏｎｇ－ＳｅｏｋＣｈｏｉ，ＳｕｎｇｊｉｎＰａｒｋ，ＪｉｅＨｗａｎＬｅｅ，ＨｏｏｎＨｅｏ，ＤｏｎｇｓｕｋＪｅｏｎ，ａｎｄＫｙｏｇｕＬｅｅ， "Ｒｅａｌ－ｔｉｍｅｄｅｎｏｉｓｉｎｇａｎｄｄｅｒｅｖｅｒｂｅｒａｔｉｏｎｗｉｔｈｔｉｎｙｒｅｃｕｒｒｅｎｔｕ－ｎｅｔ，" ａｒＸｉｖｐｒｅｐｒｉｎｔａｒＸｉｖ：２１０２．０３２０７，２０２１．Hyeong-Seok Choi, Sungjin Park, Jie Hwan Lee, Hoon Heo, Dongsuk Jeon, and Kyogu Lee, "Real-time denoising and dereverberation with tiny recurrent u-net," arXiv preprint arXiv:2102.03207, 2021. ＳｅｐｐＨｏｃｈｒｅｉｔｅｒａｎｄＪｕｒｇｅｎＳｃｈｍｉｄｈｕｂｅｒ， "Ｌｏｎｇｓｈｏｒｔｔｅｒｍｍｅｍｏｒｙ，" Ｎｅｕｒａｌｃｏｍｐｕｔａｔｉｏｎ，ｖｏｌ．９，ｎｏ．８，ｐｐ．１７３５－１７８０，１９９７．Sepp Hochreiter and Jurgen Schmidhuber, "Long shortterm memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997. ＡｒｉｅｌＥｐｈｒａｔ，ＩｎｂａｒＭｏｓｓｅｒｉ，ＯｒａｎＬａｎｇ，ＴａｌｉＤｅｋｅｌ，ＫｅｖｉｎＷｉｌｓｏｎ，ＡｖｉｎａｔａｎＨａｓｓｉｄｉｍ，ＷｉｌｌｉａｍＴ．Ｆｒｅｅｍａｎ，ａｎｄＭｉｃｈａｅｌＲｕｂｉｎｓｔｅｉｎ， "Ｌｏｏｋｉｎｇｔｏｌｉｓｔｅｎａｔｔｈｅｃｏｃｋｔａｉｌｐａｒｔｙ：Ａｓｐｅａｋｅｒ－ｉｎｄｅｐｅｎｄｅｎｔａｕｄｉｏ－ｖｉｓｕａｌｍｏｄｅｌｆｏｒｓｐｅｅｃｈｓｅｐａｒａｔｉｏｎ，" ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＧｒａｐｈｉｃｓ，ｖｏｌ．３７，ｎｏ．４，ｐｐ．１－１１，Ｊｕｌｙ２０１８．Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model. for speech separation," ACM Transactions on Graphics, vol. 37, no. 4, pp. 1-11, July 2018. Ｙ．Ｗａｎｇ，Ａ．Ｎａｒａｙａｎａｎ，ａｎｄＤ．Ｗａｎｇ， "Ｏｎｔｒａｉｎｉｎｇｔａｒｇｅｔｓｆｏｒｓｕｐｅｒｖｉｓｅｄｓｐｅｅｃｈｓｅｐａｒａｔｉｏｎ，" ＩＥＥＥ／ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌ．２２，ｎｏ．１２，ｐｐ．１８４９－１８５８，２０１４．Y. Wang, A. Narayanan, and D. Wang, "On training targets for supervised speech separation," IEEE/ACM Transactions on Audio, Speech, and Language. Processing, vol. 22, no. 12, pp. 1849-1858, 2014. ＪｏｒｔＦ．Ｇｅｍｍｅｋｅ，ＤａｎｉｅｌＰ．Ｗ．Ｅｌｌｉｓ，ＤｙｌａｎＦｒｅｅｄｍａｎ，ＡｒｅｎＪａｎｓｅｎ，ＷａｄｅＬａｗｒｅｎｃｅ，Ｒ．ＣｈａｎｎｉｎｇＭｏｏｒｅ，ＭａｎｏｊＰｌａｋａｌ，ａｎｄＭａｒｖｉｎＲｉｔｔｅｒ， "Ａｕｄｉｏｓｅｔ：Ａｎｏｎｔｏｌｏｇｙａｎｄｈｕｍａｎ－ｌａｂｅｌｅｄｄａｔａｓｅｔｆｏｒａｕｄｉｏｅｖｅｎｔｓ，" ｉｎＰｒｏｃ．ＩＥＥＥＩＣＡＳＳＰ２０１７，ＮｅｗＯｒｌｅａｎｓ，ＬＡ，２０１７．Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events," in Proc. IEEE ICASSP 2017, New Orleans, LA, 2017. ＪｏａｃｈｉｍＴｈｉｅｍａｎｎ，ＮｏｂｕｔａｋａＩｔｏ，ａｎｄＥｍｍａｎｕｅｌＶｉｎｃｅｎｔ， "Ｔｈｅｄｉｖｅｒｓｅｅｎｖｉｒｏｎｍｅｎｔｓｍｕｌｔｉ－ｃｈａｎｎｅｌａｃｏｕｓｔｉｃｎｏｉｓｅｄａｔａｂａｓｅ（ｄｅｍａｎｄ）：Ａｄａｔａｂａｓｅｏｆｍｕｌｔｉｃｈａｎｎｅｌｅｎｖｉｒｏｎｍｅｎｔａｌｎｏｉｓｅｒｅｃｏｒｄｉｎｇｓ，" ｉｎ２１ｓｔＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｇｒｅｓｓｏｎＡｃｏｕｓｔｉｃｓ，Ｍｏｎｔｒｅａｌ，Ｃａｎａｄａ，Ｊｕｎｅ２０１３，ＡｃｏｕｓｔｉｃａｌＳｏｃｉｅｔｙｏｆＡｍｅｒｉｃａ，ＴｈｅｄａｔａｓｅｔｉｔｓｅｌｆｉｓａｒｃｈｉｖｅｄｏｎＺｅｎｏｄｏ，ｗｉｔｈＤＯＩ１０．５２８１／ｚｅｎｏｄｏ．１２２７１２０．Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent, “The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings, "in 21st International Congress on Acoustics, Montreal, Canada, June 2013, Acoustical Society of America, The dataset itsself is archived on Zenodo, with DOI 10.5281/zenodo. 1227120. ＣａｓｓｉａＶａｌｅｎｔｉｎｉ－Ｂｏｔｉｎｈａｏ，ＸｉｎＷａｎｇ，ＳｈｉｎｊｉＴａｋａｋｉ，ａｎｄＪｕｎｉｃｈｉＹａｍａｇｉｓｈｉ， "Ｉｎｖｅｓｔｉｇａｔｉｎｇｒｎｎ－ｂａｓｅｄｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔｍｅｔｈｏｄｓｆｏｒｎｏｉｓｅ－ｒｏｂｕｓｔｔｅｘｔ－ｔｏ－ｓｐｅｅｃｈ，" ｉｎ９ｔｈＩＳＣＡＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓＷｏｒｋｓｈｏｐ，２０１６，ｐｐ．１４６－１５２．Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Investigating rnn-based speech enhancement methods for noise-robust text-to-speech," in 9th ISCA Speech Synthesis Workshop, 2016, pp. 146-152. ＣｅｅｓＨＴａａｌ，ＲｉｃｈａｒｄＣＨｅｎｄｒｉｋｓ，ＲｉｃｈａｒｄＨｅｕｓｄｅｎｓ，ａｎｄＪｅｓｐｅｒＪｅｎｓｅｎ， "Ａｎａｌｇｏｒｉｔｈｍｆｏｒｉｎｔｅｌｌｉｇｉｂｉｌｉｔｙｐｒｅｄｉｃｔｉｏｎｏｆｔｉｍｅ－ｆｒｅｑｕｅｎｃｙｗｅｉｇｈｔｅｄｎｏｉｓｙｓｐｅｅｃｈ，" ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌ．１９，ｎｏ．７，ｐｐ．２１２５－２１３６，２０１１．Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen, “An algorithm for intelligence prediction of time-frequency weighted noise speech," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125-2136, 2011. ＩＴＵ－ＴＲｅｃｏｍｍｅｎｄａｔｉｏｎ， "Ｐｅｒｃｅｐｔｕａｌｅｖａｌｕａｔｉｏｎｏｆｓｐｅｅｃｈｑｕａｌｉｔｙ（ｐｅｓｑ）：Ａｎｏｂｊｅｃｔｉｖｅｍｅｔｈｏｄｆｏｒｅｎｄ－ｔｏｅｎｄｓｐｅｅｃｈｑｕａｌｉｔｙａｓｓｅｓｓｍｅｎｔｏｆｎａｒｒｏｗ－ｂａｎｄｔｅｌｅｐｈｏｎｅｎｅｔｗｏｒｋｓａｎｄｓｐｅｅｃｈｃｏｄｅｃｓ，" Ｒｅｃ．ＩＴＵ－ＴＰ．８６２，２００１．ITU-T Recommendation, "Perceptual evaluation of speech quality (pesq): An objective method for end-toend speech Quality assessment of narrow-band telephone networks and speech codes," Rec. ITU-T P. 862, 2001. ＹｉＨｕａｎｄＰｈｉｌｉｐｏｓＣ．Ｌｏｉｚｏｕ， "Ｅｖａｌｕａｔｉｏｎｏｆｏｂｊｅｃｔｉｖｅｑｕａｌｉｔｙｍｅａｓｕｒｅｓｆｏｒｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔ，" ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌ．１６，ｎｏ．１，ｐｐ．２２９－２３８，２００８．Yi Hu and Philippos C. Loizou, "Evaluation of objective quality measurements for speech enhancement," IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229-238, 2008. ＭａｒｋＡ．ＣｌｅｍｅｎｔｓＳｃｈｕｙｌｅｒＲ．Ｑｕａｃｋｅｎｂｕｓｈ，ＴｈｏｍａｓＰ．Ｂａｒｎｗｅｌｌ，ＯｂｊｅｃｔｉｖｅＭｅａｓｕｒｅｓＯｆＳｐｅｅｃｈＱｕａｌｉｔｙ，ＰｒｅｎｔｉｃｅＨａｌｌ，ＥｎｇｌｅｗｏｏｄＣｌｉｆｆｓ，ＮＪ，１９８８．Mark A. Clements Schuyler R. Quackenbush, Thomas P. Barnwell, Objective Measures Of Speech Quality, Prentice Hall, Englewood Cliffs, NJ, 1988.

本発明のこれら及び他の態様は、以下に説明する実施形態を参照して明らかとなり、説明されるであろう。 These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments described hereinafter.

ダイナミック対固定の大きさのスライディングウィンドウを概略的に示す。13 illustrates a schematic of dynamic versus fixed magnitude sliding windows. データパディングの図解を概略的に示す。1 shows a schematic diagram of data padding. リアルタイムのパフォーマンスの比較を概略的に示す。1 shows a schematic comparison of real-time performance. ネットワーク構造を概略的に示す。1 illustrates a schematic diagram of a network structure. ノイズ除去オーディオを概略的に示す。1 illustrates a schematic representation of denoised audio. データ処理装置１００として機能するコンピュータ１２００のハードウェア構成の例を概略的に示す。An example of the hardware configuration of a computer 1200 functioning as the data processing device 100 is illustrated in schematic form.

１．緒言
リアルタイムの音声ノイズ除去は、極めて需要がある－おそらくこれまでで最も需要がある－オーディオ処理タスクである。我々の世界が依然コロナウイルス感染症（ＣＯＶＩＤ）のパンデミックにより闇に包まれ、オンライン会議が我々の日常的な社会生活の「新たな通常」となってきているからである。特に最近、従来技術のリアルタイムのノイズ除去技術はすべて、ニューラルネットワークに基づいている［１、２、３、４、５、６、７、８］。それらは処理時間を低減するためネットワークの簡素性を保持しながら、妥当なノイズ除去の質を達成する新規のネットワーク構造を模索するものである。 1. Introduction Real-time speech denoising is a highly demanding audio processing task - perhaps the most demanding ever - as our world remains darkened by the coronavirus disease (COVID) pandemic and online meetings have become the "new normal" of our daily social life. Especially recently, all prior art real-time denoising techniques are based on neural networks [1, 2, 3, 4, 5, 6, 7, 8]. They explore novel network structures to achieve reasonable denoising quality while retaining the simplicity of the network to reduce the processing time.

オフラインでの音声ノイズ除去と異なり、リアルタイムの設定でのオーディオ信号は、ストリーミング方式で提供される。ネットワークは、信号サンプルが到着したとたん、それらを処理し、可能な限り短い遅延で出力サンプルを発生させなければならない。結果として、概ねすべてのリアルタイムの技術において、固定の長さのスライディングウィンドウバッファを使用するのが共通の戦略となっている。 Unlike offline audio denoising, in a real-time setting the audio signal is provided in a streaming manner. The network must process signal samples as they arrive and generate output samples with as short a delay as possible. As a result, a common strategy in nearly all real-time techniques is to use a fixed-length sliding window buffer.

このことは、当然の選択であるように思われる。ネットワークは、スライディングウィンドウバッファが入来オーディオサンプルにより満たされ、次にそのバッファにおいてデータをノイズ除去するまで待機する。後に、結果として生じる信号データがリアルタイムオーディオプレイヤーに供給される。この方法では、ネットワークは固定の長さＬの入力信号を予想する。ネットワークの処理時間が常にＬより短いものである限り、信号サンプルの到着から対応するノイズ除去されたサンプルの再生時間までの遅れ全体は制限されており、Ｌよりも長く２Ｌ未満である（セクション２．１の分析を参照）。 This seems like a natural choice: the network waits until the sliding window buffer is filled with incoming audio samples and then denoises the data in that buffer. The resulting signal data is later fed to a real-time audio player. In this way, the network expects an input signal of fixed length L. As long as the processing time of the network is always less than L, the overall delay from the arrival of a signal sample to the play-out time of the corresponding denoised sample is bounded, greater than L and less than 2L (see the analysis in Section 2.1).

しかしながら、実践において、リアルタイムのパフォーマンスを確実にするバッファの長さＬを選択するのは、不可能ではないが難しい。大きなＬは、長い遅れを生起させる；小さなＬは、Ｌより長いネットワークの処理時間をもたらす場合があり、結果として途切れ途切れのオーディオ再生になる。これは、実際の計算環境において、常に他のバックグラウンドプロセス（例えば、４Ｋビデオの再生及びゲーム）が存在するためである。Ｌが短いほど、ノイズ除去ネットワークは他のプロセスのＣＰＵ占有率に影響されやすい。手短に言うと、スライディングウィンドウ戦略は、リアルタイムのノイズ除去にとって基本的なものであるにもかかわらず、入念な調査ではとらえにくいままなのである。 However, in practice, it is difficult, if not impossible, to choose a buffer length L that ensures real-time performance. A large L will cause a long delay; a small L may result in a network processing time longer than L, resulting in choppy audio playback. This is because in a real computing environment, there are always other background processes (e.g., 4K video playback and gaming). The shorter L is, the more susceptible the denoising network is to the CPU occupancy of other processes. In short, the sliding window strategy, although fundamental for real-time denoising, remains elusive to careful investigation.

本発明者は、異なるスライディングウィンドウ戦略、すなわちダイナミックスライディングウィンドウを提案する。本発明者の手法では、入力バッファの長さが固定されていない。ネットワークは、現在バッファされているデータを、その長さと無関係に取り込み、待機がない状態でそれに対する処理を開始する。ネットワークが実行している間、新たに受信したデータは、バッファに蓄積され、ネットワークが現在のノイズ除去のラウンドを終えるときに処理される準備態勢にある。このスライディングウィンドウ戦略は、概念としては単純だが、他のプロセスのＣＰＵ占有率に対してよりロバストであり、それにより、より短くてより安定した遅れという結果になる。その利点を示すために、本発明者は、本発明者の手法が生じさせるオーディオ再生の遅延及び共通して使用されている固定の大きさのスライディングウィンドウを公的に分析する。知る限りにおいて、異なるスライディングウィンドウ戦略下でのネットワークの遅延が調査されるのは、今回が初めてである。 We propose a different sliding window strategy, namely dynamic sliding window. In our approach, the input buffer length is not fixed. The network takes the currently buffered data regardless of its length and starts processing it without waiting. While the network is running, newly received data is accumulated in the buffer and is ready to be processed when the network finishes the current round of denoising. This sliding window strategy is conceptually simple, but is more robust to the CPU occupancy of other processes, thereby resulting in shorter and more stable delays. To show its advantages, we publicly analyze the audio playback delays caused by our approach and a commonly used fixed-size sliding window. To the best of our knowledge, this is the first time that network delays under different sliding window strategies are investigated.

多くの既存のリアルタイムのノイズ除去ネットワーク（例えば、［６、７、８］）は、ダイナミックスライディングウィンドウを容易には統合できない（セクション２．２の論述を参照）。したがって、本発明者は、リアルタイムの設定のために適合される軽量のノイズ除去ネットワーク、すなわち：ストリーム信号を受け入れること、及びダイナミックスライディングウィンドウを使用してそれらを処理することを提案する。スライディングウィンドウに亘るデータを入念にパディング及び再使用することで、本発明者のネットワークは、オフラインのストリーミングのない事例と比較して、ノイズ除去の質の喪失を受けない。 Many existing real-time denoising networks (e.g., [6, 7, 8]) cannot easily integrate dynamic sliding windows (see the discussion in Section 2.2). Therefore, we propose a lightweight denoising network adapted for the real-time setting: accepting streaming signals and processing them using dynamic sliding windows. By carefully padding and reusing data across the sliding window, our network does not suffer from loss in denoising quality compared to the offline non-streaming case.

本発明者は、提案のモデルといくつかの従来技術のリアルタイムのノイズ除去方法を比較する広範の実験を実行している。本発明者の提案のモデルが、すべての質のメトリックに関して互角のノイズ除去の質を取得したが、すべての比較された方法の中で再生の遅れが最も少ないという結果が示された。最も重要なことには、先行のリアルタイムのノイズ除去方法［６］と比較して、本発明者のモデルは、Ｚｏｏｍでの会議、４Ｋビデオの編集、及びテレビゲームなどの他のバックグラウンドのタスクがＣＰＵサイクルを先取りし得る現実世界のシナリオで、リアルタイムのパフォーマンスを維持するのによりロバストである。 We have performed extensive experiments comparing our proposed model with several prior art real-time denoising methods. The results show that our proposed model obtains comparable denoising quality in terms of all quality metrics, but with the lowest playback delay among all compared methods. Most importantly, compared to prior real-time denoising methods [6], our model is more robust in maintaining real-time performance in real-world scenarios where other background tasks such as Zoom meetings, 4K video editing, and video games may preempt CPU cycles.

２．方法
本発明者は、固定の大きさのスライディングウィンドウが使用されるときのオーディオ再生の遅れを分析することから開始している。これは、本発明者の提案したダイナミックスライディングウィンドウを使用するときの遅れと比較されている（セクション２．１）。本発明者の分析に刺激されて、本発明者は、次に、より速くてよりロバストなノイズ除去のためのダイナミックスライディングウィンドウを活用する軽量のノイズ除去ネットワークを提案する（セクション２．２）。 2. Methodology We start by analyzing the delay in audio playback when a fixed-size sliding window is used, which is compared to the delay when using our proposed dynamic sliding window (Section 2.1). Motivated by our analysis, we then propose a lightweight denoising network that exploits the dynamic sliding window for faster and more robust denoising (Section 2.2).

図１は、ダイナミック対固定の大きさのスライディングウィンドウを概略的に示す。異なるウィンドウが異なるハッチングにより示される。各ウィンドウは、個々にネットワークにより処理される。灰色の領域は、ＣＰＵサイクルが他のプロセスにより占有され、そのためネットワークの処理時間が増加する期間を示す。 Figure 1 shows a schematic of dynamic vs. fixed size sliding windows. Different windows are indicated by different hatching. Each window is processed by the network individually. The grey areas indicate periods where CPU cycles are occupied by other processes, thus increasing the network processing time.

２．１データストリームのためのスライディングウィンドウ
固定の大きさのスライディングウィンドウ。概ねすべての既存のネットワークベースのノイズ除去モデルにおいて、入来オーディオ信号は、時間の連続する重複のないウィンドウ［Ｘ_１，Ｘ_２，…］として扱われる。各ウィンドウは、ストリーミング方式において充足される定常長さＬのオーディオサンプル（すなわち、Ｘ_ｉ∈Ｒ^Ｌ）をホストする。ネットワークＦは最新の未処理のウィンドウＸ_ｉを取り込み、ノイズ除去された結果Ｆ（Ｘ_ｉ）を出力し、その後、次のウィンドウＸ_ｉ＋１が満たされるまで待機する（図１の（ａ）を参照）。ｔ_ｋはウィンドウＸ_ｋのネットワークの処理時間を示す。ウィンドウの大きさが固定されているが、実践においてｔ_ｋは、他のバックグラウンドプロセスのＣＰＵ占有率に起因して、経時的に異なるということに留意されたい。本発明者の分析では、Ｘ_ｉを受け取った瞬間からネットワークがＦ（Ｘ_ｉ）を出力する瞬間までの遅延（又は遅れ）ｄ_ｉが、以下のように表されることが示されている。
2.1 Sliding Windows for Data Streams Fixed-Size Sliding Windows In almost all existing network-based denoising models, the incoming audio signal is treated as contiguous non-overlapping windows [X ₁ , X ₂ , ...] in time. Each window hosts a constant length L of audio samples (i.e., X _i ∈ R ^L ) that are filled in a streaming manner. The network F takes the latest unprocessed window X _i and outputs the denoised result F(X _i ), and then waits until the next window X _{i + 1} is filled (see Fig. 1(a)). Let t _k denote the processing time of the network for window X _k . Note that although the window size is fixed, in practice t _k varies over time due to the CPU occupancy of other background processes. Our analysis shows that the delay (or lag) d _i from the moment X _i is received to the moment the network outputs F(X _i ) can be expressed as follows:

式（１）の導出は些細なものではない。ここのスペースが限られているので、本発明者は導出の詳細を飛ばすが、本発明者はそれを一般的にオンラインで利用できるようにしている^１。 The derivation of equation (1) is non-trivial. Due to limited space here, we skip the details of the derivation, but we have made it publicly available ^online.1

本発明者の分析（１）は、以下を明示している：ｔ_ｉ＜Ｌが常に満たされる理想の場合には、ノイズ除去ネットワークが遅れの蓄積もなく、リアルタイムでスムーズに実行され；再生の遅延は上限が２Ｌとなる。しかしながら、実際には、ネットワークの実行は、よく、他のバックグラウンド計算プロセスに影響され、その処理時間ｔ_ｉはＬよりも長くなることがある。同時に、長さｔ_ｉのオーディオサンプルが到着し、バッファに蓄積する。この分量のデータを処理すべく、ネットワークは、
の回数実行する必要がある。これは、ひいてはオーディオ再生の遅れを蓄積させ得る（そのため、（１）の合計の項）。 Our analysis (1) reveals that in the ideal case where t _i < L is always satisfied, the denoising network would run smoothly in real time without any accumulated delay; the playback delay would be upper bounded by 2L. However, in reality, the execution of the network is often affected by other background computational processes, and its processing time t _i can be longer than L. At the same time, audio samples of length t _i arrive and accumulate in the buffer. To process this amount of data, the network must:
times, which in turn can lead to an accumulation of delay in the audio playback (hence the summation term in (1)).

ダイナミックスライディングウィンドウ。本発明者は、スライディングウィンドウの大きさをダイナミックに調節することを提案する。ネットワークがデータウィンドウＸ_ｉを処理し終えたあと即座に、バッファの新たに受信したデータは、長さｔ_ｉを有し、これはＬよりも大きい場合も、そうでない場合もある。バッファの長さと無関係に、本発明者は待機せずに、利用可能なバッファされたデータをノイズ除去する。図１の（ｂ）はプロセスを図解している。この戦略では、ウィンドウＸ_ｉを再生するための遅延ｄ_ｉは、以下である。
Dynamic Sliding Window. We propose to dynamically adjust the size of the sliding window. As soon as the network finishes processing a data window X _i , the newly received data in the buffer has a length t _i , which may or may not be greater than L. Regardless of the length of the buffer, we denoise the available buffered data without waiting. Figure 1b illustrates the process. With this strategy, the delay d _i for playing a window X _i is:

ｉ＝１のとき、第１のウィンドウの遅延はｄ_１＝Ｌ_０＋ｔ_１であり、式中Ｌ_０は開始時にノイズ除去の処理を開始するための初期のウィンドウの大きさである。本発明者らは再度式（２）の導出のための本発明者のオンラインの文献について読者に言及する。 When i=1, the delay of the first window is _d1 = _L0 + _t1 , where _L0 is the initial window size to begin the denoising process at the beginning. We again refer the reader to our online literature for the derivation of equation (2).

この分析では、ｄ_ｉが２つの連続するウィンドウＸ_ｋ－１及びＸ_ｋのネットワークの処理時間のみに依拠することが示されている。固定の大きさのスライディングウィンドウ（（１）に示す）と対照的に、蓄積する遅延がない。そのため、本発明者の手法は、計算能力の変動に対してよりロバストである。これは注目すべき利点である。なぜなら、実際の計算環境において、ノイズ除去の計算能力は絶えず変動するからである（セクション３．３を参照）。 The analysis shows that d _i depends only on the network processing time of two successive windows X _k−1 and X _k . In contrast to a fixed-size sliding window (shown in (1)), there is no accumulating delay. Therefore, our approach is more robust to variations in computing power. This is a notable advantage, since in a real computing environment, the computing power of denoising varies constantly (see Section 3.3).

２．２ネットワーク構造及びデータパディング
ダイナミックスライディングウィンドウ戦略は、特定のネットワーク構造から独立しているが、多くの既存のリアルタイムのノイズ除去ネットワーク［６、７、８、９、１０］はそれを利用するよう容易に適合できない。それらの一部は、ネットワークの実行の前に所定のスライディングウィンドウの大きさを必要とする［６］。他にも、ネットワークの推論コストの低減に注目しているものはあるが、それらが入来データストリームをいかに取り扱うかが依然不明瞭である［７、８、９、１０］。 2.2 Network Structure and Data Padding Although the dynamic sliding window strategy is independent of a specific network structure, many existing real-time denoising networks [6, 7, 8, 9, 10] cannot be easily adapted to take advantage of it. Some of them require a given sliding window size before the network runs [6]. Others have focused on reducing the inference cost of the network, but it is still unclear how they handle the incoming data stream [7, 8, 9, 10].

提案されるネットワーク構造。本発明者は、ダイナミックスライディングウィンドウを使用する軽量のノイズ除去ネットワークを提案する。本発明者のネットワークは、［５］のノイズ除去の成分をもとに構築されている（そのため、その中のノイズ除去モデルよりかなり簡素である）。本発明者のネットワークへの入力は、データウィンドウＸ_ｉへＳＴＦＴを適用することにより取得されるスペクトログラムｓ_ｘである。スペクトログラムｓ_ｘは最初、時間－周波数領域でカーネルサイズ（５，５）及び膨張（１，１）を有する２Ｄの畳み込み層により処理される。結果生じる特徴マップは、非表示の大きさ４００の一方向性のＬＳＴＭ［１１］に供給される。最終的に、３つの完全に接続された層の非表示の大きさ（４００，６００，５１２）が、各タイムビンに対して適用される。他の発話エンハンスメントモデル［５、１２、１３］と類似して、本発明者のネットワークは、ｓ_ｘと同じ次元の複素数値のマスクｃを出力する。最後に、ノイズ除去されたオーディオ信号が、は、逆ＳＴＦＴを
に適用することによって取得され、式中
はアダマール積を示す。 Proposed network structure. We propose a lightweight denoising network that uses a dynamic sliding window. Our network builds on the denoising components of [5] (and is therefore much simpler than the denoising model therein). The input to our network is a spectrogram s _x obtained by applying STFT to a data window X _i . The spectrogram s _x is first processed by a 2D convolutional layer with kernel size (5,5) and dilation (1,1) in the time-frequency domain. The resulting feature map is fed into a hidden unidirectional LSTM [11] of size 400. Finally, three fully connected layers of size (400, 600, 512) are applied for each time bin. Similar to other speech enhancement models [5, 12, 13], our network outputs a complex-valued mask c of the same dimension as s _x . Finally, the denoised audio signal is subjected to an inverse STFT to obtain a
is obtained by applying
denotes the Hadamard product.

トレーニング時に、本発明者は後続の喪失関数を最適化している：
式中
はクリーンなオーディオのグラウンドトゥルーススペクトログラムを示す。ＳＴＦＴを計算するとき、本発明者は、ＦＦＴビンの数を５１０に、ハニング窓のサイズを４００に、ホップ長を１２８に設定している。 During training, we optimize the subsequent loss function:
In the formula
shows the ground truth spectrogram of clean audio. When computing the STFT, we set the number of FFT bins to 510, the Hanning window size to 400, and the hop length to 128.

図２は、データパディングの図解を概略的に示す。大きさ８及び６の連続する２つのウィンドウが、カーネルサイズ３の２つの１Ｄの畳み込みにより処理される。この事例において、本発明者は各ウィンドウに対する２つの未来の要素（ウィンドウＸ_ｉについて２２、２３）をパディングし、先行するウィンドウ（処理ウィンドウＸ_ｉから取得された１５、１６）からの第１の畳み込みの結果の２つの要素を再使用する。 Figure 2 shows a schematic illustration of data padding: two consecutive windows of size 8 and 6 are processed by two 1D convolutions with kernel size 3. In this case, we pad two future elements for each window (22, 23 for window X _i ) and reuse two elements of the first convolution result from the previous window (15, 16 taken from processing window X _i ).

データパディング。本発明者のネットワーク（また多くのその他のもの）は、畳み込み層を有し、それは境界のデータを処理するためにパディングを必要とする。スライディングウィンドウの方法でストリーミングデータを取り扱うために、このことは、データウィンドウを処理するときに充分な「未来の」データ（例えば、図２のウィンドウＸ_ｉ－１についてはブロック１６及び１７）をバッファする必要があることを意味する。パディングデータの到着を待機することが、さらなる遅れ（本発明者の実施においては４８ミリ秒）を導入する。しかし、本発明者は、次のスライディングウィンドウのパディングデータの畳み込みの結果を再使用することができる（図２の図解を参照）。幾分かの計算のコストを節約することとは別に、パディングデータの再使用は、ネットワークのノイズ除去の質に対して重要である。それにより、本発明者のネットワークが、突如信号全体を取り込むかのように、同じノイズ除去の質を維持することが確実になる。おそらく意外に感じられるであろうが、そのような保証は、既存のリアルタイムのノイズ除去ネットワークに依然欠いている（セクション３．２の実験を参照）。 Data padding. Our network (and many others) has convolutional layers, which require padding to process boundary data. To handle streaming data in a sliding window manner, this means that we need to buffer enough “future” data when processing the data window (e.g., blocks 16 and 17 for window X _i−1 in FIG. 2). Waiting for the arrival of padding data introduces an additional delay (48 ms in our implementation). However, we can reuse the result of the convolution with the padding data of the next sliding window (see the diagram in FIG. 2). Apart from saving some computational cost, reusing the padding data is important for the denoising quality of the network. It ensures that our network maintains the same denoising quality as if it were suddenly taking in the entire signal. Perhaps surprisingly, such a guarantee is still lacking in existing real-time denoising networks (see experiments in Section 3.2).

３．実験
本発明者の実験には、２要素がある：本発明者は、本発明者のネットワークのノイズ除去の質を、従来技術のリアルタイムのノイズ除去モデルと比較することによって評価している（セクション３．２）。本発明者は、次に、従来型の固定の大きさのスライディングウィンドウの手法を凌ぐダイナミックスライディングウィンドウのパフォーマンスの利点を実証する（セクション３．３）。 3. Experiments Our experiments are two-fold: we evaluate the denoising quality of our network by comparing it with prior art real-time denoising models (Section 3.2). We then demonstrate the performance advantage of a dynamic sliding window over a traditional fixed-size sliding window approach (Section 3.3).

３．１．実験の設定
データセット。本発明者は、２つの一般的に利用可能なデータセットについての実験を実行している。Ｘｕら［５］により提供された第１のものは、ＡＶＳＰＥＥＣＨから選択されたクリーンなオーディオ［１２］、及びＡｕｄｉｏＳｅｔ［１４］及びＤＥＭＡＮＤ［１５］からのノイズを有する。本発明者は、このデータセットをＡＤＤデータセットと呼ぶ。加えて、本発明者はまた、Ｖａｌｅｎｔｉｎｉ［１６］のベンチマークを検証し、これは２８のスピーカーからのオーディオクリップを含む；各クリップは、対応するクリーン及びノイズ版を有する。 3.1 Experimental Setup Datasets. We perform experiments on two publicly available datasets. The first one, provided by Xu et al. [5], has clean audio selected from AVSPEECH [12] and noise from AudioSet [14] and DEMAND [15]. We call this dataset the ADD dataset. In addition, we also validate the benchmark of Valentini [16], which contains audio clips from 28 speakers; each clip has corresponding clean and noise versions.

評価のメトリック。ノイズ除去の質を評価すべく、本発明者は、後続の広く使用されている客観的なメトリックを使用する：（ｉ）ＳＴＯＩ：短時間客観的明瞭度（Ｓｈｏｒｔ－ＴｉｍｅＯｂｊｅｃｔｉｖｅＩｎｔｅｌｌｉｇｉｂｉｌｉｔｙ）［１７］；（ｉｉ）ＰＥＳＱ：客観的音声品質評価法（Ｐｅｒｃｅｐｔｕａｌｅｖａｌｕａｔｉｏｎｏｆｓｐｅｅｃｈｑｕａｌｉｔｙ）（本発明者は狭帯域版を使用）［１８］；（ｉｉｉ）ＣＳＩＧ：符号歪のＭＯＳ予測因子［１９］；（ｉｖ）ＣＢＡＫ：バックグラウンドノイズの攻撃性のＭＯＳ予測因子［１９］；（ｖ）ＣＯＶＬ：全体的な質のＭＯＳ予測因子［１９］；（ｖｉ）ＳＳＮＲ：セグメント信号雑音比（ＳｅｇｍｅｎｔａｌＳｉｇｎａｌ－ｔｏ－ＮｏｉｓｅＲａｔｉｏ）［２０］。 Evaluation metric. To evaluate the quality of noise reduction, we use the following widely used objective metrics: (i) STOI: Short-Time Objective Intelligibility [17]; (ii) PESQ: Perceptual evaluation of speech quality (we use the narrowband version) [18]; (iii) CSIG: MOS predictor of code distortion [19]; (iv) CBAK: MOS predictor of background noise aggressiveness [19]; (v) COVL: MOS predictor of overall quality [19]; (vi) SSNR: Segmental Signal-to-Noise Ratio [20].

本発明者は、ネットワークのリアルタイムのパフォーマンスを評価するために２つのメトリックを使用する：平均ネットワークの処理時間（Ｄ_Ｎ）及び最大のオーディオ再生の遅れ（Ｄ_Ａ）である。
Ｄ_Ｎは
と定義され、式中Ｍはスライディングウィンドウの全例数であり、ｔ_ｉはウィンドウＸ_ｉの処理時間である。オーディオサンプルが来ると、Ｄ_Ａはオーディオサンプルの到着時間及びその再生時間の間の最大の遅れを（クリーンアップ後に）測定する。他に明記しない限り、ノイズ除去ネットワークは、ＣＰＵ（３．６０ＧＨｚＩｎｔｅｒ８－Ｃｏｒｅｉ７－９７００Ｋ）で実行され、メトリックはミリ秒で測定される。 We use two metrics to evaluate the real-time performance of a network: the average network processing time (D _N ) and the maximum audio playout delay (D _A ).
_D.N.
where M is the total number of examples in the sliding window and t _i is the processing time of window X _i . When an audio sample arrives, D _A measures the maximum delay between the arrival time of the audio sample and its playout time (after cleanup). Unless otherwise stated, the denoising network is run on a CPU (3.60 GHz Inter 8-Core i7-9700K) and the metrics are measured in milliseconds.

［表１］
オフライン（上）及びリアルタイム（下）両方の設定でのＡＡＤデータセットのノイズ除去の質。すべての得点は、ＳＮＲ［－１０、－７、－３、０、３、７、１０］での入力オーディオの平均の結果である。第２位のものに下線が引かれている。 [Table 1]
Denoising quality of the AAD dataset in both offline (top) and real-time (bottom) settings. All scores are the average result of the input audio at SNRs [-10, -7, -3, 0, 3, 7, 10]. The runners-up are underlined.

３．２．リアルタイムの音声ノイズ除去の質
本発明者は、本発明者のノイズ除去モデルを、Ｄｅｍｕｃｓ４８［６］、ＦｕｌｌＳｕｂ［７］、及びＲＮＮ－Ｍｏｄ［８］を含むいくつかの最近提案されたリアルタイムのノイズ除去ネットワークと比較した。本発明者は、同じ前述のデータセットをこれらのモデルでトレーニングして、リアルタイム及びオフライン設定の両方でノイズ除去の質を評価している。オフライン設定において、オーディオ信号はすぐに提供され、そのためスライディングウィンドウは必要ない。同じネットワークからの２つの設定のノイズ除去の質を比較することにより、本発明者は、どの程度スライディングウィンドウ戦略がノイズ除去の質に影響を与えるのか理解することを望んでいる。 3.2. Real-time Audio Denoising Quality We compared our denoising model with several recently proposed real-time denoising networks, including Demucs48 [6], FullSub [7], and RNN-Mod [8]. We train these models on the same aforementioned dataset to evaluate the denoising quality in both real-time and offline settings. In the offline setting, the audio signal is provided immediately, so no sliding window is required. By comparing the denoising quality of the two settings from the same network, we hope to understand to what extent the sliding window strategy affects the denoising quality.

Ｄｅｍｕｃｓ４８について、本発明者は、提供されたスライディングウィンドウの実装を使用する。ＦｕｌｌＳｕｂ及びＲＮＮ－Ｍｏｄは、ストリーミングの実装をもたらさず、本発明者は、ダイナミックスライディングウィンドウが追加されたときにこれらのノイズ除去の質が不安定になるということを見出した。したがって、本発明者は、過去及び未来の終わりに１６ミリ秒のパディングを伴う大きさ８０ミリ秒の固定の大きさのスライディングウィンドウをこれらのためにそれらを採用する。本発明者の実験が、許容し得る遅延を保持しながらも最善の可能なリアルタイムのノイズ除去の質をもたらすことを示したことから、本発明者はこのスライディングウィンドウの設定を選択している。 For Demucs48, we use the sliding window implementation provided. FullSub and RNN-Mod do not provide a streaming implementation, and we found that their denoising quality becomes unstable when a dynamic sliding window is added. Therefore, we adopt for them a fixed size sliding window of size 80 ms with 16 ms padding at the past and future ends. We choose this sliding window setting because our experiments have shown that it provides the best possible real-time denoising quality while maintaining acceptable latency.

表１に、ＡＤＤデータセットの評価の結果をまとめている。本発明者のモデルは、すべての質のメトリックに対する最高又は互角のリアルタイムのノイズ除去の質を備える。また、本発明者のモデルが、データパディング戦略のおかげで、対にされる相手であるオフラインと同じリアルタイムのノイズ除去の質を確実にする唯一のものであったことは、留意する価値がある。他のすべてのモデルでは、オフライン設定からリアルタイム設定に切り替えると質が低下する。さらに、Ｖａｌｅｎｔｉｎｉベンチマークのリアルタイムのノイズ除去の質の結果は、表２で報告されている。 Table 1 summarizes the results of the evaluation on the ADD dataset. Our model has the best or comparable real-time denoising quality for all quality metrics. It is also worth noting that our model was the only one that ensured the same real-time denoising quality as its offline counterpart thanks to the data padding strategy. All other models suffer from a degradation in quality when switching from the offline to the real-time setting. Furthermore, the real-time denoising quality results for the Valentini benchmark are reported in Table 2.

３．３．リアルタイムのパフォーマンス
制御実験。第１に、本発明者は、異なるネットワークモデルのＤ_Ｎ及Ｄ_Ａを測定した（表３を参照）。これらのモデルは、異なる長さの入力データを取り込み、本発明者はまた、多数のウィンドウにそれを分割することなく、すぐに、２００ミリ秒の信号処理するためのネットワーク実行時間を測定している。すべての測定は、大掛かりなバックグラウンドプロセスなしですませられた。結果は、専用の計算環境では、本発明者のネットワークが、従来技術のモデルと同じ速さであることを示している。 3.3. Real-time performance Controlled experiments. First, we measured the D _N and D _A of different network models (see Table 3). These models take input data of different lengths, and we also measure the network execution time to process a 200 ms signal at once, without splitting it into multiple windows. All measurements were done without extensive background processes. The results show that in a dedicated computing environment, our network is as fast as the prior art models.

［表２］
Ｖａｌｅｎｔｉｎｉのノイズ除去の質。 [Table 2]
Valentini noise reduction quality.

［表３］
タイミングの比較。Ｄ_Ｎ及びＤ_Ａに加えて、本発明者はまた、２００ミリ秒のオーディオ（Ｓ_Ｎ）でのネットワーク推論のコストを報告する。ここで、数字はタイミングの平均及び標準偏差を含む。 [Table 3]
Timing Comparison: In addition to D _N and D _A , we also report the cost of network inference on 200 ms of audio (S _N ), where the numbers include the mean and standard deviation of the timing.

次に、本発明者はダイナミックスライディングウィンドウのリアルタイムのパフォーマンス及びＣＰＵリソースの変動が存在する中での固定の大きさのスライディングウィンドウを理解するための制御実験を行う。その目的で、本発明者は、２つのノイズ除去モデルを作成している：両者とも、ＡＡＤデータセットでトレーニングされた同じノイズ除去ネットワークを使用している（セクション２．２）。第１のものは、ダイナミックスライディングウィンドウを使用している（Ｄモデルと称す）が、第２のものは、固定の大きさのスライディングウィンドウを使用している（Ｆモデルと称す）。公平に比較するために、Ｄモデルの初期のウィンドウの大きさＬ_０は、Ｆモデルの固定のウィンドウの大きさと同じものに設定した（Ｌ_０＝１６ミリ秒）。本発明者は、同じオーディオのセットをノイズ除去するために２つのモデルを使用しており、その各々は５秒の長さを有している。計算能力の変動を改善するべく、２秒後、本発明者は故意に、因子ｓによりネットワーク処理を遅延させており、それにおいてｓはランダムに［１、７］から選択されている。これは、本発明者が同じ長さの遅延がＤモデル及びＦモデルの両方に追加されるのを確実にするとき、制御された方法でバックグラウンドプロセスによるＣＰＵ占有をシミュレートするものである。 Next, we conduct controlled experiments to understand the real-time performance of dynamic sliding windows and fixed-size sliding windows in the presence of CPU resource variations. To that end, we create two denoising models: both use the same denoising network trained on the AAD dataset (Section 2.2). The first one uses a dynamic sliding window (referred to as D-model), while the second one uses a fixed-size sliding window (referred to as F-model). To make a fair comparison, we set the initial window size L ₀ of the D-model to be the same as the fixed window size of the F-model (L ₀ =16 ms). We use the two models to denoise the same set of audio, each of which has a duration of 5 seconds. To improve the computational power variations, after 2 seconds we intentionally delay the network processing by a factor s, where s is randomly selected from [1, 7]. This simulates CPU occupancy by background processes in a controlled way as the inventors ensure that the same length of delay is added to both the D and F models.

図３は、オーディオストリームが経時的に到着するときに測定されたＤ_Ｎ及びＤ_Ａを示す。開始時、両者は、リアルタイムでスムーズに実行することができる（再生の遅れＮ_Ａ＜１００ミリ秒で）。２秒後、計算能力が変動し始め、いくつかのスライディングウィンドウを時間内に処理させないようにする。結果として、Ｆモデルの再生の遅れＤ_Ａが蓄積し、出力されるオーディオ再生が途切れ途切れになる。対照的に、ＤモデルのＤ_Ａは安定し続けている、なぜならそれがダイナミックにウィンドウの大きさを増加させて遅延に追いつくことができるからである。この実験は、式（１）及び（２）における本発明者の理論的分析を確認するものである。 FIG. 3 shows D _N and D _A measured as the audio stream arrives over time. At the beginning, both can run smoothly in real time (with playback lag N _A <100 ms). After 2 seconds, the computational power starts to fluctuate, preventing some sliding windows from being processed in time. As a result, the playback lag D _A of the F model accumulates and the output audio playback becomes choppy. In contrast, D _A of the D model remains stable because it can dynamically increase the window size to catch up with the delay. This experiment confirms the inventor's theoretical analysis in equations (1) and (2).

図３は、ダイナミック及び固定のスライディングウィンドウの手法間におけるリアルタイムのパフォーマンスの比較を概略的に示す。本発明者は、人工的に、２秒後ネットワークの処理時間をランダムに１～７倍増加させて、ＣＰＵ能力の変動をシミュレートしている。曲線は、固定の乱数のシードでの１００回を超える試行の結果を平均化した。 Figure 3 shows a schematic comparison of real-time performance between dynamic and fixed sliding window approaches. We artificially randomly increase the network processing time by 1-7 times after 2 seconds to simulate variations in CPU power. The curves are averaged over 100 trials with a fixed random seed.

現実世界での実験。本発明者は、次に、現実のシナリオにおける本発明者のモデルのリアルタイムのパフォーマンスを調べており、バックグラウンドプロセスは、ＣＰＵサイクルを先取りし得る。ここで、本発明者は、４Ｋビデオの再生、Ｚｏｏｍでの会議、ｉＭｏｖｉｅでのビデオ編集、及びテレビゲームＡｐｅｘを含む異なるソフトウエアがバックグラウンドで実行されている間、同じセットのオーディオをノイズ除去している。Ａｐｅｘのゲームを実行するために、本発明者は８コアのＩｎｔｅｌＣＰＵ（３．６０ＧＨｚｉ７－９７００Ｋ）及びＧＰＵ（ＮＶＩＤＩＡＧｅＦｏｒｃｅＲＴＸ２０７０ＳＵＰＥＲ）を搭載したＷｉｎｄｏｗ１０ＰＣを使用している；他のソフトウエアのテストはＭａｃｂｏｏｋＰｒｏ（２．３ＧＨｚＩｎｔｅｒＱｕａｄ－Ｃｏｒｅｉ５）で行う。本発明者は、これらがいっそう多くのＣＰＵサイクルを要請することから、これらのソフトウエアを選択している。 Real-world experiments. We next examine the real-time performance of our model in real-world scenarios, where background processes can preempt CPU cycles. Here, we denoise the same set of audio while different software runs in the background, including 4K video playback, Zoom meetings, iMovie video editing, and the video game Apex. To run the Apex game, we use a Windows 10 PC with an 8-core Intel CPU (3.60GHz i7-9700K) and GPU (NVIDIA GeForce RTX 2070 SUPER); other software tests are done on a MacBook Pro (2.3GHz Inter Quad-Core i5). We choose these software because they require more CPU cycles.

［表４］
他のバックグラウンドソフトウエアを実行させたときのＤｅｍｕｃｓ４８及び本発明者のリアルタイムのノイズ除去の遅れ。各セルは、平均及び標準偏差両方でＤ_Ａ（Ｄ_Ｎ）を示す。数字は２０秒のオーディオを使用して測定されたものである。 [Table 4]
Demucs 48 and our real-time noise reduction lag when other background software is running. Each cell shows D _A (D _N ) with both the mean and standard deviation. Numbers were measured using 20 seconds of audio.

本発明者はこれらのソフトウエアを個々に実行しながら、本発明者のモデル及びＤｅｍｕｃｓ４８をそれぞれ使用してＤ_Ｎ及びＤ_Ａを測定している。本発明者は、固有のストリーミングの実装を有していて本発明者のものに匹敵するノイズ除去の質を提供することから、Ｄｅｍｕｃｓ４８と比較をしている。結果は表４に報告されている。バックグラウンドプロセスがかなり計算的に集中すると（例えば、ｉＭｏｖｉｅやＡｐｅｘ）、Ｄｅｍｕｃｓ４８の再生の遅れが劇的に増加するが、それに対して本発明者のモデルの遅れは軽度で安定し続ける。これは、本発明者のモデル及びダイナミックスライディングウィンドウ戦略のパフォーマンスの利点を示す明確な証拠である。 We measure D _N and D _A using our model and Demucs 48, respectively, running these software individually. We compare with Demucs 48, as it has its own streaming implementation and provides comparable denoising quality to ours. The results are reported in Table 4. When background processes become quite computationally intensive (e.g., iMovie and Apex), the playback lag of Demucs 48 increases dramatically, whereas the lag of our model remains mild and stable. This is clear evidence of the performance advantage of our model and dynamic sliding window strategy.

４．結論
本発明者は、リアルタイムの音声ノイズ除去のためのダイナミックスライディングウィンドウ戦略を提案してきた。入念な分析及び実験を通して、本発明者は、広く使用されている固定の大きさのスライディングウィンドウ戦略を凌ぐその利点を実証した。本発明者のノイズ除去ネットワークは、ＳＯＴＡに匹敵するリアルタイムのノイズ除去の質を達成しながらも、ダイナミックスライディングウィンドウを利用することにより、短い遅れを保持する。注目すべきことに、それは本発明者のモデルが他のバックグラウンドのタスクが存在する現実世界のシナリオにおいてロバストに実行することを可能にする。 4. Conclusion We have proposed a dynamic sliding window strategy for real-time speech denoising. Through careful analysis and experiments, we have demonstrated its advantages over the widely used fixed-size sliding window strategy. Our denoising network achieves real-time denoising quality comparable to SOTA while retaining a short lag by utilizing a dynamic sliding window. Notably, it enables our model to perform robustly in real-world scenarios in the presence of other background tasks.

図６は、データ処理装置１００、ＣＥＰ装置２００、バッチ処理装置３００、又は選択装置４００として機能するコンピュータ１２００のハードウェア構成の例を概略的に示す。コンピュータ１２００にインストールされるプログラムは、コンピュータ１２００に、本実施形態に係る装置の１又は複数「ユニット」として機能させるか、又はコンピュータ１２００に、装置に関連付けられる動作を実行させるか又は本実施形態に係るその１又は複数の「ユニット」を実行させ、及び／又はコンピュータ１２００に、本実施形態に係るプロセスを実行させるか又はプロセスの段階を実行させることができる。そのようなプログラムは、ＣＰＵ１２１２に対して、本明細書に記載されているフローチャート及びブロック図のブロックの一部又はすべてに関連付けられる特定の動作をコンピュータ１２００に実行させることにより、実行され得る。 6 shows an example of a hardware configuration of a computer 1200 functioning as a data processing device 100, a CEP device 200, a batch processing device 300, or a selection device 400. A program installed on the computer 1200 can cause the computer 1200 to function as one or more "units" of the device according to the present embodiment, or to perform operations associated with the device or to execute one or more "units" thereof according to the present embodiment, and/or to perform a process or execute a step of a process according to the present embodiment. Such a program can be executed by causing the CPU 1212 to cause the computer 1200 to perform specific operations associated with some or all of the blocks of the flowcharts and block diagrams described herein.

本実施形態に係るコンピュータ１２００は、ＣＰＵ１２１２、ＲＡＭ１２１４、及びグラフィックコントローラ１２１６を含み、これらはホストコントローラ１２１０を介して互いに接続されている。コンピュータ１２００はまた、通信インタフェース１２２２、記憶装置１２２４、ＤＶＤドライブ］、及びＩＣカードドライブなどの入出力ユニットを含み、これらは入出力コントローラ１２２０を介してホストコントローラ１２１０に接続されている。ＤＶＤドライブは、ＤＶＤ－ＲＯＭドライブ、ＤＶＤ－ＲＡＭドライブなどであり得る。記憶装置１２２４は、ハードディスクドライブ、ソリッドステートドライブなどであり得る。コンピュータ１２００はまた、ＲＯＭ１２３０やキーボードなどのレガシー入出力ユニットを含み、これらは入出力チップ１２４０を介して入出力コントローラ１２２０に接続される。 The computer 1200 according to this embodiment includes a CPU 1212, a RAM 1214, and a graphics controller 1216, which are connected to each other via a host controller 1210. The computer 1200 also includes input/output units such as a communication interface 1222, a storage device 1224, a DVD drive, and an IC card drive, which are connected to the host controller 1210 via an input/output controller 1220. The DVD drive may be a DVD-ROM drive, a DVD-RAM drive, or the like. The storage device 1224 may be a hard disk drive, a solid state drive, or the like. The computer 1200 also includes legacy input/output units such as a ROM 1230 and a keyboard, which are connected to the input/output controller 1220 via an input/output chip 1240.

ＣＰＵ１２１２は、ＲＯＭ１２３０及びＲＡＭ１２１４内に格納されたプログラムに従って動作し、それにより各ユニットを制御する。グラフィックコントローラ１２１６は、ＲＡＭ１２１４又はそれ自体において提供されたフレームバッファなどにある、ＣＰＵ１２１２により発せられたイメージデータを取得し、イメージデータがディスプレイデバイス１２１８に表示されるようにする。 The CPU 1212 operates according to the programs stored in the ROM 1230 and the RAM 1214, thereby controlling each unit. The graphics controller 1216 acquires image data issued by the CPU 1212, which is in the RAM 1214 or a frame buffer provided in the graphics controller 1216 itself, and causes the image data to be displayed on the display device 1218.

通信インタフェース１２２２は、ネットワークを介して他の電子デバイスと通信する。記憶装置１２２４は、コンピュータ１２００においてＣＰＵ１２１２により使用されるプログラム及びデータを格納する。ＤＶＤドライブは、ＤＶＤ－ＲＯＭなどからプログラム又はデータを読み取って記憶装置１２２４にプログラム又はデータを提供する。ＩＣカードドライブは、ＩＣカードからプログラム及びデータを読み取り、及び／又はＩＣカードにプログラム及びデータを書き込む。 The communication interface 1222 communicates with other electronic devices via a network. The storage device 1224 stores programs and data used by the CPU 1212 in the computer 1200. The DVD drive reads programs or data from a DVD-ROM or the like and provides the programs or data to the storage device 1224. The IC card drive reads programs and data from an IC card and/or writes programs and data to the IC card.

ＲＯＭ１２３０はその中に、作動時にコンピュータ１２００によって実行されるブートプログラムなど、及び／又はコンピュータ１２００のハードウェアに依存するプログラムを格納する。入出力チップ１２４０はまた、ＵＳＢポート、並列ポート、シリアルポート、キーボードポート、マウスポートなどを介して、入出力コントローラ１２２０に様々な入出力ユニットを接続できる。 ROM 1230 stores therein a boot program and the like executed by computer 1200 during operation, and/or programs that depend on the hardware of computer 1200. I/O chip 1240 can also connect various I/O units to I/O controller 1220 via USB ports, parallel ports, serial ports, keyboard ports, mouse ports, and the like.

プログラムは、ＤＶＤ－ＲＯＭ又はＩＣカードなどのコンピュータ可読記憶媒体によって提供される。プログラムは、コンピュータ可読記憶媒体から読み取られ、同じくコンピュータ可読記憶媒体の例である記憶装置１２２４、ＲＡＭ１２１４、又はＲＯＭ１２３０にインストールされ、ＣＰＵ１２１２によって実行される。プログラムに書き込まれている情報処理は、コンピュータ１２００により読み取られ、その結果、プログラム及び上記の様々なタイプのハードウェアリソースの間で協働する。装置又は方法は、コンピュータ１２００の使用に応じて情報の演算又は処理を実装することによって構成され得る。 The program is provided by a computer-readable storage medium such as a DVD-ROM or an IC card. The program is read from the computer-readable storage medium, installed in storage device 1224, RAM 1214, or ROM 1230, which are also examples of computer-readable storage media, and executed by CPU 1212. Information processing written in the program is read by computer 1200, resulting in cooperation between the program and the various types of hardware resources described above. The apparatus or method can be configured by implementing calculations or processing of information in accordance with the use of computer 1200.

例えば、通信がコンピュータ１２００及び外部のデバイスの間で行われている場合、ＣＰＵ１２１２は、ＲＡＭ１２１４にロードされる通信プログラムを実行し、通信インタフェース１２２２に、通信プログラムに書き込まれている処理に基づいて通信処理を行うよう命令することができる。通信インタフェース１２２２は、ＣＰＵ１２１２の制御下で、ＲＡＭ１２１４、記憶装置１２２４、ＤＶＤ－ＲＯＭ、又はＩＣカードなどの記録媒体に設けられた送信バッファ領域に格納されている伝送データを読み取って送信し、読み取った伝送データをネットワークに送信するか、又はネットワークから受信した受信データを記録媒体に設けられた受信バッファ領域などに書き込む。 For example, when communication is taking place between computer 1200 and an external device, CPU 1212 can execute a communication program loaded into RAM 1214 and instruct communication interface 1222 to perform communication processing based on the processing written in the communication program. Under the control of CPU 1212, communication interface 1222 reads and transmits transmission data stored in a transmission buffer area provided in RAM 1214, storage device 1224, a DVD-ROM, or a recording medium such as an IC card, and transmits the read transmission data to the network, or writes received data received from the network to a reception buffer area or the like provided in the recording medium.

また、ＣＰＵ１２１２は、記憶装置１２２４、ＤＶＤドライブ（ＤＶＤ－ＲＯＭ）、ＩＣカードなどのような外部記録媒体に格納されたファイル又はデータベースの全部又は必要な部分がＲＡＭ１２１４に読み取られるようにし、ＲＡＭ１２１４上のデータに対し様々なタイプの処理を実行してよい。次に、ＣＰＵ１２１２は、外部記録媒体に処理済みのデータを書き込んで戻すことができる。 The CPU 1212 may also cause all or a necessary portion of a file or database stored in an external recording medium such as the storage device 1224, a DVD drive (DVD-ROM), an IC card, etc. to be read into the RAM 1214, and perform various types of processing on the data on the RAM 1214. The CPU 1212 may then write the processed data back to the external recording medium.

様々なタイプのプログラム、データ、表、データベースなどの様々なタイプの情報が、記録媒体に格納されて情報処理される。ＣＰＵ１２１２は、ＲＡＭ１２１４に結果を戻して書き込むために、ＲＡＭ１２１４から読み取られたデータについて様々なタイプの処理を行うことができ、その処理は、本開示全体に記載され、プログラムの連続する命令により特定され、様々なタイプの動作、情報処理、条件の判断、条件的分岐、条件のない分岐、情報の検索／置換などを含む。また、ＣＰＵ１２１２は、記録媒体内のファイル、データベースなどにおける情報を検索してよい。例えば、各々が第２の属性の属性値に関連付けられる第１の属性の属性値を有する複数のエントリが、記録媒体に格納されるとき、ＣＰＵ１２１２は、複数のエントリから、第１の属性の属性値が指定された条件に適合するエントリを検索して、エントリに格納されている第２の属性の属性値を読み取り、それにより、所定の条件を満たした第１の属性と関連付けられる第２の属性の属性値を取得することができる。 Various types of information, such as various types of programs, data, tables, databases, etc., are stored in the recording medium and processed. The CPU 1212 can perform various types of processing on the data read from the RAM 1214 to write the results back to the RAM 1214, which processing is described throughout this disclosure and specified by the sequential instructions of the program, and includes various types of operations, information processing, conditional judgment, conditional branching, unconditional branching, information search/replacement, etc. The CPU 1212 may also search for information in a file, database, etc. in the recording medium. For example, when multiple entries each having an attribute value of a first attribute associated with an attribute value of a second attribute are stored in the recording medium, the CPU 1212 can search for an entry whose attribute value of the first attribute meets a specified condition from the multiple entries, read the attribute value of the second attribute stored in the entry, and thereby obtain the attribute value of the second attribute associated with the first attribute that meets the specified condition.

上で説明したプログラム又はソフトウエアモジュールは、コンピュータ１２００上又はコンピュータ１２００近傍のコンピュータ可読記憶媒体に格納されてよい。また、専用の通信ネットワーク又はインターネットに接続されるサーバシステムに設けられるハードディスク又はＲＡＭなどの記録媒体が、コンピュータ可読記憶媒体として使用でき、それにより、ネットワークを介してコンピュータ１２００にプログラムを提供する。 The above-described programs or software modules may be stored on a computer-readable storage medium on the computer 1200 or in the vicinity of the computer 1200. In addition, a recording medium such as a hard disk or RAM provided in a server system connected to a dedicated communication network or the Internet can be used as a computer-readable storage medium, thereby providing the programs to the computer 1200 via the network.

本実施形態のフローチャート及びブロック図のブロックは、動作が行われるか又は装置の「ユニット」が動作の実行を担うプロセスの段階を表し得る。特定の段階及び「ユニット」は、専用回路、コンピュータ可読記憶媒体に格納されたコンピュータ可読命令が供給されるプログラマブル回路、及び／又はコンピュータ可読記憶媒体に格納されたコンピュータ可読命令が供給されるプロセッサによって実装され得る。専用回路は、デジタル及び／又はアナログのハードウェア回路を含むことができ、集積回路（ＩＣ）及び／又はディスクリート回路を含むことができる。例えば、プログラマブル回路は、再構成可能なハードウェア回路を含み得、例えば論理のＡＮＤ、ＯＲ、ＸＯＲ、ＮＡＮＤ、ＮＯＲ、及び他の論理動作、フリップフロップ、レジスタ、及びメモリ要素、例えばフィールドプログラマブルゲートアレイ（ＦＰＧＡ）、プログラマブルロジックアレイ（ＰＬＡ）などが挙げられる。 The blocks of the flowcharts and block diagrams of the present embodiment may represent stages of a process in which an operation is performed or a "unit" of an apparatus is responsible for performing an operation. Particular stages and "units" may be implemented by dedicated circuitry, programmable circuitry provided with computer-readable instructions stored on a computer-readable storage medium, and/or a processor provided with computer-readable instructions stored on a computer-readable storage medium. Dedicated circuitry may include digital and/or analog hardware circuitry, and may include integrated circuits (ICs) and/or discrete circuits. For example, programmable circuitry may include reconfigurable hardware circuitry, such as logical AND, OR, XOR, NAND, NOR, and other logic operations, flip-flops, registers, and memory elements, such as field programmable gate arrays (FPGAs), programmable logic arrays (PLAs), and the like.

コンピュータ可読記憶媒体は、適切なデバイスによって実行される命令を格納できる任意の有形のデバイスを含むことができ、結果として、命令が格納されたコンピュータ可読記憶媒体は、フローチャート又はブロック図で指定された操作を実行するための手段を作成するために実行することができる命令を含む製品が含まれる。コンピュータ可読記憶媒体の例は、電子記憶媒体、磁気記憶媒体、光記憶媒体、電磁記憶媒体、半導体記憶媒体などを含むことができる。コンピュータ可読記憶媒体のより具体的な例は、フロッピー（登録商標）ディスク、ディスケット、ハードディスク、ランダムアクセスメモリ（ＲＡＭ）、リードオンリメモリ（ＲＯＭ）、消去可能プログラマブルリードオンリメモリ（ＥＰＲＯＭ又はフラッシュメモリ）、電気的に消去可能プログラマブルリードオンリメモリ（ＥＥＰＲＯＭ）、スタティックランダムアクセスメモリ（ＳＲＡＭ）、コンパクトディスクリードオンリメモリ（ＣＤ－ＲＯＭ）、デジタルバーサタイルディスク（ＤＶＤ）、ブルーレイ（登録商標）ディスク、メモリスティック、集積回路カード、などを含み得る。 A computer-readable storage medium may include any tangible device capable of storing instructions that are executed by a suitable device, and as a result, a computer-readable storage medium having instructions stored thereon includes a product that includes instructions that can be executed to create a means for performing the operations specified in the flowchart or block diagram. Examples of computer-readable storage media may include electronic storage media, magnetic storage media, optical storage media, electromagnetic storage media, semiconductor storage media, and the like. More specific examples of computer-readable storage media may include floppy (registered trademark) disks, diskettes, hard disks, random access memories (RAMs), read-only memories (ROMs), erasable programmable read-only memories (EPROMs or flash memories), electrically erasable programmable read-only memories (EEPROMs), static random access memories (SRAMs), compact disk read-only memories (CD-ROMs), digital versatile disks (DVDs), Blu-ray (registered trademark) disks, memory sticks, integrated circuit cards, and the like.

コンピュータ可読命令は、アセンブラ命令、命令セットアーキテクチャ（ＩＳＡ）命令、マシン命令、マシン依存命令、マイクロコード、ファームウェア命令、状態設定データ、又はＳｍａｌｌｔａｌｋ（登録商標）、ＪＡＶＡ（登録商標）、Ｃ＋＋などのようなオブジェクト指向プログラミング言語、及び「Ｃ」プログラミング言語又は同様のプログラミング言語のような従来の手続き型プログラミング言語を含む、１又は複数のプログラミング言語の任意の組み合わせで記述されたソースコード又はオブジェクトコードのいずれかを含んでよい。 The computer readable instructions may include either assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk (registered trademark), JAVA (registered trademark), C++, etc., and conventional procedural programming languages such as the "C" programming language or similar programming languages.

コンピュータ可読命令は、汎用コンピュータ、特殊目的のコンピュータ、又は他のプログラム可能なデータ処理装置のプロセッサ、又はプログラマブル回路が、フローチャート又はブロック図で指定された演算を実行するための手段を生成するために当該コンピュータ可読命令を実行すべく、ローカルに又はローカルエリアネットワーク（ＬＡＮ）、インターネットなどのようなワイドエリアネットワーク（ＷＡＮ）を介して、汎用コンピュータ、特殊目的のコンピュータ、又は他のプログラム可能なデータ処理装置のプロセッサ、又はプログラマブル回路に提供されてよい。プロセッサの例としては、コンピュータプロセッサ、処理ユニット、マイクロプロセッサ、デジタル信号プロセッサ、コントローラ、マイクロコントローラなどを含む。 The computer-readable instructions may be provided to a processor or programmable circuit of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, either locally or over a local area network (LAN), a wide area network (WAN) such as the Internet, so that the processor or programmable circuit of the general-purpose computer, special-purpose computer, or other programmable data processing apparatus executes the computer-readable instructions to generate means for performing the operations specified in the flowcharts or block diagrams. Examples of processors include computer processors, processing units, microprocessors, digital signal processors, controllers, microcontrollers, etc.

実施形態により本発明を説明してきたが、本発明の技術的範囲は上記の実施形態に限定されない。上記実施形態に、多様な変更又は改良を加えることが可能であることが当業者に明らかである。そのような変更又は改良を加えた実施形態はまた、本発明の技術的範囲に含まれ得ることが、特許請求の範囲の記載から明らかである。 Although the present invention has been described with reference to the above embodiments, the technical scope of the present invention is not limited to the above embodiments. It is clear to those skilled in the art that various modifications and improvements can be made to the above embodiments. It is clear from the claims that embodiments with such modifications or improvements can also be included in the technical scope of the present invention.

特許請求の範囲、実施形態、及び図面において示した装置、システム、プログラム、及び方法における動作、手順、ステップ、及び段階などの各処理の実行順序は、特段「より前に」、「先立って」などと明示しておらず、また、前の処理の出力を後の処理で用いるのでない限り、任意の順序で実現し得ることに留意すべきである。特許請求の範囲、実施形態、及び図面の動作フローに関して、便宜上「第１に」又は「次に」などを用いて説明したとしても、この順で実施することが必須であることを意味するものではない。 The order of execution of each process, such as operations, procedures, steps, and stages, in the devices, systems, programs, and methods shown in the claims, embodiments, and drawings is not specifically stated as "before" or "prior to," and it should be noted that they may be realized in any order, unless the output of a previous process is used in a later process. Even if the operational flow in the claims, embodiments, and drawings is described using "first" or "next" for convenience, it does not mean that it is essential to perform the process in this order.

Claims

an acquisition unit configured to acquire an audio signal in a streaming manner; and a denoising unit configured to denoise the audio signal by using a dynamic sliding window with an input buffer of a non-fixed length ,
11. An audio signal processing apparatus, wherein the noise removal unit is configured to buffer a portion of the audio signal in the dynamic sliding window having a first size, then start removing noise from the portion of the audio signal and buffering a next portion of the audio signal, and then stop the buffering when the noise removal is completed .

A program for causing a computer to function as the audio signal processing device according to claim 1 .

acquiring an audio signal in a streaming manner; and denoising the audio signal acquired in the acquiring step by using a dynamic sliding window in which an input buffer length is not fixed,
11. A method for processing an audio signal, wherein the denoising step comprises buffering a portion of the audio signal in the dynamic sliding window having a first size, then commencing denoising the portion of the audio signal and buffering a next portion of the audio signal, and then stopping the buffering when the denoising is completed .