JP7404657B2

JP7404657B2 - Speech recognition device, speech recognition program, and speech recognition method

Info

Publication number: JP7404657B2
Application number: JP2019099690A
Authority: JP
Inventors: 隆矢頭
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2019-05-28
Filing date: 2019-05-28
Publication date: 2023-12-26
Anticipated expiration: 2039-05-28
Also published as: JP2020194093A

Description

この発明は、音声認識装置、音声認識プログラム、及び音声認識方法に関し、例えば、雑音環境下で用いられる収音システム等の特定のエリアの音を強調し、それ以外のエリアの音を抑制するシステムに適用し得る。 The present invention relates to a speech recognition device, a speech recognition program, and a speech recognition method, such as a system that emphasizes sound in a specific area and suppresses sound in other areas, such as a sound collection system used in a noisy environment. can be applied to

雑音環境下で音声通信システムや音声認識応用システムを利用する場合、必要な目的音声と同時に混入する周囲の雑音は、良好なコミュニケーションを阻害し、音声認識率の低下をもたらす厄介な存在である。従来、このような複数の音源が存在する環境下において、特定の方向の音のみ分離・収音することで不要音の混入を避け必要な目的音を得る技術として、マイクアレイを用いたビームフォーマ（ＢｅａｍＦｏｒｍｅｒ；以下「ＢＦ」と呼ぶ）がある。 When using a voice communication system or a voice recognition application system in a noisy environment, ambient noise that mixes in with the necessary target voice is a nuisance that hinders good communication and causes a decrease in voice recognition rate. Conventionally, in environments where multiple sound sources exist, beamformers using microphone arrays have been used as a technology to separate and collect only sounds in specific directions, thereby avoiding the mixing of unnecessary sounds and obtaining the necessary target sounds. (Beam Former; hereinafter referred to as "BF").

ＢＦとは、各マイクロホンに到達する信号の時間差を利用して指向性を形成する技術である（非特許文献１参照）。ＢＦは、加算型と減算型の大きく２つの種類に分けられる。特に減算型ＢＦは、加算型ＢＦに比べ、少ないマイクロホン数で指向性を形成できるという利点がある。 BF is a technology that forms directivity using the time difference between signals reaching each microphone (see Non-Patent Document 1). BF can be roughly divided into two types: additive type and subtractive type. In particular, the subtractive BF has the advantage that directivity can be formed with a smaller number of microphones than the additive BF.

図４は、マイクロホン数が２個（Ｍ１、Ｍ２）の場合の減算型ＢＦ４００に係る構成を示すブロック図である。 FIG. 4 is a block diagram showing the configuration of the subtraction type BF 400 when the number of microphones is two (M1, M2).

減算型ＢＦ４００は、遅延器４１０と減算器４２０を有している。 The subtractive BF 400 includes a delay device 410 and a subtracter 420.

減算型ＢＦ４００は、目的とする方向に存在する音（以下、「目的音」と呼ぶ）が各マイクロホンに到来する信号の時間差を算出し、遅延器４１０により遅延を加えることにより目的音の位相を合わせる。このときの時間差は下記（１）式により算出される。ここで、ｄはマイクロホン間の距離であり、ｃは音速であり、τ_ｉは遅延量であり、θ_Ｌは各マイクロホンを結んだ直線に対する垂直方向から目的方向への角度である。 The subtractive BF 400 calculates the time difference between signals in which a sound existing in a target direction (hereinafter referred to as "target sound") arrives at each microphone, and adds a delay using a delay device 410 to adjust the phase of the target sound. match. The time difference at this time is calculated by the following equation (1). Here, d is the distance between the microphones, c is the speed of sound, τ _i is the amount of delay, and θ _L is the angle from the perpendicular direction to the target direction with respect to the straight line connecting each microphone.

また、ここで、遅延器４１０は、死角がマイクロホンｍｃ１とマイクロホンｍｃ２の中心に対し、マイクロホンｍｃ１の方向に存在する場合、マイクロホンｍｃ１の入力信号ｘ_１（ｔ）に対し遅延処理を行う。その後、減算器４２０が、（２）式に従った減算処理を行う。 Further, here, if the blind spot exists in the direction of the microphone mc1 with respect to the centers of the microphones mc1 and mc2, the delay device 410 performs delay processing on the input signal x ₁ (t) of the microphone mc1. Thereafter, the subtracter 420 performs subtraction processing according to equation (2).

減算器４２０では、周波数領域でも同様に減算処理を行うことができ、その場合（２）式は以下の（３）式のように変更される。

The subtracter 420 can perform subtraction processing in the frequency domain as well, in which case equation (2) is changed to equation (3) below.

ここでθ_Ｌ＝±π／２の場合、形成される指向性は図５（ａ）に示すように、カージオイド型の単一指向性となり、θ_Ｌ＝０，πの場合は、図５（ｂ）のような８の字型の双指向性となる。ここでは、入力信号から単一指向性を形成するフィルタを「単一指向性フィルタ」、双指向性を形成するフィルタを「双指向性フィルタ」と呼ぶものとする。 Here, when θ _L = ±π/2, the formed directivity becomes a cardioid type unidirectivity, as shown in FIG. 5(a), and when θ _L =0,π, This results in a figure-eight bidirectional pattern as shown in (b). Here, a filter that forms a unidirectional pattern from an input signal will be referred to as a "unidirectional filter," and a filter that forms a bidirectional pattern will be referred to as a "bidirectional filter."

減算器４２０では、スペクトル減算法（ＳｐｅｃｔｒａｌＳｕｂｔｒａｃｔｉｏｎ；以下、「ＳＳ」とも呼ぶ）を用いることで、双指向性の死角に強い指向性を形成することもできる。ＳＳによる指向性は、（４）式に従い全周波数、もしくは指定した周波数帯域で形成される。（４）式では、マイクロホンｍｃ１の入力信号Ｘ_１を用いているが、マイクロホンｍｃ２の入力信号Ｘ_２でも同様の効果を得ることができる。ここでβはＳＳの強度を調節するための係数である。（４）式に従ってＳＳの処理を行う際、減算時に値がマイナスになった場合は、０または元の値を小さくした値に置き換えるフロアリング処理を行う。この方式は、双指向性フィルタにより目的方向以外に存在する音（以下、「非目的音」と呼ぶ）を抽出し、抽出した非目的音の振幅スペクトルを入力信号の振幅スペクトルから減算することで、目的音方向を強調することができる。

In the subtracter 420, by using a spectral subtraction method (hereinafter also referred to as "SS"), it is also possible to form strong directivity in the blind spot of the bidirectional pattern. The directivity due to the SS is formed at all frequencies or at a specified frequency band according to equation (4). In equation (4), the input signal X ₁ of the microphone mc1 is used, but the same effect can be obtained with the input signal X ₂ of the microphone mc2. Here, β is a coefficient for adjusting the strength of SS. When performing SS processing according to equation (4), if the value becomes negative during subtraction, flooring processing is performed to replace it with 0 or a value smaller than the original value. This method uses a bidirectional filter to extract sounds that exist in directions other than the target direction (hereinafter referred to as "non-target sounds"), and then subtracts the amplitude spectrum of the extracted non-target sounds from the amplitude spectrum of the input signal. , the direction of the target sound can be emphasized.

ある特定のエリア内に存在する音（以下、「目的エリア音」と呼ぶ）だけを収音したい場合、減算型ＢＦを用いるだけでは、そのエリアの周囲に存在する音源（以下、「非目的エリア音」と呼ぶ）も収音してしまう可能性がある。そこで、特許文献１の記載技術では、複数のマイクアレイを用い、それぞれ別々の方向から目的エリアへ指向性を向け、指向性を目的エリアで交差させることで目的エリア音を収音する手法（エリア収音）を提案している。 If you want to collect only the sounds that exist in a certain area (hereinafter referred to as "target area sounds"), it is not possible to collect only the sound sources that exist around that area (hereinafter referred to as "non-target area sounds") by simply using a subtractive BF. There is also a possibility that sound may be picked up. Therefore, the technology described in Patent Document 1 uses a method (area sound collection).

図６は、２つのマイクアレイＭＡ１、ＭＡ２を用いて、目的エリアの音源からの目的エリア音を収音する処理について示した説明図である。 FIG. 6 is an explanatory diagram showing a process of collecting target area sound from a target area sound source using two microphone arrays MA1 and MA2.

図６（ａ）は、各マイクアレイの構成例について示した説明図である。図６（ｂ）、図６（ｃ）は、それぞれ図６（ａ）に示すマイクアレイＭＡ１、ＭＡ２のＢＦ出力について周波数領域で示したグラフ（イメージ図）である。エリア収音では、図６（ａ）に示すようは、マイクアレイＭＡ１、ＭＡ２の指向性を別々の方向から収音したいエリア（目的エリア）で交差させて収音する。図６（ａ）の状態では、各マイクアレイＭＡ１、ＭＡ２の指向性に目的エリア内に存在する音（目的エリア音）だけでなく、目的エリア方向の雑音（非目的エリア音）も含まれている。しかし、図６（ｂ）、図６（ｃ）に示すように、マイクアレイＭＡ１、ＭＡ２の指向性を周波数領域で比較すると、目的エリア音成分はどちらの出力にも含まれるが、非目的エリア音成分は各マイクアレイで異なることになる。従来のエリア収音技術では、このような特性を利用し、２つのマイクアレイＭＡ１、ＭＡ２のＢＦ出力に、共通に含まれる成分以外を抑圧することで目的エリア音のみ抽出することができる。 FIG. 6A is an explanatory diagram showing a configuration example of each microphone array. FIGS. 6(b) and 6(c) are graphs (image diagrams) showing the BF outputs of the microphone arrays MA1 and MA2 shown in FIG. 6(a) in the frequency domain, respectively. In area sound collection, as shown in FIG. 6A, sound is collected by crossing the directivity of microphone arrays MA1 and MA2 from different directions at an area where sound is desired to be collected (target area). In the state shown in Fig. 6(a), the directivity of each microphone array MA1, MA2 includes not only the sound existing within the target area (target area sound) but also the noise in the direction of the target area (non-target area sound). There is. However, as shown in FIGS. 6(b) and 6(c), when comparing the directivity of microphone arrays MA1 and MA2 in the frequency domain, the target area sound components are included in both outputs, but the non-target area The sound components will be different for each microphone array. In conventional area sound collection technology, by utilizing such characteristics, only the target area sound can be extracted by suppressing components other than those commonly included in the BF outputs of the two microphone arrays MA1 and MA2.

図６（ａ）のような環境で従来のエリア収音処理を行う場合、まず各マイクアレイのＢＦ出力に含まれる目的エリア音の振幅スペクトルの比率を推定し、それを補正係数とする例として２つのマイクアレイを使用することになる。このとき、目的エリア音振幅スペクトルの補正係数は、「（５）、（６）式」または「（７）、（８）式」により算出される。ここで、Ｙ_１ｋ（ｎ）、Ｙ_２ｋ（ｎ）はマイクアレイＭＡ１、ＭＡ２のＢＦ出力の振幅スペクトルであり、Ｎは周波数ピンの総数であり、ｋは周波数であり、α_１（ｎ）、α_２（ｎ）は各ＢＦ出力に対する振幅スペクトル補正係数である。また、ここで、ｍｏｄｅは最頻値、ｍｅｄｉａｎは中央値を表している。

When performing conventional area sound collection processing in the environment shown in Figure 6(a), as an example, first estimate the ratio of the amplitude spectrum of the target area sound included in the BF output of each microphone array, and use that as the correction coefficient. Two microphone arrays will be used. At this time, the correction coefficient for the target area sound amplitude spectrum is calculated by "Equations (5) and (6)" or "Equations (7) and (8)." Here, Y _1k (n), Y _2k (n) are the amplitude spectra of the BF outputs of microphone arrays MA1 and MA2, N is the total number of frequency pins, k is the frequency, α ₁ (n), α ₂ (n) is an amplitude spectrum correction coefficient for each BF output. Moreover, here, mode represents the mode, and median represents the median value.

その後、補正係数により各ＢＦ出力を補正し、ＳＳすることで、目的エリア方向に存在する非目的エリア音を抽出することができる。さらに、抽出した非目的エリア音を各ＢＦの出力からＳＳすることにより目的エリア音を抽出することができる。例えば、マイクアレイＭＡ１からみた目的エリア方向に存在する非目的エリア音Ｎ_１（ｎ）を抽出するには、（９）式に示すように、マイクアレイＭＡ１のＢＦ出力Ｙ_１（ｎ）からマイクアレイＭＡ２のＢＦ出力Ｙ_２（ｎ）に振幅スペクトル補正係数的を掛けたものをＳＳする。同様に（１０）式に従い、マイクアレイＭＡ２からみた目的エリア方向に存在する非目的エリア音Ｎ_２（ｎ）を抽出する。 Thereafter, by correcting each BF output using a correction coefficient and performing SS, it is possible to extract non-target area sounds that exist in the direction of the target area. Furthermore, the target area sound can be extracted by performing SS on the extracted non-target area sound from the output of each BF. For example, in order to extract the non-target area sound N ₁ (n) existing in the direction of the target area as seen from microphone array MA1, as shown in equation (9), from the BF output Y ₁ (n) of microphone array MA1, SS is obtained by multiplying the BF output Y ₂ (n) of array MA2 by the amplitude spectrum correction coefficient. Similarly, according to equation (10), non-target area sound N ₂ (n) existing in the target area direction as seen from microphone array MA2 is extracted.

その後、（１１）、（１２）式に従い、各ＢＦ出力から非目的エリア音をＳＳして目的エリア音を抽出することができる。（１１）式は、マイクアレイＭＡ１を基準として、また（１２）式は、マイクアレイＭＡ２を基準として目的エリア音を抽出することを示している。なお、（１１）式、（１２）式において、γ_１（ｎ）、γ_２（ｎ）はＳＳ時の強度を変更するための係数である。

Thereafter, according to equations (11) and (12), the non-target area sound can be SSed from each BF output to extract the target area sound. Equation (11) indicates that the target area sound is extracted using microphone array MA1 as a reference, and equation (12) indicates that target area sound is extracted using microphone array MA2 as a reference. Note that in equations (11) and (12), γ ₁ (n) and γ ₂ (n) are coefficients for changing the strength during SS.

ところで、背景雑音や非目的エリア音の音量レベルが大きい場合、目的エリア音抽出の際に行うＳＳにより、目的エリア音が歪んだり、ミュージカルノイズという耳障りな異音が発生する可能性がある。 By the way, when the volume level of the background noise or non-target area sound is high, the SS performed when extracting the target area sound may distort the target area sound or generate a harsh abnormal sound called musical noise.

そこで、特許文献２の手法では、背景雑音と非目的エリア音の大きさに応じて、マイクの入力信号と推定雑音の音量レベルをそれぞれ調節し、抽出した目的エリア音に混合している。目的エリア音を抽出する処理により発生するミュージカルノイズは、背景雑音と非目的エリア音の音量レベルが大きいほど強くなるため、混合する入力信号と推定雑音の総和の音量レベルも、背景雑音と非目的エリア音の音量レベルに比例して大きくする。背景雑音の音量レベルは、背景雑音を抑圧する過程で求める推定雑音から算出することができる。また、非目的エリア音の音量レベルは、目的エリア音を強調する過程で抽出する目的エリア方向に存在する非目的エリア音と、目的エリア方向以外に存在する非目的エリア音を合わせたものから算出することができる。 Therefore, in the method of Patent Document 2, the volume levels of the microphone input signal and the estimated noise are respectively adjusted according to the loudness of the background noise and the non-target area sound, and are mixed with the extracted target area sound. The musical noise generated by the process of extracting target area sound becomes stronger as the volume level of background noise and non-target area sound increases, so the volume level of the sum of the input signal to be mixed and the estimated noise also increases Increase the volume in proportion to the volume level of the area sound. The volume level of the background noise can be calculated from the estimated noise obtained in the process of suppressing the background noise. In addition, the volume level of the non-target area sound is calculated from the sum of the non-target area sound that exists in the direction of the target area extracted in the process of emphasizing the target area sound, and the non-target area sound that exists in the direction other than the target area. can do.

特許文献２の手法では、混合する入力信号と推定雑音の比率は、推定雑音と非目的エリア音の音量レベルから決定する。しかし、目的エリアの近くに非目的エリア音が存在する場合、混合する入力信号の音量レベルが大きすぎると目的エリア音に非目的エリア音が混入し、どちらが目的エリア音なのかが分からなくなってしまう。そこで、特許文献２の手法では、非目的エリア音が大きいときは混合する入力信号の音量レベルを下げ、推定雑音の音量レベルを大きくして混合する。つまり、特許文献２の手法では、非目的エリア音が存在しないか音量レベルが小さい場合は入力信号の割合を多くし、逆に非目的エリア音の音量レベルが大きい場合推定雑音の割合を多くして混合する。 In the method of Patent Document 2, the ratio of the input signal to be mixed and the estimated noise is determined from the volume level of the estimated noise and the non-target area sound. However, if non-target area sound exists near the target area, if the volume level of the input signal to be mixed is too high, the non-target area sound will be mixed with the target area sound, making it difficult to tell which is the target area sound. . Therefore, in the method of Patent Document 2, when the non-target area sound is loud, the volume level of the input signal to be mixed is lowered, and the volume level of the estimated noise is increased and mixed. In other words, in the method of Patent Document 2, if the non-target area sound does not exist or the volume level is low, the proportion of the input signal is increased, and conversely, if the volume level of the non-target area sound is high, the proportion of the estimated noise is increased. Mix.

このように特許文献２の手法を用いれば、目的エリア音に入力信号及び推定雑音を混合することにより、ミュージカルノイズをマスキングし、通常の背景雑音のように違和感なく聞かせることができる。また、特許文献２の手法では、マイク入力信号に含まれる目的エリア音の成分により、目的エリア音の歪みを補正し、音質を改善することができる。 In this way, by using the technique of Patent Document 2, by mixing the input signal and the estimated noise with the target area sound, musical noise can be masked and can be heard as normal background noise without any discomfort. Further, with the method of Patent Document 2, distortion of the target area sound can be corrected using the target area sound component included in the microphone input signal, and the sound quality can be improved.

特開２０１４－０７２７０８号公報Japanese Patent Application Publication No. 2014-072708 特開２０１７－１８３９０２号公報Japanese Patent Application Publication No. 2017-183902

浅野太著，“音響テクノロジーシリーズ１６音のアレイ信号処理－音源の定位・追跡と分離－”，日本音響学会編，コロナ社，２０１１年２月２５日発行Futoshi Asano, “Acoustic Technology Series 16 Sound Array Signal Processing - Localization, Tracking and Separation of Sound Sources”, Edited by the Acoustical Society of Japan, Corona Publishing, February 25, 2011.

ところで、長年の音声認識技術の進展により、従来でも静粛環境ではかなり高精度な認識が実現できるようになっていた。しかし、実環境では、目的話者の音声には周囲からの様々な雑音、妨害音声が混入し、認識率を著しく低下させる要因となる。そのため、音声認識エンジンに入力する音声から、いかに背景雑音を取り除くかが音声インタフェース実現する上で、重要な課題となっていた。そこでは、目的音声に多少の変形、歪があっても、雑音抑圧効果、とりわけ妨害音声の抑圧効果が高い前処理が求められた。 By the way, with the advancement of speech recognition technology over the years, it has become possible to achieve fairly high-precision recognition even in a quiet environment. However, in a real environment, the target speaker's voice is mixed with various noises and interfering voices from the surroundings, which causes a significant decrease in the recognition rate. Therefore, how to remove background noise from the speech input to the speech recognition engine has become an important issue in realizing a speech interface. Therefore, there is a need for preprocessing that is highly effective in suppressing noise, especially in suppressing interfering speech, even if the target speech has some deformation or distortion.

ところが近年、機械学習とりわけ深層学習の導入は、音声認識に革新的な進歩をもたらし、背景雑音に対する頑健性も従来とは比較にならないほど向上した。そのような音声認識エンジンに対しては、雑音抑圧性能を最優先にするのではなく、雑音抑圧性能と雑音抑圧後の音声品質のバランスが重要となる。 However, in recent years, the introduction of machine learning, particularly deep learning, has brought about revolutionary advances in speech recognition, making it far more robust against background noise than ever before. For such a speech recognition engine, it is important not to give top priority to noise suppression performance, but to balance the noise suppression performance and the speech quality after noise suppression.

特許文献１、２のようなエリア収音やＢＦなどの手法は、周囲の雑音が抑圧され音声の了解性は上がるが、そのまま音声認識率が改善するとは限らない。どの程度の雑音抑制が最適かは、使用環境、雑音の種類、音声認識エンジンの特性、など様々な要因に左右され、一概に決定することは困難である。 Although methods such as area sound collection and BF as in Patent Documents 1 and 2 suppress ambient noise and improve speech intelligibility, they do not necessarily improve the speech recognition rate. The optimal level of noise suppression depends on various factors, such as the environment of use, the type of noise, and the characteristics of the speech recognition engine, and is difficult to determine unconditionally.

そのため、音声認識処理の認識精度向上に寄与する収音処理に基づく音声認識処理を行う音声認識装置、音声認識プログラム、及び音声認識方法が求められている。 Therefore, there is a need for a speech recognition device, a speech recognition program, and a speech recognition method that perform speech recognition processing based on sound collection processing that contributes to improving the recognition accuracy of speech recognition processing.

第１の本発明は、（１）目的エリアに指向性を向けることが可能な複数のマイクアレイから入力された入力信号に基づいて、それぞれの前記マイクアレイのビームフォーマ出力を取得し、取得したビームフォーマ出力を用いて、前記目的エリアのエリア収音処理を行って目的エリア音を抽出するエリア音抽出手段と、（２）前記エリア音抽出手段により抽出された目的エリア音に対し、混合用信号を、複数の混合量で混合する混合処理を行って、混合量ごとの混合音を生成する信号混合手段と、（３）それぞれの前記混合音に対して音声認識処理を行った結果を取得し、それぞれの前記混合音の音声認識処理結果に対する信頼度を算出する信頼度算出処理を行う音声認識手段と、（４）前記音声認識手段が行った音声認識処理結果のうち、最も信頼度の高い音声認識処理結果を選択して出力する認識結果選択手段とを有し、（５）前記信号混合手段は、設定された中心混合量を中心として、設定された変化幅で設定された段階数変化させた混合量を前記混合処理に適用し、前記段階数の分の混合音を生成し、（６）前記認識結果選択手段の選択結果に応じて、前記信号混合手段に適用する前記中心混合量、前記変化幅、及び前記段階数を決定する混合内容決定手段をさらに有することを特徴とする。 A first aspect of the present invention provides: (1) Based on input signals input from a plurality of microphone arrays capable of directing directivity toward a target area, beamformer outputs of each of the microphone arrays are acquired; (2) area sound extraction means for extracting a target area sound by performing area sound collection processing of the target area using the beamformer output; a signal mixing means that performs a mixing process of mixing signals at a plurality of mixing amounts to generate a mixed sound for each mixing amount, and (3) obtaining the results of performing voice recognition processing on each of the mixed sounds. (4) a voice recognition means that performs a reliability calculation process to calculate the reliability of the voice recognition process results for each of the mixed sounds; (5) the signal mixing means selects and outputs a high speech recognition processing result; (6) applying the changed mixing amount to the mixing process to generate mixed sounds for the number of stages; The method further includes a mixture content determining means for determining the quantity, the range of change, and the number of stages.

第２の本発明の音声認識プログラムは、コンピュータを、（１）目的エリアに指向性を向けることが可能な複数のマイクアレイから入力された入力信号に基づいて、それぞれの前記マイクアレイのビームフォーマ出力を取得し、取得したビームフォーマ出力を用いて、前記目的エリアのエリア収音処理を行って目的エリア音を抽出するエリア音抽出手段と、（２）前記エリア音抽出手段により抽出された目的エリア音に対し、混合用信号を、複数の混合量で混合する混合処理を行って、混合量ごとの混合音を生成する信号混合手段と、（３）それぞれの前記混合音に対して音声認識処理を行った結果を取得し、それぞれの前記混合音の音声認識処理結果に対する信頼度を算出する信頼度算出処理を行う音声認識手段と、（４）前記音声認識手段が行った音声認識処理結果のうち、最も信頼度の高い音声認識処理結果を選択して出力する認識結果選択手段として機能させ、（５）前記信号混合手段は、設定された中心混合量を中心として、設定された変化幅で設定された段階数変化させた混合量を前記混合処理に適用し、前記段階数の分の混合音を生成し、（６）前記コンピュータを、前記認識結果選択手段の選択結果に応じて、前記信号混合手段に適用する前記中心混合量、前記変化幅、及び前記段階数を決定する混合内容決定手段としても機能させることを特徴とする。 The speech recognition program according to the second aspect of the present invention causes a computer to: (1) select a beamformer for each of the microphone arrays based on input signals input from a plurality of microphone arrays capable of directing directionality toward a target area; (2) an area sound extraction means for acquiring the output and performing area sound collection processing of the target area using the acquired beamformer output to extract the target area sound; and (2) a purpose extracted by the area sound extraction means. (3) signal mixing means for performing a mixing process of mixing a mixing signal in a plurality of mixing amounts with respect to the area sound to generate a mixed sound for each mixing amount; and (3) speech recognition for each of the mixed sounds. (4) a voice recognition unit that performs a reliability calculation process that acquires the results of the processing and calculates the reliability of the voice recognition process result for each of the mixed sounds; and (4) a voice recognition process result performed by the voice recognition unit. (5) The signal mixing means selects and outputs the most reliable speech recognition processing result from among the speech recognition processing results; (6) applying the mixture amount changed by the number of stages set in the mixing process to generate mixed sounds for the number of stages; (6) controlling the computer according to the selection result of the recognition result selection means; It is characterized in that it also functions as a mixture content determining means for determining the central mixing amount, the variation width, and the number of stages to be applied to the signal mixing means.

第３の本発明は、音声認識方法において、（１）エリア音抽出手段、信号混合手段、音声認識手段、認識結果選択手段、及び混合内容決定手段を有し、（２）前記エリア音抽出手段は、目的エリアに指向性を向けることが可能な複数のマイクアレイから入力された入力信号に基づいて、それぞれの前記マイクアレイのビームフォーマ出力を取得し、取得したビームフォーマ出力を用いて、前記目的エリアのエリア収音処理を行って目的エリア音を抽出し、（３）前記信号混合手段は、前記エリア音抽出手段により抽出された目的エリア音に対し、混合用信号を、複数の混合量で混合する混合処理を行って、混合量ごとの混合音を生成し、（４）前記音声認識手段は、それぞれの前記混合音に対して音声認識処理を行った結果を取得し、それぞれの前記混合音の音声認識処理結果に対する信頼度を算出する信頼度算出処理を行い、（５）前記認識結果選択手段は，前記音声認識手段が行った音声認識処理結果のうち、最も信頼度の高い音声認識処理結果を選択して出力し、（６）前記信号混合手段は、設定された中心混合量を中心として、設定された変化幅で設定された段階数変化させた混合量を前記混合処理に適用し、前記段階数の分の混合音を生成し、（７）前記混合内容決定手段は、前記認識結果選択手段の選択結果に応じて、前記信号混合手段に適用する前記中心混合量、前記変化幅、及び前記段階数を決定することを特徴とする。
A third aspect of the present invention provides a speech recognition method, which includes (1) an area sound extraction means, a signal mixing means, a speech recognition means, a recognition result selection means, and a mixed content determination means, and (2) the area sound extraction means. acquires the beamformer output of each of the microphone arrays based on the input signals input from the plurality of microphone arrays that can direct the directivity toward the target area, and uses the acquired beamformer output to (3) The signal mixing means applies a mixing signal to the target area sound extracted by the area sound extracting means, and mixes a plurality of mixing amounts. (4) The voice recognition means obtains the results of voice recognition processing for each of the mixed sounds, and generates mixed sounds for each mixing amount. (5) the recognition result selection means selects the voice with the highest reliability among the voice recognition processing results performed by the voice recognition means; Selecting and outputting the recognition processing result, (6) the signal mixing means applies to the mixing process a mixture amount that has been changed by a set number of steps in a set change width around the set center mixing amount; (7) the mixture content determining means selects the central mixing amount to be applied to the signal mixing means, the mixing amount to be applied to the signal mixing means, and The method is characterized in that the variation width and the number of steps are determined.

本発明によれば、音声認識処理の認識精度向上に寄与する収音処理に基づく音声認識処理を行うことができる。 According to the present invention, it is possible to perform speech recognition processing based on sound collection processing that contributes to improving the recognition accuracy of speech recognition processing.

第１の実施形態に係る音声認識装置の機能的構成について示したブロック図である。FIG. 1 is a block diagram showing the functional configuration of a speech recognition device according to a first embodiment. 第１の実施形態に係る音声認識装置のハードウェア構成の例について示したブロック図である。1 is a block diagram illustrating an example of a hardware configuration of a speech recognition device according to a first embodiment; FIG. 第２の実施形態に係る音声認識装置の機能的構成について示したブロック図である。FIG. 2 is a block diagram showing the functional configuration of a speech recognition device according to a second embodiment. 従来の減算型ＢＦ（マイクロホンの数が２個の場合）の構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of a conventional subtractive BF (when the number of microphones is two). 従来の減算型ＢＦ（マイクロホンの数が２個の場合）により形成される指向性フィルタの例について示した説明図。FIG. 3 is an explanatory diagram showing an example of a directional filter formed by a conventional subtractive BF (when the number of microphones is two). 従来の収音装置において、２つのマイクアレイのビームフォーマ（ＢＦ）による指向性を別々の方向から目的エリアへ向けた場合の構成例について示した説明図。FIG. 2 is an explanatory diagram illustrating a configuration example in which the beam formers (BF) of two microphone arrays direct the directivity toward the target area from different directions in a conventional sound collection device.

（Ａ）第１の実施形態
以下、本発明による音声認識装置、音声認識プログラム、及び音声認識方法の第１の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の収音システムを収音システムに適用した例について説明する。 (A) First Embodiment Hereinafter, a first embodiment of a speech recognition device, a speech recognition program, and a speech recognition method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound collection system of the present invention is applied to a sound collection system will be described.

（Ａ－１）第１の実施形態の構成
図１は、第１の実施形態の音声認識装置２００の全体構成を示すブロック図である。 (A-1) Configuration of First Embodiment FIG. 1 is a block diagram showing the overall configuration of a speech recognition device 200 according to the first embodiment.

音声認識装置２００は、マイクアレイ部１００から供給される入力信号に基づいて、目的エリアを音源とする目的エリア音（目的エリアに存在する話者の音声）を区別して収音し、収音した目的エリア音を音声認識処理（例えば、収音した音声をテキスト化する処理）して、その音声認識処理結果を出力する装置である。 Based on the input signal supplied from the microphone array unit 100, the speech recognition device 200 distinguishes and collects target area sounds (voices of speakers present in the target area) whose sound source is the target area. This is a device that performs speech recognition processing on target area sounds (for example, processing that converts collected speech into text) and outputs the speech recognition processing results.

マイクアレイ部１００は、複数のマイクアレイを用いて収音エリアを含む領域の音を捕捉する手段である。 The microphone array unit 100 is a means for capturing sound in an area including a sound collection area using a plurality of microphone arrays.

この実施形態では、マイクアレイ部１００は、２つのマイクアレイＭＡ１、ＭＡ２を備えているものとする。 In this embodiment, it is assumed that the microphone array section 100 includes two microphone arrays MA1 and MA2.

マイクアレイＭＡ１、ＭＡ２は、それぞれ目的エリアが存在する空間の任意の場所に配置される。目的エリアに対するマイクアレイＭＡ１、ＭＡ２の位置は、各マイクアレイの指向性が目的エリアでのみ重なればどこでも良い。例えば、マイクアレイＭＡ１、ＭＡ２を、目的エリアを挟んで対向に配置しても良い。マイクアレイ部１００を構成するマイクアレイの数は２つに限定するものではなく、目的エリアが複数存在する場合、全てのエリアをカバーできる数のマイクアレイを配置する。 Microphone arrays MA1 and MA2 are each placed at an arbitrary location in the space where the target area exists. The positions of the microphone arrays MA1 and MA2 with respect to the target area may be anywhere as long as the directivity of each microphone array overlaps only in the target area. For example, the microphone arrays MA1 and MA2 may be placed opposite to each other with the target area in between. The number of microphone arrays constituting the microphone array section 100 is not limited to two, and if there are multiple target areas, the number of microphone arrays that can cover all the areas is arranged.

そして、マイクアレイＭＡ１、ＭＡ２は、それぞれ２つ以上のマイクロホンを用いて構成することができる。この実施形態では、マイクアレイＭＡ１、ＭＡ２は、それぞれ２つのマイクロホンを備えているものとして説明する。ここでは、マイクアレイＭＡ１はｍｃ１、ｍｃ２を備え、マイクアレイＭＡ２はｍｃ３、ｍｃ４を備えるものとする。また、この実施形態の例では、マイクアレイＭＡ１、ＭＡ２において、２つのマイクロホンの間の距離は３ｃｍであるものとする。 The microphone arrays MA1 and MA2 can each be configured using two or more microphones. In this embodiment, each of the microphone arrays MA1 and MA2 will be described as having two microphones. Here, it is assumed that microphone array MA1 includes mc1 and mc2, and microphone array MA2 includes mc3 and mc4. Further, in the example of this embodiment, it is assumed that the distance between two microphones in microphone arrays MA1 and MA2 is 3 cm.

次に、音声認識装置２００の内部構成について説明する。 Next, the internal configuration of the speech recognition device 200 will be explained.

図１に示すように、音声認識装置２００は、信号入力部２０１、時間／周波数変換部２０２、指向性形成部２０３、エリア音抽出部２０４、信号混合部２０５、周波数／時間変換部２０６、振幅スペクトル比算出部２０７、音声区間検出部２０８、音声認識部２０９、及び認識結果選択部２１０を有している。音声認識装置２００を構成する各要素の詳細については後述する。 As shown in FIG. 1, the speech recognition device 200 includes a signal input section 201, a time/frequency conversion section 202, a directivity formation section 203, an area sound extraction section 204, a signal mixing section 205, a frequency/time conversion section 206, an amplitude It has a spectrum ratio calculation section 207, a speech section detection section 208, a speech recognition section 209, and a recognition result selection section 210. Details of each element constituting the speech recognition device 200 will be described later.

音声認識装置２００は、例えば、プロセッサやメモリ等を備えるコンピュータにプログラム（実施形態に係る音声認識プログラムを含む）を実行させるようにしてもよいが、その場合であっても、機能的には、図１のように示すことができる。音声認識装置２００の各構成要素の処理の詳細については後述する。 The speech recognition device 200 may, for example, cause a computer equipped with a processor, a memory, etc. to execute a program (including the speech recognition program according to the embodiment), but even in that case, functionally, It can be shown as shown in FIG. Details of the processing of each component of the speech recognition device 200 will be described later.

図２は、音声認識装置２００のハードウェア構成の例について示したブロック図である。なお、図２における括弧内の符号は後述する第２の実施形態で用いられる符号である。 FIG. 2 is a block diagram showing an example of the hardware configuration of the speech recognition device 200. Note that the symbols in parentheses in FIG. 2 are the symbols used in the second embodiment described later.

図２では、音声認識装置２００をソフトウェア（コンピュータ）を用いて実現する際の構成について示している。 FIG. 2 shows a configuration in which the speech recognition device 200 is implemented using software (computer).

図２に示す音声認識装置２００は、ハードウェア的な構成要素として、少なくとも信号入力部２０１と、プログラム（実施形態の音声認識プログラムを含むプログラム）がインストールされたコンピュータ５００を有している。 The speech recognition device 200 shown in FIG. 2 includes, as hardware components, at least a signal input section 201 and a computer 500 in which a program (a program including the speech recognition program of the embodiment) is installed.

信号入力部２０１は、例えば、Ａ／Ｄコンバータを用いて構成することができる。なお、コンピュータ５００自体にＡ／Ｄコンバータが搭載されていれば、信号入力部２０１を別途設ける必要はない。 The signal input section 201 can be configured using, for example, an A/D converter. Note that if the computer 500 itself is equipped with an A/D converter, there is no need to separately provide the signal input section 201.

コンピュータ５００は、信号入力部２０１から供給される音響信号（デジタル音響信号）にエリア収音処理を施して出力する処理を行う。この実施形態では、コンピュータ５００に、この実施形態の収音プログラムを含むプログラム（ソフトウェア）がインストールされているものとする。 The computer 500 performs area sound collection processing on the audio signal (digital audio signal) supplied from the signal input unit 201 and outputs the resultant sound signal. In this embodiment, it is assumed that a program (software) including the sound collection program of this embodiment is installed in the computer 500.

なお、コンピュータ５００は、収音プログラム専用のコンピュータとしてもよいし、他の機能（例えば、記録装置３００）のプログラムと共用される構成としてもよい。 Note that the computer 500 may be a computer dedicated to the sound collection program, or may be configured to be shared with programs for other functions (for example, the recording device 300).

図２に示すコンピュータ５００は、プロセッサ５０１、一次記憶部５０２、及び二次記憶部５０３を有している。一次記憶部５０２は、プロセッサ５０１の作業用メモリ（ワークメモリ）として機能する記憶手段であり、例えば、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の高速動作するメモリが適用される。二次記憶部５０３は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）やプログラムデータ（実施形態に係る収音プログラムのデータを含む）等の種々のデータを記録する記憶手段であり、例えば、ＦＬＡＳＨメモリやＨＤＤ等の不揮発性メモリが適用される。この実施形態のコンピュータ５００では、プロセッサ５０１が起動する際、二次記憶部５０３に記録されたＯＳやプログラム（実施形態に係る収音プログラムを含む）を読み込み、一次記憶部５０２上に展開して実行する。 The computer 500 shown in FIG. 2 includes a processor 501, a primary storage section 502, and a secondary storage section 503. The primary storage unit 502 is a storage unit that functions as a working memory (work memory) of the processor 501, and for example, a memory that operates at high speed such as a DRAM (Dynamic Random Access Memory) is applied. The secondary storage unit 503 is a storage unit that records various data such as an OS (Operating System) and program data (including data of the sound collection program according to the embodiment), and is, for example, a nonvolatile storage such as a FLASH memory or an HDD. sexual memory is applied. In the computer 500 of this embodiment, when the processor 501 starts up, it reads the OS and programs (including the sound collection program according to the embodiment) recorded in the secondary storage unit 503 and expands them onto the primary storage unit 502. Execute.

なお、コンピュータ５００の具体的な構成は図２の構成に限定されないものであり、種々の構成を適用することができる。例えば、一次記憶部５０２が不揮発メモリ（例えば、ＦＬＡＳＨメモリ等）であれば、二次メモリについては除外した構成としてもよい。 Note that the specific configuration of the computer 500 is not limited to the configuration shown in FIG. 2, and various configurations can be applied. For example, if the primary storage unit 502 is a nonvolatile memory (for example, a FLASH memory, etc.), the configuration may be such that the secondary memory is excluded.

（Ａ－２）第１の実施形態の動作
次に、以上のような構成を有する第１の実施形態の音声認識装置２００の動作を説明する。 (A-2) Operation of First Embodiment Next, the operation of the speech recognition device 200 of the first embodiment having the above configuration will be explained.

信号入力部２０１は、各マイクアレイＭＡ１、ＭＡ２の各マイクロホンｍｃ１～ｍｃ４で収音した音響信号をアナログ信号からデジタル信号に変換し、時間／周波数変換部２０２に供給する。以下では、マイクロホンｍｃ１～ｍｃ４で収音したデジタル形式の音響信号（以下、「入力信号」とも呼ぶ）を、それぞれｘ１～ｘ４と表す。 The signal input section 201 converts the acoustic signals picked up by the microphones mc1 to mc4 of the microphone arrays MA1 and MA2 from analog signals to digital signals, and supplies the digital signals to the time/frequency conversion section 202. In the following, digital audio signals (hereinafter also referred to as "input signals") picked up by the microphones mc1 to mc4 are expressed as x1 to x4, respectively.

時間／周波数変換部２０２は供給されたマイクロホン信号を時間領域から周波数領域へ変換する。以下では、入力信号ｘ１～ｘ４を周波数領域に変換した信号を、それぞれＸ１～Ｘ４と表す。 The time/frequency converter 202 converts the supplied microphone signal from the time domain to the frequency domain. Below, signals obtained by converting the input signals x1 to x4 into the frequency domain are expressed as X1 to X4, respectively.

指向性形成部２０３は、時間／周波数変換部２０２によって時間・周波数変換された各マイクロホンの入力信号（Ｘ１～Ｘ４）を用いて上記の（３）式、（４）式に従いＢＦにより指向性を形成する。以下では、マイクアレイＭＡ１によるＢＦ出力をＹ１、マイクアレイＭＡ２によるＢＦ出力をＹ２とする。 The directivity forming unit 203 uses the input signals (X1 to X4) of each microphone that have been time-frequency converted by the time/frequency conversion unit 202 to determine the directivity by BF according to the above equations (3) and (4). Form. In the following, the BF output by the microphone array MA1 is referred to as Y1, and the BF output by the microphone array MA2 is referred to as Y2.

エリア音抽出部２０４は、指向性形成部２０３で生成されたＢＦ出力Ｙ１、Ｙ２を用いて（９）、もしくは（１０）式に従いＳＳし、目的工リア方向に存在する非目的エリア音を抽出する。さらに、エリア音抽出部２０４は、抽出した雑音を各ＢＦの出力から（１１）、もしくは（１２）式に従いＳＳすることにより目的エリア音Ｚを抽出する。 The area sound extraction unit 204 performs SS according to equation (9) or (10) using the BF outputs Y1 and Y2 generated by the directivity forming unit 203, and extracts non-target area sounds existing in the rear direction of the target construction. do. Further, the area sound extracting unit 204 extracts the target area sound Z by performing SS on the extracted noise from the output of each BF according to equation (11) or (12).

ここまでのエリア収音処理によって、目的音が存在するエリアで発生した音だけが抽出され、周囲に音声認識に不要な雑音があっても、目的とする音声（目的エリアに存在する話者の発話した音声）だけを取り出すことができる。一般に雑音抑圧処理において、雑音の抑圧量と音質はトレードオフの関係にある。抑圧量を増やせば歪みは増える。エリア収音は、目的エリアで発生する音だけを強調可能な優れた方式であるが、一般の雑音抑圧同様、強調効果を高めれば、それだけ歪みは増すことになる。そのためエリア収音によって抽出した目的音信号を、そのまま記録装置３００（音声認識部３０１）に与えても、高い認識率が得られない可能性がある。 The area sound collection processing up to this point extracts only the sounds that occur in the area where the target sound exists, and even if there is noise unnecessary for speech recognition in the surroundings, the target sound (of the speaker in the target area) is extracted. Only the spoken voice can be extracted. Generally, in noise suppression processing, there is a trade-off relationship between the amount of noise suppression and sound quality. If the amount of suppression is increased, the distortion will increase. Area sound collection is an excellent method that can emphasize only the sounds occurring in the target area, but as with general noise suppression, the higher the emphasis effect, the more the distortion will increase. Therefore, even if the target sound signal extracted by area sound collection is directly supplied to the recording device 300 (speech recognition unit 301), a high recognition rate may not be obtained.

上述の通り、エリア音出力Ｚに、混合用信号として入力信号成分を混合すれば、エリア収音処理によって生じる目的音声の歪みを軽減することができる。そこで、この実施形態の音声認識装置２００では、エリア音出力Ｚに混合用信号としての入力信号を一定量混合し、エリア音出力Ｚにおける目的音成分の歪軽減を図る。 As described above, by mixing the input signal component with the area sound output Z as a mixing signal, it is possible to reduce the distortion of the target sound caused by the area sound collection process. Therefore, in the speech recognition device 200 of this embodiment, a certain amount of the input signal as a mixing signal is mixed with the area sound output Z to reduce distortion of the target sound component in the area sound output Z.

エリア音出力Ｚに対する入力信号の混合量（混合する入力信号のレベル）を増やせば目的音成分の歪は減るが、それと引き換えに雑音の抑圧量は少なくなる。エリア音出力に入力信号をどの程度の割合で混合すればよいかは、目的音の音量、雑音量、雑音の種類、装置の使用環境、使用する音声認識エンジンの特性など種々の要因に左右され、一律に決定するのは困難である。そこで、この実施形態の信号混合部２０５は、入力信号の混合量（混合する入力信号のレベル）の異なる複数の混合音Ｍを生成するものとする。 If the amount of input signal mixing (level of the input signal to be mixed) with respect to the area sound output Z is increased, the distortion of the target sound component will be reduced, but in exchange, the amount of noise suppression will be reduced. The ratio of the input signal to the area sound output depends on various factors, such as the volume of the target sound, the amount of noise, the type of noise, the environment in which the device is used, and the characteristics of the speech recognition engine used. , it is difficult to uniformly determine. Therefore, it is assumed that the signal mixing unit 205 of this embodiment generates a plurality of mixed sounds M having different mixing amounts of input signals (levels of input signals to be mixed).

具体的には、信号混合部２０５は、最終的に目的エリア音のエリア収音結果として出力する混合音として、混合量の異なるＱ個（Ｑは２以上の整数）の混合音Ｍ（Ｍ１～ＭＱ）を生成する。混合音Ｍ１～ＭＱは、それぞれ、元の入力信号に対する減衰量Ａｔ（Ａｔ１～ＡｔＱ）が異なる入力信号をエリア音出力Ｚに混合したものである。以下では、Ｑは「段階数」と呼ぶものとする。 Specifically, the signal mixing unit 205 generates Q (Q is an integer of 2 or more) mixed sounds M (M1 to MQ) is generated. The mixed sounds M1 to MQ are obtained by mixing input signals having different attenuation amounts At (At1 to AtQ) with respect to the original input signal into the area sound output Z, respectively. In the following, Q will be referred to as the "number of stages."

この実施形態では段階数Ｑは７として説明する。そして、減衰量Ａｔ１～Ａｔ７は、－５ｄｂ～３５ｄｂの範囲で－５ｄＢ刻みに設定（―５ｄｂの幅で７段階に設定）されるものとする。すなわち、減衰量Ａｔ１～Ａｔ７は、それぞれ－５ｄＢ、－１０ｄＢ、－１５ｄＢ、－２０ｄＢ、－２５ｄＢ、－３０ｄＢ、－３５ｄＢ［混合量小］に設定される。この場合Ａｔ１（＝－５ｄｂ）の減衰量が最も小さく（混合量が最も大きく）、Ａｔ７（＝－３５ｄｂ）の減衰量が最も大きく（混合量が最も小さく）なる。 In this embodiment, the number of stages Q will be explained as seven. It is assumed that the attenuation amounts At1 to At7 are set in the range of -5 db to 35 db in increments of -5 dB (set in 7 steps with a width of -5 db). That is, the attenuation amounts At1 to At7 are set to -5 dB, -10 dB, -15 dB, -20 dB, -25 dB, -30 dB, and -35 dB [small mixing amount], respectively. In this case, the attenuation amount of At1 (=-5 db) is the smallest (the mixing amount is the largest), and the attenuation amount of At7 (=-35 db) is the largest (the mixing amount is the smallest).

そして、信号混合部２０５は、混合用信号としての入力信号を、Ａｔ１～Ａｔ７のそれぞれの減衰量で減衰させた混合音Ｍ１～Ｍ７を生成する。 Then, the signal mixing unit 205 generates mixed sounds M1 to M7 by attenuating the input signal as the mixing signal by the respective attenuation amounts of At1 to At7.

周波数／時間変換部２０６は、エリア音出力Ｚに入力信号（入力音）を混合することによって歪みを低減した混合音Ｍ１～Ｍ７を時間領域の信号（混合音）ｍ１～ｍ７に変換して、音声認識部２０９に供給する。 The frequency/time conversion unit 206 converts the mixed sounds M1 to M7, which have reduced distortion by mixing the input signal (input sound) to the area sound output Z, into time domain signals (mixed sounds) m1 to m7, The signal is supplied to the speech recognition unit 209.

音声認識装置２００では、混合音ｍ１～ｍ７を音声認識部２０９に投入するに当たり、音声区間検出部２０８が音声区間の検出をおこなうものとする。 In the speech recognition device 200, when inputting the mixed sounds m1 to m7 to the speech recognition section 209, the speech section detection section 208 detects the speech section.

音声区間検出部２０８が、収音エリア内の目的エリア音の存在の有無を判定する方法については限定されないものであり、種々の方法を適用することができる。例えば、音声区間検出部２０８では、収音エリア内の目的エリア音の存在の有無を判定する方法として、エリア収音出力と入力音との振幅スペクトル比を用いる方法（以下の参考文献１参照）や、リア収音を行なう際のＢＦ出力間のコヒーレンスを用いる方法などを適用することができる。この実施形態の例では、音声区間検出部２０８は、振幅スペクトル比を用いる方法で、収音エリア内の目的エリア音の存在の有無を判定するものとして説明する。
参考文献１：特関２０１６－１２７４５７号公報 The method by which the audio section detection unit 208 determines the presence or absence of target area sound within the sound collection area is not limited, and various methods can be applied. For example, the sound section detection unit 208 uses a method of using the amplitude spectrum ratio of the area sound collection output and the input sound as a method for determining the presence or absence of target area sound within the sound collection area (see Reference 1 below). Alternatively, a method using coherence between BF outputs when performing rear sound pickup can be applied. In the example of this embodiment, the voice section detection unit 208 will be described as determining the presence or absence of target area sound within the sound collection area by a method using an amplitude spectrum ratio.
Reference 1: Special Seki No. 2016-127457

振幅スペクトル比算出部２０７は、時間／周波数変換部２０２から入力信号を、エリア音抽出部２０４からエリア音出力Ｚを取得して、振幅スペクトル比Ｒの算出を行う。 The amplitude spectrum ratio calculation unit 207 obtains the input signal from the time/frequency conversion unit 202 and the area sound output Z from the area sound extraction unit 204, and calculates the amplitude spectrum ratio R.

例えば振幅スペクトル比算出部２０７は、下記の（１３）式、又は（１４）式を用いて、エリア音出力（Ｚ_１又はＺ_２）と入力信号の振幅スペクトル比（Ｒ_１又はＲ_２）を周波数ごとに算出する。（１３）、（１４）式において、Ｗｉｎ_１はマイクアレイＭＡ１の入力信号の振幅スペクトルであり、Ｗｉｎ_２は、マイクアレイＭＡ２の入力信号の振幅スペクトルである。なお、Ｗｉｎ_１、Ｗｉｎ_２の算出に用いるマイクロホンは、マイクアレイＭＡ１、ＭＡ２を構成するいずれかのマイクロホンでも良い。ここではＷｉｎ_１はマイクロホンｍｃ１の入力信号Ｘ１に基づいて算出されたものとし、Ｗｉｎ_２はマイクロホンｍｃ３の入力信号Ｘ３に基づいて算出されたものとする。また、ここで、Ｚ_１は、マイクアレイＭＡ１をメインとしてエリア収音処理を行った場合（上述の（１１）式を用いた場合）のエリア音出力の振幅スペクトルであり、Ｚ_２は、マイクアレイＭＡ２をメインとしてエリア収音処理を行った場合（上述の（１２）式を用いた場合）のエリア音出力の振幅スペクトルである。 For example, the amplitude spectrum ratio calculation unit 207 calculates the amplitude spectrum ratio (R 1 or R ₂ ) of the area sound output (Z ₁ or Z ₂ ) and the input signal using the following equation ( ₁₃ ) or (14). Calculate for each frequency. In equations (13) and (14), Win ₁ is the amplitude spectrum of the input signal to microphone array MA1, and Win ₂ is the amplitude spectrum of the input signal to microphone array MA2. Note that the microphones used for calculating Win ₁ and Win ₂ may be any of the microphones forming the microphone arrays MA1 and MA2. Here, it is assumed that Win ₁ is calculated based on the input signal X1 of the microphone mc1, and Win ₂ is calculated based on the input signal X3 of the microphone mc3. In addition, here, Z ₁ is the amplitude spectrum of the area sound output when area sound collection processing is performed using the microphone array MA1 as the main (when the above equation (11) is used), and Z ₂ is the amplitude spectrum of the area sound output when the microphone array MA1 is used as the main one. This is an amplitude spectrum of area sound output when area sound collection processing is performed using array MA2 as the main (when formula (12) described above is used).

そして、振幅スペクトル比算出部２０７は、下記（１５）又は（１６）式を用いて、全周波数の振幅スペクトル比を加算して、振幅スペクトル比加算値（Ｕ_１又はＵ_２）を求める。 Then, the amplitude spectrum ratio calculation unit 207 adds the amplitude spectrum ratios of all frequencies using the following equation (15) or (16) to obtain an amplitude spectrum ratio addition value (U ₁ or U ₂ ).

ここで、（１５）式を用いて行われる処理において得られるＵ_１は、各周波数の振幅スペクトル比Ｒ_１ｉを周波数の下限ｊから上限ｋでの帯域で足し合わせたものであり、式（１６）の処理を用いて行われるＵ_２は、各周波数の振幅スペクトル比Ｒ_２ｉを、周波数の下限ｊから上限ｋでの帯域で足し合わせたものである。ここでは、振幅スペクトル比算出部２０７において演算対象とする周波数の帯域を制限しても良い。例えば、演算対象を音声情報が十分に含まれる１００Ｈｚから６ｋＨｚに制限して、上記演算を行うようにしても良い。 Here, U ₁ obtained in the process performed using equation (15) is the sum of the amplitude spectrum ratio R _1i of each frequency in the band from the lower frequency limit j to the upper frequency limit k, and is obtained by equation (16). ) is obtained by adding up the amplitude spectrum ratio R _2i of _each frequency in the band from the lower frequency limit j to the upper frequency limit k. Here, the frequency band to be calculated in the amplitude spectrum ratio calculating section 207 may be limited. For example, the above calculation may be performed by limiting the calculation target to 100 Hz to 6 kHz, where audio information is sufficiently included.

そして、振幅スペクトル比算出部２０７は、Ｕ_１又はＵ_２を算出した結果をＵとして音声区間検出部２０８に供給する。 Then, the amplitude spectrum ratio calculating section 207 supplies the result of calculating U ₁ or U ₂ as U to the voice section detecting section 208 .

なお、振幅スペクトル比算出部２０７は、エリア音抽出部２０４で目的エリア音Ｚを算出する際に、マイクアレイＭＡ１のＢＦ出力Ｙ_１（ｎ）をメインとしてエリア収音処理を行った場合（上述の（１１）式を用いた場合）には、（１５）式を用いて算出したＵ_１を振幅スペクトル比加算値Ｕとして出力することが好ましい。また、振幅スペクトル比算出部２０７は、エリア音抽出部２０４で目的エリア音Ｚを算出する際に、マイクアレイＭＡ２のＢＦ出力Ｙ_２（ｎ）をメインとしてエリア収音処理を行った場合（上述の（１２）式を用いた場合）には、（１６）式を用いて算出したＵ_２を振幅スペクトル比加算値Ｕとして出力することが好ましい。

Note that the amplitude spectrum ratio calculation unit 207 calculates the target area sound Z when the area sound extraction unit 204 performs area sound collection processing mainly using the BF output Y ₁ (n) of the microphone array MA1 (as described above). (11)), it is preferable to output _U1 calculated using equation (15) as the amplitude spectrum ratio addition value U. In addition, when the area sound extraction unit 204 calculates the target area sound Z, the amplitude spectrum ratio calculation unit 207 performs area sound collection processing using the BF output Y ₂ (n) of the microphone array MA2 as the main (as described above). When using equation (12)), it is preferable to output _U2 calculated using equation (16) as the amplitude spectral ratio addition value U.

音声区間検出部２０８は、振幅スペクトル比算出部２０７から供給された振幅スペクトル比加算値Ｕを予め設定した閾値と比較し、目的エリア内で目的エリア音（音声）が存在するかしないかを判定する。突発的な雑音と音声を区別するため、音声区間検出部２０８は、一定以上の時間、エリア音が存在したときに音声区間（目的エリア内の話者が発話中の期間）と見倣す、あるいは発話の終了と、破裂音や息継ぎなど一時的な無音区間を区別するために無音検出後一定時間は音声区間と見做す、など音声区間の判定には一般的手法を用いればよい。音声区間検出部２０８は、これらの処理により音声区間を検出し、その音声区間検出結果Ｓを音声認識部２０９に供給する。 The voice section detection unit 208 compares the amplitude spectrum ratio addition value U supplied from the amplitude spectrum ratio calculation unit 207 with a preset threshold, and determines whether or not the target area sound (voice) exists within the target area. do. In order to distinguish between sudden noise and voice, the voice section detection unit 208 identifies a voice section (a period during which a speaker in the target area is speaking) when an area sound exists for a certain period of time or more. Alternatively, a general method may be used to determine a speech interval, such as regarding a certain period of time after detection of silence as a speech interval in order to distinguish between the end of an utterance and a temporary silent interval such as a plosive or a breather. The speech section detection section 208 detects a speech section through these processes, and supplies the speech section detection result S to the speech recognition section 209.

ここで、音声区間検出部２０８は、目的エリア内で目的エリア音が存在することを検出した場合、音声区間検出結果Ｓとして「ｔｒｕｅ」を出力し、目的エリア内で目的エリア音が存在しないことを検出した場合、音声区間検出結果Ｓとして「ｆａｌｓｅ」を出力するものとする。 Here, when the voice section detection unit 208 detects that the target area sound exists within the target area, it outputs "true" as the voice zone detection result S, and indicates that the target area sound does not exist within the target area. If detected, "false" is output as the voice section detection result S.

音声認識部１０は、混合レベルを段階的に変えた混合音ｍ１～ｍ７を用いて、個別に音声認識処理を試み、それぞれに対する音声認識処理の結果（以下、「認識結果」と呼ぶ）Ａ１～Ａ７と、認識結果Ａ１～Ａ７のそれぞれの信頼性の度合いを数値化した値（以下、「認識信頼度」と呼ぶ）Ｒｅ１～Ｒｅ７を得る処理（以下、「認識信頼度算出処理」と呼ぶ）を行う。 The speech recognition unit 10 attempts speech recognition processing individually using the mixed sounds m1 to m7 whose mixing levels are changed in stages, and obtains the results of the speech recognition processing for each (hereinafter referred to as "recognition results") A1 to A process to obtain values Re1 to Re7 (hereinafter referred to as "recognition reliability") that quantify the degree of reliability of each of A7 and recognition results A1 to A7 (hereinafter referred to as "recognition reliability calculation process") I do.

音声認識部１０が、混合音ｍ１～ｍ７のそれぞれに対して音声認識処理（例えば、音声をテキスト化する処理；いわゆる「ＳｐｅｅｃｈｔｏＴｅｘｔ」の処理）を行って認識結果Ａ１～Ａ７を生成する際の具体的な手法については限定されないものであり、種々の手法を適用することができる。 When the speech recognition unit 10 performs speech recognition processing (for example, processing for converting speech into text; so-called "Speech to Text" processing) on each of the mixed sounds m1 to m7 to generate recognition results A1 to A7. The specific method is not limited, and various methods can be applied.

また、音声認識部１０が、認識結果Ａ１～Ａ７のそれぞれを分析して認識信頼度Ｒｅ１～Ｒｅ７を算出する認識信頼度算出処理の手法については限定されないものであり、種々の手法を適用することができる。例えば、音声認識部１０では、認識信頼度算出処理に以下の参考文献２、３の手法等を用いるようにしてもよい。音声認識部１０が認識信頼度を算出する間隔（以下、「信頼度算出間隔」と呼ぶ）は限定されないものである。音声認識部１０は、例えば、一定時間ごとに認識信頼度を算出するようにしてもよい。
参考文献２：特開２００５－１４８３４２号公報
参考文献３：特開２０１０－１７５８０７号公報 Further, the method of recognition reliability calculation processing in which the speech recognition unit 10 analyzes each of the recognition results A1 to A7 to calculate the recognition reliability Re1 to Re7 is not limited, and various methods may be applied. I can do it. For example, the speech recognition unit 10 may use the methods described in References 2 and 3 below for recognition reliability calculation processing. The interval at which the speech recognition unit 10 calculates the recognition reliability (hereinafter referred to as "reliability calculation interval") is not limited. The speech recognition unit 10 may calculate the recognition reliability at regular intervals, for example.
Reference document 2: Japanese Patent Application Publication No. 2005-148342 Reference document 3: Japanese Patent Application Publication No. 2010-175807

認識結果選択部２１０は、音声区間（Ｓ＝ｔｒｕｅの区間）に対して最も信頼度が高かった認識結果を選択して、最終の認識結果Ａｓとして出力する。認識結果選択部２１０は、例えば、信頼度算出間隔ごとに、出力する認識結果（Ａ１～Ａ７）を、最も認識信頼度の高い認識結果に切り替える処理を行うようにしてもよい。 The recognition result selection unit 210 selects the recognition result with the highest reliability for the voice section (the section where S=true) and outputs it as the final recognition result As. For example, the recognition result selection unit 210 may perform a process of switching the recognition results (A1 to A7) to be output to the recognition result with the highest recognition reliability at each reliability calculation interval.

（Ａ－３）第１の実施形態の効果
第１の実施形態によれば、以下のような効果を奏することができる。 (A-3) Effects of the first embodiment According to the first embodiment, the following effects can be achieved.

第１の実施形態の音声認識装置２００では、混合量（減衰量；混合レベル）の異なる複数の混合音を生成し、それぞれの混合音の音声認識処理結果に対する信頼度を算出し、最も信頼度の高い音声認識処理結果を最終的な認識結果として出力する。これにより、第１の実施形態の音声認識装置２００では、種々の使用環境において、音声認識処理にとっての最適な混合量を用いることが可能となる。結果として、音声認識装置２００では、音声認識の精度が向上する。 The speech recognition device 200 of the first embodiment generates a plurality of mixed sounds with different mixing amounts (attenuation amounts; mixing levels), calculates the reliability of the speech recognition processing result of each mixed sound, and calculates the reliability of the speech recognition processing result of each mixed sound. The speech recognition processing result with a high value is output as the final recognition result. As a result, the speech recognition device 200 of the first embodiment can use the optimum mixing amount for speech recognition processing in various usage environments. As a result, the speech recognition device 200 improves the accuracy of speech recognition.

（Ｂ）第２の実施形態
以下、本発明による音声認識装置、音声認識プログラム、及び音声認識方法の第２の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の収音システムを収音システムに適用した例について説明する。 (B) Second Embodiment Hereinafter, a second embodiment of a speech recognition device, a speech recognition program, and a speech recognition method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound collection system of the present invention is applied to a sound collection system will be described.

（Ｂ－１）第２の実施形態の構成
図３は、第２の実施形態の音声認識装置２００Ａの全体構成を示すブロック図である。 (B-1) Configuration of Second Embodiment FIG. 3 is a block diagram showing the overall configuration of a speech recognition device 200A according to the second embodiment.

図３では、上述の図１と同一又は対応する部分に、同一又は対応する符号を付している。以下では、第２の実施形態について第１の実施形態との差異を中心に説明する。 In FIG. 3, the same or corresponding parts as in FIG. 1 described above are given the same or corresponding symbols. The second embodiment will be described below, focusing on the differences from the first embodiment.

第２の実施形態の音声認識装置２００Ａでは、信号混合部２０５と認識結果選択部２１０が、信号混合部２０５Ａと認識結果選択部２１０Ａに置き換わり、さらに、混合レベル決定部２１１が追加されている点で第１の実施形態と異なっている。 In the speech recognition device 200A of the second embodiment, the signal mixing unit 205 and the recognition result selection unit 210 are replaced with the signal mixing unit 205A and the recognition result selection unit 210A, and a mixing level determination unit 211 is further added. This is different from the first embodiment.

第１の実施形態では、信号混合部２０５における入力信号の混合量（減衰量）が複数固定であった。これに対して、第２の実施形態では、混合レベル決定部２１１を備え、音声認識部２０９による認識結果に基づいて、信号混合部２０５Ａで適用する混合量（減衰量）を適応的に決定するところが第１の実施形態と異なっている。信号混合部２０５、認識結果選択部２１０及び混合レベル決定部２１１の具体的な処理の内容（第１の実施形態との差異）については後述する。 In the first embodiment, a plurality of mixing amounts (attenuation amounts) of input signals in the signal mixing section 205 are fixed. In contrast, the second embodiment includes a mixing level determining unit 211 that adaptively determines the mixing amount (attenuation amount) to be applied by the signal mixing unit 205A based on the recognition result by the speech recognition unit 209. However, this embodiment is different from the first embodiment. The specific processing contents (differences from the first embodiment) of the signal mixing section 205, the recognition result selection section 210, and the mixing level determination section 211 will be described later.

（Ｂ－２）第２の実施形態の動作
次に、以上のような構成を有する第２の実施形態の音声認識装置２００Ａの動作を説明する。 (B-2) Operation of Second Embodiment Next, the operation of the speech recognition device 200A of the second embodiment having the above configuration will be described.

マイクアレイ部１００、信号入力部２０１、時間／周波数変換部２０２、指向性形成部２０３、及びエリア音抽出部２０４の動作は、第１の実施形態と同様であるため、説明を省略する。 The operations of the microphone array section 100, signal input section 201, time/frequency conversion section 202, directivity formation section 203, and area sound extraction section 204 are the same as those in the first embodiment, and therefore the description thereof will be omitted.

第１の実施形態では、７段階の異なる混合レベルを設定した。この段階数は、多ければ、細かな段階が設定されることで、より好適な混合量が選択できるようになるが、１つの音声区間に対しての認識処理量が増え、装置の大規模化、処理の遅延などの問題が生じる。一方、段階数を減らせば処理は簡単になるが、設定できる混合量が限定されるため、適量抽出の精度が低下する。そこでこの実施形態では、音声認識処理の認識結果に基づいて、エリア音出力に対する入力信号の混合量を適応的に決定する。 In the first embodiment, seven different mixing levels were set. If the number of stages is large, finer stages can be set and a more suitable mixing amount can be selected, but the amount of recognition processing for one speech section will increase and the scale of the device will increase. , problems such as processing delays may occur. On the other hand, if the number of stages is reduced, the process becomes simpler, but since the amount of mixture that can be set is limited, the accuracy of extracting the appropriate amount decreases. Therefore, in this embodiment, the mixing amount of the input signal with respect to the area sound output is adaptively determined based on the recognition result of the speech recognition process.

信号混合部２０５Ａでは、第１の実施形態と同様、エリア音出力Ｚに対して入力信号の混合を行なうが、混合レベル（混合する入力信号の減衰量）の決定は、混合レベル決定部２１１で行なわれるものとする。 The signal mixing unit 205A mixes the input signals with respect to the area sound output Z as in the first embodiment, but the mixing level (attenuation amount of the input signal to be mixed) is determined by the mixing level determining unit 211. shall be carried out.

ここでは例として、混合レベル決定部２１１において、初期設定として、第１の実施形態と同様に、混合レベルを７段階（段階数Ｑ＝７）とし、そのときの入力信号に対する減衰量をＡｔ１～Ａｔ７として、第１の実施形態と同様に、－５ｄｂ～―３５ｄｂの間で、－５ｄＢ刻みに７段階（－５ｄＢ、－１０ｄＢ、－１５ｄＢ、－２０ｄＢ、－２５ｄＢ、－３０ｄＢ、－３５ｄＢ）が設定されているものとする。なお、以下では、減衰量Ａｔ～ＡＱを設定する際の１段階分の減衰量の幅を「変化幅」と呼ぶものとする。例えば、上記のように、減衰量Ａｔ１～Ａｔ７を、－５ｄｂ～３５ｄｂの範囲で－５ｄＢ刻みに設定する場合の「変化幅」は－５となる。ここでは、説明を簡易とするため、変化幅は一定であるものとして説明するが、必ずしも変化幅は一定である必要はない。 Here, as an example, in the mixing level determining section 211, as in the first embodiment, the mixing level is set to 7 stages (number of stages Q = 7), and the attenuation amount for the input signal at that time is set to At1 to 7. As At7, as in the first embodiment, there are 7 steps (-5 dB, -10 dB, -15 dB, -20 dB, -25 dB, -30 dB, -35 dB) in -5 dB increments between -5 db and -35 db. It is assumed that this has been set. Note that, hereinafter, the width of the attenuation amount for one step when setting the attenuation amounts At to AQ will be referred to as a "change width." For example, as described above, when the attenuation amounts At1 to At7 are set in steps of -5 dB in the range of -5 db to 35 db, the "change width" is -5. Here, in order to simplify the explanation, the variation width is assumed to be constant, but the variation width does not necessarily have to be constant.

そして、第１の実施形態と同様に、周波数／時間変換部２０６で時間変換された混合音ｍ１～ｍ７は、音声認識部２０９に入力される。 Then, as in the first embodiment, the mixed sounds m1 to m7 time-converted by the frequency/time conversion section 206 are input to the speech recognition section 209.

音声認識部２０９は、第１の実施形態と同様に、混合レベルを段階的に変えた混合音ｍ１～ｍ７を個別に音声認識し、それぞれに対する認識結果Ａ１～Ａ７、および認識信頼度Ｒｅ１～Ｒｅ７を得る。 Similarly to the first embodiment, the speech recognition unit 209 individually performs speech recognition on the mixed sounds m1 to m7 whose mixing levels are changed in stages, and generates recognition results A1 to A7 and recognition reliability levels Re1 to Re7 for each of the mixed sounds m1 to m7. get.

認識結果選択部２１０Ａは、第１の実施形態と同様に、音声区間（Ｓ＝ｔｒｕｅの区間）に対して最も信頼度が高かった認識結果を選択して、最終の認識結果Ａｓとして出力する。 Similar to the first embodiment, the recognition result selection unit 210A selects the recognition result with the highest reliability for the voice section (the section where S=true) and outputs it as the final recognition result As.

次に、混合レベル決定部２１１における混合量を適応化する方法について説明する。 Next, a method of adapting the mixing amount in the mixing level determining section 211 will be explained.

混合レベル決定部２１１における混合量の適応化の方法（以下、「混合量適応化方法」と呼ぶ）としては、例えば、以下の２つの方法が考えられる。 As a method for adapting the mixing amount in the mixing level determination unit 211 (hereinafter referred to as a "mixing amount adaptation method"), for example, the following two methods can be considered.

第１の混合量適応化方法としては、段階数Ｑは変えず範囲を限定して、混合量の設定を綴密化（変化幅を小さくする）する方法がある。また、第２の混合量適応化方法としては、設定する混合量の１段階の変化幅は変えないが段階数Ｑを減らして処理を軽くする方法がある。 As a first mixture amount adaptation method, there is a method in which the range is limited without changing the number of stages Q, and the setting of the mixture amount is made more dense (reducing the range of change). Further, as a second mixture amount adaptation method, there is a method of reducing the number of steps Q to lighten the processing while not changing the variation width of one step of the set mixing amount.

[第１の混合量適応化方法について]
まず、混合レベル決定部２１１に、第１の混合量適応化方法（混合量緻密化）を適用する場合の詳細について説明する。 [About the first mixture amount adaptation method]
First, the details of applying the first mixture amount adaptation method (mixture amount densification) to the mixture level determination unit 211 will be described.

ここでは、まず、初期の状態から、音声認識部２０９において、混合音ｍ１～ｍ７について音声認識処理を行い、その結果、混合音ｍ４の信頼度Ｒ４が最も高かったとする。この結果は、認識結果選択部２１０から混合レベル決定部２１１に供給される。混合レベル決定部２１１は、この結果を受けて、以降のエリア音出力Ｚに対する混合量を、前回最も信頼度が高かった減衰量（－２０ｄＢ）を中心として混合量の変化幅を低減するものとする。この場合、混合レベル決定部２１１は、－２０ｄｂを中心（Ａｔ４＝－２０ｄｂ）とし、さらに変化幅を１／２の－２．５ｄＢ刻みとして、減衰量Ａｔ１～Ａｔ７を再設定する。この場合、混合レベル決定部２１１は、減衰量Ａｔ１～Ａｔ７を、それぞれ－１２．５ｄＢ、－１５ｄＢ、－１７．５ｄＢ、－２０ｄＢ、－２２．５ｄＢ、－２５ｄＢ、－２７．５ｄＢとする。以後、混合レベル決定部２１１は、次の認識結果に基づき、同様の手法により、さらに変化幅を精密化することによって、混合レベルを最適値に収束させてゆく。 Here, it is assumed that the speech recognition unit 209 performs speech recognition processing on the mixed sounds m1 to m7 from an initial state, and as a result, the reliability R4 of the mixed sound m4 is the highest. This result is supplied from the recognition result selection section 210 to the mixture level determination section 211. In response to this result, the mixing level determining unit 211 reduces the range of change in the mixing amount for the subsequent area sound output Z, centering on the attenuation amount (-20 dB) that had the highest reliability last time. do. In this case, the mixing level determining unit 211 resets the attenuation amounts At1 to At7 with −20 db as the center (At4=−20 db), and the change width in steps of −2.5 dB, which is 1/2. In this case, the mixing level determining section 211 sets the attenuation amounts At1 to At7 to -12.5 dB, -15 dB, -17.5 dB, -20 dB, -22.5 dB, -25 dB, and -27.5 dB, respectively. Thereafter, the mixture level determining unit 211 converges the mixture level to the optimum value by further refining the change range using the same method based on the next recognition result.

混合レベル決定部２１１は、所定の条件となるまでを限度として、混合量の適応化（変化幅の精密化）を行うようにしてもよい。混合レベル決定部２１１は、例えば、変化幅が所定の値（以下、「最低変化幅」と呼ぶ）となるまで、又は、音声認識部２０９で認識された認識信頼度が所定以上となるまで、混合量の適応化（変化幅の精密化）を行うようにしてもよい。 The mixing level determining unit 211 may adapt the mixing amount (refining the range of variation) until a predetermined condition is met. The mixing level determining unit 211, for example, continues until the change range reaches a predetermined value (hereinafter referred to as “minimum change range”) or until the recognition reliability recognized by the speech recognition unit 209 reaches a predetermined value or higher. The mixing amount may be adapted (the variation range may be refined).

以上のように、混合レベル決定部２１１は、第１の混合量適応化方法の処理を行う。 As described above, the mixture level determination unit 211 performs the process of the first mixture amount adaptation method.

[第２の混合量適応化方法について]
次に、混合レベル決定部２１１に、第２の混合量適応化方法を適用する場合の詳細について説明する。 [About the second mixture amount adaptation method]
Next, details of applying the second mixture amount adaptation method to the mixture level determining section 211 will be described.

ここでは、まず、初期の状態から、音声認識部２０９において、混合音ｍ１～ｍ７について音声認識処理を行い、その結果、混合音ｍ４の信頼度Ｒ４が最も高かったとする。このとき、混合レベル決定部２１１は、この結果を受けて、以降のエリア音出力Ｚに対する混合量を、前回最も信頼度が高かった混合音ｍ４の減衰量（－２０ｄＢ）を中心として、段階数Ｑを７から５に減らすようにしてもよい。段階数Ｑは、奇数であるほうが中心となる減衰量の設定が容易であるため、混合レベル決定部２１１は、段階数Ｑを２ずつ変動させることが好ましい。そして、混合レベル決定部２１１は、次の認識結果に基づき、同様の手法により、さらに段階数Ｑを減らして処理量を減らしていくようにしてもよい。 Here, it is assumed that the speech recognition unit 209 performs speech recognition processing on the mixed sounds m1 to m7 from an initial state, and as a result, the reliability R4 of the mixed sound m4 is the highest. At this time, in response to this result, the mixing level determination unit 211 determines the mixing amount for the subsequent area sound output Z by a number of stages centered on the attenuation amount (-20 dB) of the mixed sound m4 that had the highest reliability last time. Q may be reduced from 7 to 5. Since it is easier to set the central attenuation amount when the number of stages Q is an odd number, it is preferable that the mixing level determining unit 211 varies the number of stages Q by two. Then, based on the next recognition result, the mixture level determining unit 211 may further reduce the number of stages Q to reduce the amount of processing using the same method.

混合レベル決定部２１１は、所定の条件となるまでを限度として、混合量の適応化（段階数Ｑの低減）を行うようにしてもよい。この場合、混合レベル決定部２１１は、例えば、段階数Ｑが所定の段階数（以下、「最低段階数」と呼ぶ）となった時点で、混合量の適応化（段階数Ｑの低減）を終了するようにしてもよい。 The mixture level determination unit 211 may adapt the mixture amount (reduce the number of stages Q) until a predetermined condition is met. In this case, the mixture level determining unit 211 may adapt the mixing amount (reduce the number of stages Q) when the number of stages Q reaches a predetermined number of stages (hereinafter referred to as the "minimum number of stages"). It may be configured to end.

以上のように、混合レベル決定部２１１は、第２の混合量適応化方法の処理を行う。 As described above, the mixture level determination unit 211 performs the process of the second mixture amount adaptation method.

上記では、第１、第２の混合量適応化方法で、適応対象のパラメータとして、変化幅又は段階数Ｑを適応化する方法を示したが、混合レベル決定部２１１は、どちらか一方に限定することなく双方を適応化してもよい。 In the above, the first and second mixture amount adaptation methods have shown a method of adapting the variation width or the number of stages Q as the parameter to be adapted, but the mixture level determination unit 211 is limited to only one of them. You may adapt both without doing so.

上記の例では、適応対象のパラメータ（変化幅、段階数Ｑ）を減らす方向のみについて説明したが、１方向（減らす方向）だけでは、認識信頼度が局所値に陥り値が動かなくなってしまう。したがって、適応には、パラメータ（変化幅、段階数Ｑ）を増やす方向も備える必要がある。増やす側の評価指標として、たとえば認識結果の信頼度Ｒを用いることができる。認識結果選択部２１０Ａにおいて、認識結果の中で最も信頼度が高く最終の認識結果として選択された信頼度の値が、一定の水準に達しない場合、混合レベル決定部２１１は、混合量の変化幅、あるいは段階数を増やす方向の調整を行なうようにしてもよい。このとき、混合レベル決定部２１１は、変化幅については一度に２ずつ変動させ、変化幅については２倍ずつ変動させるようにしてもよい。 In the above example, only the direction of reducing the adaptation target parameter (change width, number of stages Q) was explained, but if only one direction (reducing direction) is used, the recognition reliability falls to a local value and the value does not change. Therefore, for adaptation, it is necessary to also provide a direction for increasing parameters (change width, number of stages Q). As an evaluation index to be increased, for example, the reliability R of the recognition result can be used. If the reliability value selected by the recognition result selection unit 210A as the final recognition result with the highest reliability among the recognition results does not reach a certain level, the mixture level determination unit 211 determines whether the mixture amount has changed. Adjustments may be made in the direction of increasing the width or the number of steps. At this time, the mixing level determining unit 211 may change the change width by 2 at a time, or may change the change width by 2 times.

なお、混合レベル決定部２１１は、上記のように適応対象のパラメータ（変化幅、段階数Ｑ）を１度に変動する量（以下、「適応速度」と呼ぶ）を一定としてもよいし、変動させるようにしてもよい。例えば、混合レベル決定部２１１は、認識信頼度Ｒを適応速度の調整に用いるようにしてもよい。すなわち、混合レベル決定部２１１は、認識信頼度Ｒの高さ（例えば、最も高かった認識信頼度Ｒの値）に応じて適応速度を変化させるようにしてもよい。例えば、混合レベル決定部２１１は、認識信頼度Ｒ（例えば、最も高かった認識信頼度Ｒの値）が非常に高い（低い）場合は変化幅や段階数を大きく減らし（増やし）、やや高い（低い）程度では、増減幅を小さくするなどが考えられる。 Note that the mixing level determining unit 211 may set the amount by which the parameter to be adapted (variation width, number of stages Q) is varied at one time (hereinafter referred to as "adaptation speed") as constant, or You may also do so. For example, the mixture level determining unit 211 may use the recognition reliability R to adjust the adaptation speed. That is, the mixture level determining unit 211 may change the adaptation speed depending on the height of the recognition reliability R (for example, the highest value of the recognition reliability R). For example, if the recognition reliability R (for example, the value of the highest recognition reliability R) is very high (low), the mixture level determining unit 211 greatly reduces (increases) the range of change or the number of stages, and if the recognition reliability R (for example, the value of the highest recognition reliability R) is very high (low), If the amount is low (low), it may be possible to reduce the amount of increase or decrease.

（Ｂ－３）第２の実施形態の効果
第２の実施形態によれば、以下のような効果を奏することができる。 (B-3) Effects of the second embodiment According to the second embodiment, the following effects can be achieved.

第２の実施形態の音声認識装置２００Ａでは、認識結果に基づいて混合量を適応的に最適値に調整・決定しているため、非常に精度の高い混合量の決定、あるいは少ない処理量での混合量の決定が可能となる。 In the speech recognition device 200A of the second embodiment, the mixture amount is adaptively adjusted and determined to the optimum value based on the recognition result, so it is possible to determine the mixture amount with very high accuracy or with a small amount of processing. It becomes possible to determine the amount of mixture.

（Ｃ）他の実施形態
本発明は、上記の各実施形態に限定されるものではなく、以下に例示するような変形実施形態も挙げることができる。 (C) Other Embodiments The present invention is not limited to the above embodiments, and may include modified embodiments as exemplified below.

（Ｃ－１）上記の各実施形態では、音声認識装置２００、２００Ａ自体が音声認識部２０９を有しており、自装置が有する音声認識部２０９を用いて音声認識処理の結果を取得しているが、音声認識装置２００、２００Ａ自体が音声認識部２０９を備えず、外部の音声認識手段を用いて音声認識処理の結果を取得するようにしてもよい。 (C-1) In each of the above embodiments, the speech recognition devices 200 and 200A themselves have the speech recognition section 209, and obtain the results of speech recognition processing using the speech recognition section 209 of the own device. However, the speech recognition devices 200 and 200A may not themselves include the speech recognition unit 209 and may use an external speech recognition means to obtain the results of speech recognition processing.

１０…音声認識部、１００…マイクアレイ部、２００、２００Ａ…音声認識装置、２０１…信号入力部、２０２…周波数変換部、２０３…指向性形成部、２０４…エリア音抽出部、２０５、２０５Ａ…信号混合部、２０６…時間変換部、２０７…振幅スペクトル比算出部、２０８…音声区間検出部、２０９…音声認識部、２１０、２１０Ａ…認識結果選択部、２１１…混合レベル決定部、３００…記録装置、３０１…音声認識部、４１０…遅延器、４２０…減算器、５００…コンピュータ、５０１…プロセッサ、５０２…一次記憶部、５０３…二次記憶部、ＭＡ１、ＭＡ２…マイクアレイ、ｍｃ１～ｍｃ４…マイクロホン。 DESCRIPTION OF SYMBOLS 10... Voice recognition part, 100... Microphone array part, 200, 200A... Voice recognition device, 201... Signal input part, 202... Frequency conversion part, 203... Directivity formation part, 204... Area sound extraction part, 205, 205A... Signal mixing section, 206... Time conversion section, 207... Amplitude spectrum ratio calculation section, 208... Speech section detection section, 209... Speech recognition section, 210, 210A... Recognition result selection section, 211... Mixing level determination section, 300... Recording Apparatus, 301... Speech recognition unit, 410... Delay unit, 420... Subtractor, 500... Computer, 501... Processor, 502... Primary storage unit, 503... Secondary storage unit, MA1, MA2... Microphone array, mc1 to mc4... Microphone.

Claims

Based on the input signals input from a plurality of microphone arrays capable of directing directivity toward the target area, the beamformer output of each of the microphone arrays is acquired, and the acquired beamformer output is used to direct the direction toward the target area. area sound extraction means for extracting target area sound by performing area sound collection processing;
signal mixing means for performing a mixing process of mixing a mixing signal in a plurality of mixing amounts with respect to the target area sound extracted by the area sound extracting means to generate a mixed sound for each mixing amount;
a voice recognition unit that performs a reliability calculation process of acquiring the results of voice recognition processing for each of the mixed sounds and calculating the reliability of the voice recognition process results of each of the mixed sounds;
recognition result selection means for selecting and outputting the most reliable voice recognition processing result from among the voice recognition processing results performed by the voice recognition means;
The signal mixing means applies to the mixing process a mixture amount that is changed by a set number of steps in a set change width around a set center mixing amount, and generates mixed sounds for the number of steps. death,
The speech recognition device further comprises a mixture content determining unit that determines the central mixing amount, the change width, and the number of stages to be applied to the signal mixing unit according to the selection result of the recognition result selecting unit. .

further comprising a speech section detection means for detecting a speech section in which the voice uttered by the speaker is occurring in the target area,
The speech recognition device according to claim 1, wherein the speech recognition means performs the speech recognition process and the reliability calculation process only while the speech interval is detected by the speech interval detection means.

2. The mixture content determining means applies the mixture amount selected by the recognition result selection means as a new central mixture amount, and increases or decreases the range of change and/or the number of stages. Speech recognition device.

computer,
Based on the input signals input from a plurality of microphone arrays capable of directing directivity toward the target area, the beamformer output of each of the microphone arrays is acquired, and the acquired beamformer output is used to direct the direction toward the target area. area sound extraction means for extracting target area sound by performing area sound collection processing;
signal mixing means for performing a mixing process of mixing a mixing signal in a plurality of mixing amounts with respect to the target area sound extracted by the area sound extracting means to generate a mixed sound for each mixing amount;
a voice recognition unit that performs a reliability calculation process of acquiring the results of voice recognition processing for each of the mixed sounds and calculating the reliability of the voice recognition process results of each of the mixed sounds;
functioning as recognition result selection means for selecting and outputting the most reliable voice recognition processing result among the voice recognition processing results performed by the voice recognition means;
The signal mixing means applies to the mixing process a mixture amount that is changed by a set number of steps in a set change width around a set center mixing amount, and generates mixed sounds for the number of steps. death,
The computer is also made to function as a mixture content determining unit that determines the central mixing amount, the change width, and the number of stages to be applied to the signal mixing unit in accordance with the selection result of the recognition result selecting unit. speech recognition program.

In the speech recognition method,
It has an area sound extraction means, a signal mixing means, a voice recognition means, a recognition result selection means, and a mixed content determination means,
The area sound extraction means acquires beamformer outputs of each of the microphone arrays based on input signals input from a plurality of microphone arrays capable of directing directivity toward a target area, and extracts beamformer outputs from the acquired beamformer outputs. Extract the target area sound by performing area sound collection processing of the target area using
The signal mixing means performs a mixing process of mixing a mixing signal at a plurality of mixing amounts with respect to the target area sound extracted by the area sound extracting means, and generates a mixed sound for each mixing amount,
The voice recognition means obtains the results of voice recognition processing for each of the mixed sounds, and performs a reliability calculation process of calculating the reliability of the voice recognition processing results for each of the mixed sounds,
The recognition result selection means selects and outputs the most reliable speech recognition processing result from among the speech recognition processing results performed by the speech recognition means. The signal mixing means selects and outputs the most reliable speech recognition processing result from among the speech recognition processing results performed by the speech recognition means. applying a mixture amount changed by a set number of steps with a set change width to the mixing process, and generating mixed sounds for the number of steps;
The speech recognition method characterized in that the mixture content determining means determines the central mixing amount, the change width, and the number of stages to be applied to the signal mixing means in accordance with the selection result of the recognition result selection means. .