JP3097764B2

JP3097764B2 - Voice input device with guidance voice

Info

Publication number: JP3097764B2
Application number: JP03063444A
Authority: JP
Inventors: 芳夫中▲台▼; 義武鈴木
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 1991-03-27
Filing date: 1991-03-27
Publication date: 2000-10-10
Anticipated expiration: 2015-10-10
Also published as: JPH04299410A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、ガイダンス音声を発声
者に対して出力して発声者から発声を促し、この発声者
からの発声を受信するガイダンス音声付き音声入力装置
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice input device with a guidance voice for outputting a guidance voice to a speaker, prompting the speaker to generate a voice, and receiving the voice from the speaker.

【０００２】[0002]

【従来の技術】発声者から音声入力装置を離して使用す
る、いわゆるハンズフリーで使用する音声認識装置など
で、装置に音声を入力する発声者に対して、発声のタイ
ミングを明らかにするためのガイダンス信号発生装置な
どを付加することは、認識装置の稼働時間やアルゴリズ
ム上の負担を軽減し、かつ、発声者に対しても装置に対
する親和感を与える上で重要である。このようなガイダ
ンス信号は、一般に音声認識装置がキーボードなどの手
で入力する装置の代用として使われるために、また発声
者自身が発声結果をフィードバックして即時に理解する
ことが望まれるために、視覚や触覚に訴える装置より
も、聴覚に訴える装置が望まれる。すなわち、ガイダン
ス信号発生装置として、一般には、音声によるガイダン
ス方法が望ましい形である。2. Description of the Related Art A so-called hands-free speech recognition device or the like which uses a speech input device away from a speaker to clarify the timing of speech to a speaker who inputs speech to the device. It is important to add a guidance signal generating device and the like in order to reduce the operation time of the recognition device and the load on the algorithm and to give the speaker a sense of affinity for the device. Since such a guidance signal is generally used as a substitute for a device such as a keyboard in which a voice recognition device is manually input, and because it is desired that the speaker himself / herself feedback the utterance result and immediately understand it, A device that appeals to hearing rather than a device that appeals to sight and touch is desired. That is, as the guidance signal generator, generally, a guidance method using voice is a desirable form.

【０００３】ところが、音声によるガイダンス方法で
は、ガイダンス音声自体が発声者の出す音声と同一の音
声周波数帯域の信号であるため、ガイダンス音声の生成
途中で発声者が音声を装置に向かって入力するような場
合を考えると、音声信号入力器では発声者の音声と一緒
にガイダンス音声を受信するため、装置にはガイダンス
音声の重畳した音声パターンが入力されることになり、
認識装置において認識誤りを生じる原因となる。However, in the voice guidance method, since the guidance voice itself is a signal in the same voice frequency band as the voice emitted by the speaker, the voice is input to the apparatus during the generation of the guidance voice. Considering the case, since the voice signal input device receives the guidance voice together with the voice of the speaker, a voice pattern in which the guidance voice is superimposed is input to the device,
This may cause a recognition error in the recognition device.

【０００４】この問題を解決する方法は４つ考えられ
る。第１には、ガイダンス音声の発生途中での音声入力
を受け付けない方法がある。しかし、この方法ではガイ
ダンス音声が完全に消えるまで発声者に発声させないよ
う時間的制約を課すこととなる。また、ガイダンス音声
の発生途中で発声者が発声した場合、ガイダンス発声時
間中の入力音声を分析することができないため、いわゆ
る語頭切れという現象を生じ、認識装置での認識誤りの
原因となる。第２には、ガイダンス音声の音量を低く抑
える方法がある。この方法には、音声入力装置のアルゴ
リズムによって発声者の音声を検知した場合にガイダン
ス音声を打ち切る方法も含まれる。しかし、高騒音下で
使用される認識装置では、ガイダンス音声の音量を小さ
くすることは明瞭なガイダンス音声を与える上では実用
的ではない。また、音声を検知した場合にガイダンス音
声を打ち切る方法は、音声検出のアルゴリズムが完全で
なければ第１の方法で示したものと同じ語頭切れを生
じ、また、残響のあるような部屋で装置を使用する場合
ではガイダンス音声の残響成分までを完全に打ち消すこ
とは難しい技術である。第３には、ガイダンス音声を発
声するスピーカーを音声入力装置から音響的に切り離す
方法がある。この方法は、例えば、電話機の受話器のよ
うに、発声者にガイダンスを受聴させる音響系と、発声
者の音声を入力させる音響系とを分離するものである。
しかし、この方法では、音声入力音響系との分離を図る
ため発声者に密着したガイダンス発生装置が必要とな
り、発声者に装置を装着する煩わしさを与える。第４に
は、装置に入力された音声信号から重畳しているガイダ
ンス音声を事後的に除去しようとする方法がある。この
例としては、適応予測フィルタの手法により、装置内部
にガイダンスの原信号を予め記憶しておき、音声入力装
置で受信した音声信号からガイダンス音声信号とそれに
付随して発生する音響信号を原信号に基づいて推定し除
去しようとする方法がある。しかしながら、この方法で
は、ガイダンス信号の推定が完全でなかった場合、発声
者の音声信号に対して波形歪を生じさせ、認識装置での
認識率の劣化につながる。さらに、適応予測フィルタの
係数の学習のために、発声者に発声を促すガイダンスと
は別にフィルタの学習のためのガイダンス信号と学習時
間を必要とする。従って、列記したこれら４つの方法で
は、いずれも認識装置に悪影響を与えず、かつ、発声者
に負担とならない形でのガイダンス音声と入力音声との
分離を図ることは難しい。[0004] There are four ways to solve this problem. First, there is a method of not accepting a voice input during the generation of the guidance voice. However, this method imposes a time constraint so that the speaker does not speak until the guidance speech is completely extinguished. In addition, if the speaker utters while the guidance voice is being generated, the input voice during the guidance utterance time cannot be analyzed, so that a so-called word truncation phenomenon occurs, which causes a recognition error in the recognition device. Second, there is a method of suppressing the volume of the guidance voice. This method also includes a method of terminating the guidance voice when the voice of the speaker is detected by the algorithm of the voice input device. However, in a recognition device used under high noise, reducing the volume of the guidance voice is not practical for giving clear guidance voice. In addition, when the voice is detected, the guidance voice is cut off. If the algorithm of voice detection is not perfect, the same word truncation as shown in the first method will occur, and the apparatus will be installed in a room with reverberation. When used, it is difficult to completely cancel the reverberation component of the guidance voice. Third, there is a method of acoustically disconnecting a speaker that emits guidance voice from a voice input device. This method separates an audio system, such as a telephone receiver, for allowing a speaker to listen to guidance, and an audio system for inputting the speaker's voice.
However, in this method, a guidance generating device that is in close contact with the speaker is required in order to separate the speaker from the sound input acoustic system, and the speaker is troublesome to wear the device. Fourth, there is a method of removing the superimposed guidance sound from the sound signal input to the device afterwards. As an example of this, an original signal of guidance is stored in advance in the device by an adaptive prediction filter technique, and a guidance audio signal and an audio signal generated accompanying the guidance signal are converted from the audio signal received by the audio input device. There is a method that attempts to estimate and remove based on However, in this method, when the estimation of the guidance signal is not complete, waveform distortion is generated in the voice signal of the speaker, which leads to deterioration of the recognition rate in the recognition device. Further, in order to learn the coefficients of the adaptive prediction filter, a guidance signal for learning the filter and a learning time are required separately from the guidance for prompting the speaker to speak. Therefore, it is difficult to separate the guidance voice from the input voice in any of these four methods without causing any adverse effect on the recognition device and without burdening the speaker.

【０００５】[0005]

【発明が解決しようとする課題】上述した各方法は、そ
れぞれ欠点があり、ガイダンス音声に発声者の発声が重
畳された場合には、本音声入力装置を認識装置に適用し
た場合に、誤認識を生じるという問題がある。Each of the above-mentioned methods has its own drawbacks. If the voice of the speaker is superimposed on the guidance voice, erroneous recognition will occur if this voice input device is applied to a recognition device. Problem.

【０００６】ところで、上述した各方法のうち、第４の
方法は、適応予測フィルタによる事前の学習が必要であ
り、また、フィルタの推定が完全でない場合に発声者の
音声自体に歪を生じるという点で問題があることを指摘
した。しかし、この問題は、除去しようとするガイダン
ス音声の周波数帯域を限定し、その帯域内で除去を行う
ことにより、その他の周波数帯域における音声波形に歪
を与えずに解決できる。また、この除去方法において
は、適応予測フィルタを用いず、帯域除去フィルタだけ
で実現すれば、事前の学習も必要としない。このように
して得られる、ガイダンス音声と同一周波数帯域だけを
除去した音声は、発声者が入力する直接の音声とは音質
的には異なるが、除去する周波数帯域を聴覚上問題のな
い帯域に限定すれば、音声入力装置より再生したときの
音声波形は聴覚的には変化が少ない。また、この音声を
音声認識装置で使用する場合にも、認識に使用する標準
パターン自体もこのような帯域除去フィルタを通過した
音声によって構成しておけば、認識率の劣化を生じにく
い結果が得られる。また、この方法では、ガイダンス音
声自体の品質は低下するが、ガイダンス音声の音量を大
きくすることにより、ガイダンス音声自体が聞き取りに
くいといった現象は軽減できる。このように、ガイダン
ス音声と入力音声とで使用する周波数帯域を分離するこ
とにより、ガイダンス音声の重畳があっても入力音声自
体への影響がおよびにくい、ガイダンス音声付き音声入
力装置が実現できる。このように第４の方法を改良する
ことにより上述した欠点を解決することができる。[0006] Among the above-mentioned methods, the fourth method requires prior learning using an adaptive prediction filter, and when the estimation of the filter is not perfect, distortion occurs in the voice of the speaker. He pointed out that there was a problem. However, this problem can be solved by limiting the frequency band of the guidance sound to be removed and removing the guidance sound within that band without distorting the sound waveform in other frequency bands. In addition, in this removal method, prior learning is not required if only the band removal filter is used without using the adaptive prediction filter. The voice obtained by removing only the same frequency band as the guidance voice obtained in this way is different in sound quality from the direct voice input by the speaker, but the frequency band to be removed is limited to a band with no auditory problem Then, the sound waveform when reproduced from the sound input device has little change in hearing. Also, when this speech is used in a speech recognition device, if the standard pattern itself used for recognition is also constituted by speech that has passed through such a band elimination filter, a result in which the recognition rate is hardly deteriorated can be obtained. Can be Also, in this method, the quality of the guidance voice itself is reduced, but by increasing the volume of the guidance voice, the phenomenon that the guidance voice itself is difficult to hear can be reduced. In this way, by separating the frequency bands used for the guidance voice and the input voice, it is possible to realize a voice input device with a guidance voice in which even if the guidance voice is superimposed, the input voice itself is hardly affected. By improving the fourth method in this way, the above-mentioned disadvantages can be solved.

【０００７】本発明は、上記に鑑みてなされたもので、
その目的とするところは、ガイダンス音声に発声者の発
声が重畳されてもガイダンス音声の成分を含まない音声
信号のみを取り出すことができるガイダンス音声付き音
声入力装置を提供することにある。[0007] The present invention has been made in view of the above,
An object of the present invention is to provide a voice input device with guidance voice that can take out only a voice signal that does not include a component of the guidance voice even when a voice of a speaker is superimposed on the guidance voice.

【０００８】[0008]

【課題を解決するための手段】上記目的を達成するた
め、本発明のガイダンス音声付き音声入力装置は、発声
者に対して発声を促すガイダンス信号を生成する信号生
成手段と、該信号生成手段の出力信号から特定の周波数
帯域の信号を取り出す帯域通過フィルタと、該帯域通過
フィルタの出力信号を発声者に受聴させるべく該出力信
号を音声信号として発生する拡声器と、前記音声信号に
応答して発声者から出力される音声を受信し、電気信号
として出力する信号入力手段と、該信号入力手段の出力
信号から前記特定の周波数帯域以外の信号を取り出す帯
域除去フィルタとを有することを要旨とする。In order to achieve the above object, a voice input device with guidance voice according to the present invention comprises a signal generation means for generating a guidance signal for prompting a speaker to speak, and a signal generation means for the signal generation means. A band-pass filter that extracts a signal of a specific frequency band from the output signal, a loudspeaker that generates the output signal as an audio signal so that a speaker can hear the output signal of the band-pass filter, The gist of the present invention is to have a signal input unit that receives a voice output from a speaker and outputs the signal as an electric signal, and a band removal filter that extracts a signal other than the specific frequency band from an output signal of the signal input unit. .

【０００９】[0009]

【作用】本発明のガイダンス音声付き音声入力装置で
は、ガイダンス信号を帯域通過フィルタに通すことによ
り特定の周波数帯域の信号に限定するとともに、このガ
イダンス音声に応答して発声者から出力される音声に対
しては帯域除去フィルタを使用してガイダンス信号と同
一の周波数帯域の信号を除去し、これによりガイダンス
音声と発声者の音声信号とが時間的に重畳しても、ガイ
ダンス信号の成分を含まない音声信号を得ることができ
る。In the voice input device with guidance voice according to the present invention, the guidance signal is limited to a signal in a specific frequency band by passing the guidance signal through a band-pass filter, and the voice output from the speaker in response to the guidance voice is provided. On the other hand, a signal in the same frequency band as the guidance signal is removed using a band elimination filter, so that the guidance signal is not included even if the guidance voice and the voice signal of the speaker are temporally superimposed. An audio signal can be obtained.

【００１０】[0010]

【実施例】以下、図面を用いて本発明の実施例を説明す
る。Embodiments of the present invention will be described below with reference to the drawings.

【００１１】図１は本発明の一実施例に係わるガイダン
ス音声付き音声入力装置の構成を示すブロック図であ
る。同図において、ガイダンス信号源１１は、発声者１
９に発声を促すガイダンス音声を出す信号源であり、例
えば、テープレコーダーで録音した音声や、また例え
ば、電気的に合成した音声である。帯域通過フィルタ１
２は、特定の周波数帯域の信号だけを通過させることを
目的としたフィルタであり、その通過帯域は、例えば、
１ｋＨｚから２ｋＨｚとする。また、その通過帯域は、
一つの帯域に限る必要はない。増幅器１３は、帯域通過
フィルタ１２を通過した後の信号を歪なく増幅するもの
である。スピーカー１４は、増幅器１３の出力信号を空
間へ放出し発声者１９へ受聴させるものである。マイク
ロホン１５は、発声者１９の音声を収音するマイクロホ
ンである。帯域除去フィルタ１６は、帯域通過フィルタ
で通過を目的とする周波数帯域の信号を除去し、また、
その他の周波数帯域の信号について通過させることを目
的としたフィルタであり、その通過帯域は、例えば、１
００Ｈｚから１ｋＨｚ、および、２ｋＨｚから５ｋＨｚ
とする。音声出力部１７は、帯域除去フィルタ１６を通
過した音声信号を出力するものであり、例えば、電気的
な端子、テープレコーダー、磁気メモリなどである。ス
イッチ１８は、発声者が本実施例の装置に対して動作を
要求するためのスイッチであり、例えば、電気的な接
点、あるいは、光センサーなどである。FIG. 1 is a block diagram showing the configuration of a voice input device with guidance voice according to one embodiment of the present invention. In the figure, a guidance signal source 11 is a speaker 1
9 is a signal source that outputs a guidance voice prompting utterance, for example, a voice recorded by a tape recorder, or an electrically synthesized voice, for example. Bandpass filter 1
2 is a filter for passing only a signal in a specific frequency band, and the pass band is, for example,
1 kHz to 2 kHz. Also, its pass band is
It is not necessary to limit to one band. The amplifier 13 amplifies the signal after passing through the band-pass filter 12 without distortion. The speaker 14 emits the output signal of the amplifier 13 to a space and causes the speaker 19 to listen. The microphone 15 is a microphone that picks up the voice of the speaker 19. The band elimination filter 16 removes a signal in a frequency band to be passed by a band-pass filter,
This is a filter for passing signals of other frequency bands, and the pass band is, for example, 1
00Hz to 1kHz and 2kHz to 5kHz
And The audio output unit 17 outputs an audio signal that has passed through the band elimination filter 16, and is, for example, an electrical terminal, a tape recorder, a magnetic memory, or the like. The switch 18 is a switch for a speaker to request an operation of the apparatus of the present embodiment, and is, for example, an electrical contact or an optical sensor.

【００１２】以下に図１の実施例の作用を説明する。The operation of the embodiment shown in FIG. 1 will be described below.

【００１３】発声者１９が本装置に対して音声入力する
とき、まず、発声者はスイッチ１８を操作し、装置の動
作要求を与える。動作要求に対して、装置は、要求を受
け入れたことを発声者１９に確認させるために、ガイダ
ンス信号源１１より、「音声を入力して下さい」などの
音声信号を生成する。この信号は、帯域通過フィルタ１
２により、１ｋＨｚから２ｋＨｚの特定の周波数帯域だ
けの信号となり、増幅器１３を通じてスピーカー１４よ
り放出される。また、スイッチ１８の動作要求に対し
て、マイクロホン１５、帯域除去フィルタ１６、および
音声出力部１７についても動作を開始し、発声者による
音声の入力を待つ。入力を待つ時間は、例えば、数秒か
ら数十秒間の有限な時間とする。ここで、ガイダンス信
号がスピーカー１４より放出されている間に、あるい
は、放出後に、発声者１９からの発声があると、この音
声は、マイクロホン１５より入力され、帯域除去フィル
タ１６を通過することにより１ｋＨｚから２ｋＨｚの特
定の周波数帯域について除去された音声信号となり、音
声出力部１７より、ガイダンス信号帯域を含まない音声
信号を出力する。When the speaker 19 inputs a voice to the apparatus, first, the speaker operates the switch 18 to give an operation request of the apparatus. In response to the operation request, the apparatus generates a voice signal such as “Please input voice” from the guidance signal source 11 to make the speaker 19 confirm that the request has been accepted. This signal is applied to the bandpass filter 1
2, the signal becomes only a specific frequency band of 1 kHz to 2 kHz, and is emitted from the speaker 14 through the amplifier 13. Further, in response to the operation request of the switch 18, the microphone 15, the band elimination filter 16, and the audio output unit 17 also start operating, and wait for a voice input by the speaker. The time to wait for input is, for example, a finite time of several seconds to several tens of seconds. Here, when there is a voice from the speaker 19 during or after the guidance signal is emitted from the speaker 14, the voice is input from the microphone 15 and passes through the band elimination filter 16. The audio signal is removed from a specific frequency band of 1 kHz to 2 kHz, and the audio output unit 17 outputs an audio signal that does not include the guidance signal band.

【００１４】図２は本発明の他の実施例のブロック図で
あり、この実施例の音声入力装置は音声認識装置に適用
されている。FIG. 2 is a block diagram of another embodiment of the present invention. The voice input device of this embodiment is applied to a voice recognition device.

【００１５】図２において、ガイダンス信号源２１は、
図１のガイダンス信号源と同様に、発声者３３に発声を
促すガイダンス音声を出す信号源であるが、また、例え
ば、後述するラベル出力部３１の出力結果をもとに、認
識結果のラベルを発声者３３に確認させるために出力す
る機能を有する。帯域通過フィルタ２２、増幅器２３、
スピーカー２４、マイクロホン２５、帯域除去フィルタ
２６は、それぞれ、図１における帯域通過フィルタ１
２、増幅器１３、スピーカー１４、マイクロホン１５、
帯域除去フィルタ１６と同様の機能を持つものである。
音声分析部２７は、帯域除去フィルタ２６より出力され
た音声信号からスペクトル領域の特徴を表すパラメータ
を算出するものであり、例えば、短時間ＬＰＣケプスト
ラム分析などを用いる。特徴記憶部２８は、音声分析部
２７で算出されたパラメータの時系列を時間区間単位で
記憶し、ラベルを付与して、音声認識で使用する標準パ
ターンを構成するものであり、その時間区間は、例え
ば、本実施例を単語音声認識装置として使用するなら
ば、１単語区間について記憶するものとする。ラベル入
力部２９は、特徴記憶部２８で蓄積した１時間区間毎に
ついて、例えば、発声者３３の手入力により、または、
例えば自動的な順序で、ラベルを付与し、特徴記憶部２
８に記憶させるものである。類似度計算部３０は、ラベ
ルが未知の音声として音声分析部２７より入力されたパ
ラメータ列と、特徴記憶部２８に記憶される標準パター
ンとの間で類似度計算を行うものであり、その計算方法
は、例えば、ＤＰマッチングを使用する。ラベル出力部
３１は、類似度計算部３０の結果よりラベルが未知の入
力音声に対して最も類似度が高いと判定された標準パタ
ーンのラベルを出力するものであり、また、その出力は
同様にガイダンス信号源２１に与えられる。スイッチ３
２は、上記音声認識装置を動作させるためのスイッチで
ある。In FIG. 2, the guidance signal source 21 is
Similar to the guidance signal source of FIG. 1, the signal source is a signal source that outputs a guidance voice prompting the speaker 33 to utter. For example, based on an output result of a label output unit 31 described later, a label of the recognition result is output. It has a function of outputting to the speaker 33 for confirmation. A band-pass filter 22, an amplifier 23,
The speaker 24, the microphone 25, and the band elimination filter 26 are respectively the band-pass filter 1 in FIG.
2, amplifier 13, speaker 14, microphone 15,
It has the same function as the band elimination filter 16.
The voice analysis unit 27 calculates a parameter representing a characteristic of a spectral region from the voice signal output from the band elimination filter 26, and uses, for example, a short-time LPC cepstrum analysis. The feature storage unit 28 stores the time series of the parameters calculated by the speech analysis unit 27 in units of time intervals, assigns a label, and configures a standard pattern used for speech recognition. For example, if this embodiment is used as a word speech recognition device, it is assumed that one word section is stored. The label input unit 29 receives, for example, for each one-hour section accumulated in the feature storage unit 28, by manual input of the speaker 33, or
For example, labels are assigned in an automatic order, and the feature storage unit 2
8 is stored. The similarity calculation unit 30 performs similarity calculation between a parameter string input from the voice analysis unit 27 as a voice whose label is unknown and the standard pattern stored in the feature storage unit 28. The method uses, for example, DP matching. The label output unit 31 outputs the label of the standard pattern determined to have the highest similarity to the input speech whose label is unknown from the result of the similarity calculation unit 30, and the output is the same. A guidance signal source 21 is provided. Switch 3
Reference numeral 2 denotes a switch for operating the voice recognition device.

【００１６】以下に、図２のブロック図の動作を説明す
る。The operation of the block diagram shown in FIG. 2 will be described below.

【００１７】本実施例は、音声認識で使用する標準パタ
ーンを記憶する登録モードと、音声を入力して標準パタ
ーンとの類似度を計算し、最も類似度の高い標準パター
ンのラベルを出力する認識モードの２つの動作を行う。In this embodiment, a registration mode for storing a standard pattern used for speech recognition, a recognition mode for inputting speech to calculate a similarity to the standard pattern, and outputting a label of the standard pattern having the highest similarity. Two operations of the mode are performed.

【００１８】最初に、登録モードの動作について説明す
る。First, the operation in the registration mode will be described.

【００１９】まず、発声者３３はスイッチ３２を認識モ
ードへ操作し、認識装置の動作要求を与える。動作要求
に対して、認識装置は、要求を受け入れたことを発声者
３３に確認させるために、ガイダンス信号源２１より、
例えば「認識単語を発声して下さい」などの、ガイダン
ス音声を発生する。この信号は帯域通過フィルタ２２に
より１ｋＨｚから２ｋＨｚの特定の周波数帯域だけの信
号となり、増幅器２３を通じてスピーカー２４より放出
される。このとき同時に、マイクロホン２５、帯域除去
フィルタ２６、および音声分析部２７についても動作を
開始し、発声者による音声の入力を待つ。入力を待つ時
間は、例えば、スイッチ３２の操作開始より数秒から数
十秒間以内とする。ここで、ガイダンス信号がスピーカ
ー２４より放出されている間に、あるいは、放出後に、
発声者３３からの発声があると、この音声は、マイクロ
ホン２５より入力され、帯域除去フィルタ２６を通過す
ることにより１ｋＨｚから２ｋＨｚの特定の周波数帯域
を除去された音声信号となり、音声分析部２７において
ガイダンス信号を含まない音声信号として分析される。
このとき、音声分析部２７では、音声区間のみを選択し
て分析結果を出力する。音声区間の検出方法は、例え
ば、音声分析部２７の出力における短時間信号パワーを
観測し、信号パワーがある閾値を一定時間以上越える区
間を検出するようにして実現する。音声分析部２７の出
力は、認識に使用する標準パターンとして、ラベル入力
部２９からラベル付けが行われ、特徴記憶部２８に記憶
される。次に、認識モードの動作について説明する。First, the speaker 33 operates the switch 32 to the recognition mode to give an operation request of the recognition device. In response to the operation request, the recognition device sends the guidance signal source 21 to the speaker 33 to confirm that the request has been accepted.
For example, a guidance voice such as “Please say the recognized word” is generated. This signal is converted into a signal only in a specific frequency band of 1 kHz to 2 kHz by the band-pass filter 22, and is emitted from the speaker 24 through the amplifier 23. At this time, the operation of the microphone 25, the band elimination filter 26, and the voice analysis unit 27 also starts, and waits for the input of voice by the speaker. The time to wait for an input is, for example, within a few seconds to a few tens of seconds after the operation of the switch 32 is started. Here, during or after the guidance signal is emitted from the speaker 24,
When there is an utterance from the utterer 33, this sound is input from the microphone 25 and passes through the band elimination filter 26 to become an audio signal from which a specific frequency band of 1 kHz to 2 kHz has been removed. It is analyzed as an audio signal that does not include a guidance signal.
At this time, the voice analysis unit 27 selects only a voice section and outputs an analysis result. The voice section detection method is realized by, for example, observing a short-time signal power at the output of the voice analysis unit 27 and detecting a section in which the signal power exceeds a certain threshold for a predetermined time or more. The output of the voice analysis unit 27 is labeled from the label input unit 29 as a standard pattern used for recognition, and stored in the feature storage unit 28. Next, the operation in the recognition mode will be described.

【００２０】発声者３３はスイッチ３２を認識モードへ
操作し、認識装置の動作要求を与える。動作要求に対し
て、認識装置は、要求を受け入れたことを発声者３３に
確認させるために、ガイダンス信号源２１より、例えば
「音声を入力して下さい」などの、ガイダンス音声を発
生する。この信号は、登録モードの場合と同様に、帯域
通過フィルタ２２により１ｋＨｚから２ｋＨｚの特定の
周波数帯域だけの信号となり、増幅器２３を通じてスピ
ーカー２４より放出される。このとき同時に、マイクロ
ホン２５、帯域除去フィルタ２６、および音声分析部２
７についても動作を開始し、発声者による音声の入力を
待つ。入力を待つ時間は、例えば、スイッチ３２の操作
開始より数秒から数十秒間以内とする。ここで、ガイダ
ンス信号がスピーカー２４より放出されている間に、あ
るいは、放出後に、発声者３３からの発声があると、こ
の音声は、登録モードの場合と同様に、マイクロホン２
５より入力され、帯域除去フィルタ２６を経由して１ｋ
Ｈｚから２ｋＨｚの特定の周波数帯域を除去された音声
信号となり、音声分析部２７より音声区間のみを選択さ
れて出力される。音声分析部２７の出力は、ラベルが未
知の入力音声パターンとして、特徴記憶部２８に記憶さ
れた複数個の標準パターンと類似度計算部３０において
パターンマッチングされる。この結果、最も類似度の高
いと判定された標準パターンのラベルがラベル出力部３
１より出力される。出力されたラベルは、例えば、ガイ
ダンス信号源２１へフィードバックして、発声者３３へ
の認識結果の確認に使用することも可能である。例え
ば、標準パターンとして登録した音声ラベルを、「すず
き」、「たなか」、「さとう」、とすると、認識した標
準パターンのラベルを「すずき」としたとき、「認識単
語は『すずき』です。よろしいですか」などの合成音声
をガイダンス信号源２１から発生して発声者３３に聴か
せて確認させ、また、これに対する「はい」、「いい
え」などの確認の音声を発声者３３に入力させることが
できる。The speaker 33 operates the switch 32 to the recognition mode to give an operation request of the recognition device. In response to the operation request, the recognition device generates a guidance voice such as “Please input voice” from the guidance signal source 21 to make the speaker 33 confirm that the request has been accepted. This signal becomes a signal only in a specific frequency band of 1 kHz to 2 kHz by the band-pass filter 22 as in the case of the registration mode, and is emitted from the speaker 24 through the amplifier 23. At this time, the microphone 25, the band elimination filter 26, and the
The operation is also started for 7 and the input of a voice by the speaker is awaited. The time to wait for an input is, for example, within a few seconds to a few tens of seconds after the operation of the switch 32 is started. Here, when the speaker 33 utters while the guidance signal is being emitted from the speaker 24 or after the guidance signal is emitted, the utterance is transmitted to the microphone 2 as in the case of the registration mode.
5 and 1k via the band elimination filter 26.
The audio signal is obtained by removing a specific frequency band from 2 Hz to 2 kHz, and only the audio section is selected by the audio analysis unit 27 and output. The output of the voice analysis unit 27 is subjected to pattern matching in the similarity calculation unit 30 with a plurality of standard patterns stored in the feature storage unit 28 as an input voice pattern whose label is unknown. As a result, the label of the standard pattern determined to have the highest similarity is output to the label output unit 3.
1 is output. The output label can be fed back to the guidance signal source 21 and used for confirming the recognition result of the speaker 33, for example. For example, if the voice labels registered as standard patterns are "Suzuki", "Tanaka", and "Sato", and the label of the recognized standard pattern is "Suzuki", the recognition word is "Suzuki". Is generated from the guidance signal source 21 so that the speaker 33 can listen to and confirm the synthesized voice such as "?", And input the confirmation voice such as "Yes" or "No" to the speaker 33. Can be.

【００２１】本実施例の音声認識装置において、ガイダ
ンス音声信号の重畳に対して認識性能の低下が生じにく
いことを確認するために、実施例を模擬した実験を行っ
た。実験は、ガイダンス音声を重畳させた、成人男性５
名が発声した日本の都市名１００単語を認識するもので
ある。ガイダンス音声は、２０歳代の成人女性が「コマ
ンド名を発声して下さい」と発声したものであり、これ
を先の説明で述べた帯域通過フィルタ２２と同等の１ｋ
Ｈｚから２ｋＨｚの周波数帯域だけを通過するフィルタ
に通し、発声者の原音声の短時間パワーと同レベルのパ
ワーになるように、「コマンド名」の部分で重畳させ
た。このガイダンス重畳音声について、帯域除去フィル
タでガイダンスを除去した場合と、除去しない場合とで
の音声認識率の変化を見た。使用した帯域除去フィルタ
は、さきに説明した帯域除去フィルタ２６と同等の１ｋ
Ｈｚから２ｋＨｚの周波数を除去するフィルタであり、
ガイダンス重畳音声について、ガイダンスの除去をする
場合は、このフィルタを通して、また、ガイダンスの除
去をしない場合はこのフィルタを通さずに、１６次のＬ
ＰＣケプストラム分析を行い、音声パターンを作成して
認識実験を行った。なお、標準パターンは、標準パター
ンと入力音声パターンとがフィルタの通過の有無も含め
て音響的に同一の条件で分析されるように設定して作成
した。話者５名について平均した実験結果では、ガイダ
ンス音声の重畳が無い場合に、認識率９８．６％であっ
た音声について、ガイダンス音声が重畳し、これを帯域
除去フィルタで除去しなかった場合、認識率８４．８％
と約１４％の低下を見た。これに対して、帯域除去フィ
ルタで除去した音声では、認識率９３．６％となり、認
識率の低下を５％に抑えられることを確認した。このよ
うに、ガイダンス付き音声認識装置においては、帯域制
限したガイダンス音声を使用した場合、ガイダンス音声
の重畳があっても、これを帯域除去フィルタで除去する
ことにより、認識率の劣化を少なく抑えることができ
る。In order to confirm that the recognition performance of the speech recognition apparatus according to the present embodiment is unlikely to be degraded due to the superimposition of the guidance speech signal, an experiment simulating the embodiment was performed. The experiment consisted of five adult men with guidance voice superimposed.
It recognizes 100 words of the Japanese city name from which the name was uttered. The guidance voice was produced by an adult female in her twenties saying "Please utter the command name." This is equivalent to the 1k equivalent to the band-pass filter 22 described in the above description.
The signal was passed through a filter that passes only the frequency band from Hz to 2 kHz, and was superimposed at the “command name” portion so that the power had the same level as the short-time power of the original voice of the speaker. With respect to the guidance superimposed speech, a change in the speech recognition rate between when the guidance was removed by the band removal filter and when the guidance was not removed was observed. The used band elimination filter is 1 k, which is equivalent to the band elimination filter 26 described above.
A filter that removes frequencies from 2 Hz to 2 kHz,
With respect to the guidance superimposed voice, when removing the guidance, the filter is passed through this filter, and when the guidance is not removed, the filter is not passed through this filter.
A PC cepstrum analysis was performed, a speech pattern was created, and a recognition experiment was performed. In addition, the standard pattern was created by setting so that the standard pattern and the input voice pattern were analyzed under the same acoustic conditions, including the presence / absence of passing through a filter. According to the experimental results averaged for the five speakers, when there is no superimposition of the guidance voice, the guidance voice is superimposed on the voice having the recognition rate of 98.6%, and when the voice is not removed by the band elimination filter, Recognition rate 84.8%
And a decrease of about 14%. On the other hand, it was confirmed that the speech removed by the band elimination filter had a recognition rate of 93.6%, and a decrease in the recognition rate could be suppressed to 5%. As described above, in the voice recognition apparatus with guidance, when the guidance voice whose band is limited is used, even if the guidance voice is superimposed, the superposition of the guidance voice is removed by the band elimination filter, so that the deterioration of the recognition rate is suppressed to a small extent. Can be.

【００２２】[0022]

【発明の効果】以上説明したように、本発明によれば、
ガイダンス信号を帯域通過フィルタに通すことにより特
定の周波数帯域の信号に限定するとともに、このガイダ
ンス音声に応答して発声者から出力される音声に対して
は帯域除去フィルタを使用してガイダンス信号と同一の
周波数帯域の信号を除去し、これによりガイダンス音声
と発声者の音声信号とが時間的に重畳しても、ガイダン
ス信号の成分を含まない音声信号を得ることができるの
で、ガイダンス発声用のスピーカと音声入力用のマイク
ロホンとを音響的に分離して配置する必要がなく、ハン
ズフリーの音声入力装置を構成することが容易であると
ともに、ガイダンス信号が開始された時点から任意の時
刻で発声を受け付けることができ、親和性のある音声入
力装置を構成でき、また適応予測フィルタ等を使用した
従来のガイダンス音声除去方法で必要としたフィルタの
学習アルゴリズム、学習のためのガイダンス音声信号お
よび学習時間を排除することができる。更に、本音声入
力装置を音声認識装置に適用した場合に、音声にガイダ
ンス音声が重畳したり、ガイダンス音声が除去不足であ
っても、認識率の低下を生じない。As described above, according to the present invention,
Pass the guidance signal through a band-pass filter to limit the signal to a specific frequency band, and use the band elimination filter for the voice output from the speaker in response to this guidance voice and use the same as the guidance signal. Therefore, even if the guidance voice and the voice signal of the speaker are temporally superimposed on each other, a voice signal containing no guidance signal component can be obtained. It is not necessary to arrange the microphone and the microphone for voice input acoustically, which makes it easy to configure a hands-free voice input device, and makes it possible to produce an utterance at any time after the guidance signal is started. Acceptable, compatible voice input devices can be configured, and conventional guidance using adaptive prediction filters, etc. Learning algorithm filters required in the voice removal method, it is possible to eliminate the guidance voice signal and the learning time for learning. Further, when the present voice input device is applied to a voice recognition device, the recognition rate does not decrease even if the guidance voice is superimposed on the voice or the guidance voice is insufficiently removed.

[Brief description of the drawings]

【図１】本発明の一実施例に係わるガイダンス音声付き
音声入力装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a voice input device with guidance voice according to one embodiment of the present invention.

【図２】本発明の他の実施例に係わるガイダンス音声付
き音声入力装置を音声認識装置に適用した場合の構成を
示すブロック図である。FIG. 2 is a block diagram showing a configuration when a voice input device with guidance voice according to another embodiment of the present invention is applied to a voice recognition device.

[Explanation of symbols]

１１ガイダンス信号源１２帯域通過フィルタ１４スピーカ１５マイクロホン１６帯域除去フィルタ１７音声出力部 DESCRIPTION OF SYMBOLS 11 Guidance signal source 12 Band pass filter 14 Speaker 15 Microphone 16 Band elimination filter 17 Audio output part

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 3/16 G10L 13/00 G10L 15/20 G10L 21/02 ──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G06F 3/16 G10L 13/00 G10L 15/20 G10L 21/02

Claims

(57) [Claims]

1. A signal generating means for generating a guidance signal for encouraging a speaker to utter, a band-pass filter for extracting a signal of a specific frequency band from an output signal of the signal generating means, and an output of the band-pass filter A loudspeaker for generating the output signal as an audio signal so as to allow the speaker to listen to the signal, signal input means for receiving a voice output from the speaker in response to the audio signal, and outputting the signal as an electric signal; A voice input device with guidance voice, comprising: a band elimination filter for extracting a signal other than the specific frequency band from an output signal of a signal input unit.