JPS6142280B2

JPS6142280B2 -

Info

Publication number: JPS6142280B2
Application number: JP54098857A
Authority: JP
Inventors: Takehiko Asano
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1979-08-01
Filing date: 1979-08-01
Publication date: 1986-09-19
Also published as: JPS5622500A

Description

【発明の詳細な説明】本発明は音声認識装置の外部騒音による誤動作
防止に関するものである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to preventing malfunction of a speech recognition device due to external noise.

音声信号の特徴パラメータを抽出し、この特徴
パラメータを利用してある話者の音声を認識し、
認識した音声にしたがつて外部機器を制御する信
号等を出力する音声認識システムは従来種々提案
されてきている。その一例を第１図のブロツク図
に示す。入力音声を電気信号に変換する音響−電
気信号変換器（例えばマイクロフホン）を含む入
力部１、入力された音響信号のうち、処理の対象
となる周波数帯域のみを取り出す入力帯域フイル
タ２、音声信号の特徴を抽出する特徴抽出部３、
あらかじめ登録された音声特徴の標準パターンを
記憶する標準パターン記憶部４、入力音声から抽
出された特徴パターンと標準パターンとを比較
し、入力音声を特定する認識処理部５、認識結果
を出力する出力制御部６を主な構成要素とし、こ
れに認識率を向上をさせるための入力信号振幅正
規化回路７と、時間軸調整部８と、あらかじめ音
声特徴の標準パターンを登録する際の処理を受け
持つ登録制御部９とが付加されている。 Extract the feature parameters of the audio signal, use these feature parameters to recognize the voice of a certain speaker,
2. Description of the Related Art Various voice recognition systems have been proposed in the past that output signals and the like to control external devices in accordance with recognized voice. An example of this is shown in the block diagram of FIG. An input unit 1 including an acoustic-electrical signal converter (for example, a microphone) that converts input audio into an electrical signal, an input band filter 2 that extracts only a frequency band to be processed from the input audio signal, and an audio signal. a feature extraction unit 3 that extracts the features of
A standard pattern storage unit 4 that stores standard patterns of pre-registered audio features, a recognition processing unit 5 that compares the feature pattern extracted from the input audio with the standard pattern and identifies the input audio, and an output that outputs the recognition results. The main component is a control unit 6, which includes an input signal amplitude normalization circuit 7 for improving the recognition rate, a time axis adjustment unit 8, and is responsible for processing when registering standard patterns of voice features in advance. A registration control section 9 is added.

音声の特徴を抽出するパラメータとしては周波
数スペクトル分布、相関関数、零交差数、フオル
マント周波数、或いは線型予測係数など多くの方
法が考えられるが、これらのうち音声の周波数ス
ペクトルを複数の帯域分割フイルタにより分離抽
出し、標準パターンとの相関をみるいわゆるフイ
ルタバンク方法は比較的簡単な構成で高い認識率
を得ることができる方法として良く用いられてい
る。 There are many methods that can be considered as parameters for extracting voice features, such as frequency spectrum distribution, correlation function, number of zero crossings, formant frequency, or linear prediction coefficient. The so-called filter bank method, which separates and extracts patterns and checks the correlation with a standard pattern, is often used as a method that can obtain a high recognition rate with a relatively simple configuration.

さてこのような音声処理装置においては、入力
信号の中から制御命令を分離抽出する能力はきわ
めて重要でありこれによつて装置の認識率が左右
されるといつても過言ではない。特に認識装置が
使用される環境が外部騒音の多い場所であるよう
な場合、騒音がマイクロフホン１を通して入力さ
れることになり、使用上大きい問題となつてく
る。この場合マイクロフホン１から入力された騒
音が入力帯域フイルタ２円通過した後に示す周波
数スペクトル分布が、装置にあらかじめ登録され
ている音声の中のいずれかの周波数スペクトル分
布と非常に似通つたパターンを示すならば装置は
騒音をある意味を有する音声を誤認識してしまう
ことが起きるのである。 Now, in such a speech processing device, the ability to separate and extract control commands from an input signal is extremely important, and it is no exaggeration to say that the recognition rate of the device is influenced by this ability. Particularly when the environment in which the recognition device is used is a place where there is a lot of external noise, the noise will be input through the microphone 1, which will pose a big problem in use. In this case, the frequency spectrum distribution of the noise input from microphone 1 after passing through two input band filters has a pattern that is very similar to the frequency spectrum distribution of one of the voices registered in advance in the device. If this is the case, the device may mistakenly recognize noise as sound having a certain meaning.

しばしば発生し、問題となる外部騒音としては
人の拍手、物と物とがぶつかつて生じる衝撃音と
かがあり、現代ではこういつた比較的高レベルで
周波数の高い成分を多く含む騒音が種々の環境で
発生しやすくなつている。この種の音は持続時間
は短いので通常の語の長さの点から装置には受け
付けられにくくなつているが、間隔短く数回連続
して起こる場合には受け付けられることがある。 External noises that often occur and cause problems include people's applause and impact noises caused by objects colliding with each other.In modern times, these relatively high-level noises that contain many high-frequency components are produced by various types of noise. It is becoming more likely to occur in the environment. This type of sound has a short duration, making it difficult for the device to accept it due to the length of a typical word, but it may be accepted if it occurs several times in succession with short intervals.

外部騒音を除去する目的でいわゆる接話型のマ
イクロフオンを入力部に用いることが行なわれる
が、前記のような周波数の高い成分を多く含む高
レベルの音には効果がうすくなる傾向があり、誤
動作防止についてはなお不充分である。 A so-called close-talk type microphone is used in the input section for the purpose of removing external noise, but this tends to be less effective against high-level sounds that contain many high-frequency components as mentioned above. Prevention of malfunctions is still insufficient.

本発明はかかる音声認識装置の騒音による誤動
作を起こしにくくする目的から為されたものであ
る。 The present invention has been made for the purpose of making it difficult for such a speech recognition device to malfunction due to noise.

即ち、前述した種類の騒音は周波数帯域が広く
通常の音声には含まれない高域成分まで多く含む
ことに着目し、騒音と音声との区別を明確にする
ものである。 That is, it is intended to clarify the distinction between noise and voice by focusing on the fact that the above-mentioned types of noise have a wide frequency band and include many high-frequency components that are not included in normal voice.

以下本発明の一実施例を第２図に従つて詳述す
る。入力部１はマイクロフオホン１０と増幅器１
１により構成される。入力部１からの信号は入力
帯域フイルタ２を通つた後特徴抽出部３に入る特
徴抽出部３は中心周波数がそれぞれ_１，_２，
…，ｎのＮ個のバンドパスフイルタ（BPF）１
２，１３〜１４、該各フイルタの出力を切り替え
るマルチプレクサ１５、該マルチプレクサを通過
した前記各フイルタの出力レベルをデイジタル信
号に変換するアナログ−デイジタル（Ａ／Ｄ）変
換器１６によつて構成される。なおバンドパスフ
イルタ１２〜１４全部でカバーする帯域は入力帯
域フイルタ２の帯域と同じか狭い帯域とする。こ
れにより入力部１から入力した音声信号の各フイ
ルタ成分が適当な時間間隔（多くの場合10ミリ秒
前後）でサンプリングされデイジタルコーデに変
換された後Ｉ／Ｏポートを含むマイクロコンピユ
ータ１７を介して記憶メモリー１８に記憶され
る。マイクロコンピユータ１７には別の標準パタ
ーンメモリー４が接続されておりあらかじめ制御
命令音声がその制御内容を指定するコードと共に
記憶されている。音声認識モードに於いては前述
の如く制御音声が入力し、特徴抽出回路３の各フ
イルタ１２，１３…１４により抽出されデイジタ
ル信号化された信号列は記憶メモリー１８に記憶
され次いでマイクロコンピユータ１７はこの記憶
パターンと標準パターンとの差を、全ての標準パ
ターンについて計算しその差が最も小さい標準パ
ターンを決定することにより入力音声を特定す
る。一般に人間の話声は同じ言語を発声してもそ
の時間的推移は常に同等とは限らない為、第１図
に示すが如き何らかの時間軸調整回路８が付加さ
れなければならないことは衆知の通りであるが、
本発明の構成を示した第２図に於ては説明の都合
上かかる時間軸調整回路は省略している。 An embodiment of the present invention will be described in detail below with reference to FIG. Input section 1 includes microphone 10 and amplifier 1
1. The signal from the input section 1 passes through the input band filter 2 and then enters the feature extraction section 3.The feature extraction section 3 has center frequencies of ₁ , _{2, and 2} , respectively.
..., n bandpass filters (BPF) 1
2, 13 to 14, a multiplexer 15 that switches the output of each filter, and an analog-digital (A/D) converter 16 that converts the output level of each filter that has passed through the multiplexer into a digital signal. . Note that the band covered by all of the bandpass filters 12 to 14 is the same as or narrower than the band of the input band filter 2. As a result, each filter component of the audio signal input from the input section 1 is sampled at appropriate time intervals (in most cases, around 10 milliseconds) and converted into digital code, and then sent via the microcomputer 17 including the I/O port. It is stored in the storage memory 18. Another standard pattern memory 4 is connected to the microcomputer 17, and control command voices are stored in advance together with codes specifying the control contents. In the voice recognition mode, the control voice is input as described above, and the signal string extracted by each filter 12, 13...14 of the feature extraction circuit 3 and converted into a digital signal is stored in the storage memory 18, and then the microcomputer 17 The input voice is specified by calculating the difference between this stored pattern and the standard pattern for all standard patterns and determining the standard pattern with the smallest difference. In general, human speech is not always the same over time even when the same language is uttered, so it is well known that some kind of time axis adjustment circuit 8 as shown in Figure 1 must be added. In Although,
In FIG. 2 showing the configuration of the present invention, the time axis adjustment circuit is omitted for convenience of explanation.

認識モードに於ける音声の取り込みは常時行な
われており、入力音声が途切れたとき即ちポーズ
期間に前述の認識計算が実行されそれ以前の入力
音声がパターンマツチング法により特定される。
この時入力音声について特定が可能となつた時、
即ち入力音声が何らかの標準パターンに許容され
得る誤差の範囲内で一致した時マイクロコンピユ
ータ１７は出力制御部６を制御して制御音声に対
応した外部機器制御用信号を出力する。入力音声
を特定できぬ場合制御用信号は出力せず例えば表
示装置１９を駆動し再発声を促すようにする。 Speech is always captured in the recognition mode, and when the input speech is interrupted, that is, during a pause period, the above-mentioned recognition calculation is executed and the previous input speech is identified by the pattern matching method.
At this time, when it became possible to identify the input audio,
That is, when the input voice matches some standard pattern within an allowable error range, the microcomputer 17 controls the output control section 6 to output an external device control signal corresponding to the control voice. If the input voice cannot be specified, the control signal is not output, but the display device 19 is driven, for example, to prompt re-voice.

次に本発明実施例において騒音と音声信号との
判別がどのように行なわれるかついて説明する。
入力部１からの信号は入力帯域フイルタ２に入る
と同時に入力帯域フイルタの帯域以上を通過させ
る。ハイパルスフイルタ（HPF）２０に入力さ
れる。ここで必要以上の高域音を検知することは
ないのでマイクロフオンの特性に合わせて高域を
減衰させる。HPF２０の出力は高域信号用積分
器２１で積分される。入力帯域フイルタ２の出力
はBPF１２，１３〜１４に力されると同時に音声
帯域信号積分器２２に入力される。高域信号用積
分器２１と音声帯域信号用積分器２２の出力は信
号比較器を構成する割算器２３に入力される。こ
の割算器２３に於て音声帯域信号レベルを分母に
高域信号レベルを分子において割算が行なわれ該
割算器２３出力は電圧比較器２４に入る。前述の
如き騒音では音声帯域以上の成分が多く含まれる
ため、この種の騒音が入力された場合高域信号レ
ベルは大きくなり、音声帯域信号レベルとの比す
なわち前記割算器２３出力が音声帯域信号のみの
場合に比べ判然と大きくなる。したがつて電圧比
較器２４の参照電圧（V₀）を適当値（これは実験
的に定まる）に選び、参照電圧（V₀）以上の割算
器出力が出力された場合入力信号は騒音であると
みなしマイクロコンピユータ１７に信号を出すよ
うに構成することによりマイクロコンピユータ１
７は認識動作を中断し、誤認識を防止することが
できる。この場合も前記表示装置１９を駆動し再
発声を促す。 Next, a description will be given of how noise and audio signals are distinguished in the embodiment of the present invention.
The signal from the input section 1 enters the input band filter 2 and at the same time passes the signal above the band of the input band filter. The signal is input to a high pulse filter (HPF) 20. Since no higher-frequency sound than necessary is detected here, the high-frequency range is attenuated according to the characteristics of the microphone. The output of the HPF 20 is integrated by a high frequency signal integrator 21. The output of the input band filter 2 is input to the BPFs 12, 13 to 14 and at the same time is input to the voice band signal integrator 22. The outputs of the high frequency signal integrator 21 and the voice band signal integrator 22 are input to a divider 23 constituting a signal comparator. This divider 23 divides the audio band signal level as the denominator and the high frequency signal level as the numerator, and the output of the divider 23 is input to the voltage comparator 24. Since the above-mentioned noise contains many components above the voice band, when this type of noise is input, the high frequency signal level increases, and the ratio with the voice band signal level, that is, the output of the divider 23, is equal to or higher than the voice band. It is clearly larger than when only the signal is used. Therefore, if the reference voltage (V ₀ ) of the voltage comparator 24 is selected to an appropriate value (this is determined experimentally) and the divider output is higher than the reference voltage (V ₀ ), the input signal is noisy. By configuring the microcomputer 17 to output a signal to the microcomputer 17, the microcomputer 1
7 can interrupt the recognition operation and prevent erroneous recognition. In this case as well, the display device 19 is driven to prompt the user to speak again.

以上述べたように本発明によれば騒音と通常音
声とのスペクトル分布の特徴によつて両者の区別
をつけることができるので、高い弁別性を有し、
認識装置の誤認識を防止し、認識率を向上させる
ことができ実用性の高いものである As described above, according to the present invention, it is possible to distinguish between noise and normal speech based on the characteristics of their spectral distribution, so that high discrimination is achieved.
It is highly practical as it can prevent misrecognition by the recognition device and improve the recognition rate.

[Brief explanation of the drawing]

第１図は現存する音声認識装置の構成を示すブ
ロツク図、第２図は本発明装置の構成を示すブロ
ツク図であつて、１は入力部、３は特徴抽出部、
４は標準パターン記憶部、１７はマイクロコンピ
ユータ、２１は高域信号用積分器、２２は音声帯
域信号用積分器、２３は割算器、を夫々示してい
る。 FIG. 1 is a block diagram showing the configuration of an existing speech recognition device, and FIG. 2 is a block diagram showing the configuration of the device of the present invention, in which 1 is an input section, 3 is a feature extraction section,
Reference numeral 4 indicates a standard pattern storage unit, 17 a microcomputer, 21 an integrator for high frequency signals, 22 an integrator for voice band signals, and 23 a divider.

Claims

[Claims]

1. A voice range filter that divides the acoustic signal from the acoustic-electrical signal converter into a voice band signal and a high frequency signal above the frequency band; a high frequency filter; and an audio filter that detects the level of the signal passing through the voice frequency filter. a high-frequency range level detector, a high-frequency range level detector that detects the level of the signal passed through the high-frequency range filter, and a comparator that takes the ratio of the detection outputs from both of these level detectors. A speech recognition device characterized in that, as a result of the comparison, the speech recognition operation is interrupted when the treble range level is higher than the voice range level.