JPH077276B2

JPH077276B2 - Syllable recognizer

Info

Publication number: JPH077276B2
Application number: JP1056789A
Authority: JP
Inventors: 伸神谷; 文雄外川; 充宏斗谷
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1989-03-09
Filing date: 1989-03-09
Publication date: 1995-01-30
Anticipated expiration: 2010-01-30
Also published as: JPH02235141A

Description

【発明の詳細な説明】＜産業上の利用分野＞この発明は、時間遅れ神経回路網を利用した音節認識装
置に関する。The present invention relates to a syllable recognition device using a time delay neural network.

＜従来の技術＞従来、時間遅れ神経回路網（以下、TDNNと言う）を用い
た音節スポッティング装置として次のようなものがある
（沢井，アレックス・ワイベル，鹿野：「時間遅れ神経
回路網による音節スポッティングの検討」日本音響学会
講演論文集1988−10）。この音節スポッティング装置
は、入力層と２つの中間層と出力層からなるTDNNで構成
されている。このTDNNの学習用サンプルとして、音節/B
A/を含む単語53語を選出し、音節/BA/の部分15フレーム
（10ms周期）を切り出した音声サンプルを用いている。
また、入力パターンとして、音声信号の16次の高速フー
リエ変換メルスペクトラムを用いている。なお、このTD
NNの出力層のユニットは認識カテゴリ“BA"と“non−B
A"とに対応して２つ設けてある。学習は誤差逆伝播方式
によって行っている。<Prior Art> Conventionally, there are the following syllable spotting devices using a time-delayed neural network (hereinafter referred to as TDNN) (Sawai, Alex Weibel, Kano: "Syllables by time-delayed neural network". Examination of spotting "Proceedings of the Acoustical Society of Japan 1988-10). This syllable spotting device consists of a TDNN consisting of an input layer, two intermediate layers and an output layer. As a sample for learning this TDNN, syllables / B
We selected 53 words including A / and used a voice sample that cut out 15 frames (10 ms period) of syllable / BA /.
The 16th-order fast Fourier transform mel spectrum of the audio signal is used as the input pattern. In addition, this TD
The units in the output layer of the NN are recognition categories “BA” and “non-B”.
Two are provided corresponding to A ". Learning is performed by the error back propagation method.

上記TDNNに対する学習時における教師用データの与え方
は、入力パターンの音素/B/と音素/A/との境界位置とTD
NNの中心位置とのずれが一定時間内にある場合に、出力
層の認識カテゴリ“BA"に割り付けられたユニットに
“1"を与え、認識カテゴリ“non−BA"に割り付けられた
ユニットに“0"を与える。How to give the teacher data at the time of learning to the above TDNN is the boundary position between the phoneme / B / and the phoneme / A / of the input pattern and the TD
When the deviation from the center position of NN is within a certain time, "1" is given to the unit assigned to the recognition category "BA" of the output layer, and "1" is given to the unit assigned to the recognition category "non-BA". Give 0 ".

未知音声信号の入力は、未知音節の上記入力パターン
を、TDNNの入力層の各ユニットに対して３フレームずつ
シフトしながらスキャンして与えることによって行って
いる。そして、出力層の“BA"に割り付けられたユニッ
トの出力値が“non−BA"に割り付けられたユニットの出
力値よりも大きい場合は、入力パターンの音節は/BA/で
あると判定する。逆の場合には、入力パターンのカテゴ
リの音節は/non−BA/であると判定するのである。The unknown voice signal is input by scanning the above input pattern of the unknown syllable to each unit of the input layer of the TDNN while shifting by 3 frames. Then, when the output value of the unit assigned to “BA” in the output layer is larger than the output value of the unit assigned to “non-BA”, it is determined that the syllable of the input pattern is / BA /. In the opposite case, the syllable in the category of the input pattern is determined to be / non-BA /.

＜発明が解決しようとする課題＞上述のように、上記従来の音節スポッティング装置は、
TDNNの入力層に入力する入力パターンは、音声信号から
抽出された16次の高速フーリエ変換メルスペクトラムで
あり、出力層から出力される出力データが表すカテゴリ
は“BA"と“non−BA"である。すなわち、TDNNに音声信
号の特徴パターンを入力して直接音節を認識するのであ
る。そのため、TDNNにおける音節認識過程の途中経過
（例えば、各層のユニット間の重みの値）が不明であ
る。また、たとえ分かったとしても、その重みの値が示
す意味は不明である。<Problems to be Solved by the Invention> As described above, the conventional syllable spotting device described above is
The input pattern input to the input layer of the TDNN is a 16th-order fast Fourier transform mel spectrum extracted from the speech signal, and the categories represented by the output data output from the output layer are "BA" and "non-BA". is there. That is, the feature pattern of the voice signal is input to the TDNN to directly recognize the syllable. Therefore, the progress of the syllable recognition process in TDNN (for example, the value of the weight between units in each layer) is unknown. Moreover, even if it is understood, the meaning of the value of the weight is unknown.

したがって、TDNNの学習がなかなか収束しない場合、学
習の未収束の原因が全く不明であるという問題がある。
また、このように学習の未収束の原因が全く不明である
ので学習が収束に向かうように対処できず、学習時間が
必要以上に長くなるという問題がある。Therefore, if the TDNN learning does not converge easily, there is a problem that the cause of the non-convergence of learning is completely unknown.
Further, since the cause of the non-convergence of the learning is completely unknown in this way, there is a problem that the learning cannot be dealt with toward the convergence and the learning time becomes longer than necessary.

そこで、この発明の目的は、音節認識過程の途中経過を
知ることが可能であると共に、TDNNの学習時間を短縮可
能な音節認識装置を提供することにある。Therefore, an object of the present invention is to provide a syllable recognition device capable of knowing the progress of the syllable recognition process and shortening the learning time of the TDNN.

＜課題を解決するための手段＞上記目的を達成するため、この発明の音節認識装置は、
時間遅延手段を有し、順次入力される音響パラメータを
表す信号の時系列とこの音響パラメータを表す信号の時
系列を上記時間遅延手段によって所定時間遅延させた信
号とを組合せた信号を、音素あるいは単音を表す信号の
時系列に変換して出力する第１の時間遅れ神経回路網
と、時間遅延手段を有すると共に、上記第１の時間遅れ
神経回路網から出力される音素あるいは単音を表す信号
の時系列を順次入力し、この順次入力される音素あるい
は単音を表す信号の時系列とこの音素あるいは単音を表
す信号の時系列を上記時間遅延手段によって所定時間遅
延させた信号とを組合せた信号を、音節を表す信号の時
系列に変換して出力する第２の時間遅れ神経回路網を備
えたことを特徴としている。<Means for Solving the Problems> In order to achieve the above object, the syllable recognition device of the present invention is
A time-delay means is provided, and a signal obtained by combining a time series of signals representing sequentially input acoustic parameters and a signal obtained by delaying the time series of signals representing the acoustic parameters by a predetermined time by the time delay means A first time-delay neural network for converting a signal representing a single tone into a time series and outputting the time-series signal, and a time-delaying means are provided, and a signal representing a phoneme or a single tone output from the first time-delay neural network is output. A signal obtained by sequentially inputting a time series and combining a time series of signals sequentially representing the phonemes or single tones and a signal obtained by delaying the time series of the signals representing the phonemes or single tones by a predetermined time by the time delay means is generated. , A second time-delay neural network for converting a signal representing a syllable into a time series and outputting the time series.

＜作用＞第１の時間遅れ神経回路網に、未知単語から抽出された
音響パラメータを表す信号の時系列が順次入力される。
そうすると、この第１の時間遅れ神経回路網は、順次入
力される音響パラメータを表す信号の時系列とこの音響
パラメータを表す信号の時系列を時間遅延手段によって
所定時間遅延させた信号とを組合せた信号を、音素ある
いは単音を表す信号の時系列に変換して出力する。そし
て、この第１の時間遅れ神経回路網から出力される音素
あるいは単音を表す信号の時系列は、第２の時間遅れ神
経回路網に入力される。<Operation> A time series of signals representing acoustic parameters extracted from unknown words is sequentially input to the first time delay neural network.
Then, the first time-delay neural network combines the time series of the signals sequentially representing the acoustic parameters and the signal obtained by delaying the time series of the signals representing the acoustic parameters by the time delay means. The signal is converted into a time series of a signal representing a phoneme or a single sound and output. Then, the time series of the signal representing the phoneme or the single sound output from the first time-delay neural network is input to the second time-delay neural network.

そうすると、この第２の時間遅れ神経回路網は、順次入
力される音素あるいは単音を表す信号の時系列とこの音
素あるいは単音を表す信号の時系列を時間遅延手段によ
って所定時間遅延させた信号とを組合せた信号を、音節
を表す信号の時系列に変換して出力する。したがって、
この第２の時間遅れ神経回路網から出力される音節を表
す信号の時系列によって、音節を認識することができ
る。Then, the second time-delay neural network divides the time series of signals sequentially representing phonemes or single tones and the signals obtained by delaying the time series of signals representing the phonemes or single tones by a time delay means. The combined signals are converted into a time series of signals representing syllables and output. Therefore,
The syllable can be recognized by the time series of the signal representing the syllable output from the second time delay neural network.

その際に、上記第１の時間遅れ神経回路網から出力され
る音素あるいは単音を表す信号を監視することによっ
て、音節認識過程の途中経過（すなわち、音節を構成す
る音素列の認識結果）を知ることが可能である。At that time, by monitoring a signal representing a phoneme or a single sound output from the first time-delay neural network, the progress of the syllable recognition process (that is, the recognition result of the phoneme sequence forming the syllable) is known. It is possible.

＜実施例＞以下、この発明を図示の実施例による詳細に説明する。<Examples> The present invention will be described in detail below with reference to illustrated examples.

第１図はこの発明の音節認識装置のブロック図である。
この音節認識装置は大きく分けて直列に接続された２つ
のTDNNから成っている。その一方のTDNNを第１多層パー
セプトロン型ニューラル・ネットワーク（以下、NNと言
う）１で構成し、他方のTDNNを第2NN2で構成する。第１
図における第1NN1および第2NN2は、入力層，中間層，出
力層，各層に含まれるユニットおよび各ユニット間の結
合等を省略し、簡略化して表現してある。FIG. 1 is a block diagram of the syllable recognition device of the present invention.
This syllable recognizer is roughly composed of two TDNNs connected in series. One of the TDNNs is composed of a first multilayer perceptron type neural network (hereinafter referred to as NN) 1, and the other TDNN is composed of a second NN2. First
The first NN1 and the second NN2 in the figure are simplified by omitting the input layer, the intermediate layer, the output layer, the units included in each layer, the coupling between the units, and the like.

音節は音素の連鎖から構成されている。そこで、本実施
例の音節認識装置においては、音節認識動作の途中経過
を知る手段として認識対象の音節を構成している音素を
用いるのである。すなわち、第1NN1の入力データは音響
パラメータとする一方、認識カテゴリは音素とする。ま
た、第2NN2の入力データは第1NN1の識別カテゴリである
音素を表すデータ（本実施例においては、第1NN1の出力
データ）とする一方、識別カテゴリは音節とするのであ
る。こうすることによって、音節認識動作時において第
1NN1の出力データを監視すれば、第1NNに入力された音
響パラメータに対する音節認識動作の途中経過を知るこ
とができるのである。Syllables consist of a chain of phonemes. Therefore, in the syllable recognition device of the present embodiment, the phonemes forming the syllable to be recognized are used as a means for knowing the progress of the syllable recognition operation. That is, the input data of the first NN1 is acoustic parameters, while the recognition category is phonemes. In addition, the input data of the second NN2 is the data representing the phoneme which is the identification category of the first NN1 (the output data of the first NN1 in this embodiment), while the identification category is the syllable. By doing this, the
By monitoring the output data of 1NN1, it is possible to know the progress of the syllable recognition operation for the acoustic parameters input to the first NN.

上記第1NN1の入力層に入力する学習用サンプルは、発声
内容が既知の単語の音声信号から抽出した特徴パターン
に、パワー等の視察によって音素のラベル付けが行われ
たものを用いる。ここで、１フレームは8ms〜10ms程度
である。また、特徴パターンとしては、例えばｍチャン
ネルのバンド・パス・フィルタ群からの出力値、ｍ次の
自己相関係数、ｍ次のケプストラム係数等を用いる。す
なわち、入力データの次数はｍとなる。また、第1NN1の
教師データは上述のようにして作成された学習用サンプ
ルのラベルに基づく音素を表すデータを用いる。The learning sample input to the input layer of the first NN1 is a feature pattern extracted from a speech signal of a word whose utterance content is already known, and phonemes are labeled by visual inspection such as power. Here, one frame is about 8 ms to 10 ms. Further, as the characteristic pattern, for example, an output value from an m-channel band pass filter group, an m-th order autocorrelation coefficient, an m-th order cepstrum coefficient, or the like is used. That is, the order of the input data is m. As the first NN1 teacher data, data representing a phoneme based on the label of the learning sample created as described above is used.

上記第1NN1における図示しない入力層はｍ×（Ａ＋１）
個（A:後に詳述する最大遅延フレーム数）のユニットを
有する。入力層の各ユニットは、一端のユニットから順
次（Ａ＋１）個のユニットから成るｍ個のブロックに分
割されており、第ｉ番目（１≦ｉ≦ｍ）のブロックの最
初のユニットには第ｉ次の音響パラメータが入力され
る。また、次のユニットには、第ｉ次の音響パラメータ
を、入力信号を１フレームに相当する時間だけ遅延させ
る遅延素子３によって１フレーム分だけ遅延させた音響
パラメータが入力される。さらに次のユニットには、第
ｉ次の音響パラメータを２個の遅延素子３によって２フ
レーム分だけ遅延させた音響パラメータが入力される。
以下、同様にして、最後のユニットには、第ｉ次の音響
パラメータをＡ個の遅延素子３によってＡフレーム分だ
け遅延させた音響パラメータが入力されるのである。こ
うして、上述のようなｎフレーム×ｍ次の入力パターン
が０フレームから順次Ａフレームまで遅延されて、入力
層のｍ×（Ａ＋１）個のユニットに１フレームづつ順次
入力される。The input layer (not shown) in the first NN1 is m × (A + 1)
(A: maximum number of delayed frames described in detail later) units. Each unit of the input layer is sequentially divided into m blocks of (A + 1) units from one unit, and the first unit of the i-th (1 ≦ i ≦ m) block has the i-th unit. The following acoustic parameters are input. In addition, an acoustic parameter obtained by delaying the i-th acoustic parameter by one frame by the delay element 3 that delays the input signal by a time corresponding to one frame is input to the next unit. Further, an acoustic parameter obtained by delaying the i-th acoustic parameter by two frames by two delay elements 3 is input to the next unit.
Hereinafter, in the same manner, the acoustic parameter obtained by delaying the i-th acoustic parameter by A frames by A delay elements 3 is input to the last unit. In this way, the input pattern of the nth frame × mth order as described above is sequentially delayed from 0th frame to Ath frame, and sequentially input to m × (A + 1) units of the input layer one frame at a time.

一方、上記第1NN1における図示しない出力層はｐ個（p:
第1NN1によって識別したい音素数）のユニットを有す
る。通常、日本語の音素の種類は約20種類である。出力
層の各ユニットは識別する個々の音素に割り付けられて
いる（調音結合の受けやすい音素に対しては、経験に基
づいて複数のユニットを割り付けてもよい）。すなわ
ち、第１図においては、一端のユニット（第１ユニッ
ト）は音素/a/に割り付けられており、第ｊユニットは
音素/r/に割り付けられており、第ｐユニットは音素/b/
に割り付けられている。こうすることによって、第１ユ
ニットが最大出力を呈する場合には、入力された音響パ
ラメータの音素は/a/であると認識し、第ｊユニットが
最大出力を呈する場合には、入力された音響パラメータ
の音素は/r/であると認識するのである。On the other hand, the number of output layers (not shown) in the first NN1 is p (p:
The number of phonemes to be identified by the first NN1). Usually, there are about 20 types of phonemes in Japanese. Each unit in the output layer is assigned to an individual phoneme to be identified (for a phoneme susceptible to articulatory coupling, a plurality of units may be assigned based on experience). That is, in FIG. 1, the unit at one end (first unit) is assigned to the phoneme / a /, the j-th unit is assigned to the phoneme / r /, and the p-th unit is the phoneme / b /.
Is assigned to. By doing so, when the first unit has the maximum output, the phoneme of the input acoustic parameter is recognized as / a /, and when the j-th unit has the maximum output, the input sound is recognized. The parameter phoneme is recognized as / r /.

また、上記第2NN2における図示しない入力層はｐ×（Ｂ
＋１）個（B:後に詳述する最大遅延フレーム数）のユニ
ットを有する。入力層の各ユニットは、第1NN1の場合と
同様に、１端のユニットから順次（Ｂ＋１）個のユニッ
トから成るｐ個のブロックに分割されており、第ｊ番目
（１≦ｊ≦ｐ）のブロックの最初のユニットには第1NN1
の出力層の第ｊユニットからの出力信号が入力される。
また、次のユニットには、第ｊユニットからの出力信号
を遅延素子３によって１フレーム分だけ遅延させた信号
が入力される。さらに次のユニットには、第ｊユニット
からの出力信号を２個の遅延素子３によって２フレーム
分だけ遅延させた信号が入力される。以下、同様にし
て、最後のユニットには、第ｊユニットからの出力信号
をＢ個の遅延素子３によってＢフレーム分だけ遅延させ
た信号が入力されるのである。こうして、上述のような
ｐ個の出力信号列から成る入力パターンが０フレームか
ら順次Ｂフレームまで遅延されて、入力層のｐ×（Ｂ＋
１）個のユニットに順次入力される。The input layer (not shown) in the second NN2 is p × (B
+1) units (B: maximum delay frame number described in detail later). As in the case of the first NN1, each unit of the input layer is divided into p blocks of (B + 1) units sequentially from the one end unit, and the j-th (1 ≦ j ≦ p) block 1st NN1 for first unit of block
The output signal from the j-th unit of the output layer of is input.
A signal obtained by delaying the output signal from the j-th unit by one frame by the delay element 3 is input to the next unit. Further, a signal obtained by delaying the output signal from the j-th unit by two frames by the two delay elements 3 is input to the next unit. Hereinafter, in the same manner, a signal obtained by delaying the output signal from the j-th unit by B delay elements 3 by B frames is input to the last unit. In this way, the input pattern composed of the p output signal strings as described above is delayed from 0 frame to B frame in sequence, and p × (B +
1) The data is sequentially input to each unit.

一方、上記第2NN2における図示しない出力層はｓ個（s:
第2NN2によって識別したい音節数）のユニットを有す
る。通常、日本語の音節の種類は約100種類である。出
力層の各ユニットは識別する個々の音節に割り付けられ
ている。すなわち、第１図においては、第１ユニットは
音節/a/に割り付けられており、第２ユニットは音節/i/
に割り付けられており、第ｓユニットは音節/syo/に割
り付けられている。こうすることによって、例えば第１
ユニットが最大出力を呈する場合には入力された入力パ
ターンに対する音節は/a/であると認識し、第６ユニッ
トが最大出力を呈する場合には入力された入力パターン
に対する音節は/ka/であると認識し、第ｓユニットが最
大出力を呈する場合には入力された入力パターンに対す
る音節は/syo/であると認識するのである。On the other hand, the number of output layers (not shown) in the second NN2 is s (s:
The number of syllables to be identified by the second NN2). Generally, there are about 100 types of Japanese syllables. Each unit in the output layer is assigned to an individual syllable to identify. That is, in FIG. 1, the first unit is assigned to the syllable / a / and the second unit is assigned to the syllable / i /.
, And the sth unit is assigned to the syllable / syo /. By doing this, for example, the first
Recognize that the syllable for the input pattern input is / a / when the unit has the maximum output, and / ka / for the input pattern when the sixth unit has the maximum output. When the s-th unit exhibits the maximum output, the syllable for the input pattern inputted is recognized as / syo /.

上記構成の音節認識装置は次のようにして学習させる。The syllable recognition device having the above configuration is trained as follows.

第２図は教師データの与え方の説明図である。以下、第
２図に従って、学習データの与え方を詳細に説明する。FIG. 2 is an illustration of how to give teacher data. Hereinafter, a method of giving learning data will be described in detail with reference to FIG.

第２図（ａ）は学習用サンプルとしての発声内容が既知
の単語／ふたりの／のパワー曲線であり、第２図（ｂ）
は第２図（ａ）のパワー曲線に対応する音響パラメータ
（第1NN1の入力層への入力データ）を示し、第２図
（ｃ）は第1NN1の教師データを示し、第２図（ｄ）は第
2NN2の教師データを示す。なお、第２図（ｂ）は音響パ
ラメータを次数とフレーム数とのマトリックスで表現し
てあるが、具体的なデータは省略してある。2 (a) is a power curve of a word / two people / whose voicing contents are known as a learning sample, and FIG. 2 (b)
Shows acoustic parameters (input data to the input layer of the first NN1) corresponding to the power curve of FIG. 2 (a), FIG. 2 (c) shows teacher data of the first NN1, and FIG. 2 (d). Is the
2NN2 shows teacher data. In FIG. 2B, the acoustic parameters are represented by a matrix of orders and frames, but concrete data are omitted.

上述のように、学習用サンプル／ふたりの／のパワー曲
線の視察によって、フレーム毎に音響パラメータに音素
ラベルが付けられる。この付加された音素ラベルが第２
図（ａ）の下部にパワー曲線に対応付けて標記してあ
る。この音素ラベルを表すデータが第1NN1の教師データ
となるのである。As described above, the phoneme label is attached to the acoustic parameter for each frame by observing the power curve of the learning sample / two /. This added phoneme label is the second
The lower part of the figure (a) is associated with the power curve. The data representing this phoneme label becomes the first NN1 teacher data.

この第1NN1の教師データは次のようにして作成される。
すなわち、ある音素ラベルが付けられた音響パラメータ
の１フレームが入力されてからＡフレームに相当する時
間が経過した後のフレームにおいては、上記音素に割り
付けられた出力層のユニットに信号“1"を与え、その他
のユニットには信号“0"を与えるようなデータをその音
素の教師データとするのである。例えば、第２図（ｂ）
における音素/h/に対応するフレームf₁からＡフレーム
に相当する時間が経過した後の第２図（ｃ）におけるフ
レームf₂においては、音素/h/に割り付けられたユニッ
トに“1"を与え、その他のユニットには“0"を与えるデ
ータを教師データ（音素/h/の教師データと言う）とす
るのである。The teacher data of the first NN1 is created as follows.
That is, in the frame after the time corresponding to the A frame has elapsed from the input of one frame of the acoustic parameter with a certain phoneme label, the signal "1" is output to the unit of the output layer assigned to the above phoneme. The data that gives a signal "0" to the other units is used as the teacher data of the phoneme. For example, FIG. 2 (b)
In the frame f 2 in FIG. 2 (c) after the time corresponding to the A frame has elapsed from the frame f ₁ corresponding to the phoneme / h / in FIG. ₂ , “1” is assigned to the unit assigned to the phoneme / h /. The data that is given to the other units and "0" is given to the other units as the teacher data (called the phoneme / h / teacher data).

上記第1NN1は学習は、第1NN1単独で次のようにして実行
する。すなわち、学習用サンプル／ふたりの／において
/h/のラベルが付けられた最初のフレーム（第１フレー
ム）の１次の音響パラメータが入力端子４に入力され、
２次の音響パラメータが入力端子５に入力され、ｉ次の
音響パラメータが入力端子６に入力され、ｍ次の音響パ
ラメータが入力端子７に入力される。以下、同様にし
て、各入力端子４〜７には第２フレーム，第３フレーム
……の音響パラメータが順次入力される。The first NN1 executes learning by the first NN1 alone as follows. That is, in the learning sample / two /
The primary acoustic parameters of the first frame (first frame) labeled / h / are input to input terminal 4,
The secondary acoustic parameter is input to the input terminal 5, the i-th acoustic parameter is input to the input terminal 6, and the m-th acoustic parameter is input to the input terminal 7. Hereinafter, similarly, the acoustic parameters of the second frame, the third frame, ... Are sequentially input to the input terminals 4 to 7.

一方、第２図（ｃ）に示すように、上記第１フレームの
音響パラメータが入力されてからＡフレーム分の時間が
経過後、音素/h/に割り付けられたユニットに“1"を与
え、その他のユニットに“0"を与える音素/h/の教師デ
ータを３フレームに相当する時間だけ出力層の各ユニッ
トに入力する。以下、同様にして、“0"の教師データ
（すなわち、出力層の総てのユニットに“0"を与える教
師データ）を１フレーム、音素/u/の教師データを４フ
レーム、“0"の教師データを４フレーム……を順次入力
する。On the other hand, as shown in FIG. 2 (c), "1" is given to the unit assigned to the phoneme / h / after the time of A frames has elapsed since the acoustic parameters of the first frame were input, Phoneme / h / teacher data that gives “0” to other units is input to each unit of the output layer for a time corresponding to three frames. Similarly, "0" teacher data (that is, teacher data that gives "0" to all units in the output layer) is 1 frame, phoneme / u / teacher data is 4 frames, and "0" Input 4 frames of teacher data in sequence.

第２図（ｃ）は上述のようにして入力される教師データ
を、音素を行にフレームを列にしたマトリックスで表現
したものである（但し、Ａ＝３であり、データ内容は
“1"のみ記入し“0"は省略してある）。また、音素/h/
と/u/との間、音素/a/と/r/との間、音素/i/と/n/との
間および音素/n/と/o/との間に“0"の教師データを挿入
して、前あるいは後の音素による大きな影響を除去する
ようにしている。FIG. 2 (c) shows the teacher data input as described above in a matrix in which phonemes are arranged in rows and frames are arranged in columns (where A = 3, and the data content is "1"). Fill in only "0" is omitted). Also, phonemes / h /
Between / and / u /, between phonemes / a / and / r /, between phonemes / i / and / n /, and between phonemes / n / and / o / “0” teacher data Is inserted to remove the large influence of the phoneme before or after.

ここで、入力層へ入力する音響パラメータを０フレーム
から順次Ａフレームまで遅延させ、かつ、出力層への教
師データの入力タイミングを入力端子４〜７への音響パ
ラメータの入力に対してＡフレーム分遅延させるのは次
の理由による。すなわち、同一の音素ラベルが付けられ
る音響パラメータであっても調音結合等によって種々の
音響パラメータが存在する。そこで、このような調音結
合等の影響を吸収するため、まずある音素ラベルが付加
された音響パラメータを０フレームから順次Ａフレーム
まで遅延させて入力層に入力し、次に上記音素レベルが
付加された最初のフレームの音響パラメータが入力され
てからＡフレーム経過した後に教師データを入力するの
である。こうすることにより、同一のラベルでありなが
ら調音結合によって種々に変化した音響パラメータを同
時に第1NN1に入力した状態で（すなわち、多くの情報量
で）、上記ラベルに対応した教師データを入力すること
ができ、調音結合等の影響を吸収できるのである。Here, the acoustic parameters to be input to the input layer are sequentially delayed from 0 frame to A frame, and the input timing of the teacher data to the output layer is A frames for the input of the acoustic parameters to the input terminals 4 to 7. The reason for delaying is as follows. That is, even if the acoustic parameters are given the same phoneme label, there are various acoustic parameters due to articulation coupling or the like. Therefore, in order to absorb the influence of such articulatory coupling, an acoustic parameter having a certain phoneme label is first delayed from 0 frame to A frame and input to the input layer, and then the above phoneme level is added. The teacher data is input after A frames have elapsed from the input of the acoustic parameter of the first frame. By doing this, it is possible to input the teacher data corresponding to the above label with the same label and various acoustic parameters that are changed by the articulatory coupling being simultaneously input to the first NN1 (that is, with a large amount of information). It is possible to absorb the influence of articulation coupling.

上述のようにして学習された上記第1NN1の入力端子４〜
７に未知単語のｍ次の音響パラメータの時系列を順次入
力すると、この音響パラメータを０フレームから順次Ａ
フレームに相当する時間だけ遅延されたデータが入力層
の各ユニットに入力される。そして、学習後の第1NN1の
構造（すなわち、各層のユニット間の結合の重み）に応
じた出力データが出力層の各ユニットから出力される。
その際に、入力音響パラメータに対応する音素に割り付
けられたユニットが最大値の信号を出力するのである。The input terminals 4 to 4 of the first NN1 learned as described above
When the time series of the m-th order acoustic parameter of the unknown word is sequentially input to 7, the acoustic parameter is sequentially input from 0 frame.
The data delayed by the time corresponding to the frame is input to each unit in the input layer. Then, output data corresponding to the structure of the first NN1 after learning (that is, the weight of coupling between units in each layer) is output from each unit in the output layer.
At that time, the unit assigned to the phoneme corresponding to the input acoustic parameter outputs the maximum value signal.

このようにして学習が終了した第1NN1の出力層に、第１
図に示すように第2NN2の入力層が接続されて、第2NN2の
学習が実行されるのである。この場合、第1NN1の出力層
の各ユニットからの出力信号の内容を、図示しない表示
装置に表示して、第1NN1の判定結果（すなわち、認識さ
れた音素）を監視できるようにする。In the output layer of the first NN1 on which learning is completed in this way, the first
As shown in the figure, the input layers of the second NN2 are connected and the learning of the second NN2 is executed. In this case, the content of the output signal from each unit of the output layer of the first NN1 is displayed on a display device (not shown) so that the determination result of the first NN1 (that is, the recognized phoneme) can be monitored.

第2NN2の教師データは次のようにして作成される。すな
わち、音節を構成する音素連鎖の最前の音素を表すデー
タが入力されてからＢフレームに相当する時間が経過し
た後のフレームにおいては、その音素と次に続く音素と
から構成される音節に割り付けられた出力層のユニット
に信号“1"を与え、その他のユニットには信号“0"を与
えるようなデータを教師データとするのである。例え
ば、第２図（ｃ）に示す音素/h/の教師データにおける
最前のフレームf₂からＢフレームに相当する時間が経過
した後の第２図（ｄ）のフレームf₃においては、音節/h
u/に割り付けられたユニットに“1"を与え、その他のユ
ニットには“0"を与えるデータを音節/hu/の教師データ
とするのである。The second NN2 teacher data is created as follows. That is, in the frame after the time corresponding to the B frame has elapsed since the data representing the earliest phoneme of the phoneme chain forming the syllable was input, it is assigned to the syllable composed of that phoneme and the next phoneme. The data that gives the signal "1" to the output layer unit and the signal "0" to the other units is used as the teacher data. For example, in frame f ₃ in FIG. 2 (d) after the time corresponding to B frame has elapsed from the previous frame f ₂ in the teacher data of phoneme / h / shown in FIG. 2 (c), syllables / h
The data that gives "1" to the unit assigned to u / and "0" to the other units is the teacher data for the syllable / hu /.

上記第2NN2の学習は次のようにして実行する。すなわ
ち、第1NN1の学習の場合と同様に、学習済みの第1NN1の
入力層の各ユニットに、学習用サンプル／ふたりの／の
音響パラメータ時系列および遅延音響パラメータ時系列
が順次入力される。そうすると、第1NN1は既に識別する
音素の境界の学習を終了しているので、第1NN1の出力層
の各ユニットからは、第２図（ｃ）の教師データと略等
しい出力データが出力される。そして、この第1NN1から
出力される第２図（ｃ）の教師データと略等しい出力デ
ータの最初のフレーム（第１フレーム）の音素/h/を表
すデータ（０以上１以下）が入力層の各ユニットに入力
される。以下、同様にして、各ユニットには第２フレー
ム（音素/h/を表すデータ），第３フレーム（音素/h/を
表すデータ），第４フレーム（“0"のデータ）……のデ
ータが順次入力される。一方、上記第１フレームの音素
/h/を表すデータが入力されてからＢフレーム分の時間
が経過後、音節/hu/に割り付けられたユニットに“1"を
与え、その他のユニットに“0"を与える音節/hu/の教師
データを１フレームに相当する時間だけ出力層の各ユニ
ットに入力する。さらに、この１フレームに相当する時
間に続いて２フレームに相当する時間だけ音節/hu/の教
師データを入力するのである。The learning of the second NN2 is executed as follows. That is, similarly to the case of learning the first NN1, learning samples / two acoustic parameter time series and delayed acoustic parameter time series are sequentially input to each unit of the learned first NN1 input layer. Then, since the first NN1 has already finished learning the boundary of the phonemes to be identified, each unit of the output layer of the first NN1 outputs output data that is substantially equal to the teacher data in FIG. 2 (c). Then, the data (0 or more and 1 or less) representing the phoneme / h / of the first frame (first frame) of the output data which is approximately equal to the teacher data of FIG. 2 (c) output from the first NN1 is stored in the input layer. Input to each unit. Similarly, in each unit, the data of the second frame (data representing phoneme / h /), the third frame (data representing phoneme / h /), the fourth frame (data of “0”) ... Are sequentially input. On the other hand, the phoneme of the first frame
After the time for B frames has passed since the data representing / h / was input, give "1" to the unit assigned to syllable / hu / and give "0" to the other units. Teacher data is input to each unit in the output layer for a time corresponding to one frame. Further, the teacher data of the syllable / hu / is input for a time corresponding to two frames, following the time corresponding to one frame.

以下、同様にして、“0"の教師データを８フレーム、音
節/ta/の教師データを２フレーム、“0"の教師データを
４フレーム…を順次入力する。In the same manner, 8 frames of teacher data of "0", 2 frames of teacher data of syllable / ta /, 4 frames of teacher data of "0" are sequentially input.

第２図（ｄ）は上述のようにして入力される教師データ
を、音節を行にフレームを列にしたマトリックスで表現
したものである（但し、Ｂ＝５であり、データ内容は
“1"のみ記入し“0"は省略してある）。FIG. 2 (d) represents the teacher data input as described above in a matrix in which syllables are arranged in rows and frames are arranged in columns (where B = 5, and the data content is "1"). Fill in only "0" is omitted).

上記第2NN2の学習時において学習がなかなか収束しない
場合には、上記表示装置によって第1NN1の出力層の各ユ
ニットからの出力信号の内容を確認する。その結果、第
1NN1に入力された音響パラメータに対応する音素を表す
データであれば、学習未収束の原因は第2NN2側にあると
して、第2NN2に対して例えばシナプス結合の重み変更等
の何等かの処置を行う。また、入力された音響パラメー
タに対応する音素を表すデータでなければ、学習未収束
の原因は第1NN1側にあるとして、第1NN1の再学習等の処
置を実行する。このように、第1NN1の動作状態を知るこ
とによって、音節認識装置の学習を効率良く行って学習
時間を短縮することができるのである。If the learning does not easily converge during the learning of the second NN2, the contents of the output signal from each unit of the output layer of the first NN1 are confirmed by the display device. As a result,
If the data represents the phonemes corresponding to the acoustic parameters input to 1NN1, it is assumed that the cause of non-convergence of learning is on the second NN2 side, and some measure is applied to the second NN2, such as changing the weight of the synaptic connection. . If the data does not represent the phoneme corresponding to the input acoustic parameter, it is assumed that the cause of the non-convergence of learning lies on the side of the first NN1 and the re-learning of the first NN1 is performed. In this way, by learning the operating state of the first NN1, it is possible to efficiently perform the learning of the syllable recognition device and shorten the learning time.

ここで、入力層へ入力する音素を表すデータを０フレー
ムから順次Ｂフレームまで遅延させ、かつ、出力層への
教師データの入力タイミングを入力層への音素を表すデ
ータの入力に対してＢフレーム分遅延させるのは次の理
由による。すなわち、例えば音節/hu/は音素/h/と音素/
u/の連鎖から成っている。そこで、第2NN2に音素/h/を
表すデータと音素/u/を表すデータとが入力された状態
で、音声/hu/の教師データを入力しなければならない。
そこで、音素/h/を表すデータを遅延素子３によって１
フレームから順次Ｂフレームまで遅延させることによっ
て音素/h/を表すデータ保持させ、この状態で次の音素/
u/を表すデータを入力して音素/h/を表すデータと音素/
u/を表すデータとが同時に入力された状態にする。そし
て、この状態で音節/hu/の教師データを入力するのであ
る。したがって、音節/hu/の教師データを入力する時間
は、音素/h/を表すデータと音素/u/を表すデータとが同
時に保持されている数フレームだけでよい。Here, data representing a phoneme input to the input layer is sequentially delayed from 0 frame to B frame, and the input timing of the teacher data to the output layer is B frame with respect to the input of the data representing the phoneme to the input layer. The reason for delaying the delay is as follows. That is, for example, syllable / hu / is phoneme / h / and phoneme /
Made up of u / chains. Therefore, the teacher data of the voice / hu / must be input with the data representing the phoneme / h / and the data representing the phoneme / u / input to the second NN2.
Therefore, the data representing the phoneme / h / is set to 1 by the delay element 3.
By delaying from frame to B frame in sequence, data representing phoneme / h / is held, and in this state, the next phoneme / h /
Input data representing u / and input data representing phoneme / h / and phoneme /
The data representing u / should be input at the same time. Then, in this state, the teacher data of the syllable / hu / is input. Therefore, the teacher data of the syllable / hu / need only be input for several frames in which the data representing the phoneme / h / and the data representing the phoneme / u / are held at the same time.

また、音節/hu/における音素/h/のフレームの連鎖と音
素/u/のフレームの連鎖の境界位置は話者や発声速度等
によって変化する。そこで、第2NN2に入力される音素/h
/を表すデータと音素/u/を表すデータとを遅延させるこ
とによって、音節/hu/の教師データが入力される３フレ
ーム間（第２図（ｄ）参照）において、第2NN2に入力さ
れる音素/h/を表す信号連鎖と音素/u/を表す信号連鎖の
境界位置を変化させる（時間が経過するに従って境界位
置が音節/hu/の前方に移動する）のである。こうするこ
とによって、話者や発声速度による音素/h/と音素/u/の
境界位置の変動を吸収することができるのである。In addition, the boundary position between the phoneme / h / frame chain in the syllable / hu / and the phoneme / u / frame chain changes depending on the speaker, the speaking speed, and the like. Therefore, the phoneme / h input to the second NN2
By delaying the data representing / and the data representing phoneme / u /, teacher data of syllable / hu / is input to the second NN2 during three frames (see FIG. 2 (d)). The boundary position between the signal chain representing the phoneme / h / and the signal chain representing the phoneme / u / is changed (the boundary position moves to the front of the syllable / hu / over time). By doing so, it is possible to absorb the change in the boundary position between the phoneme / h / and the phoneme / u / due to the speaker or the vocalization speed.

上述のようにして学習された上記第2NN2の入力層の各ユ
ニットに、未知単語の音素を表すデータの時系列を入力
すると、学習後の第2NN2の構造に応じた出力データが出
力層の各ユニットから出力される。その際に、入力デー
タに対応する音節に割り付けられたユニットが最大値の
信号を出力するのである。In each unit of the input layer of the second NN2 learned as described above, when a time series of data representing phonemes of unknown words is input, output data corresponding to the structure of the second NN2 after learning is output in each of the output layers. Output from the unit. At that time, the unit assigned to the syllable corresponding to the input data outputs the maximum value signal.

上述のようにして学習された、第1NN1および第2NN2から
構成される音節認識装置は、次のようにして音節を認識
する。The syllable recognition device composed of the first NN1 and the second NN2 learned as described above recognizes a syllable as follows.

第１図において、第1NN1の入力端子４〜７に未知単語の
ｍ次の音響パラメータを表す信号の時系列が順次入力さ
れると、この入力されたｍ次の音響パラメータは遅延素
子３によって０フレームから順次Ａフレームに相当する
時間まで遅延され、入力層の各ユニットに入力される。
そうすると、第1NN1は、上述のような学習後の構造に応
じて、入力された音響パラメータの時系列を音素を表す
データの時系列に変換して出力層の各ユニットから出力
する。この出力データは、入力された音響パラメータに
対応する音素を表すようなデータである。In FIG. 1, when a time series of signals representing m-th order acoustic parameters of an unknown word are sequentially input to the input terminals 4 to 7 of the first NN1, the input m-th order acoustic parameters are set to 0 by the delay element 3. The signals are sequentially delayed from the frame to the time corresponding to the A frame and input to each unit in the input layer.
Then, the first NN1 converts the time series of the input acoustic parameters into the time series of the data representing the phonemes according to the structure after learning as described above, and outputs the time series of the units of the output layer. This output data is data that represents a phoneme corresponding to the input acoustic parameter.

このようにして、第1NN1の出力層の各ユニットから出力
された音素を表すデータの時系列は、第2NN2の遅延素子
３によって０フレームから順次Ｂフレームに相当する時
間まで遅延されて、第2NN2の入力層の各ユニットに入力
される。そうすると、第2NN2は、上述のような学習後の
構造に応じて、入力された音素を表すデータの時系列を
音節を表すデータの時系列に変換して出力層の各ユニッ
トから出力する。この出力データは、入力された音素を
表すデータ列に対応する音節に割り付けられた出力層の
ユニットからの出力信号が最大値になるようなデータで
ある。In this way, the time series of the data representing the phonemes output from each unit of the output layer of the first NN1 is delayed by the delay element 3 of the second NN2 from 0 frame to the time corresponding to the B frame, and the second NN2 Input to each unit of the input layer of. Then, the second NN2 converts the time series of the data representing the input phonemes into the time series of the data representing the syllables and outputs them from each unit of the output layer according to the structure after learning as described above. This output data is data such that the output signal from the unit of the output layer assigned to the syllable corresponding to the data string representing the input phoneme has the maximum value.

すなわち、第2NN2の出力層の各ユニットからの出力デー
タは、第1NN1の入力端子４〜７に入力された未知単語の
ｍ次の音響パラメータ時系列に対応した音節時系列とな
るのである。That is, the output data from each unit in the output layer of the second NN2 becomes a syllable time series corresponding to the m-th order acoustic parameter time series of the unknown word input to the input terminals 4 to 7 of the first NN1.

その際に、上述のように、第1NN1および第2NN2はTDNN構
造になっている。そのため、第1NN1による音素認識の際
に調音結合等の影響を吸収することができ、第2NN2によ
る音節認識の際に話者や発声速度の影響をある程度吸収
することができる。したがって、話者や発音速度によら
ず正しく音節を認識することができるのである。At that time, as described above, the first NN1 and the second NN2 have a TDNN structure. Therefore, it is possible to absorb the influence of articulation and the like when the phoneme is recognized by the first NN1, and it is possible to absorb the influence of the speaker and the speaking speed to some extent when the syllable is recognized by the second NN2. Therefore, the syllable can be correctly recognized regardless of the speaker or the pronunciation speed.

また、上記音節認識装置は、上記表示装置によって第1N
N1の出力層の各ユニットからの出力データを監視して音
節認識動作の途中経過を知ることができる。したがっ
て、音節の認識結果が誤っている場合に、音節認識動作
の途中経過（すなわち、音節を構成する音素の認識結
果）を知ることによって、誤認識の原因が第1NN1あるい
は第2NN2のいずれにあるかを知ることができる。すなわ
ち、誤認識の原因に応じて適確に対処することができ、
より正しい音節認識結果を得るようにすることができる
のである。In addition, the syllable recognition device is the first N
The output data from each unit of the N1 output layer can be monitored to know the progress of the syllable recognition operation. Therefore, when the recognition result of the syllable is incorrect, the cause of the misrecognition is either the first NN1 or the second NN2 by knowing the progress of the syllable recognition operation (that is, the recognition result of the phonemes that form the syllable). You can know That is, it is possible to take appropriate measures depending on the cause of the misrecognition,
It is possible to obtain a more accurate syllable recognition result.

上述のように、この発明の音節認識装置は直列に接続さ
れた２つのTDNNによって構成され、第１のTDNNは入力さ
れた未知単語の音響パラメータを表す信号の時系列を音
素を表す信号の時系列に変換して出力する一方、第２の
TDNNは第１のTDNNから出力される音素を表す信号の時系
列を入力し、この入力された音素を表す信号の時系列を
音節を表す信号の時系列に変換して出力するようになっ
ている。そのため、第１のTDNNの出力データを監視する
ことによって、音節認識過程の途中経過を知ることが可
能である。したがって、例えばNNの学習がなかなか収束
しない場合や誤認識した場合には、第１のTDNNの出力デ
ータの内容から学習の未収束および誤認識の原因を知る
ことが可能となるのである。As described above, the syllable recognition device of the present invention is composed of two TDNNs connected in series, and the first TDNN is a time series of the signals representing the acoustic parameters of the input unknown word and the time series of the signals representing the phonemes. While converting to a sequence and outputting,
The TDNN inputs the time series of the signal representing the phoneme output from the first TDNN, converts the time series of the signal representing the input phoneme into the time series of the signal representing the syllable, and outputs the time series. There is. Therefore, it is possible to know the progress of the syllable recognition process by monitoring the output data of the first TDNN. Therefore, for example, when the learning of the NN does not easily converge or is erroneously recognized, it is possible to know the cause of the unconverged learning and the erroneous recognition from the content of the output data of the first TDNN.

すなわち、この発明の音節認識装置によれば、学習の未
収束および誤認識に対して適確に対処することができ、
学習時間を短縮すると共に、より正しい認識結果を得る
ことができる。That is, according to the syllable recognition device of the present invention, it is possible to appropriately deal with unconverged learning and misrecognition.
The learning time can be shortened and more correct recognition result can be obtained.

上記実施例においては、時間遅延手段として遅延素子を
用いているがこれに限定されるものではない。In the above embodiment, the delay element is used as the time delay means, but the invention is not limited to this.

上記実施例においては、NNを学習する際において、まず
第1NN1を学習し、この学習済みの第1NN1と未学習の第2N
N2とを接続して第2NN2の学習を行うようにしている。し
かしながら、この発明はこれに限定されるものではな
く、第1NN1と第2NN2とを夫々単独に学習した後学習済み
の第1NN1と第2NN2とを接続して、さらに学習時間を短縮
するようにしてもよい。In the above embodiment, when learning the NN, the first NN1 is first learned, and the learned first NN1 and the unlearned second NN1 are learned.
The second NN2 is learned by connecting with N2. However, the present invention is not limited to this, and the first NN1 and the second NN2 are independently learned, and then the learned first NN1 and the second NN2 are connected to further reduce the learning time. Good.

上記実施例においては、第1NN1の分類カテゴリを音素と
している。しかしながら、この発明はこれに限定される
ものではなく単音を分類カテゴリとしてもよい。こうす
ることによって、従来からの音声認識に関する知識を導
入して、ある音素を表す単音のうちの異音を第1NN1の認
識カテゴリの一つとすることができ、より正しい音節認
識を可能にするのである。In the above embodiment, the first NN1 classification category is phonemes. However, the present invention is not limited to this, and single sounds may be used as the classification category. By doing this, it is possible to introduce conventional knowledge about speech recognition and make allophones of a single phoneme representing a certain phoneme one of the recognition categories of the 1st NN1, which enables more accurate syllable recognition. is there.

上記実施例においては、TDNNを多層パーセプトロン型ニ
ューラル・ネットワークで構成している。しかしなが
ら、この発明はこれに限定されものではなく、コホーネ
ン型ニューラル・ネットワークで構成してもよい。In the above embodiment, the TDNN is composed of a multilayer perceptron type neural network. However, the present invention is not limited to this, and may be configured by a Kohonen type neural network.

＜発明の効果＞以上より明らかなように、この発明の音節認識装置は、
時間遅延手段を有する第１の時間遅れ神経回路網と時間
遅延手段を有する第２の時間遅れ神経回路網とを備え、
上記第１の時間遅れ神経回路網は、順次入力される音響
パラメータを表す信号の時系列とこの音響パラメータを
表す信号の時系列を所定時間遅延させた信号とを組合せ
た信号を、音素あるいは単音を表す信号に変換して出力
する一方、上記第２の時間遅れ神経回路網は、上記第１
の時間遅れ神経回路網から順次入力される音素あるいは
単音を表す信号の時系列とこの音素あるいは単音を表す
信号の時系列を所定時間遅延させた信号とを組合せた信
号を、音節を表す信号に変換して出力するようにしたの
で、上記第１の時間遅れ神経回路網から出力される音素
あるいは単音を表す信号を監視することによって、音節
認識過程の途中経過（すなわち、音素あるいは単音の認
識結果）を知ることが可能である。<Effects of the Invention> As is clear from the above, the syllable recognition device of the present invention is
A first time-delay neural network having time delay means and a second time-delay neural network having time delay means,
The first time-delay neural network generates a phoneme or a single sound by combining a signal obtained by sequentially inputting a time series of signals representing acoustic parameters and a signal obtained by delaying the time series of signals representing the acoustic parameters by a predetermined time. Is output after being converted into a signal indicating
A signal representing a syllable is a combination of a time series of signals representing phonemes or single sounds sequentially input from the time-delayed neural network and a signal obtained by delaying the time series of signals representing the phonemes or single sounds by a predetermined time. Since the signal is converted and output, by monitoring the signal representing the phoneme or the single sound output from the first time delay neural network, the progress of the syllable recognition process (that is, the recognition result of the phoneme or the single sound) ) Is known.

したがって、この発明の音節認識装置によれば、音節認
識過程の途中経過を知ることによって、学習の未収束の
原因を明らかにして適確に対処できるので、学習時間を
短縮することができるようになる。また、音節認識過程
の途中経過を知ることによって、誤認識の原因を明らか
にして適確に対処できるので、より正しい認識結果を得
ることができるようになる。Therefore, according to the syllable recognition device of the present invention, by knowing the progress of the syllable recognition process, the cause of the unconvergence of learning can be clarified and appropriately dealt with, so that the learning time can be shortened. Become. Further, by knowing the progress of the syllable recognition process, the cause of the erroneous recognition can be clarified and the correct action can be taken, so that a more correct recognition result can be obtained.

[Brief description of drawings]

第１図はこの発明の音節認識装置の一実施例におけるブ
ロック図、第２図は第１図の音節認識装置への音響パラ
メータの一例と教師データの一例を示す図である。１……第1NN、２……第2NN、３……遅延素子、4,5,6,7
……入力端子。FIG. 1 is a block diagram of an embodiment of the syllable recognition device of the present invention, and FIG. 2 is a diagram showing an example of acoustic parameters and teacher data for the syllable recognition device of FIG. 1 ... 1st NN, 2 ... 2nd NN, 3 ... delay element, 4,5,6,7
...... Input terminal.

Claims

[Claims]

1. A combination of a time series of signals representing sequentially input acoustic parameters and a signal obtained by delaying the time series of signals representing the acoustic parameters by the time delay means by a time delay means. A first time-delay neural network for converting a signal into a time series of a signal representing a phoneme or a single tone and outputting the phoneme or a phoneme output from the first time-delay neural network. A time series of signals representing a single sound is sequentially input, and a time series of the signals sequentially representing the phonemes or the single sounds and a signal obtained by delaying the time series of the signals representing the phonemes or the single sounds for a predetermined time by the time delay means are described. A syllable recognition apparatus comprising a second time-delay neural network for converting a signal obtained by combining the signals into a time series of signals representing syllables and outputting the time-series signals.