JP3465334B2

JP3465334B2 - Voice interaction device and voice interaction method

Info

Publication number: JP3465334B2
Application number: JP00147294A
Authority: JP
Inventors: 謙二松井
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1994-01-12
Filing date: 1994-01-12
Publication date: 2003-11-10
Anticipated expiration: 2018-11-10
Also published as: JPH07210193A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声を用いて機器と対
話を行いながら制御を行う音声対話装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice dialogue device for controlling voice equipment while interacting with the equipment.

【０００２】[0002]

【従来の技術】音声を用いて機器と対話を行いながら制
御を行う音声対話装置は、おしゃべり人形などの玩具か
ら電話を用いた銀行の残高照会などまで序々にその用途
を拡大しつつある。2. Description of the Related Art The use of voice interaction devices for performing control while interacting with devices using voice is gradually expanding from toys such as chattering dolls to bank balance inquiry using a telephone.

【０００３】図４は、従来の音声対話装置の構成例であ
る。同図において、音声入力部１は音声を電気信号に変
換する部分、音声分析部２は音声信号をＦＦＴやＬＰＣ
分析により周波数スペクトルやケプストラムなどの音声
特徴情報に変換する部分、音素／音節認識部３は前記音
声分析部の音声特徴情報を基に音声認識を行い音素ある
いは音節列を出力する部分、対話制御部１２は音声対話
を管理し前記音素／音節認識部の出力から制御語を認識
し次の動作を判断する部分、キーボード部１３は音声で
は入力出来ない情報を入力するための部分、音声合成部
７は音素／音節ラティスから音声信号を合成する部分、
音声出力部８は前記音声合成部の出力を音響信号に変換
する部分、単語辞書１０は単語の発音記号や意味を記憶
する部分である。FIG. 4 shows an example of the configuration of a conventional voice dialog device. In the figure, a voice input unit 1 converts a voice into an electric signal, and a voice analysis unit 2 converts the voice signal into an FFT or LPC.
A part that is converted into speech feature information such as a frequency spectrum or a cepstrum by analysis, a phoneme / syllable recognition unit 3 performs speech recognition based on the speech feature information of the speech analysis unit, and outputs a phoneme or a syllable string, a dialogue control unit. Reference numeral 12 is a portion for managing voice dialogue, recognizing a control word from the output of the phoneme / syllable recognition portion to judge the next operation, keyboard portion 13 is a portion for inputting information that cannot be input by voice, and voice synthesizing portion 7 Is the part that synthesizes the speech signal from the phoneme / syllable lattice,
The voice output unit 8 is a unit that converts the output of the voice synthesis unit into an acoustic signal, and the word dictionary 10 is a unit that stores phonetic symbols and meanings of words.

【０００４】例として上記の音声対話装置によるスケジ
ュール管理システムを考える。先ず、対話管理部１２が
「予定をどうぞ」というテキストを音声合成部７に送
り、音声合成部７と音声出力部８により合成音で「予定
をどうぞ」と問いかける。これに対しユーザーが「１３
時ミーティング」と言うと音声入力部１および音声分析
部２によりこの声が音声特徴情報に変換され音素／音節
認識部３で音素／音節ラティスに変換され対話制御部１
２がそれぞれ「１３時」、「ミーティング」という単語
を単語辞書１０を参照して認識し、次の動作として、
「ミーティングの相手をキーボードで入力してくださ
い」というテキストを音声合成部７に送る。音声合成部
７と音声出力部８はこのテキストを合成音で問いかけ
る。ユーザーはキーボード部１３で「まつした」と入力
すると、対話制御部１２は「１３時、まつしたさんとミ
ーティングですね？」というテキストを生成し、ユーザ
ーが「はい」と言うと、対話制御部はこの「１３時、ま
つしたさんとミーティング。」という内容をデータベー
スに格納する。As an example, consider a schedule management system using the above-described voice interaction device. First, the dialogue management unit 12 sends the text "Please schedule" to the voice synthesis unit 7, and the voice synthesis unit 7 and the voice output unit 8 inquire as "Please schedule" with a synthetic voice. In response to this, the user
Speaking of "time meeting", this voice is converted into voice feature information by the voice input unit 1 and the voice analysis unit 2, converted into phoneme / syllable lattices by the phoneme / syllable recognition unit 3, and the dialogue control unit 1 is used.
2 recognizes the words "13:00" and "meeting" with reference to the word dictionary 10, and as the next operation,
The text “Please input the meeting partner with the keyboard” is sent to the voice synthesizer 7. The voice synthesizer 7 and the voice output unit 8 inquire about this text with a synthetic voice. When the user inputs "Matsushita" on the keyboard unit 13, the dialogue control unit 12 generates the text "A meeting with Matsushita-san at 13:00?", And when the user says "Yes", the dialogue control unit Stores the content of "Meeting with Matsushita-san at 13:00" in the database.

【０００５】この従来例の問題点は、「まつした」と言
うような固有名詞の入力の場合、通常単語辞書１０に無
く音声認識が出来ないのでキーボードなどに依存すると
いう点である。すなわち、現状では人間が音声で用いる
ことの出来る単語の数は制限され、機械が人間のように
対話を通じて新しい言葉を学習するということは困難で
ある。従って、音声対話技術が適用できるためには限ら
れた単語で制御が出来る場合のみで、それ以外はキーボ
ードなどの入力手段が必要になる。The problem with this conventional example is that when a proper noun such as "Matsushita" is input, it is usually not in the word dictionary 10 and voice recognition cannot be performed, so that it depends on a keyboard or the like. That is, at present, the number of words that humans can use in speech is limited, and it is difficult for machines to learn new words through dialogue like humans. Therefore, in order to be able to apply the voice interaction technology, it is only possible to control with a limited number of words, and in other cases, input means such as a keyboard is required.

【０００６】[0006]

【発明が解決しようとする課題】上記、従来例で説明し
たように、現状の音声対話装置では人間が音声で用いる
ことの出来る単語の数は制限され、機械が人間のように
対話を通じて新しい言葉を学習するということは困難で
ある。従って、名前など固有名詞の入力にはキーボード
などの入力手段が必要である。As explained in the above-mentioned conventional example, the number of words that a human can use in a voice is limited in the current voice dialog device, and the machine can use a new word through a dialogue like a human. Is difficult to learn. Therefore, an input means such as a keyboard is required for inputting proper nouns such as names.

【０００７】本発明の目的は、上記従来の音声対話装置
の課題に鑑み、機械が人間のように対話を通じて新しい
言葉を学習するということが出来る音声対話装置の提供
を目的とするものである。An object of the present invention is to provide a voice dialogue system in which a machine can learn a new word through dialogue like a human being in view of the above problems of the conventional voice dialogue system.

【０００８】[0008]

【課題を解決するための手段】本発明による音声対話装
置は、音声を電気信号に変換する音声入力部と、前記音
声入力部からの音声信号をＦＦＴやＬＰＣ分析により周
波数スペクトルやケプストラムなどの音声特徴情報に変
換する音声分析部と、前記音声分析部の出力を基に音声
認識を行い音素あるいは音節列を出力する音素／音節認
識部と、前記音素／音節認識部の出力を記憶する音素／
音節ラティス候補記憶部と、音素／音節ラティスから音
声信号を合成する音声合成部と、前記音声合成部の出力
を音響信号に変換する音声出力部と、単語の発音記号や
意味を記憶する単語辞書と、音声対話を管理し「はい」
・「いいえ」などの制御語が入力されるべき時は前記音
素／音節認識部の出力から制御語を認識し次の動作を判
断し制御語では無い新しい単語が初めて入力されるべき
時は前記音素／音節ラティス候補記憶部の内容のうち最
も尤度の高い音素／音節ラティスを選択し前記音声合成
部に出力し、また新しい単語が言い直されて入力される
べき時は前記音素／音節ラティス候補記憶部の中の前回
の認識結果と前記音素／音節認識部の出力である最新の
認識結果との比較を行い最も異なる音素／音節あるいは
音素群／音節群を前回と異なる候補に置き換え前記音声
合成部に出力し、かつ制御語として「はい」などの肯定
語が入力されると直前に出力した合成音声を正解と判断
しその音素／音節ラティスを前記単語辞書に記憶させる
対話制御部を具備するものである。A voice interactive apparatus according to the present invention comprises a voice input unit for converting voice into an electric signal, and a voice signal from the voice input unit such as a frequency spectrum or a cepstrum by FFT or LPC analysis. A voice analysis unit for converting into characteristic information, a phoneme / syllabic recognition unit for performing voice recognition based on the output of the voice analysis unit and outputting a phoneme or a syllable string, and a phoneme for storing the output of the phoneme / syllable recognition unit /
Syllable lattice candidate storage unit, voice synthesizing unit for synthesizing voice signal from phoneme / syllable lattice, voice output unit for converting output of the voice synthesizing unit into acoustic signal, and word dictionary for storing phonetic symbols and meanings of words And manage the voice dialogue, "Yes"
・ When a control word such as “No” should be input, the control word is recognized from the output of the phoneme / syllable recognition unit, the next operation is judged, and when a new word that is not a control word is input for the first time, The phoneme / syllable lattice with the highest likelihood is selected from the contents of the phoneme / syllable lattice candidate storage unit and output to the speech synthesis unit, and when a new word is to be reworded and input, the phoneme / syllable lattice is selected. The previous recognition result in the candidate storage unit is compared with the latest recognition result output from the phoneme / syllable recognition unit, and the most different phoneme / syllable or phoneme group / syllable group is replaced with a candidate different from the previous one. When an affirmative word such as “Yes” is input as a control word to the synthesizing section, the synthesized speech output immediately before is determined as the correct answer, and the phoneme / syllable lattice is stored in the word dictionary. Is shall.

【０００９】また、本発明は前記音声入力部に接続され
て入力音声の単位時間当たりのパワーを分析するパワー
分析部と、その分析結果を記憶するパワー分析結果記憶
部と、前記パワー分析結果記憶部の中の前回の分析結果
と前記パワー分析部の出力である最新の分析結果との比
較を行うパワー比較部と、新しい単語が言い直されて入
力されるべき時は前記パワー比較部の出力から最もパワ
ー値の異なる音素／音節あるいは音素群／音節群を前回
と異なる候補に置き換え前記音声合成部に出力する対話
制御部をさらに具備するものである。The present invention also relates to a power analysis unit connected to the voice input unit for analyzing the power of the input voice per unit time, a power analysis result storage unit for storing the analysis result, and the power analysis result storage. Output of the power analysis unit and a power comparison unit that compares the previous analysis result of the power analysis unit with the latest analysis result that is output from the power analysis unit To a speech element / syllable or a phoneme group / syllable group having the most different power value from the above to a candidate different from the previous one, and a dialogue control section for outputting to the speech synthesis section is further provided.

【００１０】また、本発明は前記音声入力部に接続され
て入力音声のピッチ周波数を分析するピッチ分析部と、
その分析結果を記憶するピッチ分析結果記憶部と、前記
ピッチ分析結果記憶部の中の前回の分析結果と前記ピッ
チ分析部の出力である最新の分析結果との比較を行うピ
ッチ比較部と、新しい単語が言い直されて入力されるべ
き時は前記ピッチ比較部の出力から最もピッチ周波数の
異なる音素／音節あるいは音素群／音節群を前回と異な
る候補に置き換え前記音声合成部に出力する対話制御部
をさらに具備するものである。The present invention also includes a pitch analysis unit connected to the voice input unit for analyzing the pitch frequency of the input voice,
A pitch analysis result storage unit that stores the analysis result, a pitch comparison unit that compares the previous analysis result in the pitch analysis result storage unit and the latest analysis result that is the output of the pitch analysis unit, and a new one. When a word is to be rephrased and input, a dialogue control unit which replaces a phoneme / syllable or a phoneme group / syllable group having the most different pitch frequency with a candidate different from the previous one from the output of the pitch comparison unit and outputs to the speech synthesis unit Is further provided.

【００１１】[0011]

【作用】本発明では、音声入力部と音声分析部と音素／
音節認識部と音素／音節ラティス候補記憶部と音声合成
部と音声出力部と単語辞書と対話制御部を具備すること
により、新しい単語が入力された場合でも音素や音節ラ
ティス候補列に変換しユーザーに再合成して確認を要求
し、もし間違っていれば再度発声させて前回の候補とは
異なる認識候補を再度提示でき、ユーザーの確認ととも
にその音素あるいは音節列を単語辞書に新たに書き込む
ことにより新しい単語の獲得が可能になる。In the present invention, the voice input unit, the voice analysis unit and the phoneme /
By including a syllable recognition unit, a phoneme / syllable lattice candidate storage unit, a voice synthesis unit, a voice output unit, a word dictionary, and a dialogue control unit, even when a new word is input, it is converted into a phoneme or a syllable lattice candidate sequence and the user Re-synthesize and request confirmation, and if wrong, utter again and present a recognition candidate different from the previous candidate again, and write the phoneme or syllable string newly in the word dictionary with user confirmation. New words can be acquired.

【００１２】さらに入力音声の単位時間当たりのパワー
を分析するパワー分析部とパワー分析結果記憶部と、前
回の分析結果と最新の分析結果との比較を行うパワー比
較部を具備することにより、新しい単語が言い直されて
入力されるべき時は前記パワー比較部の出力から最もパ
ワー値の異なる音素／音節あるいは音素群／音節群を前
回と異なる候補に置き換えることでより正確な候補の修
正が可能になる。Further, by providing a power analysis section for analyzing the power per unit time of the input voice, a power analysis result storage section, and a power comparison section for comparing the previous analysis result and the latest analysis result, When a word should be reworded and input, it is possible to correct the candidate more accurately by replacing the phoneme / syllable or phoneme group / syllable group with the most different power value from the output of the power comparison unit with a candidate different from the previous time. become.

【００１３】さらに入力音声のピッチ周波数を分析する
ピッチ分析部とその分析結果を記憶するピッチ分析結果
記憶部と前回の分析結果と最新の分析結果との比較を行
うピッチ比較部を具備することにより、新しい単語が言
い直されて入力されるべき時は前記ピッチ比較部の出力
から最もピッチ周波数の異なる音素／音節あるいは音素
群／音節群を前回と異なる候補に置き換えることでより
正確な候補の修正が可能になる。By further comprising a pitch analysis section for analyzing the pitch frequency of the input voice, a pitch analysis result storage section for storing the analysis result, and a pitch comparison section for comparing the previous analysis result and the latest analysis result. , When a new word should be reworded and input, the more accurate correction of the candidate by replacing the phoneme / syllable or the phoneme group / syllable group having the most different pitch frequency from the output of the pitch comparison unit with a candidate different from the previous one. Will be possible.

【００１４】[0014]

【実施例】以下、本発明の実施例について図面を参照し
て説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１５】図１は本発明の一実施例における音声対話
装置の構成を示すものである。同図において、音声入力
部１は、音声を電気信号に変換する部分である。音声分
析部２は音声入力部１からの音声信号をＦＦＴやＬＰＣ
分析により周波数スペクトルやケプストラムなどの音声
特徴情報に変換する部分である。音素／音節認識部３は
前記音声分析部２の出力を基に音声認識を行い音素ある
いは音節列を出力する部分である。音素／音節ラティス
候補記憶部４は前記音素／音節認識部３の出力を記憶す
る部分である。音声合成部５は音素／音節ラティスから
音声信号を合成する部分。音声出力部６は前記音声合成
部５の出力を音響信号に変換する部分である。単語辞書
８は単語の発音記号や意味を記憶する部分。対話制御部
７は音声対話を管理する部分である。FIG. 1 shows the configuration of a voice dialog device according to an embodiment of the present invention. In the figure, the voice input unit 1 is a unit for converting voice into an electric signal . The voice analysis unit 2 converts the voice signal from the voice input unit 1 into FFT or LPC.
It is a part that is converted into voice characteristic information such as a frequency spectrum and a cepstrum by analysis. The phoneme / syllable recognition unit 3 is a unit that performs voice recognition based on the output of the voice analysis unit 2 and outputs a phoneme or a syllable string. Phoneme / syllable lattice candidate storage unit 4 is a part for storing the output of the phoneme / syllable recognition section 3. The voice synthesis unit 5 is a unit that synthesizes a voice signal from the phoneme / syllable lattice. The voice output unit 6 is a unit that converts the output of the voice synthesis unit 5 into an acoustic signal. The word dictionary 8 is a part that stores the phonetic symbols and meanings of words. Dialogue control unit 7 is a section that handles the voice interaction.

【００１６】上記のように構成された本実施例の音声対
話装置について以下にその動作を説明する。The operation of the speech dialogue system of the present embodiment having the above-mentioned configuration will be described below.

【００１７】従来例と同様に上記の音声対話装置による
スケジュール管理システムを考える。先ず、本発明によ
る対話制御部７が「予定をどうぞ」というテキストを音
声合成部５に送り、音声合成部５と音声出力部６により
合成音で「予定をどうぞ」と問いかける。これに対しユ
ーザーが「１３時ミーティング」と言うと音声入力部１
および音声分析部２によりこの声が音声特徴情報に変換
され音素／音節認識部３で音素／音節ラティスに変換さ
れ対話制御部４がそれぞれ「１３時」、「ミーティン
グ」という単語を単語辞書８を参照して認識し、次の動
作として、「ミーティングの相手をどうぞ。」というテ
キストを音声合成部５に送る。音声合成部５と音声出力
部６は「ミーティングの相手をどうぞ」と合成音で問い
かける。ユーザーが「まつした」と発声すると、音声入
力部１、音声分析部２により「まつした」と言う音声が
音声特徴情報に変換され音素／音節認識部３で音素／音
節ラティスに変換される。この時対話制御部７はこの言
葉「まつした」が人の名前であり単語辞書８に無い単語
であることを対話の進行から認識している。対話制御部
７は、音素／音節ラティス候補記憶部４が空いているこ
とを確認して音素／音節認識部３の結果を音素／音節ラ
ティス候補記憶部４に格納すると同時に認識候補ラティ
スのなかで最も尤度の高い候補を用いて例えば「１３
時、まつしかさんとミーティングですね？」というテキ
ストを生成する。ここでは候補が「まつしか」であった
ことを意味する。ユーザーが「まつした」と再度言う
と、前回と同様に音声入力部１、音声分析部２により
「まつした」と言う音声が音声特徴情報に変換され音素
／音節認識部３で音素／音節ラティスに変換される。対
話制御部７は音素／音節認識部３の結果から「まつし
か」以外で最も尤度の高い候補をさがす。ここではそれ
が「まつした」であったと仮定する。対話制御部７は再
び「１３時、まつしたさんとミーティングですね？」と
いうテキストを生成する。ユーザーが「はい」と言う
と、対話制御部７はこの「１３時、まつしたさんとミー
ティング。」という内容をデータベースに格納し、同時
に「まつした」という単語を単語辞書８に格納する。ま
た同時にこのシステムは「まつした」という単語が人の
名前であることを理解しそのことも単語辞書８に書き込
む。Consider a schedule management system using the above-described voice interaction device as in the conventional example. First, the dialogue control unit 7 according to the present invention sends the text "Please schedule" to the voice synthesizing unit 5, and the voice synthesizing unit 5 and the voice output unit 6 inquire as "Please schedule" with a synthetic sound. On the other hand, when the user says "13:00 meeting", the voice input unit 1
The voice analysis unit 2 converts this voice into voice feature information, and the phoneme / syllable recognition unit 3 converts it into a phoneme / syllable lattice, and the dialogue control unit 4 converts the words “13:00” and “meeting” into the word dictionary 8. By referring to and recognizing, as the next operation, the text “Please have a meeting partner.” Is sent to the voice synthesizer 5. The voice synthesis unit 5 and the voice output unit 6 inquire with a synthesized voice, "Please have a meeting partner." When the user utters "Matsushita", the voice input unit 1 and the voice analysis unit 2 convert the voice "Matsushita" into voice feature information, and the phoneme / syllable recognition unit 3 converts it into phoneme / syllable lattice. At this time, the dialogue control unit 7 recognizes from the progress of the dialogue that the word "Matsushita" is a person's name and is not in the word dictionary 8. The dialogue control unit 7 confirms that the phoneme / syllable lattice candidate storage unit 4 is empty and stores the result of the phoneme / syllable recognition unit 3 in the phoneme / syllable lattice candidate storage unit 4 and at the same time in the recognition candidate lattice. Using the candidate with the highest likelihood, for example, "13
Sometimes you have a meeting with Matsushika-san? Produces the text ". Here, it means that the candidate was "Matsushika". When the user says "Matsushita" again, as in the previous time, the voice "Matsushita" is converted into voice feature information by the voice input unit 1 and the voice analysis unit 2, and the phoneme / syllable recognition unit 3 phoneme / syllable lattice. Is converted to. From the result of the phoneme / syllable recognition unit 3, the dialogue control unit 7 searches for a candidate with the highest likelihood other than “Matsushika”. Here it is assumed that it was "Matsushita". The dialogue control unit 7 again generates the text "Meeting with Matsushita-san at 13:00?" When the user says "Yes", the dialogue control unit 7 stores the content "Meeting with Matsushita-san at 13:00" in the database, and at the same time stores the word "Matsushita" in the word dictionary 8. At the same time, the system understands that the word "Matsushita" is a person's name, and also writes it in the word dictionary 8.

【００１８】このようにして新しい単語「まつした」が
音声対話により獲得できた。次に図２を用いて本発明に
よる第２の実施例を説明する。図２において、パワー分
析部９は入力音声の単位時間当たりのパワーを分析する
部分。パワー分析結果記憶部１０はパワー分析部９の分
析結果を格納する部分。パワー比較部１１は前回の分析
結果と最新の分析結果との比較を行う部分である。In this way, the new word "Matsushita" was acquired by the voice dialogue. Next, a second embodiment according to the present invention will be described with reference to FIG. In FIG. 2, a power analysis unit 9 is a part that analyzes the power of the input voice per unit time. The power analysis result storage unit 10 is a part that stores the analysis result of the power analysis unit 9. The power comparison unit 11 is a unit that compares the previous analysis result with the latest analysis result.

【００１９】上記のように構成された本発明による第２
の実施例の音声対話装置について以下にその動作を説明
する。The second aspect of the present invention configured as described above
The operation of the voice dialogue apparatus of the embodiment will be described below.

【００２０】第１の実施例と同様に上記の音声対話装置
によるスケジュール管理システムを考える。先ず、本発
明による対話管理部７が「予定をどうぞ」というテキス
トを音声合成部５に送り、音声合成部５と音声出力部６
により合成音で「予定をどうぞ」と問いかける。これに
対しユーザーが「１３時ミーティング」と言うと音声入
力部１および音声分析部２によりこの声が音声特徴情報
に変換され音素／音節認識部３で音素／音節ラティスに
変換され対話制御部７がそれぞれ「１３時」、「ミーテ
ィング」という単語を単語辞書８を参照して認識し、次
の動作として、「ミーティングの相手をどうぞ。」とい
うテキストを音声合成部５に送る。音声合成部５と音声
出力部６は「ミーティングの相手をどうぞ」と合成音で
問いかける。ユーザーが「まつした」と発声すると、音
声入力部１、音声分析部２により「まつした」と言う音
声が音声特徴情報に変換され音素／音節認識部３で音素
／音節ラティスに変換される。これと同時にパワー分析
部９がパワー値を分析しパワー分析結果記憶部１０に格
納する。この時対話制御部７はこの言葉「まつした」が
人の名前であり単語辞書８に無い単語であることを対話
の進行から認識している。対話制御部７は、音素／音節
ラティス候補記憶部４が空いていることを確認して音素
／音節認識部３の結果を音素／音節ラティス候補記憶部
４に格納すると同時に認識候補ラティスのなかで最も尤
度の高い候補を用いて例えば「１３時、まつしかさんと
ミーティングですね？」というテキストを生成する。こ
こでは候補が「まつしか」であったことを意味する。ユ
ーザーが「まつした」と再度言うと、前回と同様に音声
入力部１、音声分析部２により「まつした」と言う音声
が音声特徴情報に変換され音素／音節認識部３で音素／
音節ラティスに変換される。またパワー分析部９がパワ
ー値を再度分析しパワー比較部１１がパワー分析部の出
力とパワー分析結果格納部１０の前回の値を比較しその
結果を対話制御部７に送る。対話制御部７は音素／音節
認識部３の結果とパワー比較部１１の比較結果から最も
パワー値に差のある音節を入れ換えて「まつしか」以外
で最も尤度の高い候補をさがす。図４はこのパワーの違
いを示す例である。ここでは「た」の部分が最も違いが
鮮明であるので対話制御部７は「まつしか」のうち
「か」の部分を次の候補に入れ換える。ここではそれが
「まつした」であったと仮定する。対話制御部７は再び
「１３時、まつしたさんとミーティングですね？」とい
うテキストを生成する。ユーザーが「はい」と言うと、
対話制御部７はこの「１３時、まつしたさんとミーティ
ング。」という内容をデータベースに格納し、同時に
「まつした」という単語を単語辞書８に格納する。また
同時にこのシステムは「まつした」という単語が人の名
前であることを理解しそのことも単語辞書８に書き込
む。Consider a schedule management system using the above-described voice interactive apparatus as in the first embodiment. First, the dialogue management unit 7 according to the present invention sends the text “Please schedule” to the voice synthesis unit 5, and the voice synthesis unit 5 and the voice output unit 6 are provided.
Asks "Please have a plan" with a synthetic voice. On the other hand, when the user says “13:00 meeting”, the voice input unit 1 and the voice analysis unit 2 convert this voice into voice feature information, the phoneme / syllable recognition unit 3 converts it into phoneme / syllable lattice, and the dialogue control unit 7 Recognizes the words "meeting" and "meeting" with reference to the word dictionary 8 and sends the text "Please have a meeting partner." To the speech synthesizer 5 as the next operation. The voice synthesis unit 5 and the voice output unit 6 inquire with a synthesized voice, "Please have a meeting partner." When the user utters "Matsushita", the voice input unit 1 and the voice analysis unit 2 convert the voice "Matsushita" into voice feature information, and the phoneme / syllable recognition unit 3 converts it into phoneme / syllable lattice. At the same time, the power analysis unit 9 analyzes the power value and stores it in the power analysis result storage unit 10. At this time, the dialogue control unit 7 recognizes from the progress of the dialogue that the word "Matsushita" is a person's name and is not in the word dictionary 8. The dialogue control unit 7 confirms that the phoneme / syllable lattice candidate storage unit 4 is empty and stores the result of the phoneme / syllable recognition unit 3 in the phoneme / syllable lattice candidate storage unit 4 and at the same time in the recognition candidate lattice. Using the candidate with the highest likelihood, for example, the text “13:00, you have a meeting with Matsuka-san?” Is generated. Here, it means that the candidate was "Matsushika". When the user says "Matsushita" again, as in the previous time, the voice "Matsushita" is converted into the voice feature information by the voice input unit 1 and the voice analysis unit 2, and the phoneme / phoneme / syllabic unit recognition unit 3
Converted to syllable lattice. Further, the power analysis unit 9 analyzes the power value again, the power comparison unit 11 compares the output of the power analysis unit with the previous value of the power analysis result storage unit 10, and sends the result to the dialogue control unit 7. The dialogue control unit 7 replaces the syllable having the largest power value difference from the result of the phoneme / syllable recognition unit 3 and the comparison result of the power comparison unit 11 to search for a candidate with the highest likelihood other than “Matsushika”. FIG. 4 is an example showing this difference in power. Here, since the difference in "ta" is the clearest, the dialogue control unit 7 replaces the "ka" in "Matsushika" with the next candidate. Here it is assumed that it was "Matsushita". The dialogue control unit 7 again generates the text "Meeting with Matsushita-san at 13:00?" If the user says yes,
The dialogue control unit 7 stores the content “Meeting with Matsushita-san at 13:00” in the database, and at the same time stores the word “Matsushita” in the word dictionary 8. At the same time, the system understands that the word "Matsushita" is a person's name, and also writes it in the word dictionary 8.

【００２１】このようにしてパワー情報を用いることに
より、より正確に新しい単語「まつした」が決定可能と
なる。By using the power information in this way, the new word "Matsushita" can be more accurately determined.

【００２２】このことは、パワーの代わりにピッチ情報
を用いても可能であり、図３に示す本発明による第３の
実施例が可能である。図３において、ピッチ分析部１２
は入力音声のピッチを分析する部分。ピッチ分析結果記
憶部１３はピッチ分析部１２の分析結果を格納する部
分。ピッチ比較部１４は前回の分析結果と最新の分析結
果との比較を行う部分である。This can be done by using pitch information instead of power, and the third embodiment according to the present invention shown in FIG. 3 is possible. In FIG. 3, the pitch analysis unit 12
Is the part that analyzes the pitch of the input voice. The pitch analysis result storage unit 13 is a unit that stores the analysis result of the pitch analysis unit 12. The pitch comparison unit 14 is a unit that compares the previous analysis result with the latest analysis result.

【００２３】上記のように構成された本発明による第３
の実施例の音声対話装置は、第２の実施例と同様の動作
を行う。The third aspect of the present invention configured as described above
The voice interaction device of the second embodiment performs the same operation as that of the second embodiment.

【００２４】また、ピッチとパワー両方を組み合わせる
ことにより、より正確な推定が可能になる。Further, by combining both pitch and power, more accurate estimation becomes possible.

【００２５】[0025]

【発明の効果】以上のように本発明によれば、音素／音
節認識手段と推定される音素／音節ラティスを音声合成
し問返す動作を連続する対話制御手段を具備することに
より機械が新しい言葉を音声対話により獲得することを
可能にせしめる。また、言い直されて入力される単語と
前回の発声の単語とのパワー値の比較により適切な音節
の候補を置き換えることによってより正確な候補の推定
が可能になる。さらに、ピッチ情報あるいはピッチと
パワー情報を用いればより正確に推定ができ未知の単語
獲得がよりスムーズにおこなえる。As described above, according to the present invention, the machine is provided with a new word by providing the dialogue control means for continuously performing the speech synthesis of the phoneme / syllabic lattice presumed to be the phoneme / syllable recognition means and the question-and-answer operation. Can be acquired by voice dialogue. In addition, by comparing the power value of the word that is reworded and input with the word of the previous utterance, and replacing an appropriate syllable candidate, more accurate estimation of the candidate becomes possible. Further, if pitch information or pitch and power information is used, more accurate estimation can be performed and unknown words can be acquired more smoothly.

[Brief description of drawings]

【図１】本発明第１の実施例における音声対話装置のブ
ロック図FIG. 1 is a block diagram of a voice dialog device according to a first embodiment of the present invention.

【図２】本発明第２の実施例における音声対話装置のブ
ロック図FIG. 2 is a block diagram of a voice dialog device according to a second embodiment of the present invention.

【図３】本発明第３の実施例における音声対話装置のブ
ロック図FIG. 3 is a block diagram of a voice dialog device according to a third embodiment of the present invention.

【図４】「マツシタ」と言う単語の２回の発声における
パワーの違いの様子を示す図FIG. 4 is a diagram showing a difference in power between two vocalizations of the word “Matsushita”.

【図５】従来の音声対話装置のブロック図FIG. 5 is a block diagram of a conventional voice dialog device.

[Explanation of symbols]

１音声入力部（手段）２音声分析部（手段）３音素／音節認識部（手段）４音素／音節ラティス候補記憶部（手段）５音声合成部（手段）６音声出力部（手段）７対話制御部（手段）８単語辞書９パワー分析部（手段）１０パワー分析結果記憶部（手段）１１パワー比較部（手段）１２ピッチ分析部（手段）１３ピッチ分析結果記憶部（手段）１４ピッチ比較部（手段） 1 Voice input section (means) 2 Speech analysis unit (means) 3 Phoneme / syllable recognition unit (means) 4 Phoneme / Syllabic Lattice Candidate Storage Unit (Means) 5 Speech synthesizer (means) 6 Audio output section (means) 7 Dialogue control unit (means) 8 word dictionary 9 Power analysis section (means) 10 Power analysis result storage unit (means) 11 Power comparison unit (means) 12 Pitch analysis section (means) 13 Pitch analysis result storage unit (means) 14 Pitch comparison unit (means)

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭63−316899（ＪＰ，Ａ) 特開平４−46398（ＪＰ，Ａ) 特開昭59−192292（ＪＰ，Ａ) 特開平２−265000（ＪＰ，Ａ) 特開平２−124599（ＪＰ，Ａ) 特開平５−27792（ＪＰ，Ａ) 特開平２−255942（ＪＰ，Ａ) 特公平６−79234（ＪＰ，Ｂ２) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/08 G10L 13/00 ─────────────────────────────────────────────────── ─── Continuation of front page (56) Reference JP-A-63-316899 (JP, A) JP-A-4-46398 (JP, A) JP-A-59-192292 (JP, A) JP-A-2- 265000 (JP, A) JP-A 2-124599 (JP, A) JP-A 5-27792 (JP, A) JP-A 2-255942 (JP, A) JP-B 6-79234 (JP, B2) (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 13/08 G10L 13/00

Claims

(57) [Claims]

1. A voice input unit for converting voice into an electric signal,
A voice analysis unit that converts an electrical signal from the voice input unit into a voice feature amount, and a voice recognition unit that outputs a phoneme sequence or a syllable sequence using the voice feature amount converted by the voice analysis unit,
A word dictionary unit that stores a phoneme string or a syllable string of a recognition target word, a dialogue control unit that manages the progress of dialogue and outputs a response, and a speech synthesis unit that synthesizes the response output by the dialogue control unit into a voice signal. And a voice output unit that converts the voice signal synthesized by the voice synthesizer into an acoustic signal and outputs the acoustic signal, and the phoneme sequence or the syllable sequence output from the voice recognition unit by the dialogue control unit is stored in the word dictionary unit. When it is determined that the phoneme sequence or syllable sequence does not exist, a response including the phoneme sequence or syllable sequence is output, and when it is determined that the phoneme sequence or syllable sequence output by the speech recognition unit is reworded, the previous phoneme sequence or syllable sequence and the current A phoneme sequence or a syllable sequence is compared, and a response including a phoneme sequence or a syllable sequence in which different phonemes or syllables are replaced is output, and the phoneme sequence or syllable sequence output by the speech recognition unit means affirmative. That when determining that the word is spoken dialogue system and to store the phoneme string or syllable string outputted immediately before the word dictionary.

2. A voice input unit for converting voice into an electric signal,
Converts an electrical signal from the voice input unit into a voice feature amount
The voice analysis unit and the voice feature amount converted by the voice analysis unit are
A speech recognition unit that outputs a phoneme sequence or syllable sequence using
Word dictionary that stores the phoneme sequence or syllable sequence of the recognition target word
Dialogue control that manages the progress of the dialogue and outputs a response
Section and the response output by the dialogue control section are combined into a voice signal.
Voice synthesizing unit and voice signal synthesized by the voice synthesizing unit
With an audio output unit that converts the
The dialogue control unit recognizes the voice progress while recognizing the progress of the dialogue.
The phoneme sequence or syllable sequence output by the recognition unit is the word dictionary unit.
When it is determined that the phoneme sequence or
Outputs a response including a node sequence and outputs the sound output by the voice recognition unit.
When it is judged that the prime sequence or syllable sequence is reworded,
Phoneme sequence or syllable sequence and current phoneme sequence or syllable sequence
Phonemes with different phonemes or syllables replaced by
Output a response including a string or a syllable string, and the speech recognition unit
A word that means that the phoneme sequence or syllable sequence output by is affirmative
When determining that the phoneme sequence output immediately before
Stores a syllable string in the word dictionary unit .

3. The power for analyzing the power per unit time from the audio signal from the audio input unit in addition to the above-mentioned configuration.
When the dialogue control unit determines that the phoneme sequence or syllable sequence output by the voice recognition unit is a rephrasing, it is equipped with an analysis unit.
By comparing the results power analysis results and the current power analysis, path
3. The voice interaction device according to claim 1, wherein a response including a phoneme sequence or a syllable sequence in which phonemes or syllables having different work values are replaced is output.

4. In addition to the above configuration, a voice input unit
Equipped with a pitch analysis unit that analyzes the pitch frequency from the audio signal
The dialogue control unit outputs the phoneme sequence output by the speech recognition unit or
When the syllable string is judged to be a rewording,
Compare the numerical analysis result with the pitch frequency analysis result this time,
Sound in which phonemes or syllables with different pitch frequencies are replaced
It is characterized in that it outputs a response containing a prime sequence or a syllable sequence.
4. The voice interaction device according to claim 1, wherein
Place

5. A voice input unit for converting voice into an electric signal,
Converts an electrical signal from the voice input unit into a voice feature amount
The voice analysis unit and the voice feature amount converted by the voice analysis unit are
A speech recognition unit that outputs a phoneme sequence or syllable sequence using
Word dictionary that stores the phoneme sequence or syllable sequence of the recognition target word
Dialogue control that manages the progress of the dialogue and outputs a response
Section and the response output by the dialogue control section are combined into a voice signal.
Voice synthesizing unit and voice signal synthesized by the voice synthesizing unit
Sound with a voice output unit that converts the
A method of voice interaction in a voice interaction device, the method comprising: recognizing progress of a dialogue and recognizing progress of the dialogue.
The phoneme sequence or syllable sequence output by the voice recognition unit is
When it is judged that it does not exist in the word dictionary, the phoneme sequence
Alternatively, a response including a syllable string is output, and the phoneme string or syllable string output by the voice recognition unit is
When it is judged to be a repair, it should be the same as the previous phoneme sequence or syllable sequence.
By comparing the current phoneme sequence or syllable sequence,
Or a response containing phoneme sequences or syllable sequences with syllable replacement
The answer is output, and the phoneme sequence or syllable sequence output by the speech recognition unit is positive.
When determining that the word means
The phoneme sequence or syllable sequence stored in the word dictionary
And a voice interaction method characterized by: