JP6801587B2

JP6801587B2 - Voice dialogue device

Info

Publication number: JP6801587B2
Application number: JP2017104766A
Authority: JP
Inventors: 達朗堀; 生聖渡部
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2017-05-26
Filing date: 2017-05-26
Publication date: 2020-12-16
Anticipated expiration: 2037-05-26
Also published as: JP2018200386A

Description

本発明は、音声対話装置に関する。 The present invention relates to a voice dialogue device.

特許文献１には、ユーザの発話の音声波形から韻律的特徴を抽出し、当該韻律的特徴に基づいて音声対話システムに発話権があるか否かを判定することが記載されている。 Patent Document 1 describes that a prosodic feature is extracted from a voice waveform of a user's utterance, and whether or not the voice dialogue system has a utterance right is determined based on the prosodic feature.

特開２０１６−０３８５０１号公報Japanese Unexamined Patent Publication No. 2016-038501

特許文献１に記載の音声対話システムでは、当該音声対話システムに発話権があるか否かがあいまいな場合にもフィラー（相槌）を出力して当該音声対話システムに発話権があることを主張する場合がある。しかし、実際には、音声対話システムに発話権がなかった場合、当該フィラーを出力すると、ユーザの発話を遮ってしまう結果となる可能性がある。 The voice dialogue system described in Patent Document 1 outputs a filler (speaking) even when it is unclear whether or not the voice dialogue system has the right to speak, and claims that the voice dialogue system has the right to speak. In some cases. However, in reality, if the voice dialogue system does not have the right to speak, outputting the filler may result in interrupting the user's speech.

本発明は、このような問題を解決するためになされたものであり、ユーザの発話を遮ってしまう可能性を低減することができる音声対話装置を提供することを目的とするものである。 The present invention has been made to solve such a problem, and an object of the present invention is to provide a voice dialogue device capable of reducing the possibility of interrupting a user's utterance.

本発明に係る音声対話装置は、ユーザの音声を入力する入力部と、前記ユーザに対して音声を出力する出力部とを備える。また、前記音声対話装置は、少なくとも前記入力部によって入力された前記音声の音声波形から特徴量を抽出するユーザ音声解析部と、前記出力部が前記ユーザに対してフィラーを出力する際に、前記ユーザ音声解析部によって抽出された前記特徴量に基づいて前記フィラーを出力する信頼度を示す値を計算し、前記フィラーを出力する信頼度を示す値に基づいて前記フィラーの音圧レベルを決定する音圧レベル決定部と、を備える。 The voice dialogue device according to the present invention includes an input unit for inputting a user's voice and an output unit for outputting voice to the user. Further, the voice dialogue device includes a user voice analysis unit that extracts a feature amount from at least the voice waveform of the voice input by the input unit, and the output unit when the output unit outputs a filler to the user. A value indicating the reliability of outputting the filler is calculated based on the feature amount extracted by the user voice analysis unit, and the sound pressure level of the filler is determined based on the value indicating the reliability of outputting the filler. It is equipped with a sound pressure level determination unit.

本発明に係る音声対話装置によれば、音圧レベル決定部によって、ユーザ音声解析部によって抽出された特徴量に基づいて前記フィラーを出力する信頼度を示す値が計算され、前記フィラーを出力する信頼度を示す値に基づいて前記フィラーの音圧レベルが決定される。そのため、フィラーを出力する信頼度を示す値に応じた音圧レベルで前記出力部はフィラーを出力することができる。そして、フィラーを出力する信頼度は、発話権が音声対話装置にある可能性と正の相関関係にある。すなわち、発話権が音声対話装置にある否かがあいまいな場合であっても、発話権が音声対話装置にある可能性に応じた音圧レベルで前記出力部はフィラーを出力することができる。そのため、発話権が音声対話装置にある可能性が低い場合には出力部は小さい音圧レベルでフィラーを出力することとなり、出力部が出力したフィラーによってユーザの発話を遮ることを低減できる。これにより、ユーザの発話を遮ってしまう可能性を低減することができる音声対話装置を提供することができる。 According to the voice dialogue device according to the present invention, the sound pressure level determination unit calculates a value indicating the reliability of outputting the filler based on the feature amount extracted by the user voice analysis unit, and outputs the filler. The sound pressure level of the filler is determined based on the value indicating the reliability. Therefore, the output unit can output the filler at a sound pressure level corresponding to a value indicating the reliability of outputting the filler. The reliability of outputting the filler has a positive correlation with the possibility that the right to speak is in the voice dialogue device. That is, even when it is unclear whether or not the utterance right belongs to the voice dialogue device, the output unit can output the filler at a sound pressure level corresponding to the possibility that the utterance right belongs to the voice dialogue device. Therefore, when it is unlikely that the voice dialogue device has the right to speak, the output unit outputs the filler at a small sound pressure level, and it is possible to reduce the interruption of the user's utterance by the filler output by the output unit. This makes it possible to provide a voice dialogue device capable of reducing the possibility of interrupting the user's utterance.

本発明の実施の形態１に係るロボットの概略構成を示すブロック図である。It is a block diagram which shows the schematic structure of the robot which concerns on Embodiment 1 of this invention. 本発明の実施の形態１に係るユーザの発話区間の終了部分を示すグラフである。It is a graph which shows the end part of the utterance section of the user which concerns on Embodiment 1 of this invention. 本発明の実施の形態１に係る特徴量ベクトルの各要素の一例を示す表である。It is a table which shows an example of each element of the feature quantity vector which concerns on Embodiment 1 of this invention. 本発明の実施の形態１に係るオフライン学習による判定モデルの作成におけるサブセットの作成を説明する図である。It is a figure explaining the creation of a subset in the creation of the determination model by offline learning which concerns on Embodiment 1 of this invention. 本発明の実施の形態１に係るオフライン学習による判定モデルの作成における分岐関数候補の生成を説明する図である。It is a figure explaining the generation of a branch function candidate in the creation of the determination model by offline learning which concerns on Embodiment 1 of this invention. 本発明の実施の形態１に係るオフライン学習による判定モデルの作成における分岐関数候補の決定を説明する図である。It is a figure explaining the determination of the branch function candidate in the creation of the determination model by offline learning which concerns on Embodiment 1 of this invention. 本発明の実施の形態１に係る判定モデルを用いた、フィラーを出力する信頼度の決定を説明する図である。It is a figure explaining the determination of the reliability which outputs a filler using the determination model which concerns on Embodiment 1 of this invention. 本発明の実施の形態１に係る特徴量ベクトルの一部の要素であるユーザ発話長さと沈黙又はフィラーの頻度との関係を示すグラフの一例である。This is an example of a graph showing the relationship between the user utterance length, which is a part of the feature vector according to the first embodiment of the present invention, and the frequency of silence or filler. 本発明の実施の形態１に係る特徴量ベクトルの一部の要素であるユーザ発話長さとフィラーを出力する割合との関係を示す表の一例である。This is an example of a table showing the relationship between the user utterance length, which is a part of the feature vector according to the first embodiment of the present invention, and the ratio of outputting the filler. 本発明の実施例１に係るユーザとロボットとの会話の一例を示す図である。It is a figure which shows an example of the conversation between a user and a robot which concerns on Example 1 of this invention. 本発明の実施例１に係る特徴量ベクトルの各要素の一例を示す表である。It is a table which shows an example of each element of the feature quantity vector which concerns on Example 1 of this invention. 図１０に示す会話の例において、ロボットが出力するフィラーの音声波形を示すグラフである。It is a graph which shows the voice waveform of the filler output by a robot in the example of conversation shown in FIG.

実施の形態１
以下、図面を参照して本発明の実施の形態について説明する。図１は、本発明の実施の形態１に係る音声対話装置としてのロボット１００の概略構成を示すブロック図である。実施の形態１に係るロボット１００は、図１に示すように、入力部１１０、出力部１２０、制御部１３０を備える。制御部１３０は、ユーザ音声解析部１３１、音圧レベル決定部１３２等を備える。音圧レベル決定部１３２は、手法選択部１３２Ａ、音圧レベル設定パラメータ計算部１３２Ｂ、学習用データベース（学習用ＤＢ）１３２Ｃ、オフライン学習部１３２Ｄ、判定モデル及び条件分岐式データベース（判定モデル及び条件分岐式ＤＢ）１３２Ｅ、音声合成部１３２Ｆを備える。ロボット１００は、ユーザの発話に応じて、応答としての音声を出力したり、フィラーを出力したりする。ここで、発話とは、対話内容として意味を持つ音声である。また、フィラーとは、相槌であり、ユーザの一の発話と次の発話との間に発する繋ぎの音声である。 Embodiment 1
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a schematic configuration of a robot 100 as a voice dialogue device according to the first embodiment of the present invention. As shown in FIG. 1, the robot 100 according to the first embodiment includes an input unit 110, an output unit 120, and a control unit 130. The control unit 130 includes a user voice analysis unit 131, a sound pressure level determination unit 132, and the like. The sound pressure level determination unit 132 includes a method selection unit 132A, a sound pressure level setting parameter calculation unit 132B, a learning database (learning DB) 132C, an offline learning unit 132D, a judgment model and a conditional branching database (judgment model and conditional branching). Formula DB) 132E and voice synthesis unit 132F are provided. The robot 100 outputs a voice as a response or outputs a filler according to the utterance of the user. Here, the utterance is a voice having a meaning as the content of the dialogue. Further, the filler is an aizuchi, and is a voice of a connection between one utterance and the next utterance of the user.

入力部１１０は、マイク等を備え、ユーザの音声を集音し、ユーザの音声をユーザ音声解析部１３１に入力する。 The input unit 110 includes a microphone or the like, collects the user's voice, and inputs the user's voice to the user voice analysis unit 131.

出力部１２０は、スピーカー等を備え、ロボット１００からユーザに対して発話である音声を出力したり、フィラーを出力したりする。具体的には、後述する音声合成部１３２Ｆによって合成された音声を出力する。また、本実施の形態において、音声合成部１３２Ｆは、音圧レベル決定部１３２から入力された音圧レベル設定パラメータＩｖに基づく音圧レベルのフィラーを合成し、出力部１２０に出力する。音声合成部１３２Ｆの詳細については、後述する。 The output unit 120 includes a speaker or the like, and outputs a voice uttered to the user from the robot 100 or outputs a filler. Specifically, the voice synthesized by the voice synthesis unit 132F described later is output. Further, in the present embodiment, the voice synthesis unit 132F synthesizes a sound pressure level filler based on the sound pressure level setting parameter Iv input from the sound pressure level determination unit 132, and outputs the filler to the output unit 120. The details of the voice synthesis unit 132F will be described later.

制御部１３０は、図示しないＣＰＵ及び図示しない記憶部等を備える。そして、ＣＰＵが記憶部に格納されたプログラムを実行することにより、制御部１３０における全ての処理が実現する。
また、制御部１３０のそれぞれの記憶部に格納されるプログラムは、ＣＰＵに実行されることにより、制御部１３０のそれぞれにおける処理を実現するためのコードを含む。なお、記憶部は、例えば、このプログラムや、制御部１３０における処理に利用される各種情報を格納することができる任意の記憶装置を含んで構成される。記憶装置は、例えば、メモリ等である。 The control unit 130 includes a CPU (not shown), a storage unit (not shown), and the like. Then, when the CPU executes the program stored in the storage unit, all the processing in the control unit 130 is realized.
Further, the program stored in each storage unit of the control unit 130 includes a code for realizing the processing in each of the control units 130 by being executed by the CPU. The storage unit is configured to include, for example, this program and an arbitrary storage device capable of storing various information used for processing in the control unit 130. The storage device is, for example, a memory or the like.

具体的には、ＣＰＵが記憶部に格納されたプログラムを実行することによって、制御部１３０は、ユーザ音声解析部１３１、音圧レベル決定部１３２として機能する。また、記憶部には、学習用データベース１３２Ｃ、判定モデル及び条件分岐式データベース１３２Ｅ等が格納されている。 Specifically, when the CPU executes the program stored in the storage unit, the control unit 130 functions as the user voice analysis unit 131 and the sound pressure level determination unit 132. Further, the storage unit stores a learning database 132C, a determination model, a conditional branching database 132E, and the like.

ユーザ音声解析部１３１は、入力部１１０によって入力された音声の音声波形から特徴量を抽出する。また、ユーザ音声解析部１３１は、ロボット１００のユーザへの応答の履歴情報（装置応答の過去履歴）から、特徴量を抽出する。そして、ユーザ音声解析部１３１は、ユーザの音声波形及びユーザへの応答の履歴情報から抽出した特徴量を用いて、特徴量ベクトルを生成し、音圧レベル決定部１３２に出力する。 The user voice analysis unit 131 extracts the feature amount from the voice waveform of the voice input by the input unit 110. Further, the user voice analysis unit 131 extracts the feature amount from the history information of the response of the robot 100 to the user (past history of the device response). Then, the user voice analysis unit 131 generates a feature amount vector using the feature amount extracted from the user's voice waveform and the history information of the response to the user, and outputs the feature amount vector to the sound pressure level determination unit 132.

具体的には、ユーザ音声解析部１３１は、入力部１１０によって入力された音声の音声波形を、１以上の「発話区間」に分割する。ここで、「発話区間」とは、ユーザの発話の始まりから終わりまでの区間を意味し、ユーザ音声解析部１３１は、ユーザの発話の音圧に基づいて、「発話区間」がどこからどこまでかを決定する。図２は、ユーザの発話区間の終了部分を示すグラフであり、縦軸は音圧（ｄＢ）を示し、横軸は時間を示す。ユーザ音声解析部１３１は、例えば、図２に示すように、ユーザの音声の音圧が所定の音圧閾値より小さくなった時点から一定時間Ｔにおいて、再び音圧閾値を超え且つゼロとなる回数がＮ回以下であった場合、当該時点（ユーザの音声の音圧が所定の音圧閾値より小さくなった時点）を「発話区間」の終わりとして検出する。図２においては、Ｔは４００ｍｓｅｃ（ミリ秒）、Ｎは０となっているが、Ｔ及びＮの値は、実験対象及び環境によって適宜設定される値である。同様に、ユーザ音声解析部１３１は、ユーザの音声の音圧が所定の音圧閾値より大きくなった時点から一定時間Ｔ_２において、再び音圧閾値より小さくなり且つゼロとなる回数がＮ_２回以上であった場合、当該時点（ユーザの音声の音圧が所定の音圧閾値より大きくなった時点）を「発話区間」の始まりとして検出する。同様に、Ｔ_２及びＮ_２の値は、実験対象及び環境によって適宜設定される値である。 Specifically, the user voice analysis unit 131 divides the voice waveform of the voice input by the input unit 110 into one or more "utterance sections". Here, the "utterance section" means a section from the beginning to the end of the user's utterance, and the user voice analysis unit 131 determines from where to where the "utterance section" is based on the sound pressure of the user's utterance. decide. FIG. 2 is a graph showing the end portion of the user's utterance section, where the vertical axis represents sound pressure (dB) and the horizontal axis represents time. For example, as shown in FIG. 2, the user voice analysis unit 131 again exceeds the sound pressure threshold and becomes zero at a certain time T from the time when the sound pressure of the user's voice becomes smaller than the predetermined sound pressure threshold. When is N times or less, the relevant time point (the time point when the sound pressure of the user's voice becomes smaller than the predetermined sound pressure threshold) is detected as the end of the "speech section". In FIG. 2, T is 400 msec (milliseconds) and N is 0, but the values of T and N are values appropriately set depending on the experimental object and the environment. Similarly, in the user voice analysis unit 131, the number of times that the sound pressure of the user's voice becomes smaller than the sound pressure threshold and becomes zero again is N ₂ times in T ₂ for a certain period of time from the time when the sound pressure of the user's voice becomes larger than the predetermined sound pressure threshold. If it is the above, the time point (the time point when the sound pressure of the user's voice becomes larger than a predetermined sound pressure threshold) is detected as the start of the "speech section". Similarly, the values of T ₂ and N ₂ are values appropriately set depending on the experimental object and the environment.

そして、ユーザ音声解析部１３１は、ｉ番目（ｉは、１以上の整数である。）の「発話区間」の音声波形から特徴量を抽出する。また、ユーザ音声解析部１３１は、ｉ番目の「発話区間」の音声波形及びユーザへの応答の履歴情報から抽出した特徴量を用いて、特徴量ベクトルｖｉを生成し、音圧レベル決定部１３２に出力する。 Then, the user voice analysis unit 131 extracts the feature amount from the voice waveform of the i-th (i is an integer of 1 or more) “speech section”. Further, the user voice analysis unit 131 generates a feature amount vector vi by using the feature amount extracted from the voice waveform of the i-th "utterance section" and the history information of the response to the user, and the sound pressure level determination unit 132. Output to.

図３に、ユーザ音声解析部１３１によって生成された特徴量ベクトルｖｉの一例を示す。具体的には、図３は、特徴量ベクトルｖｉの各要素、及び、当該要素の値を示す。図３に示すように、特徴量ベクトルｖｉの各要素は、「ｉ番目のユーザ発話情報」に属するものと、「装置応答の過去履歴」に属するものに、大きく分けられる。図３に示す例では、「ｉ番目のユーザ発話情報」に属する要素としては、「句末○ｍｓｅｃ」、「発話区間全体」、「ユーザ発話の長さ」等が挙げられている。また、「装置応答の過去履歴」に属する要素としては、「同一話題の継続時間」、「直前のシステム発話タイプ」、「直前のシステムの質問タイプ」等が挙げられている。なお、図３において、太枠で囲んだ列に、各要素の値が示されている。また、図３において、「システム」とは、ロボット１００のことを指す。 FIG. 3 shows an example of the feature amount vector vi generated by the user voice analysis unit 131. Specifically, FIG. 3 shows each element of the feature vector vi and the value of the element. As shown in FIG. 3, each element of the feature amount vector vi is roughly classified into one belonging to the "i-th user utterance information" and one belonging to the "past history of device response". In the example shown in FIG. 3, as the elements belonging to the "i-th user utterance information", "phrase end ○ msec", "whole utterance section", "length of user utterance" and the like are listed. Further, as elements belonging to the "past history of device response", "duration of the same topic", "previous system utterance type", "previous system question type" and the like are listed. In FIG. 3, the value of each element is shown in the column surrounded by the thick frame. Further, in FIG. 3, the “system” refers to the robot 100.

図３において、「句末○ｍｓｅｃ」とは、ユーザの発話区間の終わりから○ｍｓｅｃ（○ミリ秒）前までの発話を意味する。図３においては、「句末○ｍｓｅｃ」の音声波形の基本周波数ｆ０、ボリュームが特徴量ベクトルｖｉの要素として挙げられている。また、「句末○ｍｓｅｃ」の音声波形の基本周波数ｆ０及びボリュームの平均値、分散値、増減の傾きの値、最大値等を、特徴量ベクトルｖｉの要素の値とする。また、ボリュームとは、入力部１１０から入力されたユーザの発話の音声の大きさ（ｄＢ）である。なお、ユーザ音声解析部１３１は、これらの基本周波数ｆ０及びボリュームの平均値、分散値、増減の傾きの値、最大値をユーザごとに正規化する。 In FIG. 3, the “phrase end ○ msec” means an utterance from the end of the user's utterance section to ○ msec (○ millisecond) before. In FIG. 3, the fundamental frequency f0 and the volume of the voice waveform of “end of phrase ○ msec” are listed as elements of the feature vector vi. Further, the fundamental frequency f0 of the voice waveform of "phrase end ○ msec", the average value of the volume, the variance value, the value of the slope of increase / decrease, the maximum value, etc. are set as the values of the elements of the feature vector vi. The volume is the volume (dB) of the voice of the user's utterance input from the input unit 110. The user voice analysis unit 131 normalizes the average value, the dispersion value, the increase / decrease slope value, and the maximum value of the fundamental frequency f0 and the volume for each user.

また、図３において、「発話区間全体」の音声波形の基本周波数ｆ０、ボリュームが特徴量ベクトルｖｉの要素として挙げられている。また、「発話区間全体」の音声波形の基本周波数ｆ０及びボリュームの平均値、分散値、増減の傾きの値、最大値等を、特徴量ベクトルｖｉの要素の値とする。なお、ユーザ音声解析部１３１は、これらの基本周波数ｆ０及びボリュームの平均値、分散値、増減の傾きの値、最大値をユーザごとに正規化する。 Further, in FIG. 3, the fundamental frequency f0 and the volume of the voice waveform of the “whole utterance section” are listed as the elements of the feature amount vector vi. Further, the fundamental frequency f0 of the voice waveform of the “whole utterance section”, the average value of the volume, the dispersion value, the value of the slope of increase / decrease, the maximum value, and the like are set as the values of the elements of the feature vector vi. The user voice analysis unit 131 normalizes the average value, the dispersion value, the increase / decrease slope value, and the maximum value of the fundamental frequency f0 and the volume for each user.

また、図３において、「ユーザ発話の長さ」が特徴量ベクトルｖｉの要素として挙げられている。また、「ユーザ発話の長さ」の数値（ｓｅｃ）、すなわち、ユーザ発話の長さが何秒であったかを、特徴量ベクトルｖｉの要素の値とする。なお、「ユーザ発話の長さ」は、ユーザ音声解析部１３１によって上述の方法で決定された「発話区間」の長さ（時間（ｓｅｃ））である。 Further, in FIG. 3, "length of user utterance" is mentioned as an element of the feature amount vector vi. Further, the numerical value (sec) of the "user utterance length", that is, how many seconds the user utterance length was, is used as the value of the element of the feature amount vector vi. The "user utterance length" is the length (time (sec)) of the "utterance section" determined by the user voice analysis unit 131 by the above method.

また、図３において、「同一話題の継続時間」が特徴量ベクトルｖｉの要素として挙げられている。また、「同一話題の継続時間」の数値（ｓｅｃ）、すなわち、同一話題の継続時間の長さが何秒であったかを、特徴量ベクトルｖｉの要素の値とする。なお、「同一話題の継続時間」は、例えば、ロボット１００が前回「次話題誘導」の音声を出力した時から、ロボット１００が今回「次話題誘導」の音声を出力する時までの時間である。また、ロボット１００は、例えば、ユーザの沈黙時間が所定時間以上である場合や前回「次話題誘導」の音声を出力した時から所定時間以上経過した場合に、「次話題誘導」の音声を出力する。 Further, in FIG. 3, "duration of the same topic" is listed as an element of the feature vector vi. Further, the numerical value (sec) of the "duration of the same topic", that is, how many seconds the duration of the same topic was, is used as the value of the element of the feature vector vi. The "duration of the same topic" is, for example, the time from the time when the robot 100 outputs the voice of the "next topic guidance" last time to the time when the robot 100 outputs the voice of the "next topic guidance" this time. .. Further, the robot 100 outputs the voice of "next topic guidance" when, for example, the silence time of the user is longer than a predetermined time or when a predetermined time or more has passed since the last time when the voice of "next topic guidance" was output. To do.

また、図３において、「直前のシステムの発話タイプ」が特徴量ベクトルｖｉの要素として挙げられている。ここで、「システム」とは、ロボット１００のことを指す。また、「直前のシステムの発話タイプ」が「相槌」、「傾聴応答」、「質問」等を、特徴量ベクトルｖｉの要素の値とする。なお、「相槌」、「傾聴応答」、「質問」等は、それぞれ、「０」、「１」、「２」等の離散値で表現する。すなわち、ユーザ音声解析部１３１は、「相槌」、「傾聴応答」、「質問」等、数値で表されない特徴量についても、離散値で表現することにより、数値化する。 Further, in FIG. 3, "the utterance type of the immediately preceding system" is listed as an element of the feature vector vi. Here, the "system" refers to the robot 100. Further, the "speech type of the immediately preceding system" is "aizuchi", "listening response", "question", etc., which are the values of the elements of the feature vector vi. It should be noted that "aizuchi", "listening response", "question" and the like are represented by discrete values such as "0", "1" and "2", respectively. That is, the user voice analysis unit 131 also digitizes the feature quantities that are not represented by numerical values, such as "aizuchi", "listening response", and "question", by expressing them as discrete values.

また、図３において、「直前のシステムの質問タイプ」が特徴量ベクトルｖｉの要素として挙げられている。また、「直前のシステムの質問タイプ」が「深堀質問」、「次話題誘導」等を、特徴量ベクトルｖｉの要素の値とする。上記と同様に、「深堀質問」、「次話題誘導」等は、それぞれ、「０」、「１」等の離散値で表現する。すなわち、ユーザ音声解析部１３１は、「深堀質問」、「次話題誘導」等、数値で表されない特徴量についても、離散値で表現することにより、数値化する。 Further, in FIG. 3, the "question type of the immediately preceding system" is listed as an element of the feature vector vi. Further, the "question type of the immediately preceding system" is "Fukahori question", "next topic guidance", etc., which are the values of the elements of the feature vector vi. Similar to the above, the “Fukahori question”, the “next topic guidance”, etc. are expressed by discrete values such as “0” and “1”, respectively. That is, the user voice analysis unit 131 also digitizes the feature quantities that are not represented by numerical values, such as "Fukahori question" and "next topic guidance", by expressing them as discrete values.

そして、ユーザ音声解析部１３１は、図２に示す要素及び要素の値等から、特徴量ベクトルｖｉを生成する。図２に示す要素及び要素の値から生成された特徴量ベクトルｖｉは、例えば、ｖｉ＝（・・・，２．４，・・・，２０，１，１，・・・）と表される。 Then, the user voice analysis unit 131 generates the feature amount vector vi from the elements shown in FIG. 2 and the values of the elements. The feature amount vector vi generated from the element and the value of the element shown in FIG. 2 is represented by, for example, vi = (..., 2.4, ..., 20, 1, 1, ...). ..

音圧レベル決定部１３２は、出力部１２０がユーザに対してフィラーを出力する際に、ユーザ音声解析部１３１によって抽出された特徴量に基づいてフィラーを出力する信頼度を示す値を計算し、フィラーを出力する信頼度を示す値に基づいてフィラーの音圧レベルを決定する。
具体的には、音圧レベル決定部１３２は、ユーザ音声解析部１３１から入力された特徴量ベクトルｖｉに基づいて、フィラーを出力する信頼度を示す値Ｉｖを計算し、当該値Ｉｖを「音圧レベル設定パラメータ」とする。なお、本実施の形態において、音圧レベル決定部１３２において決定される音圧レベル設定パラメータＩｖは、０．５≦Ｉｖ≦１を満たす。すなわち、後述する音圧レベル設定パラメータ計算部１３２Ｂは、計算によって得られたＩｖの値が０．５未満である場合、Ｉｖの値は０．５であると決定する。
そして、音圧レベル決定部１３２は、決定した音圧レベル設定パラメータＩｖを音声合成部１３２Ｆに出力する。 The sound pressure level determination unit 132 calculates a value indicating the reliability of outputting the filler based on the feature amount extracted by the user voice analysis unit 131 when the output unit 120 outputs the filler to the user. The sound pressure level of the filler is determined based on the value indicating the reliability of outputting the filler.
Specifically, the sound pressure level determination unit 132 calculates a value Iv indicating the reliability of outputting the filler based on the feature amount vector vi input from the user voice analysis unit 131, and sets the value Iv as “sound”. Pressure level setting parameter ". In the present embodiment, the sound pressure level setting parameter Iv determined by the sound pressure level determination unit 132 satisfies 0.5 ≦ Iv ≦ 1. That is, the sound pressure level setting parameter calculation unit 132B, which will be described later, determines that the value of Iv is 0.5 when the value of Iv obtained by the calculation is less than 0.5.
Then, the sound pressure level determination unit 132 outputs the determined sound pressure level setting parameter Iv to the voice synthesis unit 132F.

より具体的には、音圧レベル決定部１３２は、手法選択部１３２Ａ、音圧レベル設定パラメータ計算部１３２Ｂ、学習用データベース（学習用ＤＢ）１３２Ｃ、オフライン学習部１３２Ｄ、判定モデル及び条件分岐式データベース（判定モデル及び条件分岐式ＤＢ）１３２Ｅを備える。 More specifically, the sound pressure level determination unit 132 includes a method selection unit 132A, a sound pressure level setting parameter calculation unit 132B, a learning database (learning DB) 132C, an offline learning unit 132D, a judgment model, and a conditional branching database. (Determining model and conditional branching DB) 132E is provided.

手法選択部１３２Ａは、音圧レベル設定パラメータ計算部１３２Ｂが音圧レベル設定パラメータＩｖを計算する手法を選択する。本実施の形態では、音圧レベル設定パラメータ計算部１３２Ｂは、判定モデルを用いて音圧レベル設定パラメータＩｖを計算する手法（以下、「第１の手法」と称する。）と、特徴量ベクトルｖｉの一部の特徴量を用いて作成した条件分岐式を用いて音圧レベル設定パラメータＩｖを計算する手法（以下、「第２の手法」と称する。）と、のいずれかを用いて、音圧レベル設定パラメータＩｖを計算する。そのため、手法選択部１３２Ａは、音圧レベル設定パラメータ計算部１３２Ｂが第１の手法と第２の手法とのいずれを用いて音圧レベル設定パラメータＩｖを計算するのかを選択する。具体的には、手法選択部１３２Ａは、音圧レベル設定パラメータＩｖを計算するのに用いるデータの量、ロボット１００の仕様等に基づいて、第１の手法と第２の手法とのいずれを用いるかを選択する。そして、手法選択部１３２Ａは、第１の手法と第２の手法とのいずれを用いて音圧レベル設定パラメータＩｖを計算するかについての指示を音圧レベル設定パラメータ計算部１３２Ｂに出力する。 The method selection unit 132A selects a method in which the sound pressure level setting parameter calculation unit 132B calculates the sound pressure level setting parameter Iv. In the present embodiment, the sound pressure level setting parameter calculation unit 132B uses the determination model to calculate the sound pressure level setting parameter Iv (hereinafter referred to as “first method”) and the feature amount vector vi. The sound pressure level setting parameter Iv is calculated using the conditional branching equation created by using some of the features of the above (hereinafter referred to as the "second method"), or the sound is used. Calculate the pressure level setting parameter Iv. Therefore, the method selection unit 132A selects whether the sound pressure level setting parameter calculation unit 132B uses the first method or the second method to calculate the sound pressure level setting parameter Iv. Specifically, the method selection unit 132A uses either the first method or the second method based on the amount of data used to calculate the sound pressure level setting parameter Iv, the specifications of the robot 100, and the like. Select whether. Then, the method selection unit 132A outputs an instruction as to which of the first method and the second method is used to calculate the sound pressure level setting parameter Iv to the sound pressure level setting parameter calculation unit 132B.

音圧レベル設定パラメータ計算部１３２Ｂは、手法選択部１３２Ａから入力された指示に従って、判定モデル及び条件分岐式データベース１３２Ｅから、判定モデル又は条件分岐式のいずれかを読み出す。そして、音圧レベル設定パラメータ計算部１３２Ｂは、当該判定モデル又は条件分岐式を用いて、ユーザ音声解析部１３１から入力された特徴量ベクトルｖｉに基づいて、音圧レベル設定パラメータＩｖを計算する。なお、判定モデル及び条件分岐式の詳細については、後述する。 The sound pressure level setting parameter calculation unit 132B reads either the judgment model or the conditional branching formula from the judgment model and the conditional branching formula database 132E according to the instruction input from the method selection unit 132A. Then, the sound pressure level setting parameter calculation unit 132B calculates the sound pressure level setting parameter Iv based on the feature amount vector vi input from the user voice analysis unit 131 by using the determination model or the conditional branching expression. The details of the determination model and the conditional branching formula will be described later.

学習用データベース１３２Ｃは、判定モデル及び条件分岐式を作成するために必要なデータを格納している。具体的には、学習用データベース１３２Ｃは、事前に集められた模擬対話の音声データを格納している。また、当該音声データに含まれるそれぞれの発話に対して特徴量ベクトル及び教師ラベルが付されている。より具体的には、上述の方法等によって模擬対話の音声の音声波形が１以上の発話区間に分割され、それぞれの発話区間の音声波形に対し、特徴量ベクトル及び教師ラベルが付されている。ここで、ｉ番目（ｉは、１以上の整数である。）の発話区間に付される特徴量ベクトルをｖｉとし、教師ラベルをｃｉとする。すなわち、学習用データベース１３２Ｃは、事前に集められた模擬対話のｉ番目の発話区間の音声波形と、当該発話区間の特徴量ベクトルｖｉと、当該発話区間の教師ラベルｃｉと、を対応付けて、格納している。 The learning database 132C stores data necessary for creating a determination model and a conditional branching expression. Specifically, the learning database 132C stores the voice data of the simulated dialogue collected in advance. In addition, a feature vector and a teacher label are attached to each utterance included in the voice data. More specifically, the voice waveform of the voice of the simulated dialogue is divided into one or more utterance sections by the above-mentioned method or the like, and the feature amount vector and the teacher label are attached to the voice waveform of each utterance section. Here, the feature vector attached to the i-th (i is an integer of 1 or more) utterance section is vi, and the teacher label is ci. That is, the learning database 132C associates the voice waveform of the i-th utterance section of the simulated dialogue collected in advance with the feature amount vector vi of the utterance section and the teacher label ci of the utterance section. It is stored.

オフライン学習部１３２Ｄは、学習用データベース１３２Ｃから、判定モデルを作成するために必要なデータを読み出し、判定モデルを作成し、作成した判定モデルを判定モデル及び条件分岐式データベース１３２Ｅに出力する。また、オフライン学習部１３２Ｄは、学習用データベース１３２Ｃから、条件分岐式を作成するために必要なデータを読み出し、条件分岐式を作成し、作成した条件分岐式を判定モデル及び条件分岐式データベース１３２Ｅに出力する。なお、ロボット１００がユーザとの対話を実際に開始する前に、オフライン学習部１３２Ｄは判定モデル及び条件分岐式を作成し、判定モデル及び条件分岐式データベース１３２Ｅは、当該判定モデル及び条件分岐式を格納している。 The offline learning unit 132D reads the data necessary for creating the judgment model from the learning database 132C, creates the judgment model, and outputs the created judgment model to the judgment model and the conditional branching database 132E. Further, the offline learning unit 132D reads the data necessary for creating the conditional branching expression from the learning database 132C, creates the conditional branching expression, and uses the created conditional branching expression in the determination model and the conditional branching expression database 132E. Output. Before the robot 100 actually starts the dialogue with the user, the offline learning unit 132D creates a determination model and a conditional branching expression, and the determination model and the conditional branching expression database 132E creates the determination model and the conditional branching expression. It is stored.

まず、オフライン学習部１３２Ｄによる判定モデルの作成について説明する。ここでは、オフライン学習部１３２Ｄが、判定モデルとして、ランダムフォレストを作成する方法について説明する。図４に示すように、オフライン学習部１３２Ｄは、学習用データベース１３２Ｃから、判定モデルを作成するために必要なデータを読み出し、複数のサンプルを有するサンプル集合Ｓを用意する。具体的には、ｉ番目のサンプルは、ｉ番目の発話区間の音声波形と、当該発話区間に付された特徴量ベクトルｖｉと、当該発話区間に付された教師ラベルｃｉとを含むデータである。図４において、ハッチングしているサンプルは、教師ラベルｃｉ（フィラー）を含み、ハッチングしていないサンプルは、教師ラベルｃｉ（沈黙）を含む。また、教師ラベルｃｉ（フィラー）は、フィラーを出力するというラベルであり、教師ラベルｃｉ（沈黙）は、沈黙するというラベルである。そして、オフライン学習部１３２Ｄは、図４に示すように、サンプル集合ＳをＴ個（Ｔは、１以上の整数である。）のサブセットＳ_ｊ（ｊは、１以上Ｔ以下の整数である。）に、ランダムに分ける。ここで、Ｔは、ランダムフォレストに含まれる決定木の本数に相当する。なお、オフライン学習部１３２Ｄが、サンプル集合ＳをサブセットＳ_１、Ｓ_２、・・・Ｓ_Ｔに分ける際、異なるサブセットに１つのサンプルがそれぞれ分配されたり、いずれのサブセットにも分配されないサンプルがあったりしてもよい。 First, the creation of the determination model by the offline learning unit 132D will be described. Here, a method in which the offline learning unit 132D creates a random forest as a determination model will be described. As shown in FIG. 4, the offline learning unit 132D reads data necessary for creating a determination model from the learning database 132C, and prepares a sample set S having a plurality of samples. Specifically, the i-th sample is data including the voice waveform of the i-th utterance section, the feature amount vector vi attached to the utterance section, and the teacher label ci attached to the utterance section. .. In FIG. 4, the hatched sample contains the teacher label ci (filler) and the non-hatched sample contains the teacher label ci (silence). Further, the teacher label ci (filler) is a label that outputs the filler, and the teacher label ci (silence) is a label that silences. Then, as shown in FIG. 4, the offline learning unit 132D is a subset S _j (j is an integer of 1 or more and T or less) of T sample sets S (T is an integer of 1 or more). ), Randomly divide. Here, T corresponds to the number of decision trees included in the random forest. Incidentally, the offline learning unit 132D is a subset of the sample set S S _1, S _2, when divided into · · · S _T, there is a sample containing one sample different subsets or are distributed respectively, not be distributed to any of the subset You may do it.

次に、オフライン学習部１３２Ｄは、ランダムフォレストの決定木の各ノードの分岐関数を生成する。例えば、オフライン学習部１３２Ｄは、図５に示すように、分岐前のサブセットＳ_ｊに含まれるサンプルの特徴量ベクトルｖｉから、分岐関数ｆ_ｋの要素となる、特徴量の種類と当該特徴量の閾値との組み合わせをｋ個（ｋは、１以上の整数である。）ランダムに選択し、当該分岐関数ｆ_ｋを生成する。なお、候補数ｋは、特徴量ベクトルｖｉに含まれる特徴量の数をｍ個とすると（ｍは、１以上の整数である。）、以下の式（１）を満たすことが望ましい。
図５に示す例では、特徴量ベクトルｖｉは、１７個の特徴量の種類を含むため（ｍ＝１７）、ｋは約４となる。そこで、図５では、オフライン学習部１３２Ｄは、例えば、１番目の特徴量の種類及び閾値（０．４）、３番目の特徴量の種類及び閾値（０．３）、５番目の特徴量の種類及び閾値（０．６）、及び、１７番目の特徴量の種類及び閾値（０．４）の４つの組み合わせを選択し、分岐関数ｆ_ｋを生成している。また、図５に示す例では、生成された分岐関数は、以下の式（２）で表される。なお、式（２）において、ｘ_ｍは、ｍ番目の特徴量の値を意味する。
Next, the offline learning unit 132D generates a branch function for each node of the decision tree of the random forest. For example, as shown in FIG. 5, the offline learning unit 132D uses the feature amount vector vi of the sample included in the subset _Sj before branching to determine the type of feature amount and the feature amount that are elements of the branching function _fk . K combinations with the threshold value (k is an integer of 1 or more) are randomly selected to generate the branch function f _k . As for the number of candidates k, assuming that the number of features included in the feature vector vi is m (m is an integer of 1 or more), it is desirable that the following equation (1) is satisfied.
In the example shown in FIG. 5, since the feature amount vector vi includes 17 types of feature amounts (m = 17), k is about 4. Therefore, in FIG. 5, the offline learning unit 132D is, for example, the type and threshold value of the first feature amount (0.4), the type and threshold value of the third feature amount (0.3), and the fifth feature amount. Four combinations of the type and threshold value (0.6) and the 17th feature type and threshold value (0.4) are selected to generate the branch function _fk . Further, in the example shown in FIG. 5, the generated branch function is represented by the following equation (2). In the equation (2), x _m means the value of the m-th feature quantity.

次に、オフライン学習部１３２Ｄは、ランダムフォレストのＴ本の決定木の各ノードのエントロピーを計算し、情報利得Ｉ_ｊを計算する。例えば、図６に示す例では、サブセットＳ_ｊを有する親ノードのエントロピーＨ（Ｓ_ｊ）が以下の式（３）で表され、
候補１のノードの左側への分岐のエントロピーＨ（Ｓ^Ｌ _１）及び右側への分岐のＨ（Ｓ^Ｒ _１）が以下の式（４）及び式（５）で表され、
候補２のノードの左側への分岐のエントロピーＨ（Ｓ^Ｌ _２）及び右側への分岐のＨ（Ｓ^Ｒ _２）が以下の式（６）及び式（７）で表される。
そして、各末端ノードの情報利得Ｉ_ｊは、以下の式（８）で表される。式（８）において、Ｈ（Ｓ_ｊ）は、分岐前のエントロピー（すなわち、親ノードのエントロピー）を意味し、Ｈ（Ｓ_ｌ）は、左側への分岐のエントロピーを意味し、Ｈ（Ｓ_ｒ）は、右側への分岐のエントロピーを意味する。
そのため、図６に示す例では、候補１のノードの情報利得Ｉ_１は、以下の式（９）で表され、
候補２のノードの情報利得Ｉ_２は、以下の式（１０）で表されるため、
候補２の情報利得Ｉ_２の方が、候補１の情報利得Ｉ_１よりも大きい。そして、オフライン学習部１３２Ｄは、各決定木の情報利得が最大となるように、当該決定木の形を決定する。すなわち、図６に示す例では、親ノードから候補２への分岐が選択される。換言すれば、オフライン学習部１３２Ｄは、情報利得が最大となるように、親ノードのサブセットＳｊ（分類前のサブセット）を分類する。そして、オフライン学習部１３２Ｄは、このようにして作成したランダムフォレストのＴ本の決定木を、判定モデルとして、判定モデル及び条件分岐式データベース１３２Ｅに出力する。 Next, the offline learning unit 132D calculates the entropy of each node in the T decision tree of random forest, calculates the information gain I _j. For example, in the example shown in FIG. 6, the entropy H (S _j ) of the parent node having the subset S _j is expressed by the following equation (3).
Entropy H of the branch to the left of the candidate node 1 ^(S _{L 1)} and of the branch to the right H ^(S _{R 1)} is represented by the following formulas (4) and (5),
Entropy H of the branch to the left of the candidate node 2 ^(S _{L 2)} and of a branch to the right H ^(S _{R 2)} is expressed by the following equation (6) and (7).
The information gain I _j of each terminal node is expressed by the following equation (8). In equation (8), H (S _j ) means the entropy before branching (that is, the entropy of the parent node), H (S _l ) means the entropy of branching to the left, and H (S _r). ) Means the entropy of the branch to the right.
Therefore, in the example shown in FIG. 6, the information gain I ₁ of the node of candidate ₁ is represented by the following equation (9).
For information gain I ₂ nodes of the candidate 2, represented by the following formula (10),
The information gain I ₂ of the candidate 2 is larger than the information gain I ₁ of the candidate 1. Then, the offline learning unit 132D determines the shape of the decision tree so that the information gain of each decision tree is maximized. That is, in the example shown in FIG. 6, the branch from the parent node to the candidate 2 is selected. In other words, the offline learning unit 132D classifies the subset Sj (subset before classification) of the parent node so that the information gain is maximized. Then, the offline learning unit 132D outputs the T number of decision trees of the random forest created in this way to the determination model and the conditional branching database 132E as the determination model.

次に、上述のようにして作成された判定モデルを用いた音圧レベル設定パラメータＩｖの決定方法について説明する。まず、音圧レベル設定パラメータ計算部１３２Ｂは、判定モデル及び条件分岐式データベース１３２Ｅに格納されているランダムフォレスト（判定モデル）を読み出す。次に、音圧レベル設定パラメータ計算部１３２Ｂは、図７に示すように、ユーザ音声解析部１３１から入力されたｉ番目の発話区間の特徴量ベクトルｖｉに基づいて、Ｔ本の決定木（ｔｒｅｅｔ_１、・・・、ｔｒｅｅｔ_Ｔ）をトラバーサルする。そして、音圧レベル設定パラメータ計算部１３２Ｂは、たどり着いた各決定木の末端ノードに、オフライン学習部１３２Ｄによる学習時に親ノードのサブセットがどのような割合で振り分けられたかを示す値を、条件付き確率Ｐ_Ｔ（ｃ｜ｖ）として取得する。なお、当該Ｐ_Ｔ（ｃ｜ｖ）におけるｃは、「フィラーを出力するラベル」である。そして、音圧レベル設定パラメータ計算部１３２Ｂは、図７に示すように、Ｔ本の決定木のそれぞれで得られた条件付き確率Ｐ_Ｔ（ｃ｜ｖ）の平均値Ｐ（ｃ｜ｖ）を、音圧レベル設定パラメータＩｖとする。条件付き確率Ｐ_Ｔ（ｃ｜ｖ）の平均値Ｐ（ｃ｜ｖ）は、以下の式（１１）で表される。なお、平均値Ｐ（ｃ｜ｖ）は、ランダムフォレストのＴ本の決定木のそれぞれが特徴量ベクトルｖｉを識別した結果を統合した値である。
そして、音圧レベル設定パラメータ計算部１３２Ｂは、決定した音圧レベル設定パラメータＩｖを音声合成部１３２Ｆに出力する。 Next, a method of determining the sound pressure level setting parameter Iv using the determination model created as described above will be described. First, the sound pressure level setting parameter calculation unit 132B reads out the determination model and the random forest (determination model) stored in the conditional branching database 132E. Next, as shown in FIG. 7, the sound pressure level setting parameter calculation unit 132B has T number of decision trees (treet) based on the feature amount vector vi of the i-th utterance section input from the user voice analysis unit 131. _1. Traversal ..., treat _T ). Then, the sound pressure level setting parameter calculation unit 132B gives a value indicating the ratio of the subset of the parent nodes to the terminal nodes of each decision tree reached during learning by the offline learning unit 132D as a conditional probability. Obtained as _PT (c | v). In addition, c in the _PT (c | v) is a "label for outputting a filler". Then, as shown in FIG. 7, the sound pressure level setting parameter calculation unit 132B calculates the average value P (c | v) of the conditional probabilities _PT (c | v) obtained in each of the T decision trees. , The sound pressure level setting parameter Iv. The average value P (c | v) of the conditional probability P _T (c | v) is represented by the following equation (11). The average value P (c | v) is a value obtained by integrating the results of identifying the feature vector vi by each of the T decision trees in the random forest.
Then, the sound pressure level setting parameter calculation unit 132B outputs the determined sound pressure level setting parameter Iv to the voice synthesis unit 132F.

次に、オフライン学習部１３２Ｄによる条件分岐式の作成について説明する。オフライン学習部１３２Ｄは、学習用データベース１３２Ｃから、条件分岐式を作成するために必要なデータを読み出し、条件分岐式を作成する。具体的には、オフライン学習部１３２Ｄは、上述と同様の方法で、ランダムフォレストの決定木を作成する。 Next, the creation of the conditional branching expression by the offline learning unit 132D will be described. The offline learning unit 132D reads the data necessary for creating the conditional branching expression from the learning database 132C, and creates the conditional branching expression. Specifically, the offline learning unit 132D creates a decision tree of a random forest in the same manner as described above.

次に、オフライン学習部１３２Ｄは、学習用データベース１３２Ｃから、事前に集められた模擬対話のｉ番目の発話区間の特徴量ベクトルｖｉを読み出し、当該特徴量ベクトルｖｉの要素及び当該要素の特徴量の値に基づいて、ランダムフォレストの決定木をトラバーサルする。例えば、オフライン学習部１３２Ｄは、特徴量ベクトルｖｉの要素「ユーザ発話の長さ」及び当該要素の特徴量の値に基づいてランダムフォレストの決定木をトラバーサルすることにより、図８に示す、要素「ユーザ発話長さ」と、沈黙又はフィラーの頻度との関係を示すグラフを作成する。図８において、縦軸は、沈黙又はフィラーを出力する頻度を示し、横軸は、ユーザ発話長さ（ｓｅｃ）を示す。なお、単位「ｓｅｃ」は「秒」を意味する。 Next, the offline learning unit 132D reads out the feature quantity vector vi of the i-th speech section of the simulated dialogue collected in advance from the learning database 132C, and reads the feature quantity vector vi of the feature quantity vector vi element and the feature quantity of the element. Traverse the random forest decision tree based on the value. For example, the offline learning unit 132D traverses the decision tree of the random forest based on the element "length of user utterance" of the feature vector vi and the value of the feature of the element, thereby showing the element ". Create a graph showing the relationship between "user utterance length" and the frequency of silence or filler. In FIG. 8, the vertical axis indicates the frequency of outputting silence or filler, and the horizontal axis indicates the user utterance length (sec). The unit "sec" means "second".

さらに、オフライン学習部１３２Ｄは、図８に示すグラフから、図９に示す、要素「ユーザ発話長さ」と、フィラーを出力する割合との関係を示す表を作成する。例えば、ユーザ発話の長さが０秒以上１秒未満である場合、フィラーを出力する割合は、図８より、２÷（２０＋２）＝０．１１である。しかし、０．５≦Ｉｖ≦１であるため、ユーザ発話の長さが０秒以上１秒未満である場合にフィラーを出力する割合は、０．５とされる。また、ユーザ発話の長さが３秒以上４秒未満である場合、フィラーを出力する割合は、図８より、１２÷（１２＋４）＝０．７５である。そして、オフライン学習部１３２Ｄは、当該フィラーを出力する割合を、音圧レベル設定パラメータＩｖとし、図８に示すグラフ及び図９に示す表から、条件分岐式を作成する。図８、図９に示す例の場合、条件分岐式は、以下の式（１２）で表される。なお、以下の式（１２）において、ｔは、ユーザ発話の長さ（ｓｅｃ）である。
そして、オフライン学習部１３２Ｄは、このようにして作成した条件分岐式を判定モデル及び条件分岐式データベース１３２Ｅに出力する。 Further, the offline learning unit 132D creates a table showing the relationship between the element "user utterance length" shown in FIG. 9 and the ratio of outputting the filler from the graph shown in FIG. For example, when the length of the user's utterance is 0 seconds or more and less than 1 second, the ratio of outputting the filler is 2 ÷ (20 + 2) = 0.11 from FIG. However, since 0.5 ≦ Iv ≦ 1, the ratio of outputting the filler when the length of the user utterance is 0 seconds or more and less than 1 second is set to 0.5. Further, when the length of the user's utterance is 3 seconds or more and less than 4 seconds, the ratio of outputting the filler is 12 ÷ (12 + 4) = 0.75 from FIG. Then, the offline learning unit 132D sets the ratio of outputting the filler to the sound pressure level setting parameter Iv, and creates a conditional branching equation from the graph shown in FIG. 8 and the table shown in FIG. In the case of the examples shown in FIGS. 8 and 9, the conditional branching equation is represented by the following equation (12). In the following equation (12), t is the length (sec) of the user's utterance.
Then, the offline learning unit 132D outputs the conditional branching expression created in this way to the determination model and the conditional branching expression database 132E.

次に、上述のようにして作成された条件分岐式を用いた音圧レベル設定パラメータＩｖの決定方法について説明する。まず、音圧レベル設定パラメータ計算部１３２Ｂは、判別モデル及び条件分岐式データベース１３２Ｅから条件分岐式を読み出し、ユーザ音声解析部１３１から入力されたｉ番目の発話区間の特徴量ベクトルｖｉと、当該条件分岐式とに基づいて、音圧レベル設定パラメータＩｖを計算する。例えば、ユーザ音声解析部１３１から入力されたｉ番目の発話区間の特徴量ベクトルｖｉに含まれる要素「ユーザ発話長さ」の特徴量が２．４（ｓｅｃ）である場合、音圧レベル設定パラメータ計算部１３２Ｂは、式（１２）で表される条件分岐式に基づいて、音圧レベル設定パラメータＩｖの値を０．６９と決定する。
そして、音圧レベル設定パラメータ計算部１３２Ｂは、決定した音圧レベル設定パラメータＩｖを音声合成部１３２Ｆに出力する。 Next, a method of determining the sound pressure level setting parameter Iv using the conditional branching equation created as described above will be described. First, the sound pressure level setting parameter calculation unit 132B reads the conditional branch expression from the discrimination model and the conditional branch expression database 132E, and the feature quantity vector vi of the i-th utterance section input from the user voice analysis unit 131 and the condition. The sound pressure level setting parameter Iv is calculated based on the branching equation. For example, when the feature amount of the element "user utterance length" included in the feature amount vector vi of the i-th utterance section input from the user voice analysis unit 131 is 2.4 (sec), the sound pressure level setting parameter. The calculation unit 132B determines the value of the sound pressure level setting parameter Iv to be 0.69 based on the conditional branching equation represented by the equation (12).
Then, the sound pressure level setting parameter calculation unit 132B outputs the determined sound pressure level setting parameter Iv to the voice synthesis unit 132F.

なお、音圧レベル決定部１３２によって決定されるＩｖの値は、ユーザ音声解析部１３１から同じ特徴量ベクトルｖｉが入力されたとしても、ユーザとロボット１００とが対話する内容、ロボット１００が用意している質問内容、ユーザとロボット１００とが用いる言語等によって、異なる値となる。換言すれば、オフライン学習部１３２Ｄは、ユーザとロボット１００とが対話する内容、ロボット１００が用意している質問内容、ユーザとロボット１００とが用いる言語等ごとに、あらかじめ、判定モデル及び条件分岐式を作成する。 The value of Iv determined by the sound pressure level determination unit 132 is prepared by the robot 100 so that the user and the robot 100 can interact with each other even if the same feature amount vector vi is input from the user voice analysis unit 131. The value will differ depending on the content of the question being asked, the language used by the user and the robot 100, and the like. In other words, the offline learning unit 132D has a determination model and a conditional branching formula in advance for each content such as dialogue between the user and the robot 100, question content prepared by the robot 100, language used by the user and the robot 100, and the like. To create.

次に、音声合成部１３２Ｆが、音圧レベル決定部１３２から入力された音圧レベル設定パラメータＩｖに応じた音圧レベルを決定する方法について説明する。
音声合成部１３２Ｆは、例えば、音圧レベル設定パラメータＩｖ＝０．５の時、ユーザとロボット１００との距離が５０ｃｍ、病院内の個室という環境下で、ユーザの耳元におけるロボット１００から出力された音声の大きさが５０ｄＢとなり、音圧レベル設定パラメータＩｖ＝１の時、当該環境下でユーザの耳元におけるロボット１００から出力された音声の大きさが６０ｄＢとなるような音声を合成できる音圧レベル計算式を格納している。例えば、音圧レベル計算式は、以下の式（１３）で表される。以下の式（１３）において、Ｐは、音声合成部１３２Ｆにおける調整用変数である。
すなわち、Ｉｖが０．５に近い値である場合、フィラーを出力する信頼度は低いため、音声合成部１３２Ｆは、フィラーの音圧レベルを比較的小さい音圧レベル（例えば、５０ｄＢ）とする。一方、Ｉｖが１に近い値である場合、フィラーを出力する信頼度は高いため、音声合成部１３２Ｆは、フィラーの音圧レベルを比較的大きい音圧レベル（例えば、６０ｄＢ）とする。 Next, a method in which the voice synthesis unit 132F determines the sound pressure level according to the sound pressure level setting parameter Iv input from the sound pressure level determination unit 132 will be described.
For example, when the sound pressure level setting parameter Iv = 0.5, the voice synthesis unit 132F is output from the robot 100 in the user's ear in an environment where the distance between the user and the robot 100 is 50 cm and the room is a private room in the hospital. When the volume of the voice is 50 dB and the sound pressure level setting parameter Iv = 1, the sound pressure level capable of synthesizing the voice such that the volume of the voice output from the robot 100 in the user's ear is 60 dB under the environment. Stores the calculation formula. For example, the sound pressure level calculation formula is expressed by the following formula (13). In the following equation (13), P is an adjustment variable in the speech synthesis unit 132F.
That is, when Iv is a value close to 0.5, the reliability of outputting the filler is low, so that the voice synthesis unit 132F sets the sound pressure level of the filler to a relatively small sound pressure level (for example, 50 dB). On the other hand, when Iv is a value close to 1, the reliability of outputting the filler is high, so that the voice synthesis unit 132F sets the sound pressure level of the filler to a relatively large sound pressure level (for example, 60 dB).

または、音声合成部１３２Ｆは、音圧レベル設定パラメータＩｖと音圧レベル（ｄＢ）とを対応付けた表を予め格納しており、音圧レベル決定部１３２から入力された音圧レベル設定パラメータＩｖと当該表とに基づいて、フィラーを出力する音圧レベルを決定してもよい。 Alternatively, the sound synthesis unit 132F stores in advance a table in which the sound pressure level setting parameter Iv and the sound pressure level (dB) are associated with each other, and the sound pressure level setting parameter Iv input from the sound pressure level determination unit 132. And the table, the sound pressure level at which the filler is output may be determined.

実施例１
次に、本実施の形態１に係るロボット１００とユーザとの対話の一例を実施例１として説明する。図１０は、ユーザとロボット１００との会話の一例を示す。
例えば、ユーザのｉ番目の発話内容が、図１０における「ロボ君は行ったことある？」という質問である場合（ユーザがロボット１００に質問している場合）、ｉ番目の発話区間の特徴量ベクトルｖｉの各要素及び特徴量の値は、図３に示すものとなり、ｉ番目の発話区間の特徴量ベクトルｖｉは、ｖｉ＝（・・・，２．４，・・・，２０，１，１，・・・）と表される。そして、当該特徴量ベクトルｖｉと上述の判定モデル（ランダムフォレスト）を用いて、音圧レベル設定パラメータ計算部１３２Ｂが計算するＩｖの値は、０．７となる（図７参照）。また、当該特徴量ベクトルｖｉと上述の条件分岐式を用いて、音圧レベル設定パラメータ計算部１３２Ｂが計算するＩｖの値は、０．６９となる（図９参照）。
そして、音声合成部１３２Ｆは、当該音圧レベル設定パラメータＩｖと、上述の音圧レベル計算式又は音圧レベル設定パラメータＩｖと音圧レベル（ｄＢ）とを対応付けた表とに基づいて、６０ｄＢに近い音圧レベルでフィラー「えーっと」を合成し、出力部１２０に出力する。 Example 1
Next, an example of dialogue between the robot 100 and the user according to the first embodiment will be described as the first embodiment. FIG. 10 shows an example of a conversation between the user and the robot 100.
For example, when the user's i-th utterance content is the question "Have you been to Robo-kun?" In FIG. 10 (when the user is asking the robot 100), the feature amount of the i-th utterance section. The values of each element and the feature amount of the vector vi are as shown in FIG. 3, and the feature amount vector vi of the i-th utterance section is vi = (..., 2.4, ..., 20, 1, 1, ...). Then, the value of Iv calculated by the sound pressure level setting parameter calculation unit 132B using the feature amount vector vi and the above-mentioned determination model (random forest) is 0.7 (see FIG. 7). Further, the value of Iv calculated by the sound pressure level setting parameter calculation unit 132B using the feature amount vector vi and the above-mentioned conditional branching equation is 0.69 (see FIG. 9).
Then, the sound synthesis unit 132F is 60 dB based on the sound pressure level setting parameter Iv and the above-mentioned sound pressure level calculation formula or a table in which the sound pressure level setting parameter Iv and the sound pressure level (dB) are associated with each other. The filler "um" is synthesized at a sound pressure level close to, and output to the output unit 120.

一方、ユーザのｉ番目の発話内容が、図１０における「昨日は公園に行ってきて」という発話である場合（ユーザがロボット１００に質問していない場合）、ｉ番目の発話区間の特徴量ベクトルｖｉの各要素及び特徴量の値は、図１１に示すものとなり、ｉ番目の発話区間の特徴量ベクトルｖｉは、ｖｉ＝（・・・，１．７，・・・，２０，１，１，・・・）と表される。そして、当該特徴量ベクトルｖｉと上述の判定モデル（ランダムフォレスト）を用いて、音圧レベル設定パラメータ計算部１３２Ｂが計算するＩｖの値は、０．５５となる。また、当該特徴量ベクトルｖｉと上述の条件分岐式を用いて、音圧レベル設定パラメータ計算部１３２Ｂが計算するＩｖの値は、０．５となる（図９参照）。
そして、音声合成部１３２Ｆは、当該音圧レベル設定パラメータＩｖと、上述の音圧レベル計算式又は音圧レベル設定パラメータＩｖと音圧レベル（ｄＢ）とを対応付けた表とに基づいて、音圧レベル５０ｄＢでフィラー「えーっと」を合成し、出力部１２０に出力する。 On the other hand, when the i-th utterance content of the user is the utterance "I went to the park yesterday" in FIG. 10 (when the user does not ask the robot 100), the feature amount vector of the i-th utterance section. The values of each element and the feature amount of vi are as shown in FIG. 11, and the feature amount vector vi of the i-th utterance section is vi = (..., 1.7, ..., 20, 1, 1). , ...). Then, the value of Iv calculated by the sound pressure level setting parameter calculation unit 132B using the feature amount vector vi and the above-mentioned determination model (random forest) is 0.55. Further, the value of Iv calculated by the sound pressure level setting parameter calculation unit 132B using the feature amount vector vi and the above-mentioned conditional branching equation is 0.5 (see FIG. 9).
Then, the sound synthesis unit 132F makes a sound based on the sound pressure level setting parameter Iv and the above-mentioned sound pressure level calculation formula or a table in which the sound pressure level setting parameter Iv and the sound pressure level (dB) are associated with each other. The filler "um" is synthesized at a pressure level of 50 dB and output to the output unit 120.

図１２に、実施例１においてロボット１００が出力するフィラーの音声波形を示す。図１２の上側に、ユーザのｉ番目の発話内容が、図１０における「ロボ君は行ったことある？」という質問である場合に、ロボット１００が出力するフィラー「えーっと」の音声波形を示す。また、図１２の下側に、ユーザのｉ番目の発話内容が、図１０における「昨日は公園に行ってきて」という発話である場合に、ロボット１００が出力するフィラー「えーっと」の音声波形を示す。また、図１２において、縦軸は、音声波形の振幅を示し、横軸は、時間（ｓｅｃ）を示す。 FIG. 12 shows the voice waveform of the filler output by the robot 100 in the first embodiment. On the upper side of FIG. 12, the voice waveform of the filler "um" output by the robot 100 when the i-th utterance content of the user is the question "Have you been to Robo?" In FIG. 10 is shown. Further, on the lower side of FIG. 12, when the user's i-th utterance content is the utterance "I went to the park yesterday" in FIG. 10, the voice waveform of the filler "um" output by the robot 100 is displayed. Shown. Further, in FIG. 12, the vertical axis represents the amplitude of the voice waveform, and the horizontal axis represents the time (sec).

図１２に示すように、ユーザのｉ番目の発話内容が「ロボ君は行ったことある？」という質問である場合は、ロボット１００がユーザに対して答える必要があり、フィラーを出力する信頼度は高いといえる。実施例１において、このような場合には、フィラーの音圧レベルを大きくすることができている。
一方、ユーザのｉ番目の発話内容が「昨日は公園に行ってきて」という発話である場合は、ロボット１００がフィラーを出力する信頼度は低いといえる。実施例１において、このような場合には、フィラーの音圧レベルを小さくすることができている。 As shown in FIG. 12, when the user's i-th utterance content is the question "Have you been to Robo?", The robot 100 needs to answer to the user, and the reliability of outputting the filler. Can be said to be expensive. In the first embodiment, in such a case, the sound pressure level of the filler can be increased.
On the other hand, if the user's i-th utterance is "I went to the park yesterday", it can be said that the reliability of the robot 100 outputting the filler is low. In the first embodiment, in such a case, the sound pressure level of the filler can be reduced.

以上に説明した、実施の形態１に係るロボット１００によれば、音圧レベル決定部１３２によって、ユーザ音声解析部１３１によって抽出された特徴量に基づいてフィラーを出力する信頼度を示す値Ｉｖが計算され、フィラーを出力する信頼度を示す値Ｉｖに基づいてフィラーの音圧レベルが決定される。そのため、フィラーを出力する信頼度を示す値Ｉｖに応じた音圧レベルで出力部１２０はフィラーを出力することができる。そして、フィラーを出力する信頼度は、発話権がロボット１００にある可能性と正の相関関係にある。すなわち、発話権がロボット１００にある否かがあいまいな場合であっても、発話権がロボット１００にある可能性に応じた音圧レベルで出力部１２０はフィラーを出力することができる。そのため、発話権がロボット１００にある可能性が低い場合には出力部は小さい音圧レベルでフィラーを出力することとなり、出力部１２０が出力したフィラーによってユーザの発話を遮ることを低減できる。これにより、ユーザの発話を遮ってしまう可能性を低減することができるロボット１００を提供することができる。 According to the robot 100 according to the first embodiment described above, the sound pressure level determining unit 132 has a value Iv indicating the reliability of outputting the filler based on the feature amount extracted by the user voice analysis unit 131. The sound pressure level of the filler is determined based on the calculated value Iv indicating the reliability of outputting the filler. Therefore, the output unit 120 can output the filler at a sound pressure level corresponding to the value Iv indicating the reliability of outputting the filler. The reliability of outputting the filler has a positive correlation with the possibility that the robot 100 has the right to speak. That is, even when it is unclear whether or not the utterance right belongs to the robot 100, the output unit 120 can output the filler at a sound pressure level corresponding to the possibility that the utterance right belongs to the robot 100. Therefore, when it is unlikely that the robot 100 has the right to speak, the output unit outputs the filler at a small sound pressure level, and it is possible to reduce the interruption of the user's utterance by the filler output by the output unit 120. As a result, it is possible to provide the robot 100 that can reduce the possibility of interrupting the user's utterance.

本実施の形態に係るロボット１００においては、ユーザの音声の音声波形の基本周波数ｆ０やロボット１００の発話の過去履歴等の情報を用いてフィラーの音圧レベルを決定するため、ユーザの音声の言語情報を用いる処理に比べて、処理が比較的軽い計算でフィラーの音圧レベルを決定することができる。 In the robot 100 according to the present embodiment, the sound pressure level of the filler is determined by using information such as the basic frequency f0 of the voice waveform of the user's voice and the past history of the utterance of the robot 100, so that the language of the user's voice is determined. The sound pressure level of the filler can be determined by a calculation that is relatively lighter than the processing using information.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。例えば、判定モデルとしては、上述のランダムフォレストだけでなく、サポートベクタマシン（ＳＶＭ）等の他の機械学習手法を用いてもよい。また、本実施の形態では、ユーザ音声解析部１３１は、ロボット１００の発話の過去履歴の情報からも特徴量ベクトルｖｉの特徴量を抽出しているが、ユーザの音声の音声波形のみから、特徴量ベクトルｖｉの特徴量を抽出してもよい。また、ロボット１００は、ユーザの発話区間と発話区間との合間にフィラーを発しない場合があるのは言うまでもない。ロボット１００は、ユーザの発話区間と発話区間との合間に、フィラーを出力してもよいだけでなく、沈黙していてもよいし、発話（対話内容として意味を持つ音声）を出力してもよい。また、本実施の形態では、音声対話装置としてロボット１００を説明したが、本発明に係る音声対話装置は、ユーザと対話可能な装置であれば何であってもよく、例えば、ユーザと対話するアプリケーションが組み込まれたスマートフォン等であってもよい。 The present invention is not limited to the above embodiment, and can be appropriately modified without departing from the spirit. For example, as the determination model, not only the above-mentioned random forest but also other machine learning methods such as a support vector machine (SVM) may be used. Further, in the present embodiment, the user voice analysis unit 131 extracts the feature amount of the feature amount vector vi from the information of the past history of the utterance of the robot 100, but the feature amount is only from the voice waveform of the user's voice. The feature quantity of the quantity vector vi may be extracted. Further, it goes without saying that the robot 100 may not emit the filler between the utterance sections of the user. The robot 100 may not only output the filler between the utterance sections of the user and the utterance section, but may also be silent or output the utterance (voice having meaning as the dialogue content). Good. Further, in the present embodiment, the robot 100 has been described as the voice dialogue device, but the voice dialogue device according to the present invention may be any device capable of interacting with the user, for example, an application for interacting with the user. It may be a smartphone or the like in which is incorporated.

１００ロボット（音声対話装置）
１１０入力部
１２０出力部
１３０制御部
１３１ユーザ音声解析部
１３２音圧レベル決定部
１３２Ａ手法選択部
１３２Ｂ音圧レベル設定パラメータ計算部
１３２Ｃ学習用データベース
１３２Ｄオフライン学習部
１３２Ｅ判定モデル及び条件分岐式データベース
１３２Ｆ音声合成部 100 robot (voice dialogue device)
110 Input unit 120 Output unit 130 Control unit 131 User voice analysis unit 132 Sound pressure level determination unit 132A Method selection unit 132B Sound pressure level setting parameter calculation unit 132C Learning database 132D Offline learning unit 132E Judgment model and conditional branching database 132F Voice Synthesis part

Claims

A voice dialogue device including an input unit for inputting a user's voice and an output unit for outputting voice to the user.
A user voice analysis unit that extracts a feature amount including at least the fundamental frequency of the voice waveform as an element from at least the voice waveform of the voice input by the input unit.
When the output unit outputs the filler to the user, a value indicating the reliability of outputting the filler is calculated based on the feature amount extracted by the user voice analysis unit, and the filler is output. A sound pressure level determining unit that determines the sound pressure level of the filler based on a value indicating reliability,
A voice dialogue device.