JP7616290B2

JP7616290B2 - Robot, response method and program

Info

Publication number: JP7616290B2
Application number: JP2023135703A
Authority: JP
Inventors: 克典石井
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2019-05-20
Filing date: 2023-08-23
Publication date: 2025-01-17
Anticipated expiration: 2039-05-20
Also published as: JP7342419B2; JP2020190587A; JP2023169166A

Description

本発明は、ロボット、応答方法及びプログラムに関する。 The present invention relates to a robot , a response method , and a program.

自律型で動作し、人と対話を行うロボットが提案されている。例えば、特許文献１には、人間との対話が適切に行えるようにする自然言語処理装置が記載されている。特許文献１の自然言語処理装置は、解析可能な単位の自然言語文の一部が入力するごとに、各解析処理部で逐次的かつ並列的に解析処理を実行する逐次解析処理部と、逐次解析処理部の各解析処理部での解析結果に基づいて、対話応答文などの出力を得る出力部とを備える。逐次解析処理部に用意された各処理部は、自らの処理部での直前又はそれより前の過去の解析結果と、他の処理部での直前又はそれより前の過去の解析結果とを取得し、取得した解析結果を参照しながら先読みをしつつ解析結果を得る。 A robot that operates autonomously and converses with humans has been proposed. For example, Patent Document 1 describes a natural language processing device that enables appropriate dialogue with humans. The natural language processing device of Patent Document 1 includes a sequential analysis processing unit that executes analysis processing sequentially and in parallel in each analysis processing unit each time a portion of a natural language sentence in an analyzable unit is input, and an output unit that obtains output such as a dialogue response sentence based on the analysis results in each analysis processing unit of the sequential analysis processing unit. Each processing unit prepared in the sequential analysis processing unit obtains the previous analysis result immediately before or before that in its own processing unit and the previous analysis result immediately before or before that in other processing units, and obtains the analysis result while looking ahead and referring to the obtained analysis result.

特開２０１７－１０２７７１号公報JP 2017-102771 A

人と対話を行うロボットでは一般に、人の話す内容を聞き終えてから、応答文を生成し発話するため、発話途中ではロボットがなんら応答せず、話者にはロボットがなんら聞いていないように感じられる。特許文献１の自然言語処理装置では、文の一部が入力するごとに、各解析処理部で逐次的かつ並列的に解析処理を行い、早く応答を返そうとする。 Generally, robots that converse with humans generate and speak a response sentence only after they have finished listening to what the person is saying, so if the robot is in the middle of an utterance it does not respond, and the speaker feels as if the robot is not listening at all. In the natural language processing device of Patent Document 1, each time a portion of a sentence is input, each analysis processing unit performs analysis processing sequentially and in parallel in an attempt to return a response as quickly as possible.

しかし、この自然言語処理装置を、ユーザと対話可能に構成されたロボットに適用した場合には、次の入力データを先読みする先読み処理で予測した結果を用いて解析処理が行われるので、この解析処理の結果が誤っている可能性がある。解析結果が誤っている場合には、ユーザによる入力文に対して不適当な応答文が生成されてしまい、ひいては、ユーザの発話に対する応答を適切に行うことができない。 However, when this natural language processing device is applied to a robot configured to be able to interact with a user, the analysis process is performed using the results predicted by a look-ahead process that reads ahead the next input data, and there is a possibility that the results of this analysis process will be incorrect. If the analysis result is incorrect, an inappropriate response sentence will be generated in response to the sentence input by the user, and ultimately, it will be impossible to respond appropriately to the user's utterance.

本発明は、上述の事情に鑑みてなされたもので、応答文での応答を抑制した場合であってもロボットがなんら聞いていないように話者が感じてしまうことを抑制可能な応答を行うことができるようにすることを目的とする。 The present invention has been made in consideration of the above-mentioned circumstances, and has an object to make it possible to make a response that can prevent the speaker from feeling that the robot is not listening at all, even when a response in a response sentence is suppressed .

上記目的を達成するため、本発明に係るロボットは、音声による応答とジェスチャによる応答とを実行可能なロボットであって、発話者による発話の開始に伴わせて前記発話者が前記発話により伝達しようとする内容のカテゴリを特定すべく所定の特定処理を開始する第１特定手段と、前記発話者による前記発話が終了しているか否かにかかわらず前記特定処理により前記カテゴリが特定できしだい、この特定されたカテゴリに対応付けて予め設定されているジェスチャを前記ロボットが実行するように制御する制御手段と、を備え、前記制御手段は、前記発話者による前記発話が終了する前に前記ジェスチャによる応答を前記ロボットが実行するように制御する場合であっても、前記音声による応答については前記発話者による前記発話が終了するのを待ってから前記ロボットが実行するように制御することで、前記発話者による前記発話が終了するまでは前記ロボットが無音声になるように前記ロボットを制御し、且つ、前記カテゴリごとに異なるジェスチャを前記ロボットが実行するように制御可能に構成されている、ことを特徴とする。
また、本発明に係る応答方法は、音声による応答とジェスチャによる応答とを実行可能なロボットが実行する応答方法であって、発話者による発話の開始に伴わせて前記発話者が前記発話により伝達しようとする内容のカテゴリを特定すべく所定の特定処理を開始する特定ステップと、前記発話者による前記発話が終了しているか否かにかかわらず前記特定処理により前記カテゴリが特定できしだい、この特定されたカテゴリに対応付けて予め設定されているジェスチャを前記ロボットが実行するように制御する制御ステップと、を有し、前記制御ステップは、前記発話者による前記発話が終了する前に前記ジェスチャによる応答を前記ロボットが実行するように制御する場合であっても、前記音声による応答については前記発話者による前記発話が終了するのを待ってから前記ロボットが実行するように制御することで、前記発話者による前記発話が終了するまでは前記ロボットが無音声になるように前記ロボットを制御し、且つ、前記カテゴリごとに異なるジェスチャを前記ロボットが実行するように制御可能に構成されている、ことを特徴とする。
また、本発明に係るプログラムは、音声による応答とジェスチャによる応答とを実行可能なロボットのコンピュータを、発話者による発話の開始に伴わせて前記発話者が前記発話により伝達しようとする内容のカテゴリを特定すべく所定の特定処理を開始する特定手段、前記発話者による前記発話が終了しているか否かにかかわらず前記特定処理により前記カテゴリが特定できしだい、この特定されたカテゴリに対応付けて予め設定されているジェスチャを前記ロボットが実行するように制御する制御手段、として機能させ、前記制御手段は、前記発話者による前記発話が終了する前に前記ジェスチャによる応答を前記ロボットが実行するように制御する場合であっても、前記音声による応答については前記発話者による前記発話が終了するのを待ってから前記ロボットが実行するように制御することで、前記発話者による前記発話が終了するまでは前記ロボットが無音声になるように前記ロボットを制御し、且つ、前記カテゴリごとに異なるジェスチャを前記ロボットが実行するように制御可能に構成されている、ことを特徴とする。 In order to achieve the above object, the robot of the present invention is a robot capable of responding by voice and by gesture, and includes a first identification means for starting a predetermined identification process in response to a start of an utterance by a speaker to identify a category of a content that the speaker is trying to convey through the utterance, and a control means for controlling the robot to execute a gesture that is preset in association with the identified category as soon as the category is identified by the identification process regardless of whether the speaker has finished the utterance or not, and the control means is configured to control the robot to be silent until the speaker finishes the utterance by controlling the robot to wait for the speaker to finish the utterance before executing the voice response, even in a case where the control means controls the robot to execute the gesture response before the speaker finishes the utterance , and to control the robot to execute a gesture that differs for each category.
Further, a response method according to the present invention is a response method executed by a robot capable of executing voice responses and gesture responses, comprising: a specifying step of starting a predetermined specification process in response to a start of an utterance by a speaker to specify a category of a content that the speaker is trying to convey through the utterance; and a control step of controlling the robot to execute a gesture that is preset in association with the specified category as soon as the category is specified by the specification process regardless of whether the utterance by the speaker has ended or not, wherein the control step is configured to control the robot to execute a gesture that is preset in association with the specified category after waiting for the utterance by the speaker to end, even in a case where the robot is controlled to execute the gesture response before the utterance by the speaker ends, so that the robot is controlled to be silent until the utterance by the speaker ends, and the robot is controlled to execute a gesture that differs for each category.
The program according to the present invention causes a computer of a robot capable of responding by voice and by gesture to function as: an identification means which starts a predetermined identification process in response to the start of an utterance by a speaker to identify a category of the content that the speaker is trying to convey through the utterance; and a control means which controls the robot to execute a gesture that is preset in association with the identified category as soon as the category is identified by the identification process regardless of whether the speaker has finished the utterance or not. The control means is configured to control the robot to be silent until the speaker finishes the utterance by controlling the robot to wait for the speaker to finish the utterance before executing the voice response, even in a case where the control means controls the robot to execute the gesture response before the speaker finishes the utterance, and to control the robot to execute a gesture that differs for each category.

本発明によれば、応答文での応答を抑制した場合であってもロボットがなんら聞いていないように話者が感じてしまうことを抑制可能な応答を行うことができる。 According to the present invention, even when responses in response sentences are suppressed, a response can be made that can prevent the speaker from feeling that the robot is not listening at all.

本発明の実施の形態に係る制御装置が適用されるロボットの概略構成を示す図である。1 is a diagram showing a schematic configuration of a robot to which a control device according to an embodiment of the present invention is applied; 実施の形態に係るロボットの制御装置の機能構成を示すブロック図である。2 is a block diagram showing a functional configuration of a control device of a robot according to an embodiment. FIG. 実施の形態に係る制御装置がロボットに実行させるジェスチャ番号２の動作を示す正面図である。FIG. 13 is a front view showing the motion of gesture number 2 that the control device in the embodiment causes the robot to perform. 実施の形態に係る制御装置がロボットに実行させるジェスチャ番号３の動作を示す正面図である。FIG. 13 is a front view showing the motion of gesture number 3 that the control device in the embodiment causes the robot to perform. 実施の形態に係る制御装置がロボットに実行させるジェスチャ番号４の動作を示す正面図である。FIG. 13 is a front view showing the motion of gesture number 4 that the control device in the embodiment causes the robot to perform. 実施の形態に係る制御装置がロボットに実行させるジェスチャ番号５の動作を示す正面図である。FIG. 13 is a front view showing the motion of gesture number 5 that the control device in the embodiment causes the robot to perform. 実施の形態に係る制御装置がロボットに実行させるジェスチャ番号６の動作を示す正面図である。FIG. 13 is a front view showing the gesture of gesture number 6 that the control device in the embodiment causes the robot to perform. 実施の形態に係る制御装置の会話記録処理を示すフローチャートである。5 is a flowchart showing a conversation recording process of the control device according to the embodiment. 図４に示す会話記録処理で記録される会話記録の例を示す図である。5 is a diagram showing an example of a conversation record recorded in the conversation recording process shown in FIG. 4. 実施の形態に係る制御装置の分析学習処理を示すフローチャートである。4 is a flowchart showing an analytical learning process of the control device according to the embodiment. 会話記録から所定の対象に対応するすべての発話文を読み出した例を示す図である。FIG. 13 is a diagram showing an example in which all spoken sentences corresponding to a predetermined target are read out from a conversation record. ユニーク音素列テーブルの例を示す図である。FIG. 13 is a diagram illustrating an example of a unique phoneme sequence table. 文とジェスチャの対応の例を示す図である。FIG. 13 is a diagram showing an example of correspondence between sentences and gestures. 実施の形態に係る制御装置の応答ジェスチャデータベース登録処理を示すフローチャートである。10 is a flowchart showing a response gesture database registration process of the control device according to the embodiment. 発話文の発話された時間の例を示すタイミングチャートである。11 is a timing chart showing an example of the time when a speech sentence is uttered. 図１０に示す応答ジェスチャデータベース登録処理で用いられる応答ジェスチャデータベースの例を示す図である。11 is a diagram showing an example of a response gesture database used in the response gesture database registration process shown in FIG. 10 . 実施の形態に係る制御装置の予測応答制御処理を示すフローチャートである。4 is a flowchart showing a predicted response control process of the control device according to the embodiment. 応答時間リストの例を示す図である。FIG. 13 is a diagram illustrating an example of a response time list. 実施の形態に係る制御装置の言語応答制御処理を示すフローチャートである。4 is a flowchart showing a linguistic response control process of the control device according to the embodiment. 実施の形態に係るロボットの応答例を示すタイミングチャートである。10 is a timing chart showing an example of a response of a robot according to an embodiment.

以下、本発明の実施の形態について、図面を参照して説明する。なお、図中同一又は相当する部分には同一の符号を付す。 The following describes an embodiment of the present invention with reference to the drawings. Note that the same or corresponding parts in the drawings are given the same reference numerals.

実施の形態．
図１は、本発明の実施の形態に係る制御装置２が適用されたロボット１の概略構成を示す図である。ロボット１は、外観的には人（子供）を模した立体的な形状を有する。ロボット１は、頭部１０１と、胴体部１０２と、腕部１０３と、を備える。頭部１０１及び腕部１０３は、ロボット１に内蔵された駆動装置であるジェスチャ作動部７によって動かすことができる部位である。頭部１０１は、首の関節５によって、屈曲・伸展、回旋及び側屈が可能に胴体部１０２に取り付けられている。腕部１０３は、肩の関節６によって、屈曲・伸展及び内転・外転が可能に胴体部１０２に取り付けられている。 Embodiment
1 is a diagram showing a schematic configuration of a robot 1 to which a control device 2 according to an embodiment of the present invention is applied. The robot 1 has a three-dimensional shape that resembles a human (child) in appearance. The robot 1 includes a head 101, a body 102, and arms 103. The head 101 and the arms 103 are parts that can be moved by a gesture actuator 7, which is a drive device built into the robot 1. The head 101 is attached to the body 102 by a neck joint 5 so as to be able to bend, extend, rotate, and flex laterally. The arms 103 are attached to the body 102 by a shoulder joint 6 so as to be able to bend, extend, and adduce and abduct.

ロボット１は、音声を収音するためのマイクロフォン３、音声を出力するためのスピーカ４、頭部１０１及び腕部１０３を動かすためのジェスチャ作動部７、ならびに、制御装置２を備える。このロボット１は、所定の対象の発話を音声認識し、発話に対する応答文を生成し、音声合成で応答文を発話して、人と会話できる。ロボット１はまた、所定の対象との会話の際に、非言語的な挙動で、すなわち、頭部１０１及び腕部１０３の動きで応答を行うことができる。ロボット１では、このような非言語的な挙動として、互いに異なる複数のジェスチャ動作が設定されており、これらの複数のジェスチャ動作には、頷いたり、腕を上げたり降ろしたりする動作が含まれる。 The robot 1 is equipped with a microphone 3 for picking up sound, a speaker 4 for outputting sound, a gesture actuator 7 for moving the head 101 and arms 103, and a control device 2. This robot 1 can converse with a person by performing voice recognition of an utterance from a predetermined target, generating a response sentence to the utterance, and speaking the response sentence through voice synthesis. When conversing with a predetermined target, the robot 1 can also respond with non-verbal behavior, that is, by moving the head 101 and arms 103. In the robot 1, multiple different gesture actions are set as such non-verbal behavior, and these multiple gesture actions include actions such as nodding and raising and lowering the arms.

ロボット１は、自装置の外部に存在する所定の対象からの呼び掛け、接触等の外部からの刺激に反応して、様々に動作する。これによって、ロボット１は、所定の対象とコミュニケーションをとり、所定の対象と交流することができる。所定の対象とは、ロボット１の外部に存在し、且つ、ロボット１とコミュニケーション及び交流する相手となる対象である。所定の対象とは、例えば、ロボット１の所有者であるユーザ、ユーザの周囲の人間（ユーザの親近者もしくは友人等）、及び発話可能な他のロボットである。所定の対象は、コミュニケーション対象、コミュニケーション相手、交流対象、交流相手等とも言うことができる。 Robot 1 performs various actions in response to external stimuli, such as a call or contact from a specific object that exists outside the device itself. This allows robot 1 to communicate with and interact with the specific object. A specific object is an object that exists outside robot 1 and that communicates and interacts with robot 1. Examples of specific objects include the user who owns robot 1, people around the user (such as the user's close relatives or friends), and other robots that can speak. A specific object can also be called a communication object, communication partner, exchange object, exchange partner, etc.

図２は、上記の制御装置２の機能構成を示すブロック図である。制御装置２は、マイクロフォン３及びスピーカ４に電気的に接続されており、マイクロフォン３から音声信号を取得し、スピーカ４から応答文を発話する。また、制御装置２は、ロボット１に上記のジェスチャ動作を実行させるために、ジェスチャ作動部７を制御する。ジェスチャ作動部７はアクチュエータを備え、例えば、図１に示すロボット１の頭部１０１及び腕部１０３を駆動する。 Figure 2 is a block diagram showing the functional configuration of the control device 2. The control device 2 is electrically connected to the microphone 3 and speaker 4, and receives a voice signal from the microphone 3 and speaks a response sentence from the speaker 4. The control device 2 also controls the gesture operation unit 7 to cause the robot 1 to execute the above gesture actions. The gesture operation unit 7 includes an actuator, and drives, for example, the head 101 and the arm 103 of the robot 1 shown in Figure 1.

ロボット１は、頭部１０１の屈曲・伸展、回旋及び側屈それぞれの回転角度を検出するセンサ（ポテンショメータ）を関節５に備え、ジェスチャ作動部７は、関節５のセンサの検出値を用いたフィードバック制御によって、頭部１０１に所定の動きをさせる。同様に、ロボット１は、腕部１０３の屈曲・伸展及び内転・外転それぞれの回転角度を検出するセンサ（ポテンショメータ）を関節６に備え、ジェスチャ作動部７は、関節６のセンサの検出値を用いたフィードバック制御によって、腕部１０３に所定の動きをさせる。 Robot 1 is equipped with sensors (potentiometers) at joints 5 that detect the rotation angles of flexion/extension, rotation, and lateral bending of head 101, and gesture actuator 7 causes head 101 to perform a predetermined movement by feedback control using the detection values of the sensors at joints 5. Similarly, robot 1 is equipped with sensors (potentiometers) at joints 6 that detect the rotation angles of flexion/extension, adduction/abduction of arm 103, and gesture actuator 7 causes arm 103 to perform a predetermined movement by feedback control using the detection values of the sensors at joints 6.

図３Ａ～図３Ｅは、ロボット１に実行させるジェスチャ動作の例を示す正面図である。図３Ａは、所定の対象の発話「おはよう」に応答するための、ジェスチャ番号２のジェスチャ動作を示す。ジェスチャ番号２のジェスチャ動作は、関節５及びジェスチャ作動部７により、頭部１０１を正面（又は正面の少し上方）に向け、関節６及びジェスチャ作動部７により、左右の腕部１０３を肩よりも上に挙げる動作である。 Figures 3A to 3E are front views showing examples of gesture actions to be executed by the robot 1. Figure 3A shows the gesture action of gesture number 2 for responding to the speech of a predetermined target, "Good morning." The gesture action of gesture number 2 is an action in which the head 101 is turned forward (or slightly above the front) by the joints 5 and the gesture actuator 7, and the left and right arms 103 are raised above the shoulders by the joints 6 and the gesture actuator 7.

図３Ｂは、所定の対象の発話「こんにちは」に応答するための、ジェスチャ番号３のジェスチャ動作を示す。ジェスチャ番号３のジェスチャ動作は、関節５及びジェスチャ作動部７により、頭部１０１を正面（又は正面の少し上方）に向け、関節６及びジェスチャ作動部７により、左右の腕部１０３を頭部１０１の前に挙げる動作である。 Figure 3B shows the gesture action of gesture number 3 for responding to the utterance "hello" from a predetermined target. The gesture action of gesture number 3 is an action in which the head 101 is turned forward (or slightly above the front) by the joint 5 and the gesture actuator 7, and the left and right arms 103 are raised in front of the head 101 by the joint 6 and the gesture actuator 7.

図３Ｃは、所定の対象の発話「ばいばい」、「さようなら」又は「さよなら」に応答するための、ジェスチャ番号４のジェスチャ動作を示す。ジェスチャ番号４のジェスチャ動作は、関節５及びジェスチャ作動部７により、頭部１０１を少し左に傾け、関節６及びジェスチャ作動部７により、左の腕部１０３を下にさげたまま、右の腕部１０３を頭部１０１の近くまで挙げる動作である。ジェスチャ番号４のジェスチャ動作では、右の腕部１０３を挙げた状態で、左右に振ってもよい。 Figure 3C shows the gesture action of gesture number 4 for responding to a predetermined target's speech of "Bye-bye," "Goodbye," or "Goodbye." The gesture action of gesture number 4 is an action in which the head 101 is tilted slightly to the left by the joint 5 and the gesture actuator 7, and the right arm 103 is raised close to the head 101 while the left arm 103 is kept down by the joint 6 and the gesture actuator 7. In the gesture action of gesture number 4, the right arm 103 may be raised and swung from side to side.

図３Ｄは、所定の対象の発話「ただいま」に応答するための、ジェスチャ番号５のジェスチャ動作を示す。ジェスチャ番号５のジェスチャ動作は、関節５及びジェスチャ作動部７により、頭部１０１を正面（又は正面の少し上方）に向け、関節６及びジェスチャ作動部７により、左右の腕部１０３を肩の高さまで挙げる動作である。 Figure 3D shows the gesture action of gesture number 5 for responding to the utterance "I'm home" by a predetermined target. The gesture action of gesture number 5 is an action in which the head 101 is turned forward (or slightly above the front) by the joint 5 and the gesture actuator 7, and the left and right arms 103 are raised to shoulder height by the joint 6 and the gesture actuator 7.

図３Ｅは、所定の対象の発話「おやすみ」に応答するための、ジェスチャ番号６のジェスチャ動作を示す。ジェスチャ番号６のジェスチャ動作は、関節６及びジェスチャ作動部７により、左右の腕部１０３を下におろしたまま、関節５及びジェスチャ作動部７により、頭部１０１を下に向ける動作である。 Figure 3E shows the gesture action of gesture number 6 for responding to the speech "good night" from a specific target. The gesture action of gesture number 6 is an action of pointing the head 101 downward by the joint 5 and the gesture actuator 7 while keeping the left and right arms 103 downward by the joint 6 and the gesture actuator 7.

図２に示すように制御装置２は、制御部２０、記憶部３０並びにマイクロフォン３、スピーカ４及びジェスチャ作動部７と信号を入出力するＩ／Ｏインタフェースを備える。制御部２０は、ＣＰＵ（Central Processing Unit）等で構成され、記憶部３０に記憶されたプログラムを実行することにより、後述する各部（音声取得部２１、識別部２２、部分解析部２３、ジェスチャ応答制御部２４、発話解析部２５、応答文生成部２６、言語応答制御部２７、学習部２８及び特定部２９）の機能を実現し、ロボット１の動作を制御する。また、記憶部３０は、ＲＯＭ（Read Only Memory）及びＲＡＭ（Random Access Memory）等で構成され、ＲＯＭの一部又は全部は電気的に書き換え可能なメモリ（フラッシュメモリ等）で構成されている。なお、ロボット１は、例えば所定の対象の顔を認識するための撮像装置を備えていてもよく、制御部２０はＩ／Ｏインタフェースを介して当該撮像装置と通信して画像データ等を取得してもよい。 As shown in FIG. 2, the control device 2 includes a control unit 20, a storage unit 30, and an I/O interface for inputting and outputting signals to and from the microphone 3, the speaker 4, and the gesture operation unit 7. The control unit 20 is composed of a CPU (Central Processing Unit) and the like, and by executing a program stored in the storage unit 30, it realizes the functions of each unit (voice acquisition unit 21, identification unit 22, partial analysis unit 23, gesture response control unit 24, speech analysis unit 25, response sentence generation unit 26, language response control unit 27, learning unit 28, and identification unit 29) described later, and controls the operation of the robot 1. The storage unit 30 is composed of a ROM (Read Only Memory) and a RAM (Random Access Memory), and the ROM is composed of a part or all of an electrically rewritable memory (such as a flash memory). The robot 1 may include, for example, an imaging device for recognizing the face of a predetermined target, and the control unit 20 may communicate with the imaging device via the I/O interface to obtain image data and the like.

取得手段として機能する音声取得部２１は、所定の対象によってマイクロフォン３から入力された音声信号を、所定の周波数でサンプリングすることによりＡ／Ｄ変換し、例えばリニアＰＣＭ（ＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）によるデジタルデータを生成する。音声取得部２１はさらに、当該デジタルデータを短時間フーリエ変換（ＳＴＦＴ：Ｓｈｏｒｔ－ＴｉｍｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）により変換してスペクトログラムを取得する。音声取得部２１は、取得したスペクトログラムを識別部２２に送る。 The voice acquisition unit 21, which functions as an acquisition means, performs A/D conversion on the voice signal input from the microphone 3 by a specific subject by sampling it at a specific frequency, and generates digital data using, for example, linear PCM (Pulse Code Modulation). The voice acquisition unit 21 further converts the digital data using a short-time Fourier transform (STFT) to acquire a spectrogram. The voice acquisition unit 21 sends the acquired spectrogram to the identification unit 22.

また、音声取得部２１は、スペクトログラムを逐次解析して、発声内容を示す音素の列を取得する。音素は、分節音ラベリングで得られる。分節音ラベリングは、音声信号を構成すると考えられる子音、母音などの構成要素に分解して、それぞれの構成要素を表現するラベルを付与することである。音声取得部２１は、スペクトログラムから得られるフォルマントの遷移と、スペクトログラムのパターン及びその変化から、音声信号を構成要素に分解し、構成要素のパターンに適合するラベルを選択して付与する。なお、音素の列を取得する（音声データを音素単位にラベル付けする）処理は、例えばオープンソースのＪｕｌｉｕｓ音素セグメンテーションキットを用いて行うことができる。音声取得部２１は、分節音ラベルから音素に変換し、得られた音素と、音素間の時間もしくは音素の発声された時刻とを逐次、学習部２８及び発話解析部２５に送る。 The speech acquisition unit 21 also sequentially analyzes the spectrogram to acquire a string of phonemes that indicate the speech content. The phonemes are acquired by segmental phone labeling. Segmental phone labeling is a method of decomposing a speech signal into components such as consonants and vowels that are thought to constitute the speech signal, and assigning labels that represent each component. The speech acquisition unit 21 decomposes the speech signal into components based on the formant transitions obtained from the spectrogram, and the spectrogram pattern and its changes, and selects and assigns labels that match the patterns of the components. The process of acquiring a string of phonemes (labeling speech data in phoneme units) can be performed, for example, using the open source Julius phoneme segmentation kit. The speech acquisition unit 21 converts the segmental phonetic labels into phonemes, and sequentially sends the acquired phonemes and the time between phonemes or the time when the phonemes were spoken to the learning unit 28 and the speech analysis unit 25.

識別部２２は、音声取得部２１から送られたスペクトログラムから、例えばｉ－ｖｅｃｔｏｒなどの音声特徴データを抽出し、抽出した音声特徴データが、記憶部３０に記憶されている複数の所定の対象の音声特徴データの何れと照合するかを判定することによって、所定の対象を識別し、識別した所定の対象の対象ＩＤを取得する。識別部２２は、取得した対象ＩＤを、学習部２８及び発話解析部２５に送る。 The identification unit 22 extracts speech feature data, such as an i-vector, from the spectrogram sent from the speech acquisition unit 21, and identifies the specific target by determining which of the speech feature data of multiple specific targets stored in the storage unit 30 the extracted speech feature data matches, and acquires the target ID of the identified specific target. The identification unit 22 sends the acquired target ID to the learning unit 28 and the speech analysis unit 25.

学習手段として機能する学習部２８は、所定の対象ごとに、音声取得部２１から送られた音素の列と発話解析部２５による音声認識結果の文章とから、マイクロフォン３から入力された発話の内容を特定できる最短の音素列を学習する。第１解析手段として機能する部分解析部２３は、所定の対象から入力された発話を部分的に解析し、当該発話内の一部の音素列に、学習部２８で学習された音素列（後述する最短音素列）に一致する音素列があった場合には、その旨をジェスチャ応答制御部２４に送る。また、特定手段として機能する特定部２９は、発話解析部２５による音声認識結果に応じて、入力された発話に応じたジェスチャ動作を特定する。ジェスチャ応答制御部２４では、入力された発話内の一部の音素列に、最短音素列に一致する音素列があった場合には、この一致する最短音素列に対応するジェスチャ動作を選択し、その応答をジェスチャ作動部７に行わせる。 The learning unit 28, which functions as a learning means, learns, for each predetermined target, the shortest phoneme sequence that can identify the content of the utterance input from the microphone 3, from the phoneme sequence sent from the voice acquisition unit 21 and the sentence of the voice recognition result by the utterance analysis unit 25. The partial analysis unit 23, which functions as a first analysis means, partially analyzes the utterance input from the predetermined target, and if a part of the phoneme sequence in the utterance matches the phoneme sequence learned by the learning unit 28 (the shortest phoneme sequence described later), sends that information to the gesture response control unit 24. In addition, the identification unit 29, which functions as an identification means, identifies a gesture action corresponding to the input utterance according to the voice recognition result by the utterance analysis unit 25. If a part of the phoneme sequence in the input utterance matches the shortest phoneme sequence, the gesture response control unit 24 selects a gesture action corresponding to the matching shortest phoneme sequence, and causes the gesture operation unit 7 to perform the response.

ジェスチャ応答制御部２４は、例えば、図３Ａ～図３Ｅのジェスチャ動作に対応する、頭部１０１の屈曲・伸展、回旋及び側屈、ならびに腕部１０３の屈曲・伸展及び内転・外転それぞれの、動作開始角度、動作角速度、動作角加速度、停止角度及び停止時間などを記述した動作シーケンスをジェスチャ作動部７に送る。ジェスチャ作動部７は、ジェスチャ応答制御部２４から送られた動作シーケンスに従って、頭部１０１及び腕部１０３を駆動するための制御信号を、関節５及び関節６のセンサの検出値に応じて生成し、生成した信号をアクチュエータに入力することによって、頭部１０１及び腕部１０３を駆動する。 The gesture response control unit 24 sends to the gesture actuation unit 7 an action sequence that describes the action start angle, action angular velocity, action angular acceleration, stop angle, and stop time of each of the flexion/extension, rotation, and lateral bending of the head 101, and the flexion/extension and adduction/abduction of the arms 103, which correspond to the gesture actions of, for example, Figures 3A to 3E. The gesture actuation unit 7 generates control signals for driving the head 101 and the arms 103 in accordance with the detection values of the sensors of the joints 5 and 6 in accordance with the action sequence sent from the gesture response control unit 24, and drives the head 101 and the arms 103 by inputting the generated signals to the actuators.

第２解析手段として機能する発話解析部２５は、音声取得部２１から送られた音素の列を用いて、入力された発話を、第１解析手段として機能する部分解析部２３により解析される発話の区間よりも長い区間で解析し、音声認識する。その際、発話解析部２５は、記憶部３０に記憶されている辞書データベースを参照し形態素解析等を行って、音声取得部２１から送られた音声データを音声認識する。発話解析部２５は、解析した結果を、応答文生成部２６に送る。応答文生成部２６は、発話した話者ごとに発話された内容に適した応答文を生成する。応答文生成部２６は、生成した応答文を言語応答制御部２７に送り、言語応答制御部２７は、音声合成によって応答文をスピーカ４から発声させる。 The speech analysis unit 25, functioning as the second analysis means, uses the sequence of phonemes sent from the speech acquisition unit 21 to analyze the input utterance for a section longer than the section of the utterance analyzed by the partial analysis unit 23, functioning as the first analysis means, and performs speech recognition. At that time, the speech analysis unit 25 refers to a dictionary database stored in the storage unit 30 and performs morphological analysis, etc., to perform speech recognition on the speech data sent from the speech acquisition unit 21. The speech analysis unit 25 sends the analysis result to the response sentence generation unit 26. The response sentence generation unit 26 generates a response sentence appropriate to the content of the utterance for each speaker who has spoken. The response sentence generation unit 26 sends the generated response sentence to the language response control unit 27, and the language response control unit 27 causes the response sentence to be spoken from the speaker 4 by voice synthesis.

制御部２０は、ユーザの発話に対して、できるだけ早く、ジェスチャ動作による非言語的な応答を返すために、会話記録処理、分析学習処理及び予測応答制御処理を実行する。以下、これらの会話記録処理、分析学習処理及び予測応答制御処理について、順に説明する。 The control unit 20 executes a conversation recording process, an analysis learning process, and a predictive response control process in order to respond to the user's speech as quickly as possible with a non-verbal response using a gesture action. Below, the conversation recording process, the analysis learning process, and the predictive response control process will be described in order.

図４は、実施の形態に係る会話記録処理を示すフローチャートである。会話記録処理において、制御装置２は、ロボット１に登録されている所定の対象ごとに、ロボット１と所定の対象との会話の内容及び認識した短文を記録する。会話記録処理は、一連の会話が行われるごとに繰り返され、会話記録データを蓄積する。 Figure 4 is a flowchart showing the conversation recording process according to the embodiment. In the conversation recording process, the control device 2 records the contents of the conversation between the robot 1 and each specific target registered in the robot 1, as well as the recognized short sentences. The conversation recording process is repeated each time a series of conversations takes place, and the conversation record data is accumulated.

制御部２０は、ロボット１と所定の対象との会話が開始されるのと同時に会話記録処理を開始する。制御部２０は、例えば、音声取得部２１で所定の閾値を超える音声レベルの音声を検出したときに、音声認識処理を開始し、音声認識を行うことができたら、会話が開始されたと判定する。また音声中にノイズが多い場合は、音声認識結果を分かち書きして１単語認識できたら、会話が開始されたと判定するようにしてもよい。 The control unit 20 starts the conversation recording process at the same time as the start of a conversation between the robot 1 and a predetermined target. For example, the control unit 20 starts the voice recognition process when the voice acquisition unit 21 detects a voice with a voice level exceeding a predetermined threshold, and determines that a conversation has started if the voice recognition is successful. In addition, if there is a lot of noise in the voice, the voice recognition result may be separated into words, and if one word is recognized, it may be determined that a conversation has started.

まず識別部２２で、前述のように所定の対象を識別する（ステップＳ４０１）。 First, the identification unit 22 identifies a specific object as described above (step S401).

次いで、発話解析部２５は、会話が終了しているか否かを判別する（ステップＳ４０２）。会話の終了は、例えば、無音時間の長さが所定の長さを超えたことの判定、コンテクストの終結の判定、もしくはカメラを用いて顔認識を行う場合には話者の顔が認識できなくなったことの判定、又はこれらの組み合わせで判別できる。コンテクストの終結の判定については、発話解析部２５が、例えば、質問に対して回答が発話されたのち、所定の時間、所定の対象から次の発話がない場合、あるいは、「ばいばい」もしくは「またね」のように、所定の対象が会話の終了を宣言する発話があったことを検出した場合に、コンテクストの終結と判定できる。 Next, the speech analysis unit 25 determines whether the conversation has ended (step S402). The end of the conversation can be determined, for example, by determining that the length of the silent period exceeds a predetermined length, by determining that the context has ended, or, in the case of facial recognition using a camera, by determining that the speaker's face cannot be recognized, or by a combination of these. The end of the context can be determined, for example, by the speech analysis unit 25 when there is no next utterance from the predetermined target for a predetermined time after an answer to a question is spoken, or when it detects an utterance from the predetermined target declaring the end of the conversation, such as "bye-bye" or "see you later."

会話が終了していないとき（ステップＳ４０２；Ｎ）には、発話解析部２５は、所定の対象の発話から音声取得部２１で取得した音声データを、記憶部３０に記憶されている辞書データベースを参照し形態素解析等を行って音声認識する（ステップＳ４０３）。そして発話解析部２５は、認識結果と音声のデータを対応づけた会話記録を会話記録データベースとして記憶部３０に記録し（ステップＳ４０４）、ステップＳ４０２に戻る。会話記録は、図５に示すように、時刻、会話の相手（対象ＩＤ）、会話時の所定の対象の発話文、会話時の所定の対象の発話に含まれる音素列及び音声データを含む。ここで、時刻は、その会話が開始された時刻又は終了された時刻である。発話文は、発声された文の内容を表す文字列である。音素列は、発話文の音素の列である。音声データの欄は、音声データそのもの、又は音声データが記録されているファイルを指定する情報である。音声データは後に音素列の発話された長さを解析するために用いられる。 When the conversation has not ended (step S402; N), the speech analysis unit 25 performs speech recognition on the speech data acquired by the speech acquisition unit 21 from the speech of the predetermined target by performing morphological analysis, etc., with reference to the dictionary database stored in the storage unit 30 (step S403). The speech analysis unit 25 then records a conversation record in which the recognition result and the speech data are associated in the storage unit 30 as a conversation record database (step S404), and returns to step S402. As shown in FIG. 5, the conversation record includes the time, the conversation partner (target ID), the utterance of the predetermined target during the conversation, the phoneme string included in the utterance of the predetermined target during the conversation, and the speech data. Here, the time is the time when the conversation started or ended. The utterance sentence is a character string that represents the content of the uttered sentence. The phoneme string is a string of phonemes of the utterance sentence. The voice data column is information that specifies the voice data itself or a file in which the voice data is recorded. The voice data is used later to analyze the length of the utterance of the phoneme string.

一方、発話解析部２５は、会話が終了しているとき（ステップＳ４０２；Ｙ）には、会話記録処理を終了する。以上のようにして、発話解析部２５は、所定の対象の発話の認識結果と音声データとを対応づけた会話記録を会話記録データベースとして、記憶部３０に記憶させる。 On the other hand, when the conversation has ended (step S402; Y), the utterance analysis unit 25 ends the conversation recording process. In this manner, the utterance analysis unit 25 stores the conversation record in which the recognition result of the utterance of the specified target is associated with the voice data as a conversation record database in the storage unit 30.

次に、図６を参照しながら、分析学習処理について説明する。この分析学習処理は、ジェスチャ動作を制御する際に用いられる応答ジェスチャデータベースを生成して記憶部３０に記録するための処理である。分析学習処理は、例えば、所定の数の会話が新たに会話記録データベースに記憶されたとき、又は、所定の期間を経過するごとに、実行される。 Next, the analysis learning process will be described with reference to FIG. 6. This analysis learning process is a process for generating a response gesture database used when controlling gesture actions and recording it in the storage unit 30. The analysis learning process is executed, for example, when a predetermined number of conversations are newly stored in the conversation record database, or each time a predetermined period of time has passed.

学習部２８は、まず、前回の分析学習処理で生成された応答ジェスチャデータベースをクリアする（Ｓ６００）。分析学習は、ロボット１に登録されているすべての所定の対象について、所定の対象ごとに行う。次いで、図５に示される会話記録データベースのうち、図７に示すように、登録されている最初の所定の対象の対象ＩＤ（ＩＤ＝１）に対応する複数の発話文をすべて読み出し、読み出した複数の発話文を、ＲＡＭの所定の記憶領域に記憶させる（ステップＳ６０１）。 The learning unit 28 first clears the response gesture database generated in the previous analysis and learning process (S600). The analysis and learning is performed for each of the predetermined targets registered in the robot 1. Next, from the conversation record database shown in FIG. 5, as shown in FIG. 7, all of the multiple spoken sentences corresponding to the target ID (ID=1) of the first registered predetermined target are read out, and the multiple spoken sentences that have been read out are stored in a predetermined storage area of the RAM (step S601).

次いで学習部２８は、すべての所定の対象について、後述するジェスチャ動作の制御のための分析学習が終了しているか否かを判別する（ステップＳ６０２）。すべての所定の対象について分析学習が終了していないとき（ステップＳ６０２；Ｎ）には、ステップＳ６０１で記憶された（図７に示すような）複数の発話文から、重複する発話文（例えば「おはよう」）のうちの最初の１つ（例えば、２０１８／９／９９：０１の「おはよう」）を残して他の当該発話文（例えば、２０１８／９／１２８：００、２０１８／９／１４８：００及び２０１８／９／１４９：００の「おはよう」）を削除したテーブル（ユニーク音素列テーブル）を作成して、ＲＡＭの所定の記憶領域に記憶させる（ステップＳ６０３）。このユニーク音素列テーブルは、１つの対象ＩＤについて分析学習するための一時的なものであり、時刻及び音声データは不要で、所定の対象ごとに対象ＩＤが番号付けされているから、図８に示すように、発話文と音素列との対応があればよい。そして、学習部２８は、ユニーク音素列テーブルのうちの最初の発話文を読み出す（ステップＳ６０４）。 Next, the learning unit 28 determines whether or not analysis learning for controlling gesture operations, which will be described later, has been completed for all the predetermined targets (step S602). When analysis learning has not been completed for all the predetermined targets (step S602; N), a table (unique phoneme sequence table) is created by deleting the overlapping utterance sentences (e.g., "Good morning") from the multiple utterance sentences (as shown in FIG. 7) stored in step S601, leaving only the first one (e.g., "Good morning" on 9/9/2018 9:01) and the other utterance sentences (e.g., "Good morning" on 9/12/2018 8:00, 9/14/2018 8:00, and 9/14/2018 9:00), and storing the table in a predetermined storage area of the RAM (step S603). This unique phoneme sequence table is temporary for analytical learning of one target ID, and does not require time and voice data. Since the target ID is numbered for each predetermined target, it is sufficient to have a correspondence between the spoken sentence and the phoneme sequence, as shown in Figure 8. The learning unit 28 then reads out the first spoken sentence from the unique phoneme sequence table (step S604).

次に、上記のユニーク音素列テーブルから発話文がすべて読み出されたか否かを判別する（ステップＳ６０５）。ユニーク音素列テーブルから発話文がすべて読み出されていないときには（ステップＳ６０５；Ｎ）、特定部２９は、ステップＳ６０４で読み出された発話文に対応するジェスチャ動作を、記憶部３０に記憶されている図９に示すジェスチャ動作データベースを用いて特定する（ステップＳ６０６）。図９に示すように、このジェスチャ動作データベースは、ジェスチャ対応文と、ジェスチャ動作の番号とを対応付けて記憶するものであり、ステップＳ６０６では、ステップＳ６０４で読み出された発話文と一致するジェスチャ対応文に対応するジェスチャ動作の番号が、上記の対応するジェスチャ動作を表す番号として特定される。例えば、ジェスチャ対応文「おはよう」に対して、図３Ａに示すジェスチャ動作を表すジェスチャ番号“２”が特定される。 Next, it is determined whether or not all the spoken sentences have been read from the unique phoneme sequence table (step S605). When all the spoken sentences have not been read from the unique phoneme sequence table (step S605; N), the identification unit 29 identifies the gesture action corresponding to the spoken sentence read in step S604 using the gesture action database shown in FIG. 9 stored in the storage unit 30 (step S606). As shown in FIG. 9, this gesture action database stores gesture corresponding sentences and gesture action numbers in association with each other, and in step S606, the number of the gesture action corresponding to the gesture corresponding sentence that matches the spoken sentence read in step S604 is identified as the number representing the corresponding gesture action. For example, for the gesture corresponding sentence "Good morning", the gesture number "2" representing the gesture action shown in FIG. 3A is identified.

次いで、学習部２８は、上記のステップＳ６０６で対応するジェスチャ動作を特定できたか否かを判別する（ステップＳ６０７）。ステップＳ６０６でジェスチャ動作を特定できたとき（ステップＳ６０７；Ｙ）には、後述する（図１０に示す）応答ジェスチャデータベース登録処理を実行する（ステップＳ６０８）。そして、ユニーク音素列テーブルから、ステップＳ６０４で読み出した発話文の次に続く発話文を読み出し（ステップＳ６０９）、上記のステップＳ６０５～Ｓ６０８を再度、実行する。一方、上記のステップＳ６０６で対応するジェスチャ動作を特定できないとき（ステップＳ６０７；Ｎ）には、上記のステップＳ６０８をスキップし、応答ジェスチャデータベース登録処理を実行せずに、ステップＳ６０９以降を実行する。 Then, the learning unit 28 determines whether or not a corresponding gesture action has been identified in step S606 (step S607). If a gesture action has been identified in step S606 (step S607; Y), a response gesture database registration process (described later, shown in FIG. 10) is executed (step S608). Then, the speech sentence following the speech sentence read in step S604 is read from the unique phoneme sequence table (step S609), and steps S605 to S608 are executed again. On the other hand, if a corresponding gesture action cannot be identified in step S606 (step S607; N), step S608 is skipped, and step S609 and subsequent steps are executed without executing the response gesture database registration process.

そして、ステップＳ６０５～Ｓ６０９を繰り返し実行した結果、上記のユニーク音素列テーブルから発話文がすべて読み出されたとき（ステップＳ６０５；Ｙ）には、図５に示される会話記録データベースに記憶された発話文のうち、ステップＳ６０１で読み出す対象になった最初の対象ＩＤの次の対象ＩＤに対応する複数の発話文をすべて読み出し、読み出した複数の発話文を、ＲＡＭの所定の記憶領域に記憶させる（ステップＳ６１０）。次いで、前記ステップＳ６０２以降を再度、実行する。以上により、すべての所定の対象について、上述したステップＳ６０３～Ｓ６０９による分析学習が終了すると（ステップＳ６０２；Ｙ）、分析学習処理が終了される。 When all the spoken sentences have been read from the unique phoneme sequence table as a result of repeatedly executing steps S605 to S609 (step S605; Y), all the spoken sentences corresponding to the target ID next to the first target ID that was the target to be read in step S601 are read from the spoken sentences stored in the conversation record database shown in FIG. 5, and the read spoken sentences are stored in a specified memory area of the RAM (step S610). Next, steps S602 and onwards are executed again. When the analytical learning by the above-mentioned steps S603 to S609 is completed for all the specified targets (step S602; Y), the analytical learning process is terminated.

次に、図１０を参照しながら、図６のステップＳ６０８の応答ジェスチャデータベース登録処理について説明する。学習部２８は、この応答ジェスチャデータベース登録処理により、図６のステップＳ６０４又はＳ６０９で読み出された発話文を特定できる最低限の（最も短い）音素列として、最短音素列を特定する。例えば、「おはよう」の文に対して、ｏｈａの音素列を特定する。 Next, the response gesture database registration process in step S608 in FIG. 6 will be described with reference to FIG. 10. Through this response gesture database registration process, the learning unit 28 identifies the shortest phoneme string as the minimum (shortest) phoneme string that can identify the spoken sentence read out in step S604 or S609 in FIG. 6. For example, for the sentence "good morning," the phoneme string "oha" is identified.

まず、学習部２８は、ローカル変数としてのカウンタＮに１をセットして（ステップＳ１０００）、図６のステップＳ６０４又はＳ６０９で読み出された発話文の音素列の、先頭からＮ番目までを読み出す（ステップＳ１００１）。そして、着目している発話文の音素列の長さが、読み出した音素列の長さＮに等しいか否かを判別する（ステップＳ１００２）。発話文の音素列の長さが、読み出した音素列の長さＮに等しいとき（ステップＳ１００２；Ｙ）には、発話文を特定できる最短音素列がなかったとして、応答ジェスチャデータベースには何も記憶せずに図６のフローチャート（ステップＳ６０９）に戻る。 First, the learning unit 28 sets a counter N as a local variable to 1 (step S1000), and reads out the first N phoneme strings of the spoken sentence read out in step S604 or S609 of FIG. 6 (step S1001). Then, it is determined whether the length of the phoneme string of the spoken sentence under consideration is equal to the length N of the read phoneme string (step S1002). If the length of the phoneme string of the spoken sentence is equal to the length N of the read phoneme string (step S1002; Y), it is determined that there is no shortest phoneme string capable of identifying the spoken sentence, and the process returns to the flowchart of FIG. 6 (step S609) without storing anything in the response gesture database.

発話文の音素列の長さが、読み出した音素列の長さＮに等しくないとき（ステップＳ１００２；Ｎ）には、学習部２８は、ユニーク音素列テーブルに、着目する発話文以外の発話文で、先頭からの音素列が、読み出した音素列に一致するものがあるか検索する（ステップＳ１００３）。 When the length of the phoneme sequence of the spoken sentence is not equal to the length N of the read phoneme sequence (step S1002; N), the learning unit 28 searches the unique phoneme sequence table for a spoken sentence other than the focused sentence whose phoneme sequence from the beginning matches the read phoneme sequence (step S1003).

そして、学習部２８は、ステップＳ１００３で一致する音素列があったか否かを判別する（ステップＳ１００４）。一致する音素列があったとき（ステップＳ１００４；Ｙ）には、カウンタＮに１を加算して（ステップＳ１００５）、着目している発話文の音素列の、先頭からＮ番目までを読み出す（ステップＳ１００６）。そして、ステップＳ１００２に戻り、ステップＳ１００２からの処理を再度、実行する。 Then, the learning unit 28 determines whether or not a matching phoneme string was found in step S1003 (step S1004). If a matching phoneme string was found (step S1004; Y), the counter N is incremented by 1 (step S1005), and the phoneme string from the beginning to the Nth phoneme string of the utterance sentence of interest is read (step S1006). Then, the process returns to step S1002, and the process from step S1002 is executed again.

ステップＳ１００２～Ｓ１００６を繰り返し実行した結果、上述のユニーク音素列テーブルに、着目する発話文以外の発話文で、先頭からの音素列が、読み出した音素列に一致するものがなかったとき（ステップＳ１００４；Ｎ）には、読み出したＮ番目までの音素列を、着目している発話文の内容を特定可能な最短の音素列（以下「最短音素列」という）として記憶部３０のＲＡＭに記録する（ステップＳ１００７）。 If, as a result of repeatedly executing steps S1002 to S1006, there is no utterance other than the focused sentence in which the phoneme sequence from the beginning matches the read phoneme sequence in the unique phoneme sequence table (step S1004; N), the read phoneme sequence up to the Nth phoneme sequence is recorded in the RAM of the memory unit 30 as the shortest phoneme sequence capable of identifying the content of the focused sentence (hereinafter referred to as the "shortest phoneme sequence") (step S1007).

次に、学習部２８は、図６のステップＳ６０１又はステップＳ６１０で読み出された、所定の対象の対象ＩＤに対応する（図７に示すような）複数の発話文が記憶されている所定の領域を参照し、着目している発話文と同じ発話文すべての、Ｎ番目までの該当する音素列（最短音素列）の発話された平均的な長さを計測する（ステップＳ１００８）。このとき、同じ発話文の出現回数をカウントする。 Next, the learning unit 28 refers to a predetermined area in which multiple spoken sentences (as shown in FIG. 7) corresponding to the target ID of a predetermined target read out in step S601 or step S610 in FIG. 6 are stored, and measures the average spoken length of the corresponding phoneme strings (shortest phoneme strings) up to the Nth phoneme string for all spoken sentences that are the same as the spoken sentence of interest (step S1008). At this time, the number of times the same spoken sentence occurs is counted.

ステップＳ１００８では、着目している発話文と同じ発話文の音声データをすべて取り出し、ステップＳ１００７で特定した最短音素列の音素の区間の長さ（最初から最短音素列の終了までの時間）をそれぞれ取り出して、その平均時間を計算する。例えば、発話文「おはよう」の最短音素列が“ｏｈａ”になったとする。図１１は、異なる時刻に発話された同じ所定の対象の同じ発話文「おはよう」の音声データを、開始タイミングを一致させて、上下に並べて示す。学習部２８は、図７に示す所定の対象の対象ＩＤに対応する複数の発話文の音声データから、図１１に示すように、発話文「おはよう」の音声データを取り出し、音声データの開始から“ａ”の音素の終了までの時間、例えば図１１のｔ１及びｔ２を平均して、最短音素列“ｏｈａ”の発話された平均の長さを計測する。 In step S1008, all the voice data of the same utterance sentence as the utterance sentence of interest is extracted, and the lengths of the phoneme sections of the shortest phoneme sequence identified in step S1007 (the time from the beginning to the end of the shortest phoneme sequence) are extracted, and the average time is calculated. For example, assume that the shortest phoneme sequence of the utterance sentence "Good morning" is "oha". FIG. 11 shows voice data of the same utterance sentence "Good morning" of the same specified target uttered at different times, arranged vertically with the same start timing. The learning unit 28 extracts the voice data of the utterance sentence "Good morning" from the voice data of multiple utterance sentences corresponding to the target ID of the specified target shown in FIG. 7, as shown in FIG. 11, and measures the average length of the utterance of the shortest phoneme sequence "oha" by averaging the time from the start of the voice data to the end of the phoneme "a", for example, t1 and t2 in FIG. 11.

そして、学習部２８は、発話文、ステップＳ１００７で特定した最短音素列、ステップＳ１００８で計測した最短音素列の発話された時間（最短音素列の発話された時間に検出時間（例えば２０ｍｓ）を加算した時間でもよい）、図６のステップＳ６０６で特定したジェスチャ動作の番号、及び、当該発話文の出現回数を、図１２に示す応答ジェスチャデータベースに記憶し（ステップＳ１００９）、図６のフローチャート（ステップＳ６０９）に戻る。 Then, the learning unit 28 stores the spoken sentence, the shortest phoneme sequence identified in step S1007, the time when the shortest phoneme sequence was spoken measured in step S1008 (which may be the time when the shortest phoneme sequence was spoken plus the detection time (e.g., 20 ms)), the gesture action number identified in step S606 of FIG. 6, and the number of occurrences of the spoken sentence in the response gesture database shown in FIG. 12 (step S1009), and returns to the flowchart of FIG. 6 (step S609).

図５、図９及び図１２では、「おはよう」などの挨拶のことばを例に記載しているが、発話文及びジェスチャ対応文にはそれぞれ「あのー」、「えーと」、「おや」、「まあ」などの感動詞、間投詞もしくは感嘆詞を含めてもよい。 In Figures 5, 9, and 12, greetings such as "Good morning" are shown as examples, but the spoken sentences and gesture-enabled sentences may each include interjections, interjections, or exclamations such as "Um," "Erm," "Oh," and "Oh," respectively.

次に、図１３を参照しながら、予測応答制御処理について説明する。予測応答制御処理は、例えば、音声取得部２１で、所定の閾値を超える音声レベルの音声を検出したときに開始される。制御部２０は、予測応答制御を開始したら、まず、識別部２２で所定の対象を識別する（ステップＳ１３００）。次に、部分解析部２３は、識別された所定の対象について、応答ジェスチャデータベースから最短音素列長さを読み出し、読み出した最短音素列長さを用いて、図１４に示すような応答時間リストを生成する（ステップＳ１３０１）。図１４に示す応答時間リストでは、応答時間は、短いものから順にリストされている。 Next, the predictive response control process will be described with reference to FIG. 13. The predictive response control process is started, for example, when the voice acquisition unit 21 detects a voice with a voice level exceeding a predetermined threshold. When the control unit 20 starts predictive response control, it first identifies a predetermined target with the identification unit 22 (step S1300). Next, the partial analysis unit 23 reads out the shortest phoneme sequence length for the identified predetermined target from the response gesture database, and uses the read out shortest phoneme sequence length to generate a response time list as shown in FIG. 14 (step S1301). In the response time list shown in FIG. 14, the response times are listed in order from shortest to oldest.

そして、制御部２０は、所定の対象とロボット１との会話が終了したか否かを判別する（ステップＳ１３０２）。会話の終了は会話記録処理（図４）のステップＳ４０２と同様に判別できる。会話が終了していないとき（ステップＳ１３０２；Ｎ）には、所定の対象が発話し始めたか否かを判別し（ステップＳ１３０３）、発話し始めるまで待機する（ステップＳ１３０３；Ｎ）。発話し始めは、例えば、音声レベルが閾値以上になったこと、あるいはカメラを用いて顔認識を行う場合には所定の対象の顔認識で検出する。所定の対象が発話し始めたとき（ステップＳ１３０３；Ｙ）には、ステップＳ１３０４以降の処理を実行する。
一方、会話が終了したとき（ステップＳ１３０２；Ｙ）には、予測応答制御処理を終了する。以上により、ステップＳ１３０４以降の処理は、所定の対象による１回の発話が開始されるごとに実行される。
ステップＳ１３０４以降で、制御部２０は、予測応答時間リストに記録された予測応答時間の数だけ、以下のような処理を行う。 Then, the control unit 20 judges whether the conversation between the predetermined target and the robot 1 has ended (step S1302). The end of the conversation can be judged in the same manner as step S402 of the conversation recording process (FIG. 4). If the conversation has not ended (step S1302; N), it judges whether the predetermined target has started speaking (step S1303), and waits until the target starts speaking (step S1303; N). The start of speaking is detected, for example, when the sound level reaches or exceeds a threshold value, or by facial recognition of the predetermined target in the case of facial recognition using a camera. If the predetermined target has started speaking (step S1303; Y), the process from step S1304 onward is executed.
On the other hand, when the conversation ends (step S1302; Y), the prediction response control process ends. As described above, the process from step S1304 onward is executed every time a speech by a predetermined target is started.
After step S1304, the control unit 20 performs the following process the same number of times as the number of predicted response times recorded in the predicted response time list.

部分解析部２３は、ステップＳ１３０１で生成された応答時間リストから、最初の応答時間を読み出し（ステップＳ１３０４）、当該応答時間が、ステップＳ１３０３で所定の対象の発話が開始されたと判別されてから経過したか否かを判別し（ステップＳ１３０５）、当該応答時間が経過するまで待機する（ステップＳ１３０５；Ｎ）。当該応答時間が経過したとき（ステップＳ１３０５；Ｙ）には、部分解析部２３は、所定の対象の発話が開始されてから当該応答時間が経過するまでにマイクロフォン３から入力された所定の対象の音声を切り出す（ステップＳ１３０６）。そして、部分解析部２３は、切り出した音声に無音声が検出されるか否かを判別する（ステップＳ１３０７）。切り出した所定の対象の音声に一定時間（例えば１００ｍｓ）以上連続して、例えばレベルが閾値以下の無音が含まれていたら（ステップＳ１３０７；Ｙ）、ステップＳ１３０２に戻る。 The partial analysis unit 23 reads the first response time from the response time list generated in step S1301 (step S1304), determines whether the response time has elapsed since it was determined in step S1303 that the target speech started (step S1305), and waits until the response time has elapsed (step S1305; N). When the response time has elapsed (step S1305; Y), the partial analysis unit 23 extracts the target speech input from the microphone 3 from the target speech started until the response time has elapsed (step S1306). Then, the partial analysis unit 23 determines whether silence is detected in the extracted speech (step S1307). If the extracted target speech includes silence, for example, with a level below a threshold, for a certain period of time (for example, 100 ms) or more (step S1307; Y), the process returns to step S1302.

一方、切り出した所定の対象の音声に無音声が含まれていないとき（ステップＳ１３０７；Ｎ）には、部分解析部２３は、ステップＳ１３０６で切り出した音声を音素列に変換する（ステップＳ１３０８）。そして、部分解析部２３は、ステップＳ１３００で識別された所定の対象に対応する応答ジェスチャデータベースに記憶された複数の最短音素列の中に、ステップＳ１３０８で変換した音素列と一致する音素列が存在するか否かを判別する（ステップＳ１３０９）。変換した音素列と一致する最短音素列が存在するとき（ステップＳ１３０９；Ｙ）には、この一致する最短音素列に対応するジェスチャ動作をロボット１に実行させ（ステップＳ１３１２）、ステップＳ１３０２に戻る。 On the other hand, if the extracted voice of the specified target does not include silent voice (step S1307; N), the partial analysis unit 23 converts the voice extracted in step S1306 into a phoneme string (step S1308). Then, the partial analysis unit 23 determines whether or not a phoneme string matching the phoneme string converted in step S1308 is present among the multiple shortest phoneme strings stored in the response gesture database corresponding to the specified target identified in step S1300 (step S1309). If a shortest phoneme string matching the converted phoneme string is present (step S1309; Y), the robot 1 is caused to execute a gesture movement corresponding to this matching shortest phoneme string (step S1312), and the process returns to step S1302.

例えば、ステップＳ１３０８で変換した音素列が“ｏｈａ”であったする。部分解析部２３は、音素列“ｏｈａ”を、図１２に示す応答ジェスチャデータベースの最短音素列の中から検索すると、最短音素列“ｏｈａ”が一致するので、それに対応するジェスチャ番号“２”を取得して、ジェスチャ応答制御部２４に送る。そして、ジェスチャ応答制御部２４は、図３Ａに示すジェスチャ番号２に対応するジェスチャ動作を、ロボット１に実行させる。 For example, suppose the phoneme string converted in step S1308 is "oha." The partial analysis unit 23 searches for the phoneme string "oha" among the shortest phoneme strings in the response gesture database shown in FIG. 12, and since the shortest phoneme string "oha" matches, it obtains the corresponding gesture number "2" and sends it to the gesture response control unit 24. The gesture response control unit 24 then causes the robot 1 to perform the gesture action corresponding to gesture number 2 shown in FIG. 3A.

一方、変換した音素列と一致する最短音素列が存在しないとき（ステップＳ１３０９；Ｎ）には、応答時間リストから応答時間をすべて読み出したか否かを判別する（ステップＳ１３１０）。応答時間リストから応答時間をすべて読み出していないとき（ステップＳ１３１０；Ｎ）には、部分解析部２３は、応答時間リストから次の応答時間を読み出し（ステップＳ１３１１）、ステップＳ１３０５以降を再度、実行する。そして、応答時間リストから応答時間がすべて読み出されたとき（ステップＳ１３１０；Ｙ）には、ステップＳ１３０２に戻る。 On the other hand, if there is no shortest phoneme sequence that matches the converted phoneme sequence (step S1309; N), it is determined whether or not all response times have been read from the response time list (step S1310). If not all response times have been read from the response time list (step S1310; N), the partial analysis unit 23 reads the next response time from the response time list (step S1311) and executes steps S1305 and after again. Then, if all response times have been read from the response time list (step S1310; Y), the process returns to step S1302.

以上、予測応答制御処理について説明した。制御部２０は、この予測応答制御処理でジェスチャ動作を行うのと並行して、次に説明する言語応答制御処理を行う。この言語応答制御処理について、図１５を参照して説明する。言語応答制御処理は、予測応答制御処理と同様、例えば、制御部２０の音声取得部２１で、所定の閾値を超える音声レベルの音声を検出したときに開始される。制御部２０は、言語応答制御処理を開始したら、まず、識別部２２で所定の対象を識別する（ステップＳ１５０１）。次に、制御部２０は、所定の対象とロボット１との会話が終了したか否かを、会話記録処理（図４）のステップＳ４０２と同様に判別する（ステップＳ１５０２）。 The above describes the predictive response control process. In parallel with performing the gesture action in this predictive response control process, the control unit 20 performs the language response control process, which will be described next. This language response control process will be described with reference to FIG. 15. As with the predictive response control process, the language response control process is started, for example, when the voice acquisition unit 21 of the control unit 20 detects a voice with a voice level exceeding a predetermined threshold. When the control unit 20 starts the language response control process, first, the control unit 20 identifies a predetermined target with the identification unit 22 (step S1501). Next, the control unit 20 determines whether the conversation between the predetermined target and the robot 1 has ended, in the same manner as step S402 of the conversation recording process (FIG. 4) (step S1502).

会話が終了していないとき（ステップＳ１５０２；Ｎ）には、予測応答制御処理（図１３）のステップＳ１３０３と同様に、所定の対象が発話し始めたか否かを判別し（ステップＳ１５０３）、発話し始めるまで待機する（ステップＳ１５０３；Ｎ）。一方、会話が終了したとき（ステップＳ１５０２；Ｙ）には、言語応答制御処理を終了する。 If the conversation has not ended (step S1502; N), similar to step S1303 of the prediction response control process (FIG. 13), it is determined whether or not the predetermined target has started speaking (step S1503), and the process waits until the target starts speaking (step S1503; N). On the other hand, if the conversation has ended (step S1502; Y), the language response control process is terminated.

所定の対象が発話を開始したとき（ステップＳ１５０３；Ｙ）には、発話解析部２５は、対象の発話音素を音声取得部２１から取得して（ステップＳ１５０４）、所定の対象の発話が終了したか否かを判別する（ステップＳ１５０５）。発話が終了したか否かは、例えば、音声取得部２１で取得する音声データの音声レベルが所定の閾値以下である状態が所定の時間（例えば６００ｍｓ）継続したか否かにより判別できる。発話が終了していない間は（ステップＳ１５０５；Ｎ）、発話音素の取得（ステップＳ１５０４）を繰り返す。 When the target starts speaking (step S1503; Y), the speech analysis unit 25 acquires the target's speech phonemes from the speech acquisition unit 21 (step S1504) and determines whether the target's speech has ended (step S1505). Whether the speech has ended can be determined, for example, by whether the voice level of the voice data acquired by the speech acquisition unit 21 remains below a predetermined threshold for a predetermined period of time (e.g., 600 ms). As long as the speech has not ended (step S1505; N), acquisition of the speech phonemes (step S1504) is repeated.

所定の対象の発話が終了したとき（ステップＳ１５０５；Ｙ）には、発話解析部２５は、取得した音素の列を発話文に変換し（ステップＳ１５０６）、変換した発話文から、記憶部３０に記憶されている辞書データベースを参照し、構文解析して、発話文に含まれている単語と構文を取得する（ステップＳ１５０７）。 When the target has finished speaking (step S1505; Y), the speech analysis unit 25 converts the acquired sequence of phonemes into a spoken sentence (step S1506), and from the converted spoken sentence, references the dictionary database stored in the storage unit 30, performs syntactic analysis, and obtains the words and syntax contained in the spoken sentence (step S1507).

次に、応答文生成部２６は、発話文の単語と構文に基づいて、記憶部３０に記憶されている応答文データベースを参照して、所定の対象の発話に対する応答文を生成する（ステップＳ１５０８）。そして、言語応答制御部２７は、音声合成によって応答文をスピーカ４から発声させ（ステップＳ１５０９）、ステップＳ１５０２に戻る。 Next, the response sentence generation unit 26 generates a response sentence to the utterance of the predetermined target based on the words and syntax of the utterance sentence by referring to the response sentence database stored in the storage unit 30 (step S1508). Then, the language response control unit 27 vocalizes the response sentence from the speaker 4 by voice synthesis (step S1509), and the process returns to step S1502.

以上、言語応答制御処理について説明した。次に、予測応答制御処理の動作例を図１６を参照して説明する。図１６は、図１４の応答時間リストから読み出した応答時間に従って切り出した音声の音素列が、図１２の応答ジェスチャデータベースの最短音素列に一致し、一致した最短音素列に対応するジェスチャ動作を実行した場合の動作例を示す。この例では、所定の対象により「おはよう」という発話が入力されている。 The above describes the linguistic response control process. Next, an example of the operation of the predictive response control process will be described with reference to FIG. 16. FIG. 16 shows an example of the operation when a phoneme string of a voice extracted according to a response time read from the response time list in FIG. 14 matches the shortest phoneme string in the response gesture database in FIG. 12, and a gesture action corresponding to the matching shortest phoneme string is executed. In this example, the utterance "Good morning" is input by a specified target.

そして、所定の対象が発話を開始してから（時点：Ｔ０～）、応答時間リストの最初の応答時間（１００ｍｓ）が経過するまで（時点：Ｔ１）に入力された所定の対象の音声を切り出し（図１３のステップＳ１３０６）、切り出した音声を音素列に変換すると（図１３のステップＳ１３０８）、“ｏｈａ”であった場合を想定している。この変換した“ｏｈａ”の音素列は、図１２の応答ジェスチャデータベースの発話文「おはよう」の最短音素列“ｏｈａ”に一致する（ステップＳ１３０９：Ｙ）。そこで、この最短音素列“ｏｈａ”に対応するジェスチャ番号“２”のジェスチャ動作をロボット１に実行させる（ステップＳ１３１２）。 Then, it is assumed that the speech of the specified target input from when the specified target starts speaking (time: T0~) until the first response time (100 ms) in the response time list has elapsed (time: T1) is extracted (step S1306 in FIG. 13) and the extracted speech is converted into a phoneme string (step S1308 in FIG. 13), resulting in "oha." This converted phoneme string of "oha" matches the shortest phoneme string "oha" of the spoken sentence "Good morning" in the response gesture database in FIG. 12 (step S1309: Y). Therefore, the robot 1 is caused to execute a gesture action with gesture number "2" corresponding to this shortest phoneme string "oha" (step S1312).

その後、所定の対象からの音声がない状態（音声取得部２１で取得する音声データの音声レベルが所定の閾値以下である状態）が一定時間（例えば６００ｍｓ）経過すると（図１５のステップＳ１５０５；Ｙ）、言語応答制御部２７は、ロボット１を制御して、言語を用いた発話応答をロボット１に実行させる（ステップＳ１５０６～Ｓ１５０９）。上記の一定時間（例えば６００ｍｓ）は、所定の対象が発話し終えたのを確認して応答文を生成するための時間である。このように、ロボット１に発話応答を実行させる前に、所定の対象の発話文、すなわち発話の内容を最短音素列を用いて予測し、それに応じてジェスチャ動作をロボット１に実行させるので、ロボット１の発話応答が実行される前に、所定の対象はロボット１が自分の発話を聞いているという実感を持つことができる。 After that, when a certain time (e.g., 600 ms) has elapsed without any voice from the predetermined target (a state in which the voice level of the voice data acquired by the voice acquisition unit 21 is below a predetermined threshold) (step S1505 in FIG. 15; Y), the language response control unit 27 controls the robot 1 to make the robot 1 execute a speech response using language (steps S1506 to S1509). The certain time (e.g., 600 ms) is a time for confirming that the predetermined target has finished speaking and generating a response sentence. In this way, before making the robot 1 execute a speech response, the speech sentence of the predetermined target, i.e., the content of the speech, is predicted using the shortest phoneme sequence, and the robot 1 is made to execute a gesture movement accordingly, so that the predetermined target can have the feeling that the robot 1 is listening to his/her speech before the speech response of the robot 1 is executed.

制御装置２がロボット１に行わせる非言語的な応答は、頭部１０１及び腕部１０３の動きに限らない。非言語的な応答として、ジェスチャ動作には、頭部１０１及び腕部１０３の動きだけではなく、顔の表情、例えば、瞼の開閉、眉の上げ下げ、目、鼻もしくは口の動きなどの動作、あるいは、手を振る、又は手の形を変えて示す、例えば、手を握るもしくは手を開いて上に挙げる、などを含む。その他、非言語的な応答としては、ロボット１に備えられるディスプレイ式の目の表示態様を変えるものでもよい。 The non-verbal responses that the control device 2 causes the robot 1 to perform are not limited to the movements of the head 101 and arms 103. Non-verbal responses, such as gesture actions, include not only the movements of the head 101 and arms 103, but also facial expressions, such as opening and closing eyelids, raising and lowering eyebrows, and movements of the eyes, nose, or mouth, or waving hands or changing the shape of the hand, such as fisting hands or opening and raising hands. Other non-verbal responses may include changing the display mode of the eyes of a display provided on the robot 1.

分析学習の対象の会話記録は、少なくとも直近に記録された発話文を含むが、この会話記録が記録された期間は一定の期間である必要はない。例えば、分析学習を行う時の直近の所定の期間として、直近の１日、直近の１週間、直近の１ヶ月等、任意の期間の会話記録でもよい。分析学習ごとに対象とする会話記録の期間を変化させる場合、前回の分析学習の対象の会話記録と、新たな分析学習の対象の会話記録とは、対象とする期間の一部が重複していてもよいし、全く重複しなくてもよい。 The conversation records that are the subject of analytical learning include at least the most recently recorded speech sentences, but the period for which these conversation records were recorded does not need to be a fixed period. For example, the most recent specified period when analytical learning is performed may be any period of conversation records, such as the most recent day, the most recent week, the most recent month, etc. When the period of the conversation records that are the subject of analytical learning is changed for each analytical learning, the conversation records that were the subject of the previous analytical learning and the conversation records that are the subject of the new analytical learning may overlap in part in their target periods, or they may not overlap at all.

以上説明したとおり、本実施の形態によれば、所定の対象の発話の部分的な一致によって、当該発話に対応する非言語的な挙動をロボットに行わせることができるので、少なくとも、発話終了検出、音声認識及び応答文生成の時間をかけずに応答することができ、所定の対象の発話に対する応答を迅速かつ適切に行うことができる。 As described above, according to this embodiment, a partial match of an utterance by a specific target can cause the robot to perform non-verbal behavior corresponding to that utterance, so that a response can be made without the time required for at least detecting the end of the utterance, recognizing the voice, and generating a response sentence, and a response to the utterance of the specific target can be made quickly and appropriately.

また、所定の対象に対して非言語的な挙動を用いた所定の応答を返すので、所定の対象の会話を邪魔しない（会話自体は通常に進行する）。そのため、発話に対して応答文で早く反応を返す場合に比べて、誤った反応を行う可能性が小さい。また仮に、ロボット１が行う非言語的な挙動（ジェスチャ動作等）が、所定の対象の発話文に対する応答として適切でなかったとしても、会話には大きな影響を与えない。 In addition, since the robot 1 returns a predetermined response to a predetermined target using non-verbal behavior, it does not interfere with the conversation of the target (the conversation itself proceeds normally). Therefore, there is less chance of an incorrect response compared to a case where the robot 1 quickly responds to an utterance with a response sentence. In addition, even if the non-verbal behavior (gestures, etc.) of the robot 1 is not an appropriate response to a spoken sentence by a predetermined target, it does not have a significant impact on the conversation.

実施の形態に係る分析学習処理では、ジェスチャ動作に対応づけられている発話文について、会話記録の重複を除去したユニーク音素列テーブルで先頭からの音素列が一致しない最短の音素列について、その音素列の長さが当該発話文の長さよりも短い場合に、当該発話文を特定する最短音素列として記録するので、当該発話文全体を解析してから応答するのに比べて、短時間で応答することができる。 In the analysis and learning process according to the embodiment, for a spoken sentence associated with a gesture movement, if the length of the shortest phoneme sequence that does not match the phoneme sequence from the beginning in a unique phoneme sequence table from which duplicates in the conversation record have been removed is shorter than the length of the spoken sentence, it is recorded as the shortest phoneme sequence that identifies the spoken sentence, allowing a response to be made in a shorter time than if the entire spoken sentence were analyzed before responding.

制御装置２は、所定の対象を識別する識別部２２を備え、発話記録の発話文に、識別した所定の対象の対象ＩＤを対応づけ、分析学習処理で所定の対象ごとに、ジェスチャ動作に対応づけられた発話文を特定する最短音素列を特定して記録し、所定の対象ごとに応答ジェスチャデータベースを作成する。そして、所定の対象との会話において、所定の対象を識別して、その所定の対象の応答ジェスチャデータベースを用いて、発話文を最短音素列で特定するので、所定の対象の発話に合わせたジェスチャ応答が可能で、素早くジェスチャ応答を返すことができる。 The control device 2 includes an identification unit 22 that identifies a specific target, associates the target ID of the identified specific target with a spoken sentence in the speech record, identifies and records the shortest phoneme sequence that identifies the spoken sentence associated with a gesture action for each specific target in an analysis and learning process, and creates a response gesture database for each specific target. Then, in a conversation with a specific target, the specific target is identified and the response gesture database for the specific target is used to identify the spoken sentence with the shortest phoneme sequence, making it possible to make a gesture response that matches the speech of the specific target, and to quickly return a gesture response.

例えば実施の形態では、図８に示すように、発話文に「おはよう」と「おやすみ」が存在する場合は、「おはよう」を特定する最短音素列は”ｏｈａ”となるため、最短音素列の例として主に”ｏｈａ”を用いて説明した。しかし、識別した所定の対象が、標準語の「おはよう」の代わりに「はやえなっす」という方言を話す人の場合、「はやえなっす」を特定する最短音素列は”ｈａ”、”ｈａｙａ”、”ｈａｙａｅ”等になり得る。もし最短音素列が”ｈａ”となる場合は、最短音素列が”ｏｈａ”になる人と比べてさらに速い応答が可能になる。このように、制御装置２は、識別対象毎に会話記録処理や分析学習処理を行うことにより、当該識別対象にとって最適な予測応答制御処理を行うことができるようになる。 For example, in the embodiment, as shown in FIG. 8, if the spoken sentence contains "good morning" and "good night," the shortest phoneme sequence that identifies "good morning" is "oha," and therefore "oha" has been mainly used as an example of the shortest phoneme sequence. However, if the identified target is a person who speaks "hayaenassu" in a dialect instead of the standard "good morning," the shortest phoneme sequence that identifies "hayaenassu" could be "ha," "haya," "hayae," etc. If the shortest phoneme sequence is "ha," a faster response is possible compared to a person whose shortest phoneme sequence is "oha." In this way, the control device 2 performs a conversation recording process and an analysis and learning process for each identification target, thereby enabling it to perform a predictive response control process that is optimal for the identification target.

実施の形態では、最短音素列の長さの時間で音声を切り出して音素列を比較したが、それに限らず、さまざまな変形が可能である。例えば、音声の切り出しを逐次行い、話し始めからの音素列が最短音素列と一致するかどうかで、予測応答制御処理における一致する音素列があるか否かの判定を行ってもよい。また、音素列に変換せず、直接、話し始めからの音声と、ジャスチャ動作に対応させて記憶した参照音声との比較を行い、類似する音声なら参照音声に対応するジェスチャを行う構成としてもよい。 In the embodiment, the sound is cut out over a time period equal to the length of the shortest phoneme string and the phoneme strings are compared, but this is not limiting and various modifications are possible. For example, the sound may be cut out sequentially, and a determination may be made as to whether or not there is a matching phoneme string in the predictive response control process based on whether or not the phoneme string from the beginning of speech matches the shortest phoneme string. Alternatively, instead of converting to a phoneme string, the sound from the beginning of speech may be directly compared with a reference sound stored in association with a gesture movement, and if the sounds are similar, a gesture corresponding to the reference sound may be performed.

実施の形態では、応答ジェスチャデータベースの登録の際に、音素列の長さの平均時間を計算して登録を行ったが、集計した時間のゆれが大きいものは、登録しないとしてもよい。実施の形態では、出現回数、音素列の長さのゆれを考慮せず登録を行う構成としているが、出現回数、音素列の長さのゆれが統計的に意味のある頻度になったら登録するとしてもよい。 In the embodiment, the response gesture database is registered by calculating the average time of the length of the phoneme string, but if the aggregated time varies greatly, it is possible not to register it. In the embodiment, registration is performed without considering the number of occurrences or the variation in the length of the phoneme string, but registration may be performed when the number of occurrences and the variation in the length of the phoneme string reach a statistically significant frequency.

実施の形態では、学習部２８で、所定の対象毎に発話文を特定する最短音素列とジェスチャ動作との対応を学習したが、所定の対象毎に応答ジェスチャデータベースを学習せず、例えば工場出荷前に予め作成した応答ジェスチャデータベースをロボット１のＲＯＭ（又は不揮発性のＲＡＭ）に記憶させてもよい。予め応答ジェスチャデータベースを作成するには、様々な発話文を集めた音声会話データベースもしくはロボットとの音声会話を集めたデータベース（音声会話を沢山集めたもの）を用意し、この用意したデータベースを用いて、図６の分析学習を行えばよい。その場合、当該応答ジェスチャデータベースに登録される最短音素列の長さは、想定される一般の対象の平均又は標準偏差を含む時間とすることができる。また、当該応答ジェスチャデータベースには、出現回数に代えて発話文の一般的な発生確率を含めてもよい。あるいは、当該応答ジェスチャデータベースを書き換え可能なＲＯＭ又は不揮発性のＲＡＭに記憶させておき、ロボット１が作動している間に、所定の対象の発話から出現回数をカウントして、当該応答ジェスチャデータベース内の出現回数の項目を更新していくようにしてもよい。 In the embodiment, the learning unit 28 learns the correspondence between the shortest phoneme sequence that identifies a speech sentence for each predetermined target and a gesture action. However, instead of learning a response gesture database for each predetermined target, for example, a response gesture database created before shipping from the factory may be stored in the ROM (or non-volatile RAM) of the robot 1. To create a response gesture database in advance, a voice conversation database that collects various speech sentences or a database that collects voice conversations with the robot (a database that collects a lot of voice conversations) is prepared, and the analysis learning of FIG. 6 is performed using this prepared database. In that case, the length of the shortest phoneme sequence registered in the response gesture database can be a time that includes the average or standard deviation of the expected general target. In addition, the response gesture database may include the general occurrence probability of the speech sentence instead of the number of occurrences. Alternatively, the response gesture database may be stored in a rewritable ROM or non-volatile RAM, and the number of occurrences may be counted from the utterance of the predetermined target while the robot 1 is operating, and the number of occurrences item in the response gesture database may be updated.

実施の形態では、発話の内容を「発話文」として規定したが、発話の内容は文に限定されない。例えば、「挨拶」（「おはよう」、「こんにちは」等）、「お礼」（「ありがとう」、「感謝しているよ」等）、「質問」（「ちょっと教えて」、「ひとつ聞いてもいい」等）、「評価」（「うまいね」、「よくわかったね」等）等の「発話のカテゴリ」（ここでは「発話の目的」）を発話の内容として規定してもよい。この場合、制御装置２は、それらの「発話のカテゴリ」それぞれに対して、ロボット１の非言語的な挙動の応答を定めておいて、その「発話のカテゴリ」を特定する最短音素列と非言語的な挙動の応答との対応を学習することができる。非言語的な挙動は、例えば、「挨拶」に対してはおじぎのジェスチャ、「お礼」又は「評価」に対しては手を横に振るジェスチャ、「質問」に対しては頭部１０１を傾げるジェスチャ等とすることができる。 In the embodiment, the content of the utterance is defined as an "utterance sentence", but the content of the utterance is not limited to a sentence. For example, "utterance categories" (here, "purposes of utterance") such as "greetings" ("good morning", "hello", etc.), "thanks" ("thank you", "I'm grateful", etc.), "questions" ("tell me a bit", "can I ask you something", etc.), and "evaluations" ("good", "you understand well", etc.) may be defined as the content of the utterance. In this case, the control device 2 can determine the non-verbal behavioral responses of the robot 1 for each of these "utterance categories" and learn the correspondence between the shortest phoneme sequence that identifies the "utterance category" and the non-verbal behavioral responses. Non-verbal behaviors can be, for example, a bowing gesture for a "greeting", a shaking hand gesture for a "thank you" or "evaluation", and a tilting of the head 101 for a "question".

ジェスチャに対応づけられる発話文（ジェスチャ対応文）は、日本語に限らず、外国語でもよい。制御装置２は、言語ごとの音素セットを用いて、分析学習処理及び予測応答処理を行うことができる。例えば、英語の発話文とジェスチャ動作を対応づけておいて、英語の音素セットを用いて、上述の分析学習処理及び予測応答制御処理を行うことができる。 The spoken sentences associated with the gestures (gesture-corresponding sentences) are not limited to Japanese, and may be in a foreign language. The control device 2 can perform analysis learning processing and predictive response processing using a phoneme set for each language. For example, English spoken sentences can be associated with gesture movements, and the above-mentioned analysis learning processing and predictive response control processing can be performed using an English phoneme set.

その他、１つの発話文に対応する非言語的な挙動は、１つには限らない。例えば、１つの発話文（ジェスチャ対応文）に、複数のジェスチャ動作を含むジェスチャ動作群を対応づけておいて、その発話文の最短音素列を検出したときに、対応するジェスチャ動作群から１つのジェスチャ動作を選択して、ロボット１に実行させてもよい。その場合、ジェスチャ動作群からのジェスチャ動作の選択は、決まった確率又はランダムでもよいし、あるいは、最短音素列が発話されたときの音の高さ、発話の声の大きさ、音素列のうちのアクセントの位置、音素列の抑揚の違いなどの発話の変化に応じて、ジェスチャ動作群からジェスチャ動作を選択してもよい。さらに、発話の変化によって、ジェスチャ動作群からジェスチャ動作を選択する確率を変化させて、変化させた確率でジェスチャ動作を選択してもよい。 In addition, the number of non-verbal behaviors corresponding to one spoken sentence is not limited to one. For example, a gesture action group including multiple gesture actions may be associated with one spoken sentence (gesture-corresponding sentence), and when the shortest phoneme string of the spoken sentence is detected, one gesture action may be selected from the corresponding gesture action group and executed by the robot 1. In this case, the selection of a gesture action from the gesture action group may be made with a fixed probability or randomly, or a gesture action may be selected from the gesture action group according to changes in speech, such as the pitch of the sound when the shortest phoneme string is spoken, the volume of the speech, the position of the accent in the phoneme string, and differences in intonation of the phoneme string. Furthermore, the probability of selecting a gesture action from the gesture action group may be changed according to changes in speech, and a gesture action may be selected with the changed probability.

以上の構成の変化及び変形例のほか、さまざまな変形と派生が可能である。例えば、ロボット１の形状は、図１に示した形状に限らない。例えば、犬又は猫をはじめとして、ペットを模した形状とすることができる。ロボット１は、また、ぬいぐるみやアニメなどのキャラクタの形状であってもよい。 In addition to the above-mentioned changes and modifications, various other modifications and variations are possible. For example, the shape of the robot 1 is not limited to the shape shown in FIG. 1. For example, it may be shaped to resemble a pet, such as a dog or a cat. The robot 1 may also be shaped like a stuffed toy or an anime character.

あるいはさらに、ロボット１は、スマートフォン又はタブレットなどの画面に表示されるアバターであってもよい。ロボット１がアバターである場合、制御装置２は、スマートフォン又はタブレットにインストールされるアプリケーションプログラムで実現することができる。制御装置２は、アバターが画面に表示されているスマートフォン又はタブレットが備えるマイクロフォン３から音声信号を取得し、画面に表示されているアバターに非言語的な応答を行わせ、そして、スマートフォン又はタブレットが備えるスピーカ４から、応答文を発話させる。 Alternatively, the robot 1 may be an avatar displayed on the screen of a smartphone or tablet. When the robot 1 is an avatar, the control device 2 can be realized by an application program installed on the smartphone or tablet. The control device 2 acquires an audio signal from a microphone 3 provided on the smartphone or tablet on whose screen the avatar is displayed, causes the avatar displayed on the screen to respond non-verbally, and causes the response sentence to be spoken from a speaker 4 provided on the smartphone or tablet.

制御装置２は、制御部２０として、ＣＰＵの代わりに、例えばＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）、又は、各種制御回路等の専用のハードウェアを備え、専用のハードウェアが、図２に示した各部として機能してもよい。この場合、各部の機能それぞれを個別のハードウェアで実現してもよいし、各部の機能をまとめて単一のハードウェアで実現することもできる。また、各部の機能のうちの、一部を専用のハードウェアによって実現し、他の一部をソフトウェア又はファームウェアによって実現してもよい。 Instead of a CPU, the control device 2 may have dedicated hardware such as an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array), or various control circuits as the control unit 20, and the dedicated hardware may function as each unit shown in FIG. 2. In this case, the functions of each unit may be realized by individual hardware, or the functions of each unit may be realized together by a single piece of hardware. Also, some of the functions of each unit may be realized by dedicated hardware, and the other parts may be realized by software or firmware.

制御装置２の各機能を実現するプログラムは、例えば、フレキシブルディスク、ＣＤ（Compact Disc）－ＲＯＭ、ＤＶＤ（Digital Versatile Disc）－ＲＯＭ、メモリカード等のコンピュータ読み取り可能な記憶媒体に格納して適用できる。さらに、プログラムを搬送波に重畳し、インターネットなどの通信媒体を介して適用することもできる。例えば、通信ネットワーク上の掲示板（ＢＢＳ：Bulletin Board System）にプログラムを掲示して配信してもよい。そして、このプログラムを起動し、ＯＳ（Operating System）の制御下で、他のアプリケーションプログラムと同様に実行することにより、上記の処理を実行できるように構成してもよい。 The programs that realize the functions of the control device 2 can be stored and applied on a computer-readable storage medium, such as a flexible disk, a CD (Compact Disc)-ROM, a DVD (Digital Versatile Disc)-ROM, or a memory card. Furthermore, the programs can be superimposed on carrier waves and applied via a communication medium such as the Internet. For example, the programs can be posted and distributed on a bulletin board system (BBS) on a communication network. The above-mentioned processes can then be performed by starting up the programs and executing them under the control of an OS (Operating System) in the same way as other application programs.

以上、本発明の好ましい実施の形態について説明したが、本発明はかかる特定の実施の形態に限定されるものではなく、本発明には、特許請求の範囲に記載された発明とその均等の範囲とが含まれる。以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。 Although the preferred embodiment of the present invention has been described above, the present invention is not limited to such a specific embodiment, and the present invention includes the inventions described in the claims and their equivalents. The inventions described in the original claims of this application are described below.

（付記１）
対象に対して応答可能なロボットの制御装置であって、
前記対象の発話を取得する取得手段と、
前記対象が発話しているときに前記取得手段により取得された前記発話を部分的に解析する第１解析手段と、
前記ロボットによる応答であって、前記対象に対する非言語的な挙動を用いた第１応答を、前記第１解析手段による解析結果に応じて制御する第１制御手段と、
前記取得手段により取得された前記対象の発話を、前記第１解析手段により解析される発話の区間よりも長い区間で解析する第２解析手段と、
前記第２解析手段による解析結果に応じて、前記対象の前記発話に対する応答文を生成する生成手段と、
前記生成手段により生成された前記応答文に基づいて、前記ロボットによる言語を用いた第２応答を制御する第２制御手段と、
を備えることを特徴とするロボットの制御装置。 (Appendix 1)
A control device for a robot capable of responding to an object, comprising:
An acquisition means for acquiring a speech of the target;
a first analysis means for partially analyzing the speech acquired by the acquisition means while the subject is speaking;
a first control means for controlling a first response by the robot using a non-verbal behavior toward the target in accordance with an analysis result by the first analysis means;
a second analysis means for analyzing the target speech acquired by the acquisition means in a section longer than the section of the speech analyzed by the first analysis means;
a generation means for generating a response sentence to the utterance of the target in response to an analysis result by the second analysis means;
a second control means for controlling a second response using language by the robot based on the response sentence generated by the generation means;
A robot control device comprising:

（付記２）
前記第１解析手段が解析する前記発話の音素列の長さは、前記第２解析手段が解析する前記発話の音素列の長さよりも短く設定されていることを特徴とする、付記１に記載のロボットの制御装置。 (Appendix 2)
A robot control device as described in Appendix 1, characterized in that the length of the phoneme sequence of the utterance analyzed by the first analysis means is set shorter than the length of the phoneme sequence of the utterance analyzed by the second analysis means.

（付記３）
前記第１応答は複数の第１応答から成り、
前記複数の第１応答に対応する複数の参照音素列を記憶する記憶手段を更に備え、
前記第１解析手段は、前記対象が発話しているときに前記取得手段により取得された前記発話の一部の音素列が前記複数の参照音素列の何れかに一致するか否かを判別し、
前記第１制御手段は、前記第１解析手段により前記一部の音素列が前記複数の参照音素列の前記何れかに一致すると判別されたときには、前記複数の第１応答のうち、当該一致すると判別された前記複数の参照音素列の前記何れかに対応する第１応答を前記ロボットに実行させることを特徴とする、付記１又は２に記載のロボットの制御装置。 (Appendix 3)
the first response comprises a plurality of first responses;
a storage means for storing a plurality of reference phoneme sequences corresponding to the plurality of first responses;
the first analysis means determines whether or not a phoneme sequence of a part of the utterance acquired by the acquisition means while the target is speaking matches any one of the plurality of reference phoneme sequences;
The control device for a robot described in Appendix 1 or 2, characterized in that, when the first analysis means determines that the part of the phoneme sequence matches any one of the plurality of reference phoneme sequences, the first control means causes the robot to execute a first response among the plurality of first responses that corresponds to the any one of the plurality of reference phoneme sequences that has been determined to match.

（付記４）
前記複数の参照音素列の各々は、前記対象の発話の内容を特定可能な最短の音素列に設定されていることを特徴とする、付記３に記載のロボットの制御装置。 (Appendix 4)
The robot control device described in Appendix 3, characterized in that each of the multiple reference phoneme sequences is set to the shortest phoneme sequence capable of identifying the content of the target's speech.

（付記５）
前記ロボットは、前記対象として互いに異なる複数の対象に対して、前記第１応答及び前記第２応答を実行可能であり、
前記複数の対象の各々を識別する識別手段を更に備え、
前記取得手段は、前記識別された対象ごとに、当該対象の発話を取得し、
前記取得された対象の発話の内容を解析する解析手段と、
前記複数の第１応答から、前記解析手段による解析結果に応じた第１応答を特定する特定手段と、
前記解析手段による解析結果に基づいて、当該解析結果に対応する前記対象の発話の内容を特定可能な最短の音素列を前記参照音素列として、前記特定された第１応答に対応付けて、前記識別された対象ごとに学習する学習手段と、を更に備えることを特徴とする、付記３に記載のロボットの制御装置。 (Appendix 5)
the robot is capable of executing the first response and the second response to a plurality of targets different from each other as the target;
further comprising an identification means for identifying each of the plurality of objects;
The acquisition means acquires, for each of the identified targets, an utterance of the target,
an analysis means for analyzing the content of the acquired target utterance;
A determination means for determining a first response corresponding to an analysis result by the analysis means from the plurality of first responses;
and a learning means for learning, for each identified target, the shortest phoneme sequence capable of identifying the content of the target's utterance corresponding to the analysis result based on the analysis result by the analysis means, by associating the shortest phoneme sequence with the identified first response as the reference phoneme sequence based on the analysis result by the analysis means.

（付記６）
前記ロボットは、駆動可能な可動部を有し、
前記第１応答は、前記ロボットの前記可動部を駆動することによって実現されるジェスチャ動作による応答であり、前記第２応答は、前記対象に対して前記応答文を発話する応答であることを特徴とする、付記１から５の何れか１つに記載のロボットの制御装置。 (Appendix 6)
The robot has a drivable movable part,
A robot control device as described in any one of appendices 1 to 5, characterized in that the first response is a response by a gesture movement realized by driving the movable part of the robot, and the second response is a response by speaking the response sentence to the target.

（付記７）
対象に対して、前記第１応答と、前記第２応答とを実行可能に構成され、付記１から６の何れか１つに記載のロボットの制御装置を備えたロボット。 (Appendix 7)
A robot configured to be able to execute the first response and the second response to a target, comprising the robot control device described in any one of appendices 1 to 6.

（付記８）
対象に対して応答可能なロボットの制御装置が実行するロボットの制御方法であって、
前記対象の発話を取得する取得ステップと、
前記対象が発話しているときに前記取得ステップにより取得された前記発話を部分的に解析する第１解析ステップと、
前記ロボットによる応答であって、前記対象に対する非言語的な挙動を用いた第１応答を、前記第１解析ステップによる解析結果に応じて制御する第１制御ステップと、
前記取得ステップにより取得された前記対象の発話を、前記第１解析ステップにより解析される発話の区間よりも長い区間で解析する第２解析ステップと、
前記第２解析ステップでの解析結果に応じて、前記対象の前記発話に対する応答文を生成する生成ステップと、
前記生成ステップで生成された前記応答文に基づいて、前記ロボットによる言語を用いた第２応答を制御する第２制御ステップと、
を備えることを特徴とするロボットの制御方法。 (Appendix 8)
A method for controlling a robot that is responsive to an object, the method comprising:
An acquisition step of acquiring a target utterance;
a first analysis step of partially analyzing the speech acquired by the acquisition step while the subject is speaking;
a first control step of controlling a first response by the robot using a non-verbal behavior toward the target in accordance with an analysis result by the first analysis step;
a second analysis step of analyzing the target utterance acquired by the acquisition step, for a section longer than the section of the utterance analyzed by the first analysis step;
a generation step of generating a response sentence to the utterance of the target according to an analysis result in the second analysis step;
a second control step of controlling a second response using language by the robot based on the response sentence generated in the generation step;
A robot control method comprising:

（付記９）
対象に対して応答可能なロボットを制御するコンピュータを、
前記対象の発話を取得する取得手段、
前記対象が発話しているときに前記取得手段により取得された前記発話を部分的に解析する第１解析手段、
前記ロボットによる応答であって、前記対象に対する非言語的な挙動を用いた第１応答を、前記第１解析手段による解析結果に応じて制御する第１制御手段、
前記取得手段により取得された前記対象の発話を、前記第１解析手段により解析される発話の区間よりも長い区間で解析する第２解析手段、
前記第２解析手段による解析結果に応じて、前記対象の前記発話に対する応答文を生成する生成手段、及び
前記生成手段により生成された前記応答文に基づいて、前記ロボットによる言語を用いた第２応答を制御する第２制御手段、
として機能させるためのプログラム。 (Appendix 9)
A computer that controls a robot that can respond to an object,
An acquisition means for acquiring a speech of the target;
a first analysis means for partially analyzing the speech acquired by the acquisition means while the subject is speaking;
a first control means for controlling a first response by the robot using a non-verbal behavior toward the target in accordance with a result of the analysis by the first analysis means;
a second analysis means for analyzing the target speech acquired by the acquisition means over a section longer than the section of the speech analyzed by the first analysis means;
a generation means for generating a response sentence to the utterance of the target in accordance with an analysis result by the second analysis means; and a second control means for controlling a second response using language by the robot based on the response sentence generated by the generation means.
A program to function as a

１…ロボット、２…制御装置、３…マイクロフォン、４…スピーカ、５，６…関節、７…ジェスチャ作動部、２０…制御部、２１…音声取得部、２２…識別部、２３…部分解析部、２４…ジェスチャ応答制御部、２５…発話解析部、２６…応答文生成部、２７…言語応答制御部、２８…学習部、２９…特定部、３０…記憶部、１０１…頭部、１０２…胴体部、１０３…腕部 1...Robot, 2...Control device, 3...Microphone, 4...Speaker, 5, 6...Joints, 7...Gesture actuation unit, 20...Control unit, 21...Speech acquisition unit, 22...Identification unit, 23...Partial analysis unit, 24...Gesture response control unit, 25...Speech analysis unit, 26...Response sentence generation unit, 27...Language response control unit, 28...Learning unit, 29...Identification unit, 30...Memory unit, 101...Head, 102...Torso, 103...Arms

Claims

A robot capable of responding by voice and by gesture,
a first identification means for starting a predetermined identification process in response to a start of an utterance by a speaker in order to identify a category of a content that the speaker is trying to convey through the utterance;
a control means for controlling the robot to execute a gesture that is preset in association with the identified category as soon as the category is identified by the identification process, regardless of whether the speaker has finished speaking or not;
Equipped with
The control means
Even when controlling the robot to execute the gesture response before the speaker finishes speaking, the robot is controlled to execute the voice response after waiting for the speaker to finish speaking, thereby controlling the robot to be silent until the speaker finishes speaking,
The robot is configured to be controllable to perform different gestures for each category.
A robot characterized by:

the identification process is a process of identifying the category by performing a forward match search of a predetermined database with a phoneme string, which is a time series of phonemes that are sequentially pronounced as the utterance, each time the phoneme string extends by a predetermined elapsed time due to the pronunciation;
2. The robot according to claim 1 .

A second identification means for identifying the speaker is provided,
the identification process performs a forward match search for the phoneme string in the database associated with the speaker identified by the second identification means;
3. The robot according to claim 2 .

A response method executed by a robot capable of responding by voice and by gesture, comprising:
a step of starting a predetermined identification process in response to a start of speech by a speaker in order to identify a category of a content that the speaker intends to convey by the speech;
a control step of controlling the robot to execute a gesture that is preset in association with the identified category as soon as the category is identified by the identification process, regardless of whether the speaker has finished speaking or not;
having
The control step includes:
Even when controlling the robot to execute the gesture response before the speaker finishes speaking, the robot is controlled to execute the voice response after waiting for the speaker to finish speaking, thereby controlling the robot to be silent until the speaker finishes speaking,
The robot is configured to be controllable to perform different gestures for each category.
A response method comprising:

A robot computer capable of responding by voice and by gestures,
a determination means for starting a predetermined determination process in response to a start of speech by a speaker in order to determine a category of the content that the speaker is trying to convey through the speech;
a control means for controlling the robot to execute a gesture that is preset in association with the identified category as soon as the category is identified by the identification process, regardless of whether the speaker has finished speaking or not;
Functioning as a
The control means
Even when controlling the robot to execute the gesture response before the speaker finishes speaking, the robot is controlled to execute the voice response after waiting for the speaker to finish speaking, thereby controlling the robot to be silent until the speaker finishes speaking,
The robot is configured to be controllable to perform different gestures for each category.
A program characterized by: