JP7446178B2

JP7446178B2 - Behavior control device, behavior control method, and program

Info

Publication number: JP7446178B2
Application number: JP2020132962A
Authority: JP
Inventors: ランディゴメス
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2020-08-05
Filing date: 2020-08-05
Publication date: 2024-03-08
Anticipated expiration: 2040-08-05
Also published as: JP2022029599A

Description

本発明は、行動制御装置、行動制御方法、およびプログラムに関する。 The present invention relates to a behavior control device, a behavior control method, and a program.

今日、スマートスピーカーやコミュニケーションロボットの開発が進められている。このようなシステムでは、指示に応じて、照明をオン状態またはオフ状態にする、カレンダーにアクセスする、メールを読む、予定を設定するなどの機能に焦点を当てられている。このようなシステムでは、指示の入力が、例えばタッチパネルによる選択、音声による定められているコマンド等に限られており、人との関係を構築することが困難である。 Today, smart speakers and communication robots are being developed. Such systems focus on functions such as turning lights on or off, accessing calendars, reading email, and setting appointments based on commands. In such a system, the input of instructions is limited to, for example, selection using a touch panel or predetermined voice commands, making it difficult to build relationships with people.

このため、人との関係を持てるシステムが望まれている。例えば特許文献１には、コンパニオンデバイスと人と対話に対して、人をデバイスとの対話や操作に関わらせるシステムが提案されている。特許文献１に記載の技術では、コンパニオンデバイスが、利用者との発話や行動を検出して、移動、グラフィック、音、光、芳香を通して表現し、親交的存在を提供する。 For this reason, a system that allows for relationships with people is desired. For example, Patent Document 1 proposes a system in which a companion device interacts with a person and a person is involved in interaction and operation with the device. In the technology described in Patent Document 1, a companion device detects the user's utterances and actions, expresses them through movement, graphics, sound, light, and fragrance, and provides a friendly presence.

そして、ロボットは、人間が住んでいる環境で動作するため、ロボットがより有用になるためには、ロボットと人間とが自然な相互作用を介して一般の人から素早く学習できることが求められている。人間の評価フィードバックからの強化学習は、技術者以外の人がロボットに仕事を教えるのを容易にすることができる。デモンストレーションからの学習は、評価的フィードバックよりも高速な学習につながることが多い。 And since robots operate in environments where humans live, in order for robots to become more useful, they need to be able to quickly learn from ordinary people through natural interactions between robots and humans. . Reinforcement learning from human evaluation feedback can make it easier for non-technical people to teach robots to do tasks. Learning from demonstration often leads to faster learning than evaluative feedback.

特表２０１９－５２１４４９号公報Special Publication No. 2019-521449

しかしながら、従来の人間の評価フィードバックからの学習では、ロボットが試行錯誤しながら学習するため、ロボットの学習が危険であったり、コストが高くなったりする可能性がある。また、従来のロボットのデモンストレーションからの学習では，訓練者の性能に制限があるのに対し，人間の報酬からの学習では，一般的に訓練者の性能を上回ることがある。このため、従来の技術では、ロボットと人間との相互作用を介しての学習が困難であった。 However, in conventional learning based on human evaluation feedback, the robot learns through trial and error, which can make the robot's learning dangerous and expensive. Furthermore, while learning from conventional robot demonstrations limits the trainee's performance, learning from human rewards can generally exceed the trainee's performance. For this reason, with conventional technology, it has been difficult to learn through interaction between robots and humans.

本発明は、上記の問題点に鑑みてなされたものであって、装置と人間との相互作用を介して自律的に学習することができる行動制御装置、行動制御方法、およびプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and provides a behavior control device, a behavior control method, and a program that can autonomously learn through interaction between the device and a human. With the goal.

（１）上記目的を達成するため、本発明の一態様に係る行動制御装置は、デモンストレーションされた結果に基づいて、逆強化学習によって報酬関数を生成する学習部と、前記報酬関数と、人と環境からフィードバックされた情報に基づいて、行動を選択するエージェントと、を備える。 (1) In order to achieve the above object, a behavior control device according to one aspect of the present invention includes a learning unit that generates a reward function by inverse reinforcement learning based on a demonstrated result, a learning unit that generates a reward function, and a human An agent that selects an action based on information fed back from the environment.

（２）また、本発明の一態様に係る行動制御装置において、前記エージェントは、行動の修正を前記学習部によって学習された報酬関数によって行い、前記人と環境からフィードバックされた情報に基づいて、予測報酬モデルを学習するようにしてもよい。 (2) Furthermore, in the behavior control device according to one aspect of the present invention, the agent corrects the behavior using a reward function learned by the learning unit, and based on information fed back from the person and the environment, A predictive reward model may be learned.

（３）また、本発明の一態様に係る行動制御装置において、前記エージェントは、報酬学習管理部と、割当評価部と、行動選択部と、を備え、前記割当評価部は、前記人からのフィードバックと前記環境からのフィードバックに基づいて、前回選択した行動の確率を算出し、状態と行動と前回選択した行動の確率と教師付き学習サンプルとし、前記報酬学習管理部は、前記学習部が生成した前記報酬関数を取得し、前記割当評価部が出力する前記教師付き学習サンプルを取得し、前記予測報酬モデルを学習して、学習された前記予測報酬モデルを用いて前記報酬関数を更新し、前記行動選択部は、前記人と前記環境からフィードバックされた情報と、前記報酬学習管理部によって、前記行動を選択するようにしてもよい。 (3) Furthermore, in the behavior control device according to one aspect of the present invention, the agent includes a reward learning management unit, an allocation evaluation unit, and a behavior selection unit, and the allocation evaluation unit is configured to receive information from the person. Based on the feedback and feedback from the environment, the probability of the previously selected action is calculated, and the state, action, probability of the previously selected action, and supervised learning sample are calculated, and the reward learning management unit calculates the probability of the previously selected action, and the reward learning management unit calculates the probability of the previously selected action obtaining the supervised learning sample output by the allocation evaluation unit, learning the predictive reward model, and updating the reward function using the learned predictive reward model; The behavior selection unit may select the behavior based on information fed back from the person and the environment and the reward learning management unit.

（４）また、本発明の一態様に係る行動制御装置において、前記エージェントは、自装置の現在の向きにおいて、人の音声方向、人の顔の向き、人の体の向き、当該自装置の向きで表される環境の状態を推定し、最も報酬予測値が大きな報酬関数を持つ行動を選択することで、当該自装置が注目する人物に顔を向ける行動を選択するようにしてもよい。 (4) In the behavior control device according to one aspect of the present invention, the agent may detect the direction of the person's voice, the direction of the person's face, the direction of the person's body, and the direction of the person's body in the current orientation of the device. By estimating the state of the environment represented by the orientation and selecting the action with the reward function with the largest predicted reward value, the action in which the device turns its face toward the person of interest may be selected.

（５）また、本発明の一態様に係る行動制御装置において、前記報酬学習管理部は、計算された確率ｈ＾と、状態－行動ペアを教師付き学習サンプルとして使用し、最小二乗の勾配に基づいて、次式を用いてパラメータを更新することでインタラクション体験で受け取る人の報酬の期待値を近似した関数Ｒ＾Ｈ（ｓ，ａ）を学習し、

ｈは任意の時間ステップｔでエージェントが受け取った人間の報酬ラベルであり、αは学習率であり、ｓは状態表現であり、ａは選択された行動であり、ω^→＝（ω_０，…，ω_ｍ－１）^Ｔは列パラメータベクトルであり、｛（ｓ_０，ａ_０），…，（ｓ_ｎ，ａ_ｎ）｝は状態・動作ペアであり、φ_ｉ（ｘ^→）を基底関数とするφ（ｘ^→）＝（φ_０（ｘ^→），…，φ_ｍ－１（ｘ^→））^Ｔであり、ｍはｉ＝０，…，ｍ－１でありパラメータの総数であり、δ_ｔは時間差誤差であり次式である、ようにしてもよい。

(5) Furthermore, in the behavior control device according to one aspect of the present invention, the reward learning management unit uses the calculated probability h^ and the state-behavior pair as a supervised learning sample, and calculates the least squares gradient. Based on this, we learn a function R^H (s, a) that approximates the expected value of the reward of the person receiving the interaction experience by updating the parameters using the following formula,

h is the human reward label received by the agent at any time step t, α is the learning rate, s is the state representation, a is the selected action, and ω ^→ = (ω ₀ ,... , ω _m−1 ) ^T is a column parameter vector, {(s ₀ , a ₀ ), ..., (s _n , a _n )} is a state-action pair, and φ _i (x ^→ ) is a basis function Let φ(x ^→ ) = (φ ₀ (x ^→ ), ..., φ _m-1 (x ^→ )) ^T , and m is the total number of parameters with i=0, ..., m-1, δ _t may be a time difference error expressed by the following equation.

（６）上記目的を達成するため、本発明の一態様に係る行動制御方法は、学習部が、デモンストレーションされた結果に基づいて、逆強化学習によって報酬関数を生成し、エージェントが、前記報酬関数と、人と環境からフィードバックされた情報に基づいて、行動を選択する。 (6) In order to achieve the above object, in the behavior control method according to one aspect of the present invention, the learning unit generates a reward function by inverse reinforcement learning based on the demonstrated result, and the agent generates a reward function using the reward function. and choose actions based on information fed back from people and the environment.

（７）上記目的を達成するため、本発明の一態様に係るプログラムは、コンピュータに、デモンストレーションされた結果に基づいて、逆強化学習によって報酬関数を生成させ、生成された前記報酬関数と、人と環境からフィードバックされた情報に基づいて、行動を選択させる。 (7) In order to achieve the above object, a program according to one aspect of the present invention causes a computer to generate a reward function by inverse reinforcement learning based on the demonstrated result, and combines the generated reward function with a human and the behavior is selected based on information fed back from the environment.

（１）～（７）によれば、装置と人間との相互作用を介して自律的に学習することができる。（１）～（７）によれば、ロボットが人間によって提供されるデモンストレーションと評価フィードバックから学ぶことを可能にし、最適な動作を得るために必要な人間の評価の数、特に間違いの数（期待されていない行動）を減らすことができる。 According to (1) to (7), autonomous learning is possible through interaction between the device and a human. According to (1) to (7), the robot can learn from the demonstration and evaluation feedback provided by humans, and the number of human evaluations needed to obtain optimal behavior, especially the number of errors (expected). actions that are not performed) can be reduced.

実施形態に係るロボットによる自律的な学習方法の概略を示す図である。FIG. 1 is a diagram schematically showing an autonomous learning method by a robot according to an embodiment. 実施形態に係るロボットの構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a robot according to an embodiment. 実施形態に係るロボットの外形例を示す図である。1 is a diagram showing an example of the external shape of a robot according to an embodiment. ロボットと利用者との顔角度の定義を説明するための図である。FIG. 3 is a diagram for explaining the definition of face angles between a robot and a user. 実施形態に係るロボットのアクションセットを説明するための図である。FIG. 3 is a diagram for explaining an action set of a robot according to an embodiment. 対話型強化学習のフレームワークを説明するための図である。FIG. 2 is a diagram for explaining an interactive reinforcement learning framework. 実施形態に係るＩＲＬ－ＴＡＭＥＲフレームワークの概略を示す図である。FIG. 1 is a diagram schematically showing an IRL-TAMER framework according to an embodiment. 実施形態に係るエージェントが行う処理アルゴリズム例を示す図である。FIG. 3 is a diagram illustrating an example of a processing algorithm performed by an agent according to an embodiment. 実施形態に係る人間のソーシャルフィードバックからの対話型ＲＬのためのフレームワークを示す図である。FIG. 2 illustrates a framework for interactive RL from human social feedback according to an embodiment. リアルタイム感情分類のためのＣＮＮモデルを示す図である。FIG. 3 is a diagram illustrating a CNN model for real-time emotion classification. リアルタイムジェスチャー認識のビジュアル表示例を示す図である。FIG. 3 is a diagram showing an example of visual display of real-time gesture recognition. リアルタイム感情認識モジュールとリアルタイムジェスチャー認識からＴＡＭＥＲエージェントが受信するフィードバック信号の模式図である。FIG. 2 is a schematic diagram of feedback signals received by the TAMER agent from the real-time emotion recognition module and real-time gesture recognition. ＬｏｏｐＭａｚｅタスクのスクリーンショットを示す図である。FIG. 3 is a diagram showing a screenshot of a LoopMaze task. Ｔｅｔｒｉｓタスクのスクリーンショットを示す図である。FIG. 2 is a diagram showing a screenshot of a Tetris task. ＬｏｏｐＭａｚｅタスクに対する実験結果であり、キーボードエージェントと感情エージェントによる各エピソードにおける総時間ステップ数とフィードバックを受けた回数を示す図である。FIG. 7 is an experimental result for the LoopMaze task, and is a diagram showing the total number of time steps and the number of times feedback was received in each episode by the keyboard agent and the emotion agent. Ｔｅｔｒｉｓタスクに対する実験結果であり、キーボードエージェントと感情エージェントによる総時間ステップ数、フィードバックを受けた数、クリアした行数を示す図である。FIG. 7 is an experimental result for a Tetris task, showing the total number of time steps, the number of feedback received, and the number of cleared lines by the keyboard agent and emotion agent. ＬｏｏｐＭａｚｅタスクに対する実験結果であり、キーボードエージェントとジェスチャーエージェントによる各エピソードにおける時間ステップ数とフィードバックを受けた回数を示す図である。FIG. 7 is an experimental result for the LoopMaze task, and is a diagram showing the number of time steps and the number of times feedback was received in each episode by the keyboard agent and gesture agent. Ｔｅｔｒｉｓタスクに対する実験結果であり、キーボードエージェントとジェスチャーエージェントによる時間ステップ数、フィードバックを受けた数、クリアした行数を示す図である。FIG. 7 is an experimental result for a Tetris task, showing the number of time steps, the number of feedback received, and the number of cleared lines by the keyboard agent and gesture agent.

以下、本発明の実施の形態について図面を参照しながら説明する。なお、以下の説明に用いる図面では、各部材を認識可能な大きさとするため、各部材の縮尺を適宜変更している。 Embodiments of the present invention will be described below with reference to the drawings. Note that in the drawings used in the following explanation, the scale of each member is changed as appropriate in order to make each member a recognizable size.

＜概要＞
図１は、本実施形態に係るロボット１による自律的な学習方法の概略を示す図である。なお、以下の実施形態では、ロボット１が、自律的に人間の顔向きを学習して、学習した結果に基づいて行動する例を説明する。ロボット１は、後述するように、例えば前面に撮影部とマイクロホンアレイである収音部を備えている。ロボット１は、撮影部と収音部とが取得した情報を状態として、人間の顔、音声、体の向きを検出する。ロボット１は、その状態を基に学習モデルを用いて行動する。利用者は、ロボット１の顔の向きを観察し、報酬（評価フィードバック）とデモンストレーション（顔の向き、発話、表情等）を提供して、望ましい行動を教える。 <Summary>
FIG. 1 is a diagram schematically showing an autonomous learning method by a robot 1 according to the present embodiment. In the following embodiments, an example will be described in which the robot 1 autonomously learns the orientation of a human's face and acts based on the learned result. As will be described later, the robot 1 includes, for example, a photographing section and a sound collecting section, which is a microphone array, on the front surface. The robot 1 detects the human face, voice, and body orientation based on the information acquired by the imaging unit and the sound collection unit. The robot 1 acts using a learning model based on its state. The user observes the direction of the robot 1's face and provides rewards (evaluation feedback) and demonstrations (face direction, speech, facial expressions, etc.) to teach desired behavior.

図１の例では、ロボット１が利用者に対して「どこを見たらいいか？」と発話する。人間Ｈｕは、ロボット１の発話に応じて、見る方向に顔を向ける等のデモンストレーションを行う。ロボット１は、人間Ｈｕの行動を観察する。学習モデルは、人間Ｈｕの顔、音声、体の向きを検出し、検出された結果をロボット１の発話に対する行動として学習する。また、学習モデルは、検出された行動を入力として、動作を出力する。ロボット１は、学習モデルが出力する動作指示に応じて、例えばロボット１の顔の向きを変える。この動作を、学習モデルは、さらに学習する。なお、以下の例では、行動制御装置をロボット１に適用する例を説明するが、適用対象は、ロボット１に限らない。 In the example of FIG. 1, the robot 1 says to the user, "Where should I look?" The human Hu performs a demonstration such as turning his face in the viewing direction in response to the robot 1's utterances. Robot 1 observes the behavior of human Hu. The learning model detects the human Hu's face, voice, and body orientation, and learns the detected results as actions in response to the robot 1's utterances. Furthermore, the learning model receives the detected behavior as input and outputs the behavior. The robot 1 changes the direction of its face, for example, in accordance with the motion instructions output by the learning model. The learning model further learns this behavior. In addition, in the following example, an example in which the behavior control device is applied to the robot 1 will be described, but the application target is not limited to the robot 1.

＜ロボット１の構成例＞
次に、ロボット１の構成例を説明する。
図２は、本実施形態に係るロボット１の構成例を示すブロック図である。図２のように、ロボット１は、操作部１０１、撮影部１０２、センサ１０３、収音部１０４、行動制御装置１００、記憶部１０６、データベース１０７、表示部１１１、スピーカー１１２、アクチュエータ１１３、およびロボットセンサ１１５を備えている。 <Configuration example of robot 1>
Next, a configuration example of the robot 1 will be explained.
FIG. 2 is a block diagram showing a configuration example of the robot 1 according to the present embodiment. As shown in FIG. 2, the robot 1 includes an operation section 101, a photographing section 102, a sensor 103, a sound collection section 104, a behavior control device 100, a storage section 106, a database 107, a display section 111, a speaker 112, an actuator 113, and a robot. A sensor 115 is provided.

行動制御装置１００は、認知部１０５（認知手段）、およびエージェント３００を備える。
エージェント３００は、学習部３０１（逆強化学習部）、報酬学習管理部３０２、割当評価部３０３、および行動選択部３０４を備えている。
行動選択部３０４は、画像生成部３０４１、音声生成部３０４２、駆動部３０４３、および出力部３０４４を備えている。 The behavior control device 100 includes a recognition unit 105 (recognition means) and an agent 300.
The agent 300 includes a learning section 301 (inverse reinforcement learning section), a reward learning management section 302, an allocation evaluation section 303, and a behavior selection section 304.
The action selection section 304 includes an image generation section 3041, a sound generation section 3042, a drive section 3043, and an output section 3044.

＜ロボット１の機能、動作＞
次に、ロボット１の各機能部の機能、動作について、図２を参照して説明する。 <Function and operation of robot 1>
Next, the functions and operations of each functional section of the robot 1 will be explained with reference to FIG. 2.

操作部１０１は、例えばキーボードである。操作部１０１は、利用者によって操作された操作結果を検出し、検出した操作結果を認知部１０５に出力する。 The operation unit 101 is, for example, a keyboard. The operation unit 101 detects the result of an operation performed by a user, and outputs the detected operation result to the recognition unit 105.

撮影部１０２は、例えばＣＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭｅｔａｌＯｘｉｄｅＳｅｍｉｃｏｎｄｕｃｔｏｒ；相補性金属酸化膜半導体）撮影素子、またはＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ；電荷結合素子）撮影素子等である。撮影部１０２は、撮影した画像（静止画、連続した静止画、動画）を認知部１０５に出力する。なお、ロボット１は、撮影部１０２を複数備えていてもよい。この場合、撮影部１０２は、例えばロボット１の筐体の前方と後方に取り付けられていてもよい。 The imaging unit 102 is, for example, a CMOS (Complementary Metal Oxide Semiconductor) imaging device, a CCD (Charge Coupled Device) imaging device, or the like. The photographing unit 102 outputs photographed images (still images, continuous still images, and moving images) to the recognition unit 105. Note that the robot 1 may include a plurality of imaging units 102. In this case, the imaging unit 102 may be attached to the front and rear of the housing of the robot 1, for example.

センサ１０３は、例えば、後述するように利用者のジェスチャー等の動きを検出するモーションセンサである。センサ１０３は、検出した検出値を認知部１０５に出力する。 The sensor 103 is, for example, a motion sensor that detects movements such as gestures of the user, as described later. The sensor 103 outputs the detected value to the recognition unit 105.

収音部１０４は、例えば複数のマイクロホンで構成されるマイクロホンアレイである。収音部１０４は、複数のマイクロホンが収音した音響信号を認知部１０５に出力する。なお、収音部１０４は、マイクロホンが収音した音響信号それぞれを、同じサンプリング信号でサンプリングされて、アナログ信号からデジタル信号に変換した後、認知部１０５に出力するようにしてもよい。 The sound collection unit 104 is, for example, a microphone array composed of a plurality of microphones. The sound collection unit 104 outputs the acoustic signals collected by the plurality of microphones to the recognition unit 105. Note that the sound collection unit 104 may output each of the acoustic signals picked up by the microphone to the recognition unit 105 after being sampled with the same sampling signal and converting the analog signals into digital signals.

ロボットセンサ１１５は、ロボット１の頭部や筐体の傾きを検出するジャイロセンサ、ロボット１の頭部や筐体の動きを検出する加速度センサ等である。ロボットセンサ１１５は、検出した検出値を認知部１０５に出力する。 The robot sensor 115 is a gyro sensor that detects the inclination of the head or housing of the robot 1, an acceleration sensor that detects the movement of the head or housing of the robot 1, or the like. The robot sensor 115 outputs the detected value to the recognition unit 105.

記憶部１０６は、例えば、認知部１０５が認識すべき項目、認識の際に用いられる各種値（しきい値、定数）、認識を行うためのアルゴリズム等を記憶する。 The storage unit 106 stores, for example, items to be recognized by the recognition unit 105, various values (threshold values, constants) used during recognition, algorithms for recognition, and the like.

データベース１０７は、例えば、音声認識の際に用いられる言語モデルデータベースと音響モデルデータベースと対話コーパスデータベースと音響特徴量、画像認識の際に用いられる比較用画像データベースと画像特徴量、等を格納する。なお、各データ、特徴量については後述する。なお、データベース１０７は、クラウド上に置かれていてもよく、ネットワークを介して接続されていてもよい。 The database 107 stores, for example, a language model database, an acoustic model database, a dialogue corpus database, and acoustic features used in speech recognition, a comparison image database and image features used in image recognition, and the like. Note that each data and feature amount will be described later. Note that the database 107 may be placed on a cloud or may be connected via a network.

認知部１０５は、撮影された画像から利用者の顔の画像を抽出し、周知の手法を用いて利用者の表情を認知することで、利用者の感情を認識する。認知部１０５は、センサ１０３によって取得された検出値に基づいて、周知の手法を用いて利用者の動きをトラッキングすることで、利用者のジェスチャーを認識する。認知部１０５は、操作結果に基づいて、操作された内容を認識する。認知部１０５は、収音された音響信号から利用者の音声信号に対して音声認識処理を行うことで、利用者の音声方向の認識を行う。認知部１０５は、ロボットセンサ１１５が検出した検出値に基づいて、ロボット１の各部の向きや状態等を検出する。認知部１０５は、取得された情報に基づいて、後述するように利用者の音声の向き。利用者の顔の向き、利用者の体の向き、およびロボット１の顔の向きを検出する。認知部１０５は、検出、認知、認識した結果の情報をエージェント３００に出力する。 The recognition unit 105 extracts an image of the user's face from the photographed image and recognizes the user's facial expression using a well-known method, thereby recognizing the user's emotion. The recognition unit 105 recognizes the user's gestures by tracking the user's movements using a well-known method based on the detection value acquired by the sensor 103. The recognition unit 105 recognizes the content of the operation based on the operation result. The recognition unit 105 recognizes the direction of the user's voice by performing voice recognition processing on the user's voice signal from the collected acoustic signal. The recognition unit 105 detects the orientation, state, etc. of each part of the robot 1 based on the detection values detected by the robot sensor 115. The recognition unit 105 determines the direction of the user's voice based on the acquired information, as will be described later. The direction of the user's face, the direction of the user's body, and the direction of the robot 1's face are detected. The recognition unit 105 detects, recognizes, and outputs information on the results of the recognition to the agent 300.

エージェント３００は、認知部１０５から検出、認知、認識した結果の情報を取得する。エージェント３００は、生成された報酬関数と、取得した情報（人と環境からフィードバックされた情報）を用いてエージェントを生成し、生成したエージェントを用いて行動（発話、仕草、画像出力、出力）を生成する。また、エージェント３００は、行動の修正を学習部３０１によって学習された報酬関数によって行い、人と環境からフィードバックされた情報に基づいて、予測報酬モデルを学習する。エージェント３００は、自装置の現在の向きにおいて、人の音声方向、人の顔の向き、人の体の向き、当該自装置の向きで表される環境の状態を推定し、最も報酬予測値が大きな報酬関数を持つ行動を選択することで、当該自装置が注目する人物に顔（頭部）を向ける行動を選択する。なお、以下の説明において、エージェントが選択する行動は、利用者が期待する方向にロボットの顔を向ける行動であるが、選択する行動はこれに限らない。 The agent 300 detects, recognizes, and acquires information on the results of the recognition from the recognition unit 105. The agent 300 generates an agent using the generated reward function and the acquired information (information fed back from people and the environment), and uses the generated agent to perform actions (speech, gestures, image output, output). generate. The agent 300 also modifies its behavior using the reward function learned by the learning unit 301, and learns a predictive reward model based on information fed back from the person and the environment. The agent 300 estimates the state of the environment represented by the direction of the person's voice, the direction of the person's face, the direction of the person's body, and the orientation of the device in the current orientation of the device, and selects the one with the highest predicted reward value. By selecting an action that has a large reward function, the action that causes the device to turn its face (head) toward the person of interest is selected. Note that in the following description, the action selected by the agent is the action of turning the robot's face in the direction expected by the user, but the action selected is not limited to this.

学習部３０１は、認知部１０５が出力する検出、認知、認識した結果の情報を用いて学習しエージェントを生成する。また、学習部３０１は、デモンストレーションされた結果に基づいて、逆強化学習によって報酬関数を生成する。なお、学習方法については後述する。 The learning unit 301 performs learning using information on detection, recognition, and recognition results output by the recognition unit 105 to generate an agent. Furthermore, the learning unit 301 generates a reward function by inverse reinforcement learning based on the demonstrated results. Note that the learning method will be described later.

報酬学習管理部３０２は、学習部３０１が生成した報酬関数を取得し、割当評価部３０３が出力する教師付き学習サンプルを取得し、予測報酬モデルを学習して、学習された予測報酬モデルを用いて報酬関数を更新する。 The reward learning management unit 302 acquires the reward function generated by the learning unit 301, acquires the supervised learning sample output by the allocation evaluation unit 303, learns a predictive reward model, and uses the learned predictive reward model. and update the reward function.

割当評価部３０３は、人からのフィードバックと環境からのフィードバックに基づいて、前回選択した行動の確率を算出し、状態と行動と前回選択した行動の確率と教師付き学習サンプルとする。 The assignment evaluation unit 303 calculates the probability of the previously selected action based on the feedback from the person and the environment, and uses the probability of the previously selected action as the state, the action, the probability of the previously selected action, and a supervised learning sample.

行動選択部３０４は、人と環境からフィードバックされた情報と、報酬学習管理部３０２によって、行動を選択する。選択される行動は、画像出力、音声出力、頭部または筐体の駆動等のうつ少なくとも１つの行動である。なお、実施形態では、ロボット１の顔（頭部）の向きを変える行動を例に説明する。 The behavior selection unit 304 selects a behavior based on information fed back from the person and the environment and the reward learning management unit 302. The selected action is at least one action such as image output, sound output, and driving of the head or the housing. In the embodiment, an action of changing the direction of the face (head) of the robot 1 will be explained as an example.

画像生成部３０４１は、学習された結果と、取得された情報とに基づいて、表示部１１１に表示させる出力画像（静止画、連続した静止画、または動画）を生成し、生成した出力画像を表示部１１１に表示させる。 The image generation unit 3041 generates an output image (still image, continuous still image, or video) to be displayed on the display unit 111 based on the learned result and the acquired information, and displays the generated output image. It is displayed on the display unit 111.

音声生成部３０４２は、学習された結果と、取得された情報とに基づいて、スピーカー１１２に出力させる出力音声信号を生成し、生成した出力音声信号をスピーカー１１２に出力させる。 The audio generation unit 3042 generates an output audio signal to be output to the speaker 112 based on the learned result and the acquired information, and causes the speaker 112 to output the generated output audio signal.

駆動部３０４３は、学習された結果と、取得された情報とに基づいて、アクチュエータ１１３を駆動するための駆動信号を生成し、生成した駆動信号でアクチュエータ１１３を駆動する。 The drive unit 3043 generates a drive signal for driving the actuator 113 based on the learned result and the acquired information, and drives the actuator 113 with the generated drive signal.

出力部３０４４は、学習された結果と、取得された情報とに基づいて、例えば指示等を生成し、生成した指示を外部装置２に出力する。外部装置２は、例えばパーソナルコンピュータ、ゲーム装置、タブレット端末等である。 The output unit 3044 generates, for example, an instruction based on the learned result and the acquired information, and outputs the generated instruction to the external device 2. The external device 2 is, for example, a personal computer, a game device, a tablet terminal, or the like.

表示部１１１は、液晶画像表示装置、または有機ＥＬ（ＥｌｅｃｔｒｏＬｕｍｉｎｅｓｃｅｎｃｅ）画像表示装置等である。表示部１１１は、画像生成部３０４１が出力する出力画像を表示する。 The display unit 111 is a liquid crystal image display device, an organic EL (Electro Luminescence) image display device, or the like. The display unit 111 displays the output image output by the image generation unit 3041.

スピーカー１１２は、音声生成部３０４２が出力する出力音声信号を出力する。 The speaker 112 outputs the output audio signal output by the audio generation section 3042.

アクチュエータ１１３は、駆動部３０４３が出力する駆動信号に応じて動作部を駆動する。 The actuator 113 drives the operating section according to a drive signal output by the drive section 3043.

＜ロボット１の外形例＞
次に、ロボット１の外形例を説明する。
図３は、本実施形態に係るロボット１の外形例を示す図である。図３の正面図ｇ１０１、側面図ｇ１０２の例では、ロボット１は３つの表示部１１１（１１１ａ、１１１ｂ、１１１ｃ）を備えている。また図３の例では、撮影部１０２ａは表示部１１１ａの上部に取り付けられ、撮影部１０２ｂは表示部１１１ｂの上部に取り付けられている。表示部１１１ａ、１１１ｂは、人の目に相当し、かつ画像情報を提示する。スピーカー１１２は、筐体１２０の人の口に相当する画像を表示する表示部１１１ｃの近傍に取り付けられている。収音部１０４は、筐体１２０に取り付けられている。 <Example of external shape of robot 1>
Next, an example of the external shape of the robot 1 will be explained.
FIG. 3 is a diagram showing an example of the external shape of the robot 1 according to the present embodiment. In the example of the front view g101 and side view g102 of FIG. 3, the robot 1 includes three display units 111 (111a, 111b, 111c). Further, in the example of FIG. 3, the photographing section 102a is attached to the top of the display section 111a, and the photographing section 102b is attached to the top of the display section 111b. The display units 111a and 111b correspond to human eyes and present image information. The speaker 112 is attached to the housing 120 near a display section 111c that displays an image corresponding to a human mouth. The sound collection section 104 is attached to the housing 120.

また、ロボット１は、ブーム１２１を備える。ブーム１２１は、筐体１２０に可動部１３１を介して可動可能に取り付けられている。ブーム１２１には、水平バー１２２が可動部１３２を介して回転可能に取り付けられている。
また、水平バー１２２には、表示部１１１ａが可動部１３３を介して回転可能に取り付けられ、表示部１１１ｂが可動部１３４を介して回転可能に取り付けられている。
なお、図３に示したロボット１の外形は一例であり、これに限らない。例えば、ロボット１は、二足歩行型ロボットであってもよい。 The robot 1 also includes a boom 121. The boom 121 is movably attached to the housing 120 via a movable part 131. A horizontal bar 122 is rotatably attached to the boom 121 via a movable part 132.
Furthermore, the display section 111a is rotatably attached to the horizontal bar 122 via a movable section 133, and the display section 111b is rotatably attached via a movable section 134.
Note that the outer shape of the robot 1 shown in FIG. 3 is an example, and is not limited to this. For example, the robot 1 may be a bipedal robot.

＜ロボットと人間との顔角度＞
ここで、ロボットと利用者との顔角度について説明する。
図４は、ロボット１と利用者との顔角度の定義を説明するための図である。
実施形態で用いる状態表現は、４つの特徴量から構成される。４つの特徴量は、利用者Ｈｕの音声方向、利用者Ｈｕの顔の向き、利用者Ｈｕの体の向き、およびロボット１の顔の向きである。 <Face angle between robot and human>
Here, the face angle between the robot and the user will be explained.
FIG. 4 is a diagram for explaining the definition of the face angle between the robot 1 and the user.
The state expression used in the embodiment is composed of four feature quantities. The four feature quantities are user Hu's voice direction, user Hu's face direction, user Hu's body direction, and robot 1's face direction.

符号ｇ１５１が示す領域の図は、利用者Ｈｕの音声方向を説明するための図である。第１の特徴量は、利用者Ｈｕの音声方向α_ａ（角度範囲は［－π,π］）である。利用者Ｈｕの音声方向α_ａは、ロボット１の正面方向αに対する利用者Ｈｕの音声方向ａの角度である。なお、音声方向は、例えば音声認識処理によって検出する。 The diagram of the area indicated by the symbol g151 is a diagram for explaining the voice direction of the user Hu. The first feature amount is the user Hu's voice direction α _a (angular range is [-π, π]). The user Hu's voice direction α _a is the angle of the user Hu's voice direction a with respect to the front direction α of the robot 1 . Note that the voice direction is detected by, for example, voice recognition processing.

符号ｇ１６１が示す領域の図は、利用者Ｈｕの顔の向きを説明するための図である。第２の特徴量は、利用者Ｈｕの顔の向きα_ｆ（角度範囲は［－π,π］）である。利用者Ｈｕの顔の向きα_ｆは、ロボット１の正面方向αに対する利用者Ｈｕの顔の向きｆの角度である。なお、利用者Ｈｕの顔の向きは、例えば画像処理によって検出する。この特徴量は、利用者Ｈｕの音声方向の特徴を補強するために使用される。 The diagram of the area indicated by the symbol g161 is a diagram for explaining the direction of the user Hu's face. The second feature amount is the orientation α _f of the user Hu's face (the angular range is [-π, π]). The user Hu's face direction α _f is the angle of the user Hu's face direction f with respect to the front direction α of the robot 1 . Note that the direction of the user Hu's face is detected, for example, by image processing. This feature amount is used to reinforce the feature of the user Hu's voice direction.

符号ｇ１７１が示す領域の図は、利用者Ｈｕの体の向きを説明するための図である。第３の特徴量は、利用者Ｈｕの顔の向きα_ｂ（角度範囲は［－π,π］）である。利用者Ｈｕの体の向きα_ｂは、ロボット１の正面方向αに対する利用者Ｈｕの体の向きｂの角度である。なお、利用者Ｈｕの顔の向きは、例えば画像処理によって検出する。この特徴量は、利用者Ｈｕの音声方向や顔の方向の特徴量を補完するために使用される。 The diagram of the area indicated by the symbol g171 is a diagram for explaining the orientation of the user Hu's body. The third feature amount is the orientation α _b of the user Hu's face (the angular range is [-π, π]). The user Hu's body orientation α _b is the angle of the user Hu's body orientation b with respect to the front direction α of the robot 1 . Note that the direction of the user Hu's face is detected, for example, by image processing. This feature amount is used to complement the feature amount of the user Hu's voice direction and face direction.

符号ｇ１８１が示す領域の図は、ロボット１の顔の向きを説明するための図である。第４の特徴量は、ロボット１の顔の向きθ_ｃ（角度範囲は［－π,π］）である。ロボット１の顔の向きθ_ｃは、ロボット１の顔の向きから人物の位置方向までの角度である。ロボット１の顔の向きθ_ｃは、動作指示に基づいて行動制御装置１００が取得する。 The diagram of the area indicated by the symbol g181 is a diagram for explaining the direction of the face of the robot 1. The fourth feature amount is the orientation θ _c of the face of the robot 1 (the angular range is [-π, π]). The direction of the face of the robot 1 θ _c is the angle from the direction of the face of the robot 1 to the direction of the position of the person. The facial orientation θ _c of the robot 1 is acquired by the behavior control device 100 based on the motion instruction.

ロボット１の行動は、ロボット１が実行しうる角度コマンドのリストとすることができる。実施形態では、図５のように、ロボット１の顔の向きの行動が傾くためのアクションセット［－φ_ａ,０,φ_ａ］を使用する。なお、φ_ａは、次数の小さな正の角度である。図５は、本実施形態に係るロボット１のアクションセットを説明するための図である。 The actions of the robot 1 can be a list of angular commands that the robot 1 can perform. In the embodiment, as shown in FIG. 5, an action set [-φ _a ,0,φ _a ] for tilting the behavior of the robot 1's face direction is used. Note that φ _a is a positive angle with a small degree. FIG. 5 is a diagram for explaining the action set of the robot 1 according to this embodiment.

ロボット１がφ_ａコマンドを選択した場合は、ロボット１が現在の顔の向きから角度φ_ａ分を左側に移動させることを意味する。ロボット１が－φ_ａコマンドを選択した場合は、ロボット１が現在の顔の向きから角度ψ_ａ分を右側に移動させることを意味する。ロボット１が０コマンドを選択した場合は、ロボット１が現在の顔の向きにとどまることを意味する。 When the robot 1 selects the φ _a command, it means that the robot 1 moves the current face direction by an angle φ _a to the left. When the robot 1 selects the -φ _a command, it means that the robot 1 moves the current face direction by an angle ψ _a to the right. If robot 1 selects the 0 command, it means that robot 1 remains in the current face orientation.

このロボット１の顔の動きに対する利用者の評価フィードバックについて説明する。本実施形態では、ロボット１が利用者Ｈｕの評価フィードバックから学習している間に、利用者Ｈｕはロボット１の行動の評価を音声で返信として伝え、それを数値の報酬値にマッピングする。 The user's evaluation feedback regarding the facial movements of the robot 1 will be explained. In this embodiment, while the robot 1 is learning from user Hu's evaluation feedback, the user Hu transmits the evaluation of the robot 1's behavior as a voice reply and maps it to a numerical reward value.

ここで、実施形態で用いるフィードバックのセットを定義する。実施形態では、“かなり良い”、“良い”、“悪い”、“かなり悪い”というフィードバックを定義し、それぞれ＋２、＋１、－１、－２にマッピングする。例えば、利用者Ｈｕがロボット１の選択した行動が正しいと思った場合、利用者Ｈｕは「良い」と答え、これは＋１にマッピングされる。また、ロボット１が選択した動作がより高品質であると利用者Ｈｕが考えている場合、利用者Ｈｕは「かなり良い」と答え、これは＋２にマッピングされる。 Here, we define a set of feedback used in the embodiment. In the embodiment, feedbacks such as "fairly good", "good", "bad", and "fairly bad" are defined and mapped to +2, +1, -1, and -2, respectively. For example, if the user Hu thinks that the action selected by the robot 1 is correct, the user Hu answers "good", which is mapped to +1. Furthermore, if the user Hu thinks that the motion selected by the robot 1 is of higher quality, the user Hu answers "quite good", which is mapped to +2.

＜対話型強化学習＞
まず、対話型強化学習の概略を説明する。
標準的な強化学習では、エージェントが環境と相互作用して、逐次的な意思決定タスクの実行方法を学習する。この逐次決定タスクは、マルコフ決定プロセスとしてモデル化され、｛Ｓ，Ａ，Ｔ，Ｒ，γ｝と呼ばれる。ＳとＡは、それぞれ可能な状態と行動の集合である。Ｔは遷移関数Ｔ：Ｓ×Ａ×Ｓ→Ｒ（実数全体の集合）であり、状態ｓ_ｔと行動のもとで状態ｓ_ｔ＋１に遷移する確率を与える。γは、将来受け取る報酬の現在価値を決定するもので、割引率と呼ばれる。Ｒは、報酬関数であり、Ｔ：Ｓ×Ａ×Ｓ→Ｒ（実数全体の集合）である。報酬は、ｓ_ｔ，ａ_ｔとｓ_ｔ＋１の関数、またはｓ_ｔ，ａ_ｔのみの関数である。エージェントの学習には、通常２つの関連する値関数がある。 <Interactive reinforcement learning>
First, an overview of interactive reinforcement learning will be explained.
In standard reinforcement learning, an agent interacts with its environment to learn how to perform sequential decision-making tasks. This sequential decision task is modeled as a Markov decision process and is called {S, A, T, R, γ}. S and A are sets of possible states and actions, respectively. T is a transition function T: S×A×S→R (set of all real numbers), which gives the probability of transitioning to state s _t+1 under state s _t and action. γ determines the present value of future rewards and is called the discount rate. R is a reward function and is T:S×A×S→R (set of all real numbers). The reward is a function of s _t , _at and s _t ₊₁ or only s _t , at. For agent learning, there are typically two relevant value functions.

第１の値関数は、状態値関数Ｖ^π（ｓ）であり、次式（１）のように政策πの下でのエージェントの初期状態ｓにのみ関連している。 The first value function is the state value function V ^π (s), which is related only to the initial state s of the agent under the policy π as shown in equation (1) below.

第２の値関数は、状態・行動ペアの値Ｑ^π（ｓ，ａ）と呼ばれる行動・値関数であり、次式（２）の状態ｓで行動ａをとった後の期待リターンである。 The second value function is an action/value function called the state/action pair value Q ^π (s, a), and is the expected return after taking action a in state s in the following equation (2).

対話型強化学習は、標準的な強化学習の変形である。対話型強化学習では、報酬信号が世界の状態とエージェントの行動に基づくだけでなく、人間のトレーナー（以下、単にトレーナーという）とのリアルタイムの相互作用にも依存する。この場合、トレーナーは、明確な目標状態を提供することによって報酬信号の値を変化させるか、または連続的なプロセスで連続的に対話することができる。対話型強化学習では、図６のように、エージェントがある状態で行動を起こすたびに、トレーナーは、トレーナーの経験に基づいて、選択された行動の質をエージェントに伝える評価フィードバックを提供する。図６は、対話型強化学習のフレームワークを説明するための図である。 Interactive reinforcement learning is a variation of standard reinforcement learning. In interactive reinforcement learning, the reward signal is not only based on the state of the world and the agent's actions, but also on real-time interaction with a human trainer. In this case, the trainer can change the value of the reward signal by providing a clear goal state, or interact continuously in a continuous process. In interactive reinforcement learning, as shown in Figure 6, each time an agent performs an action in a state, the trainer provides evaluation feedback that tells the agent the quality of the selected action based on the trainer's experience. FIG. 6 is a diagram for explaining the framework of interactive reinforcement learning.

＜学習フレームワーク＞
これに対して、本実施形態では、ＴＡＭＥＲ（ＴｒａｉｎｉｎｇａｎＡｇｅｎｔＭａｎｕａｌｌｙｖｉａＥｖａｌｕａｔｉｖｅＲｅｉｎｆｏｒｃｅｍｅｎｔ）フレームワークを元にしたエージェントを使用する。ＴＡＭＥＲフレームワークは、人間の報酬を直接モデル化することで近視眼的（ｍｙｏｐｉｃａｌｌｙ）に学習するアプローチである。ここで、「近視眼的」とは、エージェントが即時報酬のみを考慮に入れること、すなわち、割引係数γを０に設定することを意味する。 <Learning framework>
In contrast, this embodiment uses an agent based on the TAMER (Training an Agent Manually via Evaluative Reinforcement) framework. The TAMER framework is an approach that myopically learns by directly modeling human rewards. Here, "myopic" means that the agent only takes immediate rewards into account, ie, sets the discount factor γ to zero.

図７のように、学習メカニズムは、人間、環境、ＴＡＭＥＲエージェントの相互作用である。そして、ＴＡＭＥＲフレームワークでは、人間の教師がエージェントの行動を観察し、その質の評価に基づいて報酬を与える。
まず、図７を用いて、学習フレームワークの概略を説明する。図７は、本実施形態に係るＩＲＬ－ＴＡＭＥＲフレームワークの概略を示す図である。本実施形態で用いるＩＲＬ－ＴＡＭＥＲは、人間の実演から逆強化学習（ＩＲＬ；ＩｎｖｅｒｓｅＲｅｉｎｆｏｒｃｅｍｅｎｔＬｅａｒｎｉｎｇ）によって学習し、人間の報酬からＴＡＭＥＲで学習する。ＩＲＬ－ＴＡＭＥＲは、図７のように２つのアルゴリズムから構成されており、以下の順に実行される。 As shown in Figure 7, the learning mechanism is the interaction of humans, the environment, and the TAMER agent. In the TAMER framework, a human teacher observes the agent's behavior and rewards it based on its quality assessment.
First, an outline of the learning framework will be explained using FIG. 7. FIG. 7 is a diagram schematically showing the IRL-TAMER framework according to this embodiment. IRL-TAMER used in this embodiment learns from human performance using inverse reinforcement learning (IRL), and learns from human rewards using TAMER. IRL-TAMER consists of two algorithms as shown in FIG. 7, and is executed in the following order.

・手順１：ＩＲＬは、訓練者によって提供されたデモンストレーションから報酬関数を学習する（左側のブロック２０１）。手順１では、デモンステーションからの逆強化学習を行う。なお、このブロックの処理は、学習部３０１が行う。 - Step 1: The IRL learns the reward function from the demonstration provided by the trainer (block 201 on the left). In step 1, reverse reinforcement learning is performed from the demon station. Note that the processing of this block is performed by the learning unit 301.

・手順２：ＴＡＭＥＲは人間の評価フィードバックから予測報酬モデルを学習する（右側のブロック３１１）。手順２では、ＴＡＭＥＲエージェントが評価フィードバックから学習する。なお、このブロック３１１の処理は、エージェント３００が行う。 - Step 2: TAMER learns a predictive reward model from human rating feedback (block 311 on the right). In step 2, the TAMER agent learns from the evaluation feedback. Note that the process of this block 311 is performed by the agent 300.

＜手順１；ＩＲＬアルゴリズム＞
次に、手順１のアルゴリズムについて説明する。
人間の教師は、状態と行動のペアのシーケンスのデモンストレーションを行う。状態と行動のペアは、｛（ｓ_０，ａ_０），…，（ｓ_ｎ，ａ_ｎ）｝を含む。ここで、（ｓ_０，ａ_０）は、開始時の状態と行動を示す。（ｓ_ｎ，ａ_ｎ）は、終了時の状態と行動を示す。なお、ロボット１に対して動作を行うのは、トレーナーとは異なる他の人である。そして、ロボット１は、実演を記録する。ここで，状態ｓは、上述した４つの特徴量（人の音声方向α_ａ、顔の向きα_ｆ、体の向きα_ｂ、ロボットの顔の向きθ_ｃ）を特徴変数で表す。また、アクションは、アクションセット［－φ_ａ，０，φ_ａ］の中の１つである。 <Step 1; IRL algorithm>
Next, the algorithm of step 1 will be explained.
A human teacher demonstrates a sequence of state-action pairs. The state and action pairs include {(s ₀ , a ₀ ), ..., (s _n , a _n )}. Here, (s ₀ , a ₀ ) indicates the state and behavior at the beginning. (s _n , a _n ) indicates the state and action at the end. Note that the person who performs the motion on the robot 1 is a person other than the trainer. Then, the robot 1 records the demonstration. Here, the state s represents the above-mentioned four feature quantities (person's voice direction α _a , face direction α _f , body direction α _b , and robot's face direction θ _c ) as feature variables. Further, the action is one of the action set [-φ _a , 0, φ _a ].

ＩＲＬアルゴリズム２１１によって、記録されたデモンストレーションは、逆強化学習モジュールに報酬関数Ｒ＝ω・φ（ｓ）（２１２）として与えられる。ここでωは、パラメータ重みベクトルであり、φ（ｓ）はＩＲＬの状態上の基底特徴のベクトルである。 The IRL algorithm 211 provides the recorded demonstration to the inverse reinforcement learning module as a reward function R=ω·φ(s) (212). Here, ω is a parameter weight vector, and φ(s) is a vector of base features on the state of the IRL.

デモンストレーションからＩＲＬアルゴリズム２１１を介して学習した報酬関数Ｒは、ＴＡＭＥＲの報酬関数Ｒ_Ｈの重みベクトルｗの初期化に使用される。これにより、人間の評価フィードバックｈを用いて、訓練者がロボット１の動作を微調整することができる。 The reward function R learned from the demonstration via the IRL algorithm 211 is used to initialize the weight vector w of the reward function R _H of TAMER. This allows the trainee to fine-tune the motion of the robot 1 using the human evaluation feedback h.

＜手順２；ＴＡＭＥＲエージェントの学習アルゴリズム＞
次に、手順２のＴＡＭＥＲエージェントの学習アルゴリズムについて説明する。図７のようにエージェント３００は、報酬学習管理部３０２、割当評価部３０３、および行動選択部３０４を含む。 <Step 2; TAMER agent learning algorithm>
Next, the learning algorithm of the TAMER agent in step 2 will be explained. As shown in FIG. 7, the agent 300 includes a reward learning management section 302, an allocation evaluation section 303, and a behavior selection section 304.

まず、ＴＡＭＥＲを用いたエージェント３００（以下、ＴＡＭＥＲエージェント３００ともいう）の処理の概略を説明する。
ＴＡＭＥＲエージェント３００は、ロボット１の現在の顔の向きθ_ｃにおいて、人（人間Ｈｕ）の音声方向α_ａ、人の顔の向きα_ｆ、人の体の向きα_ｂ、ロボットの顔の向きθ_ｃで表される環境の状態ｓを推定する。そして、ＴＡＭＥＲエージェント３００は、行動選択部３０４を用いて報酬関数Ｒ_Ｈ（ｓ，ａ）を持つ行動ａ（角度指令）を選択する。
なお、行動選択部３０４は、人間の報酬予測値が最も大きいアクションを選択することで、ロボット１の即時行動による人間Ｈｕの報酬を最大化する。 First, an outline of the processing of the agent 300 (hereinafter also referred to as TAMER agent 300) using TAMER will be explained.
The TAMER agent 300 determines, in the current face orientation θ _c of the robot 1, the voice direction α _a of the person (human Hu), the human face orientation α _f , the human body orientation α _b , and the robot face orientation θ Estimate the state s of the environment represented by _c . Then, the TAMER agent 300 uses the action selection unit 304 to select action a (angle command) having the reward function R _H (s, a).
Note that the action selection unit 304 maximizes the reward for the human Hu due to the immediate action of the robot 1 by selecting the action for which the human's predicted reward value is the largest.

トレーナー（人間Ｈｕ）は、ロボット１の状態ｓと選択された動作φ_ａを観察し（環境３１２）、その品質を評価してフィードバックする。割当評価部３０３は、このようにフィードバックされた評価ｈを取得する。 The trainer (human Hu) observes the state s of the robot 1 and the selected motion φ _a (environment 312), evaluates its quality, and provides feedback. The allocation evaluation unit 303 acquires the evaluation h fed back in this way.

割当評価部３０３は、トレーナーから与えられた評価フィードバックｈを受け取り、前回選択した行動の確率（クレジット）ｈを計算する。割当評価部３０３は、ロボット１の行動を評価して報酬を与えることに起因する人間の報酬の時間的な遅れに対処するために使用される。割当評価部３０３は、人の報酬の予測モデルＲ＾_Ｈを学習され、｛Ｓ，Ａ，Ｔ，Ｒ＾_Ｈ，γ｝として指定されたＭＤＰ（マルコフ決定過程）内でエージェントに報酬を提供する。 The assignment evaluation unit 303 receives the evaluation feedback h given by the trainer and calculates the probability (credit) h of the previously selected action. The assignment evaluation unit 303 is used to deal with a time delay in human reward due to evaluating the behavior of the robot 1 and giving the reward. The allocation evaluation unit 303 is trained to predict a human reward model R^ _H , and provides a reward to the agent within an MDP (Markov decision process) specified as {S, A, T, R^ _H , γ}. .

具体的には、割当評価部３０３は、インタラクション体験で受け取る人の報酬の期待値を近似した関数Ｒ＾_Ｈ（ｓ，ａ）を、次式（３）を用いて学習する。なお、Ｓは環境中の状態の集合であり、Ａはエージェント３００が実行できる行動の集合である。 Specifically, the allocation evaluation unit 303 uses the following equation (3) to learn a function R^ _H (s, a) that approximates the expected value of the reward of the person receiving the interaction experience. Note that S is a set of states in the environment, and A is a set of actions that the agent 300 can execute.

式（３）において、ここで、ω^→＝（ω_０，…，ω_ｍ－１）^Ｔは列パラメータベクトルであり、φ_ｉ（ｘ^→）を基底関数とするφ（ｘ^→）＝（φ_０（ｘ^→），…，φ_ｍ－１（ｘ^→））^Ｔであり、ｍはｉ＝０，…，ｍ－１でありパラメータの総数である。 In ^equation (3), here, ω ^→ = (ω ₀ , ..., ω _m-1 ) ^T is a column parameter vector, and _φ (x ^→ ) = (φ ₀ (x ^→ ), ..., φ _m-1 (x ^→ )) ^T , where m is i=0, ..., m-1 and is the total number of parameters.

ＴＡＭＥＲエージェント３００は、人間ユーザの報酬関数を学習し、ａｒｇｍａｘａＲ＾_Ｈ（ｓ，ａ）により人間の報酬を最大化しようとする。ＴＡＭＥＲエージェントでは、最適な政策は人間ユーザによって定義される。
人間の報酬から学ぶ他の手法と比較して、ＴＡＭＥＲには、以下のような３つの工夫がある。 The TAMER agent 300 learns the human user's reward function and attempts to maximize the human's reward by argmaxaR^ _H (s,a). In the TAMER agent, the optimal policy is defined by the human user.
Compared to other methods of learning from human rewards, TAMER has the following three features.

Ｉ．ＴＡＭＥＲでは、人間の評価の遅れに対応するために、単位の割り当てを行っている。
ＩＩ．ＴＡＭＥＲエージェントは、人間の報酬モデル（Ｒ＾Ｈ）を学習する。
ＩＩＩ．各時間ステップにおいて、ＴＡＭＥＲエージェントは、将来の状態への影響を考慮せずに、最大の報酬を直接引き出すと予測される行動（ａｒｇｍａｘａＲ＾_Ｈ（ｓ，ａ））を選択する。 I. In TAMER, units are assigned to accommodate delays in human evaluation.
II. The TAMER agent learns a human reward model (R^H).
III. At each time step, the TAMER agent chooses the action (argmaxaR^ _H (s,a)) that is predicted to directly derive the maximum reward, without considering the effects on future states.

具体的には、人間がエージェントの行動を観察するとき、人間の脳は対応するフィードバック信号を出すために一定の反応時間を必要とする。しかし、この間、エージェントは、すでに新しい探索を開始している可能性があり、人間のフィードバックに多少の遅れを生じさせる。この問題を解決するために、エージェントは、サンプル毎の状態行動ペアのラベルに貢献する複数の最近の状態行動ペアに、それぞれの人間の報酬信号を分配する必要がある。ＴＡＭＥＲでは、確立された回帰アルゴリズムＲ_Ｈ：Ｓ×Ａ→Ｒを用いて、仮想的な人間の報酬関数をシミュレートしている。なお、ＴＡＭＥＲフレームワークには、人間の報酬関数を近似するための特定のモデルや教師付き学習アルゴリズムは含まれておらず、すべての決定が例えば設計者によって行われる。また、ＡＭＥＲフレームワークでは、状態－行動サンプルのラベルは。すべて人間の報酬で構成されている。さらに、状態ｓの下で行動選択が行われるとき、ＴＡＭＥＲエージェントは戦略ａ＝ａｒｇｍａｘａＲ＾_Ｈ（ｓ，ａ）を直接採用する。このような近視眼的学習は、割引係数γ＝０で強化学習を行うことに相当する。 Specifically, when a human observes an agent's behavior, the human brain requires a certain reaction time to issue a corresponding feedback signal. However, during this time, the agent may have already started a new exploration, causing some delay in human feedback. To solve this problem, the agent needs to distribute each human reward signal to multiple recent state-action pairs that contribute to the label of the state-action pair for each sample. TAMER uses the established regression algorithm R _H :S×A→R to simulate a hypothetical human reward function. Note that the TAMER framework does not include any specific models or supervised learning algorithms for approximating the human reward function; all decisions are made, for example, by the designer. Also, in the AMER framework, the label for a state-behavior sample is . Everything is made up of human rewards. Furthermore, when an action selection is made under state s, the TAMER agent directly adopts the strategy a=argmaxaR^ _H (s, a). Such myopic learning corresponds to performing reinforcement learning with a discount coefficient γ=0.

ＴＡＭＥＲのフレームワークでは、エージェントの役割が本質的にやや不特定である。それは、測定することができないタスクのパフォーマンスを最大化するように、人間の報酬から学習しなければならないからである。
このため、実施形態では、エージェントがトレーナーの報酬のモデルを学習し、そのモデルが予測する行動が最も多くの報酬に直結することを選択することで、エージェントがその役割を最もよく果たすという仮説を立てた。 In the TAMER framework, the role of agents is somewhat unspecified in nature. That's because it must learn from human rewards to maximize performance on tasks that cannot be measured.
To this end, embodiments hypothesize that the agent will best fulfill its role by learning a model of the trainer's rewards and selecting the behavior that the model predicts will lead to the most rewards. erected.

報酬学習管理部３０２は、計算された確率ｈ＾、状態表現ｓ、選択された行動ａを１つの教師付き学習サンプル（ｓ，ａ，ｈ＾）として取得し、教師付き学習アルゴリズムで報酬関数Ｒ_Ｈ（ｓ，ａ）を学習する。報酬学習管理部３０２は、Ｒ_Ｈ（ｓ，ａ）を（ｓ，ａ，ｈ＾）で更新する。
報酬学習管理部３０２は、エージェント３００の行動を評価してそれを提供することに起因する人の報酬の時間遅れに対処するための信用付与器である。ＴＡＭＥＲでは、教師のフィードバック遅延の確率を推定するために確率密度関数ｆ（ｔ）を定義している。確率密度関数ｆ（ｔ）は、フィードバックが任意の特定の時間間隔内に発生する確率を提供し、単一の報酬信号が単一のタイムステップを対象としている確率（クレジット）を計算するために使用される。現在の時間ステップｔでは、各前の時間ステップｔ－ｋの確率は次式（４）のように計算される。 The reward learning management unit 302 acquires the calculated probability h^, the state representation s, and the selected action a as one supervised learning sample (s, a, h^), and calculates the reward function R using a supervised learning algorithm. Learn _H (s, a). The reward learning management unit 302 updates R _H (s, a) with (s, a, h^).
The reward learning management unit 302 is a credit granting device for dealing with a time delay in receiving a person's reward due to evaluating the behavior of the agent 300 and providing it. TAMER defines a probability density function f(t) to estimate the probability of a teacher's feedback delay. The probability density function f(t) provides the probability that feedback occurs within any particular time interval, and to calculate the probability (credit) that a single reward signal is targeted for a single time step used. At the current time step t, the probability of each previous time step tk is calculated as follows (4).

なお、人間が複数の報酬を与える場合、各前回のタイムステップ（状態－行動ペア）のラベルｈ＾は、式（４）を用いて各人の報酬で計算された全ての確率の総和である。
報酬学習管理部３０２は、ｈ＾と状態－行動ペアを教師付き学習サンプルとして使用し、最小二乗の勾配に基づいて、式（５）、式（６）のようにパラメータを更新することでＲ＾_Ｈ（ｓ，ａ）を学習する。なお、式（５）、式（６）において、ここで、αは学習率であり、δ_ｔは時間差誤差である。 Note that when humans give multiple rewards, the label h^ of each previous time step (state-action pair) is the sum of all probabilities calculated for each person's reward using equation (4). .
The reward learning management unit 302 uses h^ and the state-behavior pair as a supervised learning sample, and updates the parameters as shown in equations (5) and (6) based on the least squares gradient. ^ Learn _H (s, a). Note that in equations (5) and (6), α is the learning rate and δ _t is the time difference error.

なお、式（５）、式（６）において、ｈは、任意の時間ステップｔでエージェントが受け取った人間の報酬ラベルである。 Note that in equations (5) and (6), h is the human reward label received by the agent at any time step t.

行動選択部３０４は、更新された報酬関数Ｒ_Ｈ（ｓ，ａ）を用いて、別の行動（提案角度指令）φ_ａを選択する。人間によるデモンストレーションと計画を介して生成された軌跡は、状態・動作ペア｛（ｓ_０，ａ_０），…，（ｓ_ｎ，ａ_ｎ）｝のシーケンスで構成されており、これらは逆ＲＬアルゴリズムに供給される。エージェント３００は，人の報酬関数を学習し、ａｒｇｍａｘ_ａＲ＾_Ｈ（ｓ,ａ）によって人間の報酬を最大化するロボット１の行動を、例えば次式（７）または次式（８）を用いて選択する。 The behavior selection unit 304 selects another behavior (proposed angle command) φ _a using the updated reward function R _H (s, a). The trajectory generated through human demonstration and planning consists of a sequence of state-action pairs {(s ₀ , a ₀ ), ..., (s _n , a _n )}, which are processed using the inverse RL algorithm. supplied to The agent 300 learns the human reward function and determines the behavior of the robot 1 that maximizes the human reward by argmax _a R^ _H (s, a) using, for example, the following equation (7) or the following equation (8). and select.

エージェント３００は、ロボット１が最適な行動を学習するまで，行動をとり、報酬を検出し、予測報酬関数モデルを更新して新しいサイクルを開始する。なお、選択される行動は、例えば期待されるロボット１の顔の向きである。これによって、ロボット１が注目する人Ｈｕに顔を向けることができる。 The agent 300 takes actions, detects rewards, updates the predictive reward function model, and starts a new cycle until the robot 1 learns the optimal action. Note that the selected action is, for example, the expected direction of the robot 1's face. This allows the robot 1 to face the person Hu who is paying attention.

＜処理アルゴリズム＞
次に、エージェントが行う処理アルゴリズム例を説明する。図８は、本実施形態に係るエージェントが行う処理アルゴリズム例を示す図である。 <Processing algorithm>
Next, an example of a processing algorithm performed by the agent will be explained. FIG. 8 is a diagram illustrating an example of a processing algorithm performed by an agent according to this embodiment.

・手順１：エージェント３００は、報酬関数Ｒ、人の報酬関数Ｒ＾_Ｈ、行動価値関数Ｑ（ｓ，ａ）または状態価値関数Ｖ（ｓ）を初期化する。 - Procedure 1: The agent 300 initializes the reward function R, the human reward function R^ _H , the action value function Q(s, a), or the state value function V(s).

・手順２：人間がデモンストレーションを行う。エージェント３００は、行われたデモンストレーションを取得、記録する。・Step 2: A human performs a demonstration. The agent 300 obtains and records the demonstration performed.

・手順４、手順５：エージェント３００は、デモを記録し、計画を介して軌道を生成し、逆ＲＬを介して報酬関数Ｒを最適化する。人間によるデモンストレーションと計画を介して生成された軌跡は、状態・動作ペア｛（ｓ_０，ａ_０），…，（ｓ_ｎ，ａ_ｎ）｝のシーケンスで構成されており、これらは逆ＲＬアルゴリズムに供給される。 - Step 4, Step 5: The agent 300 records the demo, generates a trajectory through planning, and optimizes the reward function R through inverse RL. The trajectory generated through human demonstration and planning consists of a sequence of state-action pairs {(s ₀ , a ₀ ), ..., (s _n , a _n )}, which are processed using the inverse RL algorithm. supplied to

・手順６：エージェント３００は、デモンストレーションから逆ＲＬ（強化学習）を介して学習された報酬関数Ｒを、ＴＡＭＥＲにおける人間の報酬関数Ｒ^_Ｈの種付けに使用する。 - Step 6: The agent 300 uses the reward function R learned from the demonstration through inverse RL (reinforcement learning) to seed the human reward function R^ _H in TAMER.

・手順８：訓練者は、人間の評価フィードバックを提供することで、エージェント３００の方針を修正することができる。エージェント３００は、人の報酬ｈを受け取ったか否か判別する。 - Step 8: The trainer can modify the agent's 300 policy by providing human evaluation feedback. The agent 300 determines whether or not the person's reward h has been received.

・手順９、手順１０：エージェント３００は、例えば、人間の報酬を受け取った場合、受け取った報酬を、人の報酬関数Ｒ＾_Ｈを更新するために使用する。 - Steps 9 and 10: For example, when the agent 300 receives a human reward, the agent 300 uses the received reward to update the human reward function R^ _H .

・手順１１、手順１２、手順１３：エージェント３００は、報酬関数Ｒを持つ１つの動作を選択して実行する。 - Step 11, Step 12, Step 13: The agent 300 selects and executes one action with the reward function R.

・手順１４：エージェント３００は、訓練者がエージェント３００の行動に満足するまで繰り返す。 - Step 14: The agent 300 repeats until the trainer is satisfied with the behavior of the agent 300.

＜人間のソーシャルフィードバックからの対話型ＲＬのためのフレームワーク＞
以下の例では、利用者が操作部を操作して明示的なフィードバック信号を用いてエージェントを訓練するのではなく、顔の表情やジェスチャーのような人間の社会的な信号を、エージェントと人間のユーザとの間の相互作用のプロセスに提供する。
これにより、本実施形態によれば、エージェントの訓練経験のない利用者が、複雑な訓練ルールを学習することなく、エージェントの行動の好みに基づいてエージェントを訓練することができる。また、本実施形態によれば、利用者の期待に応じたフィードバックを行うことができるようにするためのアプローチを提供できる。 <Framework for interactive RL from human social feedback>
In the example below, rather than having the user manipulate controls to train the agent using explicit feedback signals, human social signals, such as facial expressions and gestures, can be used to communicate between the agent and the human. Provide to the process of interaction between users.
As a result, according to the present embodiment, a user who has no experience in agent training can train an agent based on the agent's behavioral preferences without learning complicated training rules. Furthermore, according to the present embodiment, it is possible to provide an approach that allows feedback to be provided in accordance with the user's expectations.

図９は、本実施形態に係る人間のソーシャルフィードバックからの対話型ＲＬのためのフレームワークを示す図である。図９のように、人間のソーシャルフィードバックは、２つの方法で訓練する。
第１の方法は、顔のフィードバックを用いてエージェントを訓練する方法である。もう第２の方法は、キーボードによるフィードバックの代わりにジェスチャーフィードバックを直接導入する方法である。 FIG. 9 is a diagram illustrating a framework for interactive RL from human social feedback according to this embodiment. As shown in Figure 9, human social feedback is trained in two ways.
The first method is to train the agent using facial feedback. The second method is to directly introduce gesture feedback instead of keyboard feedback.

エージェント３００は、顔やジェスチャーの信号から人間Ｈｕの報酬を学習し、最適なポリシーを得る。本実施形態では、撮影部１０２でリアルタイムの顔認識を実現するモジュール４０１、モーションセンサでオンラインジェスチャー認識を実現するモジュール４０３、および異なるプロセス間のソケット通信で人間のフィードバックを報酬信号にマッピングするモジュール（４０１～４０３、エージェント３００）を含み、機能の異なる複数のモジュールを含んでいる。本実施形態では、感情エージェントを構成するＴＡＭＥＲエージェントに、リアルタイム顔認識を導入した。さらに、本実施形態では、オンラインジェスチャー認識モジュールをＴＡＭＥＲＡｇｅｎｔと組み合わせることで、ジェスチャーエージェントを実現する。なお、モーションセンサは、例えば手と指をトラッキングするセンサである。 The agent 300 learns human Hu's rewards from facial and gesture signals and obtains an optimal policy. In this embodiment, a module 401 that realizes real-time face recognition with the imaging unit 102, a module 403 that realizes online gesture recognition with a motion sensor, and a module that maps human feedback to reward signals through socket communication between different processes ( 401 to 403, agent 300), and includes a plurality of modules with different functions. In this embodiment, real-time face recognition is introduced into the TAMER agent that constitutes the emotional agent. Furthermore, in this embodiment, a gesture agent is realized by combining an online gesture recognition module with a TAMER Agent. Note that the motion sensor is a sensor that tracks hands and fingers, for example.

まず、リアルタイム感情認識について説明する。
顔の表情は、表情筋の１つ以上の動きや状態の結果である。これらの動きは、観察者の個人の感情を表現している。そして、表情は、非言語コミュニケーションの一形態である。表情は、人間同士の社会的な情報を表現する主な手段であり、通常は感情を伝えるために使用される。本実施形態では、リアルタイム感情認識を設計するために、例えば図１０のような畳み込みニュートラルネットワーク（ＣＮＮ）フレームワークを用いる。図１０は、リアルタイム感情分類のためのＣＮＮモデルを示す図である。図１０のフレームワークは、取得された画像５０１に対して、４つの残差深度分離可能な畳み込みを持つ、完全畳み込みニューラルネットワークである。各畳み込みには、バッチ正規化演算（５０２、５０３、５０４、５０５、５０６、５０９）とＲｅＬＵ活性化関数が接続される。最後の層（５０７、５１０、５１１）では、グローバル平均プーリングとソフトマックス活性化関数を適用して予測値を生成する。 First, real-time emotion recognition will be explained.
Facial expressions are the result of the movement or state of one or more facial muscles. These movements express the observer's personal emotions. And facial expressions are a form of nonverbal communication. Facial expressions are the main means of expressing social information between humans and are usually used to convey emotions. In this embodiment, a convolutional neutral network (CNN) framework as shown in FIG. 10, for example, is used to design real-time emotion recognition. FIG. 10 is a diagram showing a CNN model for real-time emotion classification. The framework of FIG. 10 is a fully convolutional neural network with four residual depth-separable convolutions for the acquired image 501. A batch normalization operation (502, 503, 504, 505, 506, 509) and a ReLU activation function are connected to each convolution. The last layer (507, 510, 511) applies global average pooling and softmax activation functions to generate predicted values.

本実施形態の顔認識モジュールでは、例えば、「嬉しい」、「悲しい」、「怒っている」、「怖い」、「驚いている」、「中性」および「嫌悪感」の７つの感情を認識できる。また、実験では、「幸せ（ポジティブ）」感情と「不幸（ネガティブ）」感情とに分類した。実施形態では、「怒り」「悲しみ」「恐れ」を、「不幸」感情としてラベル付けした。そして、実験では、利用者が、エージェント３００の行動を直接観察し、「幸せ」と「不幸」の表現で好みを伝えるようにした。なお、「幸せ（ポジティブ）」な表情は、例えば微笑んでいる表情であり、「不幸（ネガティブ）」な表情は、例えば怒っている表情である。 The face recognition module of this embodiment recognizes, for example, seven emotions: "happy," "sad," "angry," "scared," "surprised," "neutral," and "disgusted." can. In addition, in the experiment, emotions were classified into "happy (positive)" emotions and "unhappy (negative)" emotions. In the embodiment, "anger," "sadness," and "fear" are labeled as "unhappy" emotions. In the experiment, users directly observed the behavior of agent 300 and communicated their preferences using expressions such as "happy" and "unhappy." Note that the "happy (positive)" facial expression is, for example, a smiling facial expression, and the "unhappy (negative)" facial expression is, for example, an angry facial expression.

なお、本実施形態では、エージェントの状態や行動ごとに報酬信号を連続的に与えるのではなく、トレーナーが必要と感じたときに、トレーナーの判断でフィードバックを与えることができるようにした。
なお、実験では、表情認識は連続的であるため、エージェントと同じ速度でモジュールが動作しやすいように、表情をつかむ間隔を２秒に設定して、顔のフィードバックを抽出した。また、実験では、ユーザがフィードバックを提供したくない場合は、ユーザは「中立」または撮影部１０２が撮影できないところにいて、エージェントがフィードバックを受け取らないようにした。 In this embodiment, instead of continuously giving reward signals for each agent's state or action, feedback can be given at the trainer's discretion when the trainer feels it is necessary.
In the experiment, since facial expression recognition is continuous, the interval between capturing facial expressions was set to 2 seconds to make it easier for the module to operate at the same speed as the agent, and facial feedback was extracted. In addition, in the experiment, if the user did not want to provide feedback, the user was "neutral" or in a place where the imaging unit 102 could not take pictures, so that the agent did not receive feedback.

次に、オンラインジェスチャー認識について説明する。
ジェスチャーは、日常生活の中での自然なコミュニケーションの方法であり、例えば聴覚障害者や言語障害者の間でよく使われます。人間とコンピュータの相互作用の観点から、コミュニケーション言語としてのジェスチャーは、サービスロボットを利用した言語障害のあるユーザ、専用の入力デバイスを使用することが不便な水中作業、言語コミュニケーションに大きな干渉を与えるノイズの多い環境など、非常に幅広い応用が可能である。
実施形態では、モーションセンサを用いて３種類のジェスチャーを認識する。 Next, online gesture recognition will be explained.
Gestures are a natural way of communicating in everyday life, and are often used, for example, by people who are deaf or have speech disabilities. From the perspective of human-computer interaction, gestures as a communicative language are useful for users with speech impairments using service robots, underwater work where it is inconvenient to use dedicated input devices, and noise that greatly interferes with verbal communication. It has a very wide range of applications, such as environments with many environments.
In the embodiment, three types of gestures are recognized using a motion sensor.

第１のモデルは「簡易ジェスチャー検出入力（ＥａｓｙＧｅｓｔｕｒｅＰｌａｙＩｎｐｕｔ）」で、１つのニュートラル状態と５つの基本ジェスチャーを検出し、簡単な信号処理でラベル付けを行う。ニュートラル状態は、「Ｙｅｓ」であり、例えば静止した親指を上げることで表した。実施形態において基本的なジェスチャーは、「Ｎｏ」（例えば親指を下に倒す）、「Ｇｒｅａｔ」（例えば親指を上に跳ね上げる）、「Ｓｔｏｐ」（手を振る）、「Ｌｅｆｔｓｗｉｐｅ」（例えば左に大きく振る）、「Ｒｉｇｈｔｓｗｉｐｅ」（例えば右に大きく振る）の５つである。 The first model is "Easy Gesture Play Input," which detects one neutral state and five basic gestures and labels them using simple signal processing. The neutral state is "Yes" and is represented by, for example, a stationary thumb raised. In the embodiment, basic gestures include "No" (e.g. thumbs down), "Great" (e.g. thumbs flipped up), "Stop" (waving hand), and "Left swipe" (e.g. "Right swipe" (for example, swing sharply to the right).

第２のモデルは、機械学習アルゴリズムを使用してジェスチャー活動を認識して分類して、ロボット１上での反応を誘発する（例えば、テレプレゼンスやソーシャルロボット）。 The second model uses machine learning algorithms to recognize and classify gestural activity to trigger a response on the robot 1 (eg, telepresence or social robot).

第３のモデルは、超音波の原理に基づいてジェスチャー動作をロボットにマッピングするキネティック特徴マッピング制御入力（ＫｉｎｅｍａｔｉｃＦｅａｔｕｒｅＭａｐｐｉｎｇＣｏｎｔｒｏｌＩｎｐｕｔ）である。 The third model is a Kinematic Feature Mapping Control Input that maps gesture movements to the robot based on ultrasound principles.

実験では、第１のモデルを用いて予め行ったジェスチャー認識の実験に基づいて、ポジティブなフィードバックを提供するために“Ｇｒｅａｔ”を選択し、エージェントへのネガティブなフィードバックを表現するために“Ｓｔｏｐ”を選択するようにした。図１１は、リアルタイムジェスチャー認識のビジュアル表示例を示す図である。図１１の符号ｇ２０１は“Ｇｒｅａｔ”のジェスチャー例を示し、図１１の符号ｇ２０２は“Ｓｔｏｐ”のジェスチャー例を示している。 In the experiment, based on the gesture recognition experiment conducted in advance using the first model, "Great" was selected to provide positive feedback, and "Stop" was selected to express negative feedback to the agent. I made it possible to select. FIG. 11 is a diagram illustrating a visual display example of real-time gesture recognition. Reference numeral g201 in FIG. 11 indicates an example of a "Great" gesture, and reference numeral g202 in FIG. 11 indicates an example of a "Stop" gesture.

次に、実施形態で用いたソケット通信について説明する。
実験では、ＴＡＭＥＲエージェントをＪａｖａ（登録商標）スクリプトで動作させ、リアルタイム表情認識モジュールとオンラインジェスチャー認識モジュールはＰｙｔｈｏｎ（登録商標）で実装した。なお、実装は、他のプロミング言語やスクリプトを用いてもよい。 Next, socket communication used in the embodiment will be explained.
In the experiment, the TAMER agent was operated using Java (registered trademark) script, and the real-time facial expression recognition module and online gesture recognition module were implemented using Python (registered trademark). Note that the implementation may use other programming languages or scripts.

実験では、２つのプロセス間で安全かつ信頼性の高いデータ転送を実現するために、図１２に示すＴＣＰ通信機構に対するソケット方式を採用した。図１２は、リアルタイム感情認識モジュールとリアルタイムジェスチャー認識からＴＡＭＥＲエージェントが受信するフィードバック信号の模式図である。図１２を参照して、実験で用いたソケット通信の概略を説明する。
データ送信の過程では、エージェントはサーバ側であり、リアルタイム認識モジュールは通信中のクライアントである。クライアントは、認識した結果を全てサーバに遅滞なく渡すが、サーバは選択的にデータを受信する。実験では、２秒ごとにクライアントの出力ストリームからデータを受信して読み出すようにサーバを設定した。 In the experiment, the socket method for the TCP communication mechanism shown in FIG. 12 was adopted in order to achieve safe and reliable data transfer between two processes. FIG. 12 is a schematic diagram of feedback signals received by the TAMER agent from the real-time emotion recognition module and real-time gesture recognition. An outline of the socket communication used in the experiment will be explained with reference to FIG.
In the process of data transmission, the agent is the server side, and the real-time recognition module is the communicating client. The client passes all recognized results to the server without delay, but the server selectively receives data. In the experiment, the server was configured to receive and read data from the client's output stream every two seconds.

＜実験＞
次に、本実施形態のエージェントを用いて実験を行った結果例を説明する。
実験では、人間（例えばユーザ）が、現在の環境でロボット１が行った行動を観察する。そして、ユーザは、ロボット１のエージェントが選択した動作にユーザが同意した場合、ポジティブなフィードバックを与え、選択された行動が期待にそぐわないとユーザが考えた場合、ネガティブなフィードバックを与える。ユーザは、予め定義されたキーパッドフィードバックと比較して、異なる表情またはジェスチャーを報酬信号として提供することにより、エージェントを訓練した。 <Experiment>
Next, an example of the results of an experiment using the agent of this embodiment will be described.
In the experiment, a human (for example, a user) observes the actions performed by the robot 1 in the current environment. Then, the user gives positive feedback if the user agrees with the action selected by the agent of the robot 1, and gives negative feedback if the user thinks that the selected action does not meet expectations. The user trained the agent by providing different facial expressions or gestures as reward signals compared to predefined keypad feedback.

実験では、３つのフィードバックをＴＡＭＥＲエージェントに学習させた。
１つ目は、比較例であり、キーボードからの明示的なフィードバックを用いて学習するキーボードエージェントである。２つ目は、表情を用いて学習する感情エージェントである。３つ目は、ジェスチャーを使ってフィードバックを提供するジェスチャーエージェントである。
実験では、３つのエージェントに対して、２つの強化学習ベンチマークタスク（ＬｏｏｐＭａｚｅとＴｅｔｒｉｓ（登録商標））を用いてテストを行った。また、実験では、各エージェントに対して各タスクで１０回の訓練を行った。なお、実験結果は、１０回の試行で収集したデータの平均値である。 In the experiment, the TAMER agent was trained with three types of feedback.
The first, a comparative example, is a keyboard agent that learns using explicit feedback from the keyboard. The second is an emotional agent that learns using facial expressions. The third is a gesture agent that provides feedback using gestures.
In the experiment, three agents were tested using two reinforcement learning benchmark tasks (LoopMaze and Tetris (registered trademark)). In addition, in the experiment, each agent was trained 10 times for each task. Note that the experimental results are the average value of data collected in 10 trials.

キーボードエージェントについては、人間がエージェントの動作を観察し、指定されたキーボードキーを押すことでフィードバックを行った。実験では、ｖキーを押すと報酬が＋１となり、ｎキーを押すと報酬が－１となり、キーを何度もクリックすることで報酬の値を重ね合わせることができるようにした。 For the keyboard agent, humans observed the agent's actions and provided feedback by pressing designated keyboard keys. In the experiment, pressing the v key increased the reward by +1, pressing the n key decreased the reward by -1, and the reward values could be stacked by clicking the key many times.

感情エージェントでは、最初の２つのエピソードで人間がエージェントの性能を観察し、キーボードからのフィードバックで第一のポリシーを学習するように訓練した。実験では、３つ目のエピソードから、表情のフィードバックを学習モデルに導入した。エージェントによって選択された行動が期待通りのものであれば、人間は笑顔の表情を見せることになり、エージェントにポジティブなフィードバック信号を出力する。エージェントが選択した行動が不満足な場合、人間は、怒りや恐怖、悲しみなどの感情を表現することで、エージェントにネガティブなフィードバック信号を出力する。なお、感情エージェントは、表現から学習して、初期のポリシーをさらに調整する。 For the emotional agent, humans observed the agent's performance during the first two episodes and trained it to learn the first policy using keyboard feedback. In the experiment, facial feedback was introduced into the learning model starting from the third episode. If the action selected by the agent is as expected, the human will show a smiling face and output a positive feedback signal to the agent. If the agent's chosen action is unsatisfactory, humans output negative feedback signals to the agent by expressing emotions such as anger, fear, and sadness. Note that the emotional agent learns from the expressions to further adjust the initial policy.

ジェスチャーエージェントは、人間がエージェントの行動を観察し、２種類のジェスチャー（“Ｇｒｅａｔ”と“Ｓｔｏｐ”）でフィードバックを行うようにした。エージェントが選択した行動が期待通りであれば、人間は、親指を立ててバウンスさせ、“Ｇｒｅａｔ”と表現するようにした。エージェントが選択した行動が不適切だと思ったとき、人間、は手を振り、エージェントへの否定的なフィードバック信号として“Ｓｔｏｐ”と表現するようにした。 In the gesture agent, a human observes the agent's actions and provides feedback using two types of gestures (“Great” and “Stop”). If the action selected by the agent was as expected, the human would give a thumbs up and bounce to express "Great". When humans thought that the agent's chosen action was inappropriate, they would wave their hands and say "Stop" as a negative feedback signal to the agent.

［ＬｏｏｐＭａｚｅ］
迷路ゲームであるＬｏｏｐＭａｚｅを用いて実験では、エージェントがゴールに２５回（つまり２５エピソード）到達した時点でトレーニングセッションを停止し、各エピソードの最大トレーニング時間ステップを２０００に設定した。図１３は、ＬｏｏｐＭａｚｅタスクのスクリーンショットを示す図である。図１３において、符号ｇ３１１とｇ３１２は壁を表し、符号ｇ３２１はエージェントを表し、符号ｇ３２２はエージェントの移動方向を示す。 [LoopMaze]
In an experiment using the maze game LoopMaze, the training session was stopped when the agent reached the goal 25 times (i.e., 25 episodes), and the maximum training time step for each episode was set to 2000. FIG. 13 is a diagram showing a screenshot of the LoopMaze task. In FIG. 13, symbols g311 and g312 represent walls, g321 represents an agent, and g322 represents the moving direction of the agent.

ＬｏｏｐＭａｚｅのタスクは、３０の状態を含んでいる。タスクにおいて、各状態のエージェントは、上下左右に移動することができ、ある状態で選択されたアクションがエージェントに壁にぶつかった場合、移動は発生しない。エージェントの目標は、開始状態ｇ３０１からゴール状態ｇ３０２へと導く最適な政策をできるだけ早く学習することである。開始状態からゴール状態への最短経路は１９のアクションを必要とする。ＬｏｏｐＭａｚｅでエージェントが利用できる行動は、目標状態と最後に選択されたアクションからの相対的な位置に依存する。 The LoopMaze task includes 30 states. In a task, an agent in each state can move up, down, left, or right, and if an action selected in a certain state causes the agent to hit a wall, no movement will occur. The agent's goal is to learn as quickly as possible the optimal policy that leads from the starting state g301 to the goal state g302. The shortest path from the starting state to the goal state requires 19 actions. The actions available to an agent in LoopMaze depend on the target state and its position relative to the last selected action.

［Ｔｅｔｒｉｓ］
落ち物パズルであるＴｅｔｒｉｓでは、２^２００の状態があり、隣接する４つのブロックを一度に選択してテトリスのピースを形成する。アクションセットには、下、左、右、回転の４つの選択肢がある。テトリスのピースが一列に並ぶと、その一列を埋めているブロックは自動的に排除される。Ｔｅｔｒｉｓのタスクでは、落ちてくるブロックを一列に並べて、常に行数を消していき、最終的にゲームを無期限に走らせるのがベストな方針である。このため、制限内は、エージェントが１つのエピソードでクリアできる行数が多いほど、このポリシーの性能が高いということになる。 [Tetris]
In Tetris, a falling object puzzle, there are ^2,200 states, and four adjacent blocks are selected at once to form a Tetris piece. The action set has four options: down, left, right, and rotation. When Tetris pieces line up in a row, the blocks filling that row are automatically eliminated. For Tetris tasks, the best policy is to line up the falling blocks in a line, constantly clearing out rows, and eventually letting the game run indefinitely. Therefore, within the limits, the more rows an agent can clear in one episode, the better the performance of this policy.

Ｔｅｔｒｉｓを用いて実験では、２０エピソードの訓練を行った。１つのエピソードでは、時間ステップの上限数を１００００とし、１回の訓練が終了するまで新たなエピソードが開始されるようにした。実験では、キーボードエージェントと感情エージェントの性能を比較するために、各エピソードにおける実行時間ステップ数と総フィードバック量の平均と分散を分析した。図１４は、Ｔｅｔｒｉｓタスクのスクリーンショットを示す図である。 In the experiment, 20 episodes of training were performed using Tetris. In one episode, the upper limit number of time steps was set to 10,000, and a new episode was started until one training session was completed. In the experiment, in order to compare the performance of the keyboard agent and the emotional agent, we analyzed the mean and variance of the number of execution time steps and the total amount of feedback in each episode. FIG. 14 is a diagram showing a screenshot of a Tetris task.

Ｔｅｔｒｉｓタスクでは、任意のテトリスピースが落ちている時間、人間が前回ステップのＴｅｔｒｉｓピースの配置についてフィードバック（ポジティブまたはネガティブ）を与えるようにした。また、落ちてくるブロックが完全に落ちると四角い部分が黒くなりますが、この時に与えられたフィードバックが最も効果的に働くようにした。 In the Tetris task, the human provided feedback (positive or negative) about the placement of Tetris pieces in the previous step while any Tetris pieces were falling. Also, when a falling block completely falls, the square part turns black, but we have made sure that the feedback given at this time works most effectively.

［実験結果］
まず、ＬｏｏｐＭａｚｅタスクとＴｅｔｒｉｓタスクの両方において、顔のフィードバックとジェスチャーフィードバックからの学習の実験結果を説明する。
感情フィードバックのないエージェントでは、利用者の好みは、キーボードのキーを押すという報酬のプロセスを通してしか実現できない。これに対し、本実施形態によれば、正しい行動を選択したときには単純に嬉しそうな表情を、予想外の行動を選択したときには悲しそうな表情や怒っているような表情を見せることで、利用者は自分の感情を伝えることができる。顔の表情で好みを表現するだけで満足のいくエージェントを得ることができ、利用者の認知的負担（例えば発話等）を大幅に軽減することができる。なお、実験では、キーボードフィードバックで初期のポリシーを取得し、その後、表情フィードバックを導入して形成されたポリシーを改善し、最終的に最適なポリシーを取得するようにエージェントを訓練した。このように実験を行った理由は、実験に用いた表情のエージェントの認識率が６６％であり、安定したポリシーを学習できなかったためである。
なお、ＬｏｏｐＭａｚｅでは、２つのエージェント（キーボードエージェントと感情エージェント）の学習性能を、エージェントが実行する時間ステップ数と人間のトレーナーが提供するフィードバックの総数の２つの指標で評価する。 [Experimental result]
First, we will explain the experimental results of learning from facial feedback and gesture feedback in both the LoopMaze task and the Tetris task.
In an agent without emotional feedback, a user's preferences can only be realized through the reward process of pressing a keyboard key. In contrast, according to the present embodiment, when the correct action is selected, the user simply appears happy, and when the unexpected action is selected, the user appears sad or angry. people can express their feelings. A satisfied agent can be obtained simply by expressing preferences through facial expressions, and the cognitive burden on the user (for example, speaking) can be significantly reduced. In the experiment, the agent was trained to obtain an initial policy using keyboard feedback, then improve the formed policy by introducing facial feedback, and finally obtain the optimal policy. The reason for conducting the experiment in this way is that the facial expression recognition rate of the agent used in the experiment was 66%, and a stable policy could not be learned.
Note that in LoopMaze, the learning performance of two agents (keyboard agent and emotion agent) is evaluated using two indicators: the number of time steps performed by the agent and the total number of feedback provided by the human trainer.

図１５は、ＬｏｏｐＭａｚｅタスクに対する実験結果であり、キーボードエージェントと感情エージェントによる各エピソードにおける総時間ステップ数とフィードバックを受けた回数を示す図である。
なお、各プロットは、１０回の独立した実行から得られたデータを平均化して作成され、各実行は２５エピソードで構成されている。この実験では、２つのエージェントの学習性能を、エージェントが実行する時間ステップ数と、人間のトレーナーが提供するフィードバックの総数の二つの指標で評価した。得られる結果は、学習が進むにつれて，初期状態から目標状態に至るまでの時間ステップ数を徐々に減らし、フィードバックを受ける回数を少なくすることが望ましい。なお、エージェントが完璧な場合は、人間が生成した報酬を与えずに、１エピソードあたり２０回の時間ステップで目標状態に到達する。 FIG. 15 is an experimental result for the LoopMaze task, and is a diagram showing the total number of time steps and the number of times feedback was received in each episode by the keyboard agent and emotional agent.
Note that each plot was created by averaging data obtained from 10 independent runs, with each run consisting of 25 episodes. In this experiment, we evaluated the learning performance of the two agents using two metrics: the number of time steps performed by the agents and the total number of feedback provided by the human trainer. As for the results obtained, as learning progresses, it is desirable to gradually reduce the number of time steps from the initial state to the target state and to reduce the number of times feedback is received. Note that if the agent is perfect, it will reach the goal state in 20 time steps per episode without giving any human-generated rewards.

グラフｇ４０１は、ＬｏｏｐＭａｚｅタスクの１エピソードを完了するために、キーボードエージェントと感情エージェントによる各エピソードにおける総時間ステップ数を示す図である。グラフｇ４１１は、開始状態からゴール状態へのナビゲートに成功したときに、キーボードエージェントと感情エージェントによる各エピソードにおけるフィードバックを受けた回数を示す図である。グラフｇ４０１とグラフｇ４１１において横軸はエピソード数であり、符号ｇ４０１のグラフにおいて縦軸は総時間ステップ数（回）であり、グラフｇ４１１おいて縦軸はフィードバックを受けた回数（回）である。また、符号ｇ４０２はキーボードエージェントの実験結果を示し、符号ｇ４０３は感情エージェントの実験結果を示し、符号ｇ４０４は平均値の標準誤差である。 Graph g401 is a diagram showing the total number of time steps in each episode by the keyboard agent and emotional agent to complete one episode of the LoopMaze task. Graph g411 is a diagram showing the number of times feedback is received in each episode from the keyboard agent and emotional agent when the user successfully navigates from the start state to the goal state. In graph g401 and graph g411, the horizontal axis is the number of episodes, in the graph g401, the vertical axis is the total number of time steps (times), and in graph g411, the vertical axis is the number of times (times) feedback was received. Furthermore, symbol g402 indicates the experimental results for the keyboard agent, symbol g403 indicates the experimental results for the emotion agent, and symbol g404 indicates the standard error of the mean value.

図１５のように、表情フィードバックが導入された初期段階では、目標状態に到達するまでの時間ステップ数とフィードバック量がわずかに増加していた。しかし、３～４回の学習を経て、感情エージェントの時間ステップ数とフィードバック量は、キーボードエージェントのそれと基本的には一致し、両エージェントともにほぼ最適なポリシーを学習している。図１５のように、学習期間の後、感情エージェントは、キーボードエージェントと似たような、またはより良いポリシーを学習することができた。 As shown in FIG. 15, at the initial stage when facial expression feedback was introduced, the number of time steps to reach the target state and the amount of feedback slightly increased. However, after three or four training sessions, the number of time steps and amount of feedback for the emotion agent basically match those for the keyboard agent, and both agents have learned almost optimal policies. As shown in Figure 15, after the learning period, the emotional agent was able to learn similar or better policies than the keyboard agent.

図１６は、Ｔｅｔｒｉｓタスクに対する実験結果であり、キーボードエージェントと感情エージェントによる総時間ステップ数、フィードバックを受けた数、クリアした行数を示す図である。
Ｔｅｔｒｉｓタスクでは、エージェントが４つの小さな正方形のランダムな組み合わせで生成され、それぞれの組み合わせの時間や確率も不確定である。このタスクでは、エージェントが優秀であればあるほど、実行時間が長くなり、より多くの状態やアクションを経験することができる。人間の訓練の下で、エージェントは最終的にＴｅｔｒｉｓのピースの全行を排除することを学習し、ゲームをプレイし続けることができる。このため、このタスクに対して期待されることは、人間のフィードバックの減少を必要としながら、時間のステップ数と排除されたテトリスのラインの数が徐々に増加していることである。また、実験では、最初の２つのエピソードでキーボードフィードバックによるエージェントの訓練を行い、３つ目のエピソードで顔のフィードバックを導入し、合計２０エピソードの訓練を行った。実験では、２つのエージェント（キーボードエージェントと感情エージェント）の学習性能を３つの指標で評価する。３つの評価は、エージェントの実行時間のステップ数、受け取ったフィードバックの数、エピソード内で排除されたテトリスの行の数である。 FIG. 16 shows the experimental results for the Tetris task, showing the total number of time steps, the number of feedback received, and the number of cleared lines by the keyboard agent and emotional agent.
In the Tetris task, agents are generated using random combinations of four small squares, and the time and probability of each combination are uncertain. The better the agent is at this task, the longer it will run and the more states and actions it can experience. Under human training, the agent can eventually learn to eliminate entire rows of Tetris pieces and continue playing the game. Therefore, the expectation for this task is that the number of time steps and the number of Tetris lines eliminated gradually increases while requiring a reduction in human feedback. In addition, in the experiment, the agent was trained using keyboard feedback in the first two episodes, and facial feedback was introduced in the third episode, for a total of 20 episodes of training. In the experiment, we evaluate the learning performance of two agents (keyboard agent and emotion agent) using three indicators. The three measures are the number of steps in the agent's execution time, the number of feedback received, and the number of Tetris lines eliminated within the episode.

グラフｇ４５１は、実験の各エピソードにおいて、キーボードエージェントと感情エージェントによる各エピソードにおける時間ステップ数を示す図である。グラフｇ４６１は、実験の各エピソードにおいて、キーボードエージェントと感情エージェントによる各エピソードにおけるフィードバックを受けた回数を示す図である。グラフｇ４７１は、実験の各エピソードにおいて、キーボードエージェントと感情エージェントによる各エピソードにおけるリアした行数を示す図である。
符号ｇ４５１、ｇ４６１およびｇ４７１のグラフにおいて横軸はエピソード数であり、符号ｇ４５１のグラフにおいて縦軸は時間ステップ数（回）であり、符号ｇ４６１のグラフにおいて縦軸はフィードバックを受けた回数（回）であり、符号ｇ４７１のグラフにおいて縦軸はクリアした行数（行）である。 Graph g451 is a diagram showing the number of time steps in each episode by the keyboard agent and the emotional agent in each episode of the experiment. Graph g461 is a diagram showing the number of times feedback was received in each episode from the keyboard agent and the emotional agent in each episode of the experiment. Graph g471 is a diagram showing the number of lines read by the keyboard agent and emotion agent in each episode of the experiment.
In the graphs with symbols g451, g461, and g471, the horizontal axis is the number of episodes, in the graph with symbol g451, the vertical axis is the number of time steps (times), and in the graph with symbol g461, the vertical axis is the number of times feedback was received (times). In the graph with symbol g471, the vertical axis is the number of cleared lines (rows).

図１６のように、最初の２つのエピソードでは、時間ステップ数、フィードバックの受信数、ラインクリア数は、感情エージェントとキーボードエージェントの間でほぼ同じで、ゲーム内で数十行をクリアできる妥当なポリシーを獲得していた。４エピソードの訓練後の７エピソード以降、感情エージェントの時間ステップ数とクリアしたライン数は、フィードバックの数が若干増えたが、キーボードエージェントと同等であった。 As shown in Figure 16, in the first two episodes, the number of time steps, the number of feedback received, and the number of line clears are almost the same between the emotion agent and the keyboard agent, which is a reasonable number that can clear dozens of lines in the game. acquired the policy. After 7 episodes of training after 4 episodes of training, the number of time steps and the number of cleared lines of the emotional agent were similar to those of the keyboard agent, although the number of feedbacks increased slightly.

次に、ＬｏｏｐＭａｚｅ課題とＴｅｔｒｉｓ課題において、キーボードエージェントとジェスチャーエージェントの学習性能を比較した。
図１７は、ＬｏｏｐＭａｚｅタスクに対する実験結果であり、キーボードエージェントとジェスチャーエージェントによる各エピソードにおける時間ステップ数とフィードバックを受けた回数を示す図である。
グラフｇ５０１は、ＬｏｏｐＭａｚｅタスクの１つのエピソードを終了するために、キーボードエージェントとジェスチャーエージェントによる各エピソードにおける時間ステップ数を示す図である。グラフｇ５１１は、各エピソードでキーボードエージェントとジェスチャーエージェントによる各エピソードにおけるフィードバックを受けた回数を示す図である。グラフｇ５０１とグラフｇ５１１において横軸はエピソード数であり、グラフｇ５０１において縦軸は時間ステップ数（回）であり、グラフｇ５１１において縦軸はフィードバックを受けた回数（回）である。また、符号ｇ５０２はキーボードエージェントの実験結果を示し、符号ｇ５０３はジェスチャーエージェントの実験結果を示し、符号ｇ５０４は平均値の標準誤差である。 Next, we compared the learning performance of the keyboard agent and gesture agent in the LoopMaze task and Tetris task.
FIG. 17 is an experimental result for the LoopMaze task, and is a diagram showing the number of time steps and the number of times feedback was received in each episode by the keyboard agent and gesture agent.
Graph g501 is a diagram showing the number of time steps in each episode by the keyboard agent and gesture agent to finish one episode of the LoopMaze task. A graph g511 is a diagram showing the number of times feedback is received in each episode by the keyboard agent and the gesture agent. In the graph g501 and the graph g511, the horizontal axis is the number of episodes, the vertical axis in the graph g501 is the number of time steps (times), and the vertical axis in the graph g511 is the number of times (times) feedback is received. Further, symbol g502 indicates the experimental result of the keyboard agent, symbol g503 indicates the experimental result of the gesture agent, and symbol g504 indicates the standard error of the mean value.

この実験では、独立した実験を１０ラウンド実施し、各ラウンドで２０エピソードを実行した。顔のフィードバックからの学習と同様に、ＬｏｏｐＭａｚｅタスクでは、時間ステップ数と人間のユーザから提供されたフィードバックの総数がエージェントのパフォーマンスの指標となる。ジェスチャーフィードバックにより、訓練時にエージェントが受け取るフィードバックの数を減らすことができると期待される。 In this experiment, 10 independent rounds of experiments were conducted, with 20 episodes running in each round. Similar to learning from facial feedback, in the LoopMaze task, the number of time steps and the total number of feedback provided by the human user are indicators of the agent's performance. Gesture feedback is expected to reduce the amount of feedback an agent receives during training.

図１７のように、最初の３つのエピソードでは、ジェスチャーエージェントの方がキーボードエージェントよりも多くの時間ステップを経験する必要がある。しかし、ジェスチャーエージェントは、フィードバックを受ける量がキーボードエージェントに比べて著しく少ない。３回のエピソードトレーニングの後、両方のエージェントは最高のポリシーを取得した。 As shown in Figure 17, in the first three episodes, the gesture agent needs to experience more time steps than the keyboard agent. However, gesture agents receive significantly less feedback than keyboard agents. After three episodes of training, both agents obtained the best policy.

次に、テトリスタスクにおいて、ジェスチャーエージェントとキーボードエージェントのテトリスタスクにおける学習性能を比較した。ここでは、顔のフィードバックからの学習と同じ３つの指標、すなわち、エージェントの実行時間のステップ数、受け取ったフィードバックの数、エピソード内で排除されたテトリスの行の数で、２つのエージェントの学習性能を評価した。 Next, we compared the learning performance of gesture agents and keyboard agents in the Tetris task. Here, we evaluate the learning performance of the two agents using the same three metrics as for learning from facial feedback: the number of steps in the agent's execution time, the number of feedback received, and the number of Tetris rows eliminated within an episode. was evaluated.

図１８は、Ｔｅｔｒｉｓタスクに対する実験結果であり、キーボードエージェントとジェスチャーエージェントによる時間ステップ数、フィードバックを受けた数、クリアした行数を示す図である。なお、実験では、１０個の独立したラウンドを繰り返し、各ラウンドで２０個のエピソードを実行し、時間ステップ数、受け取った総フィードバックの数、排除されたテトリスの行の数で測定された各エピソードのパフォーマンスを平均化した。 FIG. 18 is an experimental result for the Tetris task, and is a diagram showing the number of time steps, the number of feedback received, and the number of cleared lines by the keyboard agent and gesture agent. Note that in the experiment, we repeated 10 independent rounds, running 20 episodes in each round, and each episode measured by the number of time steps, the number of total feedback received, and the number of Tetris lines eliminated. performance was averaged.

グラフｇ５５１は、テトリスタスクの各エピソードにおいて、キーボードエージェントとジェスチャーエージェントによる各エピソードにおける時間ステップ数を示す図である。グラフｇ５６１は、各エピソードにおいて、キーボードエージェントとジェスチャーエージェントによる各エピソードにおけるフィードバックを受けた回数を示す図である。グラフｇ５７１は、各エピソードにおいて、キーボードエージェントとジェスチャーエージェントによる各エピソードにおけるリアした行数を示す図である。
符号ｇ５５１、ｇ５６１およびｇ５７１のグラフにおいて横軸はエピソード数であり、符号ｇ５５１のグラフにおいて縦軸は時間ステップ数（回）であり、符号ｇ５６１のグラフにおいて縦軸はフィードバックを受けた回数（回）であり、符号ｇ５７１のグラフにおいて縦軸はクリアした行数（行）である。 Graph g551 is a diagram showing the number of time steps in each episode by the keyboard agent and gesture agent in each episode of the Tetris task. Graph g561 is a diagram showing the number of times feedback is received in each episode by the keyboard agent and the gesture agent. Graph g571 is a diagram showing the number of lines read in each episode by the keyboard agent and gesture agent.
In the graphs with symbols g551, g561, and g571, the horizontal axis is the number of episodes, in the graph with symbol g551, the vertical axis is the number of time steps (times), and in the graph with symbol g561, the vertical axis is the number of times feedback was received (times). In the graph with symbol g571, the vertical axis is the number of cleared lines (rows).

図１８のように、最初の４つエピソードで、ジェスチャーエージェントは、キーボードエージェントよりもジェスチャーフィードバックの受け取る数がはるかに少ない状態で、時間ステップ数がわずかに少なく、各エピソードで行数がわずかに少なくクリアされている。その後もジェスチャーエージェントは、キーボードエージェントよりもフィードバックを受ける量が少なくなった。９エピソード以後、両エージェントの学習性能はかなり近く、学習したポリシーを調整するために必要なフィードバックの回数も同等であった。 As shown in Figure 18, in the first four episodes, the gesture agent received much less gesture feedback than the keyboard agent, took slightly fewer time steps, and received slightly fewer lines in each episode. It has been cleared. Even after that, the gesture agent received less feedback than the keyboard agent. After nine episodes, the learning performance of both agents was quite similar, and the number of feedbacks required to adjust the learned policy was also comparable.

このことから、エージェントは、事前に定義されたキーストロークがなくても、ジェスチャーから人間の意図をよく理解することができ、限られた、あるいはそれよりも少ない数のフィードバックから学習して、情報を正確に取り込むことができる。すなわち、本実施形態を用いた実験によれば、人間のソーシャルシグナルを用いることで、人間のフィードバックが少ない（十分な認識精度がある）エージェントの学習効率を効果的に向上させることができることが確認できた。そして、実験結果のようにジェスチャーフィードバックから学習するエージェントは、キーボードのフィードバックから学習するエージェントと同じような性能を、受け取るフィードバックの量をはるかに少なくして得ることができる。また、実験結果のように顔のフィードバックから学習した場合、キーボードによるフィードバックから学習した場合と同様の性能を得ることができることを示している。 This means that agents can better understand human intent from gestures, even without predefined keystrokes, and can learn from limited or fewer feedback to provide information. can be captured accurately. In other words, according to experiments using this embodiment, it was confirmed that by using human social signals, it is possible to effectively improve the learning efficiency of an agent that requires little human feedback (sufficient recognition accuracy). did it. And, as shown in our experiments, agents that learn from gesture feedback can achieve similar performance to agents that learn from keyboard feedback, but receive much less feedback. Furthermore, the experimental results show that when learning from facial feedback, it is possible to obtain performance similar to learning from keyboard feedback.

また、以上の実験結果より、人間のソーシャルシグナル（例えば顔の表情、ジェスチャー）を用いることで、人間のフィードバックが少ない（十分な認識精度がある）エージェントの学習効率を効果的に向上させる。これにより、利用者が事前に学習方法を学習する必要がなくなり、利用者の認知的負担や作業負荷を軽減することができる。これにより、本実施形態によれば、人間が事前に訓練方法を学習する必要がないので、一般の人が自分の好みに応じたタスクの実行方法をエージェントに訓練させる自然な方法を提供することができる。 Additionally, the above experimental results show that using human social signals (e.g., facial expressions, gestures) effectively improves the learning efficiency of agents with little human feedback (with sufficient recognition accuracy). This eliminates the need for the user to learn the learning method in advance, and can reduce the user's cognitive burden and workload. As a result, according to the present embodiment, there is no need for humans to learn training methods in advance, thereby providing a natural method for ordinary people to train agents on how to perform tasks according to their preferences. Can be done.

なお、上述した実験結果では、エージェントの一例として感情エージェントとジェスチャーエージェントを用いる例を説明したが、これに限らない。エージェントは、他の人間のソーシャルシグナルを用いてもよい。エージェントは、例えば音声による感情を用いた音声感情エージェントであってもよい。 Note that in the above-mentioned experimental results, an example in which an emotional agent and a gesture agent are used as an example of the agent is explained, but the present invention is not limited to this. Agents may also use other humans' social signals. The agent may be, for example, a voice emotion agent that uses voice emotions.

以上のように、本実施形態では、ロボットは、ＩＲＬを介して人間のデモンストレーションから学習し、次にＴＡＭＥＲを介して人間の報酬から学習する。このＩＲＬ－ＴＡＭＥＲでは、以下のＩ、ＩＩの順番に実行される２つのアルゴリズムで構成される。
Ｉ．ＩＲＬは、人間のトレーナーが提供するデモンストレーションから報酬関数を学習し、
ＩＩ．ＴＡＭＥＲは、人間の評価フィードバックから予測報酬モデルを学習する。
なお、人間によるデモンストレーションによるフィードバックは、画像による人間の表情認知、人間のジェスチャーの認知等である。 As described above, in this embodiment, the robot learns from human demonstrations via IRL and then from human rewards via TAMER. This IRL-TAMER consists of two algorithms that are executed in the following order: I and II.
I. IRL learns the reward function from demonstrations provided by human trainers,
II. TAMER learns predictive reward models from human rating feedback.
Note that the feedback provided by human demonstrations includes recognition of human facial expressions through images, recognition of human gestures, and the like.

これにより、本実施形態によれば、人間（例えば利用者、訓練者）が提供するデモンストレーションと評価フィードバックの両方からロボットの自律的な行動学習を可能にすることができる。
この結果、本実施形態によれば、ロボットが人間によって提供されるデモンストレーションと評価フィードバックから学ぶことを可能にし、最適な動作を得るために必要な人間の評価の数、特に間違いの数（期待されていない行動）を減らすことができる。 As a result, according to the present embodiment, it is possible to enable the robot to autonomously learn behavior from both demonstrations and evaluation feedback provided by humans (for example, users, trainers).
As a result, the present embodiment allows the robot to learn from the demonstration and evaluation feedback provided by the human, and the number of human evaluations required to obtain optimal behavior, in particular the number of errors (expected behavior) can be reduced.

なお、実施形態ではロボット１を例に説明したが、エージェント等は、他の装置、例えば車載のナビゲーション装置、スマートフォン、タブレット端末等にも適用可能である。例えばスマートフォンに適用する場合は、スマートフォンの表示部上に、例えば図３のようなロボット１の静止画を表示させるようにしてもよい。または、スマートフォンの表示部上に、ロボット１の仕草をアニメーションで表示させるようにしてもよい。 Note that although the embodiment has been described using the robot 1 as an example, the agent and the like can also be applied to other devices, such as an in-vehicle navigation device, a smartphone, a tablet terminal, and the like. For example, when applied to a smartphone, a still image of the robot 1 as shown in FIG. 3 may be displayed on the display section of the smartphone. Alternatively, the gestures of the robot 1 may be displayed as an animation on the display section of the smartphone.

なお、本発明における行動制御装置１００の機能の全てまたは一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより行動制御装置１００が行う処理の全てまたは一部を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 Note that a program for realizing all or part of the functions of the behavior control device 100 according to the present invention is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed. By doing so, all or part of the processing performed by the behavior control device 100 may be performed. Note that the "computer system" herein includes hardware such as an OS and peripheral devices. Furthermore, the term "computer system" includes a WWW system equipped with a home page providing environment (or display environment). Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, magneto-optical disks, ROMs, and CD-ROMs, and storage devices such as hard disks built into computer systems. Furthermore, "computer-readable recording medium" refers to volatile memory (RAM) inside a computer system that serves as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. This also includes programs that are retained for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in a transmission medium. Here, the "transmission medium" that transmits the program refers to a medium that has a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Moreover, the above-mentioned program may be for realizing a part of the above-mentioned functions. Furthermore, it may be a so-called difference file (difference program) that can realize the above-described functions in combination with a program already recorded in the computer system.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形および置換を加えることができる。 Although the mode for implementing the present invention has been described above using embodiments, the present invention is not limited to these embodiments in any way, and various modifications and substitutions can be made without departing from the gist of the present invention. can be added.

１…ロボット、１０１…操作部、１０２…撮影部、１０３…センサ、１０４…収音部、１００…行動制御装置、１０６…記憶部、１０７…データベース、１１１…表示部、１１２…スピーカー、１１３…アクチュエータ、１１５…ロボットセンサ、１０５…認知部、３００…エージェント、３０１…学習部、３０２…報酬学習管理部、３０３…割当評価部、３０４…行動選択部、３０４１…画像生成部、３０４２…音声生成部、３０４３…駆動部、３０４４…出力部、２１１…ＩＲＬアルゴリズム DESCRIPTION OF SYMBOLS 1...Robot, 101...Operation unit, 102...Photography unit, 103...Sensor, 104...Sound collection unit, 100...Behavior control device, 106...Storage unit, 107...Database, 111...Display unit, 112...Speaker, 113... Actuator, 115... Robot sensor, 105... Recognition unit, 300... Agent, 301... Learning unit, 302... Reward learning management unit, 303... Allocation evaluation unit, 304... Behavior selection unit, 3041... Image generation unit, 3042... Sound generation section, 3043...drive section, 3044...output section, 211...IRL algorithm

Claims

The demonstrated results of the sequence of pairs of states and actions of the trainer are given as a reward function to an inverse reinforcement learning module by an inverse reinforcement learning algorithm; a learning section that learns a reward function;
The agent selects an action that has a reward function using the weight vector as an initial value, and the trainer evaluates the state of the self-apparatus and the selected action. Calculate the probability, learn the approximate expected value of the reward value of the person receiving the interaction experience, learn to maximize the reward function of the person, and based on the learned result, calculate the predicted reward value of the person An agent that selects the action with the largest
A behavior control device comprising:

The agent is
modifying the behavior using a reward function learned by the learning unit;
learning a predictive reward model based on feedback information from the trainer ;
The behavior control device according to claim 1.

The agent includes a reward learning management section, an allocation evaluation section, and an action selection section,
The assignment evaluation unit calculates the probability of the previously selected action based on the feedback from the trainer , and sets the state, the action, the probability of the previously selected action, and a supervised learning sample;
The reward learning management unit acquires the reward function generated by the learning unit, acquires the supervised learning sample output by the allocation evaluation unit, learns the predictive reward model, and calculates the learned prediction. updating the reward function using a reward model;
The behavior selection unit selects the behavior based on information fed back from the trainer and the reward learning management unit.
The behavior control device according to claim 2.

The agent is
Estimates the state of the environment expressed by the direction of the person's voice, the direction of the person's face, the orientation of the person's body, and the orientation of the device in the current orientation of the device, and selects a reward function with the largest predicted reward value. By selecting an action to hold, the device selects an action to turn its face toward the person of interest.
The behavior control device according to any one of claims 1 to 3.

The reward learning management unit uses the calculated probability h^ and the state-behavior pair as a supervised learning sample, and updates the parameters using the following equation based on the least squares gradient to create an interaction experience. Learn a function R^H (s, a) that approximates the expected value of the reward of the recipient,

h is the human reward label received by the agent at any time step t, α is the learning rate, s is the state representation, a is the selected action, and ω ^→ = (ω ₀ ,... , ω _m−1 ) ^T is a column parameter vector, {(s ₀ , a ₀ ), ..., (s _n , a _n )} is a state-action pair, and φ _i (x ^→ ) is a basis function Let φ(x ^→ ) = (φ ₀ (x ^→ ), ..., φ _m-1 (x ^→ )) ^T , and m is the total number of parameters with i=0, ..., m-1,
δ _t is the time difference error and is the following formula,

The behavior control device according to claim 3.

a learning unit, wherein the demonstrated result of the sequence of pairs of states and actions of the trainer is given as a reward function to the inverse reinforcement learning module by an inverse reinforcement learning algorithm; learning the reward function including a weight vector ;
The agent selects an action that has a reward function using the weight vector in the agent as an initial value, and uses the previous evaluation to which the trainer evaluates the state of the self-apparatus and the selected action as feedback. calculates the probability of the selected action, learns the approximate expected value of the reward value of the recipient in the interaction experience, learns to maximize the reward function of said person, and based on the learned result, select the action with the highest predicted reward value ,
Behavioral control methods.

to the computer,
The demonstrated result of the sequence of trainer state and action pairs is given as a reward function to the inverse reinforcement learning module by an inverse reinforcement learning algorithm, and the demonstrated result is passed through the inverse reinforcement learning algorithm from the demonstrated result to the inverse reinforcement learning module. Learn the reward function ,
The agent selects an action that has a reward function using the weight vector as an initial value, and the trainer evaluates the state of the self-apparatus and the selected action. Calculate the probability, learn the approximate expected value of the reward value of the person receiving the interaction experience, learn to maximize the reward function of the person, and based on the learned result, calculate the predicted reward value of the person Let them choose the action with the largest
program.