JP7628037B2

JP7628037B2 - Reinforcement learning device, reinforcement learning method, and program

Info

Publication number: JP7628037B2
Application number: JP2021040746A
Authority: JP
Inventors: 研一中里
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2025-02-07
Anticipated expiration: 2041-03-12
Also published as: JP2022140092A; DE102022201574A1

Description

本発明は、強化学習装置、強化学習方法及びプログラムに関する。 The present invention relates to a reinforcement learning device, a reinforcement learning method, and a program.

従来、与えられたタスクを達成するために強化学習が用いられている。強化学習は、タスクが与えられた環境においてエージェントの一連の行動の累積価値が最大化するように、エージェントの行動を学習する方法である。例えば、強化学習は、ゲームやモータの制御、又は車両の自動運転制御等に応用されている（特許文献１及び２参照。）。 Conventionally, reinforcement learning has been used to achieve a given task. Reinforcement learning is a method of learning an agent's behavior so as to maximize the cumulative value of a series of actions of the agent in an environment in which a task is given. For example, reinforcement learning has been applied to games and motor control, and automatic vehicle driving control (see Patent Documents 1 and 2).

特開２０１８－６３６０２号公報JP 2018-63602 A 特開２０２０－１４４４８３号公報JP 2020-144483 A

一般的な強化学習は、エージェントの行動をランダムに又は確率的に選択して試行錯誤を繰り返す。累積価値が最大値付近に収束するまでの学習時間が長くなりやすいため、学習の効率化が求められている。 In general reinforcement learning, the agent's actions are selected randomly or probabilistically, and trial and error is repeated. Since it takes a long time for the cumulative value to converge to near the maximum value, there is a need to make learning more efficient.

特に、エージェントの行動に対して、タスクを達成するまで報酬が付与されない環境下では、学習時間が長くなりやすい。エージェントの行動の価値は、その行動に対して環境から付与される報酬によって計算され、計算された価値がその後のエージェントの行動を選択する指針となる。しかし、タスクを達成するまでの過程において報酬が付与されない場合、その間の行動の価値が変化せず、行動を選択する指針が得られずに試行錯誤が増えるためである。 In particular, learning times are likely to be long in environments where an agent is not rewarded for its actions until it completes a task. The value of an agent's actions is calculated based on the reward given by the environment for that action, and the calculated value serves as a guideline for selecting the agent's subsequent actions. However, if no reward is given during the process of completing a task, the value of the actions during that time does not change, and there is no guideline for selecting an action, leading to increased trial and error.

本発明は、強化学習の学習効率を高めることを目的とする。 The present invention aims to improve the learning efficiency of reinforcement learning.

本発明の一態様は、与えられた環境においてタスクを達成するまで、エージェントの行動を選択するエピソードを繰り返し実施し、１エピソードにおける一連の行動の累積価値が最大化するように前記エージェントの行動を学習する強化学習装置（１０）である。強化学習装置（１０）は、前記エージェントの行動を繰り返し選択する行動選択部（１１１）と、前記行動が選択されるごとに、前記選択された行動に対して付与される報酬（Ｒ）を用いて、前記選択された行動の価値（Ｑ）を計算する計算処理部（１１２）と、前記タスクから１又は複数のサブタスクを定義するとともに、前記サブタスクの達成度に応じて付与する第２の報酬（Ｍ）を定義するタスク制御部（１１３）と、を備える。前記計算処理部（１１２）は、前記選択された行動に対して定義された前記第２の報酬（Ｍ）と前記環境から付与される第１の報酬（ｒ）とを取得し、前記第１の報酬（ｒ）に前記第２の報酬（Ｍ）を加えることにより、前記報酬（Ｒ）を計算する。 One aspect of the present invention is a reinforcement learning device (10) that repeatedly performs an episode of selecting an agent's behavior until a task is accomplished in a given environment, and learns the agent's behavior so as to maximize the cumulative value of a series of actions in one episode. The reinforcement learning device (10) includes an action selection unit (111) that repeatedly selects an action of the agent, a calculation processing unit (112) that calculates the value (Q) of the selected action using a reward (R) granted to the selected action each time the action is selected, and a task control unit (113) that defines one or more subtasks from the task and defines a second reward (M) to be granted depending on the degree of achievement of the subtask. The calculation processing unit (112) obtains the second reward (M) defined for the selected action and a first reward (r) granted from the environment, and calculates the reward (R) by adding the second reward (M) to the first reward (r).

本発明の他の一態様は、与えられた環境においてタスクを達成するまで、エージェントの行動を選択するエピソードを繰り返し実施し、１エピソードにおける一連の行動の累積価値が最大化するように前記エージェントの行動を学習する強化学習方法である。前記強化学習方法は、前記エージェントの行動を繰り返し選択するステップと、前記行動が選択されるごとに、前記選択された行動に対して付与される報酬（Ｒ）を用いて、前記選択された行動の価値（Ｑ）を計算するステップと、前記タスクから１又は複数のサブタスクを定義するとともに、前記サブタスクの達成度に応じて付与する第２の報酬（Ｍ）を定義するステップと、を含む。前記価値（Ｑ）を計算するステップは、前記選択された行動に対して定義された前記第２の報酬（Ｍ）と前記環境から付与される第１の報酬（ｒ）とを取得するステップと、前記第１の報酬（ｒ）に前記第２の報酬（Ｍ）を加えることにより、前記報酬（Ｒ）を計算するステップと、を含む。 Another aspect of the present invention is a reinforcement learning method that repeatedly performs an episode of selecting an agent's behavior until a task is accomplished in a given environment, and learns the agent's behavior so as to maximize the cumulative value of a series of actions in one episode. The reinforcement learning method includes a step of repeatedly selecting an action of the agent, a step of calculating a value (Q) of the selected action using a reward (R) granted to the selected action each time the action is selected, and a step of defining one or more subtasks from the task and defining a second reward (M) to be granted according to the degree of achievement of the subtask. The step of calculating the value (Q) includes a step of acquiring the second reward (M) defined for the selected action and a first reward (r) granted from the environment, and a step of calculating the reward (R) by adding the second reward (M) to the first reward (r).

本発明の他の一態様は、与えられた環境においてタスクを達成するまで、エージェントの行動を選択するエピソードを繰り返し実施し、１エピソードにおける一連の行動の累積価値が最大化するように前記エージェントの行動を学習する強化学習方法を、コンピュータに実行させるためのプログラムである。前記強化学習方法は、前記エージェントの行動を繰り返し選択するステップと、前記行動が選択されるごとに、前記選択された行動に対して付与される報酬（Ｒ）を用いて、前記選択された行動の価値（Ｑ）を計算するステップと、前記タスクから１又は複数のサブタスクを定義するとともに、前記サブタスクの達成度に応じて付与する第２の報酬（Ｍ）を定義するステップと、を含む。前記価値（Ｑ）を計算するステップは、前記選択された行動に対して定義された前記第２の報酬（Ｍ）と前記環境から付与される第１の報酬（ｒ）とを取得するステップと、前記第１の報酬（ｒ）に前記第２の報酬（Ｍ）を加えることにより、前記報酬（Ｒ）を計算するステップと、を含む。 Another aspect of the present invention is a program for causing a computer to execute a reinforcement learning method in which an episode of selecting an agent's behavior is repeatedly performed until a task is accomplished in a given environment, and the agent's behavior is learned so that the cumulative value of a series of actions in one episode is maximized. The reinforcement learning method includes a step of repeatedly selecting an action for the agent, a step of calculating a value (Q) of the selected action using a reward (R) granted to the selected action each time the action is selected, and a step of defining one or more subtasks from the task and defining a second reward (M) to be granted according to the degree of achievement of the subtask. The step of calculating the value (Q) includes a step of acquiring the second reward (M) defined for the selected action and a first reward (r) granted from the environment, and a step of calculating the reward (R) by adding the second reward (M) to the first reward (r).

本発明によれば、強化学習の学習効率を高めることができる。 The present invention can improve the learning efficiency of reinforcement learning.

本実施形態の強化学習装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a reinforcement learning device according to an embodiment of the present invention. 強化学習装置において実行される強化学習処理を示すフローチャートである。1 is a flowchart showing a reinforcement learning process executed in a reinforcement learning device. 環境の一例であるゲームエリアを示す図である。FIG. 2 is a diagram showing a game area as an example of an environment. サブタスクの一例を示す図である。FIG. 13 illustrates an example of a subtask. 第２の報酬のテーブルの一例を示す図である。FIG. 13 is a diagram showing an example of a second remuneration table. 第２の報酬のテーブルの他の一例を示す図である。FIG. 13 is a diagram showing another example of the second remuneration table. 価値のテーブルの一例を示す図である。FIG. 13 is a diagram showing an example of a value table. サブタスクごとに計算された価値のテーブルの一例を示す図である。FIG. 13 illustrates an example of a table of values calculated for each subtask.

以下、本発明の強化学習装置、強化学習方法及びプログラムの一実施形態について、図面を参照して説明する。以下の説明は本発明の一例（代表例）であり、本発明はこれに限定されない。 Below, an embodiment of the reinforcement learning device, reinforcement learning method, and program of the present invention will be described with reference to the drawings. The following description is an example (representative example) of the present invention, and the present invention is not limited thereto.

図１は、本発明の一実施形態の強化学習装置１０の構成を示す。
強化学習装置１０は、ＣＰＵ（Central Processing Unit）１１及び記憶部１２を備える。強化学習装置１０は、操作部１３、表示部１４及び通信部１５をさらに備えてもよい。 FIG. 1 shows a configuration of a reinforcement learning device 10 according to an embodiment of the present invention.
The reinforcement learning device 10 includes a CPU (Central Processing Unit) 11 and a storage unit 12. The reinforcement learning device 10 may further include an operation unit 13, a display unit 14, and a communication unit 15.

ＣＰＵ１１は、記憶部１２からプログラムを読み出して実行することにより、後述する強化学習処理を実行する。強化学習処理において、ＣＰＵ１１は、行動選択部１１１、計算処理部１１２及びタスク制御部１１３として機能する。 The CPU 11 executes the reinforcement learning process described below by reading and executing a program from the storage unit 12. In the reinforcement learning process, the CPU 11 functions as a behavior selection unit 111, a calculation processing unit 112, and a task control unit 113.

行動選択部１１１は、タスクが与えられた環境においてエージェントの行動を選択する。計算処理部１１２は、行動選択部１１１により選択された行動の価値を、当該行動に対して付与される報酬を用いて計算する。タスク制御部１１３は、与えられたタスクからサブタスクを定義する。 The behavior selection unit 111 selects an agent's behavior in an environment in which a task is given. The calculation processing unit 112 calculates the value of the behavior selected by the behavior selection unit 111 using the reward given for that behavior. The task control unit 113 defines subtasks from the given task.

記憶部１２は、ＣＰＵ１１が読み取り可能なプログラム、及びプログラムの実行に用いられるテーブル等を記憶する。記憶部１２としては、例えばハードディスク等の記録媒体を用いることができる。 The storage unit 12 stores programs that can be read by the CPU 11, and tables used to execute the programs. The storage unit 12 can be, for example, a recording medium such as a hard disk.

操作部１３は、キーボード、又はマウス等である。操作部１３は、ユーザの操作を受け付けて、その操作内容をＣＰＵ１１に出力する。 The operation unit 13 is a keyboard, a mouse, or the like. The operation unit 13 accepts operations by the user and outputs the contents of the operations to the CPU 11.

表示部１４は、ディスプレイ等である。表示部１４は、ＣＰＵ１１からの表示指示にしたがって、操作画面やＣＰＵ１１の処理結果等を表示する。 The display unit 14 is a display or the like. The display unit 14 displays an operation screen, the processing results of the CPU 11, etc., in accordance with display instructions from the CPU 11.

通信部１５は、ネットワークを介して外部のコンピュータと通信するインターフェイスである。 The communication unit 15 is an interface that communicates with external computers via a network.

強化学習装置１０において、ＣＰＵ１１は、与えられたタスクを達成するための方策を強化学習により決定することができる。本実施形態では、強化学習の１つであるＱ学習の例を説明する。 In the reinforcement learning device 10, the CPU 11 can determine a strategy for accomplishing a given task through reinforcement learning. In this embodiment, an example of Q-learning, which is one type of reinforcement learning, is described.

（一般的な強化学習方法）
Ｑ学習では、タスクが与えられた環境にエージェントが配置される。エージェントとは行動主体をいう。エージェントは、環境のある状態（ｓ_ｔ）においてとり得る複数の行動（ａ）から１つの行動（ａ）を選択する。選択された行動（ａ）によって環境の状態（ｓ_ｔ）は状態（ｓ_ｔ＋１）に遷移する。 (General reinforcement learning method)
In Q-learning, an agent is placed in an environment where a task is given. An agent is a subject of action. The agent selects one action (a) from multiple actions (a) that can be taken in a certain state (s _t ) of the environment. The selected action (a) transitions the state (s _t ) of the environment to state (s _t+1 ).

エージェントの各行動（ａ）には、その行動（ａ）を評価する価値（Ｑ）が関連付けられる。価値（Ｑ）は、下記式（１０）により表される行動価値関数Ｑによって計算される。

Each action (a) of an agent is associated with a value (Q) that evaluates the action (a). The value (Q) is calculated by an action value function Q expressed by the following formula (10).

式（１０）において、ｓ_ｔは時間ｔにおける環境の状態（ｓ）を表す。ｓ_ｔ＋１は、状態（ｓ_ｔ）における行動（ａ）によって遷移した１ステップ後の状態（ｓ）を表す。ｒ_ｔ＋１は状態（ｓ_ｔ）における行動（ａ）に応じて環境から付与される報酬（ｒ）を表す。αは学習率を表し、０＜α≦１を満たす。γは割引率を表し、０＜γ≦１を満たす。ｍａｘＱ（ｓ_ｔ＋１，ａ）は、状態（ｓ_ｔ＋１）においてとり得るいくつかの行動（ａ）の価値Ｑ（ｓ_ｔ＋１，ａ）のなかから最大値を出力する関数を表す。 In formula (10), s _t represents the state (s) of the environment at time t. _{s t+1} represents the state ( _s ) one step after the transition due to the action (a) in the state (s t ). r _t+1 represents the reward (r) given by the environment in response to the action (a) in the state (s _t ). α represents the learning rate and satisfies 0<α≦1. γ represents the discount rate and satisfies 0<γ≦1. maxQ(s _t+1 , a) represents a function that outputs the maximum value among the values Q(s t ₊ ₁ , a) of several actions (a) that can be taken in the state (s t+1 ).

環境が初期状態（ｓ_０）から最終状態（ｓ_ｅ）に遷移するまで、エージェントは行動（ａ）を続ける。この初期状態（ｓ_０）から最終状態（ｓ_ｅ）までのエージェントの一連の行動（ａ）は、エピソードと呼ばれる。エピソードを繰り返し実施することにより、各状態（ｓ）における行動（ａ）の価値（Ｑ）が順次計算され、更新されていく。一定数のエピソードを実施したときの累積価値が最も大きくなる一連の行動を、与えられた環境に対する最適な行動として学習することができる。 The agent continues to take action (a) until the environment transitions from the initial state (s ₀ ) to the final state (s _e ). This series of actions (a) of the agent from the initial state (s ₀ ) to the final state (s _e ) is called an episode. By repeatedly performing episodes, the value (Q) of the action (a) in each state (s) is calculated and updated in sequence. The series of actions that have the greatest cumulative value after performing a certain number of episodes can be learned as the optimal action for a given environment.

一般的なＱ学習では、累積価値が最大値付近に収束するまで、エピソードを繰り返し、試行錯誤する必要がある。そのため、学習時間が長くなりやすい。特に、タスクが達成されるまで報酬（ｒ）が付与されない環境下では学習時間が長くなりやすい。価値（Ｑ）は、報酬（ｒ）によって重み付けられ、その後のエージェントの行動（ａ）の選択の指針となる。しかし、タスクの達成まで付与される報酬（ｒ）が０であると、タスクが達成されるまでの間、行動（ａ）の価値（Ｑ）の変化がなく、試行錯誤が増えるためである。 In general Q-learning, episodes must be repeated and trial and error conducted until the cumulative value converges to near the maximum value. This can lead to long learning times. This is particularly true in environments where reward (r) is not granted until the task is accomplished. Value (Q) is weighted by reward (r) and serves as a guide for the agent's selection of subsequent actions (a). However, if the reward (r) granted until the task is accomplished is zero, the value (Q) of action (a) will not change until the task is accomplished, resulting in increased trial and error.

これに対し、本実施形態の強化学習装置１０は、与えられたタスクの一部であるサブタスクを定義し、当該サブタスクを達成する行動（ａ）に応じて報酬を付与する。つまり、強化学習装置１０は、価値（Ｑ）の計算に、環境から与えられる報酬（ｒ）だけではなく、サブタスクの達成度に応じた報酬を用いる。タスクを達成するまでの間も行動（ａ）の選択の指針となる価値（Ｑ）を重み付け、試行錯誤を減らすことにより、学習の効率化を図る。以下、環境から与えられる従来の報酬（ｒ）を第１の報酬（ｒ）といい、この第１の報酬（ｒ）に追加される報酬を第２の報酬（Ｍ）という。 In contrast, the reinforcement learning device 10 of this embodiment defines a subtask that is part of a given task, and grants a reward according to the action (a) that accomplishes the subtask. In other words, the reinforcement learning device 10 uses not only the reward (r) given by the environment but also a reward according to the degree of accomplishment of the subtask to calculate the value (Q). The value (Q), which serves as a guideline for selecting the action (a) even until the task is accomplished, is weighted, thereby reducing trial and error, thereby improving the efficiency of learning. Hereinafter, the conventional reward (r) given by the environment is referred to as the first reward (r), and the reward added to this first reward (r) is referred to as the second reward (M).

（本実施形態の強化学習方法）
図２は、強化学習装置１０における強化学習処理の流れを示す。この強化学習処理は、ＣＰＵ１１が記憶部１２のプログラムを読み取ることにより実行される。
以下、強化学習処理に与えられるタスクの一例として、ゲームを説明する。図３は、ゲームの環境として与えられる６×６ブロックのゲームエリア３０を示す。 (Reinforcement learning method of this embodiment)
2 shows a flow of the reinforcement learning process in the reinforcement learning device 10. This reinforcement learning process is executed by the CPU 11 reading a program from the storage unit 12.
A game will now be described as an example of a task given to the reinforcement learning process. Fig. 3 shows a 6x6 block game area 30 given as the game environment.

図３において、各ブロックはブロック番号（ｉｊ）により区別される。ｉはゲームエリア３０の行を表す０～５の数値である。ｊはゲームエリア３０の列を表す０～５の数値である。例えば、２行目かつ１列目のブロックはブロック（１０）と表される。 In FIG. 3, each block is identified by a block number (ij). i is a number between 0 and 5 that represents the row of the game area 30. j is a number between 0 and 5 that represents the column of the game area 30. For example, the block in the second row and first column is represented as block (10).

ゲームエリア３０には、ブラシ２０が配置される。ブラシ２０は、現在のブロック（ｉｊ）から上下左右に１ブロックずつ移動し、移動後のブロック（ｉｊ）を掃除することができる。掃除されたブロック（ｉｊ）の色は、黒から白に変化する。 A brush 20 is placed in the game area 30. The brush 20 can move one block at a time from the current block (ij) up, down, left, or right, and clean the block (ij) after the movement. The color of the cleaned block (ij) changes from black to white.

このゲームのタスクは、すべてのブロック（ｉｊ）の色を黒から白に変えることである。ブラシ２０の総移動距離が短いほど、タスクを効率的に達成することができる。すべてのブロック（ｉｊ）が白に変わったときにゲームエリア３０から付与される第１の報酬（ｒ）は１００ポイントである。黒のブロック（ｉｊ）がある間は、どのような行動（ａ）に対しても付与される第１の報酬（ｒ）は０ポイントである。第１の報酬（ｒ）の情報は、ゲームエリア３０とともに与えられ、記憶部１２に保存される。 The task of this game is to change the color of all blocks (ij) from black to white. The shorter the total distance traveled by the brush 20, the more efficiently the task can be accomplished. The first reward (r) awarded from the game area 30 when all blocks (ij) have turned white is 100 points. As long as there are black blocks (ij), the first reward (r) awarded for any action (a) is 0 points. Information on the first reward (r) is given together with the game area 30 and is stored in the memory unit 12.

このタスクにおいて、ゲームエリア３０は与えられた環境である。環境の初期状態（ｓ_０）は、白のブロック数が０であり、黒のブロックの残数が３６の状態である。ブラシ２０を移動する行動（ａ）によって、環境の状態（ｓ）、すなわちブロック（ｉｊ）の色の状態（ｓ）が遷移していく。タスクの達成によってゲームが終了するため、環境の最終状態（ｓ_ｅ）は、全ブロックが白の状態、つまり白のブロック数が３６であり、黒のブロックの残数が０の状態である。 In this task, the game area 30 is a given environment. In the initial state (s ₀ ) of the environment, the number of white blocks is 0 and the number of remaining black blocks is 36. The action (a) of moving the brush 20 causes a transition in the state (s) of the environment, i.e., the color state (s) of the block (ij). Since the game ends when the task is accomplished, the final state (s _e ) of the environment is a state in which all blocks are white, i.e., the number of white blocks is 36 and the number of remaining black blocks is 0.

強化学習装置１０は、ゲームのプレイヤーとしてブラシ２０を移動するエージェントの行動（ａ）を選択し、その価値（Ｑ）を計算することを繰り返す。これにより、強化学習装置１０は、すべてのブロック（ｉｊ）の色が黒から白へ変わるまでの一連の行動（ａ）に対して計算される価値（Ｑ）の累積値が最大化する行動（ａ）を探索する。 The reinforcement learning device 10 repeatedly selects an action (a) of an agent that moves the brush 20 as a game player and calculates its value (Q). In this way, the reinforcement learning device 10 searches for an action (a) that maximizes the cumulative value of the value (Q) calculated for a series of actions (a) until the color of all blocks (ij) changes from black to white.

まず、ＣＰＵ１１のタスク制御部１１３が、与えられたタスク（以下、メインタスクという）から１又は複数のサブタスクを定義する（ステップＳ１）。サブタスクはメインタスクの一部である。つまり、サブタスクの達成によりメインタスクの一部が達成される。 First, the task control unit 113 of the CPU 11 defines one or more subtasks from a given task (hereinafter referred to as the main task) (step S1). A subtask is part of the main task. In other words, part of the main task is achieved by achieving a subtask.

図４は、サブタスクの一例を示す。
この例において、タスク制御部１１３は、６×６ブロックのゲームエリア３０の色を変えるメインタスクから、３つのサブタスクを定義し、０１、０２及び０３のＩＤを付与する。ＩＤ＝０１のサブタスクは、４×２ブロックのエリア３１の色を変えることである。ＩＤ＝０２のサブタスクは、２×４ブロックのエリア３２の色を変えることである。ＩＤ＝０３のサブタスクは、３×４ブロックのエリア３３の色を変えることである。 FIG. 4 shows an example of a subtask.
In this example, the task control unit 113 defines three subtasks from a main task of changing the color of a game area 30 of 6×6 blocks, and assigns IDs 01, 02, and 03 to the subtasks. The subtask with ID=01 is to change the color of an area 31 of 4×2 blocks. The subtask with ID=02 is to change the color of an area 32 of 2×4 blocks. The subtask with ID=03 is to change the color of an area 33 of 3×4 blocks.

各サブタスクの一部は、他のサブタスクの一部と重複してもよい。図４の例では、ＩＤ＝０２のサブタスクのエリア３２は、ＩＤ＝０３のサブタスクのエリア３３と部分的に重複している。なお、メインタスクをいくつかに分割することにより、重複のない複数のサブタスクが定義されてもよい。 A part of each subtask may overlap with a part of another subtask. In the example of FIG. 4, area 32 of subtask ID=02 partially overlaps with area 33 of subtask ID=03. Note that multiple subtasks without overlaps may be defined by dividing the main task into several parts.

サブタスクの定義により、環境は、メインタスクにおける状態（ｓ）と各サブタスクにおける状態（ｍ）とを有する。状態（ｓ）は、メインタスクの状態を表す複数の要素を有し、ｓ＝｛ｘ_１，ｘ_２，・・・，ｘ_ｎ｝と表される。状態（ｍ）は、サブタスクの状態を表す複数の要素を有し、ｍ＝｛ｙ_１，ｙ_２，・・・，ｙ_ｎ｝と表される。各要素は任意に決定され得る。例えば、メインタスクの状態（ｓ）の要素ｘ_ｎは、ゲームエリア３０内において、ブラシ２０の移動により白に変化したブロック番号（ｉｊ）、黒のブロックの残数等を含む。また、サブタスクの状態（ｍ）の要素ｙ_ｎは、各サブタスクのエリア内において、白に変化したブロック番号（ｉｊ）、黒のブロックの残数等を含む。 According to the definition of the subtasks, the environment has a state (s) in the main task and a state (m) in each subtask. The state (s) has a plurality of elements that represent the state of the main task, and is expressed as s = { _x1 , _x2 , ..., _xn }. The state (m) has a plurality of elements that represent the state of the subtask, and is expressed as m = { _y1 , _y2 , ..., _yn }. Each element can be determined arbitrarily. For example, the element _xn of the state (s) of the main task includes the block number (ij) that has changed to white by the movement of the brush 20 in the game area 30, the remaining number of black blocks, etc. Furthermore, the element _yn of the state (m) of the subtask includes the block number (ij) that has changed to white in the area of each subtask, the remaining number of black blocks, etc.

例えば、図４に示すゲームエリア３０においてメインタスクは、白に変化したブロック番号が００及び０１であり、黒のブロックの残数が３４の状態（ｓ）にある。また、ＩＤ＝０１のサブタスクは、エリア３１において白に変化したブロック番号はまだなく、黒のブロックの残数が８の状態（ｍ）にある。ＩＤ＝０３のサブタスクも、エリア３３において白に変化したブロック番号がなく、黒のブロックの残数が１２の状態（ｍ）にある。ＩＤ＝０２のサブタスクは、エリア３２内で白に変化したブロック番号が００及び０１であり、黒のブロックの残数が６の状態（ｍ）にある。 For example, in the game area 30 shown in FIG. 4, the main task has block numbers 00 and 01 that have changed to white, and the number of remaining black blocks is 34 (s). The subtask with ID=01 has no block numbers that have changed to white in area 31, and the number of remaining black blocks is 8 (m). The subtask with ID=03 has no block numbers that have changed to white in area 33, and the number of remaining black blocks is 12 (m). The subtask with ID=02 has block numbers 00 and 01 that have changed to white in area 32, and the number of remaining black blocks is 6 (m).

次に、タスク制御部１１３は、エージェントの行動（ａ）に対し、サブタスクの達成度に応じて付与される第２の報酬（Ｍ）を定義する（ステップＳ２）。タスク制御部１１３は、定義された第２の報酬（Ｍ）を記憶部１２に保存する。 Next, the task control unit 113 defines a second reward (M) to be given to the agent's action (a) according to the degree of achievement of the subtask (step S2). The task control unit 113 stores the defined second reward (M) in the memory unit 12.

タスク制御部１１３は、サブタスクごとに独立して第２の報酬（Ｍ）を定義することができる。各サブタスクの第２の報酬（Ｍ）は同じ定義であってもよいし、異なる定義であってもよい。 The task control unit 113 can define the second reward (M) independently for each subtask. The second reward (M) for each subtask may be the same or differently defined.

図５は、ＩＤ＝０１のサブタスクに対して定義された第２の報酬（Ｍ）を保持するテーブルＴ３１の例を示す。図５において、状態（ｓ_ｔ，ｍ）及び（ｓ_ｔ＋１，ｍ）の項目は、状態を示す各要素のうち、エリア３１内の黒のブロックの残数のみを示す。 Fig. 5 shows an example of a table T31 that holds a second reward (M) defined for a subtask with ID = 01. In Fig. 5, the items of state (s _t , m) and (s _t+1 , m) indicate only the remaining number of black blocks in the area 31 among the elements indicating the state.

テーブルＴ３１において、ある状態（ｓ_ｔ，ｍ）から状態（ｓ_ｔ＋１，ｍ）へ遷移する行動（ａ）であって、ブロックの色を変える行動（ａ）に対しては、３ポイントの第２の報酬（Ｍ）が関連付けられる。例えば、エリア３１内の黒のブロックの残数が８の状態（ｓ_ｔ，ｍ）から７の状態（ｓ_ｔ＋１，ｍ）に遷移する行動（ａ）には、３ポイントの第２の報酬（Ｍ）が関連付けられている。 In table T31, an action (a) of transitioning from a state (s _t , m) to a state (s _t+1 , m) and changing the color of a block is associated with a second reward (M) of 3 points. For example, an action (a) of transitioning from a state (s _t , m) in which the remaining number of black blocks in the area 31 is 8 to a state (s _t+1 , m) in which the remaining number of black blocks is 7 is associated with a second reward (M) of 3 points.

一方、ブロックの残数に変わりがない行動（ａ）に対しては０ポイントの第２の報酬（Ｍ）が関連付けられている。この定義によれば、ブラシ２０を移動する行動（ａ）によりエリア３１内で黒のブロックの残数が１減るごとに、第２の報酬（Ｍ）が３ポイントずつ付与される。 On the other hand, an action (a) that leaves no change to the number of remaining blocks is associated with a second reward (M) of 0 points. According to this definition, each time the number of remaining black blocks in area 31 is reduced by 1 due to action (a) of moving brush 20, 3 points of second reward (M) are awarded.

本実施形態において、ＩＤ＝０２のサブタスクに対する第２の報酬（Ｍ）の定義は、ＩＤ＝０１のサブタスクと同じである。よって、ＩＤ＝０２のサブタスクに対する第２の報酬（Ｍ）のテーブルＴ３２の構成は、テーブルＴ３１と同じである。 In this embodiment, the definition of the second reward (M) for the subtask with ID=02 is the same as that for the subtask with ID=01. Therefore, the configuration of table T32 for the second reward (M) for the subtask with ID=02 is the same as that of table T31.

図６は、ＩＤ＝０３のサブタスクに対して定義された第２の報酬（Ｍ）を保持するテーブルＴ３３の例を表す。図６において、状態（ｓ_ｔ，ｍ）及び（ｓ_ｔ＋１，ｍ）の項目は、サブタスクの状態を示す各要素のうち、エリア３３内の黒のブロックの残数のみを示す。
テーブルＴ３３において、エリア３３内の黒のブロックの残数が１の状態（ｓ_ｔ，ｍ）から０の状態（ｓ_ｔ＋１，ｍ）へ遷移する行動（ａ）に対しては、３６ポイントの第２の報酬（Ｍ）が関連付けられる。それ以外の行動（ａ）に対しては、０ポイントの第２の報酬（Ｍ）が関連付けられる。 Fig. 6 shows an example of a table T33 that holds a second reward (M) defined for a subtask with ID = 03. In Fig. 6, the items of state (s _t , m) and (s _t+1 , m) indicate only the remaining number of black blocks in the area 33 among the elements indicating the state of the subtask.
In table T33, a second reward (M) of 36 points is associated with an action (a) that transitions the remaining number of black blocks in area 33 from a state (s _t , m) where the number is 1 to a state (s _t+1 , m) where the number is 0. A second reward (M) of 0 points is associated with other actions (a).

この定義によれば、エリア３３内の一部を白に変える行動（ａ）に対して付与される第２の報酬（Ｍ）は０ポイントである。最後の黒の１ブロックを白に変える行動（ａ）に対して、３６ポイントの第２の報酬（Ｍ）が付与される。つまり、各ブロックの色が変わるごとに第２の報酬（Ｍ）が付与されるのではなく、全ブロックの色が変わるときにまとめて第２の報酬（Ｍ）が付与される。 According to this definition, the second reward (M) awarded for the action (a) of changing part of area 33 to white is 0 points. The second reward (M) of 36 points is awarded for the action (a) of changing the last black block to white. In other words, the second reward (M) is not awarded each time the color of each block changes, but rather when all the blocks change color, the second reward (M) is awarded all at once.

なお、図５及び図６は、第２の報酬（Ｍ）の定義の一例を示すのであって、第２の報酬（Ｍ）の定義方法はこれに限定されない。タスクの内容に応じて他の定義方法を採用することができる。 Note that Figures 5 and 6 show an example of the definition of the second reward (M), and the method of defining the second reward (M) is not limited to this. Other definition methods can be adopted depending on the content of the task.

サブタスク及び第２の報酬（Ｍ）の定義が終了すると、行動選択部１１１がゲームを初期化し、エピソードを開始する（ステップＳ３）。初期化により、環境は初期状態（ｓ_０）にリセットされる。つまり、全ブロックの色が白から黒へ変わり、黒のブロックの残数が３６の状態にリセットされる。 After the subtask and the second reward (M) are defined, the action selection unit 111 initializes the game and starts the episode (step S3). The initialization resets the environment to the initial state (s ₀ ). That is, the color of all blocks is changed from white to black, and the number of remaining black blocks is reset to 36.

行動選択部１１１は、エージェントが現在の状態（ｓ_ｔ，ｍ）においてとり得る行動（ａ）のなかから１つの行動（ａ）を選択する（ステップＳ４）。選択した行動（ａ）により、ゲームエリア３０が状態（ｓ_ｔ，ｍ）から状態（ｓ_ｔ＋１，ｍ）へと変化する。 The action selection unit 111 selects one action (a) from among the actions (a) that the agent can take in the current state (s _t , m) (step S4). The selected action (a) changes the game area 30 from state (s _t , m) to state (s _t+1 , m).

例えば、図３に示すように、初期状態（ｓ_０）において、ブロック（００）にブラシ２０を配置する行動（ａ）が選択される。ブロック（００）の色が白に変わるため、環境は、黒のブロックの残数が３６の状態（ｓ_０，ｍ）から、黒のブロックの残数が３５の状態（ｓ_１，ｍ）へと変化する。 3, in the initial state (s ₀ ), the action (a) of placing the brush 20 in the block (00) is selected. As the color of the block (00) changes to white, the environment changes from a state (s ₀ , m) in which the number of remaining black blocks is 36 to a state (s ₁ , m) in which the number of remaining black blocks is 35.

本実施形態において、行動（ａ）の選択はε－ｇｒｅｅｄｙ法により確率的に行われる。具体的には、行動選択部１１１が、一定の確率εで、状態（ｓ_ｔ，ｍ）においてとり得る行動（ａ）のうちの１つをランダムに選択する。また、行動選択部１１１は、確率（１－ε）で、状態（ｓ_ｔ，ｍ）においてとり得る行動（ａ）のうち、次の状態（ｓ_ｔ＋１，ｍ）における行動の価値（Ｑ）が最も大きい行動（ａ）を選択する。つまり、ｍａｘＱ（ｓ_ｔ＋１，ｍ，ａ）が得られる行動（ａ）が選択される。 In this embodiment, the selection of the action (a) is performed probabilistically by the ε-greedy method. Specifically, the action selection unit 111 randomly selects one of the actions (a) that can be taken in the state (s _t , m) with a certain probability ε. Furthermore, the action selection unit 111 selects, with a probability (1-ε), the action (a) that has the largest value (Q) of the action in the next state (s _t+1 , m) from among the actions (a) that can be taken in the state (s _t , m). In other words, the action (a) that obtains maxQ(s _t+1 , m, a) is selected.

ｍａｘＱ（ｓ_ｔ＋１，ｍ，ａ）が得られる行動（ａ）を常に選択すると、学習が停滞することがある。それは、価値（Ｑ）が低い行動（ａ）を排除すると、その後のより価値（Ｑ）が高い行動（ａ）を選択し損ねることがあるからである。ε－ｇｒｅｅｄｙ法は、あえて価値（Ｑ）が低い行動（ａ）を含むランダムな行動（ａ）を一定確率で選択する。これにより、行動の選択の可能性が広がり、学習を効率化することができる。
なお、行動（ａ）の選択手法としては、上記ε－ｇｒｅｅｄｙ法に限らず、softmax法等の他の手法を目的に応じて採用することができる。 If an action (a) that gives maxQ(s _t+1 , m, a) is always selected, learning may stagnate. This is because eliminating an action (a) with low value (Q) may result in failure to select a subsequent action (a) with higher value (Q). The ε-greedy method deliberately selects random actions (a) including actions (a) with low value (Q) with a certain probability. This expands the possibilities for selecting actions, making learning more efficient.
The method for selecting the action (a) is not limited to the above-mentioned ε-greedy method, but other methods such as the softmax method can be adopted depending on the purpose.

行動（ａ）が選択されると、計算処理部１１２は、状態（ｓ_ｔ，ｍ）において選択した行動（ａ）の価値（Ｑ）を計算する（ステップＳ５）。
下記式（１）は、本実施形態において価値（Ｑ）の計算に用いられる行動価値関数Ｑを示す。式（１）中のＲは、状態（ｓ_ｔ，ｍ）において選択された行動（ａ）に対して付与される報酬を表す。下記式（２）は、報酬（Ｒ）の計算に用いられる報酬関数を示す。 When the action (a) is selected, the calculation processing unit 112 calculates the value (Q) of the selected action (a) in the state (s _t , m) (step S5).
The following formula (1) shows an action value function Q used to calculate value (Q) in this embodiment. R in formula (1) represents a reward given to the action (a) selected in state (s _t , m). The following formula (2) shows a reward function used to calculate reward (R).

式（２）において、ｒ_ｔ＋１は、状態（ｓ_ｔ）における行動（ａ）に関連付けられた第１の報酬（ｒ）を表す。Ｍ（ｓ_ｔ＋１，ｍ）は、状態（ｓ_ｔ）における行動（ａ）に関連付けられた第２の報酬（Ｍ）を表す。τは０≦τ≦１を満たす係数を表す。γ、α及びｍａｘの定義は、式（１０）と同じである。 In formula (2), r _t+1 represents the first reward (r) associated with the action (a) in the state (s _t ). M(s _t+1 , m) represents the second reward (M) associated with the action (a) in the state (s _t ). τ represents a coefficient that satisfies 0≦τ≦1. The definitions of γ, α, and max are the same as those in formula (10).

まず、計算処理部１１２は、価値（Ｑ）の計算に使用する第１の報酬（ｒ）及び第２の報酬（Ｍ）を取得する。計算処理部１１２は、取得した第１の報酬（ｒ）に第２の報酬（Ｍ）を加えることにより、報酬（Ｒ）を計算する。計算処理部１１２は、複数のサブタスクから第２の報酬（Ｍ）が付与される場合は、それらを合算して報酬（Ｒ）の計算に用いる。 First, the calculation processing unit 112 obtains the first reward (r) and the second reward (M) to be used in calculating the value (Q). The calculation processing unit 112 calculates the reward (R) by adding the second reward (M) to the obtained first reward (r). When second rewards (M) are granted from multiple subtasks, the calculation processing unit 112 adds them together and uses them to calculate the reward (R).

例えば、ブロック（００）からブロック（０１）にブラシ２０を移動する行動（ａ）により、環境から付与される第１の報酬（ｒ）は０ポイントである。一方、この行動（ａ）によりエリア３２における黒のブロックの残数は７から６へ変化し、テーブルＴ３１から３ポイントの第２の報酬（Ｍ）が取得される。ブロックの色が変化しないエリア３２及び３３のサブタスクから得られる第２の報酬（Ｍ）は０ポイントである。よって、報酬（Ｒ）は３ポイントと計算される。 For example, the action (a) of moving the brush 20 from block (00) to block (01) results in a first reward (r) of 0 points given by the environment. On the other hand, this action (a) changes the remaining number of black blocks in area 32 from 7 to 6, and a second reward (M) of 3 points is obtained from table T31. The second reward (M) obtained from the subtasks in areas 32 and 33, where the color of the blocks does not change, is 0 points. Therefore, the reward (R) is calculated as 3 points.

次に、計算処理部１１２は、計算された報酬（Ｒ）を用いて、式（１）に示すように、価値（Ｑ）を計算する。計算処理部１１２は、計算した価値（Ｑ）を記憶部１２に保存する。 Next, the calculation processing unit 112 uses the calculated reward (R) to calculate the value (Q) as shown in formula (1). The calculation processing unit 112 stores the calculated value (Q) in the memory unit 12.

図７は、記憶部１２において価値（Ｑ）を保持するテーブルＴｑの例を示す。
テーブルＴｑにおいて、各行動（ａ）に計算された価値（Ｑ）が関連付けられる。また、各行動（ａ）には、その行動（ａ）により遷移した後の状態（ｓ_ｔ＋１）として、白に変化したブロック番号（ｉｊ）と、ゲームエリア３０内の黒のブロックの残数ｎとが関連付けられる。これらは、各行動（ａ）が選択され、価値（Ｑ）が計算されるごとに書き込まれていく。 FIG. 7 shows an example of a table Tq that holds a value (Q) in the storage unit 12.
In table Tq, a calculated value (Q) is associated with each action (a). Also, each action (a) is associated with the block number (ij) that has turned white and the remaining number n of black blocks in the game area 30 as the state (s _t+1 ) after the transition due to that action (a). These are written in each time an action (a) is selected and the value (Q) is calculated.

選択した行動（ａ）により環境が最終状態（ｓ_ｅ）に至っていない場合（ステップＳ６：ＮＯ）、行動選択部１１１は、行動（ａ）により遷移した状態（ｓ_ｔ＋１，ｍ）を現在の状態（ｓ_ｔ，ｍ）に設定する（ステップＳ７）。その後、ステップＳ４及びＳ５の処理が繰り返される。つまり、最終状態（ｓ_ｅ）に至るまで、行動（ａ）の選択とその行動の価値（Ｑ）の計算とが繰り返される。その結果、テーブルＴｑに計算された価値（Ｑ）が順次保存されていく。 If the environment has not reached the final state (s _e ) by the selected action (a) (step S6: NO), the action selection unit 111 sets the state (s _t+1 , m) to which the environment transitioned by the action (a) as the current state (s _t , m) (step S7). Then, the processes of steps S4 and S5 are repeated. That is, the selection of the action (a) and the calculation of the value (Q) of the action are repeated until the final state (s _e ). As a result, the calculated values (Q) are stored in the table Tq in sequence.

選択した行動（ａ）により環境が最終状態（ｓ_ｅ）に至った場合（ステップＳ６：ＹＥＳ）、１エピソードが終了する。一定数のエピソードが実施された場合は（ステップＳ８：ＹＥＳ）、強化学習処理が終了する。一定数は任意に設定され得る。 When the environment reaches the final state (s _e ) by the selected action (a) (step S6: YES), one episode ends. When a certain number of episodes have been performed (step S8: YES), the reinforcement learning process ends. The certain number can be set arbitrarily.

一方、一定数のエピソードが実施されていない場合（ステップＳ８：ＮＯ）、ステップＳ３の処理に戻り、新たなエピソードが開始される。つまり、一定数のエピソードを実施するまでステップＳ３～Ｓ７の処理が繰り返され、実施されたエピソード中の一連の行動によって価値（Ｑ）が更新されていく。 On the other hand, if the certain number of episodes has not been performed (step S8: NO), the process returns to step S3 and a new episode is started. In other words, the process of steps S3 to S7 is repeated until the certain number of episodes has been performed, and the value (Q) is updated based on the series of actions during the performed episodes.

このように、サブタスクの達成度に応じて付与される第２の報酬（Ｍ）によって、報酬（Ｒ）が高くなり、計算される価値（Ｑ）も高くなっていく。第１の報酬（ｒ）によって価値（Ｑ）が変化しない間も、第２の報酬（Ｍ）によって価値（Ｑ）が変化し、行動（ａ）を評価できる。これにより試行錯誤が減り、タスクを達成できる行動（ａ）を効率的に学習することができる。 In this way, the second reward (M) that is granted according to the degree of completion of the subtask increases the reward (R) and the calculated value (Q) also increases. While the value (Q) does not change due to the first reward (r), the value (Q) changes due to the second reward (M), and the action (a) can be evaluated. This reduces trial and error and makes it possible to efficiently learn the action (a) that can complete the task.

報酬（Ｒ）を計算する際、計算処理部１１２は、係数τの値を変更することにより、第２の報酬（Ｍ）を加える割合（τ）を調整することができる。割合（τ）が大きいほど第１の報酬（ｒ）に追加される第２の報酬（Ｍ）の割合が増える。計算処理部１１２は、第２の報酬（Ｍ）による学習の効率化を優先する場合は割合（τ）を増やし、試行錯誤による学習を優先する場合は割合（τ）を減らすことができる。 When calculating the reward (R), the calculation processing unit 112 can adjust the rate (τ) at which the second reward (M) is added by changing the value of the coefficient τ. The larger the rate (τ), the greater the rate of the second reward (M) added to the first reward (r). The calculation processing unit 112 can increase the rate (τ) when prioritizing the efficiency of learning using the second reward (M), and can decrease the rate (τ) when prioritizing learning by trial and error.

計算処理部１１２は、エピソードの実施回数が増えるにつれて、割合（τ）を減らすことが好ましい。第２の報酬（Ｍ）の付与はサブタスクを達成するようにエージェントの行動（ａ）を誘導するが、新たな行動（ａ）の選択が減り、エージェントの行動（ａ）が第２の報酬（Ｍ）が付与される行動（ａ）に偏りやすい。よって、エピソード数が少ない間は第２の報酬（Ｍ）の割合（τ）を増やすことにより、学習を効率化することができる。エピソード数がある程度実施された後は第２の報酬（Ｍ）の割合（τ）を減らして、あえてランダムに行動することにより、より価値（Ｑ）の高い行動（ａ）の学習が可能となり、学習の効率化を図ることができる。 It is preferable that the calculation processing unit 112 reduces the ratio (τ) as the number of times an episode is performed increases. The granting of the second reward (M) induces the agent's action (a) to accomplish the subtask, but the selection of new actions (a) decreases, and the agent's actions (a) tend to be biased toward actions (a) to which the second reward (M) is granted. Therefore, while the number of episodes is small, the learning can be made more efficient by increasing the ratio (τ) of the second reward (M). After a certain number of episodes have been performed, the ratio (τ) of the second reward (M) can be reduced and the agent can act randomly, making it possible to learn actions (a) with higher value (Q), and improving the efficiency of learning.

計算処理部１１２は、割合（τ）を最終的に０まで減らすことができる。これにより、通常のＱ学習と同様の結果に収束させることができる。計算処理部１１２は、割合（τ）を単調減少させてもよいが、減らす過程において一時的に増やしてもよい。 The calculation processing unit 112 can eventually reduce the ratio (τ) to 0. This allows the ratio to converge to a result similar to that of normal Q-learning. The calculation processing unit 112 may monotonically reduce the ratio (τ), or may temporarily increase it during the reduction process.

以上のように、本実施形態によれば、エピソードを繰り返し実施して、１エピソードにおけるエージェントの一連の行動の累積価値が最大化するように、エージェントの行動（ａ）を学習する。このような強化学習において、行動選択部１１１は、エージェントの行動（ａ）を繰り返し選択する。計算処理部１１２は、行動（ａ）が選択されるごとに、選択された行動（ａ）に対して付与される報酬（Ｒ）を用いて行動（ａ）の価値（Ｑ）を計算する。 As described above, according to this embodiment, episodes are repeatedly performed to learn the agent's behavior (a) so that the cumulative value of the series of actions of the agent in one episode is maximized. In such reinforcement learning, the behavior selection unit 111 repeatedly selects the agent's behavior (a). Each time behavior (a) is selected, the calculation processing unit 112 calculates the value (Q) of behavior (a) using the reward (R) granted to the selected behavior (a).

通常のＱ学習において価値（Ｑ）の計算に使用される報酬（Ｒ）は、式（１０）に示したように第１の報酬（ｒ）のみである。第１の報酬（ｒ）は、環境において予め定義される定数である。全ブロックの色が変化したときのみ高い第１の報酬（ｒ）が付与されるような環境下では、それまでの間、行動（ａ）の指標となる第１の報酬（ｒ）が得られない。全ブロックの色が変化するまで試行錯誤を繰り返す必要があるため、価値（Ｑ）が最大化するまでに必要なエピソードの実施数が増え、学習に時間を要する。 In normal Q-learning, the reward (R) used to calculate value (Q) is only the first reward (r) as shown in equation (10). The first reward (r) is a constant that is predefined in the environment. In an environment where a high first reward (r) is awarded only when the color of all blocks has changed, the first reward (r), which is an indicator of action (a), cannot be obtained until that point. Since trial and error must be repeated until the color of all blocks has changed, the number of episodes required to maximize value (Q) increases, and learning takes time.

これに対し、本実施形態におけるタスク制御部１１３は、メインタスクからサブタスクとサブタスクの達成度に応じて付与される第２の報酬（Ｍ）とを定義する。計算処理部１１２は、第１の報酬（ｒ）に第２の報酬（Ｍ）を加えることにより報酬（Ｒ）を得る。このような第２の報酬（Ｍ）の加算により重み付けられた報酬（Ｒ）は、サブタスクを達成する一連の行動の価値（Ｑ）を高める。よって、サブタスクを達成しながら最終的にメインタスクを達成する行動（ａ）へとエージェントを誘導することができる。 In contrast, the task control unit 113 in this embodiment defines subtasks from the main task and a second reward (M) that is granted depending on the degree of achievement of the subtask. The calculation processing unit 112 obtains the reward (R) by adding the second reward (M) to the first reward (r). The reward (R) weighted by the addition of such a second reward (M) increases the value (Q) of a series of actions that achieve the subtask. Therefore, it is possible to guide the agent to an action (a) that ultimately achieves the main task while achieving the subtask.

行動（ａ）の誘導により試行錯誤が減る。また、価値（Ｑ）の最大化を加速させることができる。これにより、学習時間が短くなるため、強化学習の学習効率を高めることができる。 Inducing action (a) reduces trial and error. It also accelerates the maximization of value (Q). This shortens the learning time, thereby improving the learning efficiency of reinforcement learning.

強化学習装置１０は、様々な方策の決定に用いることができ、その技術分野は特に限定されない。例えば、危険物を回避して車両の走行経路を決定する自動運転制御、モータの駆動制御、ゲームのキャラクタの制御等に強化学習装置１０を利用可能である。 The reinforcement learning device 10 can be used to determine various strategies, and its technical field is not particularly limited. For example, the reinforcement learning device 10 can be used for automatic driving control that determines a vehicle's driving route while avoiding hazards, motor drive control, game character control, etc.

以上、本発明の好ましい実施形態について説明したが、本発明は、これらの実施形態に限定されない。本発明の範囲内で種々の変形が可能であり、以下にいくつかの変形例を挙げる。各変形例は組み合わせてもよい。 Although preferred embodiments of the present invention have been described above, the present invention is not limited to these embodiments. Various modifications are possible within the scope of the present invention, and some modifications are listed below. Each modification may be combined.

（変形例１）
タスク制御部１１３は、各サブタスクの有効化又は無効化を選択することができる。タスク制御部１１３は、サブタスクの有効化又は無効化をエピソードごとに切り替えることもできるし、１エピソードのなかでも環境の状態（ｓ）に応じて切り替えることもできる。 (Variation 1)
The task control unit 113 can select whether to enable or disable each subtask. The task control unit 113 can switch between enabling and disabling the subtask for each episode, and can also switch between enabling and disabling the subtask within one episode depending on the environmental state (s).

例えば、タスク制御部１１３は、１エピソードにおいてゲームエリア３０の半分の色が変わるまで、ＩＤが０２のサブタスクを無効化し、ＩＤが０１及び０３のサブタスクを有効化することを選択できる。この場合、ゲームの前半はエリア３２のブロックよりもエリア３１及び３３の色を変える行動（ａ）が選択されやすくなる。エリア３２の色が変化しても第２の報酬（Ｍ）が得られず、第２の報酬（Ｍ）が得られるエリア３１及び３３の色を変える行動（ａ）の方が、価値（Ｑ）が高くなるためであるよって、先にエリア３１及び３３の色を変え、次にエリア３２の色を変えるよう、行動（ａ）をスケジュールすることができる。 For example, the task control unit 113 can select to disable the subtask with ID 02 and enable the subtasks with IDs 01 and 03 until half of the game area 30 has changed color in one episode. In this case, in the first half of the game, action (a) of changing the color of areas 31 and 33 is more likely to be selected than the block in area 32. This is because action (a) of changing the color of areas 32 does not result in the second reward (M), and action (a) of changing the color of areas 31 and 33, which results in the second reward (M), has a higher value (Q). Therefore, action (a) can be scheduled to change the color of areas 31 and 33 first, and then change the color of area 32.

サブタスクの有効化又は無効化が選択される場合、計算処理部１１２は、有効化された各サブタスクの第２の報酬（Ｍ）のすべてを合算して報酬（Ｒ）を計算し、当該報酬（Ｒ）を用いて価値（Ｑ）を計算してもよい。 When enabling or disabling a subtask is selected, the calculation processing unit 112 may calculate a reward (R) by adding up all of the second rewards (M) of each enabled subtask, and use the reward (R) to calculate the value (Q).

あるいは、計算処理部１１２は、サブタスクごとの価値（Ｑ）をまず計算し、有効化された各サブタスクの価値（Ｑ）の平均値を、行動（ａ）の価値（Ｑ）として計算してもよい。具体的には、計算処理部１１２は、ＩＤ＝０１のサブタスクのみが有効化された場合の価値（Ｑ）、ＩＤ＝０２のサブタスクのみが有効化された場合の価値（Ｑ）及びＩＤ＝０３のサブタスクのみが有効化された場合の価値（Ｑ）をそれぞれ計算する。すなわち、各サブタスクから付与される第２の報酬（Ｍ）のみを用いて３つの価値（Ｑ）を計算する。 Alternatively, the calculation processing unit 112 may first calculate the value (Q) for each subtask, and then calculate the average value of the values (Q) of each activated subtask as the value (Q) of the action (a). Specifically, the calculation processing unit 112 calculates the value (Q) when only the subtask with ID=01 is activated, the value (Q) when only the subtask with ID=02 is activated, and the value (Q) when only the subtask with ID=03 is activated. In other words, the three values (Q) are calculated using only the second reward (M) granted from each subtask.

図８は、サブタスクごとに計算された価値（Ｑ）のテーブルＴｑ１～Ｔｑ３の例を示す。
テーブルＴｑ１は、ＩＤ＝０１のサブタスクの第２の報酬（Ｍ）のみを用いて計算される価値（Ｑ）を保持する。同様に、テーブルＴｑ２は、ＩＤ＝０２のサブタスクの第２の報酬（Ｍ）のみを用いて計算される価値（Ｑ）を保持する。テーブルＴｑ３は、ＩＤ＝０３のサブタスクの第２の報酬（Ｍ）のみを用いて計算される価値（Ｑ）を保持する。 FIG. 8 shows examples of tables Tq1 to Tq3 of values (Q) calculated for each subtask.
Table Tq1 holds a value (Q) calculated using only the second reward (M) of the subtask with ID=01. Similarly, table Tq2 holds a value (Q) calculated using only the second reward (M) of the subtask with ID=02. Table Tq3 holds a value (Q) calculated using only the second reward (M) of the subtask with ID=03.

計算処理部１１２は、３つの価値（Ｑ）のうち、有効化されたサブタスクの価値（Ｑ）の平均値を計算する。例えば、ＩＤが０１及び０２のサブタスクが有効化され、ＩＤが０３のサブタスクが無効化された場合、計算処理部１１２は、テーブルＴｑ１及びＴｑ２に保持された価値（Ｑ）の平均値を計算する。この平均値が、選択された行動（ａ）に関連付けて、テーブルＴｑに保持される。 The calculation processing unit 112 calculates the average value of the values (Q) of the activated subtasks among the three values (Q). For example, if subtasks with IDs 01 and 02 are activated and subtask with ID 03 is deactivated, the calculation processing unit 112 calculates the average value of the values (Q) stored in tables Tq1 and Tq2. This average value is associated with the selected action (a) and stored in table Tq.

（変形例２）
行動選択部１１１が、上記サブタスクの有効化又は無効化をエージェントの行動（ａ）の１つとして選択してもよい。これにより、サブタスクを達成する行動（ａ）のスケジュールも学習することができる。 (Variation 2)
The behavior selection unit 111 may select the activation or deactivation of the subtask as one of the agent's behaviors (a), thereby making it possible to learn a schedule of the behavior (a) for accomplishing the subtask.

タスク制御部１１３は、各サブタスクの有効化又は無効化を選択する行動（ａ）に対し、第２の報酬（Ｍ）を定義することができる。第２の報酬（Ｍ）によって、行動（ａ）のスケジュールをより効率的に学習可能である。 The task control unit 113 can define a second reward (M) for the action (a) of selecting whether to enable or disable each subtask. The second reward (M) makes it possible to learn the schedule of the action (a) more efficiently.

（変形例３）
タスク制御部１１３は、予め与えられた環境に関する情報に基づいて、第２の報酬（Ｍ）を定義することができる。
例えば、避けるべき環境の状態（ｓ）や経由すべき状態（ｓ）が事前に判明している場合、計算処理部１１２はその情報に基づいて第２の報酬（Ｍ）を定義することができる。 (Variation 3)
The task control unit 113 can define the second reward (M) based on information about the environment given in advance.
For example, if the environmental state (s) to be avoided or the state (s) to be passed through is known in advance, the calculation processing unit 112 can define the second reward (M) based on that information.

ゲームエリア３０のいくつかのブロック（ｉｊ）にトラップが配置される例を説明する。ブラシ２０がトラップに到達すると、ゲームオーバーに至り、タスクは失敗する。このトラップの位置情報が予め与えられた場合、タスク制御部１１３は、このトラップが配置されたブロック（ｉｊ）にブラシ２０を移動する行動（ａ）に対して－１００ポイントのような、他の行動（ａ）よりも低い第２の報酬（Ｍ）を定義することができる。 An example will be described in which traps are placed in some blocks (ij) of the game area 30. If the brush 20 reaches a trap, the game ends and the task fails. When the position information of this trap is given in advance, the task control unit 113 can define a second reward (M) for the action (a) of moving the brush 20 to the block (ij) in which this trap is placed that is lower than other actions (a), such as -100 points.

このような低い値の第２の報酬（Ｍ）が定義されると、その状態（ｓ）に至る行動（ａ）を選択した場合に式（２）により計算される報酬（Ｒ）が小さくなり、結果として価値（Ｑ）も小さくなる。これにより、トラップを避けるようにエージェントの行動（ａ）を誘導することができる。 When such a low-value second reward (M) is defined, the reward (R) calculated by formula (2) when the action (a) that leads to the state (s) is selected becomes smaller, and as a result, the value (Q) also becomes smaller. This makes it possible to guide the agent's action (a) so as to avoid traps.

強化学習を用いる制御内容によっては、タスクを効率的に達成できる行動（ａ）であるだけではなく、リスクが低い行動（ａ）が求められる。例えば、最短経路でタスクを達成する行動（ａ）であっても、リスクの高い地点に近づく行動（ａ）は避けた方がよい。この場合は、上述のようにして第２の報酬（Ｍ）を定義することにより、学習する行動のリスクを減らすことができる。 Depending on the control content using reinforcement learning, not only is an action (a) that can accomplish a task efficiently, but also an action (a) with low risk is required. For example, even if an action (a) accomplishes a task via the shortest route, it is better to avoid an action (a) that approaches a high-risk point. In this case, the risk of the learned action can be reduced by defining the second reward (M) as described above.

（変形例４）
上記実施形態における第２の報酬（Ｍ）は、サブタスクとともに定義される定数である。しかし、第２の報酬（Ｍ）は、これからの行動（ａ）により付与される報酬（Ｒ）の期待値に応じて、更新される変数であってもよい。 (Variation 4)
The second reward (M) in the above embodiment is a constant defined together with the subtask. However, the second reward (M) may be a variable that is updated according to an expected value of the reward (R) to be granted by the future action (a).

下記式（３）は、更新後の第２の報酬（Ｍ）の計算に用いられる報酬関数Ｍ（ｓ，ｍ）の一例を示す。式（３）において、Ｒ_ｅはサブタスクを達成した場合に付与される報酬（Ｒ）の期待値を表す。期待値（Ｒ_ｅ）は、式（４）により計算される。計算処理部１１２は、１又は複数の行動（ａ）が選択されるごとに、更新後の第２の報酬（Ｍ）を計算し、テーブルＴ３１～Ｔ３３を更新することができる。更新後の第２の報酬（Ｍ）が、報酬（Ｒ）の計算に用いられる。 The following formula (3) shows an example of a reward function M(s,m) used to calculate the updated second reward (M). In formula (3), R _e represents the expected value of the reward (R) granted when a subtask is accomplished. The expected value (R _e ) is calculated by formula (4). The calculation processing unit 112 can calculate the updated second reward (M) and update tables T31 to T33 every time one or more actions (a) are selected. The updated second reward (M) is used to calculate the reward (R).

式（３）において、λは０＜λ≦１を満たす係数である。式（４）において、Ｅ[]は、[]内の期待値を出力する関数を示す。

In formula (3), λ is a coefficient that satisfies 0<λ≦1. In formula (4), E[ ] denotes a function that outputs the expected value in [ ].

例えば、現在の環境が、エリア３１の全ブロックの色が変わったが、エリア３２及び３３の色が変わっていない状態（ｓ）にある。この状態（ｓ）においてどのような行動（ａ）が選択されても得られる第１の報酬（ｒ）は０ポイントである。また、ＩＤが０１のサブタスクはすでに達成されているため、その達成により今後得られる報酬（Ｒ）の期待値（Ｒ_ｅ）は０ポイントである。 For example, the current environment is in a state (s) where the color of all blocks in area 31 has changed, but the colors of areas 32 and 33 have not changed. Whatever action (a) is selected in this state (s), the first reward (r) obtained is 0 points. In addition, since the subtask with ID 01 has already been achieved, the expected value (R _e ) of the reward (R) obtained in the future by achieving the subtask is 0 points.

一方、ＩＤが０２及び０３のサブタスクの達成により、各サブタスクに対して定義された第２の報酬（Ｍ）の累積値が報酬（Ｒ）として期待される。式（３）及び式（４）によれば、ＩＤが０１のサブタスクよりも、ＩＤが０２及び０３のサブタスクを達成する行動（ａ）に対して付与される第２の報酬（Ｍ）が高くなる。その結果、その行動（ａ）の価値（Ｑ）が高まるため、ＩＤが０２及び０３のサブタスクを達成するように、エージェントの行動（ａ）を誘導することができる。 On the other hand, by accomplishing the subtasks with IDs 02 and 03, the cumulative value of the second reward (M) defined for each subtask is expected as the reward (R). According to formulas (3) and (4), the second reward (M) given to the action (a) of accomplishing the subtasks with IDs 02 and 03 is higher than that of the subtask with ID 01. As a result, the value (Q) of the action (a) increases, and the agent's action (a) can be induced to accomplish the subtasks with IDs 02 and 03.

更新後の第２の報酬（Ｍ）を計算する際、計算処理部１１２は、係数λの値を変更することにより、更新後の第２の報酬（Ｍ）における報酬（Ｒ）の期待値（Ｒ_ｅ）の割合（λ）を調整することができる。割合（λ）が大きいほど、元の第２の報酬（Ｍ）から報酬（Ｒ）の期待値（Ｒ_ｅ）へのシフトが加速化される。よって、計算処理部１１２は割合（λ）を増やすことにより、実際の行動（ａ）による結果をその後の行動（ａ）へより早く反映させることができる。 When calculating the updated second reward (M), the calculation processing unit 112 can adjust the ratio (λ) of the expected value (R _e ) of the reward (R) in the updated second reward (M) by changing the value of the coefficient λ. The larger the ratio (λ), the more accelerated the shift from the original second reward (M) to the expected value (R _e ) of the reward (R). Therefore, by increasing the ratio (λ), the calculation processing unit 112 can more quickly reflect the results of the actual action (a) in the subsequent action (a).

（変形例５）
第２の報酬（Ｍ）は、式（３）に代えて、下記式（５）に示す報酬関数Ｍ_γ（ｓ，ｍ）により計算されてもよい。式（５）によれば、更新後の第２の報酬（Ｍ）は、１エピソードにおいて選択された各行動（ａ）に至るまでに得られる報酬（Ｒ）の期待値（Ｅ）であり、各報酬（Ｒ）を足し合わせることにより計算される。ここで、各行動（ａ）に付与される報酬（Ｒ）が係数γ_Ｅによって割り引かれている。 (Variation 5)
The second reward (M) may be calculated by a reward function M _γ (s, m) shown in the following formula (5) instead of formula (3). According to formula (5), the updated second reward (M) is an expectation value (E) of the reward (R) obtained until each action (a) selected in one episode is reached, and is calculated by adding up each reward (R). Here, the reward (R) given to each action (a) is discounted by a coefficient γ _E.

式（５）において、E[]は、[]内の期待値を出力する関数を表す。

In equation (5), E[ ] represents a function that outputs the expected value in [ ].

γ_Ｅは、各行動（ａ）に付与される報酬（Ｒ）の割引率を表し、０＜γ_Ｅ≦１を満たす。割引率γ_Ｅは（ｅ－ｔ）乗されるため、最終状態（ｓ_ｅ）から初期状態（ｓ_０）へ近づくほど、報酬（Ｒ）の割引率が小さくなる。この報酬関数Ｍ_γ（ｓ，ｍ）によれば、初期状態（ｓ_０）より最終状態（ｓ_ｅ）に近い過去の行動に誘導しやすくなる。よって、メインタスクを達成する最後の行動（ａ）に第１の報酬（ｒ）が付与される環境に適している。また、γ_Ｅ ^{（ｅ－ｔ）}が１より小さいと、少ない行動（ａ）で期待値（Ｅ）が高い行動（ａ）が選択されやすくなる。よって、累積報酬（Ｒ_ｅ）が得られるまでの時間平均を考慮してγ_Ｅを設定することができる。 γ _E represents the discount rate of the reward (R) given to each action (a) and satisfies 0<γ _E ≦1. Since the discount rate γ _E is raised to the power of (e-t), the closer to the initial state (s ₀ ) from the final state (s _e ), the smaller the discount rate of the reward (R). According to this reward function M _γ (s, m), it is easier to induce past actions closer to the final state (s _e ) than to the initial state (s ₀ ). Therefore, it is suitable for an environment in which the first reward (r) is given to the last action (a) that achieves the main task. In addition, when γ _E ^(e-t) is smaller than 1, an action (a) with a high expected value (E) is more likely to be selected among a small number of actions (a). Therefore, γ _E can be set in consideration of the time average until the cumulative reward (R _e ) is obtained.

例えば、状態（ｓ_ｔ，ｍ）から状態（ｓ_ｔ＋３，ｍ）まで遷移した場合、各行動（ａ）に至るまでの各報酬（Ｒ）の期待値（Ｅ）は次のように計算される。
Ｍ_γ（ｓ，ｍ）＝γ_Ｅ ^３Ｒ（ｓ_ｔ，ｍ）＋γ_Ｅ ^２Ｒ（ｓ_ｔ＋１，ｍ）
＋γ_ＥＲ（ｓ_ｔ＋２，ｍ）＋Ｒ（ｓ_ｔ＋３，ｍ） For example, when transitioning from state (s _t , m) to state (s _t+3 , m), the expected value (E) of each reward (R) until each action (a) is reached is calculated as follows.
M _γ (s, m) = γ _E ³ R (s _t , m) + γ _E ² R (s _t+1 , m)
+γ _E R (s _t+2 , m) + R (s _t+3 , m)

（変形例６）
タスク制御部１１３は、各サブタスク間で第２の報酬（Ｍ）に差を設けてもよい。これにより、サブタスクが達成される順番の制御が容易となる。例えば、タスク制御部１１３は、ＩＤが０１のサブタスクの第２の報酬（Ｍ）が、ＩＤが０２及び０３のサブタスクよりも大きくなるように、各サブタスクの第２の報酬（Ｍ）を定義することができる。この場合、ＩＤが０２及び０３よりもＩＤが０１のサブタスクを先に達成するように、行動（ａ）が誘導される。 (Variation 6)
The task control unit 113 may set a difference in the second reward (M) between each subtask. This makes it easier to control the order in which the subtasks are completed. For example, the task control unit 113 can define the second reward (M) for each subtask so that the second reward (M) for the subtask with ID 01 is greater than the subtasks with IDs 02 and 03. In this case, the action (a) is induced so that the subtask with ID 01 is completed before the subtasks with IDs 02 and 03.

（変形例７）
タスク制御部１１３は、サブタスクのなかでも先に達成すべき一部があれば、この一部を他のサブタスクの一部と重なるように定義することができる。
例えば、図４中のブロック（１２）及び（１３）のように、複数のサブタスクのエリアが重なるブロックの色が変わる場合、価値（Ｑ）の計算に、各サブタスクの第２の報酬（Ｑ）が用いられる。その結果、エリアが重なるブロックにブラシ２０を移動する行動（ａ）の価値（Ｑ）は、重なっていないブロックに移動する行動（ａ）よりも高くなりやすい。したがって、エリアが重なるブロックの色が先に変わるよう、行動（ａ）を誘導することが可能である。 (Variation 7)
If there is a part of a subtask that should be completed first, the task control unit 113 can define this part so that it overlaps with a part of another subtask.
For example, when the color of a block where the areas of multiple subtasks overlap changes, such as blocks (12) and (13) in Fig. 4, the second reward (Q) of each subtask is used to calculate the value (Q). As a result, the value (Q) of the action (a) of moving the brush 20 to a block where the areas overlap is likely to be higher than the action (a) of moving to a block where there is no overlap. Therefore, it is possible to induce the action (a) so that the color of the block where the areas overlap changes first.

なお、上記実施形態では、記憶部１２が各テーブルＴ３１～Ｔ３３、Ｔｑ及びＴｑ１～Ｔｑ３を記憶したが、これらはサーバ等の外部装置に保存されていてもよい。通信部１５によって外部装置と通信することにより、テーブルＴ３１～Ｔ３３、Ｔｑ及びＴｑ１～Ｔｑ３のダウンロード又はアップロードを行うことができる。 In the above embodiment, the storage unit 12 stores the tables T31 to T33, Tq, and Tq1 to Tq3, but these may be stored in an external device such as a server. By communicating with the external device via the communication unit 15, the tables T31 to T33, Tq, and Tq1 to Tq3 can be downloaded or uploaded.

また、Ｑ学習の例を説明したが、報酬を用いてエージェントの行動の価値を計算する強化学習であれば、本発明を適用することができる。例えば、ＳＡＲＳＡ、マルコフ決定過程(MDP : Markov decision process)、又はＤＱＮ（Deep Q-Network）等においても、本発明を適用可能である。 Although an example of Q-learning has been described, the present invention can be applied to any reinforcement learning that uses rewards to calculate the value of an agent's actions. For example, the present invention can also be applied to SARSA, Markov decision processes (MDPs), and Deep Q-Networks (DQNs).

また、本発明の強化学習方法をコンピュータに実行させるプログラムが記録された記録媒体が提供されてもよい。記録媒体としては、ＣＰＵ等のコンピュータが読み取り可能な記録媒体であれば特に限定されず、半導体メモリ、磁気ディスク、光ディスク等を使用可能である。 A recording medium may be provided on which a program for causing a computer to execute the reinforcement learning method of the present invention is recorded. The recording medium is not particularly limited as long as it is a recording medium that can be read by a computer such as a CPU, and semiconductor memory, magnetic disks, optical disks, etc. can be used.

１０・・・強化学習装置、１１・・・ＣＰＵ、１１１・・・行動選択部、１１２・・・計算処理部、１１３・・・タスク制御部、１２・・・記憶部

10: Reinforcement learning device, 11: CPU, 111: Action selection unit, 112: Calculation processing unit, 113: Task control unit, 12: Storage unit

Claims

A reinforcement learning device (10) that repeatedly performs an episode of selecting an agent's behavior until the agent accomplishes a task in a given environment, and learns the agent's behavior so as to maximize a cumulative value of a series of actions in one episode,
an action selection unit (111) for repeatedly selecting an action of the agent;
a calculation processing unit (112) that calculates a value (Q) of the selected action using a reward (R) given to the selected action each time the action is selected;
A task control unit (113) that defines one or more subtasks from the task and defines a second reward (M) to be granted according to the degree of achievement of the subtask;
The computation processing unit (112) obtains the second reward (M) defined for the selected action and a first reward (r) provided from the environment, and calculates the reward (R) by adding the second reward (M) to the first reward (r).

The reinforcement learning device (10) according to claim 1 , wherein the calculation processing unit (112) adjusts a rate (τ) of adding the second reward (M).

The reinforcement learning device (10) according to claim 2, wherein the calculation processing unit (112) decreases a rate (τ) of adding the second reward (M) as the number of times the episode is performed increases.

The reinforcement learning device (10) according to any one of claims 1 to 3, wherein the task control unit (113) defines the second reward (M) independently for each of the subtasks.

The reinforcement learning device (10) according to any one of claims 1 to 4, wherein the task control unit (113) selects whether to enable or disable the subtask.

The reinforcement learning device (10) according to any one of claims 1 to 4, wherein the action selection unit (111) selects enabling or disabling the subtask as one of the actions of the agent.

A reinforcement learning method for repeatedly selecting an action for an agent in a given environment until the agent accomplishes a task, the method comprising:
Iteratively selecting an action for the agent;
Each time the action is selected, a value (Q) of the selected action is calculated using a reward (R) given to the selected action;
defining one or more subtasks from the task and defining a second reward (M) to be given according to the degree of achievement of the subtask;
The step of calculating the value (Q) comprises:
Obtaining the second reward (M) defined for the selected behavior and a first reward (r) provided by the environment;
and calculating the reward (R) by adding the second reward (M) to the first reward (r).

A program for causing a computer to execute a reinforcement learning method for repeatedly executing an episode of selecting an action of an agent until the agent accomplishes a task in a given environment, and learning the action of the agent so as to maximize a cumulative value of a series of actions in one episode, comprising:
The reinforcement learning method includes:
Iteratively selecting an action for the agent;
Each time the action is selected, a value (Q) of the selected action is calculated using a reward (R) given to the selected action;
defining one or more subtasks from the task and defining a second reward (M) to be given according to the degree of achievement of the subtask;
The step of calculating the value (Q) comprises:
Obtaining the second reward (M) defined for the selected behavior and a first reward (r) provided by the environment;
and calculating the reward (R) by adding the second reward (M) to the first reward (r).