JP7635849B2

JP7635849B2 - Reinforcement learning system, reinforcement learning device, reinforcement learning method and program

Info

Publication number: JP7635849B2
Application number: JP2023546676A
Authority: JP
Inventors: 裕志吉田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2025-02-26
Anticipated expiration: 2041-09-10
Also published as: WO2023037504A1; JPWO2023037504A1

Description

本発明は、強化学習システム、強化学習装置及び強化学習方法に関する。 The present invention relates to a reinforcement learning system, a reinforcement learning device, and a reinforcement learning method.

ある状態において、次の行動を実施したときに得られる報酬を最大化する行動を学習していく、強化学習の研究が進められている。特許文献１には、ロボットアームによる組立作業において、凹部品と凸部品の画像と、部品を組み合わせる際の制御量とを強化学習により学習する技術が記載されている。また、特許文献２には、強化学習を用いて、アクセル操作量を学習し、状態に応じたスロットル開口度指令値及び遅角量からなる行動を選択する技術が記載されている。特許文献２にはまた、行動価値関数Ｑに関数近似器を用いてもよいことが記載されている。Research is currently being conducted into reinforcement learning, which learns actions that maximize the reward obtained when the next action is performed in a certain state. Patent Document 1 describes a technology that uses reinforcement learning to learn images of concave and convex parts and the control amount when combining parts in assembly work using a robot arm. Patent Document 2 describes a technology that uses reinforcement learning to learn the accelerator operation amount and selects an action consisting of a throttle opening command value and retard amount according to the state. Patent Document 2 also describes that a function approximator may be used for the action value function Q.

国際公開第２０１８／１４６７７０号公報International Publication No. WO 2018/146770 日本国特開２０２１－６７１９３号公報Japanese Patent Application Publication No. 2021-67193

しかしながら、特許文献１及び２に記載の技術は、より好適な行動を選択するという観点で改善の余地がある。強化学習において行動価値関数を正確に推定できていれば適切な行動を選択できるが、特許文献１及び２に記載の技術において推定される行動価値関数には誤差が含まれるためである。特に状態行動空間が巨大である場合、行動価値関数を正確に推定することは困難である。However, the techniques described in Patent Documents 1 and 2 have room for improvement in terms of selecting more suitable actions. If the action value function can be accurately estimated in reinforcement learning, an appropriate action can be selected, but the action value function estimated in the techniques described in Patent Documents 1 and 2 contains errors. It is difficult to accurately estimate the action value function, especially when the state-action space is huge.

本発明の一態様は、上記の問題に鑑みてなされたものであり、その目的の一例は、より好適な行動を選択できる技術を提供することである。One aspect of the present invention has been made in consideration of the above problems, and one example of its objective is to provide technology that enables the selection of more appropriate actions.

本発明の一側面に係る強化学習システムは、強化学習の対象である環境における第１の状態を取得する取得手段と、前記第１の状態にノイズを付加することによって第２の状態を生成する生成手段と、前記第２の状態に応じて、第１の行動価値関数を算出する算出手段と、前記第１の行動価値関数に応じて、行動を選択する選択手段と、を備える。 A reinforcement learning system according to one aspect of the present invention comprises an acquisition means for acquiring a first state in an environment that is the subject of reinforcement learning, a generation means for generating a second state by adding noise to the first state, a calculation means for calculating a first action value function according to the second state, and a selection means for selecting an action according to the first action value function.

本発明の一側面に係る強化学習装置は、強化学習の対象である環境における第１の状態を取得する取得手段と、前記第１の状態にノイズを付加することによって第２の状態を生成する生成手段と、前記第２の状態に応じて、第１の行動価値関数を算出する算出手段と、前記第１の行動価値関数に応じて、行動を選択する選択手段と、を備える。 A reinforcement learning device according to one aspect of the present invention comprises an acquisition means for acquiring a first state in an environment that is the subject of reinforcement learning, a generation means for generating a second state by adding noise to the first state, a calculation means for calculating a first action value function according to the second state, and a selection means for selecting an action according to the first action value function.

本発明の一側面に係る強化学習方法は、強化学習の対象である環境における前記第１の状態にノイズを付加することによって第２の状態を生成すること、前記第２の状態に応じて、第１の行動価値関数を算出すること、前記第１の行動価値関数に応じて、行動を選択すること、を含む。
本発明の一側面に係るプログラムは、コンピュータを強化学習装置として機能させるためのプログラムであって、前記コンピュータを、強化学習の対象である環境における第１の状態を取得する取得手段と、前記第１の状態にノイズを付加することによって第２の状態を生成する生成手段と、前記第２の状態に応じて、第１の行動価値関数を算出する算出手段と、前記第１の行動価値関数に応じて、行動を選択する選択手段と、として機能させる。 A reinforcement learning method according to one aspect of the present invention includes generating a second state by adding noise to the first state in an environment that is the subject of reinforcement learning, calculating a first action value function according to the second state, and selecting an action according to the first action value function.
A program according to one aspect of the present invention is a program for causing a computer to function as a reinforcement learning device, causing the computer to function as: acquisition means for acquiring a first state in an environment that is the subject of reinforcement learning; generation means for generating a second state by adding noise to the first state; calculation means for calculating a first action value function in accordance with the second state; and selection means for selecting an action in accordance with the first action value function.

本発明の一態様によれば、より好適な行動を選択することができる。 According to one aspect of the present invention, a more suitable action can be selected.

本発明の例示的実施形態１に係る強化学習システムの構成を示すブロック図である。1 is a block diagram showing a configuration of a reinforcement learning system according to an exemplary embodiment 1 of the present invention. 本発明の例示的実施形態１に係る強化学習方法の流れを示すフロー図である。1 is a flow diagram showing a flow of a reinforcement learning method according to an exemplary embodiment 1 of the present invention. 本発明の例示的実施形態１に係る強化学習システムの構成を例示するブロック図である。1 is a block diagram illustrating a configuration of a reinforcement learning system according to an exemplary embodiment 1 of the present invention. 本発明の例示的実施形態１を実現する装置構成の一例を示すブロック図である。1 is a block diagram showing an example of a device configuration for realizing an exemplary embodiment 1 of the present invention. 本発明の例示的実施形態２に係る強化学習システムの構成を示すブロック図である。FIG. 11 is a block diagram showing a configuration of a reinforcement learning system according to an exemplary embodiment 2 of the present invention. 本発明の例示的実施形態２に係る強化学習方法の流れを示すフロー図である。FIG. 11 is a flow chart showing the flow of a reinforcement learning method according to an exemplary embodiment 2 of the present invention. 本発明の例示的実施形態３に係るゲーム画面の一例を示す図である。FIG. 11 is a diagram showing an example of a game screen according to an exemplary embodiment 3 of the present invention. 本発明の例示的実施形態３に係る第１の状態を例示する図である。FIG. 11 is a diagram illustrating a first state according to an exemplary embodiment 3 of the present invention. 本発明の例示的実施形態３の適用例に係る評価結果の一例を示す図である。FIG. 13 is a diagram showing an example of an evaluation result according to an application example of the exemplary embodiment 3 of the present invention. 本発明の例示的実施形態３の適用例に係る評価結果の一例を示す図である。FIG. 13 is a diagram showing an example of an evaluation result according to an application example of the exemplary embodiment 3 of the present invention. 本発明の例示的実施形態３の適用例に係る評価結果の一例を示す図である。FIG. 13 is a diagram showing an example of an evaluation result according to an application example of the exemplary embodiment 3 of the present invention. 本発明の例示的実施形態３の適用例に係る評価結果の一例を示す図である。FIG. 13 is a diagram showing an example of an evaluation result according to an application example of the exemplary embodiment 3 of the present invention. 本発明の例示的実施形態１～７に係る強化学習装置、端末２０、サーバ３０として機能するコンピュータの構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a reinforcement learning device, a terminal 20, and a computer functioning as a server 30 according to exemplary embodiments 1 to 7 of the present invention.

〔例示的実施形態１〕
本発明の第１の例示的実施形態について、図面を参照して詳細に説明する。本例示的実施形態は、後述する例示的実施形態の基本となる形態である。 [Example embodiment 1]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A first exemplary embodiment of the present invention will be described in detail with reference to the accompanying drawings. This exemplary embodiment is a basic form of the exemplary embodiments described below.

＜強化学習システムの構成＞
本例示的実施形態に係る強化学習システム１の構成について、図１を参照して説明する。図１は、強化学習システム１の構成を示すブロック図である。強化学習システム１は、強化学習により行動を選択するシステムである。強化学習システム１は、一例として、掘削機等の建設機械の建設動作を制御するシステム、搬送装置による搬送を制御するシステム、又はコンピュータゲームの自律プレイのためのシステムである。ただし、強化学習システム１の強化学習は上述した例に限定されるものではなく、強化学習システム１が行う強化学習は種々のシステムに適用可能である。行動は、強化学習におけるエージェントの行動であり、一例として、掘削機の掘削動作制御、搬送装置の搬送動作制御、又はコンピュータゲームの自律プレイ制御である。ただし、行動はこれらの例に限定されるものではなく、上記以外のものであってもよい。 <Configuration of Reinforcement Learning System>
The configuration of a reinforcement learning system 1 according to this exemplary embodiment will be described with reference to FIG. 1. FIG. 1 is a block diagram showing the configuration of the reinforcement learning system 1. The reinforcement learning system 1 is a system that selects an action by reinforcement learning. The reinforcement learning system 1 is, for example, a system that controls the construction operation of a construction machine such as an excavator, a system that controls transportation by a transport device, or a system for autonomous play of a computer game. However, the reinforcement learning of the reinforcement learning system 1 is not limited to the above examples, and the reinforcement learning performed by the reinforcement learning system 1 can be applied to various systems. The action is the action of an agent in reinforcement learning, and for example, the excavation operation control of an excavator, the transport operation control of a transport device, or the autonomous play control of a computer game. However, the action is not limited to these examples, and may be other than the above.

強化学習システム１は、図１に示すように、取得部１１、生成部１２、算出部１３、及び選択部１４を備える。取得部１１は、本例示的実施形態において取得手段を実現する構成である。生成部１２は、本例示的実施形態において生成手段を実現する構成である。算出部１３は、本例示的実施形態において算出手段を実現する構成である。選択部１４は、本例示的実施形態において選択手段を実現する構成である。 As shown in FIG. 1, the reinforcement learning system 1 includes an acquisition unit 11, a generation unit 12, a calculation unit 13, and a selection unit 14. The acquisition unit 11 is configured to realize the acquisition means in this exemplary embodiment. The generation unit 12 is configured to realize the generation means in this exemplary embodiment. The calculation unit 13 is configured to realize the calculation means in this exemplary embodiment. The selection unit 14 is configured to realize the selection means in this exemplary embodiment.

取得部１１は、第１の状態を取得する。第１の状態は、強化学習の対象である環境における状態である。例えば強化学習システム１が掘削機の掘削動作を選択するためのシステムである場合、第１の状態は一例として、土砂を掘削する掘削機の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、の一部又は全部を含む。また、強化学習システム１が搬送装置の搬送動作を選択するためのシステムである場合、第１の状態は、一例として、搬送装置の位置、移動方向、速度及び各角度、通路の位置、並びに静的な障害部又は動的な障害物の位置及び速度、の一部又は全部を含む。また、強化学習システム１がコンピュータゲームの自律プレイのためのシステムである場合、第１の状態は、一例として、コンピュータゲームにおいてゲームの進行に影響を与えるオブジェクトの状態を含む。ただし、第１の状態は上述したものに限定されず、他の状態であってもよい。第１の状態は、例えば、温度又は天気等の環境の状態を含んでもよい。The acquisition unit 11 acquires a first state. The first state is a state in an environment that is the target of reinforcement learning. For example, when the reinforcement learning system 1 is a system for selecting an excavation operation of an excavator, the first state includes, as an example, a part or all of the posture and position of the excavator excavating the soil, the shape of the soil to be excavated, and the amount of soil in the bucket of the excavator. When the reinforcement learning system 1 is a system for selecting a transport operation of a transport device, the first state includes, as an example, a part or all of the position, movement direction, speed, and each angle of the transport device, the position of the passage, and the position and speed of a static obstacle or a dynamic obstacle. When the reinforcement learning system 1 is a system for autonomous play of a computer game, the first state includes, as an example, a state of an object that affects the progress of the game in the computer game. However, the first state is not limited to the above-mentioned ones, and may be other states. The first state may include, for example, an environmental state such as temperature or weather.

生成部１２は、第１の状態にノイズを付加することによって第２の状態を生成する。ノイズは一例として、正規乱数、又は一様乱数等の乱数である。ただし、生成部１２が第１の状態に付加するノイズはこれらに限られず、上記以外のノイズであってもよい。生成部１２は、第１の状態に含まれる要素の全てにノイズを付加してもよく、また、第１の状態に含まれる要素のうちの一部にノイズを付加してもよい。The generation unit 12 generates the second state by adding noise to the first state. An example of the noise is a random number such as a normal random number or a uniform random number. However, the noise added to the first state by the generation unit 12 is not limited to these, and may be noise other than the above. The generation unit 12 may add noise to all of the elements included in the first state, or may add noise to some of the elements included in the first state.

算出部１３は、第２の状態に応じて、第１の行動価値関数を算出する。算出部１３は、一例として、複数の第２の状態を含む状態列を用いて第１の行動価値関数を算出する。また、算出部１３は、第１の状態、及び、１又は複数の第２の状態を含む状態列を用いて第１の行動価値関数を算出してもよい。換言すると、算出部１３が第１の行動価値関数の算出のために用いる状態列は、１又は複数の第２の状態を含み、また、上記状態列に含まれる状態は第１の状態又は第２の状態である。以下の説明では、第１の状態及び第２の状態を各々区別する必要がない場合には、これらを単に「状態」ともいう。The calculation unit 13 calculates the first action value function according to the second state. As an example, the calculation unit 13 calculates the first action value function using a state sequence including a plurality of second states. The calculation unit 13 may also calculate the first action value function using a state sequence including the first state and one or more second states. In other words, the state sequence used by the calculation unit 13 to calculate the first action value function includes one or more second states, and the states included in the state sequence are the first state or the second state. In the following description, when it is not necessary to distinguish between the first state and the second state, they are also simply referred to as "states".

第１の行動価値関数は、状態での行動を評価するための関数である。第１の行動価値関数は、一例として、Ｑ学習（Q-learning）で用いられる行動価値関数であり、一例として以下の式（１）により更新される。ただし、第１の行動価値関数は式（１）により与えられるものに限られず、他の関数であってもよい。The first action value function is a function for evaluating an action in a state. As an example, the first action value function is an action value function used in Q-learning, and is updated, for example, by the following formula (1). However, the first action value function is not limited to that given by formula (1), and may be another function.

式（１）において、ｓ_ｔ ^（ｉ）（１≦ｉ≦ｎ；ｉ及びｎは自然数）は状態列に含まれる状態（すなわち第１の状態又は第２の状態）であり、ａは行動であり、Ｑ（ｓ_ｔ ^（ｉ），ａ）は第１の行動価値関数である。αは学習率、ｓ_ｔ＋１ ^（ｉ）は遷移後の状態、ｒ_ｔ＋１はエージェントが状態ｓ_ｔ＋１ ^（ｉ）に遷移したときに得る報酬、γ（０≦γ≦１）は割引率である。また、ａ´∈Ａ、集合Ａは状態ｓ_ｔ ^（ｉ）においてエージェントが可能な行動の集合である。 In formula (1), s _t ⁽ⁱ⁾ (1≦i≦n; i and n are natural numbers) is a state included in the state sequence (i.e., the first state or the second state), a is an action, and Q(s _t ⁽ⁱ⁾ , a) is the first action-value function. α is the learning rate, s _t+1 ⁽ⁱ⁾ is the state after the transition, r _t+1 is the reward the agent obtains when it transitions to state s _t+1 ⁽ⁱ⁾ , and γ (0≦γ≦1) is the discount rate. Also, a′∈A, and set A is a set of actions the agent can take in state s _t ⁽ⁱ⁾ .

報酬は、エージェントが行動することで環境から得られる報酬である。報酬は、一例として、掘削機の掘削量、掘削に要した時間、搬送に要した時間、搬送中における障害物への接触の有無、ゲームの勝敗、又はゲームのスコアに応じて、加算又は減算される値である。ただし、報酬はこれらの例に限定されるものではなく、上記以外のものであってもよい。 The reward is a reward that the agent obtains from the environment by taking action. As an example, the reward is a value that is added or subtracted depending on the amount of excavation by the excavator, the time required for excavation, the time required for transportation, whether or not an obstacle is encountered during transportation, the outcome of the game, or the game score. However, the reward is not limited to these examples and may be something other than the above.

式（１）を用いる場合、算出部１３は、状態列に含まれる状態のそれぞれについて、第１の行動価値関数を算出する。換言すると、算出部１３は、状態列に含まれる状態の数だけ第１の行動価値関数を算出する。When formula (1) is used, the calculation unit 13 calculates the first action-value function for each state included in the state sequence. In other words, the calculation unit 13 calculates the first action-value function for the number of states included in the state sequence.

選択部１４は、第１の行動価値関数に応じて、行動を選択する。選択部１４は、一例として、第１の行動価値関数を最大化する行動を選択する。選択部１４は、εグリーディ手法、遺伝的アルゴリズムで用いられているルーレット選択、ボルツマン分布を利用したソフトマックス手法等により行動を選択してもよい。The selection unit 14 selects an action according to the first action value function. As an example, the selection unit 14 selects an action that maximizes the first action value function. The selection unit 14 may select an action by an ε-greedy method, a roulette selection used in a genetic algorithm, a softmax method using a Boltzmann distribution, or the like.

また、複数の第１の行動価値関数を用いる場合、選択部１４は、一例として、複数の第１の行動価値関数のいずれかを用いて行動を選択してもよく、また、算出部１３が算出した複数の第１の行動価値関数を用いて第２の行動価値関数を算出し、算出した第２の行動価値関数を用いて行動を選択してもよい。第２の行動価値関数は、状態での行動を評価するための関数である。第２の行動価値関数は、一例として、複数の第１の行動価値関数の期待値であってもよく、また、一例として、複数の第１の行動価値関数のばらつきが大きいほど上記期待値よりも小さな値となる関数であってもよい。第２の行動価値関数は、一例として、下記式（２）、又は式（３）により与えられる。ただし、第２の行動価値関数は式（２）又は（３）により与えられるものに限られず、これら以外の他の関数であってもよい。 In addition, when a plurality of first action value functions are used, the selection unit 14 may select an action using one of the plurality of first action value functions, or may calculate a second action value function using the plurality of first action value functions calculated by the calculation unit 13, and select an action using the calculated second action value function. The second action value function is a function for evaluating an action in a state. The second action value function may be, for example, an expected value of the plurality of first action value functions, or may be, for example, a function that becomes smaller than the expected value as the variation of the plurality of first action value functions increases. The second action value function is, for example, given by the following formula (2) or formula (3). However, the second action value function is not limited to that given by formula (2) or (3), and may be a function other than these.

式（２）及び式（３）において、Ｊ（ｓ_ｔ，ａ）は第２の行動価値関数、ｓ_ｔは第１の状態、ａは行動、θはハイパーパラメータ、Ｑ（ｓ_ｔ ^（ｉ），ａ）は第１の行動価値関数、ｓ_ｔ ^（ｉ）は状態列に含まれる状態、Ｅは期待値、である。なお、式（３）は、式（２）をテイラー展開して２次の項までを採用し、３次以降を切り捨てたものである。 In formula (2) and formula (3), J(s _t , a) is the second action value function, s _t is the first state, a is an action, θ is a hyperparameter, Q(s _t ⁽ⁱ⁾ , a) is the first action value function, s _t ⁽ⁱ⁾ is a state included in the state sequence, and E is an expected value. Note that formula (3) is obtained by Taylor expansion of formula (2), adopting up to second-order terms and discarding third-order and subsequent terms.

選択部１４が第２の行動価値関数を算出する場合、選択部１４は、一例として、式（４）により与えられる方策を用いて、第２の行動価値関数を最大化する行動を選択する。なお、行動を選択する方策は式（４）により与えられる方策に限られず、他の方策であってもよい。選択部１４は例えば、εグリーディ手法、遺伝的アルゴリズムで用いられているルーレット選択、又はボルツマン分布を利用したソフトマックス手法等により行動を選択してもよい。εグリーディ手法を用いる場合、方策は一例として以下の式（５）により与えられる。When the selection unit 14 calculates the second action value function, the selection unit 14 selects an action that maximizes the second action value function using a measure given by, for example, formula (4). Note that the measure for selecting an action is not limited to the measure given by formula (4) and may be another measure. The selection unit 14 may select an action using, for example, the ε-greedy method, roulette selection used in genetic algorithms, or a softmax method using the Boltzmann distribution. When the ε-greedy method is used, the measure is given by the following formula (5), for example.

式（４）及び式（５）において、πは選択する次の行動、ａ´は第１の状態ｓ_ｔにおいてエージェントが可能な行動である。また、式（５）において、ε（０＜ε＜１）は定数、ｖは、０≦ｖ≦１を満たす乱数である。

In formulas (4) and (5), π is the next action to be selected, a' is an action that the agent can take in the first state s _t , and in formula (5), ε (0<ε<1) is a constant, and v is a random number that satisfies 0≦v≦1.

＜強化学習システムの効果＞
本例示的実施形態に係る強化学習システム１によれば、第１の状態にノイズを付加した第２の状態を用いて行動価値関数を算出することにより、状態のばらつきを考慮した第１の行動価値関数を算出することができる。この第１の行動価値関数を用いて行動を選択することにより、強化学習システム１はより好適な行動を選択できる。 <Effects of reinforcement learning system>
According to the reinforcement learning system 1 of this exemplary embodiment, the first action value function that takes into account the variability of the states can be calculated by calculating the action value function using the second state obtained by adding noise to the first state. By selecting an action using this first action value function, the reinforcement learning system 1 can select a more suitable action.

＜強化学習方法の流れ＞
図２は、強化学習システム１が実行する強化学習方法Ｓ１の流れを示すフロー図である。強化学習システム１は、強化学習方法Ｓ１を繰り返すことにより、行動の選択を繰り返し行う。なお、すでに説明した内容についてはその説明を繰り返さない。 <Reinforcement learning method flow>
2 is a flow diagram showing the flow of the reinforcement learning method S1 executed by the reinforcement learning system 1. The reinforcement learning system 1 repeatedly selects an action by repeating the reinforcement learning method S1. Note that the contents already explained will not be explained again.

強化学習方法Ｓ１は、ステップＳ１１～Ｓ１４を含む。ステップＳ１１において、取得部１１は、第１の状態を取得する。ステップＳ１２において、生成部１２は、第１の状態にノイズを付加することによって第２の状態を生成する。 The reinforcement learning method S1 includes steps S11 to S14. In step S11, the acquisition unit 11 acquires a first state. In step S12, the generation unit 12 generates a second state by adding noise to the first state.

ステップＳ１３において、算出部１３は、第２の状態に応じて、第１の行動価値関数を算出する。ここで、繰り返しのｎ（ｎは自然数）回目において、算出部１３が第１の行動価値関数を算出するために参照するデータとしては、一例として、（ｎ－１）回目までに蓄積された状態、行動、及び報酬が用いられる。ステップＳ１４において、選択部１４は、第１の行動価値関数に応じて、行動を選択する。In step S13, the calculation unit 13 calculates a first action value function according to the second state. Here, in the nth (n is a natural number) repetition, the data that the calculation unit 13 refers to in order to calculate the first action value function are, for example, the states, actions, and rewards accumulated up to the (n-1)th repetition. In step S14, the selection unit 14 selects an action according to the first action value function.

＜強化学習方法の効果＞
本例示的実施形態に係る強化学習方法Ｓ１によれば、第１の状態にノイズを付加した第２の状態を用いて行動価値関数を算出することにより、状態のばらつきを考慮した行動価値関数を算出することができる。この行動価値関数を用いて行動を選択することにより、より好適な行動を選択できる。 <Effects of reinforcement learning methods>
According to the reinforcement learning method S1 of this exemplary embodiment, an action value function that takes into account the variability of states can be calculated by calculating an action value function using a second state obtained by adding noise to a first state. By selecting an action using this action value function, a more suitable action can be selected.

＜強化学習システムの装置構成例＞
続いて、本例示的実施形態に係る強化学習システム１の装置構成例について図面を参照しつつ説明する。図３は、強化学習システム１の構成の一例を示すブロック図である。図３の例では、強化学習システム１は強化学習装置１０を備える。強化学習装置１０は、取得部１１、生成部１２、算出部１３、及び選択部１４を備える。強化学習装置１０は、一例として、サーバ装置、パーソナルコンピュータ、又はゲーム機器であるが、これらに限定されるものではなく、上記以外の装置であってもよい。強化学習装置１０は一例として、通信インタフェースを介して第１の状態を受信することにより第１の状態を取得してもよい。 <Example of device configuration for reinforcement learning system>
Next, an example of the device configuration of the reinforcement learning system 1 according to this exemplary embodiment will be described with reference to the drawings. FIG. 3 is a block diagram showing an example of the configuration of the reinforcement learning system 1. In the example of FIG. 3, the reinforcement learning system 1 includes a reinforcement learning device 10. The reinforcement learning device 10 includes an acquisition unit 11, a generation unit 12, a calculation unit 13, and a selection unit 14. The reinforcement learning device 10 is, for example, a server device, a personal computer, or a game device, but is not limited to these, and may be a device other than the above. For example, the reinforcement learning device 10 may acquire the first state by receiving the first state via a communication interface.

図４は、強化学習システム１の構成の他の例を示すブロック図である。図４の例では、強化学習システム１は、端末２０及びサーバ３０を備える。端末２０は一例として、パーソナルコンピュータ、又はゲーム機器であるが、これらに限定されるものではなく、上記以外の装置であってもよい。端末２０は、取得部１１を備える。サーバ３０は、生成部１２、算出部１３、及び選択部１４を備える。端末２０は、第１の状態を取得し、取得した第１の状態をサーバ３０に供給する。 Figure 4 is a block diagram showing another example of the configuration of the reinforcement learning system 1. In the example of Figure 4, the reinforcement learning system 1 includes a terminal 20 and a server 30. The terminal 20 is, by way of example, a personal computer or a game device, but is not limited to these and may be a device other than those mentioned above. The terminal 20 includes an acquisition unit 11. The server 30 includes a generation unit 12, a calculation unit 13, and a selection unit 14. The terminal 20 acquires a first state and supplies the acquired first state to the server 30.

本例示的実施形態では強化学習システム１の構成例として図３及び図４を例示したが、強化学習システム１の構成は、図３及び図４に例示したものに限定されるものではなく、これ以外の種々の構成が適用可能である。In this exemplary embodiment, Figures 3 and 4 are shown as examples of the configuration of the reinforcement learning system 1, but the configuration of the reinforcement learning system 1 is not limited to that shown in Figures 3 and 4, and various other configurations can be applied.

〔例示的実施形態２〕
本発明の第２の例示的実施形態について、図面を参照して詳細に説明する。なお、例示的実施形態１にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付し、その説明を繰り返さない。 Exemplary embodiment 2
A second exemplary embodiment of the present invention will be described in detail with reference to the drawings. Note that components having the same functions as those described in the first exemplary embodiment are given the same reference numerals and will not be described repeatedly.

＜強化学習システムの構成＞
図５は、強化学習システム２の構成を示すブロック図である。図５に示すように、強化学習システム２は、端末４０及び強化学習装置５０を備える。端末４０と強化学習装置５０とは通信回線Ｎを介して通信可能に構成されている。通信回線Ｎの具体的構成は本例示的実施形態を限定するものではないが、一例として、無線ＬＡＮ（Local Area Network）、有線ＬＡＮ、ＷＡＮ（Wide Area Network）、公衆回線網、モバイルデータ通信網、又は、これらのネットワークの組み合わせを用いることができる。 <Configuration of Reinforcement Learning System>
Fig. 5 is a block diagram showing a configuration of the reinforcement learning system 2. As shown in Fig. 5, the reinforcement learning system 2 includes a terminal 40 and a reinforcement learning device 50. The terminal 40 and the reinforcement learning device 50 are configured to be able to communicate with each other via a communication line N. Although the specific configuration of the communication line N does not limit the present exemplary embodiment, as an example, a wireless LAN (Local Area Network), a wired LAN, a WAN (Wide Area Network), a public line network, a mobile data communication network, or a combination of these networks can be used.

端末４０は、一例として汎用コンピュータであり、より具体的には、例えば掘削機等の建設機械を制御する制御装置、搬送装置による搬送を管理する管理装置、又はコンピュータゲームをプレイするためのゲーム機器である。なお、端末４０はこれらに限定されるものではなく、上記以外の装置であってもよい。強化学習装置５０は、一例としてサーバ装置である。 The terminal 40 is, for example, a general-purpose computer, and more specifically, for example, a control device that controls construction machinery such as an excavator, a management device that manages transportation by a transportation device, or a game device for playing computer games. Note that the terminal 40 is not limited to these, and may be a device other than the above. The reinforcement learning device 50 is, for example, a server device.

＜端末の構成＞
端末４０は、通信部４１、制御部４２、及び入力受付部４３を備える。通信部４１は、制御部４２の制御の下に、通信回線Ｎを介して強化学習装置５０との間で情報を送受信する。以降、制御部４２が通信部４１を介して強化学習装置５０との間で情報を送受信することを、単に、制御部４２が強化学習装置５０との間で情報を送受信する、とも記載する。 <Device configuration>
The terminal 40 includes a communication unit 41, a control unit 42, and an input receiving unit 43. The communication unit 41 transmits and receives information to and from the reinforcement learning device 50 via a communication line N under the control of the control unit 42. Hereinafter, the transmission and reception of information between the control unit 42 and the reinforcement learning device 50 via the communication unit 41 will also be simply described as the control unit 42 transmitting and receiving information to and from the reinforcement learning device 50.

制御部４２は、状態提供部４２１、行動実行部４２２、及び報酬提供部４２３を備える。状態提供部４２１は、第１の状態を取得し、取得した第１の状態を強化学習装置５０に提供する。本例示的実施形態において、状態提供部４２１が取得する第１の状態は、属性が付随する複数の要素を含む。属性は、要素の特徴及び／又は種類を示す情報であり、例えば環境内を移動する動的要素であるか、環境内を移動しない静的要素であるか、を示す情報を含む。また、属性は例えば、人、自動車、自転車、建物、といった要素の種類を示す情報であってもよい。ただし、属性は上述した例に限られず、上記以外の他の情報であってもよい。The control unit 42 includes a state providing unit 421, an action executing unit 422, and a reward providing unit 423. The state providing unit 421 acquires a first state and provides the acquired first state to the reinforcement learning device 50. In this exemplary embodiment, the first state acquired by the state providing unit 421 includes a plurality of elements to which attributes are attached. The attribute is information indicating the characteristics and/or type of the element, and includes information indicating, for example, whether the element is a dynamic element that moves within the environment or a static element that does not move within the environment. The attribute may also be information indicating the type of element, such as a person, a car, a bicycle, or a building. However, the attribute is not limited to the above examples, and may be other information other than the above.

一例として、状態提供部４２１は、建設機械又は搬送装置等の動作を検出するセンサが出力するセンサ情報を第１の状態として取得してもよい。また、一例として、状態提供部４２１は、コンピュータゲームにおいてゲームの進行に影響を与えるオブジェクトの第１状態を取得してもよい。ただし、状態提供部４２１が取得する第１の状態は上述した例に限られず、上記以外の状態であってもよい。 As one example, the status providing unit 421 may acquire sensor information output by a sensor that detects the operation of a construction machine, a transport device, or the like, as the first state. Also, as another example, the status providing unit 421 may acquire the first state of an object that affects the progress of a computer game. However, the first state acquired by the status providing unit 421 is not limited to the above-mentioned examples, and may be a state other than the above.

状態提供部４２１は、一例として、入力受付部４３を介して第１の状態の入力を受け付け、受け付けた第１の状態を強化学習装置５０に提供する。また、状態提供部４２１は、一例として、通信部４１を介して接続された他の装置から第１の状態を受信し、受信した第１の状態を強化学習装置５０に提供してもよい。As an example, the state providing unit 421 accepts an input of a first state via the input accepting unit 43, and provides the accepted first state to the reinforcement learning device 50. As an example, the state providing unit 421 may also receive a first state from another device connected via the communication unit 41, and provide the received first state to the reinforcement learning device 50.

行動実行部４２２は、強化学習装置５０が決定した行動を実行する。一例として、行動実行部４２２は、強化学習装置５０が決定した行動を建設機械又は搬送装置等に行わせるための制御情報を出力する。また、一例として、行動実行部４２２は、コンピュータゲームにおいてユーザ操作の対象であるオブジェクトの行動を制御する。ただし、強化学習装置５０が実行する行動は上述した例に限られず、上記以外の行動であってもよい。The action execution unit 422 executes the action determined by the reinforcement learning device 50. As an example, the action execution unit 422 outputs control information for causing a construction machine, a transport device, or the like to execute the action determined by the reinforcement learning device 50. As another example, the action execution unit 422 controls the action of an object that is the subject of a user operation in a computer game. However, the action executed by the reinforcement learning device 50 is not limited to the above example, and may be an action other than the above.

報酬提供部４２３は、強化学習装置５０が決定した行動をエージェントが実行して得られた報酬を強化学習装置５０に提供する。一例として、報酬提供部４２３は、掘削機の掘削量、掘削に要した時間、搬送装置が搬送に要した時間、搬送中における障害物への接触の有無、ゲームの勝敗、又はゲームのスコアを示す情報を報酬として強化学習装置５０に提供する。ただし、報酬提供部４２３が提供する報酬は上述した例に限られず、上記以外の他の報酬であってもよい。The reward providing unit 423 provides the reinforcement learning device 50 with a reward obtained when the agent executes the action determined by the reinforcement learning device 50. As an example, the reward providing unit 423 provides the reinforcement learning device 50 with information indicating the amount of excavation by the excavator, the time required for excavation, the time required for transportation by the transportation device, whether or not an obstacle was contacted during transportation, the outcome of the game, or the game score as a reward. However, the reward provided by the reward providing unit 423 is not limited to the above-mentioned example, and may be a reward other than the above.

報酬提供部４２３は、一例として、入力受付部４３を介して取得した報酬を強化学習装置５０に提供する。また、報酬提供部４２３は、通信部４１を介して接続された他の装置から報酬を受信し、受信した報酬を強化学習装置５０に提供してもよい。As an example, the reward providing unit 423 provides the reinforcement learning device 50 with a reward acquired via the input receiving unit 43. The reward providing unit 423 may also receive a reward from another device connected via the communication unit 41 and provide the received reward to the reinforcement learning device 50.

入力受付部４３は、端末４０に対する各種の入力を受け付ける。入力受付部４３の具体的構成は本例示的実施形態を限定するものではないが、一例として、入力受付部４３は、キーボード及びタッチパッド等の入力デバイスを備える構成とすることができる。また、入力受付部４３は、赤外線や電波等の電磁波を介してデータの読み取りを行うデータスキャナ、及び、環境の状態をセンシングするセンサ等を備える構成としてもよい。報酬提供部４２３は一例として、入力受付部４３が取得したセンシング結果に基づいて、搬送装置が搬送に要した時間等を測定し、測定結果を示す報酬を強化学習装置５０に提供する。The input reception unit 43 receives various inputs to the terminal 40. Although the specific configuration of the input reception unit 43 does not limit this exemplary embodiment, as an example, the input reception unit 43 may be configured to include input devices such as a keyboard and a touchpad. The input reception unit 43 may also be configured to include a data scanner that reads data via electromagnetic waves such as infrared rays and radio waves, and a sensor that senses the state of the environment. As an example, the reward provision unit 423 measures the time required for transportation by the transportation device based on the sensing result acquired by the input reception unit 43, and provides a reward indicating the measurement result to the reinforcement learning device 50.

入力受付部４３は、上述した入力デバイス、データスキャナ、及びセンサ等を介して、入力を受け付けた情報を制御部４２に供給する。入力受付部４３は、一例として、上述した状態、及び上述した報酬を取得し、取得した状態及び報酬を制御部４２に供給する。The input reception unit 43 supplies the received input information to the control unit 42 via the above-mentioned input devices, data scanners, sensors, etc. The input reception unit 43, as an example, acquires the above-mentioned state and the above-mentioned reward, and supplies the acquired state and reward to the control unit 42.

＜強化学習装置の構成＞
強化学習装置５０は、通信部５１、制御部５２及び記憶部５３を備える。通信部５１は、制御部５２の制御の下に、通信回線Ｎを介して強化学習装置５０との間で情報を送受信する。以降、制御部５２が通信部５１を介して端末４０との間で情報を送受信することを、単に、制御部５２が端末４０との間で情報を送受信する、とも記載する。 <Configuration of Reinforcement Learning Device>
The reinforcement learning device 50 includes a communication unit 51, a control unit 52, and a storage unit 53. The communication unit 51 transmits and receives information to and from the reinforcement learning device 50 via a communication line N under the control of the control unit 52. Hereinafter, the transmission and reception of information between the control unit 52 and the terminal 40 via the communication unit 51 will also be simply described as the control unit 52 transmitting and receiving information to and from the terminal 40.

制御部５２は、報酬取得部５２１、状態観測部５２２、状態ランダム化部５２３、学習部５２４、推定部５２５、及び選択部５２６を備える。状態観測部５２２は、本例示的実施形態において取得手段を実現する構成である。状態ランダム化部５２３は、本例示的実施形態において生成手段を実現する構成である。推定部５２５は、本例示的実施形態において算出手段を実現する構成である。選択部５２６は、本例示的実施形態において選択手段を実現する構成である。 The control unit 52 includes a reward acquisition unit 521, a state observation unit 522, a state randomization unit 523, a learning unit 524, an estimation unit 525, and a selection unit 526. The state observation unit 522 is a configuration that realizes an acquisition means in this exemplary embodiment. The state randomization unit 523 is a configuration that realizes a generation means in this exemplary embodiment. The estimation unit 525 is a configuration that realizes a calculation means in this exemplary embodiment. The selection unit 526 is a configuration that realizes a selection means in this exemplary embodiment.

報酬取得部５２１は、通信部５１を介して端末４０が提供する報酬を取得する。状態観測部５２２は、通信部５１を介して端末４０が提供する第１の状態を取得する。状態ランダム化部５２３は、状態観測部５２２が取得した第１の状態にノイズを付加することによって１又は複数の第２の状態を生成する。学習部５２４は、第１の行動価値関数を更新するための行動価値関数モデル５３１を学習させる。行動価値関数モデル５３１は第１の行動価値関数の推定に用いられる。The reward acquisition unit 521 acquires the reward provided by the terminal 40 via the communication unit 51. The state observation unit 522 acquires the first state provided by the terminal 40 via the communication unit 51. The state randomization unit 523 generates one or more second states by adding noise to the first state acquired by the state observation unit 522. The learning unit 524 trains an action value function model 531 for updating the first action value function. The action value function model 531 is used to estimate the first action value function.

推定部５２５は、第１の状態と１又は複数の第２の状態とを含む状態列、又は、複数の第２の状態を含む状態列、に応じて、第１の行動価値関数を算出する。また、推定部５２５は、第１の行動価値関数を用いて第２の行動価値関数を算出する。The estimation unit 525 calculates a first action value function according to a state sequence including a first state and one or more second states, or a state sequence including multiple second states. The estimation unit 525 also calculates a second action value function using the first action value function.

選択部５２６は、第２の行動価値関数を用いて行動を選択し、選択した行動を示す情報を記憶部５３に記憶するとともに、選択した行動を示す情報を端末４０に送信する。The selection unit 526 selects an action using the second action value function, stores information indicating the selected action in the memory unit 53, and transmits the information indicating the selected action to the terminal 40.

記憶部５３は、制御部５２が参照する各種のデータを記憶する。一例として、記憶部５３は、行動価値関数モデル５３１、及び学習データ５３２を記憶する。行動価値関数モデル５３１は、第１の行動価値関数を更新するための学習モデルである。学習データ５３２は、強化学習装置５０が行う強化学習で用いるデータである。学習データ５３２は、一例として、第１の状態、第２の状態、行動、及び報酬を含む。The memory unit 53 stores various data referenced by the control unit 52. As an example, the memory unit 53 stores an action value function model 531 and learning data 532. The action value function model 531 is a learning model for updating the first action value function. The learning data 532 is data used in the reinforcement learning performed by the reinforcement learning device 50. As an example, the learning data 532 includes a first state, a second state, an action, and a reward.

＜強化学習方法の流れ＞
図６は、強化学習システム２が実行する強化学習方法Ｓ２の流れを示すフロー図である。強化学習システム２は、ステップＳ２１～ステップＳ２９を繰り返すことにより、行動の選択を繰り返し行う。なお、一部のステップは並行して、又は順序を変えて実行されてもよい。 <Reinforcement learning method flow>
6 is a flow diagram showing the flow of the reinforcement learning method S2 executed by the reinforcement learning system 2. The reinforcement learning system 2 repeatedly selects an action by repeating steps S21 to S29. Note that some steps may be executed in parallel or in a different order.

ステップＳ２１において、状態提供部４２１は、第１の状態ｓ_ｔを取得し、取得した第１の状態ｓ_ｔを強化学習装置５０に提供する。ステップＳ２２において、状態観測部５２２は、端末４０から第１の状態ｓ_ｔを取得する。 In step S21, the state providing unit 421 acquires a first state s _t , and provides the acquired first state s _t to the reinforcement learning apparatus 50. In step S22, the state observing unit 522 acquires the first state s _t from the terminal 40.

ステップＳ２３において、状態ランダム化部５２３は、第１の状態ｓ_ｔにノイズを付加することによって、１又は複数の第２の状態を生成する。状態ランダム化部５２３が第１の状態ｓ_ｔに付加するノイズは、一例として、正規乱数、又は一様乱数である。ただし、状態ランダム化部５２３が第１の状態ｓ_ｔに付加するノイズはこれらに限られず、上記以外のノイズであってもよい。ノイズが付加された第２の状態は、第１の状態ｓ_ｔに若干のブレが生じた状態を表す。 In step S23, the state randomizer 523 generates one or more second states by adding noise to the first state s _t . The noise added to the first state s _t by the state randomizer 523 is, for example, a normal random number or a uniform random number. However, the noise added to the first state s _t by the state randomizer 523 is not limited to these, and may be noise other than the above. The second state to which noise has been added represents a state in which the first state s _t is slightly blurred.

本動作例において、状態ランダム化部５２３は、属性に応じ、第１の状態ｓ_ｔに含まれる複数の要素に、選択的にノイズを付加することによって第２の状態を生成する。状態ランダム化部５２３は、一例として、所定の条件を満たす属性に付随した要素にノイズを付加する。所定の条件は例えば、動的要素を示す属性である、又は、静的要素を示す属性である、といった条件である。ただし、所定の条件は上述した例に限られず、他の条件であってもよい。 In this operation example, the state randomizer 523 generates a second state by selectively adding noise to a plurality of elements included in the first state s _t according to the attribute. As an example, the state randomizer 523 adds noise to an element associated with an attribute that satisfies a predetermined condition. For example, the predetermined condition is an attribute indicating a dynamic element, or an attribute indicating a static element. However, the predetermined condition is not limited to the above example, and may be another condition.

また、状態ランダム化部５２３は、生成した第２の状態を含む状態列｛ｓ_ｔ ^（ｉ）｝（１≦ｉ≦ｎ；ｉは自然数、ｎは２以上の自然数）を生成する。状態列｛ｓ_ｔ ^（ｉ）｝は、第１の状態ｓ_ｔと１又は複数の第２の状態とを含む状態列、又は、複数の第２の状態を含む状態列である。換言すると、状態列｛ｓ_ｔ ^（ｉ）｝は、少なくとも第２の状態を含み、また、第１の状態ｓ_ｔを含んでいても含んでいなくてもよい。 Furthermore, the state randomization unit 523 generates a state sequence {s _t ⁽ⁱ⁾ } (1≦i≦n; i is a natural number, n is a natural number equal to or greater than 2) including the generated second state. The state sequence {s _t ⁽ⁱ⁾ } is a state sequence including a first state s _t and one or more second states, or a state sequence including multiple second states. In other words, the state sequence {s _t ⁽ⁱ⁾ } includes at least the second state, and may or may not include the first state s _t .

ステップＳ２４において、推定部５２５は、状態列｛ｓ_ｔ ^（ｉ）｝に応じて、第１の行動価値関数Ｑ（ｓ_ｔ ^（ｉ），ａ）を算出する。推定部５２５は一例として、状態列｛ｓ_ｔ ^（ｉ）｝に含まれる複数の状態ｓ_ｔ ^（ｉ）のそれぞれについて、第１の行動価値関数Ｑ（ｓ_ｔ ^（ｉ），ａ）を算出する。より具体的には、推定部５２５は一例として、上記式（１）により状態ｓ_ｔ ^（ｉ）についての第１の行動価値関数Ｑ（ｓ_ｔ ^（ｉ），ａ）を更新する。本動作例において、第１の行動価値関数（ｓ_ｔ ^（ｉ），ａ）はｍ次元（ｍは２以上の整数）のベクトルであり、ｍは集合Ａの要素数（すなわち行動ａの種類数）である。 In step S24, the estimation unit 525 calculates a first action value function Q(s _t ⁽ⁱ⁾ , a) according to the state sequence {s _t ⁽ ^{i) }. As an example, the estimation unit 525 calculates the first action value function Q(s t (i), a) for each of the multiple states s t (i) included in the state sequence {s t (i)} _} _. ^More _specifically ^, as an example, the estimation unit 525 updates the first action value function Q(s _t ⁽ⁱ⁾ , a) for the state s _t ⁽ⁱ⁾ using the above formula (1). In this operation example, the first action value function (s _t ⁽ⁱ⁾ , a) is an m-dimensional (m is an integer equal to or greater than 2) vector, and m is the number of elements in set A (i.e., the number of types of actions a).

ステップＳ２５において、推定部５２５は、算出した複数の第１の行動価値関数Ｑ（ｓ_ｔ ^（ｉ），ａ）に基づいて第２の行動価値関数Ｊ（ｓ_ｔ，ａ）を算出する。第２の行動価値関数Ｊ（ｓ_ｔ，ａ）は一例として、上記式（２）又は式（３）により与えられる。換言すると、推定部５２５は、上記式（２）又は式（３）により与えられる第２の行動価値関数を算出する。上記式（２）又は上記式（３）により与えられる第２の行動価値関数は、複数の第１の行動価値関数Ｑ（ｓ_ｔ ^（ｉ），ａ）のばらつきが大きいほど第１の行動価値関数Ｑ（ｓ_ｔ ^（ｉ），ａ）の期待値より低い値となる関数である。 In step S25, the estimation unit 525 calculates a second action value function J(s t , a) based on the calculated multiple first action value functions _Q (s _t ⁽ⁱ⁾ , a). The second action value function J(s _t , a) is given by the above formula (2) or formula (3), as an example. In other words, the estimation unit 525 calculates the second action value function given by the above formula (2) or formula (3). The second action value function given by the above formula (2) or formula (3) is a function whose value is lower than the expected value of the first action value function Q(s _t ⁽ⁱ⁾ , a) as the variation of the multiple first action value functions Q(s _t ⁽ⁱ⁾ , a) increases.

ステップＳ２６において、選択部５２６は、第１の行動価値関数Ｑ（ｓ_ｔ ^（ｉ），ａ）に基づいて算出される第２の行動価値関数Ｊ（ｓ_ｔ，ａ）に応じて、行動ａを選択する。選択部５２６は一例として、上記式（４）により与えられる方策により行動ａを選択する。なお、行動ａを選択する方策は上記式（４）により与えられる方策に限られず、εグリーディ方策、ソフトマックス手法等の他の方策が用いられてもよい。選択部５２６は、選択した行動ａを端末４０に通知する。 In step S26, the selection unit 526 selects action a according to the second action value function J(s _{t , a) calculated based on the first action value function Q(s t} ₍ ⁱ⁾ , a). As an example, the selection unit 526 selects action a using the strategy given by the above formula (4). Note that the strategy for selecting action a is not limited to the strategy given by the above formula (4), and other strategies such as the ε-greedy strategy and the softmax method may be used. The selection unit 526 notifies the terminal 40 of the selected action a.

ステップＳ２７において、行動実行部４２２は、強化学習装置５０から通知された行動ａを実行する。ステップＳ２８において、報酬提供部４２３は、強化学習装置５０が選択した行動を実行して得られた報酬ｒ_ｔを、強化学習装置５０に提供する。ステップＳ２９において、報酬取得部５２１は、状態列｛ｓ_ｔ ^（ｉ）｝、及び報酬ｒ_ｔを含む学習データを蓄積する。 In step S27, the action performing unit 422 performs the action a notified by the reinforcement learning device 50. In step S28, the reward providing unit 423 provides the reinforcement learning device 50 with the reward r _t obtained by performing the action selected by the reinforcement learning device 50. In step S29, the reward acquiring unit 521 accumulates learning data including the state sequence {s _t ⁽ⁱ⁾ } and the reward r _t .

＜強化学習システムの効果＞
強化学習においては、状態が若干異なっているだけで行動価値関数の値が大きく異なる場合がある。換言すると、状態の若干の差分が行動価値関数の値に大きな影響を及ぼす場合がある。本例示的実施形態では、第１の状態ｓ_ｔにあえて若干のノイズを加えた第２の状態を用いて第１の行動価値関数Ｑを算出することにより、状態のばらつきを考慮した第１の行動価値関数Ｑを算出することができる。この第１の行動価値関数Ｑを用いて行動ａを選択することにより、本例示的実施形態によれば、行動ａをより適切に選択することができる。 <Effects of reinforcement learning system>
In reinforcement learning, the value of the action value function may differ greatly even if the state is only slightly different. In other words, a slight difference in the state may have a large effect on the value of the action value function. In this exemplary embodiment, the first action value function Q is calculated using a second state in which a slight noise is intentionally added to the first state s _t , thereby allowing the first action value function Q to be calculated taking into account the variation in the state. By selecting the action a using this first action value function Q, the action a can be more appropriately selected according to this exemplary embodiment.

また、本例示的実施形態に係る強化学習システム２においては、ノイズを付加した第２の状態を含む複数の状態ｓ_ｔ ^（ｉ）に応じて第１の行動価値関数Ｑを算出する構成が採用されている。このため、本例示的実施形態に係る強化学習システム２によれば、例示的実施形態１に係る強化学習システム１の奏する効果に加えて、より適切な行動ａを選択できるという効果が得られる。 Furthermore, the reinforcement learning system 2 according to this exemplary embodiment employs a configuration in which the first action-value function Q is calculated according to a plurality of states s _t ⁽ⁱ⁾ including the second state to which noise is added. Therefore, the reinforcement learning system 2 according to this exemplary embodiment has the effect of being able to select a more appropriate action a in addition to the effect of the reinforcement learning system 1 according to the exemplary embodiment 1.

また、本例示的実施形態において、強化学習システム２が上述の式（２）を用いて第２の行動価値関数Ｊを算出する場合、第２の行動価値関数Ｊは、高次の影響を含めたリスク（ばらつき）に敏感な指標となる。第２の行動価値関数Ｊを用いて強化学習システム２が行動ａを選択することで、よりリスクに敏感な行動ａの選択を行うことができる。In addition, in this exemplary embodiment, when the reinforcement learning system 2 calculates the second action value function J using the above-mentioned formula (2), the second action value function J becomes an index that is sensitive to risk (variability) including higher-order influences. By the reinforcement learning system 2 selecting an action a using the second action value function J, it is possible to select an action a that is more sensitive to risk.

また、本例示的実施形態において、強化学習システム２が上述の式（３）を用いて第２の行動価値関数Ｊを算出する場合、式（３）には指数演算が含まれないため計算処理において桁あふれが発生することがない。第２の行動価値関数Ｊを用いて強化学習システム２が行動ａを選択することにより、行動ａをより好適に選択できるとともに、行動ａの選択に係る処理負荷が軽減される。In addition, in this exemplary embodiment, when the reinforcement learning system 2 calculates the second action value function J using the above-mentioned formula (3), no exponential operation is included in formula (3), so that no overflow occurs in the calculation process. By the reinforcement learning system 2 selecting action a using the second action value function J, action a can be selected more preferably, and the processing load related to the selection of action a is reduced.

〔例示的実施形態３〕
本発明の例示的実施形態３について、図面を参照して説明する。なお、例示的実施形態１～２にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付記し、その説明を繰り返さない。 Exemplary embodiment 3
A third exemplary embodiment of the present invention will be described with reference to the drawings. Note that components having the same functions as those described in the first and second exemplary embodiments are denoted by the same reference numerals, and the description thereof will not be repeated.

＜強化学習システムの構成＞
本例示的実施形態に係る強化学習システム（以下「強化学習システム３」という）は、上記例示的実施形態２に係る強化学習システム２を、コンピュータゲームの自律プレイに適用したものである。強化学習システム３は、上述の例示的実施形態２において図５に示した強化学習システム２と同様の構成を有する。強化学習システム３の構成要素については、強化学習システム２の構成要素と同様であり、ここではその説明を繰り返さない。 <Configuration of Reinforcement Learning System>
A reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 3") is obtained by applying the reinforcement learning system 2 according to the above-mentioned exemplary embodiment 2 to autonomous play of a computer game. The reinforcement learning system 3 has a similar configuration to the reinforcement learning system 2 shown in Fig. 5 in the above-mentioned exemplary embodiment 2. The components of the reinforcement learning system 3 are similar to the components of the reinforcement learning system 2, and the description thereof will not be repeated here.

本例示的実施形態において、第１の状態ｓ_ｔは、一例として、コンピュータゲームにおいてゲームの進行に影響を与えるオブジェクトの状態を含む。行動ａは、一例として、コンピュータゲームのプレイヤにより操作されるオブジェクトの動作を含む。報酬ｒ_ｔは、一例として、ゲームの勝敗、又はゲームのスコアに関する報酬を含む。 In this exemplary embodiment, the first state s _t includes, for example, a state of an object that affects the progress of the computer game. The action a includes, for example, an action of an object operated by a player of the computer game. The reward r _t includes, for example, a reward related to winning or losing the game or a score of the game.

図７は、強化学習システム３に係るコンピュータゲームのゲーム画面の一例である画面ＳＣ１を示す図である。画面ＳＣ１は、第１動的オブジェクトＣ１１、第２動的オブジェクトＣ２１～Ｃ２３、第１静的オブジェクトＣ３１～Ｃ３４、及び第２静的オブジェクトＣ４を含む。第１動的オブジェクトＣ１１、第２動的オブジェクトＣ２１～Ｃ２３、第１静的オブジェクトＣ３１～Ｃ３４、及び第２静的オブジェクトＣ４は、ゲームの進行に影響を与えるオブジェクトの例である。 Figure 7 is a diagram showing a screen SC1, which is an example of a game screen of a computer game related to the reinforcement learning system 3. The screen SC1 includes a first dynamic object C11, second dynamic objects C21-C23, first static objects C31-C34, and a second static object C4. The first dynamic object C11, the second dynamic objects C21-C23, the first static objects C31-C34, and the second static object C4 are examples of objects that affect the progress of the game.

図７に係るコンピュータゲームは、迷路内を移動する第１動的オブジェクトＣ１１の移動方向をゲームのプレイヤが指定し、第２動的オブジェクトＣ２１～Ｃ２３の追跡をかわしながら迷路内に配置された第１静的オブジェクトＣ３１～Ｃ３４を回収するとラウンドクリアとなるゲームである。 The computer game of Figure 7 is a game in which the player of the game specifies the direction of movement of a first dynamic object C11 moving within a maze, and the round is cleared when the player collects first static objects C31 to C34 placed within the maze while evading pursuit by second dynamic objects C21 to C23.

第１動的オブジェクトＣ１１、及び第２動的オブジェクトＣ２１～Ｃ２３は、ゲームの進行中において画面上を移動するオブジェクトであり、環境内を移動する動的要素の一例である。一方、第１静的オブジェクトＣ３１～Ｃ３４及び第２静的オブジェクトＣ４は、ゲームの進行中において画面上を移動しないオブジェクトであり、環境内を移動しない静的要素の一例である。第１動的オブジェクトＣ１１は、プレイヤの操作対象のオブジェクトである。第１動的オブジェクトＣ１１はゲームの進行中において迷路内を一定の速度で移動し、プレイヤの操作に応じて移動方向を変更する。第２動的オブジェクトＣ２１～Ｃ２３は、ゲームの進行中において第１動的オブジェクトＣ１１を追従して移動するオブジェクトである。図７では３つの第２動的オブジェクトＣ２１～Ｃ２３を図示しているが、第２動的オブジェクトの数は３に限られず、これより多くても少なくてもよい。The first dynamic object C11 and the second dynamic objects C21 to C23 are objects that move on the screen during the game, and are an example of a dynamic element that moves within the environment. On the other hand, the first static objects C31 to C34 and the second static object C4 are objects that do not move on the screen during the game, and are an example of a static element that does not move within the environment. The first dynamic object C11 is an object that is operated by the player. The first dynamic object C11 moves at a constant speed within the maze during the game, and changes the direction of movement according to the player's operation. The second dynamic objects C21 to C23 are objects that move following the first dynamic object C11 during the game. Although three second dynamic objects C21 to C23 are illustrated in FIG. 7, the number of second dynamic objects is not limited to three, and may be more or less than this.

第１静的オブジェクトＣ３１～Ｃ３４は、迷路内に配置され、第１動的オブジェクトＣ１１により回収されるオブジェクトである。第１動的オブジェクトＣ１１が第１静的オブジェクトＣ３１～Ｃ３４に衝突することにより第１静的オブジェクトＣ３１～Ｃ３４が第１動的オブジェクトＣ１１により回収される。図７では４つの第１静的オブジェクトＣ３１～Ｃ３４を図示しているが、第１静的オブジェクトの数は４に限られず、これより多くても少なくてもよい。第２静的オブジェクトＣ４は、迷路を構成する壁である。The first static objects C31-C34 are objects that are placed in the maze and are collected by the first dynamic object C11. When the first dynamic object C11 collides with the first static objects C31-C34, the first static objects C31-C34 are collected by the first dynamic object C11. Although four first static objects C31-C34 are illustrated in FIG. 7, the number of first static objects is not limited to four and may be more or less than this. The second static object C4 is a wall that constitutes the maze.

図７の例において、第１の状態ｓ_ｔは、第１動的オブジェクトＣ１１、第２動的オブジェクトＣ２１～Ｃ２３、第１静的オブジェクトＣ３１～Ｃ３４、及び第２静的オブジェクトＣ４に関する状態を含む。換言すると、第１の状態は、環境内を移動する動的要素に関する状態、及び、環境内を移動しない静的要素に関する状態を含む。より具体的には、第１の状態ｓ_ｔは、第１動的オブジェクトＣ１１の位置、第２動的オブジェクトＣ２１～Ｃ２３の位置、第１静的オブジェクトＣ３１～Ｃ３４の位置、及び第２静的オブジェクトＣ４の位置、を含む。 In the example of Fig. 7, the first state s _t includes states related to the first dynamic object C11, the second dynamic objects C21 to C23, the first static objects C31 to C34, and the second static object C4. In other words, the first state includes states related to dynamic elements that move within the environment, and states related to static elements that do not move within the environment. More specifically, the first state s _t includes the position of the first dynamic object C11, the positions of the second dynamic objects C21 to C23, the positions of the first static objects C31 to C34, and the position of the second static object C4.

本例示的実施形態において、第１の状態ｓ_ｔは、ゲームのプレイ画面を表す画像である。
図８は、第１の状態ｓ_ｔの一例である画像Ｉｍｇ１１を示す図である。画像Ｉｍｇ１１は、ゲーム画面に含まれる要素を０～２５５の画素値により表現したグレースケール画像である。画像Ｉｍｇ１１は所定数のマスに分割されており、各マスに位置する要素の属性に応じた画素値で各マスが表現される。一例として、第１動的オブジェクトＣ１１の位置は画素値が２５５、第２動的オブジェクトＣ２１～Ｃ２３の位置は画素値が１６０、第１静的オブジェクトＣ３１～Ｃ３４の位置は画素値が１２８、第２静的オブジェクトＣ４により形成される通路の位置は画素値が６４、移動不可の場所は画素値が０、で表される。 In this exemplary embodiment, the first state s _t is an image representing a game play screen.
8 is a diagram showing an image Img11 which is an example of the first state s _t . The image Img11 is a grayscale image in which elements included in the game screen are represented by pixel values of 0 to 255. The image Img11 is divided into a predetermined number of squares, and each square is represented by a pixel value according to the attribute of the element located in each square. As an example, the position of the first dynamic object C11 is represented by a pixel value of 255, the positions of the second dynamic objects C21 to C23 by a pixel value of 160, the positions of the first static objects C31 to C34 by a pixel value of 128, the position of the passage formed by the second static object C4 by a pixel value of 64, and the positions where movement is not possible by a pixel value of 0.

本例示的実施形態において、行動ａは、第１動的オブジェクトＣ１１の移動であり、上に移動、下に移動、右に移動、左に移動、の４種類である。報酬ｒ_ｔは、一例として、スコアがアップした場合に得られる所定の加算値（例えば、＋１）、及び第２動的オブジェクトＣ２１～Ｃ２３に捕獲された場合に得られる所定の減算値（例えば、－１０）である。１回の行動においてアップしたスコアの程度に関わらず、行動によりスコアがアップした場合に所定の加算値（例えば、＋１）が報酬ｒ_ｔとして得られてもよい。 In this exemplary embodiment, the action a is the movement of the first dynamic object C11, and there are four types: moving up, moving down, moving right, and moving left. The reward r _t is, for example, a predetermined added value (e.g., +1) obtained when the score is increased, and a predetermined subtracted value (e.g., -10) obtained when the object is captured by the second dynamic objects C21 to C23. Regardless of the degree of increase in the score in one action, a predetermined added value (e.g., +1) may be obtained as the reward r _t when the score is increased by an action.

＜強化学習方法の流れ＞
強化学習システム３は、上述の例示的実施形態２に係る図６の強化学習方法Ｓ２を実行する。以下では、本例示的実施形態において特徴的な動作について主に説明し、上述の例示的実施形態２で説明した内容についてはその説明を繰り返さない。 <Reinforcement learning method flow>
The reinforcement learning system 3 executes the reinforcement learning method S2 of Fig. 6 according to the above-mentioned exemplary embodiment 2. In the following, characteristic operations in this exemplary embodiment will be mainly described, and the contents described in the above-mentioned exemplary embodiment 2 will not be described repeatedly.

本例示的実施形態では、ステップＳ２３において、状態ランダム化部５２３は、第１の状態ｓ_ｔに含まれる動的要素の状態にノイズを付加することによって第２の状態を生成する。状態ランダム化部５２３は、一例として、第１動的オブジェクトＣ１１の位置及び第２動的オブジェクトＣ２１～Ｃ２３の位置をランダムウォークによりランダム化した第２の状態を生成する。 In this exemplary embodiment, in step S23, the state randomizer 523 generates the second state by adding noise to the states of the dynamic elements included in the first state s _t . As an example, the state randomizer 523 generates the second state by randomizing the position of the first dynamic object C11 and the positions of the second dynamic objects C21 to C23 by random walk.

より具体的には、状態ランダム化部５２３は、一例として、ゲーム画面を所定数のマスに分割（例えば、３３×３３マスに分割）し、前後左右の進行できる方向（道のある方向）に１マス進む／進まない確率を、等確率に選択する。状態ランダム化部５２３は、第１動的オブジェクトＣ１１の位置及び第２動的オブジェクトＣ２１～Ｃ２３についてσ^２回（σは１以上の整数）のランダムウォークを実施する。σ^２回のランダムウォークの実施により、動的要素は平均でσマスだけ移動する。 More specifically, as an example, the state randomization unit 523 divides the game screen into a predetermined number of squares (e.g., 33 x 33 squares) and selects equal probability for moving forward/backward/left/right and one square in each possible direction (direction of a path). The state randomization unit 523 performs σ ² random walks (σ is an integer equal to or greater than 1) for the position of the first dynamic object C11 and the second dynamic objects C21 to C23. By performing σ ² random walks, the dynamic elements move by σ squares on average.

ステップＳ２３において状態ランダム化部５２３が生成する状態列｛ｓ_ｔ ^（ｉ）｝は、第１の状態ｓ_ｔ、及び、第１の状態ｓ_ｔをランダム化した（ｎ－１）個の第２の状態、の計ｎ個の状態を含む。また、行動ａが上に移動、下に移動、右に移動、左に移動の４種類であるため、ステップＳ２４で推定部５２５が算出する第１の行動価値関数（ｓ_ｔ ^（ｉ），ａ）は、４次元のベクトルである。 The state sequence {s _t ⁽ⁱ⁾ } generated by the state randomization unit 523 in step S23 includes a total of n states, including the first state s _t and (n-1) second states obtained by randomizing the first state s _t . In addition, since there are four types of action a, namely, moving up, moving down, moving right, and moving left, the first action value function (s _t ⁽ⁱ⁾ , a) calculated by the estimation unit 525 in step S24 is a four-dimensional vector.

ステップＳ２６において、選択部５２６は、第１動的オブジェクトＣ１１の移動方向を、交差点又は角（すなわち、移動方向を変更できる地点）において、行動ａとして上下左右の４種類からいずれかを選択する。ただし、選択部５２６は、第１動的オブジェクトＣ１１が移動できない方向は除外する。In step S26, the selection unit 526 selects one of four types of movement direction of the first dynamic object C11, up, down, left, or right, as action a at an intersection or corner (i.e., a point where the movement direction can be changed). However, the selection unit 526 excludes directions in which the first dynamic object C11 cannot move.

＜本例示的実施形態の評価＞
図９～図１２はそれぞれ、強化学習システム３に係るコンピュータゲームの自律プレイの評価結果の一例を示す図である。本例示的実施形態に係るコンピュータゲームにおいて、第１動的オブジェクトのライフは１機とし、第１動的オブジェクトが第２動的オブジェクトに捕獲されるとゲームオーバーとした。また、ステージは１ステージとし、ゲームをクリアすれば、すなわち全ての第１静的オブジェクトを全て回収すれば終了とした。 Evaluation of this exemplary embodiment
9 to 12 are diagrams showing examples of evaluation results of autonomous play of a computer game involving the reinforcement learning system 3. In the computer game according to this exemplary embodiment, the first dynamic object has one life, and the game is over when the first dynamic object is captured by the second dynamic object. In addition, there is one stage, and the game ends when the game is cleared, that is, when all the first static objects are collected.

図９～図１２の例では、強化学習システム３の強化学習におけるσ及びθの値を変更した複数の条件において強化学習システム３がコンピュータゲームの自律プレイを行った結果を評価した。また、強化学習システム３ではない、従来の強化学習の手法による自律プレイの結果も比較対象とした。従来の強化学習の手法としては、ＤＱＮ（deep Q-network）の手法において行動選択の方策を改良したものを用いた。 In the examples of Figures 9 to 12, the results of the reinforcement learning system 3 autonomously playing a computer game under multiple conditions in which the values of σ and θ in the reinforcement learning of the reinforcement learning system 3 were changed were evaluated. The results of autonomous play using a conventional reinforcement learning method other than the reinforcement learning system 3 were also used for comparison. The conventional reinforcement learning method used was an improved action selection policy in the DQN (deep Q-network) method.

図９は、σ＝２の場合の自律プレイによるスコアを表すグラフである。σは、上述したようにランダムウォークにおける平均移動回数である。図９において、縦軸はスコアを示す。グラフｇ９１は、従来の強化学習による自律プレイのスコアの平均値を示す。グラフｇ１１～ｇ１４は、強化学習システム３の自律プレイによるスコアの平均値を表す。グラフｇ１１～ｇ１４は、第２の行動価値関数Ｊを表す式（上記式（２）又は式（３））のハイパーパラメータθの値がそれぞれ異なっている。グラフｇ１１～ｇ１４はそれぞれ、ハイパーパラメータθを「０」、「０．００１」、「０．０１」、「０．１」とした場合のスコアの平均値を表すグラフである。 Figure 9 is a graph showing scores from autonomous play when σ = 2. As mentioned above, σ is the average number of moves in a random walk. In Figure 9, the vertical axis shows the score. Graph g91 shows the average score from autonomous play using conventional reinforcement learning. Graphs g11 to g14 show the average score from autonomous play using the reinforcement learning system 3. Graphs g11 to g14 each have a different value for the hyperparameter θ in the equation (equation (2) or equation (3) above) representing the second action value function J. Graphs g11 to g14 show the average score when the hyperparameter θ is set to "0", "0.001", "0.01", and "0.1", respectively.

グラフｇ９１と、グラフｇ１１～ｇ１４とを比較すると、従来の強化学習によるスコアよりも、本例示的実施形態に係る強化学習システム３のスコアのほうが高く、特にハイパーパラメータθの値を「０．０１」とした場合のスコアが高くなっている。 Comparing graph g91 with graphs g11 to g14, the score of the reinforcement learning system 3 according to this exemplary embodiment is higher than the score obtained by conventional reinforcement learning, and the score is particularly high when the value of the hyperparameter θ is set to "0.01."

図１０は、σ＝２の場合の自律プレイによる第１静的オブジェクトの回収率を表すグラフである。図１０において、縦軸は回収率を示す。グラフｇ９２は、従来の強化学習による自律プレイの回収率の平均値を示す。グラフｇ２１～ｇ２４は、強化学習システム３の自律プレイによる回収率の平均値を表す。グラフｇ２１～ｇ２４は、第２の行動価値関数Ｊを表す式（上記式（２）又は式（３））のハイパーパラメータθの値がそれぞれ異なっている。グラフｇ２１～ｇ２４はそれぞれ、ハイパーパラメータθを「０」、「０．００１」、「０．０１」、「０．１」とした場合の回収率の平均値を表すグラフである。 Figure 10 is a graph showing the recovery rate of the first static object through autonomous play when σ = 2. In Figure 10, the vertical axis shows the recovery rate. Graph g92 shows the average recovery rate of autonomous play through conventional reinforcement learning. Graphs g21 to g24 show the average recovery rate of autonomous play through reinforcement learning system 3. Graphs g21 to g24 each have a different value for the hyperparameter θ of the equation (equation (2) or equation (3) above) representing the second action value function J. Graphs g21 to g24 are graphs showing the average recovery rate when the hyperparameter θ is set to "0", "0.001", "0.01", and "0.1", respectively.

グラフｇ９２と、グラフｇ２１～ｇ２４とを比較すると、従来の強化学習による回収率よりも、本例示的実施形態に係る強化学習システム３の回収率のほうが高い傾向があり、特にハイパーパラメータθの値を「０．０１」とした場合のスコアが高くなっている。 Comparing graph g92 with graphs g21 to g24, the recovery rate of the reinforcement learning system 3 according to this exemplary embodiment tends to be higher than the recovery rate of conventional reinforcement learning, and the score is particularly high when the value of the hyperparameter θ is set to "0.01."

図１１は、自律プレイによるスコアとσとの関係を表すグラフである。図１１において、横軸はσを示し、縦軸はスコアを示す。グラフｇ３１～ｇ３４はそれぞれ、ハイパーパラメータθが「０」、「０．００１」、「０．０１」、「０．１」である場合における、σが１～５の場合のスコアの平均値を表す。なお、従来の強化学習による自律プレイのスコアの平均値は「２００９」である。 Figure 11 is a graph showing the relationship between score and σ in autonomous play. In Figure 11, the horizontal axis shows σ, and the vertical axis shows score. Graphs g31 to g34 show the average score when σ is 1 to 5, and the hyperparameter θ is "0", "0.001", "0.01", and "0.1", respectively. Note that the average score in autonomous play using conventional reinforcement learning is "2009".

図１１の例では、σの値が１～３の場合のスコア値が、従来の強化学習によるスコアよりも高くなっていることが多い。特に、θ＝０．０１、σ＝２の場合のスコアが他と比較して高くなっている。In the example of Figure 11, the scores for σ values between 1 and 3 are often higher than those obtained with conventional reinforcement learning. In particular, the score for θ = 0.01 and σ = 2 is higher than the others.

図１２は、自律プレイによる回収率とσとの関係を表すグラフである。図１２において、横軸はσを示し、縦軸は回収率を示す。グラフｇ４１～ｇ４４はそれぞれ、ハイパーパラメータθが「０」、「０．００１」、「０．０１」、「０．１」である場合のσの値毎の回収率の平均値を表す。なお、従来の強化学習による自律プレイの回収率の平均値は６７．５％である。 Figure 12 is a graph showing the relationship between recovery rate and σ due to autonomous play. In Figure 12, the horizontal axis shows σ, and the vertical axis shows recovery rate. Graphs g41 to g44 show the average recovery rate for each value of σ when the hyperparameter θ is "0", "0.001", "0.01", and "0.1", respectively. The average recovery rate for autonomous play using conventional reinforcement learning is 67.5%.

図１２の例では、σの値が１～３の場合の回収率が、従来の強化学習による回収率よりも高くなっているものが多い。特に、θ＝０．０１、σ＝２の場合の回収率が他と比較して高くなっている。In the example of Figure 12, the recovery rate for σ values between 1 and 3 is often higher than that achieved by conventional reinforcement learning. In particular, the recovery rate for θ = 0.01 and σ = 2 is higher than the others.

以上説明したように本例示的実施形態によれば、強化学習システム３は、第１の状態にノイズを付加した第２の状態を用いて第１の行動価値関数を算出することにより、コンピュータゲームの自律プレイにおける行動の選択をより好適に行うことができる。As described above, according to this exemplary embodiment, the reinforcement learning system 3 is able to more appropriately select actions in autonomous play of a computer game by calculating a first action value function using a second state in which noise is added to the first state.

〔例示的実施形態４〕
本発明の例示的実施形態４について説明する。なお、例示的実施形態１～３にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を用いてその説明を繰り返さない。 Exemplary embodiment 4
A fourth exemplary embodiment of the present invention will be described. Note that components having the same functions as those described in the first to third exemplary embodiments will be designated by the same reference numerals and will not be described repeatedly.

本例示的実施形態に係る強化学習システム（以下「強化学習システム４」という）は、上記例示的実施形態２に係る強化学習システム２を、土砂を掘削する掘削機等の建設機械の制御に適用したものである。強化学習システム３は、上述の例示的実施形態２において図５に示した強化学習システム２と同様の構成を有する。強化学習システム４の構成要素については、強化学習システム２の構成要素と同様であり、ここではその説明を繰り返さない。The reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 4") is the reinforcement learning system 2 according to the exemplary embodiment 2 described above applied to the control of construction machinery such as an excavator that excavates soil. The reinforcement learning system 3 has a similar configuration to the reinforcement learning system 2 shown in FIG. 5 in the exemplary embodiment 2 described above. The components of the reinforcement learning system 4 are similar to the components of the reinforcement learning system 2, and the description thereof will not be repeated here.

強化学習システム４は、油圧ショベルが土砂を掘削する場合の掘削動作等の建設機械の動作を強化学習により選択する。強化学習における行動の目的は、一例として、バケット一杯に土砂を掘削し、掘削の際に車体が傾いたり引きずられたりしないようにすることである。The reinforcement learning system 4 selects the operation of the construction machine, such as the excavation operation of a hydraulic excavator when excavating soil, by using reinforcement learning. As an example, the objective of the action in the reinforcement learning is to excavate a bucket full of soil without tilting or dragging the vehicle body during excavation.

本例示的実施形態において、第１の状態ｓ_ｔは、一例として、油圧ショベル等の建設機械の姿勢及び位置、掘削対象である土砂の形状（３Ｄデータ、等）、並びに掘削機のバケット内の土砂量、の一部又は全部を含む。建設機械の姿勢は、一例として、建設機械のバケット、アーム、ブーム、及び上記旋回体の角度を含む。建設機械の位置は、一例として、建設機械のクローラの位置及び方向を含む。 In this exemplary embodiment, the first state s _t includes, for example, some or all of the following: the attitude and position of a construction machine such as a hydraulic excavator, the shape (3D data, etc.) of the soil to be excavated, and the amount of soil in the bucket of the excavator. The attitude of the construction machine includes, for example, the angles of the bucket, arm, boom, and rotating body of the construction machine. The position of the construction machine includes, for example, the position and direction of the crawler of the construction machine.

行動ａは、一例として、建設機械の姿勢制御（バケット、アーム、ブーム、旋回体の角度制御、等）を含む。報酬ｒ_ｔは、一例として、掘削量が多いほどその絶対値が大きい正の報酬、及び、建設機械の車体の傾きの程度、引きずられの程度又は掘削にかかった時間が大きいほどその絶対値が大きい負の報酬、の一部又は全部を含む。 The action a includes, for example, posture control of the construction machine (angle control of the bucket, arm, boom, rotating body, etc.). The reward r _t includes, for example, a part or all of a positive reward whose absolute value increases as the amount of excavation increases, and a negative reward whose absolute value increases as the degree of inclination of the body of the construction machine, the degree of dragging, or the time taken for excavation increases.

状態ランダム化部５２３は、第１の状態ｓ_ｔに含まれる複数の要素の全てにノイズを付加してもよく、また、一部の要素にノイズを付加してもよい。一部の要素にノイズを付加する場合、ノイズが付加される要素は、例えば、油圧ショベル姿勢、観測した土砂の３Ｄデータを含んでもよい。 The state randomizer 523 may add noise to all of the elements included in the first state s _t , or may add noise to some of the elements. When adding noise to some of the elements, the elements to which noise is added may include, for example, the attitude of the hydraulic excavator and 3D data of the observed soil and sand.

本例示的実施形態によれば、強化学習システム４は、第１の状態ｓ_ｔにノイズを付加した第２の状態を用いて第１の行動価値関数を算出することにより、建設機械の動作の選択をより好適に行うことができる。 According to this exemplary embodiment, the reinforcement learning system 4 can more appropriately select the operation of the construction machine by calculating the first action-value function using the second state obtained by adding noise to the first state s _t .

〔例示的実施形態５〕
本発明の例示的実施形態５について説明する。なお、例示的実施形態１～４にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を用いてその説明を繰り返さない。 Exemplary embodiment 5
A fifth exemplary embodiment of the present invention will be described. Note that components having the same functions as those described in the first to fourth exemplary embodiments will be designated by the same reference numerals and will not be described repeatedly.

本例示的実施形態に係る強化学習システム（以下「強化学習システム５」という）は、上記例示的実施形態２に係る強化学習システム２を、荷物を搬送する搬送装置の制御に適用するものである。搬送装置は、一例として、自動走行する無人搬送車（ＡＧＶ：Automated Guided Vehicle）である。強化学習システム５は、上述の例示的実施形態２において図５に示した強化学習システム２と同様の構成を有する。強化学習システム５の構成要素については、強化学習システム２の構成要素と同様であり、ここではその説明を繰り返さない。 The reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 5") applies the reinforcement learning system 2 according to the exemplary embodiment 2 to the control of a transport device that transports luggage. As an example, the transport device is an automated guided vehicle (AGV) that runs automatically. The reinforcement learning system 5 has a similar configuration to the reinforcement learning system 2 shown in FIG. 5 in the exemplary embodiment 2 described above. The components of the reinforcement learning system 5 are similar to the components of the reinforcement learning system 2, and the description thereof will not be repeated here.

強化学習システム５は、所定の位置から別の位置へと荷物を搬送する場合に、できるだけ搬送時間を短く（搬送速度を速く）、かつ、途中で静的障害物（棚、荷物等）及び動的障害物（人、他のロボット、等）への接触がないように行動を選択する。When transporting luggage from a given location to another, the reinforcement learning system 5 selects actions that minimize the transport time (increase the transport speed) and avoid contact with static obstacles (shelves, luggage, etc.) and dynamic obstacles (people, other robots, etc.) along the way.

本例示的実施形態において、第１の状態ｓ_ｔは、一例として、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、静的障害物の位置、並びに動的障害物の位置及び移動速度、の一部又は全部を含む。行動ａは、一例として、搬送装置の速度制御及び角速度制御を含む。報酬ｒ_ｔは、一例として、搬送完了時に得られる正の報酬、障害物への接触時に得られる負の報酬、又は、搬送時間が長いほどその絶対値が大きい負の報酬、の一部又は全部を含む。 In this exemplary embodiment, the first state s _t includes, for example, some or all of the following: the position, moving direction, speed, and angular speed of the conveying device that conveys the conveyed object, the position of the passage, the position of a static obstacle, and the position and moving speed of a dynamic obstacle. The action a includes, for example, speed control and angular velocity control of the conveying device. The reward r _t includes, for example, some or all of the following: a positive reward obtained when conveying is completed, a negative reward obtained when contacting an obstacle, or a negative reward whose absolute value increases as the conveying time is longer.

状態ランダム化部５２３は、第１の状態にｓ_ｔに含まれる複数の要素の全てにノイズを付加してもよく、また、一部の要素にノイズを付加してもよい。一部の要素にノイズを付加する場合、ノイズが付加される要素は、例えば、搬送装置の位置、方向、速度及び角速度を含んでもよく、また、静的障害物の位置、又は動的障害物の位置及び速度を含んでもよい。また、状態ランダム化部５２３は例えば、搬送装置の進行方向や走行経路上に位置する障害物に対してノイズを付与し、進行方向外や走行経路以外に位置する障害物に対し、ノイズを付与しないようにしてもよい。 The state randomization unit 523 may add noise to all of the elements included in s _t in the first state, or may add noise to some of the elements. When adding noise to some of the elements, the elements to which noise is added may include, for example, the position, direction, speed, and angular velocity of the transport device, and may also include the position of a static obstacle, or the position and speed of a dynamic obstacle. In addition, the state randomization unit 523 may add noise to obstacles located in the traveling direction or on the traveling path of the transport device, and not add noise to obstacles located outside the traveling direction or outside the traveling path.

本例示的実施形態によれば、強化学習システム５は、第１の状態ｓ_ｔにノイズを付加した第２の状態を用いて第１の行動価値関数を算出することにより、搬送装置の搬送制御をより好適に行うことができる。 According to this exemplary embodiment, the reinforcement learning system 5 can more appropriately perform transportation control of the transportation device by calculating the first action value function using the second state in which noise is added to the first state s _t .

〔例示的実施形態６〕
本発明の例示的実施形態６について説明する。なお、例示的実施形態１～５にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を用いてその説明を繰り返さない。 Exemplary embodiment 6
A sixth exemplary embodiment of the present invention will be described. Note that components having the same functions as those described in the first to fifth exemplary embodiments will be designated by the same reference numerals and will not be described repeatedly.

本例示的実施形態に係る強化学習システム（以下「強化学習システム６」という）は、上記例示的実施形態２に係る強化学習システム２を、フォークリフトの制御に適用するものである。強化学習システム６は、上述の例示的実施形態２において図５に示した強化学習システム２と同様の構成を有する。強化学習システム６の構成要素については、強化学習システム２の構成要素と同様であり、ここではその説明を繰り返さない。The reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 6") applies the reinforcement learning system 2 according to the above exemplary embodiment 2 to the control of a forklift. The reinforcement learning system 6 has a similar configuration to the reinforcement learning system 2 shown in FIG. 5 in the above exemplary embodiment 2. The components of the reinforcement learning system 6 are similar to the components of the reinforcement learning system 2, and the description thereof will not be repeated here.

強化学習システム６は、所定の位置から別の位置へとパレットを搬送する場合に、できるだけ搬送時間を短く（搬送速度を速く）、かつ、途中で静的障害物（棚、荷物等）及び動的障害物（人、他のボロッと、等）への接触がないように行動を選択する。When transporting a pallet from a given position to another, the reinforcement learning system 6 selects actions that minimize the transport time (increase the transport speed) and avoid contact with static obstacles (shelves, luggage, etc.) and dynamic obstacles (people, other debris, etc.) along the way.

本例示的実施形態において、第１の状態ｓ_ｔは、一例として、フォークリフトの位置、移動方向、速度、及び角速度、通路の位置、静的障害物の位置、並びに動的障害物の位置及び速度、の一部又は全部を含む。行動ａは、一例として、フォークリフトの速度制御及び角速度制御を含む。報酬ｒ_ｔは、一例として、搬送完了時に得られる正の報酬、障害物への接触時に得られる負の報酬、又は、搬送時間が長いほどその絶対値が大きい負の報酬、の一部又は全部を含む。 In this exemplary embodiment, the first state s _t includes, for example, some or all of the position, moving direction, speed, and angular speed of the forklift, the position of the passage, the position of a static obstacle, and the position and speed of a dynamic obstacle. The action a includes, for example, speed control and angular speed control of the forklift. The reward r _t includes, for example, some or all of a positive reward obtained when the transport is completed, a negative reward obtained when the obstacle is contacted, or a negative reward whose absolute value increases as the transport time is longer.

状態ランダム化部５２３は、第１の状態にｓ_ｔに含まれる複数の要素の全てにノイズを付加してもよく、また、一部の要素にノイズを付加してもよい。一部の要素にノイズを付加する場合、ノイズが付加される要素は、例えば、フォークリフトの位置、方向、速度及び角速度を含んでもよく、また、静的障害物の位置、又は動的障害物の位置及び速度を含んでもよい。また、状態ランダム化部５２３は例えば、フォークリフトの進行方向や走行経路上に位置する障害物に対してノイズを付与し、進行方向外や走行経路以外に位置する障害物に対し、ノイズを付与しないようにしてもよい。 The state randomization unit 523 may add noise to all of the elements included in s _t in the first state, or may add noise to some of the elements. When adding noise to some of the elements, the elements to which noise is added may include, for example, the position, direction, speed, and angular velocity of the forklift, and may also include the position of a static obstacle, or the position and speed of a dynamic obstacle. In addition, the state randomization unit 523 may add noise to obstacles located in the traveling direction or on the traveling path of the forklift, and not add noise to obstacles located outside the traveling direction or on the traveling path.

本例示的実施形態によれば、強化学習システム５は、第１の状態ｓ_ｔにノイズを付加した第２の状態を用いて第１の行動価値関数を算出することにより、フォークリフト制御をより好適に行うことができる。 According to this exemplary embodiment, the reinforcement learning system 5 can more appropriately control the forklift by calculating the first action-value function using the second state obtained by adding noise to the first state s _t .

〔例示的実施形態７〕
本発明の例示的実施形態７について説明する。なお、例示的実施形態１～６にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を用いてその説明を繰り返さない。 Exemplary embodiment 7
A seventh exemplary embodiment of the present invention will be described. Note that components having the same functions as those described in the first to sixth exemplary embodiments will be designated by the same reference numerals and will not be described repeatedly.

本例示的実施形態に係る強化学習システム（以下「強化学習システム７」という）は、上述の例示的実施形態２において図５に示した強化学習システム２と同様の構成を有する。強化学習システム６の構成要素については、強化学習システム２の構成要素と同様であり、ここではその説明を繰り返さない。The reinforcement learning system according to this exemplary embodiment (hereinafter referred to as "reinforcement learning system 7") has a configuration similar to that of the reinforcement learning system 2 shown in FIG. 5 in the above-described exemplary embodiment 2. The components of the reinforcement learning system 6 are similar to those of the reinforcement learning system 2, and the description thereof will not be repeated here.

本例示的実施形態において、第１の状態ｓ_ｔは、属性が付随する複数の要素を含む。また、状態ランダム化部５２３は、第１の状態ｓ_ｔにノイズを付加する際に、属性によりノイズの付加の重み付けを異ならせる。状態ランダム化部５２３は、一例として、環境内を移動する動的要素の重み付けを大きくする一方、環境内を移動しない静的要素の重み付けを小さくしてもよい。また、一例として、状態ランダム化部５２３は、環境内を移動する動的要素のうち、人の位置の重み付けを他の動的要素の重み付けよりも大きくしてもよい。 In this exemplary embodiment, the first state s _t includes a plurality of elements with attributes. When adding noise to the first state s _t , the state randomizer 523 weights the addition of noise differently depending on the attributes. As an example, the state randomizer 523 may increase the weight of dynamic elements that move within the environment, while decreasing the weight of static elements that do not move within the environment. As an example, the state randomizer 523 may increase the weight of the position of a person among the dynamic elements that move within the environment, more than the weight of other dynamic elements.

本例示的実施形態によれば、要素の属性に応じた重み付けでノイズを付加した第２の状態を用いて第１の行動価値関数を算出することにより、属性に応じたデータのばらつきを考慮した第１の行動価値関数を算出することができる。この第１の行動価値関数を用いて行動を選択することにより、属性に応じたデータのばらつきを考慮した行動の選択を行うことができる。According to this exemplary embodiment, the first action value function is calculated using the second state in which noise is added with weighting according to the attribute of the element, so that the first action value function can be calculated taking into account the variability of data according to the attribute. By selecting an action using this first action value function, it is possible to select an action taking into account the variability of data according to the attribute.

また、本例示的実施形態において、状態ランダム化部５２３は、ノイズの付加の重み付けを強化学習の実行中に変更してもよい。一例として、状態ランダム化部５２３は、動的要素が環境中を移動している場合は重み付けを大きくする一方、環境中を移動していない動的要素については重み付けを小さくする、といった制御を行ってもよい。In addition, in this exemplary embodiment, the state randomizer 523 may change the weighting of the noise addition during reinforcement learning. As an example, the state randomizer 523 may perform control such that the weighting is increased when a dynamic element is moving in the environment, and the weighting is decreased for a dynamic element that is not moving in the environment.

〔ソフトウェアによる実現例〕
強化学習装置１０、端末２０、サーバ３０、端末４０、強化学習装置５０（以下「強化学習装置１０等」という）の一部又は全部の機能は、集積回路（ＩＣチップ）等のハードウェアによって実現してもよいし、ソフトウェアによって実現してもよい。 [Software implementation example]
Some or all of the functions of the reinforcement learning device 10, the terminal 20, the server 30, the terminal 40, and the reinforcement learning device 50 (hereinafter referred to as "the reinforcement learning device 10, etc.") may be realized by hardware such as an integrated circuit (IC chip), or by software.

後者の場合、強化学習装置１０等は、例えば、各機能を実現するソフトウェアであるプログラムの命令を実行するコンピュータによって実現される。このようなコンピュータの一例（以下、コンピュータＣと記載する）を図１３に示す。コンピュータＣは、少なくとも１つのプロセッサＣ１と、少なくとも１つのメモリＣ２と、を備えている。メモリＣ２には、コンピュータＣを強化学習装置１０等として動作させるためのプログラムＰが記録されている。コンピュータＣにおいて、プロセッサＣ１は、プログラムＰをメモリＣ２から読み取って実行することにより、強化学習装置１０等の各機能が実現される。In the latter case, the reinforcement learning device 10 etc. is realized, for example, by a computer that executes instructions of a program, which is software that realizes each function. An example of such a computer (hereinafter referred to as computer C) is shown in Figure 13. Computer C has at least one processor C1 and at least one memory C2. Memory C2 stores program P for operating computer C as reinforcement learning device 10 etc. In computer C, processor C1 reads and executes program P from memory C2, thereby realizing each function of reinforcement learning device 10 etc.

プロセッサＣ１としては、例えば、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphic Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＭＰＵ（Micro Processing Unit）、ＦＰＵ（Floating point number Processing Unit）、ＰＰＵ（Physics Processing Unit）、マイクロコントローラ、又は、これらの組み合わせなどを用いることができる。メモリＣ２としては、例えば、フラッシュメモリ、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、又は、これらの組み合わせなどを用いることができる。The processor C1 may be, for example, a central processing unit (CPU), a graphic processing unit (GPU), a digital signal processor (DSP), a micro processing unit (MPU), a floating point number processing unit (FPU), a physics processing unit (PPU), a microcontroller, or a combination of these. The memory C2 may be, for example, a flash memory, a hard disk drive (HDD), a solid state drive (SSD), or a combination of these.

なお、コンピュータＣは、プログラムＰを実行時に展開したり、各種データを一時的に記憶したりするためのＲＡＭ（Random Access Memory）を更に備えていてもよい。また、コンピュータＣは、他の装置との間でデータを送受信するための通信インタフェースを更に備えていてもよい。また、コンピュータＣは、キーボードやマウス、ディスプレイやプリンタなどの入出力機器を接続するための入出力インタフェースを更に備えていてもよい。 The computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and for temporarily storing various data. The computer C may further include a communications interface for transmitting and receiving data to and from other devices. The computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.

また、プログラムＰは、コンピュータＣが読み取り可能な、一時的でない有形の記録媒体Ｍに記録することができる。このような記録媒体Ｍとしては、例えば、テープ、ディスク、カード、半導体メモリ、又はプログラマブルな論理回路などを用いることができる。コンピュータＣは、このような記録媒体Ｍを介してプログラムＰを取得することができる。また、プログラムＰは、伝送媒体を介して伝送することができる。このような伝送媒体としては、例えば、通信ネットワーク、又は放送波などを用いることができる。コンピュータＣは、このような伝送媒体を介してプログラムＰを取得することもできる。 The program P can also be recorded on a non-transitory, tangible recording medium M that can be read by the computer C. Such a recording medium M can be, for example, a tape, a disk, a card, a semiconductor memory, or a programmable logic circuit. The computer C can acquire the program P via such a recording medium M. The program P can also be transmitted via a transmission medium. Such a transmission medium can be, for example, a communications network or broadcast waves. The computer C can also acquire the program P via such a transmission medium.

〔付記事項１〕
本発明は、上述した実施形態に限定されるものでなく、請求項に示した範囲で種々の変更が可能である。例えば、上述した実施形態に開示された技術的手段を適宜組み合わせて得られる実施形態についても、本発明の技術的範囲に含まれる。 [Additional Note 1]
The present invention is not limited to the above-described embodiment, and various modifications are possible within the scope of the claims. For example, embodiments obtained by appropriately combining the technical means disclosed in the above-described embodiment are also included in the technical scope of the present invention.

〔付記事項２〕
上述した実施形態の一部又は全部は、以下のようにも記載され得る。ただし、本発明は、以下の記載する態様に限定されるものではない。 [Additional Note 2]
Some or all of the above-described embodiments can be described as follows. However, the present invention is not limited to the aspects described below.

（付記１）
強化学習の対象である環境における第１の状態を取得する取得手段と、
前記第１の状態にノイズを付加することによって第２の状態を生成する生成手段と、
前記第２の状態に応じて、第１の行動価値関数を算出する算出手段と、
前記第１の行動価値関数に応じて、行動を選択する選択手段と、
を備えることを特徴とする強化学習システム。 (Appendix 1)
An acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
A calculation means for calculating a first action value function according to the second state;
a selection means for selecting an action in accordance with the first action-value function;
A reinforcement learning system comprising:

上記の構成によれば、第１の状態にノイズを付加した第２の状態を用いて第１の行動価値関数を算出することにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by calculating the first action value function using a second state in which noise is added to the first state.

（付記２）
前記算出手段は、前記第１の状態と前記第２の状態とに応じて、前記第１の行動価値関数を算出する、
付記１に記載の強化学習システム。 (Appendix 2)
the calculation means calculates the first action value function according to the first state and the second state.
2. The reinforcement learning system of claim 1.

上記の構成によれば、ノイズを付加した第２の状態を含む複数の状態を用いて第１の行動価値関数を算出することにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by calculating the first action value function using multiple states including the second state to which noise has been added.

（付記３）
前記算出手段は、前記第１の状態及び前記第２の状態のそれぞれについて、前記第１の行動価値関数を算出し、
前記選択手段は、複数の前記第１の行動価値関数に基づいて算出される第２の行動価値関数に応じて、前記行動を選択する、
付記２に記載の強化学習システム。 (Appendix 3)
The calculation means calculates the first action-value function for each of the first state and the second state;
the selection means selects the action in accordance with a second action-value function calculated based on a plurality of the first action-value functions.
3. The reinforcement learning system of claim 2.

上記の構成によれば、複数の第１の行動価値関数を用いて算出される第２の行動価値関数を用いることにより、より好適な行動を選択できる。 According to the above configuration, a more suitable action can be selected by using a second action value function calculated using multiple first action value functions.

（付記４）
前記第１の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか１つを含む、
付記１から３の何れか１つに記載の強化学習システム。 (Appendix 4)
The first state includes at least one of a position, a moving direction, a speed, and an angular velocity of a conveying device that conveys an object, a position of a passage, and a position and a speed of a static or dynamic obstacle;
4. The reinforcement learning system of claim 1.

上記の構成によれば、強化学習による搬送装置の搬送動作の選択をより好適に行うことができる。 According to the above configuration, the transport operation of the transport device can be more appropriately selected through reinforcement learning.

（付記５）
前記第１の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか１つを含む、
付記１から３のいずれか１つに記載の強化学習システム。 (Appendix 5)
The first state includes at least one of the following: an attitude and a position of a construction machine, a shape of soil to be excavated, and an amount of soil in a bucket of an excavator.
4. The reinforcement learning system of claim 1.

上記の構成によれば、強化学習による建設機械の建設動作の選択をより好適に行うことができる。 According to the above configuration, the construction operations of the construction machine can be more appropriately selected through reinforcement learning.

（付記６）
前記第１の状態は、属性が付随する複数の要素を含み、
前記生成手段は、前記属性に応じ、前記第１の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第２の状態を生成する、
付記１から５の何れか１つに記載の強化学習システム。 (Appendix 6)
the first state includes a plurality of elements having attributes associated therewith;
the generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
6. The reinforcement learning system of claim 1.

上記の構成によれば、所定の条件を満たす属性に付随した要素についてのデータのばらつきを考慮した第１の行動価値関数を算出できる。 According to the above configuration, a first action value function can be calculated that takes into account the variability in data for elements associated with attributes that satisfy specified conditions.

（付記７）
前記第１の状態は、環境内を移動する動的要素に関する状態を含み、
前記生成手段は、前記第１の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第２の状態を生成する、
付記６に記載の強化学習システム。 (Appendix 7)
the first state includes a state regarding a dynamic element moving within an environment;
the generating means generates the second state by adding noise to a state of the dynamic element included in the first state.
7. The reinforcement learning system of claim 6.

上記の構成によれば、動的要素についてのデータのばらつきを考慮した第１の行動価値関数を算出できる。 According to the above configuration, a first action value function can be calculated that takes into account the variability of data for dynamic elements.

（付記８）
強化学習の対象である環境における第１の状態を取得する取得手段と、
前記第１の状態にノイズを付加することによって第２の状態を生成する生成手段と、
前記第２の状態に応じて、第１の行動価値関数を算出する算出手段と、
前記第１の行動価値関数に応じて、行動を選択する選択手段と、
を備えることを特徴とする強化学習装置。 (Appendix 8)
An acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
A calculation means for calculating a first action value function according to the second state;
a selection means for selecting an action in accordance with the first action-value function;
A reinforcement learning device comprising:

（付記９）
前記算出手段は、前記第１の状態と前記第２の状態とに応じて、前記第１の行動価値関数を算出する、
付記８に記載の強化学習装置。 (Appendix 9)
the calculation means calculates the first action value function according to the first state and the second state.
9. The reinforcement learning device according to claim 8.

（付記１０）
前記算出手段は、前記状態列に含まれる複数の状態のそれぞれについて、前記第１の行動価値関数を算出し、
前記選択手段は、複数の前記第１の行動価値関数に基づいて算出される第２の行動価値関数に応じて、前記行動を選択する、
付記９に記載の強化学習装置。 (Appendix 10)
the calculation means calculates the first action-value function for each of a plurality of states included in the state sequence;
the selection means selects the action in accordance with a second action-value function calculated based on a plurality of the first action-value functions.
10. The reinforcement learning device according to claim 9.

（付記１１）
前記第１の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか１つを含む、
付記８から１０の何れか１つに記載の強化学習装置。 (Appendix 11)
The first state includes at least one of a position, a moving direction, a speed, and an angular velocity of a conveying device that conveys an object, a position of a passage, and a position and a speed of a static or dynamic obstacle;
11. The reinforcement learning device according to any one of appendixes 8 to 10.

（付記１２）
前記第１の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか１つを含む、
付記８から１０のいずれか１つに記載の強化学習装置。 (Appendix 12)
The first state includes at least one of the following: an attitude and a position of a construction machine, a shape of the soil to be excavated, and an amount of soil in a bucket of an excavator.
11. The reinforcement learning device according to any one of claims 8 to 10.

（付記１３）
前記第１の状態は、属性が付随する複数の要素を含み、
前記生成手段は、前記属性に応じ、前記第１の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第２の状態を生成する、
付記８から１２の何れか１つに記載の強化学習装置。 (Appendix 13)
the first state includes a plurality of elements having attributes associated therewith;
the generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
13. A reinforcement learning device according to any one of appendixes 8 to 12.

（付記１４）
前記第１の状態は、環境内を移動する動的要素に関する状態を含み、
前記生成手段は、前記第１の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第２の状態を生成する、
付記１３に記載の強化学習装置。 (Appendix 14)
the first state includes a state regarding a dynamic element moving within an environment;
the generating means generates the second state by adding noise to a state of the dynamic element included in the first state.
14. The reinforcement learning device according to claim 13.

（付記１５）
強化学習の対象である環境における第１の状態を取得すること、
前記第１の状態にノイズを付加することによって第２の状態を生成すること、
前記第２の状態に応じて、第１の行動価値関数を算出すること、
前記第１の行動価値関数に応じて、行動を選択すること、
を含む強化学習方法。 (Appendix 15)
Obtaining a first state of an environment that is the subject of reinforcement learning;
generating a second state by adding noise to the first state;
calculating a first action-value function in response to the second state;
selecting an action in response to said first action-value function;
Reinforcement learning methods including

（付記１６）
前記第１の行動価値関数を算出することにおいて、
前記第１の状態と前記第２の状態とに応じて、前記第１の行動価値関数を算出する、
付記１５に記載の強化学習方法。 (Appendix 16)
In calculating the first action value function,
calculating the first action-value function according to the first state and the second state;
16. The reinforcement learning method of claim 15.

（付記１７）
前記第１の行動価値関数を算出することにおいて、前記状態列に含まれる複数の状態のそれぞれについて、前記第１の行動価値関数を算出し、
前記行動を選択することにおいて、複数の前記第１の行動価値関数に基づいて算出される第２の行動価値関数に応じて、前記行動を選択する、
付記１６に記載の強化学習方法。 (Appendix 17)
In calculating the first action-value function, the first action-value function is calculated for each of a plurality of states included in the state sequence;
In selecting the action, the action is selected according to a second action value function calculated based on a plurality of the first action value functions.
17. The reinforcement learning method of claim 16.

（付記１８）
前記第１の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか１つを含む、
付記１５から１７の何れか１つに記載の強化学習方法。 (Appendix 18)
The first state includes at least one of a position, a moving direction, a speed, and an angular velocity of a conveying device that conveys an object, a position of a passage, and a position and a speed of a static or dynamic obstacle;
18. The reinforcement learning method of any one of appendix 15 to 17.

（付記１９）
前記第１の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか１つを含む、
付記１５から１７のいずれか１つに記載の強化学習方法。 (Appendix 19)
The first state includes at least one of the following: an attitude and a position of a construction machine, a shape of soil to be excavated, and an amount of soil in a bucket of an excavator.
18. The reinforcement learning method of any one of appendix 15 to 17.

（付記２０）
前記第１の状態は、属性が付随する複数の要素を含み、
前記第２の状態を生成することにおいて、前記属性に応じ、前記第１の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第２の状態を生成する、
付記１５から１９の何れか１つに記載の強化学習方法。 (Appendix 20)
the first state includes a plurality of elements having attributes associated therewith;
generating the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute;
20. The reinforcement learning method of any one of appendix 15 to 19.

（付記２１）
前記第１の状態は、環境内を移動する動的要素に関する状態を含み、
前記第２の状態を生成することにおいて、前記第１の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第２の状態を生成する、
付記２０に記載の強化学習方法。 (Appendix 21)
the first state includes a state regarding a dynamic element moving within an environment;
generating the second state by adding noise to the state of the dynamic element included in the first state;
21. The reinforcement learning method of claim 20.

（付記２２）
前記第１の状態は、属性が付随する複数の要素を含み、
前記生成手段は、前記属性により前記ノイズの付加の重み付けを異ならせる、
付記１から５の何れか１つに記載の強化学習システム。 (Appendix 22)
the first state includes a plurality of elements having attributes associated therewith;
The generating means varies a weight of the noise addition depending on the attribute.
6. The reinforcement learning system of claim 1.

上記の構成によれば、要素の属性に応じた重み付けでノイズを付加した第２の状態を用いることで、属性に応じたデータのばらつきを考慮した第１の行動価値関数を算出できる。 According to the above configuration, by using the second state in which noise is added with weighting according to the attributes of the element, it is possible to calculate a first action value function that takes into account the variability of the data according to the attributes.

（付記２３）
前記第１の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、の一部又は全部を含み、
前記行動は、前記建設機械の姿勢制御を含む、
付記１から６、及び付記１９の何れか１つに記載の強化学習システム。 (Appendix 23)
The first state includes some or all of the following: the attitude and position of the construction machine, the shape of the soil to be excavated, and the amount of soil in the bucket of the excavator;
The action includes attitude control of the construction machine.
19. The reinforcement learning system of claim 1.

上記の構成によれば、第１の状態にノイズを付加した第２の状態を用いて第１の行動価値関数を算出することにより、強化学習による掘削機の掘削動作の選択をより好適に行うことができる。 According to the above configuration, by calculating the first action value function using a second state in which noise is added to the first state, it is possible to more appropriately select the excavation operation of the excavator through reinforcement learning.

（付記２４）
前記第１の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、の一部又は全部を含み、
前記行動は、前記搬送装置の速度制御及び角速度制御を含む、
付記１から６、及び付記１９の何れか１つに記載の強化学習システム。 (Appendix 24)
The first state includes some or all of the following: a position, a moving direction, a speed, and an angular velocity of a conveying device that conveys an object, a position of a passage, and a position and a speed of a static or dynamic obstacle;
The action includes speed control and angular velocity control of the transport device.
19. The reinforcement learning system of claim 1.

上記の構成によれば、第１の状態にノイズを付加した第２の状態を用いて第１の行動価値関数を算出することにより、強化学習による搬送装置の搬送動作の選択をより好適に行うことができる。 According to the above configuration, by calculating the first action value function using a second state in which noise is added to the first state, it is possible to more appropriately select the transportation operation of the transportation device through reinforcement learning.

（付記２５）
前記第１の状態は、コンピュータゲームにおいてゲームの進行に影響を与えるオブジェクトの状態を含み、
前記行動は、前記コンピュータゲームのプレイヤにより操作されるオブジェクトの動作を含む、
付記１から６、及び付記１９の何れか１つに記載の強化学習システム。 (Appendix 25)
the first state includes a state of an object that affects progress of the computer game;
the action includes a movement of an object controlled by a player of the computer game;
19. The reinforcement learning system of claim 1.

上記の構成によれば、第１の状態にノイズを付加した第２の状態を用いて第１の行動価値関数を算出することにより、コンピュータゲームの自律プレイにおけるオブジェクトの動作の選択をより好適に行うことができる。 According to the above configuration, by calculating the first action value function using a second state in which noise is added to the first state, it is possible to more appropriately select the action of an object in autonomous play of a computer game.

（付記２６）
コンピュータを強化学習装置として機能させるプログラムであって、
前記プログラムは、前記コンピュータを、
強化学習の対象である環境における第１の状態を取得する取得手段と、
前記第１の状態にノイズを付加することによって第２の状態を生成する生成手段と、
前記第２の状態に応じて、第１の行動価値関数を算出する算出手段と、
前記第１の行動価値関数に応じて、行動を選択する選択手段と、
として機能させることを特徴とするプログラム。 (Appendix 26)
A program for causing a computer to function as a reinforcement learning device,
The program causes the computer to
An acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
A calculation means for calculating a first action value function according to the second state;
a selection means for selecting an action in accordance with the first action-value function;
A program characterized by causing the program to function as a

（付記２７）
前記算出手段は、前記第１の状態と前記第２の状態とに応じて、前記第１の行動価値関数を算出する、
ことを特徴とする付記２６に記載のプログラム。 (Appendix 27)
the calculation means calculates the first action value function according to the first state and the second state.
27. The program according to claim 26,

（付記２８）
前記算出手段は、前記第１の状態及び前記第２の状態のそれぞれについて、前記第１の行動価値関数を算出し、
前記選択手段は、複数の前記第１の行動価値関数に基づいて算出される第２の行動価値関数に応じて、前記行動を選択する、
ことを特徴とする付記２７に記載のプログラム。 (Appendix 28)
The calculation means calculates the first action-value function for each of the first state and the second state;
the selection means selects the action in accordance with a second action-value function calculated based on a plurality of the first action-value functions.
28. The program according to claim 27,

（付記２９）
前記第１の状態は、搬送物を搬送する搬送装置の位置、移動方向、速度、及び角速度、通路の位置、並びに静的又は動的な障害物の位置及び速度、のうちの少なくとも何れか１つを含む、
付記２６から２８の何れか１つに記載のプログラム。 (Appendix 29)
The first state includes at least one of a position, a moving direction, a speed, and an angular velocity of a conveying device that conveys an object, a position of a passage, and a position and a speed of a static or dynamic obstacle;
29. The program of any one of appendices 26 to 28.

（付記３０）
前記第１の状態は、建設機械の姿勢及び位置、掘削対象である土砂の形状、並びに掘削機のバケット内の土砂量、のうちの少なくとも何れか１つを含む、
付記２６から２８のいずれか１つに記載のプログラム。 (Appendix 30)
The first state includes at least one of the following: an attitude and a position of a construction machine, a shape of soil to be excavated, and an amount of soil in a bucket of an excavator.
29. The program of any one of appendices 26 to 28.

（付記３１）
前記第１の状態は、属性が付随する複数の要素を含み、
前記生成手段は、前記属性に応じ、前記第１の状態に含まれる複数の要素に、選択的にノイズを付加することによって前記第２の状態を生成する、
ことを特徴とする付記２６から３０の何れか１つに記載のプログラム。 (Appendix 31)
the first state includes a plurality of elements having attributes associated therewith;
the generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
31. The program according to any one of appendices 26 to 30.

（付記３２）
前記第１の状態は、環境内を移動する動的要素に関する状態を含み、
前記生成手段は、前記第１の状態に含まれる前記動的要素の状態にノイズを付加することによって前記第２の状態を生成する、
ことを特徴とする付記２７に記載のプログラム。 (Appendix 32)
the first state includes a state regarding a dynamic element moving within an environment;
the generating means generates the second state by adding noise to a state of the dynamic element included in the first state.
28. The program according to claim 27,

〔付記事項３〕
上述した実施形態の一部又は全部は、更に、以下のように表現することもできる。 [Additional Note 3]
A part or all of the above-described embodiments can be further expressed as follows.

少なくとも１つのプロセッサを備え、前記プロセッサは、
強化学習の対象である環境における第１の状態を取得する取得処理と、
前記第１の状態にノイズを付加することによって第２の状態を生成する生成処理と、
前記第２の状態に応じて、第１の行動価値関数を算出する算出処理と、
前記第１の行動価値関数に応じて、行動を選択する選択処理と、
を実行する強化学習装置。 At least one processor, the processor comprising:
An acquisition process for acquiring a first state in an environment that is a target of reinforcement learning;
a generation process for generating a second state by adding noise to the first state;
A calculation process of calculating a first action value function according to the second state;
a selection process for selecting an action according to the first action-value function;
A reinforcement learning device that executes.

なお、この強化学習装置は、更にメモリを備えていてもよく、このメモリには、前記取得処理と、前記生成処理と、前記算出処理と、前記選択処理とを前記プロセッサに実行させるためのプログラムが記憶されていてもよい。また、このプログラムは、コンピュータ読み取り可能な一時的でない有形の記録媒体に記録されていてもよい。The reinforcement learning device may further include a memory, and the memory may store a program for causing the processor to execute the acquisition process, the generation process, the calculation process, and the selection process. The program may also be recorded on a computer-readable, non-transitory, tangible recording medium.

１、２、３、４、５、６、７強化学習システム
１０、５０強化学習装置
１１取得部
１２生成部
１３算出部
１４、５２６選択部
２０、４０端末
３０サーバ
４１、５１通信部
４２、５２制御部
４３入力受付部
５３記憶部
４２１状態提供部
４２２行動実行部
４２３報酬提供部
５２１報酬取得部
５２２状態観測部
５２３状態ランダム化部
５２４学習部
５２５推定部

1, 2, 3, 4, 5, 6, 7 Reinforcement learning system 10, 50 Reinforcement learning device 11 Acquisition unit 12 Generation unit 13 Calculation unit 14, 526 Selection unit 20, 40 Terminal 30 Server 41, 51 Communication unit 42, 52 Control unit 43 Input reception unit 53 Memory unit 421 State provision unit 422 Action execution unit 423 Reward provision unit 521 Reward acquisition unit 522 State observation unit 523 State randomization unit 524 Learning unit 525 Estimation unit

Claims

An acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
A calculation means for calculating a first action value function according to the second state;
a selection means for selecting an action in accordance with the first action-value function;
Equipped with
The calculation means calculates the first action-value function for each of the first state and the second state;
the selection means selects the action in accordance with a second action-value function calculated based on a plurality of the first action-value functions .
A reinforcement learning system characterized by:

The first state includes at least one of a position, a moving direction, a speed, and an angular velocity of a conveying device that conveys an object, a position of a passage, and a position and a speed of a static or dynamic obstacle;
The reinforcement learning system of claim 1 .

The first state includes at least one of the following: an attitude and a position of a construction machine, a shape of soil to be excavated, and an amount of soil in a bucket of an excavator.
The reinforcement learning system according to claim 1 or 2 .

the first state includes a plurality of elements having attributes associated therewith;
the generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
The reinforcement learning system according to any one of claims 1 to 3 .

the first state includes a state regarding a dynamic element moving within an environment;
the generating means generates the second state by adding noise to a state of the dynamic element included in the first state.
The reinforcement learning system according to claim 4 .

An acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
A calculation means for calculating a first action value function according to the second state;
a selection means for selecting an action in accordance with the first action-value function;
Equipped with
The calculation means calculates the first action-value function for each of the first state and the second state;
the selection means selects the action in accordance with a second action-value function calculated based on a plurality of the first action-value functions .
A reinforcement learning device characterized by:

Obtaining a first state of an environment that is the subject of reinforcement learning;
generating a second state by adding noise to the first state;
calculating a first action-value function in response to the second state;
selecting an action in response to said first action-value function;
Including,
In the calculating step, the first action-value function is calculated for each of the first state and the second state;
In the selecting step, the action is selected according to a second action-value function calculated based on a plurality of the first action-value functions.
Reinforcement learning methods.

A program for causing a computer to function as a reinforcement learning device, the program comprising:
An acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
A calculation means for calculating a first action value function according to the second state;
a selection means for selecting an action in accordance with the first action-value function;
Function as a
The calculation means calculates the first action-value function for each of the first state and the second state;
the selection means selects the action in accordance with a second action-value function calculated based on a plurality of the first action-value functions .
program .

An acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
A calculation means for calculating a first action value function according to the second state;
a selection means for selecting an action in accordance with the first action-value function;
Equipped with
the first state includes a plurality of elements having attributes associated therewith;
the generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute .
A reinforcement learning system characterized by:

An acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
A calculation means for calculating a first action value function according to the second state;
a selection means for selecting an action in accordance with the first action-value function;
Equipped with
the first state includes a plurality of elements having attributes associated therewith;
the generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
A reinforcement learning device characterized by:

Obtaining a first state of an environment that is the subject of reinforcement learning;
generating a second state by adding noise to the first state;
calculating a first action-value function in response to the second state;
selecting an action in response to said first action-value function;
Including,
the first state includes a plurality of elements having attributes associated therewith;
In the generating step, the second state is generated by selectively adding noise to a plurality of elements included in the first state according to the attribute.
Reinforcement learning methods.

A program for causing a computer to function as a reinforcement learning device, the program comprising:
An acquisition means for acquiring a first state in an environment that is a target of reinforcement learning;
generating means for generating a second state by adding noise to the first state;
A calculation means for calculating a first action value function according to the second state;
a selection means for selecting an action in accordance with the first action-value function;
Function as a
the first state includes a plurality of elements having attributes associated therewith;
the generating means generates the second state by selectively adding noise to a plurality of elements included in the first state according to the attribute.
program.