JP7664121B2

JP7664121B2 - Logistics transport speed control agent learning device, logistics transport speed control agent learning method, and logistics transport speed control agent learning program

Info

Publication number: JP7664121B2
Application number: JP2021136852A
Authority: JP
Inventors: 宏之鹿山; 健吾高橋
Original assignee: IHI Logistics and Machinery Corp
Current assignee: IHI Logistics and Machinery Corp
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2025-04-17
Anticipated expiration: 2041-08-25
Also published as: JP2023031401A

Description

本発明は、物流搬送速度制御エージェント学習装置、物流搬送速度制御エージェント学習方法、及び物流搬送速度制御エージェント学習プログラムに関するものである。 The present invention relates to a logistics transport speed control agent learning device, a logistics transport speed control agent learning method, and a logistics transport speed control agent learning program.

例えば特許文献１に示されているように、物流設備においては、ワークを搬送するための搬送路が複数設けられている。このような搬送路は、ワークを水平搬送するコンベヤや、ワークの移載を行う移載ロボット等の多種の搬送ユニットによって形成されている。 For example, as shown in Patent Document 1, a logistics facility has multiple transport paths for transporting workpieces. These transport paths are formed by various transport units, such as conveyors that transport the workpieces horizontally and transfer robots that transfer the workpieces.

特開２０２０－１１８１７号公報JP 2020-11817 A

ところで、搬送路においてワークの搬送が滞ると、いわゆる渋滞が発生する。例えば、物流設備の全体が稼働している途中に、搬送路を形成する搬送ユニットの一部がメンテナンスによって停止する場合がある。このような場合には、メンテンナンスが実施される搬送ユニットが含まれる搬送路において、ワークの搬送が滞って渋滞が発生する。ワークの渋滞が解消されないと、時間の経過に伴って搬送路にワークを一時的に保留するスペースがなくなり、搬送路にてワークを受け付けることができなくなる。このように、上流から供給されたワークを搬送路にて受け付けることができなくなることをドロップと称する。つまり、搬送路に対して搬送能力を超える多くのワークが集まると、搬送路にてワークが受け付けできなくなり、ドロップが発生する。 When the transport of workpieces is delayed on the transport path, a so-called traffic jam occurs. For example, while the entire logistics facility is in operation, some of the transport units that form the transport path may stop for maintenance. In such a case, the transport of workpieces is delayed on the transport path that includes the transport unit undergoing maintenance, causing a traffic jam. If the workpiece congestion is not resolved, over time the transport path will no longer have space to temporarily store the workpieces, and the transport path will no longer be able to accept the workpieces. This inability to accept workpieces supplied from upstream on the transport path is referred to as a drop. In other words, when a large number of workpieces gather on the transport path in excess of the transport capacity, the transport path will no longer be able to accept the workpieces, and a drop will occur.

このようなドロップの発生を抑制するためには、搬送ユニットの搬送速度を常に最大とすることが考えられる。搬送ユニットの搬送速度を常に最大とすることで、搬送路におけるワークの数を最小限に抑えることができる。このため、ワークの集中や搬送ユニットのメンテナンスが発生した時点での搬送路上のワークの数が抑えられ、渋滞発生中において搬送路で一時的に受け入れられるワークの数を最大化することが可能になる。したがって、ドロップの発生を抑制することができる。 One way to prevent such drops from occurring is to keep the transport speed of the transport unit at its maximum. By always keeping the transport speed of the transport unit at its maximum, the number of workpieces on the transport path can be minimized. This reduces the number of workpieces on the transport path when workpieces are concentrated or when transport unit maintenance occurs, making it possible to maximize the number of workpieces that can be temporarily accepted on the transport path during congestion. This therefore makes it possible to prevent drops from occurring.

しかしながら、常に搬送速度を最大とした場合には、渋滞が発生しないような場合においても搬送ユニットの搬送速度が不必要に最大の状態となる。このため、エネルギ消費量が大きくなることが想定される。 However, if the conveying speed is always set to the maximum, the conveying speed of the conveying unit will be unnecessarily set to the maximum even when there is no congestion. This is expected to result in high energy consumption.

本発明は、上述する問題点に鑑みてなされたもので、物流設備において搬送路に対して渋滞によりワークが受け付けられなくなることを抑制しつつ、エネルギ消費量を削減可能とすることを目的とする。 The present invention was made in consideration of the above-mentioned problems, and aims to reduce energy consumption while preventing workpieces from being unable to be accepted due to congestion on the transport route in logistics facilities.

本発明は、上記課題を解決するための手段として、以下の構成を採用する。 The present invention adopts the following configuration as a means for solving the above problems.

本発明の第１の態様は、物流搬送路を形成する複数の搬送ユニットにおける搬送速度を制御する学習エージェントを強化学習させる物流搬送速度制御エージェント学習装置であって、モデル化された上記物流搬送路を用いて上記物流搬送路の状態及び当該状態に基づく報酬を算出する搬送路シミュレータと、上記報酬に基づいた評価が大きくなるように上記搬送速度に関する学習を行う上記学習エージェントとを有し、上記搬送路シミュレータは、上記物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬と、上記物流搬送路で上記ワークの受け付けができない場合に与えられる受付拒否罰と、上記搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて上記報酬を算出するという構成を採用する。 The first aspect of the present invention is a logistics transport speed control agent learning device that performs reinforcement learning on a learning agent that controls the transport speeds of multiple transport units that form a logistics transport route, and includes a transport route simulator that uses a model of the logistics transport route to calculate the state of the logistics transport route and a reward based on the state, and the learning agent that learns about the transport speed so that the evaluation based on the reward increases, and the transport route simulator calculates the reward based on a transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route, an acceptance refusal penalty that is given when the logistics transport route cannot accept the workpiece, and an energy consumption increase penalty that increases as the energy consumption related to the transport speed increases.

本発明の第２の態様は、上記第１の態様において、モデル化された上記物流搬送路が、単一あるいは複数の搬送ユニットを含む制御対象部を複数有し、上記学習エージェントが、上記制御対象部ごとに上記搬送速度を制御するという構成を採用する。 The second aspect of the present invention is the first aspect, in which the modeled logistics transport route has multiple control target parts, each of which includes a single or multiple transport units, and the learning agent controls the transport speed for each of the control target parts.

本発明の第３の態様は、上記第２の態様において、上記搬送路シミュレータが、上記搬送ユニットごとの速度指令値に基づいて上記消費エネルギ増加罰を算出するという構成を採用する。 The third aspect of the present invention is the second aspect, in which the transport path simulator calculates the energy consumption increase penalty based on the speed command value for each transport unit.

本発明の第４の態様は、上記第３の態様において、上記搬送路シミュレータが、上記搬送ユニットごとの速度指令値を各々二乗した値の総和に基づいて上記消費エネルギ増加罰を算出するという構成を採用する。 The fourth aspect of the present invention is the third aspect, in which the transport path simulator calculates the energy consumption increase penalty based on the sum of the squared speed command values for each transport unit.

本発明の第５の態様は、上記第２の態様において、上記搬送路シミュレータが、上記搬送ユニットごとの加速度に基づいて上記消費エネルギ増加罰を算出するという構成を採用する。 The fifth aspect of the present invention is the second aspect, in which the transport path simulator calculates the energy consumption increase penalty based on the acceleration of each transport unit.

本発明の第６の態様は、上記第２～第５いずれかの態様において、上記物流搬送路におけるワークの搬送中にメンテンナンスによる停止期間が発生する可能性があるメンテナンス発生搬送ユニットが、上記物流搬送路を形成する複数の上記搬送ユニットに含まれ、単一の上記制御対象部に含まれる上記メンテナンス発生搬送ユニットは１つ以下であるという構成を採用する。 The sixth aspect of the present invention is any one of the second to fifth aspects, in which a maintenance-incurring transport unit that may experience a downtime period due to maintenance during the transport of workpieces on the logistics transport route is included in the multiple transport units that form the logistics transport route, and a single controlled part includes one or less maintenance-incurring transport units.

本発明の第７の態様は、物流搬送路を形成する複数の搬送ユニットにおける搬送速度を制御する学習エージェントを強化学習させる物流搬送速度制御エージェント学習方法であって、搬送路シミュレータによって、モデル化された上記物流搬送路を用いて上記物流搬送路の状態及び当該状態に基づく報酬を算出し、上記学習エージェントが、上記報酬に基づいた評価が大きくなるように上記搬送速度に関する学習を行い、上記搬送路シミュレータにて、上記物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬と、上記物流搬送路で上記ワークの受け付けができない場合に与えられる受付拒否罰と、上記搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて上記報酬を算出するという構成を採用する。 The seventh aspect of the present invention is a logistics transport speed control agent learning method for performing reinforcement learning on a learning agent that controls the transport speeds of multiple transport units that form a logistics transport route, and employs a transport route simulator to calculate the state of the logistics transport route and a reward based on the state using the modeled logistics transport route, the learning agent learns about the transport speed so that the evaluation based on the reward increases, and the transport route simulator calculates the reward based on a transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route, an acceptance refusal penalty given when the logistics transport route cannot accept the workpiece, and an energy consumption increase penalty that increases as the energy consumption related to the transport speed increases.

本発明の第８の態様は、コンピュータを、物流搬送路を形成する複数の搬送ユニットにおける搬送速度を制御する学習エージェントを強化学習させる物流搬送速度制御エージェント学習装置として機能させる物流搬送速度制御エージェント学習プログラムであって、上記コンピュータを、モデル化された上記物流搬送路を用いて上記物流搬送路の状態及び当該状態に基づく報酬を算出する搬送路シミュレータと機能させ、上記コンピュータを、上記報酬に基づいた評価が大きくなるように上記搬送速度に関する学習を行う上記学習エージェントとして機能させ、上記コンピュータを上記搬送路シミュレータとして機能させる場合に、上記物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬と、上記物流搬送路で上記ワークの受け付けができない場合に与えられる受付拒否罰と、上記搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて上記報酬を算出させるという構成を採用する。 The eighth aspect of the present invention is a logistics transport speed control agent learning program that causes a computer to function as a logistics transport speed control agent learning device that performs reinforcement learning on a learning agent that controls the transport speeds of multiple transport units that form a logistics transport route, and causes the computer to function as a transport route simulator that uses a model of the logistics transport route to calculate the state of the logistics transport route and a reward based on the state, causes the computer to function as the learning agent that learns about the transport speed so that the evaluation based on the reward increases, and when the computer is caused to function as the transport route simulator, calculates the reward based on a transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route, an acceptance refusal penalty that is given when the logistics transport route cannot accept the workpiece, and an energy consumption increase penalty that increases due to an increase in energy consumption related to the transport speed.

本発明によれば、搬送路シミュレータで算出された報酬が学習エージェントに入力され、報酬に基づいて学習エージェントが学習を行う。報酬は、物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬と、物流搬送路でワークの受け付けができない場合に与えられる受付拒否罰と、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて算出される。つまり、学習エージェントに入力される報酬は、搬送速度に関するエネルギ消費量が大きくなることで減少する。このため、本発明において、学習エージェントは、搬送速度に関するエネルギ消費量を小さくするように学習する。また、学習エージェントに入力される報酬は、物流搬送路でワークの受け付けができない場合に減少する。したがって、本発明によれば、学習エージェントは、物流搬送路でワークの受け付けができない場合を回避しようと学習する。よって、本発明によれば、物流設備において搬送路に対して渋滞によりワークが受け付けられなくなることを抑制しつつ、エネルギ消費量を削減することが可能になる。 According to the present invention, the reward calculated by the transport route simulator is input to the learning agent, and the learning agent learns based on the reward. The reward is calculated based on a transport completion reward obtained when the transport of the workpiece is completed on the logistics transport route, an acceptance refusal penalty given when the logistics transport route cannot accept the workpiece, and an energy consumption increase penalty that increases as the energy consumption related to the transport speed increases. In other words, the reward input to the learning agent decreases as the energy consumption related to the transport speed increases. For this reason, in the present invention, the learning agent learns to reduce the energy consumption related to the transport speed. Also, the reward input to the learning agent decreases when the logistics transport route cannot accept the workpiece. Therefore, according to the present invention, the learning agent learns to avoid cases where the logistics transport route cannot accept the workpiece. Therefore, according to the present invention, it is possible to reduce energy consumption while suppressing the inability to accept the workpiece due to congestion on the transport route in the logistics facility.

本発明の一実施形態における物流搬送速度制御エージェント学習装置のハードウェア構成の概略を示すブロック図である。1 is a block diagram showing an outline of a hardware configuration of a logistics transport speed control agent learning device in one embodiment of the present invention. 本発明の一実施形態における物流搬送速度制御エージェント学習装置の機能構成の概略を示すブロック図である。1 is a block diagram showing an outline of the functional configuration of a logistics transportation speed control agent learning device in one embodiment of the present invention. モデル化された物流搬送路のイメージ図である。FIG. 1 is an image diagram of a modeled logistics transportation route. ワークのドロップについて説明する模式図である。FIG. 13 is a schematic diagram illustrating dropping of a workpiece. 本発明の一実施形態における物流搬送速度制御エージェント学習装置の動作を説明するためのフローチャートである。5 is a flowchart for explaining the operation of a logistics transportation speed control agent learning device in one embodiment of the present invention.

以下、図面を参照して、本発明に係る物流搬送速度制御エージェント学習装置、物流搬送速度制御エージェント学習方法、及び物流搬送速度制御エージェント学習プログラムの一実施形態について説明する。 Below, an embodiment of a logistics transport speed control agent learning device, a logistics transport speed control agent learning method, and a logistics transport speed control agent learning program according to the present invention will be described with reference to the drawings.

図１は、本実施形態の物流搬送速度制御エージェント学習装置１のハードウェア構成の概略を示すブロック図である。また、図１は、本実施形態の物流搬送速度制御エージェント学習装置１の機能構成の概略を示すブロック図である。 Figure 1 is a block diagram showing an outline of the hardware configuration of the logistics transport speed control agent learning device 1 of this embodiment. Figure 1 is also a block diagram showing an outline of the functional configuration of the logistics transport speed control agent learning device 1 of this embodiment.

本実施形態の物流搬送速度制御エージェント学習装置１は、学習エージェント３（図２参照）を強化学習させる装置である。学習エージェント３は、強化学習後に物流設備の制御装置にインストールされる。物流設備の制御装置にインストールされた学習エージェント３は、物流設備に設けられる物流搬送路の搬送速度を制御する。物流設備に設けられる物流搬送路は、コンベヤや移載装置等の多様な搬送ユニットによって形成される。学習エージェント３は、例えば各々の搬送ユニットにおける搬送速度の制御を行う。 The logistics transport speed control agent learning device 1 of this embodiment is a device that causes the learning agent 3 (see FIG. 2) to perform reinforcement learning. The learning agent 3 is installed in the control device of the logistics facility after reinforcement learning. The learning agent 3 installed in the control device of the logistics facility controls the transport speed of the logistics transport route provided in the logistics facility. The logistics transport route provided in the logistics facility is formed by various transport units such as conveyors and transfer devices. The learning agent 3 controls, for example, the transport speed of each transport unit.

図１に示すように、物流搬送速度制御エージェント学習装置１は、記憶部１０、操作部１１、通信部１２、演算部１３及び表示部１４を備えており、コンピュータによって形成されている。 As shown in FIG. 1, the logistics transport speed control agent learning device 1 includes a memory unit 10, an operation unit 11, a communication unit 12, a calculation unit 13, and a display unit 14, and is formed by a computer.

記憶部１０は、ＲＯＭ（Read Only Memory）及びＲＡＭ（Random Access Memory）等のメモリ、ＨＤＤ（Hard Disk Drive）やＳＳＤ（Solid State Drive）等のストレージからなる。この記憶部１０は、学習プログラムＰ（物流搬送速度制御エージェント学習プログラム）を記憶している。また、記憶部１０は、各種のデータＤが記憶される。このデータＤには、演算部１３で用いられる初期データや、演算部１３の演算結果が含まれる。 The storage unit 10 is made up of memories such as ROM (Read Only Memory) and RAM (Random Access Memory), and storage such as HDD (Hard Disk Drive) and SSD (Solid State Drive). This storage unit 10 stores a learning program P (logistics transport speed control agent learning program). The storage unit 10 also stores various data D. This data D includes initial data used by the calculation unit 13 and the calculation results of the calculation unit 13.

操作部１１は、物流搬送速度制御エージェント学習装置１を運用する作業者の操作指示を受け付ける入力装置であり、より具体的にはキーボードやマウス等のポインティングデバイスである。この操作部１１は、作業者の操作指示に対応した操作信号を演算部１３に出力する。通信部１２は、所定の通信回線を介して外部機器とデータの送受信を行う通信装置であり、例えばＬＡＮ（Local Area Network）やインターネットに準拠した通信プロトコルを用いて外部機器との通信を行う。 The operation unit 11 is an input device that accepts operation instructions from an operator who operates the logistics transport speed control agent learning device 1, and more specifically, is a pointing device such as a keyboard or a mouse. This operation unit 11 outputs operation signals corresponding to the operator's operation instructions to the calculation unit 13. The communication unit 12 is a communication device that transmits and receives data to and from external devices via a specified communication line, and communicates with external devices using, for example, a communication protocol that complies with a LAN (Local Area Network) or the Internet.

演算部１３は、上述した学習プログラムＰ、データＤ、及び操作信号等に基づいて、学習エージェント３を強化学習させるための演算を行う演算装置である。この演算部１３は、インターフェース回路及びＣＰＵ（Central Processing Unit）等のハードウェアからなる。上記インターフェース回路は、記憶部１０、操作部１１、通信部１２及び表示部１４と各種信号の授受を行う電子回路である。ＣＰＵは、上述した学習プログラムＰを実行する中央処理装置である。 The calculation unit 13 is a calculation device that performs calculations for making the learning agent 3 perform reinforcement learning based on the above-mentioned learning program P, data D, operation signals, etc. This calculation unit 13 is composed of hardware such as an interface circuit and a CPU (Central Processing Unit). The interface circuit is an electronic circuit that sends and receives various signals to and from the memory unit 10, operation unit 11, communication unit 12, and display unit 14. The CPU is a central processing unit that executes the above-mentioned learning program P.

表示部１４は、演算部１３で生成された画像データに基づいて学習プログラムＰに基づいて学習エージェント３の学習状態等を表示する表示装置である。なお、物流搬送速度制御エージェント学習装置１は、必ずしも表示部１４を備える必要はない。つまり、物流搬送速度制御エージェント学習装置１から通信部１２を介して出力されたデータに基づいて、外部機器が表示を行うようにしても良い。 The display unit 14 is a display device that displays the learning state of the learning agent 3 based on the learning program P, which is based on the image data generated by the calculation unit 13. Note that the logistics transport speed control agent learning device 1 does not necessarily need to include a display unit 14. In other words, an external device may display the data based on the data output from the logistics transport speed control agent learning device 1 via the communication unit 12.

図１に示すようなハードウェア構成を有する物流搬送速度制御エージェント学習装置１は、図２に示す複数の機能部を有している。これらの機能部は、図１に示す各種ハードウェア、及び記憶部１０に記憶された学習プログラムＰ等が協働することによって具現化される。 The logistics transport speed control agent learning device 1 having the hardware configuration shown in FIG. 1 has multiple functional units shown in FIG. 2. These functional units are realized by the cooperation of the various hardware shown in FIG. 1 and the learning program P stored in the memory unit 10.

図２に示すように、物流搬送速度制御エージェント学習装置１は、上記機能部として、搬送路シミュレータ２と、学習エージェント３と、性能評価部４と、ハイパーパラメータ設定部５とを有している。 As shown in FIG. 2, the logistics transport speed control agent learning device 1 has the following functional units: a transport route simulator 2, a learning agent 3, a performance evaluation unit 4, and a hyperparameter setting unit 5.

搬送路シミュレータ２は、学習エージェント３の強化学習において用いられる「状態」、「行動」及び「報酬」のうち、「状態」及び「報酬」を算出して出力する。搬送路シミュレータ２は、モデル化された物流搬送路を用いて、学習エージェント３から出力される「行動」に基づいて、「状態」及び「報酬」を算出する。なお、モデル化された物流搬送路とは、学習後に学習プログラムＰがインストールされる物流設備に設置された物流搬送路を、シミュレーション用にモデル化したデータ群である。学習エージェント３は、物流搬送速度制御エージェント学習装置１によって強化学習されるエージェントであり、物流搬送路を形成する搬送ユニットにおけるワークの搬送速度を制御する。この学習エージェント３は、「状態」、「行動」及び「報酬」のうち、「行動」を算出して出力する。 The transport route simulator 2 calculates and outputs the "state" and "reward" from among the "state," "action," and "reward" used in the reinforcement learning of the learning agent 3. The transport route simulator 2 calculates the "state" and "reward" based on the "action" output from the learning agent 3 using a modeled logistics transport route. The modeled logistics transport route is a data group that models, for simulation purposes, a logistics transport route installed in a logistics facility in which the learning program P is installed after learning. The learning agent 3 is an agent that undergoes reinforcement learning by the logistics transport speed control agent learning device 1, and controls the transport speed of workpieces in the transport units that form the logistics transport route. This learning agent 3 calculates and outputs the "action" from among the "state," "action," and "reward."

つまり、本実施形態の物流搬送速度制御エージェント学習装置１は、搬送路シミュレータ２で算出された「状態」及び「報酬」が学習エージェントに入力され、学習エージェントが搬送速度についての学習を行う。さらに、学習エージェント３が「行動」を選択（算出）して出力し、この「行動」に基づいて搬送路シミュレータ２が再び「状態」及び「報酬」を算出する。これらの動作を学習エージェント３の学習が進むまで繰り返すことで、物流搬送速度制御エージェント学習装置１は、学習エージェント３を強化学習させる。 In other words, in the logistics transport speed control agent learning device 1 of this embodiment, the "state" and "reward" calculated by the transport route simulator 2 are input to the learning agent, and the learning agent learns about the transport speed. Furthermore, the learning agent 3 selects (calculates) and outputs an "action", and the transport route simulator 2 again calculates the "state" and "reward" based on this "action". By repeating these operations until the learning of the learning agent 3 progresses, the logistics transport speed control agent learning device 1 causes the learning agent 3 to perform reinforcement learning.

図２に示すように、搬送路シミュレータ２は、セル定義部２ａと、搬送路制御部２ｂと、報酬計算部２ｃとを有している。セル定義部２ａは、物流搬送路を複数部分（セル）に分割し、その単位要素であるセルの動作を定義する。 As shown in FIG. 2, the transport path simulator 2 has a cell definition unit 2a, a transport path control unit 2b, and a remuneration calculation unit 2c. The cell definition unit 2a divides the logistics transport path into multiple parts (cells) and defines the operation of the cells, which are the unit elements.

図３は、モデル化された物流搬送路Ｈのイメージ図である。この図に示すように、本実施形態においては、物流搬送路Ｈは、複数のセルＳに分割して定義されている。各々のセルＳは、例えば物流搬送路Ｈを形成する搬送ユニット（コンベヤや移載ロボット等）ごとに設けられている。このような物流搬送路Ｈでは、隣接するセルＳにてワークＷが受け渡されることで、物流搬送路Ｈの一端側から他端側に向けてワークＷが搬送される。 Figure 3 is an image diagram of a modeled logistics transport route H. As shown in this figure, in this embodiment, the logistics transport route H is defined by dividing it into multiple cells S. Each cell S is provided, for example, for each transport unit (conveyor, transfer robot, etc.) that forms the logistics transport route H. In such a logistics transport route H, the work W is transferred between adjacent cells S, and the work W is transported from one end of the logistics transport route H to the other end.

セル定義部２ａは、例えば各々のセルＳにおける搬送速度の設定可能範囲や加速度等の動作条件を定義する。また、本実施形態においてセル定義部２ａは、複数のセルＳ（すなわち搬送ユニット）を複数の制御対象部（第１制御対象部Ｔ１、第２制御対象部Ｔ２及び第３制御対象部Ｔ３）にグループ分けしている。 The cell definition unit 2a defines operating conditions such as the settable range of the transport speed and acceleration for each cell S. In addition, in this embodiment, the cell definition unit 2a groups multiple cells S (i.e., transport units) into multiple control target parts (first control target part T1, second control target part T2, and third control target part T3).

物流搬送路Ｈを形成する複数の搬送ユニットの中には、物流設備が稼働している最中に、メンテナンス作業が必要となる搬送ユニットが存在する。このようなメンテナンス作業が必要となる搬送ユニットにてメンテナンス作業が発生すると、メンテナンス期間中、メンテナンス作業中の搬送ユニットをワークが通過できなくなる。メンテナンス作業が発生する可能性のある搬送ユニットに対応するセルＳをメンテナンス発生セルＳａ（メンテナンス発生搬送ユニット）とする。本実施形態では、このようなメンテナンス発生セルＳａは、制御対象部において１つ以下とされている。つまり、単一の制御対象部に含まれるメンテナンス発生セルＳａは、１つ以下である。このように、単一の制御対象部に含まれるメンテナンス発生セルＳａを１つ以下とすることで、単一の制御対象部の搬送速度が、複数のメンテナンス発生セルＳａの影響を受けることを抑止することが可能となる。 Among the multiple transport units that form the logistics transport route H, there are transport units that require maintenance work while the logistics equipment is in operation. When maintenance work occurs on such a transport unit that requires maintenance work, work cannot pass through the transport unit undergoing maintenance work during the maintenance period. A cell S corresponding to a transport unit where maintenance work may occur is defined as a maintenance occurrence cell Sa (maintenance occurrence transport unit). In this embodiment, such maintenance occurrence cells Sa are set to one or less in the controlled part. In other words, the number of maintenance occurrence cells Sa included in a single controlled part is one or less. In this way, by setting the number of maintenance occurrence cells Sa included in a single controlled part to one or less, it is possible to prevent the transport speed of a single controlled part from being affected by multiple maintenance occurrence cells Sa.

本実施形態においては、例えば図３に示すように、１４個のセルＳが直列的に配列されて物流搬送路Ｈが形成されている。このような１４個のセルＳは、３つの制御対象部（第１制御対象部Ｔ１、第２制御対象部Ｔ２及び第３制御対象部Ｔ３）にグループ分けされている。第１制御対象部Ｔ１には、３つのセルＳが含まれている。第２制御対象部Ｔ２には、８つのセルＳが含まれている。第３制御対象部Ｔ３には、２つのセルＳが含まれている。なお、最後のセルＳは、ワークＷを受け取るのみでワークＷの搬送を行わないため、制御対象部に含められていない。 In this embodiment, as shown in FIG. 3, for example, 14 cells S are arranged in series to form a logistics transport route H. These 14 cells S are grouped into three control target parts (first control target part T1, second control target part T2, and third control target part T3). The first control target part T1 includes three cells S. The second control target part T2 includes eight cells S. The third control target part T3 includes two cells S. Note that the last cell S is not included in the control target parts because it only receives the workpieces W and does not transport the workpieces W.

また、本実施形態においては、第２制御対象部Ｔ２に含まれるセルＳのうち、最も搬送方向における上流側に位置するセルＳは、周期的にメンテナンスが行われるメンテナンス発生セルＳａである。また、本実施形態においては、第３制御対象部Ｔ３に含まれるセルＳのうち、最も搬送方向における上流側に位置するセルＳは、周期的にメンテナンスが行われるメンテナンス発生セルＳａである。 In addition, in this embodiment, of the cells S included in the second controlled section T2, the cell S located most upstream in the transport direction is a maintenance occurrence cell Sa, which is periodically subjected to maintenance. In addition, in this embodiment, of the cells S included in the third controlled section T3, the cell S located most upstream in the transport direction is a maintenance occurrence cell Sa, which is periodically subjected to maintenance.

このように制御対象部にメンテナンス発生セルＳａが含まれる場合には、制御対象部の最も上流側にメンテナンス発生セルＳａを配置することができる。制御対象部の最も上流側にメンテナンス発生セルＳａを配置することで、メンテナンス発生セルＳａでメンテナンスが発生した場合に、メンテナンス発生セルＳａの上流端に最も近接して配置されたセルＳまでワークＷを搬送することが可能になる。 In this way, when the controlled part includes a maintenance occurrence cell Sa, the maintenance occurrence cell Sa can be placed at the most upstream side of the controlled part. By placing the maintenance occurrence cell Sa at the most upstream side of the controlled part, when maintenance occurs in the maintenance occurrence cell Sa, it becomes possible to transport the work W to the cell S placed closest to the upstream end of the maintenance occurrence cell Sa.

図２に戻り、搬送路制御部２ｂは、各々のセルＳを統括し、入力される「行動」に基づいて物流搬送路Ｈの全体の「状態」を算出する。具体的には、学習エージェント３からは、「行動」として、各々のセルＳの速度指令値が搬送路制御部２ｂに入力される。搬送路制御部２ｂは、セル定義部２ａで定義された条件の下、モデル化された物流搬送路Ｈを用いて、速度指令値に基づいて「状態」を算出する。搬送路制御部２ｂは、これらの速度指令値に応じて各々のセルＳの搬送速度を設定する。 Returning to FIG. 2, the transport path control unit 2b manages each cell S and calculates the overall "state" of the logistics transport path H based on the input "actions." Specifically, the learning agent 3 inputs a speed command value for each cell S to the transport path control unit 2b as an "action." The transport path control unit 2b calculates the "state" based on the speed command value using the modeled logistics transport path H under the conditions defined in the cell definition unit 2a. The transport path control unit 2b sets the transport speed of each cell S according to these speed command values.

報酬計算部２ｃは、搬送路制御部２ｂで算出された「状態」に基づいて、「報酬」を算出する。本実施形態において報酬計算部２ｃは、例えば、下式（１）に基づいて「報酬」を算出する。 The remuneration calculation unit 2c calculates the "remuneration" based on the "state" calculated by the transport path control unit 2b. In this embodiment, the remuneration calculation unit 2c calculates the "remuneration" based on, for example, the following formula (1).

式（１）において、ｒ_ｔは、時刻ｔにおける報酬を示している。また、ｘ_{ｔ，ｃａｔｃｈ}は、時刻ｔにて物流搬送路Ｈの最下流（上流側から１４番目のセルＳ）までワークＷを運んだかそうでないかに対応して１か０を取る変数である。なお、物流搬送路Ｈの最下流までワークＷを運んだ場合には、ｘ_{ｔ，ｃａｔｃｈ}は１を取る。一方、物流搬送路Ｈの最下流までワークＷを運んでいない場合には、ｘ_{ｔ，ｃａｔｃｈ}は０を取る。 In formula (1), r _t indicates the reward at time t. Also, x _t,catch is a variable that takes a value of 1 or 0 depending on whether the work W has been transported to the most downstream part of the logistics transport route H (the 14th cell S from the upstream side) at time t. If the work W has been transported to the most downstream part of the logistics transport route H, x _t,catch takes a value of 1. On the other hand, if the work W has not been transported to the most downstream part of the logistics transport route H, x _t,catch takes a value of 0.

また、式（１）において、ｘ_{ｔ，ｄｒｏｐ}は、時刻ｔにてワークＷのドロップが発生したかそうでないかに対応して１か０を取る変数である。なお、ワークＷのドロップが発生した場合には、ｘ_{ｔ，ｄｒｏｐ}は１を取る。一方、ワークＷのドロップが発生していない場合には、ｘ_{ｔ，ｄｒｏｐ}は０を取る。 In addition, in formula (1), x _t,drop is a variable that takes a value of 1 or 0 depending on whether or not a drop of the work W has occurred at time t. If a drop of the work W has occurred, x _t,drop takes a value of 1. On the other hand, if a drop of the work W has not occurred, x _t,drop takes a value of 0.

ドロップとは、上流から物流搬送路Ｈに供給されようとするワークＷを物流搬送路Ｈにて受け付けることができなくなった事象が発生したことを意味する。図４は、ワークＷのドロップについて説明する模式図である。 A drop means that an event has occurred in which the logistics transport route H is no longer able to accept the workpiece W that is being supplied from upstream to the logistics transport route H. Figure 4 is a schematic diagram that explains the dropping of the workpiece W.

例えば、図４（ａ）に示すように、メンテナンス発生セルＳａにてメンテナンスが発生すると、ワークＷはメンテナンス発生セルＳａを通過することができなくなる。ただし、図４（ａ）に示すように、メンテナンス発生セルＳａの上流側にワークＷを留め置きすることが可能なセルＳが存在する場合には、物流搬送路ＨにてワークＷを受け取ることができるため、ドロップは発生しない。 For example, as shown in FIG. 4(a), when maintenance occurs in the maintenance occurrence cell Sa, the work W cannot pass through the maintenance occurrence cell Sa. However, as shown in FIG. 4(a), if there is a cell S upstream of the maintenance occurrence cell Sa where the work W can be stored, the work W can be received on the logistics transport route H, so a drop does not occur.

一方で、図４（ｂ）に示すように、メンテナンス発生セルＳａの上流側にワークＷを留め置きすることが可能なセルＳが存在しない場合には、物流搬送路ＨにてワークＷを受け取ることができない。このため、メンテナンス発生セルＳａの上流側にワークＷを留め置きすることが可能なセルＳが存在しない場合において、上流から物流搬送路Ｈに供給されようとするワークＷを物流搬送路Ｈにて受け付けることができなくなり、ワークＷのドロップが発生する。 On the other hand, as shown in FIG. 4(b), if there is no cell S upstream of the maintenance occurrence cell Sa where the workpiece W can be stored, the workpiece W cannot be received by the logistics transport route H. Therefore, if there is no cell S upstream of the maintenance occurrence cell Sa where the workpiece W can be stored, the logistics transport route H cannot accept the workpiece W that is being supplied to the logistics transport route H from upstream, and the workpiece W is dropped.

また、式（１）において、ｖ_{（ｔ，ｉ）}は時刻ｔにおけるｉ番目のセルＳへの速度指令値である。ＮはセルＳの総数である。またＡ、Ｂ、Ｃはハイパーパラメータである。 In addition, in formula (1), v _{(t, i)} is a speed command value for the i-th cell S at time t. N is the total number of cells S. Furthermore, A, B, and C are hyperparameters.

式（１）において右辺第１項は、物流搬送路ＨでワークＷの搬送が完了したことで得られる搬送完了報酬を示している。また、式（１）において右辺第２項は、物流搬送路ＨでワークＷの受け付けができない場合に与えられる受付拒否罰を示している。また、式（１）において右辺第３項は、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰を示している。 In equation (1), the first term on the right hand side represents the transportation completion reward obtained when the transportation of the workpiece W is completed on the logistics transportation route H. In addition, the second term on the right hand side of equation (1) represents the acceptance refusal penalty given when the logistics transportation route H cannot accept the workpiece W. In addition, the third term on the right hand side of equation (1) represents the energy consumption increase penalty that increases as the amount of energy consumption related to the transportation speed increases.

つまり、本実施形態において報酬計算部２ｃは、物流搬送路ＨでワークＷの搬送が完了したことで得られる搬送完了報酬と、物流搬送路ＨでワークＷの受け付けができない場合に与えられる受付拒否罰と、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて「報酬」を算出する。 In other words, in this embodiment, the reward calculation unit 2c calculates a "reward" based on a transport completion reward obtained when the transport of the work W is completed on the logistics transport route H, an acceptance refusal penalty given when the work W cannot be accepted on the logistics transport route H, and an energy consumption increase penalty that increases as the amount of energy consumption related to the transport speed increases.

このような本実施形態の物流搬送速度制御エージェント学習装置１では、式（１）の右辺第１項によって、物流搬送路ＨでワークＷの搬送が完了するように学習エージェント３を学習させることができる。また、本実施形態の物流搬送速度制御エージェント学習装置１では、式（１）の右辺第２項によって、ワークＷのドロップを発生させないように学習エージェント３を学習させることができる。また、本実施形態の物流搬送速度制御エージェント学習装置１では、式（１）の右辺第３項によって、搬送速度に関するエネルギ消費量が大きくなることを抑制するように学習エージェント３を学習させることができる。 In the logistics transport speed control agent learning device 1 of this embodiment, the first term on the right side of equation (1) can train the learning agent 3 to complete the transport of the workpiece W on the logistics transport route H. In addition, in the logistics transport speed control agent learning device 1 of this embodiment, the second term on the right side of equation (1) can train the learning agent 3 to prevent the workpiece W from being dropped. In addition, in the logistics transport speed control agent learning device 1 of this embodiment, the third term on the right side of equation (1) can train the learning agent 3 to suppress an increase in energy consumption related to the transport speed.

したがって、本実施形態の物流搬送速度制御エージェント学習装置１では、物流搬送路ＨでワークＷの搬送が完了すること、ワークＷのドロップを発生させないことを優先として、搬送速度に関するエネルギ消費量が大きくなることを抑制するように学習エージェント３を学習させることができる。 Therefore, in the logistics transport speed control agent learning device 1 of this embodiment, the learning agent 3 can be trained to prioritize completing the transport of the work W on the logistics transport route H and not causing the work W to be dropped, and to suppress an increase in energy consumption related to the transport speed.

また、式（１）の右辺第３項に示されているように、本実施形態において報酬計算部２ｃは、セルＳ（搬送ユニット）ごとの速度指令値を各々二乗した値の総和に基づいて消費エネルギ増加罰を算出している。つまり、報酬計算部２ｃは、セルＳ（搬送ユニット）ごとの速度指令値に基づいて消費エネルギ増加罰を算出している。速度指令値が大きくなるほど、セルＳ（搬送ユニット）の消費エネルギが増加する。したがって、セルＳ（搬送ユニット）ごとの速度指令値に基づいて消費エネルギ増加罰を算出することによって、搬送ユニットの速度に応じて罰則を変化させることができる。 Also, as shown in the third term on the right side of equation (1), in this embodiment, the reward calculation unit 2c calculates the energy consumption increase penalty based on the sum of the squared values of the speed command values for each cell S (transport unit). In other words, the reward calculation unit 2c calculates the energy consumption increase penalty based on the speed command value for each cell S (transport unit). The larger the speed command value, the more energy is consumed by the cell S (transport unit). Therefore, by calculating the energy consumption increase penalty based on the speed command value for each cell S (transport unit), the penalty can be changed according to the speed of the transport unit.

なお、本実施形態では、セルＳ（搬送ユニット）ごとの速度指令値を各々二乗した値の総和に基づいて消費エネルギ増加罰を算出している。このような構成を採用することによって、搬送ユニットの搬送速度が大きい状態でさらに搬送速度を高めようとした場合の罰則が、搬送ユニットの搬送速度がゼロあるいは小さい状態で搬送速度を高めようとした場合の罰則よりも大きくなる。このため、例えば、搬送ユニットにおいて、ワークＷが停止した状態やワークＷの搬送速度が小さい状態から一定の速度まで搬送速度を素早く速めることが可能となる一方で、ワークＷの搬送速度が既に速い状態からさらに速めようとする動作を防ぐように学習エージェント３を学習させることができる。 In this embodiment, the penalty for increased energy consumption is calculated based on the sum of the squared values of the speed command values for each cell S (transport unit). By adopting such a configuration, the penalty for attempting to further increase the transport speed when the transport unit's transport speed is high is greater than the penalty for attempting to increase the transport speed when the transport unit's transport speed is zero or low. For this reason, for example, in the transport unit, it is possible to quickly increase the transport speed from a state in which the workpiece W is stopped or the transport speed of the workpiece W is low to a certain speed, while the learning agent 3 can be trained to prevent an action that attempts to further increase the transport speed of the workpiece W when it is already high.

本実施形態において学習エージェント３は、ＰＰＯ（Proximal Policy Optimization）アルゴリズムに基づいて強化学習を行う。なお、学習エージェント３における強化学習の方法は、ＰＰＯアルゴリズムに限られるものではない。例えば、学習エージェント３は、Q-learning法、Soft Actor-Critic法、Temporal Difference learning法等に基づいて強化学習を行うようにしても良い。 In this embodiment, the learning agent 3 performs reinforcement learning based on the PPO (Proximal Policy Optimization) algorithm. Note that the method of reinforcement learning in the learning agent 3 is not limited to the PPO algorithm. For example, the learning agent 3 may perform reinforcement learning based on the Q-learning method, the Soft Actor-Critic method, the Temporal Difference Learning method, etc.

本実施形態において、学習エージェント３は、価値推定ネットワーク３ａと、方策ネットワーク３ｂと、ネットワーク更新部３ｃとを備えている。価値推定ネットワーク３ａは、搬送路シミュレータ２で算出された「状態」を用いて状態価値を推定する。方策ネットワーク３ｂは、価値推定ネットワーク３ａで推定された状態価値と、搬送路シミュレータ２で算出された「状態」を用いて、次に取る「行動」（各々のセルＳに対する速度指令値）を決定する。 In this embodiment, the learning agent 3 includes a value estimation network 3a, a strategy network 3b, and a network update unit 3c. The value estimation network 3a estimates the state value using the "state" calculated by the transport path simulator 2. The strategy network 3b uses the state value estimated by the value estimation network 3a and the "state" calculated by the transport path simulator 2 to determine the next "action" (speed command value for each cell S).

ネットワーク更新部３ｃは、ネットワーク更新部３ｃは、搬送路シミュレータ２で算出された「報酬」に基づいて、価値推定ネットワーク３ａと、方策ネットワーク３ｂとを更新する。例えば、ネットワーク更新部３ｃは、ＰＰＯアルゴリズムの更新式に従って、得られる「報酬」が最大化されるように価値推定ネットワーク３ａのパラメータと、方策ネットワーク３ｂのパラメータとを更新する。例えば、ネットワーク更新部３ｃは、搬送路シミュレータで算出された「報酬」の系列から累積報酬の系列を得て、状態価値の系列を用いてアドバンテージの系列を得ることで、価値推定ネットワーク３ａ及び方策ネットワーク３ｂの更新を行う。 The network update unit 3c updates the value estimation network 3a and the policy network 3b based on the "reward" calculated by the transport path simulator 2. For example, the network update unit 3c updates the parameters of the value estimation network 3a and the policy network 3b so as to maximize the obtained "reward" according to the update formula of the PPO algorithm. For example, the network update unit 3c updates the value estimation network 3a and the policy network 3b by obtaining a series of accumulated rewards from the series of "reward" calculated by the transport path simulator and obtaining a series of advantages using the series of state values.

性能評価部４は、学習エージェント３が予め定められた性能を満たしているか否かの判定を行う。例えば、性能評価部４は、物流搬送路ＨでワークＷの搬送が完了すること、ワークＷのドロップを発生させないことを前提として、搬送速度に関するエネルギ消費量が予め定められた閾値を下回った場合に、学習エージェント３が予め定められた性能を満たしていると判定する。 The performance evaluation unit 4 judges whether the learning agent 3 satisfies a predetermined performance. For example, the performance evaluation unit 4 judges that the learning agent 3 satisfies a predetermined performance when the energy consumption related to the transport speed falls below a predetermined threshold, assuming that the transport of the workpiece W on the logistics transport route H is completed and that the workpiece W is not dropped.

性能評価部４にて、学習エージェント３が予め定められた性能を満たしていると判定された場合には、学習エージェント３が学習済みとなり、学習エージェント３の強化学習が終了される。一方、性能評価部４にて、学習エージェント３が予め定められた性能を満たしていないと判定された場合には、学習エージェント３が学習済みとはならずに、学習エージェント３の強化学習が引き続き行われる。 If the performance evaluation unit 4 determines that the learning agent 3 meets the predetermined performance, the learning agent 3 is marked as having learned, and the reinforcement learning of the learning agent 3 is terminated. On the other hand, if the performance evaluation unit 4 determines that the learning agent 3 does not meet the predetermined performance, the learning agent 3 is not marked as having learned, and the reinforcement learning of the learning agent 3 continues.

ハイパーパラメータ設定部５は、報酬計算部２ｃで用いる報酬計算式（例えば上述の式（１））におけるハイパーパラメータを設定する。例えば、学習エージェント３の学習率と、報酬計算式で用いる係数とがハイパーパラメータとして、ハイパーパラメータ設定部５によって設定される。 The hyperparameter setting unit 5 sets hyperparameters in the reward calculation formula (e.g., the above-mentioned formula (1)) used by the reward calculation unit 2c. For example, the learning rate of the learning agent 3 and the coefficients used in the reward calculation formula are set as hyperparameters by the hyperparameter setting unit 5.

なお、学習エージェント３の学習率とは、ネットワーク更新部３ｃにおける更新度合いの程度を決めるパラメータである。また、報酬計算式として上述の式（１）を用いる場合には、式（１）におけるＡ、Ｂ及びＣの係数がハイパーパラメータである。 The learning rate of the learning agent 3 is a parameter that determines the degree of updating in the network update unit 3c. In addition, when the above formula (1) is used as the reward calculation formula, the coefficients A, B, and C in formula (1) are hyperparameters.

また、ハイパーパラメータ設定部５は、性能評価部４によって学習エージェント３が予め定められた性能を満たしていないと判定された場合に、報酬計算式におけるハイパーパラメータを更新する。つまり、ハイパーパラメータ設定部５は、報酬計算式として上述の式（１）を用いる場合には、性能評価部４によって学習エージェント３が予め定められた性能を満たしていないと判定された場合に、式（１）における係数Ａ、係数Ｂ及び係数Ｃを更新する。 Furthermore, the hyperparameter setting unit 5 updates the hyperparameters in the reward calculation formula when the performance evaluation unit 4 determines that the learning agent 3 does not satisfy the predetermined performance. In other words, when the above-mentioned formula (1) is used as the reward calculation formula, the hyperparameter setting unit 5 updates the coefficients A, B, and C in formula (1) when the performance evaluation unit 4 determines that the learning agent 3 does not satisfy the predetermined performance.

ハイパーパラメータ設定部５は、予め記憶されたアルゴリズムに基づいてハイパーパラメータを決定する。このアルゴリズムとしては、例えばベイズ最適化法のアルゴリズムを用いることができる。 The hyperparameter setting unit 5 determines the hyperparameters based on a pre-stored algorithm. For example, a Bayesian optimization algorithm can be used as this algorithm.

次に、このような構成の本実施形態の物流搬送速度制御エージェント学習装置１の動作（物流搬送速度制御エージェント学習方法）について、図５のフローチャートを参照して説明する。 Next, the operation of the logistics transport speed control agent learning device 1 of this embodiment (logistics transport speed control agent learning method) will be described with reference to the flowchart in Figure 5.

なお、以下の説明においては、物流搬送路Ｈのモデル化と、セル定義部２ａによるセルＳの定義は既に完了しているものとする。
図５に示すように、まず、ハイパーパラメータが決定される（ステップＳ１）。ここでは、ハイパーパラメータ設定部５によって、学習エージェント３の強化学習に用いるハイパーパラメータ（学習エージェント３の学習率や、報酬計算式で用いる係数）が決定される。 In the following description, it is assumed that modeling of the logistics transport route H and definition of the cell S by the cell definition unit 2a have already been completed.
5, first, hyperparameters are determined (step S1). Here, the hyperparameter setting unit 5 determines hyperparameters (the learning rate of the learning agent 3 and the coefficients used in the reward calculation formula) used in the reinforcement learning of the learning agent 3.

続いて、搬送路シミュレータ２及び学習エージェント３の初期化が行われる（ステップＳ２）。ここでは、先の処理によって搬送路シミュレータ２や学習エージェント３が初期設定に対して変化している場合に、搬送路シミュレータ２や学習エージェント３を初期設定の状態に戻す。 Next, the transport path simulator 2 and the learning agent 3 are initialized (step S2). Here, if the transport path simulator 2 or the learning agent 3 has changed from its initial settings due to the previous processing, the transport path simulator 2 or the learning agent 3 is returned to its initial setting state.

つまり、ステップＳ１にてハイパーパラメータが決定されると、ステップＳ２にて搬送路シミュレータ２及び学習エージェント３の初期化が行われる。したがって、本実施形態においては、ハイパーパラメータが更新した場合（後述するステップＳ８からステップＳ１に戻った場合）には、搬送路シミュレータ２や学習エージェント３が初期設定に戻される。 In other words, when the hyperparameters are determined in step S1, the transport path simulator 2 and the learning agent 3 are initialized in step S2. Therefore, in this embodiment, when the hyperparameters are updated (when returning to step S1 from step S8 described later), the transport path simulator 2 and the learning agent 3 are returned to their initial settings.

搬送路シミュレータ２及び学習エージェント３の初期化が完了すると、タイムステップｔが０に設定され（ステップＳ３）、データ収集ステップ（ステップＳ４）が行われる。データ収集ステップでは、学習エージェント３から入力される「行動」に基づいて物流搬送路Ｈの全体の「状態」を搬送路制御部２ｂが算出する。なお、例えば、初期状態において学習エージェント３から入力される「行動」がない場合には、搬送路制御部２ｂは初期値として記憶された「状態」を出力する。 When the initialization of the transport route simulator 2 and the learning agent 3 is completed, the time step t is set to 0 (step S3), and a data collection step (step S4) is performed. In the data collection step, the transport route control unit 2b calculates the overall "state" of the logistics transport route H based on the "action" input from the learning agent 3. Note that, for example, if there is no "action" input from the learning agent 3 in the initial state, the transport route control unit 2b outputs the "state" stored as the initial value.

また、データ収集ステップでは、搬送路シミュレータ２から入力された「状態」に基づいて、学習エージェント３が次のタイムステップで取る「行動」（各々のセルＳの速度指令値）を算出する。ここで、学習エージェント３は、入力された「状態」に対して、価値推定ネットワーク３ａを用いて状態価値を推定する。さらに、学習エージェント３は、状態価値と「状態」とに基づいて、単位時間後に取る「行動」を選択する。 In addition, in the data collection step, the "action" (speed command value of each cell S) that the learning agent 3 will take in the next time step is calculated based on the "state" input from the transport path simulator 2. Here, the learning agent 3 estimates the state value for the input "state" using the value estimation network 3a. Furthermore, the learning agent 3 selects the "action" to be taken after a unit time based on the state value and the "state".

データ収集ステップでは、学習エージェント３から搬送路シミュレータ２に「行動」が入力されると、搬送路シミュレータ２は、セル定義部２ａにて定義された動作に従って単位時間後の「状態」を算出し、現在と単位時間後における「状態」の遷移に基づいて、現在のタイムステップにおける「報酬」を算出する。 In the data collection step, when an "action" is input from the learning agent 3 to the transport path simulator 2, the transport path simulator 2 calculates the "state" after a unit of time according to the operation defined in the cell definition section 2a, and calculates the "reward" for the current time step based on the transition of the "state" between the present and after the unit of time.

ここで、本実施形態においては、搬送路シミュレータ２は、物流搬送路ＨでワークＷの搬送が完了したことで得られる搬送完了報酬と、物流搬送路ＨでワークＷの受け付けができない場合に与えられる受付拒否罰と、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて、報酬計算部２ｃにて「報酬」を算出する。 In this embodiment, the transport path simulator 2 calculates a "reward" in the reward calculation unit 2c based on a transport completion reward obtained when the transport of the work W is completed on the logistics transport path H, an acceptance refusal penalty given when the work W cannot be accepted on the logistics transport path H, and an energy consumption increase penalty that increases as the amount of energy consumption related to the transport speed increases.

データ収集ステップでは、搬送路シミュレータ２における「状態」及び「報酬」の算出と、学習エージェント３における「行動」の算出とが予め定められた回数行われる。データ収集ステップでは、上述の過程で得られた「報酬」が学習エージェント３においてデータとして収集される。 In the data collection step, the calculation of the "state" and "reward" in the transport path simulator 2 and the calculation of the "action" in the learning agent 3 are performed a predetermined number of times. In the data collection step, the "reward" obtained in the above process is collected as data in the learning agent 3.

続いて、パラメータ更新ステップ（ステップＳ５）が行われる。パラメータ更新ステップでは、データ収集ステップで収取された「報酬」に基づいて、学習エージェント３の価値推定ネットワーク３ａのパラメータと、方策ネットワーク３ｂとのパラメータの更新が、ネットワーク更新部３ｃによって行われる。ネットワーク更新部３ｃは、ＰＰＯアルゴリズムの更新式に従って、得られる「報酬」が最大化されるように価値推定ネットワーク３ａのパラメータと、方策ネットワーク３ｂのパラメータとを更新する。ここでは、例えば、ネットワーク更新部３ｃは、収集された「報酬」の系列から累積報酬の系列を得て、状態価値の系列を用いてアドバンテージの系列を得ることで、価値推定ネットワーク３ａ及び方策ネットワーク３ｂの更新を行う。 Next, a parameter update step (step S5) is performed. In the parameter update step, the network update unit 3c updates the parameters of the value estimation network 3a and the policy network 3b of the learning agent 3 based on the "reward" collected in the data collection step. The network update unit 3c updates the parameters of the value estimation network 3a and the policy network 3b in accordance with the update formula of the PPO algorithm so as to maximize the obtained "reward". Here, for example, the network update unit 3c updates the value estimation network 3a and the policy network 3b by obtaining a series of accumulated rewards from the series of collected "reward" and obtaining a series of advantages using the series of state values.

パラメータ更新ステップが完了すると、タイムステップの更新が行われる（ステップＳ６）。ここでは、現在のタイムステップｔが１つ増加される。続いて、ステップＳ６において更新されたタイムステップが、予め定められた最大タイムステップ数Ｔ以上であるか否かの判定が行われる（ステップＳ７）。 When the parameter update step is completed, the time step is updated (step S6). Here, the current time step t is incremented by one. Then, it is determined whether the time step updated in step S6 is equal to or greater than a predetermined maximum number of time steps T (step S7).

更新されたタイムステップが最大タイムステップ数Ｔ以上でない場合には、ステップＳ４に戻る。一方で、更新されたタイムステップが最大タイムステップ数Ｔ以上である場合には、学習エージェント３が十分な性能であるか否かの判定が行われる（ステップＳ８）。ここでは、性能評価部４が学習エージェント３の性能を評価かつ判断する。 If the updated time step is not equal to or greater than the maximum number of time steps T, the process returns to step S4. On the other hand, if the updated time step is equal to or greater than the maximum number of time steps T, a determination is made as to whether the learning agent 3 has sufficient performance (step S8). Here, the performance evaluation unit 4 evaluates and judges the performance of the learning agent 3.

ステップＳ８において、学習エージェント３の性能が十分であると判定された場合には、学習エージェント３が学習済みであるとされ、学習エージェント３の強化学習が終了となる。一方で、ステップＳ８において、学習エージェント３の性能が十分でないと判定された場合には、再びステップＳ１に戻ってハイパーパラメータを決定する。このとき、例えば、ベイズ最適化法等のアルゴリズムによってハイパーパラメータが更新される。 If it is determined in step S8 that the performance of the learning agent 3 is sufficient, the learning agent 3 is deemed to have completed learning, and the reinforcement learning of the learning agent 3 is terminated. On the other hand, if it is determined in step S8 that the performance of the learning agent 3 is not sufficient, the process returns to step S1 again to determine the hyperparameters. At this time, the hyperparameters are updated by an algorithm such as the Bayesian optimization method.

なお、ステップＳ１におけるハイパーパラメータの決定は、例えば作業者や外部機器が行うようにしても良い。このような場合には、物流搬送速度制御エージェント学習装置１にハイパーパラメータ設定部５を設けないようにすることも可能である。 The determination of the hyperparameters in step S1 may be performed, for example, by an operator or an external device. In such a case, it is also possible not to provide the hyperparameter setting unit 5 in the logistics transport speed control agent learning device 1.

ステップＳ８において、学習済みであると判定された学習エージェント３は、物流設備にインストールされ、物流設備に設置された物流搬送路の速度制御を行う。学習済みの学習エージェント３によれば、物流搬送路でワークの搬送が完了すること、ワークのドロップを発生させないことを優先として、搬送速度に関するエネルギ消費量が大きくなることを抑制するように速度制御を行う。このため、学習エージェント３を用いることで、物流設備においてワークの搬送を確実に実施可能であると共に、エネルギの消費量を削減することが可能となる。 In step S8, the learning agent 3 that is determined to have learned is installed in the logistics facility and controls the speed of the logistics transport route installed in the logistics facility. The learned learning agent 3 performs speed control to prevent the energy consumption related to the transport speed from increasing, with priority given to completing the transport of the workpiece on the logistics transport route and not causing the workpiece to be dropped. Therefore, by using the learning agent 3, it is possible to reliably transport the workpiece in the logistics facility and reduce energy consumption.

このような学習済みの学習エージェント３は、例えば、搬送ユニットにてメンテナンスが実施されない期間あるいはメンテナンスまでの時間的余裕があるような場合には、規定の搬送時間を超えない範囲で、ドロップが発生せずにかつエネルギ消費量が小さくなるように搬送ユニットの速度を制御する。 For example, when there is a period during which no maintenance is performed on the transport unit or there is sufficient time before maintenance is performed, such a trained learning agent 3 controls the speed of the transport unit so that no drops occur and energy consumption is reduced, without exceeding the specified transport time.

また、学習済みの学習エージェント３は、例えば、ある搬送ユニットのメンテナンスが迫っている場合には、この搬送ユニットがメンテナンスに入る前までに上流側における搬送速度を増加させて、メンテナンス対象の搬送ユニットの上流側に位置するワークの数を減少させる。また、学習済みの学習エージェント３は、例えば、ある搬送ユニットがメンテナンスを行っている場合には、この搬送ユニットの上流側におけるワークの搬送速度を遅くすることで、ドロップが発生しない範囲でエネルギ消費量が小さくなるように搬送ユニットの速度を制御する。 Furthermore, for example, when maintenance of a transport unit is approaching, the trained learning agent 3 increases the transport speed on the upstream side before the transport unit undergoes maintenance, thereby reducing the number of workpieces located upstream of the transport unit that is the subject of maintenance. For example, when a transport unit is undergoing maintenance, the trained learning agent 3 slows down the transport speed of the workpieces upstream of the transport unit, thereby controlling the speed of the transport unit so that energy consumption is reduced without causing drops.

以上のような本実施形態の物流搬送速度制御エージェント学習装置１は、物流搬送路を形成する複数の搬送ユニットにおける搬送速度を制御する学習エージェント３を強化学習させる。本実施形態の物流搬送速度制御エージェント学習装置１は、モデル化された物流搬送路Ｈを用いて物流搬送路Ｈの状態及び当該状態に基づく報酬を算出する搬送路シミュレータ２を備えている。また、本実施形態の物流搬送速度制御エージェント学習装置１は、報酬に基づいた評価が大きくなるように搬送速度に関する学習を行う学習エージェント３を備えている。搬送路シミュレータ２は、物流搬送路ＨでワークＷの搬送が完了したことで得られる搬送完了報酬と、物流搬送路ＨでワークＷの受け付けができない場合に与えられる受付拒否罰と、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて報酬を算出する。 The logistics transport speed control agent learning device 1 of this embodiment as described above performs reinforcement learning on the learning agent 3 that controls the transport speed of multiple transport units that form a logistics transport route. The logistics transport speed control agent learning device 1 of this embodiment includes a transport route simulator 2 that uses a modeled logistics transport route H to calculate the state of the logistics transport route H and a reward based on the state. The logistics transport speed control agent learning device 1 of this embodiment also includes a learning agent 3 that learns about the transport speed so that the evaluation based on the reward becomes larger. The transport route simulator 2 calculates the reward based on the transport completion reward obtained when the transport of the work W is completed on the logistics transport route H, the acceptance refusal penalty given when the logistics transport route H cannot accept the work W, and the energy consumption increase penalty that increases as the energy consumption related to the transport speed increases.

本実施形態の物流搬送速度制御エージェント学習装置１によれば、搬送路シミュレータ２で算出された報酬が学習エージェント３に入力され、報酬に基づいて学習エージェント３が学習を行う。報酬は、物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬と、物流搬送路でワークの受け付けができない場合に与えられる受付拒否罰と、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて算出される。つまり、学習エージェント３に入力される報酬は、搬送速度に関するエネルギ消費量が大きくなることで減少する。 According to the logistics transport speed control agent learning device 1 of this embodiment, the reward calculated by the transport route simulator 2 is input to the learning agent 3, and the learning agent 3 learns based on the reward. The reward is calculated based on a transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route, an acceptance refusal penalty given when the workpiece cannot be accepted on the logistics transport route, and an energy consumption increase penalty that increases as the amount of energy consumption related to the transport speed increases. In other words, the reward input to the learning agent 3 decreases as the amount of energy consumption related to the transport speed increases.

したがって、本実施形態の物流搬送速度制御エージェント学習装置１によれば、学習エージェント３は、搬送速度に関するエネルギ消費量を小さくするように学習する。また、学習エージェント３に入力される報酬は、物流搬送路でワークの受け付けができない場合に減少する。このため、本実施形態の物流搬送速度制御エージェント学習装置１において、学習エージェント３は、物流搬送路でワークの受け付けができない場合を回避しようと学習する。よって、本実施形態の物流搬送速度制御エージェント学習装置１によれば、物流設備において搬送路に対して渋滞によりワークが受け付けられなくなることを抑制しつつ、エネルギ消費量を削減することが可能になる。 Therefore, according to the logistics transport speed control agent learning device 1 of this embodiment, the learning agent 3 learns to reduce energy consumption related to the transport speed. Furthermore, the reward input to the learning agent 3 decreases when the logistics transport route cannot accept work. For this reason, in the logistics transport speed control agent learning device 1 of this embodiment, the learning agent 3 learns to avoid cases where the logistics transport route cannot accept work. Therefore, according to the logistics transport speed control agent learning device 1 of this embodiment, it is possible to reduce energy consumption while preventing the logistics facility from being unable to accept work due to congestion on the transport route.

また、本実施形態の物流搬送速度制御エージェント学習装置１において、モデル化された物流搬送路Ｈが、単一あるいは複数の搬送ユニットを含む制御対象部を複数（第１制御対象部Ｔ１、第２制御対象部Ｔ２及び第３制御対象部Ｔ３）有している。また、学習エージェント３が、制御対象部ごとに搬送速度を制御する。このため、全ての搬送ユニットを個別に制御するよりも、制御を容易化することが可能となる。 In addition, in the logistics transport speed control agent learning device 1 of this embodiment, the modeled logistics transport route H has multiple control target parts (first control target part T1, second control target part T2, and third control target part T3) that include a single or multiple transport units. Furthermore, the learning agent 3 controls the transport speed for each control target part. This makes it possible to make control easier than controlling all the transport units individually.

また、本実施形態の物流搬送速度制御エージェント学習装置１において、搬送路シミュレータ２が、搬送ユニットごとの速度指令値に基づいて消費エネルギ増加罰を算出する。このため、速度指令値が大きくなるほど、搬送ユニットの消費エネルギが増加する。したがって、搬送ユニットごとの速度指令値に基づいて消費エネルギ増加罰を算出することによって、搬送ユニットの速度に応じて罰則を変化させることができる。 In addition, in the logistics transport speed control agent learning device 1 of this embodiment, the transport route simulator 2 calculates the energy consumption increase penalty based on the speed command value for each transport unit. Therefore, the larger the speed command value, the more energy the transport unit consumes. Therefore, by calculating the energy consumption increase penalty based on the speed command value for each transport unit, the penalty can be changed according to the speed of the transport unit.

また、本実施形態の物流搬送速度制御エージェント学習装置１において、搬送路シミュレータ２が、搬送ユニットごとの速度指令値を各々二乗した値の総和に基づいて消費エネルギ増加罰を算出する。このため、搬送ユニットの搬送速度が大きい状態でさらに搬送速度を高めようとした場合の罰則が、搬送ユニットの搬送速度がゼロあるいは小さい状態で搬送速度を高めようとした場合の罰則よりも大きくなる。したがって、例えば、搬送ユニットにおいて、ワークＷが停止した状態やワークＷの搬送速度が小さい状態から一定の速度まで搬送速度を素早く速めることが可能となる一方で、ワークＷの搬送速度が既に速い状態からさらに速めようとする動作を防ぐように学習エージェント３を学習させることができる。 In addition, in the logistics transport speed control agent learning device 1 of this embodiment, the transport path simulator 2 calculates the energy consumption increase penalty based on the sum of the squared values of the speed command values for each transport unit. Therefore, the penalty for attempting to further increase the transport speed when the transport unit's transport speed is high is greater than the penalty for attempting to increase the transport speed when the transport unit's transport speed is zero or low. Therefore, for example, in the transport unit, it is possible to quickly increase the transport speed from a state in which the workpiece W is stopped or the transport speed of the workpiece W is low to a certain speed, while the learning agent 3 can be trained to prevent an action that attempts to further increase the transport speed of the workpiece W from a state in which the transport speed is already high.

また、本実施形態の物流搬送速度制御エージェント学習装置１においては、物流搬送路ＨにおけるワークＷの搬送中にメンテンナンスによる停止期間が発生する可能性があるメンテナンス発生搬送ユニット（メンテナンス発生セルＳａ）が、物流搬送路を形成する複数の搬送ユニットに含まれ、単一の制御対象部に含まれるメンテナンス発生搬送ユニット（メンテナンス発生セルＳａ）は１つ以下である。このように、単一の制御対象部に含まれるメンテナンス発生セルＳａを１つ以下とすることで、単一の制御対象部の搬送速度が、複数のメンテナンス発生セルＳａの影響を受けることを抑止することが可能となる。 In addition, in the logistics transport speed control agent learning device 1 of this embodiment, a maintenance occurrence transport unit (maintenance occurrence cell Sa) that may experience a downtime period due to maintenance during the transport of work W on the logistics transport route H is included in multiple transport units that form the logistics transport route, and a single controlled object part contains one or less maintenance occurrence transport units (maintenance occurrence cell Sa). In this way, by limiting the number of maintenance occurrence cells Sa included in a single controlled object part to one or less, it is possible to prevent the transport speed of a single controlled object part from being affected by multiple maintenance occurrence cells Sa.

また、以上のような本実施形態の物流搬送速度制御エージェント学習方法は、物流搬送路を形成する複数の搬送ユニットにおける搬送速度を制御する学習エージェント３を強化学習させる。本実施形態の物流搬送速度制御エージェント学習方法は、搬送路シミュレータ２によって、モデル化された物流搬送路Ｈを用いて物流搬送路Ｈの状態及び当該状態に基づく報酬を算出する。また、本実施形態の物流搬送速度制御エージェント学習方法においては、学習エージェント３が、報酬に基づいた評価が大きくなるように搬送速度に関する学習を行い、搬送路シミュレータ２にて、報酬を算出する。さらに、搬送路シミュレータ２によって、物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬と、物流搬送路でワークの受け付けができない場合に与えられる受付拒否罰と、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて報酬を算出する。 The logistics transport speed control agent learning method of this embodiment as described above causes reinforcement learning of the learning agent 3 that controls the transport speed of multiple transport units that form a logistics transport route. In the logistics transport speed control agent learning method of this embodiment, the transport route simulator 2 calculates the state of the logistics transport route H and the reward based on the state using the modeled logistics transport route H. In addition, in the logistics transport speed control agent learning method of this embodiment, the learning agent 3 learns about the transport speed so that the evaluation based on the reward is increased, and the transport route simulator 2 calculates the reward. Furthermore, the transport route simulator 2 calculates the reward based on the transport completion reward obtained when the transport of the work is completed on the logistics transport route, the acceptance refusal penalty given when the logistics transport route cannot accept the work, and the energy consumption increase penalty that increases due to an increase in the energy consumption related to the transport speed.

このような本実施形態の物流搬送速度制御エージェント学習方法によれば、搬送路シミュレータ２で算出された報酬が学習エージェント３に入力され、報酬に基づいて学習エージェント３が学習を行う。報酬は、物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬と、物流搬送路でワークの受け付けができない場合に与えられる受付拒否罰と、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて算出される。つまり、学習エージェント３に入力される報酬は、搬送速度に関するエネルギ消費量が大きくなることで減少する。 According to the logistics transport speed control agent learning method of this embodiment, the reward calculated by the transport route simulator 2 is input to the learning agent 3, and the learning agent 3 learns based on the reward. The reward is calculated based on a transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route, an acceptance refusal penalty given when the workpiece cannot be accepted on the logistics transport route, and an energy consumption increase penalty that increases as the amount of energy consumption related to the transport speed increases. In other words, the reward input to the learning agent 3 decreases as the amount of energy consumption related to the transport speed increases.

したがって、本実施形態の物流搬送速度制御エージェント学習方法によれば、学習エージェント３は、搬送速度に関するエネルギ消費量を小さくするように学習する。また、学習エージェント３に入力される報酬は、物流搬送路でワークの受け付けができない場合に減少する。このため、本実施形態の物流搬送速度制御エージェント学習方法において、学習エージェント３は、物流搬送路でワークの受け付けができない場合を回避しようと学習する。よって、本実施形態の物流搬送速度制御エージェント学習方法によれば、物流設備において搬送路に対して渋滞によりワークが受け付けられなくなることを抑制しつつ、エネルギ消費量を削減することが可能になる。 Therefore, according to the logistics transport speed control agent learning method of this embodiment, the learning agent 3 learns to reduce energy consumption related to the transport speed. Furthermore, the reward input to the learning agent 3 decreases when the logistics transport route cannot accept work. For this reason, in the logistics transport speed control agent learning method of this embodiment, the learning agent 3 learns to avoid cases where the logistics transport route cannot accept work. Therefore, according to the logistics transport speed control agent learning method of this embodiment, it is possible to reduce energy consumption while preventing the logistics facility from being unable to accept work due to congestion on the transport route.

また、本実施形態においては、上述のように、コンピュータが学習プログラムＰによって、物流搬送速度制御エージェント学習装置１として機能される。この学習プログラムＰは、コンピュータを、搬送路シミュレータ２及び学習エージェント３として機能させる。つまり、学習プログラムＰは、コンピュータを、物流搬送路を形成する複数の搬送ユニットにおける搬送速度を制御する学習エージェント３を強化学習させる物流搬送速度制御エージェント学習装置１として機能させる。また、学習プログラムＰは、コンピュータを、モデル化された物流搬送路Ｈを用いて物流搬送路Ｈの状態及び当該状態に基づく報酬を算出する搬送路シミュレータ２と機能させる。また、学習プログラムＰは、コンピュータを、報酬に基づいた評価が大きくなるように搬送速度に関する学習を行う学習エージェント３として機能させる。さらに、学習プログラムＰは、コンピュータを搬送路シミュレータ２として機能させる場合に、物流搬送路ＨでワークＷの搬送が完了したことで得られる搬送完了報酬と、物流搬送路ＨでワークＷの受け付けができない場合に与えられる受付拒否罰と、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて報酬を算出させる。 In addition, in this embodiment, as described above, the learning program P causes the computer to function as a logistics transport speed control agent learning device 1. This learning program P causes the computer to function as a transport route simulator 2 and a learning agent 3. In other words, the learning program P causes the computer to function as a logistics transport speed control agent learning device 1 that performs reinforcement learning on the learning agent 3 that controls the transport speed in multiple transport units that form a logistics transport route. The learning program P also causes the computer to function as a transport route simulator 2 that uses a modeled logistics transport route H to calculate the state of the logistics transport route H and a reward based on the state. The learning program P also causes the computer to function as a learning agent 3 that learns about the transport speed so that the evaluation based on the reward becomes large. Furthermore, when the learning program P causes the computer to function as the transport route simulator 2, the reward is calculated based on the transport completion reward obtained when the transport of the work W is completed on the logistics transport route H, the acceptance refusal penalty given when the work W cannot be accepted on the logistics transport route H, and the energy consumption increase penalty that increases due to the increase in energy consumption related to the transport speed.

このような本実施形態の学習プログラムＰによれば、搬送路シミュレータ２で算出された報酬が学習エージェント３に入力され、報酬に基づいて学習エージェント３が学習を行う。報酬は、物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬と、物流搬送路でワークの受け付けができない場合に与えられる受付拒否罰と、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰とに基づいて算出される。つまり、学習エージェント３に入力される報酬は、搬送速度に関するエネルギ消費量が大きくなることで減少する。 According to the learning program P of this embodiment, the reward calculated by the transport path simulator 2 is input to the learning agent 3, and the learning agent 3 learns based on the reward. The reward is calculated based on a transport completion reward obtained when the transport of a workpiece is completed on the logistics transport path, an acceptance refusal penalty given when the workpiece cannot be accepted on the logistics transport path, and an energy consumption increase penalty that increases as the amount of energy consumption related to the transport speed increases. In other words, the reward input to the learning agent 3 decreases as the amount of energy consumption related to the transport speed increases.

したがって、本実施形態の学習プログラムＰによれば、学習エージェント３は、搬送速度に関するエネルギ消費量を小さくするように学習する。また、学習エージェント３に入力される報酬は、物流搬送路でワークの受け付けができない場合に減少する。このため、本実施形態の学習プログラムＰにおいて、学習エージェント３は、物流搬送路でワークの受け付けができない場合を回避しようと学習する。よって、本実施形態の学習プログラムＰによれば、物流設備において搬送路に対して渋滞によりワークが受け付けられなくなることを抑制しつつ、エネルギ消費量を削減することが可能になる。 Therefore, according to the learning program P of this embodiment, the learning agent 3 learns to reduce energy consumption related to the transport speed. Furthermore, the reward input to the learning agent 3 decreases when the logistics transport route cannot accept the work. For this reason, in the learning program P of this embodiment, the learning agent 3 learns to avoid cases where the logistics transport route cannot accept the work. Therefore, according to the learning program P of this embodiment, it is possible to reduce energy consumption while preventing the logistics facility from being unable to accept the work due to congestion on the transport route.

以上、添付図面を参照しながら本発明の好適な実施形態について説明したが、本発明は、上記実施形態に限定されないことは言うまでもない。上述した実施形態において示した各構成部材の諸形状や組み合わせ等は一例であって、本発明の趣旨から逸脱しない範囲において設計要求等に基づき種々変更可能である。 The above describes a preferred embodiment of the present invention with reference to the attached drawings, but it goes without saying that the present invention is not limited to the above embodiment. The shapes and combinations of the components shown in the above embodiment are merely examples, and various modifications can be made based on design requirements, etc., without departing from the spirit of the present invention.

例えば、上記実施形態においては、報酬計算式にて、搬送ユニットごとの速度指令値に基づいて消費エネルギ増加罰を算出するという構成を採用した。しかしながら、本発明はこれに限定されるものではない。例えば、物流搬送路の消費エネルギ量は、搬送ユニットの加速度に応じても変化する。つまり、搬送ユニットの加速度が大きい場合には、目標速度に到達するまでの期間が短くなるが、消費エネルギが大きくなる。一方、搬送ユニットの加速度が小さい場合には、目標速度に到達するまでの期間が長くなるが、消費エネルギが小さくなる。このため、搬送路シミュレータ２が、搬送ユニットごとの加速度に基づいて消費エネルギ増加罰を算出するという構成を採用することも可能である。 For example, in the above embodiment, a configuration was adopted in which the reward calculation formula calculates the energy consumption increase penalty based on the speed command value for each transport unit. However, the present invention is not limited to this. For example, the amount of energy consumed on a logistics transport route also changes depending on the acceleration of the transport unit. In other words, when the acceleration of the transport unit is large, the period until the target speed is reached is shorter, but the energy consumption is greater. On the other hand, when the acceleration of the transport unit is small, the period until the target speed is reached is longer, but the energy consumption is smaller. For this reason, it is also possible to adopt a configuration in which the transport route simulator 2 calculates the energy consumption increase penalty based on the acceleration of each transport unit.

また、上記実施形態においては、搬送ユニットごとの速度指令値を各々二乗した値の総和に基づいて消費エネルギ増加罰を算出した。しかしながら、本発明はこれに限定されるものではない。例えば、搬送ユニットごとの速度指令値の３以上の数の累乗により求められる値の総和に基づいて消費エネルギ増加罰を算出するようにしても良い。また、消費エネルギ増加罰を搬送ユニットごとに算出するようにしても良い。また、複数の搬送ユニットのうち、最も速度指令値が大きいものを用いて消費エネルギ増加罰を算出するようにしても良い。 In the above embodiment, the energy consumption increase penalty is calculated based on the sum of the values obtained by squaring the speed command values for each transport unit. However, the present invention is not limited to this. For example, the energy consumption increase penalty may be calculated based on the sum of the values obtained by raising the speed command value for each transport unit to a power of three or more. The energy consumption increase penalty may also be calculated for each transport unit. The energy consumption increase penalty may also be calculated using the transport unit with the largest speed command value among multiple transport units.

また、上記実施形態においては、報酬計算式における物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬と、物流搬送路でワークの受け付けができない場合に与えられる受付拒否罰とが、０か１で与えられる構成について説明した。しかしながら、本発明はこれに限定されるものではない。例えば、物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬と物流搬送路でワークの受け付けができない場合に与えられる受付拒否罰との両方あるいはいずれか一方が、０から１までの中間値をとる構成を採用することも可能である。 In the above embodiment, a configuration has been described in which the transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route in the reward calculation formula, and the acceptance refusal penalty given when the workpiece cannot be accepted on the logistics transport route, are given as 0 or 1. However, the present invention is not limited to this. For example, it is also possible to adopt a configuration in which both or either one of the transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route and the acceptance refusal penalty given when the workpiece cannot be accepted on the logistics transport route take an intermediate value between 0 and 1.

また、上記実施形態においては、報酬計算式が、物流搬送路でワークの搬送が完了したことで得られる搬送完了報酬を示す第１項と、物流搬送路でワークの受け付けができない場合に与えられる受付拒否罰を示す第２項と、搬送速度に関するエネルギ消費量が大きくなることで増加する消費エネルギ増加罰を示す第３項との３つの項を有していた。しかしながら、本発明はこれに限定されるものではなく、第４項以上の項を有する報酬計算式を用いることも可能である。 In the above embodiment, the reward calculation formula had three terms: a first term indicating a transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route; a second term indicating an acceptance refusal penalty given when the workpiece cannot be accepted on the logistics transport route; and a third term indicating an energy consumption increase penalty that increases as the amount of energy consumed related to the transport speed increases. However, the present invention is not limited to this, and it is also possible to use a reward calculation formula having four or more terms.

１……物流搬送速度制御エージェント学習装置、２……搬送路シミュレータ、２ａ……セル定義部、２ｂ……搬送路制御部、２ｃ……報酬計算部、３……学習エージェント、３ａ……価値推定ネットワーク、３ｂ……方策ネットワーク、３ｃ……ネットワーク更新部、４……性能評価部、５……ハイパーパラメータ設定部、Ｈ……物流搬送路、Ｐ……学習プログラム（物流搬送速度制御エージェント学習プログラム）、Ｓ……セル（搬送ユニット）、Ｓａ……メンテナンス発生セル（メンテナンス発生搬送ユニット）、Ｔ１……第１制御対象部（制御対象部）、Ｔ２……第２制御対象部（制御対象部）、Ｔ３……第３制御対象部（制御対象部）、Ｗ……ワーク 1...Logistics transport speed control agent learning device, 2...Transport route simulator, 2a...Cell definition unit, 2b...Transport route control unit, 2c...Reward calculation unit, 3...Learning agent, 3a...Value estimation network, 3b...Policy network, 3c...Network update unit, 4...Performance evaluation unit, 5...Hyper parameter setting unit, H...Logistics transport route, P...Learning program (logistics transport speed control agent learning program), S...Cell (transport unit), Sa...Maintenance occurrence cell (maintenance occurrence transport unit), T1...First control target unit (control target unit), T2...Second control target unit (control target unit), T3...Third control target unit (control target unit), W...Work

Claims

A logistics transport speed control agent learning device that performs reinforcement learning on a learning agent that controls transport speeds of a plurality of transport units that form a logistics transport route, comprising:
A transport route simulator that uses the modeled logistics transport route to calculate a state of the logistics transport route and a reward based on the state;
The learning agent learns about the conveying speed so as to increase an evaluation based on the reward,
The transport route simulator calculates the reward based on a transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route, an acceptance refusal penalty imposed when the workpiece cannot be accepted on the logistics transport route, and an energy consumption increase penalty that increases as the amount of energy consumption related to the transport speed increases.

The modeled logistics transport route has a plurality of control targets including one or more transport units;
2. The logistics transport speed control agent learning device according to claim 1, wherein the learning agent controls the transport speed for each of the control target parts.

3. The logistics transportation speed control agent learning device according to claim 2, wherein the transportation route simulator calculates the energy consumption increase penalty based on a speed command value for each of the transportation units.

The logistics transport speed control agent learning device according to claim 3, characterized in that the transport path simulator calculates the energy consumption increase penalty based on the sum of the squared values of the speed command values for each transport unit.

3. The logistics and transportation speed control agent learning device according to claim 2, wherein the transportation route simulator calculates the energy consumption increase penalty based on the acceleration of each of the transportation units.

a maintenance occurrence transport unit that may be stopped for maintenance during the transport of a workpiece on the logistics transport path is included in the plurality of transport units that form the logistics transport path;
6. The logistics transport speed control agent learning device according to claim 2, wherein the number of the maintenance occurrence transport units included in a single controlled object is one or less.

A logistics transport speed control agent learning method for performing reinforcement learning on a learning agent that controls transport speeds of a plurality of transport units that form a logistics transport route, comprising:
Calculating a state of the logistics transportation route and a reward based on the state using the modeled logistics transportation route by a transportation route simulator;
the learning agent learns about the transport speed so as to increase an evaluation based on the reward;
a transport route simulator that calculates the reward based on a transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route, an acceptance refusal penalty that is imposed when the workpiece cannot be accepted on the logistics transport route, and an energy consumption increase penalty that increases as the amount of energy consumption related to the transport speed increases.

A logistics transport speed control agent learning program that causes a computer to function as a logistics transport speed control agent learning device that performs reinforcement learning on a learning agent that controls transport speeds in a plurality of transport units that form a logistics transport route, the program comprising:
The computer is caused to function as a transport route simulator that calculates a state of the logistics transport route and a remuneration based on the state by using the modeled logistics transport route;
causing the computer to function as the learning agent to learn about the conveying speed so as to increase an evaluation based on the reward;
A logistics transport speed control agent learning program characterized in that, when the computer is made to function as the transport route simulator, the reward is calculated based on a transport completion reward obtained when the transport of a workpiece is completed on the logistics transport route, an acceptance refusal penalty given when the workpiece cannot be accepted on the logistics transport route, and an energy consumption increase penalty that increases as the amount of energy consumption related to the transport speed increases.