JP7308073B2

JP7308073B2 - Logistics management system

Info

Publication number: JP7308073B2
Application number: JP2019089900A
Authority: JP
Inventors: 和弘小池
Original assignee: Askul Corp
Current assignee: Askul Corp
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2023-07-13
Anticipated expiration: 2039-05-10
Also published as: JP2020187416A

Description

本発明は、物流管理システムに関する。 The present invention relates to a physical distribution management system.

サプライチェーンにおける物流管理において、いわゆるBullwhip Effect（以下「ＢＥ
」と称する）が知られている。ＢＥとは、サプライチェーンの下流における需要予測と意思決定の結果、需要が拡大しながら、サプライチェーンの下流から上流に向かって伝搬していく現象である。この現象は過剰在庫や欠品に繋がるため、発生のメカニズムとＢＥの抑制手法は、長年に渡り研究対象となっている。ＢＥの発生要因としては、価格表、発注頻度、返品方針、価格販売施策の頻度と深さ、情報共有の程度、需要予測方法、欠品時の配分ルールなどが挙げられる。 In logistics management in the supply chain, the so-called Bullwhip Effect (below BE
) is known. BE is a phenomenon in which, as a result of demand forecasting and decision-making in the downstream of the supply chain, demand expands and propagates from the downstream to the upstream of the supply chain. Since this phenomenon leads to excess inventory and shortages, the mechanism of BE generation and methods for suppressing BE have been the subject of research for many years. Factors that cause BE include the price list, order frequency, return policy, frequency and depth of price sales measures, degree of information sharing, demand forecasting method, and distribution rules for shortages.

また、ＢＥを検証する１手法として、Beer Game（以下「ＢＧ」と称する）が知られて
いる。ＢＧでは、直列に繋がったビールのサプライチェーンにおけるシミュレーションゲームであり、４プレーヤー（小売業者、卸売業者、流通業者、製造業者）の各プレーヤーが、決められた期間内でのコスト最小化を競う。 Also, Beer Game (hereinafter referred to as “BG”) is known as one technique for verifying BE. BG is a tandem beer supply chain simulation game in which four players (retailers, wholesalers, distributors, and manufacturers) compete to minimize costs within a defined period of time.

ＢＧの過程はMarkov Decision Process（ＭＤＰ）として知られており、ＢＧにおいて
、各プレーヤーが観測できる情報は、隣り合うプレーヤーとの注文と商品のやり取りと自身の在庫レベルのみであるため、いわゆるPartially Observable Markov Decision Process（ＰＯＭＤＰ）である。ＢＧでは、各プレーヤーは、観測可能な情報からコストを最小化する行動を選択する。しかしながら、ＢＧは観測空間と行動空間が大きく、非定常な時系列を扱うため複雑な問題となる。そこで、ＢＧに深層強化学習を適用することで、サプライチェーンにおける物流プロセスの全体最適化を実現できる可能性は示されている（非特許文献１－３）。 The process of BG is known as the Markov Decision Process (MDP), and in BG, the information that each player can observe is only orders with neighboring players, product exchanges, and own inventory levels, so it is so-called Partially Observable Markov Decision Process (POMDP). In BG, each player chooses the action that minimizes the cost from observable information. However, BG has a large observation space and action space, and it is a complicated problem because it deals with non-stationary time series. Therefore, it has been shown that by applying deep reinforcement learning to BG, it is possible to achieve overall optimization of logistics processes in the supply chain (Non-Patent Documents 1 to 3).

V. Mnih et al. Human-level control through deep reinforcement learning, doi 10.1038/nature14236 (2015).V. Mnih et al. Human-level control through deep reinforcement learning, doi 10.1038/nature14236 (2015). Afshin Oroojlooyjadid et al. A Deep Q-Network for the Beer Game: A Reinforcement Learning Algorithm to Solve Inventory Optimization Problems, arXiv:1708.05924v2 (2018).Afshin Oroojlooyjadid et al. A Deep Q-Network for the Beer Game: A Reinforcement Learning Algorithm to Solve Inventory Optimization Problems, arXiv:1708.05924v2 (2018). Taiki Fuji et al. Deep Multi-Agent Reinforcement Learning using DNN-Weight Evolution to Optimize Supply Chain Performance. doi 10.24251/HICSS.2018.157 (2018).Taiki Fuji et al. Deep Multi-Agent Reinforcement Learning using DNN-Weight Evolution to Optimize Supply Chain Performance. doi 10.24251/HICSS.2018.157 (2018).

インターネットを利用した通信販売（ネット通販）においては、サイバー空間における情報の流れの効率とフィジカル空間における物の流れの効率の差が顕著になってきており、この差がＢＥの新たな発生要因となる可能性がある。例えば、Electronic Commerce（
ＥＣ）サイトでの高度に効率化された販売施策によって需要の変動が増幅された結果、配送遅延や欠品、過剰在庫などが発生する場合が考えられる。このため、上記の従来技術を用いても、ネット通販を対象としたサプライチェーンにおける物流プロセスの全体最適化を実現することはできない可能性があり、そのような全体最適化を実現する確立された手法が提案されていなかった。 In mail-order sales using the Internet (online shopping), the difference between the efficiency of information flow in cyberspace and the efficiency of product flow in physical space is becoming more pronounced, and this difference is a new cause of BE. may become. For example, Electronic Commerce (
As a result of highly efficient sales measures on EC) sites amplifying fluctuations in demand, delivery delays, shortages, excess inventory, etc. may occur. For this reason, even if the above conventional technology is used, it may not be possible to achieve total optimization of the logistics process in the supply chain for online shopping. No method was proposed.

そこで、本件開示の技術は、上記の事情に鑑みてなされたものであり、その目的とするところは、物流管理の全体最適化を支援する技術を提供することである。 Therefore, the technique of the present disclosure has been made in view of the above circumstances, and its purpose is to provide a technique for supporting overall optimization of physical distribution management.

本件開示の予測装置は、電子商取引の商品の商品情報を取得する取得部と、商品の物流に関連する環境において許容される行動変数の複数の値に対して、取得した商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの強化学習を基に第３の行動モデルを生成し、第３の行動モデルを用いて環境における最適な行動変数の値を予測する制御部とを有する。これにより、本予測装置によって、商品の物流管理において相反する目標が設定された２つの行動モデルから全体最適化に叶うバランスの取れた行動モデルを生成して、最適な行動変数を予測することができる。 The prediction device of the present disclosure includes an acquisition unit that acquires product information of electronic commerce products, and the acquired product information and the state of the environment for multiple values of behavioral variables that are allowed in the environment related to product distribution. Generate a third behavior model based on reinforcement learning of a first behavior model and a second behavior model that calculate mutually different rewards using variables, and use the third behavior model to perform optimal behavior in the environment and a controller for predicting the value of the variable. As a result, with this prediction device, it is possible to generate a well-balanced behavioral model that achieves overall optimization from two behavioral models in which contradictory goals are set in product logistics management, and to predict the optimum behavioral variables. can.

また、上記の予測装置において、行動変数は、商品の販売業者による商品の発注数であり、取得部によって取得される商品情報は、一定期間にわたる商品の発注数と出荷数と入荷数の変動を示す情報であり、環境の状態変数は、商品の売上と、商品の仕入れ値と、商品の欠品に関連する欠品コストと、商品の販売促進に関連する販売促進コストと、商品の過剰在庫に関連する在庫コストと、商品の配送に関連する配送コストであり、制御部は、行動変数の複数の値に対応する発注数それぞれに対して、取得した商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの強化学習を基に第３の行動モデルを生成し、第３の行動モデルを用いて最適な発注数を予測してもよい。これにより、ＢＥＩが閾値を超えないように制御され、かつ需要数をできるだけ満たすように出荷が行われることで、在庫コストの最小化と売上の最大化という、相反する目標のバランスを取り、当該商品の在庫の最適化を図ることができる。 In the prediction device described above, the behavioral variable is the number of products ordered by the product distributor, and the product information acquired by the acquisition unit reflects fluctuations in the number of orders, shipments, and receipts of products over a certain period of time. The state variables of the environment are the sales of goods, the purchase price of goods, the stockout costs associated with stockouts of goods, the promotional costs associated with promoting goods, and the overstock of goods. The associated inventory cost and the delivery cost associated with the delivery of the product. The control unit uses the acquired product information and the environmental state variables for each of the order numbers corresponding to the multiple values of the behavior variable. A third behavior model may be generated based on reinforcement learning of a first behavior model and a second behavior model that calculate different rewards, and an optimal number of orders may be predicted using the third behavior model. . This controls the BEI so that it does not exceed the threshold, and the shipments are made to meet demand as much as possible, balancing the conflicting goals of minimizing inventory costs and maximizing sales. Product inventory can be optimized.

また、上記の予測装置において、行動変数は、倉庫における商品の保管に関連する人時であり、取得部によって取得される商品情報は、商品の販売促進の実施日を示す情報であり、環境の状態変数は、商品の販売促進の実施日の曜日に応じて決定される販売促進コストであり、制御部は、行動変数の複数の値に対応する人時それぞれに対して、取得した商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの強化学習を基に第３の行動モデルを生成し、第３の行動モデルを用いて最適な人時を予測してもよい。これにより、販売促進によって変動する需要数を満たすように出荷を行い、かつ商品が保管される倉庫における人時を低く維持することで、売上最大化と出荷コスト最小化という、相反する目標のバランスを取り、本シミュレーションで設定される販売促進の実施日における当該商品の出荷に割り当てられる人時の最適化を図ることができる。 Further, in the prediction device described above, the behavioral variable is man-hours related to the storage of the product in the warehouse, the product information acquired by the acquisition unit is information indicating the implementation date of the sales promotion of the product, and the environmental The state variable is a sales promotion cost determined according to the day of the week on which the sales promotion of the product is implemented, and the control unit stores the acquired product information and Generate a third behavior model based on reinforcement learning of the first behavior model and the second behavior model that calculate mutually different rewards using environmental state variables, and use the third behavior model to find the optimal You can predict man hours. This balances the conflicting goals of maximizing sales and minimizing shipping costs by shipping to meet demand that fluctuates due to sales promotions and keeping man hours low in warehouses where products are stored. , it is possible to optimize the man-hours allocated to shipping the product on the day of the sales promotion set in this simulation.

また、上記の予測装置において、行動変数は、ＥＣサイトにおいて商品のレコメンドのされやすさを示す値であり、取得部によって取得される商品情報は、ＥＣサイトにおける一定期間にわたる商品のクリック数と表示回数とを示す情報であり、環境の状態変数は、商品のサイズと、商品の売上と、商品の仕入れ値と、商品のサイズに応じて決定される在庫コストであり、制御部は、行動変数の複数の値に対応する商品のレコメンドのされやすさを示す値それぞれに対して、取得した商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの強化学習を基に第３の行動モデルを生成し、第３の行動モデルを用いて商品のレコメンドのされやすさを示す最適な値を予測してもよい。これにより、レコメンドされないために売れない商品とレコメンドされても売れない商品とを判定して、売れる可能性がより高い商品がレコメンドされるようにすることで、ＥＣサイトにおける商品のレコメンドの最適化を図ることができる。 Further, in the prediction device described above, the behavioral variable is a value that indicates the ease with which a product is recommended on the EC site, and the product information acquired by the acquisition unit is the number of clicks on the product over a certain period of time on the EC site and displayed. The environmental state variables are product size, product sales, product purchase price, and inventory cost determined according to product size. A first behavior model and a second behavior that calculate mutually different rewards using acquired product information and environmental state variables for each value that indicates the ease of recommending a product corresponding to a plurality of values. A third behavior model may be generated based on model reinforcement learning, and the third behavior model may be used to predict an optimal value indicating the likelihood of a product being recommended. This optimizes product recommendations on EC sites by identifying products that do not sell because they are not recommended and products that do not sell even if they are recommended, and recommending products that are more likely to sell. can be achieved.

また、上記の予測装置において、行動変数は、ＥＣサイトにおいて商品の配送指定日を
変更することでユーザに付与されるインセンティブであり、取得部によって取得される商品情報は、商品の配送先および配送日時と、商品の配送業者の配送エリアおよび配送可能な配送先数と、商品の配送の所要時間と、商品の配送における配送距離とを示す情報であり、制御部は、行動変数の複数の値に対応するインセンティブそれぞれに対して、取得した商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの強化学習を基に第３の行動モデルを生成し、第３の行動モデルを用いて最適なインセンティブを予測してもよい。これにより、配送コスト最小化とインセンティブコスト最小化の２つの目標のバランスを取りつつ、ＥＣサイトにおいて当該商品の配送日の変更をユーザに促す場合に付与されるインセンティブの大きさの最適化を図ることができる。 In the prediction device described above, the behavior variable is an incentive given to the user by changing the designated delivery date of the product on the EC site, and the product information acquired by the acquisition unit includes the delivery destination and delivery date of the product. Information indicating the date and time, the delivery area of the product delivery company, the number of deliverable destinations, the time required for product delivery, and the delivery distance for product delivery. A third behavior model based on reinforcement learning of the first behavior model and the second behavior model that calculates mutually different rewards using the acquired product information and the environmental state variables for each of the incentives corresponding to , and a third behavioral model may be used to predict the optimal incentive. By doing so, we aim to balance the two goals of minimizing delivery costs and minimizing incentive costs, while optimizing the size of incentives given when prompting users to change the delivery date of the relevant product on the e-commerce site. be able to.

本件開示の技術によれば、物流管理の全体最適化を支援する技術を提供することができる。 According to the technology of the present disclosure, it is possible to provide a technology that supports overall optimization of physical distribution management.

図１は、第１実施形態に係る予測装置の一例を示す。FIG. 1 shows an example of a prediction device according to the first embodiment. 図２Ａは、ＢＧにおけるプレーヤーおよびサプライチェーンの関係を示し、図２Ｂは、第１実施形態において実行されるNetshop Gameおけるプレーヤーおよびサプライチェーンの関係を示す。FIG. 2A shows the relationship between players and supply chain in BG, and FIG. 2B shows the relationship between players and supply chain in Netshop Game executed in the first embodiment. 図３は、第１実施形態において実行されるシミュレーションのアルゴリズムの一例を示す。FIG. 3 shows an example of a simulation algorithm executed in the first embodiment. 図４は、第１実施形態において設定される環境パラメータの一例を示す。FIG. 4 shows an example of environmental parameters set in the first embodiment. 図５は、第１実施形態において設定されるNetshop Gameの報酬の一例を示す。FIG. 5 shows an example of Netshop Game rewards set in the first embodiment. 図６は、第１実施形態において実行される上記の各状態変数の更新の一例を示す。FIG. 6 shows an example of updating of the above state variables executed in the first embodiment. 図７は、第１実施形態におけるメタプレーヤーに対するゴール条件の一例を示す。FIG. 7 shows an example of goal conditions for meta players in the first embodiment. 図８は、第１実施形態においてメタプレーヤーを対象としたタイムステップと報酬の変化の一例を示すグラフである。FIG. 8 is a graph showing an example of changes in time steps and rewards for meta players in the first embodiment. 図９は、第１実施形態におけるサイバープレーヤーの獲得報酬の推移の一例を示すグラフである。FIG. 9 is a graph showing an example of changes in rewards earned by cyber players in the first embodiment. 図１０は、第１実施形態におけるフィジカルプレーヤーの獲得報酬の推移の一例を示すグラフである。FIG. 10 is a graph showing an example of changes in rewards earned by physical players according to the first embodiment. 図１１は、第１実施形態におけるメタプレーヤーの獲得報酬の推移の一例を示すグラフである。FIG. 11 is a graph showing an example of changes in rewards earned by meta players in the first embodiment. 図１２は、第１実施形態における各プレーヤーのシミュレーション結果の一例を示す。FIG. 12 shows an example of simulation results for each player in the first embodiment. 図１３Ａは、変形例１におけるシミュレーションの報酬の一例を示し、図１３Ｂは、変形例１におけるシミュレーションの状態変数の更新の一例を示し、図１３Ｃは、変形例１におけるシミュレーションのゴール条件の一例を示す。13A shows an example of simulation rewards in Modification 1, FIG. 13B shows an example of updating the simulation state variables in Modification 1, and FIG. 13C shows an example of simulation goal conditions in Modification 1. show. 図１４Ａは、変形例２におけるシミュレーションの報酬の一例を示し、図１４Ｂは、変形例２におけるシミュレーションの状態変数の更新の一例を示し、図１４Ｃは、変形例２におけるシミュレーションのゴール条件の一例を示す。14A shows an example of simulation rewards in Modification 2, FIG. 14B shows an example of updating the simulation state variables in Modification 2, and FIG. 14C shows an example of simulation goal conditions in Modification 2. show. 図１５Ａは、変形例３におけるシミュレーションの報酬の一例を示し、図１５Ｂは、変形例３におけるシミュレーションの状態変数の更新の一例を示し、図１５Ｃは、変形例３におけるシミュレーションのゴール条件の一例を示す。15A shows an example of simulation rewards in Modification 3, FIG. 15B shows an example of updating the simulation state variables in Modification 3, and FIG. 15C shows an example of simulation goal conditions in Modification 3. show.

以下に、図面を参照しながら、本件開示の技術の好適な実施の形態について説明する。ただし、以下に記載されている構成部品の構成は、本件開示の技術が適用される装置の構成や各種条件により適宜変更されるべきものである。よって、本件開示の技術の技術的範囲を以下の記載に限定する趣旨のものではない。 Preferred embodiments of the technology disclosed herein will be described below with reference to the drawings. However, the configuration of the components described below should be appropriately changed according to the configuration of the device to which the technique of the present disclosure is applied and various conditions. Therefore, the technical scope of the technology disclosed herein is not intended to be limited to the following description.

（第１実施形態）
図１は、第１実施形態に係る予測装置の概略構成を示す図である。図１に示すように、予測装置１は、制御部１１、記憶部１２、操作部１３、表示部１４、通信部１５を有する。 (First embodiment)
FIG. 1 is a diagram showing a schematic configuration of a prediction device according to the first embodiment. As shown in FIG. 1 , the prediction device 1 has a control unit 11 , a storage unit 12 , an operation unit 13 , a display unit 14 and a communication unit 15 .

ネット通販では、ＥＣサイトやネットによる取引などサイバー空間で完結するプロセスと、物流倉庫や配送センターなどフィジカル空間で行われるプロセスでは特性の違いがある。例えばサイバー空間では商品１００個はあくまでも数値データである。このため、強気な販売施策によって、販売個数を１０倍の１０００個に増やすことについては、物理的な制約を受けにくいといえる。一方で、フィジカル空間では商品１００個は体積と重量を持つ実体である。このため、販売個数を１０００個に増やすことについて、倉庫のキャパシティや出荷能力など物理制約の影響を大きく受けるといえる。また、上記のＢＧでは、サプライチェーンの上流プレーヤーに対する注文においてリードタイムが存在するが、ネット通販における取引では注文のリードタイムは無視することができる。このため、ネット通販の物流プロセスの問題を扱う場合、上記のＢＧのモデルを用いても、問題の解決策を見いだすことができない可能性がある。 In online shopping, there are differences in characteristics between processes that are completed in cyberspace, such as e-commerce sites and online transactions, and processes that are performed in physical space, such as distribution warehouses and delivery centers. For example, in cyber space, 100 products are just numerical data. For this reason, it can be said that increasing the number of sales to 1000 units, which is 10 times higher, through aggressive sales measures is less subject to physical restrictions. On the other hand, in the physical space, 100 products are entities with volume and weight. Therefore, it can be said that increasing the sales volume to 1000 is greatly affected by physical restrictions such as warehouse capacity and shipping capability. Also, in the above BG, there is a lead time for orders to upstream players in the supply chain, but the lead time for orders can be ignored in online shopping transactions. For this reason, when dealing with the problem of the physical distribution process of online shopping, there is a possibility that a solution to the problem cannot be found using the BG model described above.

ここで、図２に、ＢＧ（図２Ａ）と第１実施形態において実行されるNetshop Game（図２Ｂ）のそれぞれにおけるプレーヤーおよびサプライチェーンの関係を示す。第１実施形態では、予測装置１によって図２Ａに例示する環境が設定されたNetshop Gameと称するシミュレーションが実行される。図２Ｂに示すように、上記のＢＧでは、小売業者、卸売業者、流通業者、製造業者の４プレーヤーが直列に連結されているが、Netshop Gameでは、プレーヤーは小売業者のみとし、小売業者に対してサプライチェーンの上流側に位置付けられる卸売業者、流通業者、製造業者の３プレーヤーは供給業者として１プレーヤーにまとめる。第１実施形態ではNetshop Gameで取り扱われる電子商取引の商品は１種類であると想定する。さらに、Netshop Gameでは、小売業者を、サイバー空間で発生するプロセスを管理するサイバープレーヤーと、フィジカル空間で発生するプロセスを管理するフィジカルプレーヤーとに分ける。 Here, FIG. 2 shows the relationship between players and the supply chain in each of BG (FIG. 2A) and Netshop Game (FIG. 2B) executed in the first embodiment. In the first embodiment, the prediction device 1 executes a simulation called Netshop Game in which the environment illustrated in FIG. 2A is set. As shown in FIG. 2B, in the above BG, the four players of the retailer, wholesaler, distributor, and manufacturer are connected in series, but in the Netshop Game, the player is only the retailer, and the retailer Wholesalers, distributors, and manufacturers positioned upstream in the supply chain are grouped into one player as a supplier. In the first embodiment, it is assumed that there is one type of electronic commerce product handled by the Netshop Game. Furthermore, in the Netshop Game, retailers are divided into cyber players who manage processes occurring in cyber space and physical players who manage processes occurring in physical space.

サイバープレーヤーは、顧客からの需要に対してどれだけ欠品なく商品を供給できたかを示すフィルレート（Fill Rate；ＦＲとも称する）があらかじめ設定された閾値よりも
大きくなるように行動する。サイバープレーヤーは、サイバー空間で報酬最大化を狙いとする。サイバープレーヤーは、例えばＥＣサイト上でセールなどのセールスプロモーションを積極的に行う。また、サイバープレーヤーは、在庫が増加することによって生じるコストを無視し、欠品による機会損失を最小化するように行動する。 A cyber player acts so that a fill rate (also referred to as FR), which indicates the extent to which products can be supplied without shortage in response to customer demand, is greater than a preset threshold. Cyber players aim to maximize rewards in cyberspace. Cyber players actively carry out sales promotions such as sales on e-commerce sites, for example. In addition, cyber players ignore the costs caused by increased inventory and act to minimize opportunity losses due to shortages.

また、フィジカルプレーヤーは、ブルウィップ効果インデックス（Bullwhip Effect Index；ＢＥＩとも称する）があらかじめ設定された閾値よりも小さくなるように行動する
。フィジカルプレーヤーは、物流コストの最小化を狙いとする。フィジカルプレーヤーは、欠品による機会損失を無視し、倉庫および配送に関連するコストを最小化すべく、在庫をできるだけ抑え、ＢＥが大きくならないように行動する。 Also, the physical player acts so that the Bullwhip Effect Index (also referred to as BEI) is less than a preset threshold. Physical players aim to minimize logistics costs. The physical player ignores the opportunity loss due to stockouts and acts to minimize the costs associated with warehousing and shipping by keeping inventories as low as possible and avoiding large BEs.

第１実施形態では、予測装置１において図２Ｂに示す環境をOpenAI Gymによって実装し、サイバープレーヤーとフィジカルプレーヤーそれぞれの上記目的を達成するための最適な行動について、Deep Q-Network（ＤＱＮ）によって学習および評価を行う。さらに、第
1実施形態では、サイバープレーヤーとフィジカルプレーヤーそれぞれの報酬とNetshop Gameのゴール条件を踏まえて両者のバランスを取るように行動するメタプレーヤーを導入
する。なお、サイバープレーヤーとフィジカルプレーヤーが、商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの一例である。また、メタプレーヤーが、第１の行動モデルと第２の行動モデルの強化学習を基に生成される第３の行動モデルの一例である。 In the first embodiment, the environment shown in FIG. 2B is implemented by OpenAI Gym in the prediction device 1, and deep Q-Network (DQN) is used to learn the optimal actions for achieving the above objectives for each of the cyber player and the physical player. and evaluation. In addition, the
In one embodiment, a meta player is introduced that acts to balance the rewards of the cyber player and the physical player and the goal conditions of the Netshop Game. The cyber player and the physical player are examples of the first behavior model and the second behavior model that calculate mutually different rewards using product information and environmental state variables. Also, the metaplayer is an example of a third behavior model generated based on reinforcement learning of the first behavior model and the second behavior model.

次に、Netshop Gameにおける各プレーヤーに対する設定の詳細について説明する。まず、Netshop Gameにおいて、タイムステップｔにおける観測可能な状態変数ｏ_ｔを以下の式（１）で定義する。

ここで、ＩＬ_ｔは、タイムステップｔにおける在庫数、ＯＯ_ｔは供給業者に対して発注済みであるが未入荷の状態である商品数、ｄ_ｔは顧客からの商品の需要数、ＲＳ_ｔは、供給業者から入荷した商品数、ＳＳ_ｔは、顧客に出荷済みの商品数、ａ_ｔは、タイムステップｔにおいて発生するアクション、すなわち供給業者への商品の発注数である。このように、第１実施形態では、これら６変数の回帰分析を用いる。 Next, details of settings for each player in Netshop Game will be described. First, in Netshop Game, an observable state variable o _t at time step t is defined by the following equation (1).

where IL _t is the number of items in stock at time step t, OO _t is the number of items that have been ordered from suppliers but have not yet arrived, d _t is the number of items demanded by customers, and RS _t is , the number of items received from the supplier, SS _t is the number of items shipped to the customer, and a _t is the action that occurs at time step t, ie, the number of items ordered from the supplier. Thus, in the first embodiment, regression analysis of these six variables is used.

また、Netshop Gameでは、ゲーム開始からゴール条件達成までを１エピソードとし、１エピソードにおける全タイムステップの状態変数ｈｏ_ｔを以下の式（２）に示すとして記憶する。

Also, in the Netshop Game, one episode is defined as the period from the start of the game to the achievement of the goal condition, and the state variable _hot of all time steps in one episode is stored as shown in the following equation (2).

次に、Netshop Gameにおけるアクション空間について説明する。アクションは、商品の物流に関連する環境において許容される行動変数であるともいえる。また、アクションの各値が行動変数の各値となる。上記のアクションに示すように、本実施形態でのアクションは、供給業者への商品の発注数である。アクション空間、すなわちプレーヤーに許容される商品の発注数の自由度、すなわち発注数の下限から上限までの範囲が広すぎると、予測装置１における処理効率が低下する可能性がある。そこで、本実施形態では、一例としてアクション空間を０から２０の離散値集合［０，１，２，・・・，２０］としてNetshop Gameを実施する。 Next, the action space in Netshop Game will be explained. Actions can also be said to be behavioral variables that are permissible in an environment related to the distribution of goods. Also, each value of the action becomes each value of the action variable. As shown in the action above, the action in this embodiment is the number of orders for the product to the supplier. If the action space, that is, the degree of freedom of the number of orders allowed for the player, that is, the range from the lower limit to the upper limit of the number of orders is too wide, the processing efficiency of the prediction device 1 may decrease. Therefore, in the present embodiment, as an example, the Netshop Game is implemented with the action space being a set of discrete values [0, 1, 2, . . . , 20] from 0 to 20.

次に、Netshop Gameにおける報酬について説明する。上記のＢＧでは、以下の式（３）に示すように、タイムステップｔにおける在庫数によって報酬が決定される。 Next, the reward in the Netshop Game will be explained. In the above BG, the reward is determined by the inventory quantity at time step t, as shown in the following equation (3).

ここで、変数ｘについて（ｘ）^＋：ｍａｘ（０，ｘ）、（ｘ）^－：ｍａｘ（０，－ｘ）である。また、在庫数が正の場合は在庫数分の在庫コストｃｈを乗算し、在庫数が負の場合は欠品による機会損失コストｃｐを乗算する。また、ｉは１～４までの整数であり各値
が各プレーヤーに対応する。したがって、式（３）によって全プレーヤーの報酬の総計が算出される。

Here, for the variable x, (x) ⁺ : max(0, x), (x) ⁻ : max(0, −x). If the inventory quantity is positive, it is multiplied by the inventory cost ch for the inventory quantity, and if the inventory quantity is negative, it is multiplied by the opportunity loss cost cp due to shortage. Also, i is an integer from 1 to 4, and each value corresponds to each player. Therefore, equation (3) calculates the total reward for all players.

Netshop Gameでは、報酬の算出に、売値、仕入れ値、販売促進費、配送費が追加される。具体的には、サイバープレーヤーの報酬を式（４）、フィジカルプレーヤーの報酬を式（５）、メタプレーヤーの報酬を式（６）によって算出する。

ここで、ｓ_ｐは売値、ｃ_ｒは仕入れ値、ｃ_ｓは販売促進費、ｃ_ｄは配送費である。Netshop Gameでは各プレーヤーはこれらの値を観測できないものとする。 In Netshop Game, selling price, purchase price, sales promotion cost, and delivery cost are added to the calculation of reward. Specifically, the reward for the cyber player is calculated by equation (4), the reward for the physical player by equation (5), and the reward for the meta player by equation (6).

Here, s _p is the selling price, _cr is the purchase price, _cs is the sales promotion cost, and _cd is the delivery cost. In Netshop Game, each player shall not be able to observe these values.

式（４）が示すように、サイバープレーヤーは、欠品の機会損失コストに加え、供給業者への発注数が需要より大きい場合にそれらの差を販売促進費として報酬に加算する。また、式（５）が示すように、フィジカルプレーヤーは、在庫数分の在庫コストと顧客への出荷数分の配送コストとを報酬に加算する。サイバープレーヤーは過剰在庫を気にせず、フィジカルプレーヤーは欠品を気にしない、という偏った指向となるように報酬が設定されているのは、局所最適解を求める状況を意図的に発生させるためである。一方、式（６）が示すように、メタプレーヤーは、上記の指向のバランスを取った全体最適解を求めるよう、上記の各コストを報酬に加算する。 As Equation (4) shows, the cyber player adds the difference between the number of orders to the supplier and the demand as sales promotion costs in addition to the lost opportunity cost of the shortage. Also, as shown in Equation (5), the physical player adds the inventory cost for the number of inventory and the delivery cost for the number of shipments to the customer to the remuneration. Rewards are set so that cyber players don't care about excess inventory and physical players don't care about shortages. is. On the other hand, as shown in Equation (6), the metaplayer adds each of the above costs to the reward so as to obtain a global optimal solution that balances the above orientations.

次に、Netshop Gameにおける各プレーヤーのゴール条件について説明する。上記のＢＧでは、あらかじめ決められたタイムステップ期間内における報酬の合計値によって各プレーヤーが競争するが、Netshop Gameでは、以下の２つの指標がゴール条件として設定される。各プレーヤーは、いずれかの指標を達成した時点で１エピソードを終了する。 Next, the goal conditions for each player in the Netshop Game will be described. In the above BG, each player competes according to the total value of rewards within a predetermined time step period, but in Netshop Game, the following two indices are set as goal conditions. Each player finishes one episode when he/she achieves one of the indicators.

Netshop Gameにおけるゴール条件の１つの指標が以下の式（７）で示されるＢＥＩであり、もう１つの指標が以下の式（８）で示されるＦＲである。

ここで、ｄｅｍａｎｄは、タイムステップｔにおける顧客からの需要数の直近ｐ期間分の配列である（ｐは正の整数）。また、ｓｈｉｐｐｅｄは、タイムステップｔにおける商品の出荷数の直近ｐ期間分の配列である。また、Ｖａｒ（ｘ）は、変数ｘの分散、Ｍｅａｎ（ｘ）は、変数ｘの平均である。さらに、Netshop Gameでは、以下の式（９）および式（１０）によって各プレーヤーのゴール条件の達成を判断する。

ここで、式（９）がフィジカルプレーヤーのゴール条件に用いられ、式（１０）がサイバープレーヤーのゴール条件に用いられる。また、式（９）および式（１０）がメタプレーヤーのゴール条件に用いられる。 One index of the goal condition in the Netshop Game is BEI indicated by the following formula (7), and another index is FR indicated by the following formula (8).

Here, demand is an array of the number of demands from customers at time step t for the most recent p periods (p is a positive integer). Also, shipped is an array of the number of products shipped at time step t for the most recent p periods. Also, Var(x) is the variance of the variable x, and Mean(x) is the mean of the variable x. Furthermore, in Netshop Game, each player's achievement of the goal condition is determined by the following equations (9) and (10).

Here, equation (9) is used for the physical player's goal condition, and equation (10) is used for the cyber player's goal condition. Also, equations (9) and (10) are used for the goal condition of the meta player.

次に、Netshop Gameにおいて生成される顧客からの需要数について説明する。Netshop Gameでは、タイムステップｔにおける顧客からの需要数Ｄｔが、以下の式（１１）に示されるように、タイムステップｔの直近ｐ期間分の需要数の平均値に正規分布に従う確率変数ｘを加えて生成される。

なお、初期値は、１から１０までの自然数からランダムに選択された値が採用される。 Next, the number of demands from customers generated in the Netshop Game will be explained. In the Netshop Game, the number of demands Dt from customers at time step t, as shown in the following equation (11), is a random variable x that follows a normal distribution to the average value of the number of demands for the last p period of time step t. additionally generated.

A value randomly selected from natural numbers from 1 to 10 is adopted as the initial value.

図３に、本実施形態においてOpenAI Gymによって実行されるシミュレーションのアルゴリズムの一例を示す。図３に示されるように、Netshop Gameでは、アクションの結果から得られる経験の蓄積および活用のバランスを取る方法として、いわゆるイプシロングリーディアルゴリズムを用い、ＤＮＮ（Deep Neural Network）を用いた状態評価を行う。図
３のアルゴリズムは、上記の非特許文献にも記載されている周知のものであるため、ここでは詳細な説明は省略する。 FIG. 3 shows an example of a simulation algorithm performed by OpenAI Gym in this embodiment. As shown in Fig. 3, Netshop Game uses the so-called epsilon greedy algorithm as a method of balancing the accumulation and utilization of experience obtained from action results, and performs state evaluation using a DNN (Deep Neural Network). . Since the algorithm in FIG. 3 is well-known and is also described in the above non-patent document, detailed description is omitted here.

次に、本実施形態における各プレーヤーのパラメータと報酬計算の定義について説明する。図４に、OpenAI Gymによって設定される環境パラメータの一例を示す。図４に示すように、本実施形態で実行されるシミュレーションでは、ゴール条件に用いられるＢＥＩの閾値（「bei_threshold」）は０．９５、ＦＲの閾値（「fillrate_threshold」）は０．
９５、在庫コスト（「stock_cost」）は０．５、欠品コスト（「shortage_cost」）は１
．０、販売促進費（「promotion_cost」）は０．０１、配送コスト（「delivery_cost」
）は０．０１とする。また、顧客からの需要数の生成には、図中「demand_volatility」
（需要変動）、「demand_min」（最小値）、「demand_max」（最大値）の各値が使用される。また、商品の販売価格（「sales_price」）は２．０、商品の仕入れ値（「purchase_price」）は１．２とする。 Next, the definition of each player's parameters and reward calculation in this embodiment will be described. Figure 4 shows an example of environmental parameters set by OpenAI Gym. As shown in FIG. 4, in the simulation performed in this embodiment, the BEI threshold (“bei_threshold”) used for the goal condition is 0.95, and the FR threshold (“fillrate_threshold”) is 0.95.
95, Inventory cost ("stock_cost") is 0.5, Shortage cost ("shortage_cost") is 1
. 0, promotion cost ("promotion_cost") is 0.01, delivery cost ("delivery_cost")
) is 0.01. In addition, to generate the number of demand from customers, "demand_volatility" in the figure
(demand fluctuation), "demand_min" (minimum value) and "demand_max" (maximum value) values are used. The sales price ("sales_price") of the product is 2.0, and the purchase price ("purchase_price") of the product is 1.2.

図５は、OpenAI Gymによって設定されるNetshop Gameの報酬の一例を示す。図５に示すように、サイバープレーヤー、フィジカルプレーヤー、メタプレーヤーの各プレーヤーの報酬が設定される。 FIG. 5 shows an example of Netshop Game rewards set by OpenAI Gym. As shown in FIG. 5, rewards are set for each of the cyber player, physical player, and meta player.

メタプレーヤーの報酬についてより具体的に説明すると、メタプレーヤーに対しては、商品が欠品の場合は在庫数分の欠品コストを加算する（図中「cost += abs(IL) * self.shortage_cost if IL < 0 else 0」）。また、顧客からの需要数よりも発注数が多くなる
場合は販売促進費を加算する（図中「cost += (action - d) * self.promotion_cost if action > d else 0」）。また、現在の在庫数に対しては在庫数分の在庫コストを加算す
る（図中「cost += IL * self.stock_cost if IL > 0 else 0」）。また、出荷数に対し
ては配送コストを加算する（図中「cost += SS * self.delivery_cost if SS > 0 else 0」）。また、販売価格（売上）は、出荷数に売単価を乗算した値とする（図中「sales_price = SS * self.sales_price」）。また、仕入れ値は、入荷数に仕入れ単価を乗算した
値とする（図中「purchase_price = RS * self.purchase_price」）。そして、報酬は、
売上から仕入れ値および上記の各コストを除算した値とする（図中「reward = abs(sales_price)-abs(purchase_price)-abs(cost)」）。メタプレーヤーは、このように設定され
る報酬をもとに後述するゴール条件の達成を目指す。 To explain the meta player's reward more specifically, if the product is out of stock, the meta player will be charged the cost of the stock shortage ("cost += abs(IL) * self. shortage_cost if IL < 0 else 0"). Also, if the number of orders is greater than the number of demands from customers, the sales promotion cost is added ("cost += (action - d) * self.promotion_cost if action > d else 0" in the figure). Also, the inventory cost for the number of inventory is added to the current inventory (“cost += IL * self.stock_cost if IL > 0 else 0” in the figure). Also, the delivery cost is added to the number of shipments (“cost += SS * self.delivery_cost if SS > 0 else 0” in the figure). Also, the selling price (sales) is a value obtained by multiplying the number of shipments by the selling price ("sales_price = SS * self.sales_price" in the figure). Also, the purchase price is the value obtained by multiplying the number of arrivals by the purchase unit price ("purchase_price = RS * self.purchase_price" in the figure). and the reward is
The value obtained by dividing the purchase price and each of the above costs from the sales ("reward = abs(sales_price)-abs(purchase_price)-abs(cost)" in the figure). The meta player aims to achieve the goal conditions described later based on the rewards set in this way.

次に、図６は、OpenAI Gymにおいて実行される上記の各状態変数の更新の一例を示す。図６に示すように、１つのタイムステップにおいて、現在の在庫数（図中「IL = self.state[0]」）、未入荷の発注数（図中「OO = self.state[1]」）、１つ前のタイムステップにおける発注数（図中「a = self.state[5]」）が用いられる。また、需要の移動平均に正規分布に従う変動数を加算することで、顧客からの需要数が乱数を用いて生成される（図中「d = max(self.demand_min,int(dave + np.random.randn() * self.demand_volatility))」）。また、入荷数リードタイムを１として発注数に応じた商品が入荷するものとする（図中「RS = a # received shipment = last order ( shipping lead time = 1
)」）。 Next, FIG. 6 shows an example of updating each of the above state variables performed in OpenAI Gym. As shown in Figure 6, in one time step, the current number of inventory ("IL = self.state[0]" in the figure), the number of unarrived orders ("OO = self.state[1]" in the figure) ), and the number of orders placed in the previous time step (“a = self.state[5]” in the figure) is used. In addition, by adding the number of fluctuations following a normal distribution to the moving average of demand, the number of demand from customers is generated using random numbers ("d = max (self.demand_min,int(dave + np.random .randn() * self.demand_volatility))”). In addition, it is assumed that the goods corresponding to the number of orders are received with the lead time of the number of shipments being 1 ("RS = a # received shipment = last order (shipping lead time = 1
)”).

そして、まず供給業者から入荷した商品数を在庫数に加算する（図中「IL += RS」）。その後、顧客からの需要数が在庫数よりも小さい場合は（図中「if d < IL:」）、需要数分の商品数を顧客に出荷し（図中「SS = d」）、顧客からの需要数が在庫数以上となる場合は（図中「else: # d >= IL」）、在庫数を顧客への出荷数とし（図中「SS = IL」）、在庫数を変更する（図中「IL -= SS」）。そして、今回の発注数から入荷数を除算した値を未入荷の発注数に加算する（図中「OO += action - RS」）。このように各状態変数が
変更され、変更後の状態変数を用いて次のタイムステップにおける商品の発注、入荷、出荷がそれぞれ行われる。 First, the number of products received from the supplier is added to the number of inventories (“IL += RS” in the figure). After that, if the number of demand from the customer is smaller than the number of inventory ("if d <IL:" in the figure), the number of products for the number of demand is shipped to the customer ("SS = d" in the figure), If the quantity of demand is greater than the quantity of inventory (“else: # d >= IL” in the figure), the quantity of inventory is set as the number of shipments to the customer (“SS = IL” in the figure), and the quantity of inventory is changed ( “IL -= SS” in the figure). Then, the value obtained by dividing the received quantity from the current ordered quantity is added to the unarrived ordered quantity ("OO += action - RS" in the figure). Each state variable is changed in this way, and the changed state variables are used to order, receive, and ship products in the next time step.

図７は、メタプレーヤーに対するゴール条件の一例を示す。図７に示すように、現在のタイムステップから直近の一定期間（例えば１００日）における出荷数の分散（図中「vars4 = np.var(shipped)」）と需要数の分散（図中「vars2 = np.var(demand)」）とを算
出し、算出した分散からＢＥＩを算出する（図中「bei = vars4 / vars2 if vars2 != 0 else bei_threshold」）。そして、算出したＢＥＩが上記のＢＥＩの閾値より小さい場合は（図中「if bei < bei_threshold:」）、算出したＢＥＩはＢＥＩの条件を満たしたとする（図中「bei_flag = True」）。また、現在のタイムステップから直近の一定期間（
例えば１００日）における需要数に対する出荷数の平均を算出する（図中「fillrate = np.mean(shipped / demand)」）。そして、算出した平均が上記のＦＲの閾値よりも大きい場合は（図中「if fillrate > self.fillrate_threshold:」）、算出した平均はＦｉｌｌｒａｔｅの条件を満たしたとする（図中「fill_flag = True」）。そして、上記の２つの条件がいずれも満たされた場合にゴール条件が達成されたとみなす（図中「done = fill_flag & sum_flag」）。 FIG. 7 shows an example of goal conditions for metaplayers. As shown in Fig. 7, the distribution of the number of shipments ("vars4 = np.var(shipped)" in the figure) and the distribution of the number of demand ("vars2 =np.var(demand)”), and BEI is calculated from the calculated variance (“bei=vars4/vars2 if vars2 !=0 else bei_threshold” in the figure). If the calculated BEI is smaller than the BEI threshold ("if bei <bei_threshold:" in the figure), it is assumed that the calculated BEI satisfies the BEI condition ("bei_flag=True" in the figure). Also, from the current timestep to the most recent fixed period (
For example, 100 days), the average number of shipments to the number of demand is calculated (“fillrate=np.mean(shipped/demand)” in the figure). If the calculated average is greater than the above FR threshold ("if fillrate >self.fillrate_threshold:" in the figure), then the calculated average satisfies the Fillrate condition ("fill_flag = True" in the figure). . Then, when both of the above two conditions are satisfied, it is considered that the goal condition has been achieved ("done = fill_flag &sum_flag" in the figure).

各プレーヤーは、図５に示すように設定される報酬を基に上記のゴール条件の達成を目指して、タイムステップごとに図６に示す状態変数の更新を繰り返していく。図８に、Netshop Gameにおいてタイムステップの上限を５００００ステップとして強化学習を行った場合の、メタプレーヤーを対象としたタイムステップと報酬の変化の一例を示すグラフである。グラフの横軸はタイムステップを表し、グラフの縦軸は上記の通りメタプレーヤーが得る報酬を表す。図８のグラフが示すように、メタプレーヤーの報酬は、タイムステッ
プが進むほど報酬が一定値に向かって上昇していくことがわかる。したがって、タイムステップが進むたびにメタプレーヤーの学習が蓄積されていくものといえる。 Each player repeatedly updates the state variables shown in FIG. 6 at each time step, aiming to achieve the above goal conditions based on rewards set as shown in FIG. FIG. 8 is a graph showing an example of changes in time steps and rewards for meta players when reinforcement learning is performed with the upper limit of time steps set to 50000 steps in Netshop Game. The horizontal axis of the graph represents time steps, and the vertical axis of the graph represents the rewards that metaplayers earn as described above. As shown in the graph of FIG. 8, it can be seen that the meta player's reward rises toward a constant value as the time step progresses. Therefore, it can be said that the meta player's learning is accumulated each time the time step progresses.

図９～図１１は、上記のNetshop Gameによって学習済みの各プレーヤーの行動モデルを用いて、再度Netshop Gameを１００エピソード分実行したテストにおける、需要数と発注数と在庫数との変化の一例を示すグラフである。グラフの横軸はタイムステップを表し（図中「ｓｔｅｐ」）、グラフの縦軸は商品数を表す（図中「ｉｔｅｍｓ」）。なお、図９は、サイバープレーヤーの獲得報酬の推移を示し、図１０は、フィジカルプレーヤーの獲得報酬の推移を示し、図１１は、メタプレーヤーの獲得報酬の推移を示す。ここで、プレーヤーが学習済みであるとは、図９に示すように、プレーヤーが獲得する報酬がほぼ一定に推移するような状態までプレーヤーの学習が蓄積された状態であるとする。また、１エピソードのタイムステップ数の上限を１０００ステップとし、プレーヤーが上記のゴール条件を達成することなく１０００ステップ目のタイムステップの行動を完了した時点でエピソードを終了する。なお、各図のグラフは、各エピソードの最後の１００ステップ分、すなわちエピソード終了時点から１００ステップ分遡った獲得報酬の推移を示す。また、各図において、「ｓｔｏｃｋ」は在庫数、「ｄｅｍａｎｄ」は需要数、「ｓｈｉｐｐｅｄ」は出荷数をそれぞれ表す。 Figures 9 to 11 show an example of changes in demand, orders, and inventory in a test in which 100 episodes of the Netshop Game were run again using the behavior models of each player learned by the Netshop Game above. It is a graph showing. The horizontal axis of the graph represents the time step ("step" in the figure), and the vertical axis of the graph represents the number of products ("items" in the figure). FIG. 9 shows changes in rewards earned by cyber players, FIG. 10 shows changes in rewards earned by physical players, and FIG. 11 shows changes in rewards earned by meta players. Here, it is assumed that the player has already learned, as shown in FIG. 9, in which the player's learning has been accumulated to the point where the reward obtained by the player remains substantially constant. Also, the upper limit of the number of time steps in one episode is set to 1000 steps, and the episode ends when the player completes the actions of the 1000th time step without achieving the above goal conditions. It should be noted that the graph in each figure shows the transition of the acquired reward for the last 100 steps of each episode, that is, 100 steps back from the end of the episode. In each figure, "stock" represents the number of stocks, "demand" represents the number of demand, and "shipped" represents the number of shipments.

図９の在庫数の推移が示すように、Netshop Gameにおいてサイバープレーヤーの行動によって過剰在庫が増大して高止まりする傾向があると考えられる。また、図１０では在庫数が０である状態が複数ステップに亘って継続していることから、フィジカルプレーヤーの行動によって欠品が頻繁に発生する傾向があると考えられる。一方、図１１では在庫数の推移を示すグラフが概ねのこぎり形となっている。これは、在庫数が０となっても直後のステップあるいは数ステップ以内で在庫数が増加し、在庫数が増えすぎた場合でも需要数および出荷数が在庫数を押し下げるように働いていると考えることができる。したがって、メタプレーヤーの行動によってバランスのとれた在庫数が実現できる可能性があると考えられる。 As shown in FIG. 9 for changes in the number of inventories, it is thought that in Netshop Games, excess inventories tend to increase and remain high due to the actions of cyber players. Further, in FIG. 10, since the state in which the inventory quantity is 0 continues over a plurality of steps, it is conceivable that there is a tendency for shortages to occur frequently due to the behavior of the physical player. On the other hand, in FIG. 11, the graph showing changes in the number of stocks is generally sawtooth-shaped. This means that even if the inventory count becomes 0, the inventory count will increase in the next step or within a few steps, and even if the inventory count is too high, demand and shipments will work to push down the inventory count. be able to. Therefore, it is thought that there is a possibility that a balanced number of inventories can be achieved through the actions of meta players.

図１２に、上記のテストにおける各プレーヤーのゲーム条件達成までの所要ステップ数、ＢＥＩの値、ＦＲの値、獲得報酬の値について１００エピソードの平均値を求めた結果を示す。図９～１１のグラフにおいて、ゴール条件を達成するまでに要するステップ数（図中「ｓｔｅｐｓ」）は、小さいほどよい。また、ＢＥＩの値が１．０より小さくなる場合に、ＢＥの影響が抑制されていると考えられる。また、ＦＲの値は顧客からの需要に対して出荷できる割合を示しており、１．０に近いほどよい。図１２からわかるように、報酬の最大化だけが目的であればサイバープレーヤーが最適であるが、図９に示す在庫数の推移からわかるように、サイバープレーヤーの場合はエピソード内で在庫数が大きい状態が継続するという現象が生じている。この現象が発生する一因としては、サイバープレーヤーに対する報酬の設定においては過剰在庫を抑制する要素を与えていないことが挙げられる。 FIG. 12 shows the average values of 100 episodes of the required number of steps, the BEI value, the FR value, and the obtained reward value for each player to achieve the game conditions in the above test. In the graphs of FIGS. 9 to 11, the smaller the number of steps (“steps” in the figures) required to achieve the goal condition, the better. Also, when the value of BEI is less than 1.0, it is considered that the influence of BE is suppressed. Also, the value of FR indicates the ratio of the products that can be shipped to the demand from customers, and the closer to 1.0 the better. As can be seen from Figure 12, if the goal is only to maximize rewards, Cyber Player is the best choice. There is a continuation of the state of affairs. One of the reasons why this phenomenon occurs is that the setting of rewards for cyber players does not provide an element to control excess inventory.

図１２に示すように、フィジカルプレーヤーの場合は、在庫数が小さい値で維持され、このためＢＥＩもサイバープレーヤーに比べて低い値となるが、図１０に示す在庫数の推移からわかるように、在庫数が０の状態が複数ステップにわたって継続する、すなわち欠品の状態が継続することがあり、結果としてＦＲの値も低い値となっている。 As shown in FIG. 12, in the case of the physical player, the number of inventories is maintained at a small value, so the BEI is also lower than that of the cyber player. The state in which the inventory quantity is 0 continues for a plurality of steps, that is, the state of out-of-stock may continue, and as a result, the FR value is also a low value.

また、図１２に示すように、メタプレーヤーの場合は、サイバープレーヤーとフィジカルプレーヤーに比べると獲得報酬は低いが、ＢＥＩの値はサイバープレーヤーとフィジカルプレーヤーの場合よりも小さくなり、ＦＲの値は、過剰在庫がより多く発生するサイバープレーヤーの場合と欠品がより多く発生するフィジカルプレーヤーの場合の間の値となっている。また、図１１の在庫数の推移を示すグラフにおいて、複数ステップにわたって
在庫数の変化の周期および変動幅がほぼ一定となる部分があることから在庫数、入荷数、出荷数のバランスが取れている、すなわち在庫を安定させるための適正な在庫数管理の学習効果が得られていると考えられる。 In addition, as shown in FIG. 12, in the case of the meta player, the acquisition reward is lower than that of the cyber player and the physical player, but the BEI value is smaller than that of the cyber player and the physical player, and the FR value is The value lies between cyber players, who have more excess inventory, and physical players, who have more shortages. In addition, in the graph of FIG. 11 showing changes in the number of inventories, there is a portion where the cycle and fluctuation range of changes in the number of inventories are almost constant over multiple steps, so the number of stocks, the number of arrivals, and the number of shipments are well balanced. In other words, it is considered that the learning effect of appropriate inventory management for stabilizing inventory is obtained.

第１実施形態では、予測装置１の制御部１１が取得部としての通信部１５を制御して商品情報ＤＢ（データベース）２と通信する。通信部１５は、商品情報ＤＢ２に記憶されている、一定期間における商品の発注数、出荷数、入荷数の情報を取得する。なお、ここで商品情報ＤＢ２から取得される情報が、一定期間にわたる商品の発注数と出荷数と入荷数の変動を示す商品情報の一例である。制御部１１は通信部１５によって取得された情報を基に各状態変数に基づいて決定される報酬を、発注数を変更しながら算出する強化学習によって上記の行動モデルを生成する。そして、制御部１１は、この行動モデルを用いて、Netshop Gameにおけるメタプレーヤーの行動のシミュレーションを行う。 In the first embodiment, the control unit 11 of the prediction device 1 controls the communication unit 15 as an acquisition unit to communicate with the product information DB (database) 2 . The communication unit 15 acquires information on the number of ordered products, the number of shipments, and the number of arrivals of products for a certain period of time, which is stored in the product information DB 2 . The information acquired from the product information DB 2 here is an example of product information indicating changes in the number of orders, the number of shipments, and the number of arrivals of products over a certain period of time. The control unit 11 generates the behavior model by reinforcement learning that calculates rewards determined based on each state variable based on the information acquired by the communication unit 15 while changing the number of orders. Then, the control unit 11 uses this behavior model to simulate the behavior of the metaplayer in the Netshop Game.

具体的には、制御部１１は、メタプレーヤーによる商品の発注数を上限から下限まで変更しながら、例えば上記の例では発注数を０から２０まで１ずつ増加させながら、シミュレーションを繰り返す。制御部１１は、一例として上記のように１エピソードのタイムステップ数の上限を１０００ステップとし、メタプレーヤーが上記のゴール条件を達成することなく１０００ステップ目のタイムステップの行動を完了した時点でエピソードを終了することとして、各発注数に対して１００エピソード実行した結果を基に、各発注数に対するゴール条件達成までのステップ数、ＢＥＩの値、ＦＲの値、獲得報酬について１００エピソードの平均値を算出し、算出結果を記憶部１２に記憶したり、表示部１４に表示したり、通信部１５から外部装置（図示せず）に送信したりすることで、算出結果を出力する。なお、予測装置１による算出結果が、予測装置によって予測される最適な発注数の一例である。 Specifically, the control unit 11 repeats the simulation while changing the number of products ordered by the meta player from the upper limit to the lower limit, for example, increasing the number of orders by one from 0 to 20 in the above example. As an example, the control unit 11 sets the upper limit of the number of time steps in one episode to 1000 steps as described above, and when the metaplayer completes the action of the 1000th time step without achieving the above goal condition, the episode starts. Based on the results of executing 100 episodes for each order quantity, the number of steps to achieve the goal conditions for each order quantity, BEI value, FR value, and the average value of 100 episodes for the acquisition reward Calculation results are output by storing the calculation results in the storage unit 12, displaying them on the display unit 14, or transmitting them from the communication unit 15 to an external device (not shown). The calculation result by the prediction device 1 is an example of the optimal number of orders predicted by the prediction device.

ユーザは、予測装置１による算出結果を確認して、シミュレーションの対象となった商品の実際の発注数を決定することができる。したがって、第１実施形態によれば、販売業者による商品の発注数が、予測装置１によるシミュレーションによる算出結果を基に調整される。これにより、ＢＥＩが閾値を超えないように制御され、かつ需要数をできるだけ満たすように出荷が行われることで、在庫コストの最小化と売上の最大化という、相反する目標のバランスを取り、当該商品の在庫の最適化を図ることができる。 The user can confirm the calculation result by the prediction device 1 and determine the actual number of orders for the simulation target product. Therefore, according to the first embodiment, the number of products ordered by the seller is adjusted based on the calculation result of the simulation performed by the prediction device 1 . This controls the BEI so that it does not exceed the threshold, and the shipments are made to meet demand as much as possible, balancing the conflicting goals of minimizing inventory costs and maximizing sales. Product inventory can be optimized.

以上が本実施形態に関する説明であるが、本実施形態の予測装置１の構成や処理は、上記説明の内容に限定されるものではなく、本発明の技術的思想と同一性を失わない範囲内において種々の変更が可能である。以下に、上記の実施形態の変形例について説明する。なお、以下の説明において、上記と同様の構成や処理などについては同一の符号を付し、詳細な説明については省略する。 The above is the description of the present embodiment, but the configuration and processing of the prediction device 1 of the present embodiment are not limited to the contents of the above description, and are within the scope of the technical idea and identity of the present invention. Various changes are possible in Modifications of the above embodiment will be described below. In the following description, the same reference numerals are given to the same configurations and processes as those described above, and detailed description thereof will be omitted.

（変形例１）
変形例１では、ＥＣサイトで販売される１商品の販売促進の戦略に沿った商品の在庫を保管する倉庫の作業者について、１日の出荷に必要な作業人時の決定を行う。なお、本変形例で実行されるシミュレーションでは、１つのタイムステップを１日とし、各日に曜日が設定され、商品の販売促進が実施される日があらかじめ設定されているものと想定する。販売促進の実施日の一例として、５を含む日（５日、１５日、２５日）や五十日などが挙げられる。なお、販売促進の実施日を示す情報は、例えば商品情報ＤＢ２に記憶されていて予測装置１が商品情報ＤＢ２から取得してもよいし、ユーザが操作部１３を操作して入力してもよい。図１３に、予測装置１において、OpenAI Gymによって設定される本シミュレーションの報酬の一例（図１３Ａ）とOpenAI Gymにおいて実行される各状態変数の更新の一例（図１３Ｂ）を示す。なお、本シミュレーションにおけるアクション（行動変数）は、１日における商品の出荷に必要な作業人時であり、０から所定の上限値までの値が
選択される。 (Modification 1)
In Modified Example 1, the number of man-hours required for one day's shipment is determined for a warehouse worker who stocks merchandise according to a sales promotion strategy for one merchandise sold on an EC site. In the simulation executed in this modified example, it is assumed that one time step is one day, the day of the week is set for each day, and the day on which the sales promotion of the product is to be implemented is set in advance. An example of a sales promotion implementation date is a day including 5 (5th, 15th, 25th), fifty days, and the like. Information indicating the date of implementation of the sales promotion may be stored in the product information DB 2 and acquired by the prediction device 1 from the product information DB 2, for example, or may be input by the user by operating the operation unit 13. . FIG. 13 shows an example of rewards for this simulation set by OpenAI Gym in prediction device 1 (FIG. 13A) and an example of update of each state variable executed in OpenAI Gym (FIG. 13B). Note that the action (behavior variable) in this simulation is the man-hours required to ship the product in one day, and a value from 0 to a predetermined upper limit is selected.

図１３Ａに示すように、本変形例で実行されるシミュレーションでは、メタプレーヤーに対して、需要数と商品を出荷可能な在庫数との差の累積の絶対値を算出する（図中「cost += abs(b)」）。また、現在のステップにおける曜日が土曜日または日曜日であるか否かを判定する（図中「if self.sat_sun_day_flag(y):」）。そして、土曜日または日曜日である場合は、必要人時と出荷に対応可能な人時との差に、土曜日・日曜日用の係数（土日稼働係数）を乗算してコストとする（図中「cost += abs(f) * self.manpower_cost_holiday」）。一方、曜日が平日である場合は、必要人時と出荷に対応可能な人時との差に
、平日用の係数（平日稼働係数）を乗算してコストとする（図中「cost += abs(f) * self.manpower_cost_weekday」）。なお、これらのコストが、商品の販売促進の実施日の曜
日に応じて決定される販売促進コストの一例である。そして、算出したコストのマイナスの値を報酬とする。メタプレーヤーは、このように設定される報酬をもとに後述するゴール条件の達成を目指す。 As shown in FIG. 13A, in the simulation executed in this modified example, the cumulative absolute value of the difference between the quantity of demand and the quantity of inventory that can be shipped is calculated for the metaplayer ("cost + =abs(b)”). It also determines whether the day of the week at the current step is Saturday or Sunday ("if self.sat_sun_day_flag(y):" in the figure). If it is a Saturday or Sunday, the difference between the required man-hours and the man-hours available for shipping is multiplied by the coefficient for Saturdays and Sundays (Saturday and Sunday operation coefficient) to determine the cost ("cost + = abs(f) * self.manpower_cost_holiday”). On the other hand, if the day of the week is a weekday, the difference between the required man-hours and the man-hours available for shipping is multiplied by the coefficient for weekdays (weekday operation coefficient) to determine the cost ("cost += abs (f) *self.manpower_cost_weekday”). It should be noted that these costs are an example of sales promotion costs determined according to the day of the week on which the sales promotion of the product is carried out. Then, the negative value of the calculated cost is used as a reward. The meta player aims to achieve the goal conditions described later based on the rewards set in this way.

また、図１３Ｂに示すように、１つのタイムステップにおいて、まず、日付が１日進められる（図中「self.step_date += datetime.timedelta(days=1)」）。次に、現在日が販売促進の実施日であるか否かを判定する（図中「t = self.five_six_day_flag(self.step_date.day)」）。次に、現在日の曜日を特定する（図中「y = self.step_date.weekday()」）。そして、販売促進の実施日（ｔ）と曜日（ｙ）と移動平均（ｄａｖｅ）とを用いて商品の需要数を決定する（図中「d = self.demand_by_date(t,y,dave)」）。次に、１時
間あたりの出荷数を決定する（図中「r = self.avg_productivity」）。そして、需要数
分の商品を出荷するために必要となる人時を算出する（図中「n = int( d / r )」）。次に、アクションで選択された人時を特定する（図中「m = action」）。そして、アクションで選択された人時によって出荷可能な在庫数を算出する（図中「p = m * r」）。また
、需要数分の商品を出荷するために必要な人時（ｎ）と出荷に対応可能な人時（ｍ）との差（ｆ）を算出する（図中「f = n - m」）。なお、図１３Ａ、図１３Ｂからわかるよう
に、この差（ｆ）を用いて報酬が計算される。そして、需要数と出荷可能な在庫数の差の累積を算出する（図中「b = self.state[B] + (d - p)」）。このように各状態変数が変
更され、変更後の状態変数を用いて次のタイムステップにおいて、生成される需要を基に商品の出荷が行われる。 Also, as shown in FIG. 13B, in one time step, first, the date is advanced by one day (“self.step_date += datetime.timedelta(days=1)” in the figure). Next, it is determined whether or not the current date is the date of the sales promotion ("t = self.five_six_day_flag(self.step_date.day)" in the figure). Next, the current day of the week is specified ("y = self.step_date.weekday()" in the figure). Then, the sales promotion date (t), the day of the week (y), and the moving average (dave) are used to determine the demand for the product (“d = self.demand_by_date (t, y, dave)” in the figure). . Next, the number of shipments per hour is determined ("r = self.avg_productivity" in the figure). Then, the man-hours required to ship the products for the demand are calculated ("n = int(d/r)" in the figure). Next, identify the man-time selected by the action ("m = action" in the figure). Then, the number of inventory that can be shipped is calculated according to the man-hours selected in the action ("p = m * r" in the figure). Also, calculate the difference (f) between the man-hours (n) required to ship the number of products required and the man-hours (m) available for shipping ("f = n - m" in the figure). . As can be seen from FIGS. 13A and 13B, the difference (f) is used to calculate the reward. Then, calculate the accumulated difference between the number of demand and the number of available inventory ("b = self.state[B] + (d - p)" in the figure). Each state variable is changed in this way, and the product is shipped based on the generated demand at the next time step using the changed state variable.

図１３Ｃは、本変形例におけるシミュレーションにおけるゴール条件の一例を示す。図１３Ｃに示すように、現在のタイムステップから直近の一定期間（例えば１００日）における需要数に対する出荷数の平均を算出する（図中「fillrate = np.mean(shipped / demand)」）。そして、算出した平均が上記のＦＲの閾値よりも大きい場合は（図中「if fillrate > self.fillrate_threshold:」）、算出した平均はＦｉｌｌｒａｔｅの条件を満たしたとする（図中「fill_flag = True」）。また、現在のタイムステップから直近の一定期間（例えば１００日）における各日の人時ｎ、ｍそれぞれの合計を算出する（図中「ssum = np.sum(history,axis=0)」）。また、現在のタイムステップにおける需要数分の商
品の出荷を行うために必要な人時ｎの月当たりの合計を算出する（図中「nsum = int(ssum[N]/self.months)」）。また、アクションでｎ選択される人時ｍの月当たりの合計を算
出する（図中「msum = int(ssum[M]/self.months)」）。そして、算出した人時ｍの月当
たりの合計が算出した人時ｎの月当たりの合計以下である場合は（図中「if nsum >= msum:」）、コスト最小化が達成されたとして人時の条件が満たされたとする（図中「sum_flag = True」）。そして、上記の２つの条件がいずれも満たされた場合にゴール条件が達
成されたとみなす（図中「done = fill_flag & sum_flag」）。 FIG. 13C shows an example of goal conditions in simulation in this modified example. As shown in FIG. 13C, the average of the number of shipments to the number of demands for a fixed period (for example, 100 days) from the current time step is calculated (“fillrate=np.mean(shipped/demand)” in the figure). If the calculated average is greater than the above FR threshold ("if fillrate >self.fillrate_threshold:" in the figure), then the calculated average satisfies the Fillrate condition ("fill_flag = True" in the figure). . In addition, the sum of man hours n and m for each day in a fixed period (for example, 100 days) from the current time step is calculated (“ssum=np.sum(history,axis=0)” in the figure). Also, calculate the total number of man-hours n per month required to ship the products for the number of demands at the current time step ("nsum = int(ssum[N]/self.months)" in the figure) . Also, the sum of n man-hours m selected in the action per month is calculated (“msum=int(ssum[M]/self.months)” in the figure). If the total calculated man-hours m per month is less than or equal to the calculated man-hours n per month (“if nsum >= msum:” in the figure), it is assumed that cost minimization has been achieved. Assume that the time condition is satisfied ("sum_flag = True" in the figure). Then, when both of the above two conditions are satisfied, it is considered that the goal condition has been achieved ("done = fill_flag &sum_flag" in the figure).

本変形例では、制御部１１は、上記の報酬を、複数の人時それぞれに対して算出する強化学習によって行動モデルを生成し、生成した行動モデルを用いてシミュレーションを行
う。具体的に、制御部１１は、一例として１日（１ステップ）において出荷に必要な人時を下限から上限まで変更しながら、例えば０からアクションで選択可能な上限値まで所定人時ずつ増加させながらシミュレーションを繰り返す。そして、制御部１１は、一例として上記のように１エピソードのタイムステップ数の上限を１０００ステップとし、メタプレーヤーが上記のゴール条件を達成することなく１０００ステップ目のタイムステップの行動を完了した時点でエピソードを終了することとして、各人時に対して１００エピソード実行した結果を基に、各人時に対するゴール条件達成までのタイムステップ数、ＦＲの値について１００エピソードの平均値を算出する。そして、制御部１１は、算出結果を記憶部１２に記憶したり、表示部１４に表示したり、通信部１５から外部装置に送信したりすることで、算出結果を出力する。 In this modification, the control unit 11 generates a behavior model by reinforcement learning that calculates the reward for each of a plurality of man-hours, and performs a simulation using the generated behavior model. Specifically, for example, the control unit 11 increases man-hours necessary for shipping from 0 to an upper limit value that can be selected by an action by a predetermined number of man-hours, while changing the man-hours required for shipping in one day (one step) from the lower limit to the upper limit. while repeating the simulation. As an example, the control unit 11 sets the upper limit of the number of time steps in one episode to 1000 steps as described above, and when the metaplayer completes the action of the 1000th time step without achieving the above goal condition, , and based on the results of executing 100 episodes for each man-hour, the average value of 100 episodes is calculated for the number of time steps until the goal condition is achieved for each man-hour and the value of FR. Then, the control unit 11 outputs the calculation result by storing the calculation result in the storage unit 12, displaying the calculation result on the display unit 14, or transmitting the calculation result from the communication unit 15 to an external device.

ユーザは、予測装置１による算出結果を確認して、シミュレーションの対象となった商品について、販売促進の実施日における出荷に対する人時を決定することができる。したがって、第１実施形態によれば、商品の出荷に割り当てられる人時が、予測装置１によるシミュレーションによる算出結果を基に調整される。これにより、販売促進によって変動する需要数を満たすように出荷を行い、かつ商品が保管される倉庫における人時を低く維持することで、売上最大化と出荷コスト最小化という、相反する目標のバランスを取り、本シミュレーションで設定される販売促進の実施日における当該商品の出荷に割り当てられる人時の最適化を図ることができる。 The user can confirm the calculation result by the prediction device 1 and determine the man hours for shipment on the day of the sales promotion for the simulated product. Therefore, according to the first embodiment, the man-hours assigned to shipping products are adjusted based on the calculation result of the simulation performed by the prediction device 1 . This balances the conflicting goals of maximizing sales and minimizing shipping costs by shipping to meet demand that fluctuates due to sales promotions and keeping man hours low in warehouses where products are stored. , it is possible to optimize the man-hours allocated to shipping the product on the day of the sales promotion set in this simulation.

（変形例２）
変形例２では、ＥＣサイトで販売される商品を対象に、在庫数が変動しない不動在庫を特定して当該商品のレコメンドの要否の判定を行う。なお、本変形例で実行されるシミュレーションでは、１つのタイムステップを１日とし、ユーザが情報処理端末を使用してインターネット上でＥＣサイトの商品ページを示す情報を検索し、検索結果のページからＥＣサイトの商品ページに移動するものと想定する。また、商品情報の検索結果に当該商品ページの情報が表示される回数を商品の表示回数とし、当該商品ページへの移動回数をクリック数とする。また、１商品が倉庫内で占有する空間の指標となるサイズがあらかじめ定められているとする。 (Modification 2)
In Modified Example 2, immovable inventory whose stock quantity does not fluctuate is specified for products sold on an EC site, and it is determined whether or not the product should be recommended. In addition, in the simulation executed in this modification, one time step is set to one day, and the user uses an information processing terminal to search for information indicating the product page of the EC site on the Internet, and from the search result page It is assumed that the user moves to the product page of the EC site. In addition, the number of times the information of the product page is displayed in the product information search results is defined as the display count of the product, and the number of times of movement to the product page is defined as the number of clicks. It is also assumed that a size, which is an index of the space occupied by one product in the warehouse, is determined in advance.

図１４Ａは、予測装置１において、OpenAI Gymによって設定されるシミュレーションの報酬の一例を示し、図１４Ｂは、OpenAI Gymにおいて実行される各状態変数の更新の一例を示す。なお、本シミュレーションにおけるアクション（行動変数）は、商品のレコメンドの要否を決定するための閾値であり、例えば０．０から１．０まで０．１刻みの値のうちいずれかの値が選択される。なお、当該閾値が、ＥＣサイトにおいて商品のレコメンドのされやすさを示す値の一例である。 14A shows an example of simulation rewards set by OpenAI Gym in prediction device 1, and FIG. 14B shows an example of update of each state variable executed in OpenAI Gym. Note that the action (behavior variable) in this simulation is a threshold value for determining whether product recommendation is necessary. be done. Note that the threshold is an example of a value indicating the ease with which a product is recommended on an EC site.

図１４Ａに示すように、本変形例で実行されるシミュレーションでは、メタプレーヤーに対して、一定期間、すなわち所定数連続する複数ステップにおいて、商品ページのクリック数（図中「click_num」）と商品ページの表示回数（図中「view_num」）を基に、当
該期間の商品ページのクリックスルー率を算出する（図中「click_through_rate = click_num / view_num」）。そして、クリックスルー率に利益とサイズを乗算した値を報酬と
する（図中「reward = click_through_rate * profit * size」）。すなわち、利益またはサイズが大きくなるほど報酬も大きくなる。メタプレーヤーは、このように設定される報酬をもとに後述するゴール条件の達成を目指す。 As shown in FIG. 14A, in the simulation executed in this modified example, the number of clicks (“click_num” in the figure) on the product page and the number of product page Based on the number of views (“view_num” in the figure), the click-through rate of the product page during the period is calculated (“click_through_rate = click_num / view_num” in the figure). Then, the click-through rate multiplied by profit and size is used as a reward ("reward = click_through_rate * profit * size" in the figure). That is, the greater the profit or size, the greater the reward. The meta player aims to achieve the goal conditions described later based on the rewards set in this way.

また、図１４Ｂに示すように、１つのタイムステップにおいて、商品１個あたりの利益（一例として販売価格から仕入れ値と在庫コストとを差し引いた額）（図中「profit」）と、１日における商品ページのクリック数（図中「click_num_one_day」）と、１日にお
ける商品ページの表示回数（図中「view_num_one_day」）と、商品が倉庫に保管されてい
る日数である在庫日数（図中「stock_days」）と、在庫数（図中「stock_num」）と、上
記のアクションで選択される閾値（図中「boost_value」）が状態変数である。また、本
変形例では、図中「boost_value」以外の値を示す情報が商品情報ＤＢ２に記憶されてい
る。また、商品情報ＤＢ２には、上記の各値が、例えば過去３０日など、過去の一定期間にわたって日別に記憶されているものとする。予測装置１は、商品情報ＤＢ２に記憶されている一定期間にわたる各値の情報を取得し、タイムステップが１つ進むごとに、取得した情報を基に翌日の各値を特定し、特定した値で各状態変数を更新する。このように各状態変数が変更され、変更後の状態変数を用いて次のタイムステップにおける商品のレコメンドの要否判定が行われる。 Also, as shown in FIG. 14B, in one time step, the profit per product (as an example, the amount obtained by subtracting the purchase price and the inventory cost from the selling price) (“profit” in the figure) and the product The number of page clicks (“click_num_one_day” in the figure), the number of times the product page is displayed per day (“view_num_one_day” in the figure), and the number of days the product is stored in the warehouse (“stock_days” in the figure) , the stock quantity ("stock_num" in the figure) and the threshold value ("boost_value" in the figure) selected by the above action are the state variables. Further, in this modified example, information indicating values other than "boost_value" in the drawing is stored in the product information DB2. In addition, it is assumed that each of the above values is stored in the product information DB 2 by day over a certain period of time in the past, such as the past 30 days. The prediction device 1 acquires information on each value over a certain period stored in the product information DB 2, and each time the time step advances by one, specifies each value for the next day based on the acquired information, and determines the specified value. to update each state variable. Each state variable is changed in this manner, and the necessity of product recommendation at the next time step is determined using the changed state variable.

図１４Ｃは、本変形例におけるシミュレーションにおけるゴール条件の一例を示す。図１４Ｃに示すように、上記の一定期間におけるクリックスルー率があらかじめ設定された閾値より大きい場合は（図中「if click_through_rate >= click_through_rate _threshold:」）、クリックスルー率の条件が満たされたとする（図中「click_through_rate_flag
= True」）。また、現在のタイムステップから直近の一定期間（例えば１００日）の商
品の在庫日数平均「stock_days」があらかじめ設定された閾値より大きい場合は（図中「if stock_days >= stock_days_threshold:」）、商品の在庫日数平均の条件が満たされたとする（図中「stock_days_flag = True」）。そして、上記の２つの条件がいずれも満たされた場合にゴール条件が達成されたとみなす（図中「done = click_through_rate_flag
& stock_days_flag」）。 FIG. 14C shows an example of goal conditions in simulation in this modified example. As shown in FIG. 14C , if the clickthrough rate in the above constant period is greater than a preset threshold (“if click_through_rate >= click_through_rate _threshold:” in the figure), the clickthrough rate condition is satisfied ( "click_through_rate_flag
= True"). Also, if the stock days average "stock_days" of the product for a certain period (e.g. 100 days) from the current time step is greater than the preset threshold ("if stock_days >= stock_days_threshold:" in the figure), Assume that the conditions for the average number of days in stock are satisfied ("stock_days_flag = True" in the figure). Then, when both of the above two conditions are satisfied, the goal condition is considered to have been achieved ("done = click_through_rate_flag
&stock_days_flag").

本変形例では、制御部１１は、取得した商品情報と、商品のサイズと、商品の売上と、商品の仕入れ値と、商品のサイズに応じて決定される在庫コストとに基づいて決定される報酬を、商品のレコメンドのされやすさを示す複数の値それぞれに対して算出する強化学習によって行動モデルを生成し、生成した行動モデルを用いてシミュレーションを行う。具体的に、制御部１１は、商品のレコメンドの要否を決定するための閾値を下限から上限まで変更しながら、例えば０．０から１．０まで０．１ずつ増加させながらシミュレーションを繰り返す。ここで、制御部１１は、１エピソードのタイムステップ数を、商品情報ＤＢ２から取得した上記の各値の情報の対象期間の日数とし、各閾値に対して１００エピソード実行した結果を基に、各閾値に対するゴール条件達成までのタイムステップ数、クリックスルー率、在庫日数平均について１００エピソードの平均値を算出する。例えば、商品情報ＤＢ２から過去３０日にわたる上記の各値の情報が取得される場合は、１エピソードのタイムステップ数は３０となる。制御部１１は、算出結果を記憶部１２に記憶したり、表示部１４に表示したり、通信部１５から外部装置に送信したりすることで、算出結果を出力する。 In this modification, the control unit 11 provides a remuneration determined based on the acquired product information, product size, product sales, product purchase price, and inventory cost determined according to product size. is calculated for each of a plurality of values indicating the ease of recommending a product, a behavior model is generated by reinforcement learning, and a simulation is performed using the generated behavior model. Specifically, the control unit 11 repeats the simulation while changing the threshold for determining the necessity of product recommendation from the lower limit to the upper limit, increasing the threshold by 0.1 from 0.0 to 1.0, for example. Here, the control unit 11 sets the number of time steps for one episode as the number of days in the target period of the information of each value obtained from the product information DB 2, and based on the result of executing 100 episodes for each threshold value, each Calculate the average value of 100 episodes for the number of time steps to achieve the goal condition for the threshold, the click-through rate, and the average number of days in stock. For example, when the information of each of the above values over the past 30 days is acquired from the product information DB2, the number of time steps for one episode is 30. The control unit 11 outputs the calculation result by storing the calculation result in the storage unit 12, displaying the calculation result on the display unit 14, or transmitting the calculation result from the communication unit 15 to an external device.

ユーザは、予測装置１による算出結果を確認して、シミュレーションの対象となった商品のレコメンドの要否を決定することができる。したがって、変形例１によれば、ＥＣサイトにおける商品のレコメンドの要否が、予測装置１によるシミュレーションによる算出結果を基に判定される。これにより、ＥＣサイトにおいて、在庫日数の閾値を超えた商品がレコメンドされる、すなわちユーザの情報処理端末に表示される。そして、ユーザが表示された商品をクリックしないなど、レコメンドされた商品に対するユーザの操作が発生しない状況が継続する場合は、当該商品がレコメンドされないようになる。この結果、レコメンドされないために売れない商品とレコメンドされても売れない商品とを判定して、売れる可能性がより高い商品がレコメンドされるようにすることで、ＥＣサイトにおける商品のレコメンドの最適化を図ることができる。 The user can confirm the calculation result by the prediction device 1 and decide whether or not to recommend the product targeted for the simulation. Therefore, according to Modification 1, the necessity of product recommendation on the EC site is determined based on the calculation result of the simulation by the prediction device 1 . As a result, on the EC site, products whose number of days in stock exceeds the threshold are recommended, that is, displayed on the user's information processing terminal. Then, if the user does not click on the displayed product, or if the user does not operate the recommended product, the product is not recommended. As a result, products that do not sell because they are not recommended and products that do not sell even if they are recommended are determined, and products that are more likely to sell are recommended, thereby optimizing product recommendations on EC sites. can be achieved.

（変形例３）
変形例３では、ＥＣサイトで販売される商品を対象に、ユーザによる商品の注文時に配送指定日を変更することでユーザに付与されるインセンティブの決定を行う。ここで、イ
ンセンティブとは、商品の注文に対してユーザに還元されるＥＣサイトで利用可能なポイントなどである。インセンティブについては周知であるため詳細な説明は省略する。なお、本変形例におけるアクション（行動変数）は、インセンティブであり、例えば０．０から１．０まで０．１刻みの値のうちいずれかの値が選択される。また、本変形例で実行されるシミュレーションでは、１つのタイムステップを１日とし、ユーザがＥＣサイトにおいて商品の注文時に配送指定日を変更することが可能であるものと想定する。 (Modification 3)
In Modified Example 3, the incentive given to the user is determined by changing the designated delivery date when the user orders the product for sale on the EC site. Here, the incentive is a point or the like that can be used on the EC site that is returned to the user for ordering a product. Since incentives are well known, detailed description thereof will be omitted. Note that the action (behavior variable) in this modified example is an incentive, and any value is selected from 0.0 to 1.0 in increments of 0.1, for example. Also, in the simulation executed in this modified example, it is assumed that one time step is one day, and that the user can change the designated delivery date when ordering the product on the EC site.

図１５Ａは、予測装置１において、OpenAI Gymによって設定されるシミュレーションの報酬の一例を示し、図１５Ｂは、OpenAI Gymにおいて実行される各状態変数の更新の一例を示す。 15A shows an example of simulation rewards set by OpenAI Gym in prediction device 1, and FIG. 15B shows an example of update of each state variable executed in OpenAI Gym.

図１５Ａに示すように、本変形例で実行されるシミュレーションでは、シミュレーションの対象となる商品の配送に必要な人時の値を報酬とする（図中「reward = click_through_rate * profit * size」）。すなわち、利益またはサイズが大きくなるほど報酬も大きくなる。メタプレーヤーは、このように設定される報酬をもとにゴール条件の達成を目指す。 As shown in FIG. 15A, in the simulation executed in this modified example, the reward is the value of man-hours required to deliver the simulation target product (“reward=click_through_rate*profit*size” in the figure). That is, the greater the profit or size, the greater the reward. The meta player aims to achieve the goal conditions based on the rewards set in this way.

また、図１５Ｂに示すように、１つのタイムステップにおいて、商品の注文情報（注文ＩＤ、配送先の緯度および経度、現在の指定された配送日時である配送指定日および配送時間帯、配送の個口数を含む）（図中「order_info : (order_id, latitude, longitude,
date, time, parcel_num)」）、商品の配送における時空間クラスタ情報（現在の配送指定日および配送時間帯、商品の配送を行うクラスタの中心位置の緯度および経度、当該クラスタの半径、当該クラスタの配送可能な配送先数、現在の当該クラスタ内で予定されている配送の配送先数）（図中「time_space_cluster_info : [ (date,time,latitude,longitude,radius,max,num) , ….]」）、配送に必要な総人時（図中「man_hour」）、配送に必要な総配送員人数（図中「deivers_num」）、クラスタの総数（図中「cluster_num」）、配送に伴う所要時間である総待ち時間（図中「idle_time」）、総配送距離（図中「distance」）、配送指定日時と最適化された日時との差（図中「difference」）、配送指定
日を変更した場合にユーザに付与されるインセンティブ（一例としてポイント数。図中「incentive : basic_point * action (0.0～1.0)」）が状態変数である。また、本変形例
では、図中「incentive : basic_point * action (0.0～1.0)」以外の値を示す情報が商
品情報ＤＢ２に記憶されている。なお、クラスタが配送業者の配送エリアの一例である。予測装置１は、商品情報ＤＢ２に記憶されている一定期間にわたる各値の情報を取得し、タイムステップが１つ進むごとに、取得した情報を基に翌日の各値を特定し、特定した値で各状態変数を更新する。このように各状態変数が変更され、変更後の状態変数を用いて次のタイムステップにおける配送に必要な総人時の算出が行われる。 In addition, as shown in FIG. 15B, in one time step, product order information (order ID, latitude and longitude of delivery destination, specified delivery date and delivery time zone that is currently specified delivery date and time, delivery unit number) ("order_info: (order_id, latitude, longitude,
date, time, parcel_num)"), spatio-temporal cluster information for product delivery (current delivery date and delivery time zone, latitude and longitude of the central position of the cluster that delivers the product, radius of the cluster, cluster Number of delivery destinations that can be delivered, number of delivery destinations currently scheduled within the cluster) ("time_space_cluster_info: [(date,time,latitude,longitude,radius,max,num), ….]" in the figure) ), total man-hours required for delivery (“man_hour” in the figure), total number of delivery workers required for delivery (“deivers_num” in the figure), total number of clusters (“cluster_num” in the figure), and time required for delivery. A certain total waiting time ("idle_time" in the figure), total delivery distance ("distance" in the figure), the difference between the specified delivery date and time and the optimized date and time ("difference" in the figure), and when the specified delivery date is changed The incentive given to the user (as an example, the number of points. In the figure, "incentive: basic_point * action (0.0 to 1.0)") is a state variable. In addition, in this modification, information indicating values other than "incentive: basic_point*action (0.0 to 1.0)" in the drawing is stored in the product information DB2. A cluster is an example of a delivery area of a delivery company. The prediction device 1 acquires information on each value over a certain period stored in the product information DB 2, and each time the time step advances by one, specifies each value for the next day based on the acquired information, and determines the specified value. to update each state variable. Each state variable is changed in this manner, and the changed state variables are used to calculate the total man-hours required for delivery in the next time step.

図１５Ｃは、本変形例におけるシミュレーションにおけるゴール条件の一例を示す。図１５Ｃに示すように、現在のタイムステップから直近の一定期間（例えば１００日）において配送に必要な総人時（図中「man_hour_sum」）と当該配送に割当可能な最大人時（図中「max_ man_hour_sum」）とを算出する。そして、算出したそれぞれの人時を基に、当
該期間における配送員の稼働率を算出する（図中「man_hour_rate = man_hour / max_ man_hour_sum」）。そして、算出した配送員の稼働率があらかじめ設定された閾値以下の場合は（図中「if man_hour_rate <= man_hour_rate_threshold:」）、配送員の稼働率の条件が満たされたとする（図中「man_hour_rate_flag = True」）。また、当該期間のイン
センティブ（「incentive」）の値があらかじめ設定された閾値以下の場合は（図中「if incentive <= incentive_threshold:」）、インセンティブの条件が満たされたとする（
図中「incentive_flag = True」）。そして、上記の２つの条件がいずれも満たされた場
合にゴール条件が達成されたとみなす（図中「done = man_hour_rate_flag & incentive_flag」）。 FIG. 15C shows an example of goal conditions in simulation in this modified example. As shown in FIG. 15C, the total man-hours required for delivery (“man_hour_sum” in the figure) and the maximum man-hours allocatable to the delivery (“ max_man_hour_sum”). Then, based on the calculated man-hours, the operating rate of the delivery staff in the relevant period is calculated (“man_hour_rate=man_hour/max_man_hour_sum” in the figure). If the calculated operating rate of the delivery staff is equal to or less than a preset threshold ("if man_hour_rate <= man_hour_rate_threshold:" in the figure), it is assumed that the conditions for the operating rate of the delivery staff are satisfied ("man_hour_rate_flag = True"). Also, if the value of the incentive ("incentive") for the period is less than the preset threshold ("if incentive <= incentive_threshold:" in the figure), the incentive conditions are assumed to be met (
"incentive_flag = True" in the figure). And when both said two conditions are satisfy|filled, it considers that goal conditions were achieved ("done=man_hour_rate_flag&incentive_flag" in the figure).

本変形例では、制御部１１は、取得した商品情報から算出される配送に必要な総人時に基づいて決定される報酬を、複数のインセンティブそれぞれに対して算出する強化学習によって行動モデルを生成し、生成した行動モデルを用いてシミュレーションを行う。具体的には、制御部１１は、上記のインセンティブの割合を示す値を下限から上限まで変更しながら、例えば０．０から１．０まで０．１ずつ増加させながらシミュレーションを繰り返す。ここで、制御部１１は、１エピソードのタイムステップ数を、商品情報ＤＢ２から取得した上記の各値の情報の対象期間の日数とし、各インセンティブの割合に対して１００エピソード実行した結果を基に、各インセンティブの割合に対するゴール条件達成までのタイムステップ数、配送員の稼働率について１００エピソードの平均値を算出する。例えば、商品情報ＤＢ２から過去３０日にわたる上記の各値の情報が取得される場合は、１エピソードのタイムステップ数は３０となる。制御部１１は、算出結果を記憶部１２に記憶したり、表示部１４に表示したり、通信部１５から外部装置に送信したりすることで、算出結果を出力する。 In this modification, the control unit 11 generates a behavior model through reinforcement learning that calculates rewards determined based on total man-hours required for delivery calculated from acquired product information for each of a plurality of incentives. , a simulation is performed using the generated behavioral model. Specifically, the control unit 11 repeats the simulation while changing the value indicating the incentive ratio from the lower limit to the upper limit, increasing the value by 0.1 from 0.0 to 1.0, for example. Here, the control unit 11 sets the number of time steps in one episode to the number of days in the target period of the information of each value obtained from the product information DB 2, and based on the result of executing 100 episodes for each incentive ratio , the number of time steps to achieve the goal condition for each incentive ratio, and the operating rate of the delivery staff, the average value of 100 episodes is calculated. For example, when the information of each of the above values over the past 30 days is acquired from the product information DB2, the number of time steps for one episode is 30. The control unit 11 outputs the calculation result by storing the calculation result in the storage unit 12, displaying the calculation result on the display unit 14, or transmitting the calculation result from the communication unit 15 to an external device.

ユーザは、予測装置１による算出結果を確認して、ＥＣサイトにおいてユーザがシミュレーションの対象となった商品を注文する際に、当該商品の配送指定日を変更することでユーザに付与されるインセンティブの大きさを決定することができる。したがって、変形例１によれば、ＥＣサイトにおいて上記インセンティブの大きさが、予測装置１によるシミュレーションによる算出結果を基に調整される。これにより、配送コスト最小化とインセンティブコスト最小化の２つの目標のバランスを取りつつ、ＥＣサイトにおいて当該商品の配送日の変更をユーザに促す場合に付与されるインセンティブの大きさの最適化を図ることができる。 The user confirms the calculation result by the prediction device 1, and when the user orders the simulated product on the EC site, changes the specified delivery date of the product, thereby changing the incentive given to the user. size can be determined. Therefore, according to Modification 1, the magnitude of the incentive is adjusted at the EC site based on the calculation result of the simulation performed by the prediction device 1 . By doing so, we aim to balance the two goals of minimizing delivery costs and minimizing incentive costs, while optimizing the size of incentives given when prompting users to change the delivery date of the relevant product on the e-commerce site. be able to.

１予測装置
１１制御部
２商品情報ＤＢ 1 prediction device 11 control unit 2 product information DB

Claims

an acquisition unit that acquires product information of products for electronic commerce;
A first behavioral model and a first behavioral model for calculating mutually different rewards using the obtained product information and the state variables of the environment for a plurality of values of behavioral variables allowed in the environment related to the distribution of the product; a control unit that generates a third behavior model based on reinforcement learning of the behavior model of 2, and uses the third behavior model to predict an optimal behavior variable value in the environment;
the behavioral variable is the number of orders for the product by the seller of the product;
The product information acquired by the acquisition unit is information indicating changes in the number of orders, the number of shipments, and the number of arrivals of the product over a certain period of time,
The state variables of the environment include sales of the item, stock prices of the item, stockout costs associated with stockouts of the item, promotional costs associated with promoting the item, and excess inventory of the item. inventory costs associated with and shipping costs associated with delivering said goods;
The control unit includes the first behavior model that calculates different rewards using the acquired product information and the environmental state variables for each order quantity corresponding to a plurality of values of the behavior variables; generating the third behavior model based on reinforcement learning of the second behavior model, and predicting the optimal order quantity using the third behavior model;
A prediction device characterized by :

an acquisition unit that acquires product information of products for electronic commerce;
A first behavioral model and a first behavioral model for calculating mutually different rewards using the obtained product information and the state variables of the environment for a plurality of values of behavioral variables allowed in the environment related to the distribution of the product; a control unit that generates a third behavior model based on reinforcement learning of the behavior model of 2, and uses the third behavior model to predict the values of the optimal behavior variables in the environment;
has
wherein the behavioral variable is man-hours associated with storing the goods in a warehouse;
The product information acquired by the acquisition unit is information indicating the implementation date of the sales promotion of the product,
The state variable of the environment is a sales promotion cost determined according to the day of the week when the sales promotion of the product is implemented,
The control unit includes the first behavior model that calculates different rewards using the acquired product information and the environmental state variables for each person-time corresponding to the plurality of values of the behavior variables; A prediction device, wherein the third behavior model is generated based on reinforcement learning of the second behavior model, and the optimum human time is predicted using the third behavior model.

an acquisition unit that acquires product information of products for electronic commerce;
A first behavioral model and a first behavioral model for calculating mutually different rewards using the obtained product information and the state variables of the environment for a plurality of values of behavioral variables allowed in the environment related to the distribution of the product; a control unit that generates a third behavior model based on reinforcement learning of the behavior model of 2, and uses the third behavior model to predict the values of the optimal behavior variables in the environment;
has
The behavioral variable is a value that indicates the ease with which the product is recommended on an EC site,
The product information acquired by the acquisition unit is information indicating the number of clicks and the number of displays of the product over a certain period of time on the EC site,
the state variables of the environment are the size of the product, the sales of the product, the purchase price of the product, and an inventory cost determined according to the size of the product;
The control unit uses the obtained product information and the environment state variables to provide different rewards for each of the values indicating the ease of recommendation of the product corresponding to the plurality of values of the behavior variables. The third behavior model is generated based on the reinforcement learning of the calculated first behavior model and the second behavior model, and the ease with which the product is recommended is indicated using the third behavior model. A prediction device characterized by predicting an optimum value.

an acquisition unit that acquires product information of products for electronic commerce;
A first behavioral model and a first behavioral model for calculating mutually different rewards using the obtained product information and the state variables of the environment for a plurality of values of behavioral variables allowed in the environment related to the distribution of the product; a control unit that generates a third behavior model based on reinforcement learning of the behavior model of 2, and uses the third behavior model to predict the values of the optimal behavior variables in the environment;
has
The action variable is an incentive given to the user by changing the designated delivery date of the product on the EC site,
The product information acquired by the acquisition unit includes the delivery destination and delivery date and time of the product, the delivery area of the delivery company of the product, the number of deliverable delivery destinations, the time required for delivery of the product, and the delivery time of the product. is information indicating the delivery distance in the delivery of
The control unit includes the first behavioral model and the A prediction device that generates the third behavior model based on reinforcement learning of the second behavior model and predicts an optimal incentive using the third behavior model.