JP7770261B2

JP7770261B2 - Model learning system and method

Info

Publication number: JP7770261B2
Application number: JP2022098271A
Authority: JP
Inventors: 晶田中; 貴志爲重; 博泰西山
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2025-11-14
Anticipated expiration: 2042-06-17
Also published as: JP2023184233A

Description

本発明は、概して、機械学習モデルの学習に関する。 The present invention generally relates to training machine learning models.

コンピュータシステムにおいては、仮想化技術を利用してハードウェアリソース（ＣＰＵ（Central Processing Unit）／メモリ／ストレージ）などを持つ仮想サーバやネットワークといったデジタルインフラをインターネット経由でオンデマンドで提供するＩａａＳ（Infrastructure as a Service）と呼ばれるサービスが存在し、リソースの使用量や使用時間などの状況に応じてユーザへ課金を行う例が一般的である。 In the computer system industry, there is a service known as IaaS (Infrastructure as a Service), which uses virtualization technology to provide digital infrastructure such as virtual servers and networks with hardware resources (CPU (Central Processing Unit), memory, storage, etc.) on demand via the Internet. It is common for users to be charged based on the amount and duration of resource usage.

また、コンピュータシステムにおいては、リソース量の増加（典型的には、スケールアウト及び／又はスケールアップ）、及び／又は、リソース量の減少（典型的には、スケールイン及び／又はスケールダウン）が行われ得る。 In addition, in a computer system, the amount of resources can be increased (typically, scaled out and/or scaled up) and/or the amount of resources can be decreased (typically, scaled in and/or scaled down).

また、特許文献１には、リソースの使用量を抑えて、仮想マシンの迅速な切り替えを可能にする情報処理装置が記載されている。 Patent Document 1 also describes an information processing device that reduces resource usage and enables rapid switching between virtual machines.

特開２０２０－６４５６７号公報Japanese Patent Application Laid-Open No. 2020-64567

昨今、各種データを分析し予測する方法として、機械学習、特にニューラルネットワークを用いたディープラーニングが広く利用されている。機械学習により得られるモデルを用いてデータの予測を実行する場合、予測精度を保つためには最新データに基づくモデルの再学習が必要となる。 Nowadays, machine learning, particularly deep learning using neural networks, is widely used as a method for analyzing and predicting various types of data. When using a model obtained through machine learning to predict data, it is necessary to retrain the model using the latest data to maintain prediction accuracy.

ここで、一例として、ＩａａＳにおける機械学習のモデル再学習に時間がかかる場合を取り上げる。例えば、１週間ごとに新たなデータを入力して再学習を必要とする機械学習のモデルにおいて、モデルの再学習が遅延して数日かかってしまうと、その間に当該モデルの活用機会を失ってしまうことになる。あらかじめモデル再学習にリソース量を大きく割り当てることで遅延を発生させないことは期待できるかもしれないが、過剰なリソース量の割り当てという別の課題が生じ得る。 As an example, let's consider a case where it takes a long time to retrain a machine learning model in IaaS. For example, in a machine learning model that requires retraining by inputting new data every week, if the retraining of the model is delayed and takes several days, opportunities to utilize the model will be lost during that time. While it may be possible to avoid delays by allocating a large amount of resources to model retraining in advance, another issue may arise: the allocation of excessive resources.

特許文献１には、このような課題もその課題の解決手段も開示も示唆もされていない。 Patent Document 1 does not disclose or suggest these issues or the means to solve them.

本発明はこうした背景に鑑みてなされたものであり、モデル学習の遅延回復と過剰なリソース量の割当回避の両方を実現することを目的とする。 The present invention was made in light of this background, and aims to achieve both recovery from delays in model learning and avoidance of excessive resource allocation.

モデル学習システムが、機械学習モデルの学習の都度に、複数種類の計算リソースの各々について、当該機械学習モデルの学習に割り当てられている計算リソース量であるリソース割当量を取得し、且つ、複数種類の計算リソースの各々について、当該機械学習モデルの学習に使用される計算リソース量であるリソース使用量を監視する。モデル学習システムが、機械学習モデルの今回の学習に要する時間を推定し、当該推定された時間が、当該機械学習モデルの過去の学習に要した時間よりも長いか否かの判定である時間判定を行う。当該時間判定の結果が真の場合、モデル学習システムが、当該時間判定の結果が真の場合、複数種類の計算リソースの少なくとも一つについて今回の学習におけるリソース割当量を増加させる。 Each time a machine learning model is trained, the model learning system obtains a resource allocation, which is the amount of computational resources allocated to training the machine learning model, for each of multiple types of computational resources, and monitors resource usage, which is the amount of computational resources used for training the machine learning model, for each of multiple types of computational resources. The model learning system estimates the time required for the current training of the machine learning model and performs a time determination to determine whether the estimated time is longer than the time required for previous training of the machine learning model. If the result of the time determination is true, the model learning system increases the resource allocation for the current training for at least one of the multiple types of computational resources.

その他、本願が開示する課題、及びその解決方法は、発明を実施するための形態の欄、及び図面により明らかにされる。 Furthermore, the problems disclosed in this application and the solutions thereto will be made clear in the detailed description and drawings.

本発明によれば、モデル学習の遅延回復と過剰なリソース量の割当回避の両方を実現することができる。 This invention makes it possible to both recover from delays in model learning and avoid allocating excessive amounts of resources.

本発明の一実施形態に係るモデル学習システムの構成例を示す図である。FIG. 1 is a diagram illustrating an example of the configuration of a model learning system according to an embodiment of the present invention. 学習履歴テーブルの一例を示す図である。FIG. 10 is a diagram illustrating an example of a learning history table. 学習リソーステーブルの一例を示す図である。FIG. 10 is a diagram illustrating an example of a learning resource table. 実行コストテーブルの一例を示す図である。FIG. 10 is a diagram illustrating an example of an execution cost table. モデル学習処理のフローチャートの例である。10 is an example of a flowchart of a model learning process. 学習監視処理のフローチャートの例である。10 is an example of a flowchart of a learning monitoring process. 学習制御処理のフローチャートの例である。10 is a flowchart illustrating an example of a learning control process. モデル情報画面の一例を示す図である。FIG. 10 is a diagram illustrating an example of a model information screen.

以下の説明では、「インターフェース装置」は、一つ以上のインターフェースデバイスで良い。当該一つ以上のインターフェースデバイスは、下記のうちの少なくとも一つで良い。
・一つ以上のＩ／Ｏ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ）インターフェースデバイス。Ｉ／Ｏ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ）インターフェースデバイスは、Ｉ／Ｏデバイスと遠隔の表示用計算機とのうちの少なくとも一つに対するインターフェースデバイスである。表示用計算機に対するＩ／Ｏインターフェースデバイスは、通信インターフェースデバイスで良い。少なくとも一つのＩ／Ｏデバイスは、ユーザインターフェースデバイス、例えば、キーボードおよびポインティングデバイスのような入力デバイスと、表示デバイスのような出力デバイスとのうちのいずれでも良い。
・一つ以上の通信インターフェースデバイス。一つ以上の通信インターフェースデバイスは、一つ以上の同種の通信インターフェースデバイス（例えば一つ以上のＮＩＣ（ＮｅｔｗｏｒｋＩｎｔｅｒｆａｃｅＣａｒｄ））であっても良いし二つ以上の異種の通信インターフェースデバイス（例えばＮＩＣとＨＢＡ（ＨｏｓｔＢｕｓＡｄａｐｔｅｒ））であっても良い。 In the following description, an "interface apparatus" may refer to one or more interface devices, which may be at least one of the following:
One or more I/O (Input/Output) interface devices. The I/O (Input/Output) interface devices are interface devices to at least one of the I/O devices and a remote display computer. The I/O interface device to the display computer may be a communications interface device. The at least one I/O device may be a user interface device, for example, either an input device such as a keyboard and a pointing device, or an output device such as a display device.
One or more communication interface devices. The one or more communication interface devices may be one or more homogeneous communication interface devices (e.g., one or more NICs (Network Interface Cards)) or two or more heterogeneous communication interface devices (e.g., a NIC and an HBA (Host Bus Adapter)).

また、以下の説明では、「メモリ」は、一つ以上のメモリデバイスであり、典型的には主記憶デバイスで良い。メモリにおける少なくとも一つのメモリデバイスは、揮発性メモリデバイスであっても良いし不揮発性メモリデバイスであっても良い。 In the following description, "memory" refers to one or more memory devices, typically a primary storage device. At least one of the memory devices may be a volatile memory device or a non-volatile memory device.

また、以下の説明では、「永続記憶装置」は、一つ以上の永続記憶デバイスである。永続記憶デバイスは、典型的には、不揮発性の記憶デバイス（例えば補助記憶デバイス）であり、具体的には、例えば、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）またはＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）である。 In the following description, a "persistent storage device" refers to one or more persistent storage devices. A persistent storage device is typically a non-volatile storage device (e.g., an auxiliary storage device), specifically, for example, an HDD (Hard Disk Drive) or an SSD (Solid State Drive).

また、以下の説明では、「記憶装置」は、メモリと永続記憶装置の少なくともメモリで良い。 Also, in the following description, "storage device" may refer to at least memory, including memory and persistent storage devices.

また、以下の説明では、「プロセッサ」は、一つ以上のプロセッサデバイスである。少なくとも一つのプロセッサデバイスは、典型的には、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）のようなマイクロプロセッサデバイスであるが、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）のような他種のプロセッサデバイスでも良い。少なくとも一つのプロセッサデバイスは、シングルコアでも良いしマルチコアでも良い。少なくとも一つのプロセッサデバイスは、プロセッサコアでも良い。少なくとも一つのプロセッサデバイスは、処理の一部または全部を行うハードウェア回路（例えばＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）またはＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ））といった広義のプロセッサデバイスでも良い。 In the following description, a "processor" refers to one or more processor devices. The at least one processor device is typically a microprocessor device such as a CPU (Central Processing Unit), but may also be other types of processor devices such as a GPU (Graphics Processing Unit). The at least one processor device may be single-core or multi-core. The at least one processor device may also be a processor core. The at least one processor device may also be a broader processor device such as a hardware circuit that performs some or all of the processing (for example, an FPGA (Field-Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit)).

また、以下の説明では、「ｙｙｙ部」の表現にて機能を説明することがあるが、機能は、一つ以上のコンピュータプログラムがプロセッサによって実行されることで実現されても良いし、一つ以上のハードウェア回路（例えばＦＰＧＡ又はＡＳＩＣ）によって実現されても良いし、それらの組合せによって実現されても良い。プログラムがプロセッサによって実行されることで機能が実現される場合、定められた処理が、適宜に記憶装置および／またはインターフェース装置等を用いながら行われるため、機能はプロセッサの少なくとも一部とされても良い。機能を主語として説明された処理は、プロセッサあるいはそのプロセッサを有する装置が行う処理としても良い。プログラムは、プログラムソースからインストールされても良い。プログラムソースは、例えば、プログラム配布計算機又は計算機が読み取り可能な記録媒体（例えば非一時的な記録媒体）であっても良い。各機能の説明は一例であり、複数の機能が一つの機能にまとめられたり、一つの機能が複数の機能に分割されたりしても良い。 In the following description, functions are sometimes described using the expression "yyy unit," but the functions may be realized by one or more computer programs being executed by a processor, by one or more hardware circuits (e.g., FPGAs or ASICs), or by a combination of these. When a function is realized by a program being executed by a processor, the specified processing is performed using a storage device and/or interface device, etc., as appropriate, and therefore the function may be considered to be at least part of the processor. Processing described using a function as the subject may be processing performed by a processor or a device having that processor. A program may be installed from program source. The program source may be, for example, a program distribution computer or a computer-readable recording medium (e.g., a non-transitory recording medium). The description of each function is an example, and multiple functions may be combined into a single function, or a single function may be divided into multiple functions.

また、以下の説明では、「ｘｘｘテーブル」といった表現にて、入力に対して出力が得られる情報を説明することがあるが、当該情報は、どのような構造のデータでも良いし（例えば、構造化データでも良いし非構造化データでも良いし）、入力に対する出力を発生するニューラルネットワーク、遺伝的アルゴリズムやランダムフォレストに代表されるような学習モデルでも良い。従って、「ｘｘｘテーブル」を「ｘｘｘ情報」と言うことができる。また、以下の説明において、テーブルの構成は一例であり、一つのテーブルは、二つ以上のテーブルに分割されても良いし、二つ以上のテーブルの全部又は一部が一つのテーブルであっても良い。 In the following explanation, information that produces an output in response to an input may be described using expressions such as "xxx table," but this information may be data of any structure (for example, structured or unstructured data), or it may be a neural network that generates an output in response to an input, or a learning model such as a genetic algorithm or random forest. Therefore, "xxx table" can also be referred to as "xxx information." In the following explanation, the table structure is an example, and one table may be divided into two or more tables, or two or more tables may all or partly be a single table.

また、以下の説明において、「モデル」は、機械学習モデルであり、例えば、ニューラルネットワークで良い。 Also, in the following description, a "model" refers to a machine learning model, such as a neural network.

以下、発明を実施するための形態について図面を用いて詳細に説明する。 The following describes in detail the embodiments of the invention using the accompanying drawings.

図１は、本発明の一実施形態に係るモデル学習システム１の構成例を示している。 Figure 1 shows an example configuration of a model learning system 1 according to one embodiment of the present invention.

同図に示すように、モデル学習システム１は、インターフェース装置１０１、記憶装置１０２及びプロセッサ１０３を備えた物理的なコンピュータシステムでもよいし、物理的なコンピュータシステムに基づく論理的なコンピュータシステム（例えば、仮想マシン又はクラウドコンピューティングシステム）でもよい。インターフェース装置１０１に、ユーザ装置３と通信ネットワーク２を介して通信可能に接続されている。通信ネットワーク２は、無線方式又は有線方式の通信手段であり、例えば、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、インターネット、専用線、各種公衆通信網等である。ユーザ装置３は、パーソナルコンピュータやスマートフォンといった情報処理端末で良い。モデル学習システム１がサーバの一例で良くユーザ装置３がクライアントの一例で良い。 As shown in the figure, the model learning system 1 may be a physical computer system equipped with an interface device 101, a storage device 102, and a processor 103, or it may be a logical computer system based on a physical computer system (e.g., a virtual machine or cloud computing system). The interface device 101 is communicatively connected to a user device 3 via a communication network 2. The communication network 2 is a wireless or wired communication means, such as a LAN (Local Area Network), WAN (Wide Area Network), the Internet, a dedicated line, or various public communication networks. The user device 3 may be an information processing terminal such as a personal computer or smartphone. The model learning system 1 may be an example of a server, and the user device 3 may be an example of a client.

記憶装置１０２に、プロセッサ１０３によりアクセスされるデータやプロセッサ１０３により実行されるコンピュータプログラムが格納される。記憶装置１０２に格納されるデータとして、例えば、学習履歴テーブル４００、学習リソーステーブル４１０及び実行コストテーブル６００がある。記憶装置１０２に格納されるコンピュータプログラムとして、例えば、後述の学習実行部１２を実現するためのコンピュータプログラムと、後述の学習監視部１０、学習制御部１１及び情報出力部１３を実現するためのコンピュータプログラムとがある。学習監視部１０及び学習制御部１１を実現するためのコンピュータプログラムと、情報出力部１３を実現するためのコンピュータプログラムとは別々のプログラムでもよい。 The storage device 102 stores data accessed by the processor 103 and computer programs executed by the processor 103. Data stored in the storage device 102 includes, for example, a learning history table 400, a learning resource table 410, and an execution cost table 600. Computer programs stored in the storage device 102 include, for example, a computer program for implementing the learning execution unit 12 described below, and a computer program for implementing the learning monitoring unit 10, learning control unit 11, and information output unit 13 described below. The computer program for implementing the learning monitoring unit 10 and learning control unit 11 and the computer program for implementing the information output unit 13 may be separate programs.

プロセッサ１０３が記憶装置１０２に格納されている一つ以上のコンピュータプログラムを実行することにより、学習監視部１０、学習制御部１１、学習実行部１２、及び情報出力部１３といった機能が実現される。学習監視部１０は、モデル学習のリソース使用量と実行時間（学習時間）を収集する。学習制御部１１は、モデル学習の遅延時に学習遅延を回復するための制御を行う。情報出力部１３は、モデルに関する情報であるモデル情報をユーザ装置３に表示する（モデル情報の表示用の情報をユーザ装置３に送信する）。 By executing one or more computer programs stored in the storage device 102, the processor 103 realizes functions such as the learning monitor 10, learning control unit 11, learning execution unit 12, and information output unit 13. The learning monitor 10 collects resource usage and execution time (learning time) for model learning. The learning control unit 11 performs control to recover from learning delays when model learning is delayed. The information output unit 13 displays model information, which is information about the model, on the user device 3 (transmits information for displaying the model information to the user device 3).

学習実行部１２は、指定されたリソース割当量の範囲のリソース使用量でモデル学習を実行する。学習実行部１２は、例えば、ＩａａＳを用いて実現されるデジタルインフラで良い。学習実行部１２は、リソース量の増加又は減少の指示に従い、モデル学習に必要なリソース使用量を制御する（例えば、ＩａａＳにおける仮想マシンのＣＰＵの割り当てを増やしたり、メモリの割り当てを減らしたりする）。 The learning execution unit 12 executes model learning with resource usage within the specified resource allocation range. The learning execution unit 12 may be, for example, a digital infrastructure implemented using IaaS. The learning execution unit 12 controls the resource usage required for model learning in accordance with instructions to increase or decrease the amount of resources (for example, by increasing the CPU allocation of a virtual machine in IaaS or decreasing the memory allocation).

インターフェース装置１０１、記憶装置１０２及びプロセッサ１０３のうちの少なくとも一部が、モデル学習に使用される複数種類の計算リソースの一例でよい。具体的には、例えば、記憶装置１０２は、メモリを含み、プロセッサ１０３は、複数のＣＰＵコアを有する一つ以上のＣＰＵを含んでよく、複数種類の計算リソースは、メモリとＣＰＵでよい。モデル学習に使用される計算リソースとして、ポートや通信帯域といったリソースが採用されてもよい。 At least some of the interface device 101, storage device 102, and processor 103 may be examples of multiple types of computational resources used for model learning. Specifically, for example, the storage device 102 may include memory, and the processor 103 may include one or more CPUs with multiple CPU cores, and the multiple types of computational resources may be memory and CPUs. Resources such as ports and communication bandwidth may also be adopted as computational resources used for model learning.

学習監視部１０は、モデルの学習の都度に、複数種類の計算リソースの各々について、当該モデルのリソース割当量（当該モデルの学習に割り当てられている計算リソース量）を取得し、且つ、複数種類の計算リソースの各々について、当該モデルのリソース使用量（当該モデルの学習に使用される計算リソース量）を監視する。学習制御部１１は、当該モデルの今回の学習に要する時間を推定し、当該推定された時間が、当該モデルの過去の学習に要した時間（学習履歴テーブル４００から特定される時間）よりも長いか否かの判定である時間判定を行う。当該時間判定の結果が真の場合、学習制御部１１は、今回の学習での複数種類の計算リソースのリソース使用量を基にボトルネックの計算リソース種類を特定し、当該特定された計算リソース種類の計算リソースについて今回の学習におけるリソース割当量を増加させる（例えば、学習実行部１２に、当該モデルについてリソース使用量の増加を指示する）。 The learning monitor 10 acquires the resource allocation for each of multiple types of computational resources (the amount of computational resources allocated to learning the model) for that model each time a model is learned, and monitors the resource usage for each of multiple types of computational resources (the amount of computational resources used to learn the model). The learning control unit 11 estimates the time required for the current learning of the model and performs a time determination to determine whether the estimated time is longer than the time required for previous learning of the model (the time identified from the learning history table 400). If the result of the time determination is true, the learning control unit 11 identifies the bottleneck computational resource type based on the resource usage of multiple types of computational resources in the current learning, and increases the resource allocation for the identified computational resource type in the current learning (for example, instructs the learning execution unit 12 to increase resource usage for the model).

このモデル学習システム１によれば、今回の学習に要すると推定された時間が過去の学習時間よりも長いと判定された場合に学習のためのリソース割当量が増加するため、モデル学習の遅延回避を実現することができ、且つ、このようなリソース割当量の増加が可能であるためあらかじめ過剰なリソース量を割り当てておくことは不要である。 With this model learning system 1, if it is determined that the time estimated to be required for current learning is longer than the previous learning time, the amount of resources allocated for learning is increased, thereby avoiding delays in model learning. Furthermore, because it is possible to increase the amount of resource allocation in this way, there is no need to allocate excessive amounts of resources in advance.

なお、推定された学習時間は過去の履歴としての学習時間と比較されるが、この技術的意義の一例は次の通りである。すなわち、同一モデルの学習に使用される訓練データのサイズは学習によって大きく変わることは少ないと考えられる。そのため、推定された学習時間の比較対象として過去の学習時間を採用することで、効率的に、今回の学習に遅延が生じるか否かを判定することができる。 The estimated learning time is compared with historical learning time. One example of the technical significance of this is as follows: the size of the training data used to learn the same model is unlikely to change significantly during learning. Therefore, by using past learning time as a comparison target for the estimated learning time, it is possible to efficiently determine whether a delay will occur in the current learning.

以下、本実施形態を詳細に説明する。なお、以下、説明の便宜上、モデルの精度を「学習精度」と呼ぶ。 This embodiment will be described in detail below. For ease of explanation, the accuracy of the model will be referred to as "learning accuracy."

図２は、学習履歴テーブル４００の一例を示す。 Figure 2 shows an example of a learning history table 400.

同図に示すように、学習履歴テーブル４００は、モデル毎の過去の学習の履歴を表す。例えば、学習履歴テーブル４００は、モデル学習毎にレコードを有する。レコードは、モデル名４０１、Ｅｐｏｃｈ４０２、学習精度４０３、日時４０４、時間４０５、データ数４０６、ＣＰＵ４０７及びメモリ４０８といった情報を保持する。一つのモデル学習を例に取る。 As shown in the figure, the learning history table 400 represents the history of past learning for each model. For example, the learning history table 400 has a record for each model learning. The record holds information such as model name 401, Epoch 402, learning accuracy 403, date and time 404, time 405, number of data 406, CPU 407, and memory 408. Let's take one model learning as an example.

モデル名４０１は、モデル学習の対象のモデルの名前を表す。Ｅｐｏｃｈ４０２は、モデル学習におけるエポックの回数を表す。すなわち、本実施形態では、モデル学習に訓練データ群が使用されるが、一の訓練データが繰り返し入力される。一の訓練データの入力を含んだ学習が「エポック」である。既定の回数（エポック数）のエポックが行われた場合に、学習が終了する。具体的には、例えば、学習監視部１０は、モデルの学習の都度に、下記を行うようになっている。
・当該学習について初めてのエポックの開始を当該学習の開始と判定する。
・当該学習の開始後、当該学習について規定されたエポック数分のエポックが完了していないことを、当該モデルの学習中と判定する。
・当該学習の開始後、当該学習について規定されたエポック数分のエポックが完了したことを、当該モデルの学習の終了と判定する。 The model name 401 indicates the name of the model that is the target of model learning. The Epoch 402 indicates the number of epochs in model learning. That is, in this embodiment, a group of training data is used for model learning, and one piece of training data is repeatedly input. Learning that includes the input of one piece of training data is an "epoch." Learning ends when a predetermined number of epochs (number of epochs) have been performed. Specifically, for example, the learning monitor 10 is configured to perform the following each time a model is learned.
The start of the first epoch for the learning is determined as the start of the learning.
After the start of the learning, if the specified number of epochs for the learning have not been completed, the model is determined to be in the learning process.
After the learning starts, the completion of the specified number of epochs for the learning is determined to be the end of the learning of the model.

学習精度４０３は、学習を終了したモデルの学習精度を表す。例えば、学習精度の要素として、正解率、損失及び他の要素の少なくとも一つが採用されてよい。本実施形態では、学習精度は、正解率及び損失で定義される。また、「学習の終了」は、下記のうちのいずれかに該当することでよい。
・既定のエポック数分のエポックが行われること。
・学習精度が収束したこと（既定のエポック数分のエポックが行われていなくても学習が終了するケースの一例）。 The learning accuracy 403 indicates the learning accuracy of a model that has completed learning. For example, at least one of the accuracy rate, loss, and other factors may be adopted as an element of the learning accuracy. In this embodiment, the learning accuracy is defined by the accuracy rate and loss. Furthermore, "end of learning" may be any of the following:
- The specified number of epochs must be performed.
- The learning accuracy has converged (an example of a case where learning ends even if the specified number of epochs have not been performed).

日時４０４は、レコードが記憶された日時（モデル学習の開始日時）を表す。時間４０５は、学習開始時と学習終了時の日時の差、すなわち、モデル学習に要した時間を表す。データ数４０６は、モデル学習に使用した訓練データを構成するデータ要素数を表す。例えば、訓練データが、複数の画像ファイルの集合の場合、データ数は、画像のファイルの数でよい。 The date and time 404 represents the date and time when the record was stored (the start date and time of model learning). The time 405 represents the difference between the date and time when learning started and the date and time when learning ended, i.e., the time required for model learning. The number of data 406 represents the number of data elements that make up the training data used for model learning. For example, if the training data is a collection of multiple image files, the number of data may be the number of image files.

ＣＰＵ４０７は、モデル学習に割り当てられたＣＰＵ割当量を表す。メモリ４０８は、モデル学習に割り当てられたメモリ割当量を表す。ＣＰＵ割当量及びメモリ割当量がリソース割当量の一例である。 CPU 407 represents the CPU allocation allocated to model learning. Memory 408 represents the memory allocation allocated to model learning. The CPU allocation and memory allocation are examples of resource allocation.

図３は、学習リソーステーブル４１０の一例を示す。 Figure 3 shows an example of a learning resource table 410.

同図に示すように、学習リソーステーブル４１０は、モデル毎に当該モデルの今回の学習の状況を表す。例えば、学習リソーステーブル４１０は、エポック毎にレコードを有する。レコードは、モデル名４１１、日時４１９、Ｅｐｏｃｈ４１２、学習精度４０３、ＣＰＵ１４１、メモリ４１５、時間４１６、遅延４１７及びリソース４１８といった情報を保持する。一つのエポックを例に取る。 As shown in the figure, the learning resource table 410 represents the current learning status of each model. For example, the learning resource table 410 has a record for each epoch. The record holds information such as the model name 411, date and time 419, epoch 412, learning accuracy 403, CPU 141, memory 415, time 416, delay 417, and resources 418. Let's take one epoch as an example.

モデル名４０１は、学習対象のモデルの名前を表す。Ｅｐｏｃｈ４０２は、今回の学習において当該エポックが何回目のエポックであるかを表す。学習精度４０３は、学習中のモデルの学習精度（本実施形態では、正解率及び損失）を表す。ＣＰＵ４１４は、モデルの学習に伴うＣＰＵ使用量（例えば、ＣＰＵ使用率及び／又はＣＰＵ使用時間）を表す。メモリ４１５は、学習処理に伴うメモリ使用量（例えば、メモリ使用率及び／又はメモリ使用量）を表す。時間４１６は、学習開始時と測定時の日時の差、すなわち、モデル学習に要している時間を表す。 Model name 401 indicates the name of the model being trained. Epoch 402 indicates the number of epochs in the current training. Training accuracy 403 indicates the training accuracy of the model being trained (in this embodiment, the accuracy rate and loss). CPU 414 indicates the CPU usage associated with model training (e.g., CPU usage rate and/or CPU usage time). Memory 415 indicates the memory usage associated with the training process (e.g., memory usage rate and/or memory usage amount). Time 416 indicates the difference in date and time between the start of training and the time of measurement, i.e., the time required for model training.

遅延４１７は、時間４１６を基に推定された学習時間（今回の学習に要すると推定された時間）が同じモデルの過去の学習時間と比べて長い（遅延している）かどうかを示す値である。例えば遅延ありの場合は値が「あり」であり、遅延なしの場合は値が「なし」である。 Delay 417 is a value that indicates whether the learning time estimated based on time 416 (the time estimated to be required for current learning) is longer (delayed) than past learning times for the same model. For example, if there is a delay, the value is "Yes," and if there is no delay, the value is "No."

リソース４１８は、学習に対するＣＰＵ割当量やメモリ割当量といったリソース割当量を表す。例えば、リソース４１８の値は、ＣＰＵコア数（ＣＰＵ割当量の一例）とメモリサイズ（メモリ割当量）のペアであるリソース量ペアである。 Resource 418 represents resource allocations, such as CPU allocation and memory allocation, for learning. For example, the value of resource 418 is a resource amount pair that pairs the number of CPU cores (an example of CPU allocation) and memory size (memory allocation).

図４は、実行コストテーブル６００の一例を示す。 Figure 4 shows an example of an execution cost table 600.

同図に示すように、実行コストテーブル６００は、複数種類の計算リソースのコストを表す。例えば、実行コストテーブル６００は、リソース量ペア毎にレコードを有する。レコードは、割当リソース６０１及びコスト６０２といった情報を保持する。一つのリソース量ペアを例に取る。 As shown in the figure, the execution cost table 600 represents the costs of multiple types of computing resources. For example, the execution cost table 600 has a record for each resource amount pair. The record holds information such as allocated resources 601 and costs 602. Let's take one resource amount pair as an example.

割当リソース６０１は、当該ペアにおけるＣＰＵ割当量とメモリ割当量とを表す。また、コスト６０２は、当該ペアについて単位時間あたりのリソース利用コスト（実行コスト）を表す。 Allocated resources 601 represent the CPU allocation and memory allocation for the pair. Also, cost 602 represents the resource usage cost (execution cost) per unit time for the pair.

続いて、以上の構成からなるモデル学習システム１において行われる処理の例を説明する。 Next, we will explain an example of the processing performed in the model learning system 1 configured as described above.

図５は、モデル学習処理のフローチャートである。 Figure 5 is a flowchart of the model learning process.

まず、学習実行部１２は、例えばユーザ装置３からの操作に応答して、学習対象のモデル（例えばニューラルネットワーク）を定義する（ステップ８０１）。ここで、モデルの定義は、学習実行時のモデル作成でもよいし、作成済のモデルの取り込みでも良い。 First, the learning execution unit 12 defines a model (e.g., a neural network) to be learned, for example, in response to an operation from the user device 3 (step 801). Here, the model definition may be the creation of a model during learning execution, or the import of a model that has already been created.

次に、学習実行部１２は、例えばユーザ装置３からの操作に応答して、ステップ８０１で定義されたモデルのエポック数（学習の繰り返しの回数）を指定する（ステップ８０２）。例えば、ニューラルネットワークの学習においては、学習精度を向上させるために、同一の訓練データの入力によるモデル学習を繰り返すことが一般的である。１回の訓練データの入力でのモデル学習が「エポック」である。ステップ８０２では、このエポックの数が指定される。 Next, the learning execution unit 12, for example, in response to an operation from the user device 3, specifies the number of epochs (number of learning repetitions) for the model defined in step 801 (step 802). For example, in neural network learning, it is common to repeatedly learn models by inputting the same training data in order to improve learning accuracy. Model learning with one input of training data is called an "epoch." In step 802, the number of epochs is specified.

次に、学習実行部１２は、学習対象のモデルに訓練データを入力するモデル学習を実行する（ステップ８０３）。つまり、学習実行部１２は、１回のエポックを実行し、エポック数（例えばメモリに記憶されているカウント値）を更新（１インクリメント）する。 Next, the learning execution unit 12 executes model learning, inputting training data into the model to be learned (step 803). That is, the learning execution unit 12 executes one epoch and updates (increments by 1) the number of epochs (e.g., a count value stored in memory).

次に、学習実行部１２は、更新後のエポック数がステップ８０２で指定したエポック数に達したかどうかを判定する（ステップ８０４）。ステップ８０４の判定結果が偽の場合、処理がテップ８０３に戻る。ステップ８０４の判定結果が真の場合、学習対象のモデルの学習が終了する。 Next, the learning execution unit 12 determines whether the number of epochs after the update has reached the number of epochs specified in step 802 (step 804). If the determination result in step 804 is false, the process returns to step 803. If the determination result in step 804 is true, learning of the model to be learned is terminated.

図６は、学習監視処理のフローチャートである。学習監視部１０は、モデル学習処理の実行を契機として、学習実行部１２から呼び出され、学習監視処理を開始する。例えば、呼び出されるタイミングは、図５に示したモデル学習処理の開始時、ステップ８０３の完了時（エポックの完了時）、及び、モデル学習処理の終了時である。 Figure 6 is a flowchart of the learning monitoring process. The learning monitoring unit 10 is called by the learning execution unit 12 when the model learning process is executed, and starts the learning monitoring process. For example, it is called at the start of the model learning process shown in Figure 5, at the completion of step 803 (at the completion of the epoch), and at the end of the model learning process.

学習実行部１２から呼び出された学習監視部１０は、まずモデル学習処理の開始時かどうかを判定する（ステップ３０１）。 When called by the learning execution unit 12, the learning monitoring unit 10 first determines whether the model learning process has started (step 301).

ステップ３０１の判定結果が真の場合は、学習監視部１０は、当該学習対象のモデルについて、学習リソーステーブル４１０に、日時４１９（学習開始日時）を記録し、データ数（入力された訓練データにおけるデータ要素数）をメモリに格納する（ステップ３０２）。学習監視部１０は、学習実行部１２に対してリソース割当量の設定を指示する（ステップ３０８）。ここで、ステップ３０８における「リソース割当量」とは、例えば、ＣＰＵ“４コア”及びメモリ“４ＧＢ”といったリソース量ペアでよい。リソース割当量は、実行コストの観点で選択されても良いし、過去の学習実績からの統計的な演算（例えば、当該学習対象のモデルの過去のデータ数４０６、ＣＰＵ４０７及びメモリ４０８のそれぞれの回帰分析）から得られる推定値に基づいて選択されても良い。なお、ステップ３０８は、学習監視処理とは独立して、モデル学習開始時に実行されても良い。 If the determination result in step 301 is true, the learning monitor 10 records the date and time 419 (learning start date and time) for the model being learned in the learning resource table 410 and stores the number of data (the number of data elements in the input training data) in memory (step 302). The learning monitor 10 instructs the learning execution unit 12 to set the resource allocation (step 308). Here, the "resource allocation" in step 308 may be a resource amount pair such as "4 CPU cores" and "4 GB" of memory. The resource allocation may be selected from the perspective of execution cost, or may be selected based on an estimate obtained from statistical calculations based on past learning results (e.g., regression analysis of the number of past data 406, CPU 407, and memory 408 for the model being learned). Note that step 308 may be executed at the start of model learning, independently of the learning monitoring process.

ステップ３０１の判定結果が偽の場合は、学習監視部１０は、学習実行部１２から、モデル学習処理のリソース使用量（ＣＰＵ使用量及びメモリ使用量）とリソース割当量（ＣＰＵ割当量及びメモリ割当量）を取得する（ステップ３０３）。ここで、リソース使用量は、取得時点の値でも良いし、１回前のエポックから測定時までの間のリソース使用量の統計値（例えば平均値又は最大値）でも良い。統計値が平均値の場合は、学習処理のリソース使用量の実態に近いため有用である。学習監視部１０は、学習実行部１２から、学習中のモデルの名前（モデル名）、現時点のエポック数（カウント値）、及び学習精度といった情報を取得する（ステップ３０４）。学習監視部１０は、モデル学習の終了時かどうかを判定する（ステップ３０５）。 If the determination result in step 301 is false, the learning monitor 10 obtains the resource usage (CPU usage and memory usage) and resource allocation (CPU allocation and memory allocation) of the model learning process from the learning execution unit 12 (step 303). Here, the resource usage may be the value at the time of acquisition, or a statistical value (e.g., average or maximum value) of resource usage from the previous epoch to the time of measurement. If the statistical value is an average value, it is useful because it is closer to the actual resource usage of the learning process. The learning monitor 10 obtains information such as the name of the model being learned (model name), the current epoch number (count value), and learning accuracy from the learning execution unit 12 (step 304). The learning monitor 10 determines whether model learning has ended (step 305).

ステップ３０５の判定結果が真の場合は、学習監視部１０は、学習対象のモデルの日時４１９と現在日時との差からモデルの学習時間を算出し、さらに、当該モデルのエポック毎のＣＰＵ４１４及びメモリ４１５からＣＰＵ割当量とメモリ割当量の統計値（例えば平均値）を算出する。学習監視部１０は、取得済みのモデル名４０１、Ｅｐｏｃｈ４０２（エポック数）、学習精度４０３、データ数４０６と共に、日時４０４、時間４０５（算出された学習時間）、ＣＰＵ４０７（ＣＰＵ割当量の算出された統計値）及びメモリ（メモリ割当量の算出された統計値）を学習履歴テーブル４００に記録する（ステップ３０６）。 If the determination result in step 305 is true, the learning monitor 10 calculates the model's learning time from the difference between the date and time 419 of the model being learned and the current date and time, and further calculates statistical values (e.g., average values) of the CPU allocation and memory allocation from the CPU 414 and memory 415 for each epoch of the model. The learning monitor 10 records the acquired model name 401, Epoch 402 (number of epochs), learning accuracy 403, and number of data 406, as well as the date and time 404, time 405 (calculated learning time), CPU 407 (calculated statistical value of CPU allocation), and memory (calculated statistical value of memory allocation) in the learning history table 400 (step 306).

ステップ３０５の判定結果が偽の場合は、学習監視部１０は、日時４１９と現在日時の差からモデルの学習時間を算出し、取得済みのモデル名４１１、Ｅｐｏｃｈ４１２、学習精度４１３、リソース使用量（ＣＰＵ４１４及びメモリ４１５）及びリソース４１８と共に、時間４１６を学習リソーステーブル４１０に記録する（ステップ３０７）。 If the judgment result in step 305 is false, the learning monitor unit 10 calculates the model learning time from the difference between the date and time 419 and the current date and time, and records the time 416 in the learning resource table 410 along with the acquired model name 411, Epoch 412, learning accuracy 413, resource usage (CPU 414 and memory 415), and resources 418 (step 307).

以上のように、学習監視部１０は、モデルの学習の都度に、学習対象のモデルのモデル精度を監視（例えば、エポックの都度にモデル精度を取得）するようになっている。学習が完了した場合、当該モデルのモデル精度やデータ数等が学習履歴テーブル４００に記録される。 As described above, the learning monitor 10 monitors the model accuracy of the model being learned each time the model is learned (for example, by acquiring the model accuracy for each epoch). When learning is complete, the model accuracy, number of data points, etc. of the model are recorded in the learning history table 400.

図７は、学習制御処理のフローチャートである。 Figure 7 is a flowchart of the learning control process.

学習制御部１１は、学習監視部１０と同様に、モデル学習処理の実行を契機として、学習実行部１２から呼び出される。例えば、呼び出されるタイミングは、図５に示したモデル学習処理の開始時、ステップ８０３の完了時（エポックの完了時）、及び、モデル学習処理の終了時である。 Like the learning monitoring unit 10, the learning control unit 11 is called by the learning execution unit 12 when the model learning process is executed. For example, it is called at the start of the model learning process shown in Figure 5, at the completion of step 803 (at the completion of an epoch), and at the end of the model learning process.

学習実行部１２から呼び出された学習制御部１１は、学習リソーステーブル４１０から、学習中のモデル（モデル名４１１）の全てのレコードにおける時間４１６、Ｅｐｏｃｈ４１２、リソース使用量（ＣＰＵ４１４及びメモリ４１５）及び学習精度４１３を取得する（ステップ５０１）。 When called by the learning execution unit 12, the learning control unit 11 obtains the time 416, Epoch 412, resource usage (CPU 414 and memory 415), and learning accuracy 413 for all records of the model being learned (model name 411) from the learning resource table 410 (step 501).

続いて、学習制御部１１は、学習履歴テーブル４００から、当該モデルの過去のモデル学習に対応した時間４０５及び学習精度４０３を取得する（ステップ５０２）。 Next, the learning control unit 11 obtains the time 405 and learning accuracy 403 corresponding to past model learning of the model from the learning history table 400 (step 502).

続いて、学習制御部１１は、ステップ５０１にて取得した学習精度が収束しているかどうかの判定である収束判定を行う（ステップ５０３）。ステップ５０３の収束判定の結果が真の場合に、学習制御部１１は、学習実行部１２に対してモデル学習を中止（終了）するように指示する（ステップ５０９）。ステップ５０３の収束判定の結果が偽の場合は、学習制御部１１は、モデル学習を継続することを決定する（例えば、学習実行部１２に特段の指示を送信しない）（ステップ５０８）。ここで、学習精度が収束しているとは、学習精度４１３の変動が小さく収まっていることでよく、例えば、下記のうちのいずれかでよい。
・ステップ５０１にて取得した最新のエポックに対応の学習精度４１３（例えば正解率）と、一つ前のエポック（最新のＥｐｏｃｈ４１２より一つ小さいＥｐｏｃｈ４１２）に対応の学習精度４１３（例えば正解率）との差が、閾値よりも小さい。
・最新のエポックからあらかじめ指定した数遡ったエポックまでの期間での学習精度４１３（例えば正解率）の範囲（最大値と最小値の差）が閾値より小さい。このように学習精度が収束している場合にモデル学習を終了することで（具体的には、後述のステップ５０５の判定を行うことなくモデル学習を終了することで）、モデル学習の遅延回避と実行コストの削減という効果が得られる。 Next, the learning control unit 11 performs a convergence determination to determine whether the learning accuracy acquired in step 501 has converged (step 503). If the result of the convergence determination in step 503 is true, the learning control unit 11 instructs the learning execution unit 12 to stop (end) model learning (step 509). If the result of the convergence determination in step 503 is false, the learning control unit 11 decides to continue model learning (for example, does not send any special instructions to the learning execution unit 12) (step 508). Here, the learning accuracy having converged may mean that the fluctuations in the learning accuracy 413 have been kept small, and may be, for example, any of the following:
- The difference between the learning accuracy 413 (e.g., accuracy rate) corresponding to the latest epoch acquired in step 501 and the learning accuracy 413 (e.g., accuracy rate) corresponding to the previous epoch (Epoch 412 one Epoch smaller than the latest Epoch 412) is smaller than a threshold value.
The range (difference between the maximum and minimum values) of the learning accuracy 413 (e.g., accuracy rate) during the period from the most recent epoch to a pre-specified number of epochs back is smaller than a threshold. By terminating model learning when the learning accuracy has converged in this way (specifically, by terminating model learning without making the determination in step 505 described below), it is possible to avoid delays in model learning and reduce execution costs.

ステップ５０３の収束判定の結果が偽の場合に、学習制御部１１は、学習中のモデルの学習時間を推定する（ステップ５０４）。ここで、学習時間の推定方法としては、例えば、ステップ５０１にて取得した最新の時間４１６を最新のＥｐｏｃｈ４１２で除算して得られるエポックごとの平均学習時間に対して最新のエポック数（カウント値）を乗算することで得られた値を、今回のモデル学習に要する時間の推定値とする方法がある。以上のように、学習制御部１１は、今回の学習において、エポック毎の学習精度４１３を基に収束判定（ステップ５０３）を行い、最新のＥｐｏｃｈ４１２（カウント値）と最新のエポックまでに要した時間４１６とを基に今回の学習に要する時間を推定する。このように、エポック毎に取得された情報を、収束判定と学習時間推定に利用することができる。 If the result of the convergence determination in step 503 is false, the learning control unit 11 estimates the learning time of the model being trained (step 504). One method for estimating the learning time is to multiply the average learning time per epoch, obtained by dividing the most recent time 416 acquired in step 501 by the most recent Epoch 412, by the most recent number of epochs (count value), and use this value as an estimate of the time required for the current model training. As described above, the learning control unit 11 performs a convergence determination (step 503) based on the learning accuracy per epoch 413 in the current training, and estimates the time required for the current training based on the most recent Epoch 412 (count value) and the time required up to the most recent Epoch 416. In this way, the information acquired for each epoch can be used for convergence determination and learning time estimation.

続いて、学習制御部１１は、ステップ５０４で推定された学習時間が、ステップ５０２で取得した過去の時間４０５よりも長いかどうかの判定である時間判定、つまり、学習処理が遅延しているかどうかの判定を行う（ステップ５０５）。ステップ５０５の時間判定の結果が真の場合に、学習処理に割り当てるリソース割当量を増やすために、学習制御部１１は、学習実行部１２に対してリソース割当量を増加するよう指示する（ステップ５１０）。 Next, the learning control unit 11 performs a time determination to determine whether the learning time estimated in step 504 is longer than the past time 405 obtained in step 502, i.e., whether the learning process is delayed (step 505). If the result of the time determination in step 505 is true, the learning control unit 11 instructs the learning execution unit 12 to increase the resource allocation amount in order to increase the resource allocation amount allocated to the learning process (step 510).

ここで、今回の学習に要する推定時間と比較される過去の学習時間は、当該モデルの全ての過去の時間４０５のうちの少なくとも一つに基づいて良く、例えば、当該モデルの過去の時間４０５のうち最も長い時間４０５でも良いし、当該モデルの過去の時間４０５の平均値でもよい。また、当該モデルの過去の時間４０５の平均値μと標準偏差σに基づいて統計的に外れた範囲（μ＋３σ以上など）に含まれる推定学習時間が遅延とみなされても良い。 Here, the past learning time compared with the estimated time required for the current learning may be based on at least one of all past times 405 of the model, and may be, for example, the longest time 405 of the model's past times 405, or the average value of the model's past times 405. Furthermore, an estimated learning time that falls within a statistically outlying range (such as μ + 3σ or more) based on the average value μ and standard deviation σ of the model's past times 405 may be considered to be delayed.

また、ここで、リソース割当量の増加については、例えば、閾値よりも使用率の高いＣＰＵ及びメモリに対する割当量をあらかじめ設定した値で増加するのでも良いし、実行コストの低い方のＣＰＵあるいはメモリの割当量を増加するのでも良い。実行コストに基づいてリソース割当量を増やすことは、実行コストテーブル６００を参照して行われても良い。 Increasing the resource allocation amount may involve, for example, increasing the allocation amount for a CPU or memory with a higher utilization rate than a threshold by a preset value, or increasing the allocation amount for a CPU or memory with a lower execution cost. Increasing the resource allocation amount based on the execution cost may be performed by referring to the execution cost table 600.

また、ここで、学習制御部１１は、ステップ５０５の時間判定の結果が真の場合に、今回の学習での複数種類の計算リソースのリソース使用量（本実施形態では、当該学習中のモデルの各エポックに対応のＣＰＵ４１４及びメモリ４１５）を基に、ボトルネックの計算リソース種類を特定してよい。例えば、ＣＰＵ４１４及びメモリ４１５のうち閾値を超えている方に対応した計算リソース、ＣＰＵ４１４及びメモリ４１５のいずれも閾値を超えていれば閾値との差が大きい方の使用率に対応した計算リソースが、ボトルネックの計算リソース種類の計算リソースでよい。学習制御部１１は、当該特定された計算リソース種類の計算リソースについて、ステップ５１０において、今回の学習におけるリソース割当量を増加させてよい。これにより、効率的なリソース割当量の増加が期待される。 Furthermore, here, if the result of the time determination in step 505 is true, the learning control unit 11 may identify the bottleneck computational resource type based on the resource usage of multiple types of computational resources in the current learning (in this embodiment, the CPU 414 and memory 415 corresponding to each epoch of the model being learned). For example, the computational resource corresponding to either the CPU 414 or memory 415 whose usage rate exceeds a threshold, or if both the CPU 414 and memory 415 exceed their thresholds, the computational resource corresponding to the computational resource whose usage rate differs more from the threshold, may be the computational resource of the bottleneck computational resource type. In step 510, the learning control unit 11 may increase the resource allocation amount for the identified computational resource type in the current learning. This is expected to result in an efficient increase in resource allocation amount.

ステップ５０５の時間判定の結果が偽の場合は、学習制御部１１は、ステップ５０１にて取得した最新の学習精度４１３が、ステップ５０２で取得した過去の学習実績における学習精度よりも高いかどうかの判定である精度判定を行う（ステップ５０６）。ステップ５０６の精度判定の結果が真の場合に、学習制御部１１は、学習処理に割り当てるリソース量を減らすために、学習制御部１１は、学習実行部１２に対してリソース割当量を減少するよう指示する（ステップ５０７）。これにより、リソース割当量の最適化が期待できる。 If the result of the time determination in step 505 is false, the learning control unit 11 performs an accuracy determination to determine whether the latest learning accuracy 413 acquired in step 501 is higher than the learning accuracy in the past learning results acquired in step 502 (step 506). If the result of the accuracy determination in step 506 is true, the learning control unit 11 instructs the learning execution unit 12 to reduce the amount of resources allocated to the learning process (step 507). This is expected to optimize the amount of resource allocation.

ここで、最新の学習精度４１３と比較される過去の学習精度は、当該モデルの全ての過去の学習精度４０３のうちの少なくとも一つに基づいて良く、例えば、過去の学習実績での最も高い学習精度４０３でも良いし、過去の学習実績での学習精度４０３の平均値でも良い。 Here, the past learning accuracy compared with the latest learning accuracy 413 may be based on at least one of all past learning accuracies 403 of the model, and may be, for example, the highest learning accuracy 403 from past learning results, or the average value of the learning accuracies 403 from past learning results.

また、ここで、リソース割当量の減少については、例えば、閾値よりも使用率の低いＣＰＵ及びメモリについての割当量をあらかじめ設定した値に削減するのでも良いし、実行コストの低い方のＣＰＵあるいはメモリの割当量を削減するのでも良い。実行コストに基づいてリソース割当量を減らすことは、実行コストテーブル６００を参照して行われても良い。例えば、学習制御部１１は、複数種類の計算リソースのうちいずれの種類の計算リソースのリソース割当量を維持することが最もコストが減るかを特定し、特定された種類以外の種類の計算リソースのリソース割当量を減少させてよい。具体的には、例えば、現在のリソース量ペアが、ＣＰＵ“８コア”及びメモリ“８ＧＢ”の場合、図４に例示の実行コストテーブル６００によれば、メモリ“８ＧＢ”を維持しＣＰＵコア数を減らすよりもＣＰＵ“８コア”を維持しメモリを減らす方がコストが下がる。このため、リソース割当量を減少する指示は、ＣＰＵ“８コア”及びメモリ“８ＧＢ”をＣＰＵ“８コア”及びメモリ“４ＧＢ”に変更する指示である。これにより、リソース割当量を適切に減らすことが期待できる。 In addition, here, the resource allocation reduction may involve, for example, reducing the allocations for CPUs and memory whose utilization rates are lower than a threshold to a preset value, or reducing the allocation for the CPU or memory with the lowest execution cost. Reducing resource allocations based on execution cost may be performed by referencing the execution cost table 600. For example, the learning control unit 11 may identify which type of computing resource, among multiple types, will most effectively reduce costs by maintaining its resource allocation, and then reduce the resource allocations for types of computing resources other than the identified type. Specifically, for example, if the current resource amount pair is 8 CPU cores and 8 GB memory, the execution cost table 600 illustrated in FIG. 4 shows that maintaining 8 CPU cores and reducing memory reduces costs more than maintaining 8 GB memory and reducing the number of CPU cores. Therefore, the instruction to reduce resource allocation is an instruction to change 8 CPU cores and 8 GB memory to 8 CPU cores and 4 GB memory. This is expected to result in an appropriate reduction in resource allocation.

ステップ５０７の後、上述したステップ５０８を経て、学習制御処理が終了する。また、テップ５０６の精度判定の結果が偽の場合も、上述のステップ５０８を経て、学習制御処理が終了する。 After step 507, the learning control process ends via step 508 described above. Also, if the accuracy determination result in step 506 is false, the learning control process ends via step 508 described above.

学習制御処理において、学習制御部１１は、ステップ５０５の時間判定で比較される少なくとも一方の時間を、今回の学習に使用される訓練データの量と当該過去の学習に使用された訓練データの量との差分に基づき調整してよい。例えば、学習制御部１１は、当該モデルの過去の学習に要した時間（例えば時間４０５の統計値）、及び／又は、ステップ５０４で推定された時間を、今回の学習に使用される訓練データの量（ステップ３０２で記録され学習制御部１１により取得されたデータ数）と当該過去の学習に使用された訓練データの量（例えばデータ数４０６の統計値）との差分に基づき調整してよい。ステップ５０５の時間判定は、調整された推定時間が過去の学習時間よりも長いか否か、ステップ５０４で推定された時間が過去の調整後の学習時間よりも長いか否か、又は、調整された推定時間が過去の調整後の学習時間よりも長いか否かの判定でよい。学習時間の長さは、データ数（訓練データの量）の大きさに依存するが、時間判定で比較される少なくとも一方の学習時間が、今回のデータ数と過去のデータ数との差を基に調整されるので、学習遅延が生じているかどうかの判定の正確性の向上が期待できる。 In the learning control process, the learning control unit 11 may adjust at least one of the times compared in the time determination in step 505 based on the difference between the amount of training data used in the current learning and the amount of training data used in the previous learning. For example, the learning control unit 11 may adjust the time required for previous learning of the model (e.g., the statistical value of time 405) and/or the time estimated in step 504 based on the difference between the amount of training data used in the current learning (the number of pieces of data recorded in step 302 and acquired by the learning control unit 11) and the amount of training data used in the previous learning (e.g., the statistical value of data number 406). The time determination in step 505 may be a determination of whether the adjusted estimated time is longer than the previous learning time, whether the time estimated in step 504 is longer than the previous adjusted learning time, or whether the adjusted estimated time is longer than the previous adjusted learning time. The length of the learning time depends on the amount of data (amount of training data), but at least one of the learning times compared in the time judgment is adjusted based on the difference between the amount of data this time and the amount of data in the past, which is expected to improve the accuracy of determining whether a learning delay is occurring.

図９は、モデル情報画面の一例を示す図である。 Figure 9 shows an example of the model information screen.

情報出力部１３が、モデル情報（モデルの学習状態に関する情報でありステップ５０５の時間判定の結果に基づく情報を含んだ情報）を表示する（例えば、ユーザ装置３に、モデル情報の表示用情報を送信する）。モデル情報画面７００は、モデル情報の表示画面であり、モデル情報は、モデルの学習状態を表す。このモデル情報画面７００により、ユーザは、モデルの学習状態を把握することができる。 The information output unit 13 displays the model information (information about the learning status of the model, including information based on the result of the time determination in step 505) (for example, by sending display information for the model information to the user device 3). The model information screen 700 is a display screen for the model information, and the model information represents the learning status of the model. This model information screen 700 allows the user to understand the learning status of the model.

具体的には、例えば、情報出力部１３が、学習リソーステーブル４１０と学習履歴テーブル４００から取得した情報に基づき、モデル情報を生成し表示する。モデル情報は、一つ以上のモデル（例えばユーザから指定された一つ以上のモデル）の各々について、モデル名７０１、学習状態７０２、終了予定日時７０３、実行状態７０４、及び実行制御７０５といった情報を有する。 Specifically, for example, the information output unit 13 generates and displays model information based on information obtained from the learning resource table 410 and the learning history table 400. The model information includes information such as model name 701, learning status 702, scheduled completion date and time 703, execution status 704, and execution control 705 for each of one or more models (e.g., one or more models specified by the user).

モデル名７０１は、モデル名を表す。 Model name 701 represents the model name.

学習状態７０２は、モデルの学習状態（例えば、学習中、学習完了）を表す。学習履歴テーブル４００にモデル名が記録されており学習リソーステーブル４１０にモデル名が記録されていないモデルの学習状態７０２は、“学習完了”であり、学習履歴テーブル４００にモデル名が記録されているか否かに関わらず学習リソーステーブル４１０にモデル名が記録されているモデルの学習状態７０２は“学習中”である。 The learning status 702 indicates the learning status of the model (e.g., learning in progress, learning completed). The learning status 702 of a model whose model name is recorded in the learning history table 400 but whose model name is not recorded in the learning resource table 410 is "learning completed," and the learning status 702 of a model whose model name is recorded in the learning resource table 410 is "learning," regardless of whether the model name is recorded in the learning history table 400.

終了予定日時７０３は、学習リソーステーブル４１０に基づいて推定された学習時間（ステップ５０４で推定された学習時間）の終了日時を表す。 The estimated end date and time 703 represents the end date and time of the study time estimated based on the learning resource table 410 (the study time estimated in step 504).

実行状態７０４は、学習中のモデルの学習時間が遅延しているかどうかを表す。具体的には、ステップ５０５の時間判定の結果が真の場合に実行状態７０４は“遅延”であり、ステップ５０５の時間判定の結果が偽の場合に実行状態７０４は“順調”で良い。 Execution status 704 indicates whether the learning time of the model being trained is delayed. Specifically, if the result of the time determination in step 505 is true, execution status 704 is "delayed," and if the result of the time determination in step 505 is false, execution status 704 can be "on track."

実行制御７０５は、学習制御部１１によってリソース割当量が増減された場合の情報である。例えば、ステップ５１０が行われその内容がＣＰＵ割当量増加の場合に実行制御７０５は“ＣＰＵスケールアップ”で良く、ステップ５０７が行われその内容がＣＰＵ割当量減少の場合に実行制御７０５は“ＣＰＵスケールダウン”で良い。 Execution control 705 is information when the resource allocation is increased or decreased by the learning control unit 11. For example, if step 510 is performed and the content is an increase in CPU allocation, execution control 705 may be "CPU scale up," and if step 507 is performed and the content is a decrease in CPU allocation, execution control 705 may be "CPU scale down."

本発明は上記した実施形態に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施形態の構成の一部を他の実施形態の構成に置き換えることが可能であり、また、ある実施形態の構成に他の実施形態の構成を加えることも可能である。また、各実施形態の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 The present invention is not limited to the above-described embodiments and includes various modifications. For example, the above-described embodiments have been described in detail to clearly explain the present invention, and are not necessarily limited to those including all of the described configurations. Furthermore, it is possible to replace part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment. Furthermore, it is possible to add, delete, or replace part of the configuration of each embodiment with other configurations.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現しても良い。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現しても良い。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記憶部、又は、ＩＣカード、ＳＤカード、ＤＶＤ、ＢＤ等の記録媒体に置くことができる。 Furthermore, some or all of the above-mentioned configurations, functions, processing units, processing means, etc. may be realized in hardware, for example by designing them as integrated circuits. Furthermore, the above-mentioned configurations, functions, etc. may be realized in software by a processor interpreting and executing programs that realize each function. Information such as programs, tables, and files that realize each function can be stored in a storage unit such as memory, a hard disk, or an SSD (Solid State Drive), or on a recording medium such as an IC card, SD card, DVD, or BD.

また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えても良い。 Furthermore, the control and information lines shown are those deemed necessary for explanation, and do not necessarily represent all control and information lines in the product. In reality, it is safe to assume that almost all components are interconnected.

１：モデル学習システム１０：学習監視部１１：学習制御部 1: Model learning system 10: Learning monitoring unit 11: Learning control unit

Claims

a learning monitoring unit that, each time a machine learning model is learned, acquires a resource allocation amount, which is the amount of computational resources allocated to the learning of the machine learning model, for each of a plurality of types of computational resources, and monitors a resource usage amount, which is the amount of computational resources used for the learning of the machine learning model, for each of the plurality of types of computational resources;
A model learning system comprising: a learning control unit that estimates the time required for the current learning of the machine learning model, performs a time judgment to determine whether the estimated time is longer than the time required for past learning of the machine learning model, and increases the resource allocation amount for the current learning for at least one of the multiple types of computing resources if the result of the time judgment is true.

the learning monitoring unit monitors the model accuracy of the machine learning model each time the machine learning model is learned;
The learning control unit
A convergence determination is performed to determine whether the model accuracy has converged in the current learning.
If the result of the convergence determination is true, the current learning is terminated.
If the result of the convergence determination is false, the time required for the current learning is estimated.
The model learning system of claim 1 .

The learning monitoring unit, each time the machine learning model is learned,
determining the start of the first epoch for the learning as the start of the learning;
After the start of the learning, if the number of epochs specified for the learning has not been completed, the machine learning model is determined to be in the middle of learning;
After the start of the learning, the completion of a specified number of epochs for the learning is determined as the end of learning of the machine learning model;
the learning control unit determines the convergence based on the model accuracy for each epoch in the current learning, and estimates the time required for the current learning based on the number of latest epochs and the time required until the latest epoch.
The model learning system of claim 2 .

The learning control unit
If the result of the time determination is false, an accuracy determination is performed to determine whether the model accuracy in the current learning is higher than the model accuracy in past learning of the machine learning model;
If the result of the accuracy determination is true, a computing resource type to be reduced in resource allocation amount is determined based on the costs of the plurality of types of computing resources, and the resource allocation amount in the current learning is reduced for the computing resources of the determined computing resource type.
The model learning system of claim 1 .

If the result of the accuracy determination is true, the learning control unit identifies which type of computational resource among the plurality of types of computational resources should have its resource allocation maintained to most effectively reduce costs, and reduces the resource allocation of types of computational resources other than the identified type.
The model learning system of claim 4 .

the learning control unit adjusts at least one of the times compared in the time determination based on a difference between an amount of training data used in the current learning and an amount of training data used in the previous learning.
The model learning system of claim 1 .

The model learning system according to claim 1, further comprising an information output unit that displays model information, which is information regarding the learning state of the machine learning model and includes information based on the result of the time judgment.

when the result of the time determination is true, the learning control unit identifies a bottleneck computing resource type based on the resource usage amounts of the plurality of types of computing resources in the current learning, and increases the resource allocation amount in the current learning for the computing resources of the identified computing resource type.
The model learning system of claim 1 .

Each time a machine learning model is trained, the computer acquires, for each of a plurality of types of computational resources, a resource allocation amount that is the amount of computational resources allocated to the training of the machine learning model, and monitors, for each of the plurality of types of computational resources, a resource usage amount that is the amount of computational resources used for the training of the machine learning model;
The computer estimates the time required for current learning of the machine learning model,
The computer performs a time determination to determine whether the estimated time is longer than a time required for past learning of the machine learning model;
If the result of the time determination is true, the computer increases the resource allocation amount for the current learning for at least one of the plurality of types of computing resources.
Model learning methods.

Each time a machine learning model is trained, a resource allocation amount is acquired for each of a plurality of types of computational resources, which is the amount of computational resources allocated to the training of the machine learning model, and a resource usage amount is monitored for each of the plurality of types of computational resources, which is the amount of computational resources used for the training of the machine learning model;
Estimate the time required for current learning of the machine learning model;
performing a time determination to determine whether the estimated time is longer than the time required for past learning of the machine learning model;
If the result of the time determination is true, increase the resource allocation amount in the current learning for at least one of the plurality of types of computing resources.
A computer program that causes a computer to do something.