JP7655446B2

JP7655446B2 - Analysis device and program

Info

Publication number: JP7655446B2
Application number: JP2024507219A
Authority: JP
Inventors: 悠香橋本; ショウオウ
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2025-04-02
Anticipated expiration: 2042-03-14
Also published as: JPWO2023175681A1; WO2023175681A1

Description

本開示は、解析装置及びプログラムに関する。 The present disclosure relates to an analysis device and a program.

ニューラルネットワークは、例えば、回帰問題、分類問題、データ生成、データ分布推定等に幅広く応用されている。近年では、ニューラルネットワークの連続化に関する研究も行われており（例えば、非特許文献１や非特許文献２等）、様々な応用や理論解析に適用されている。非特許文献１に記載されているＮｅｕｒａｌＯＤＥ（Ordinary Differential Equations）では、入力から出力までの変換を連続的な力学系とみなし、連続的な微分方程式に対する解法を利用している。この手法は、層方向（横方向）の連続化といえる。また、非特許文献２に記載されている積分表現理論では、各層における変換を連続化し、積分作用素を用いて表すことで、調和解析の理論を利用してニューラルネットワークの理論解析を行っている。この手法は、ユニット方向（縦方向）の連続化といえる。これらのような連続化によって、微分や積分等といった連続的な関数に対する操作を利用できるようになり、性能向上や理論解析の精緻化が進んだといえる。Neural networks are widely used in, for example, regression problems, classification problems, data generation, data distribution estimation, etc. In recent years, research on continuity of neural networks has also been conducted (for example, Non-Patent Document 1 and Non-Patent Document 2, etc.), and they are applied to various applications and theoretical analyses. In the Neural ODE (Ordinary Differential Equations) described in Non-Patent Document 1, the transformation from input to output is regarded as a continuous dynamical system, and a solution method for continuous differential equations is used. This method can be said to be continuity in the layer direction (horizontal direction). In addition, in the integral representation theory described in Non-Patent Document 2, the transformation in each layer is made continuous and expressed using an integral operator, and theoretical analysis of neural networks is performed using the theory of harmonic analysis. This method can be said to be continuity in the unit direction (vertical direction). Such continuity makes it possible to use operations on continuous functions such as differentiation and integration, and it can be said that performance has improved and theoretical analysis has become more sophisticated.

一方で、Ｃ^＊環やＨｉｌｂｅｒｔＣ^＊－ｍｏｄｕｌｅといった数学的な概念をデータ解析に応用する研究が行われており（例えば、非特許文献３等）、データ解析に必要な基礎的な性質が示されている。 Meanwhile, research is being conducted into the application of mathematical concepts such as C ^* -algebras and Hilbert C ^* -modules to data analysis (for example, Non-Patent Document 3, etc.), and basic properties necessary for data analysis have been shown.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS), 2018.Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS), 2018. Sonoda, S. and Murata, N. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233-268, 2017.Sonoda, S. and Murata, N. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233-268, 2017. Hashimoto, Y., Ishikawa, I., Ikeda, M., Komura, F., Katsura, T., and Kawahara, Y. Reproducing kernel Hilbert C*-module and kernel mean embeddings. Journal of Machine Learning Research, 22(267):1-56, 2021.Hashimoto, Y., Ishikawa, I., Ikeda, M., Komura, F., Katsura, T., and Kawahara, Y. Reproducing kernel Hilbert C*-module and kernel mean embeddings. Journal of Machine Learning Research, 22(267):1-56, 2021.

しかしながら、ニューラルネットワークの連続化に関する既存研究は１つのモデルを対象としていた。一方で、近年では、アンサンブル学習やメタ学習等といった複数のモデルを同時に学習したり、複数のモデルに共通する特徴量を学習したりする手法が注目されている。このため、複数のモデルを対象として、それら複数のモデルに関連する学習を効率的に行う手法が必要になると考えられる。However, existing research on continuity of neural networks has focused on a single model. Meanwhile, in recent years, methods such as ensemble learning and meta-learning that simultaneously train multiple models or learn features common to multiple models have been attracting attention. For this reason, it is believed that a method that targets multiple models and efficiently performs learning related to these multiple models is required.

本開示は、上記の点に鑑みてなされたもので、複数のモデルに関連する学習を効率的に行う技術を提供する。 The present disclosure has been made in consideration of the above points and provides a technology for efficiently performing learning related to multiple models.

本開示の一態様による解析装置は、コンパクト空間Ｚ上の連続関数全体の空間Ａに値を取る複数の要素で構成されるパラメータθを持つニューラルネットワークモデルをｆとして、前記ニューラルネットワークモデルｆの学習データを入力するように構成されている入力部と、前記学習データを用いて、所定の損失関数Ｌが含まれる関数Ｌ_ｒｅｇを最小化する前記パラメータθを学習するように構成されているパラメータ最適化部と、を有する。 An analysis device according to one aspect of the present disclosure includes an input unit configured to input training data for a neural network model f having a parameter θ composed of a plurality of elements that take values in a space A of all continuous functions in a compact space Z, and a parameter optimization unit configured to learn the parameter θ that minimizes a function L _reg including a predetermined loss function L, using the training data.

複数のモデルに関連する学習を効率的に行う技術が提供される。 A technique is provided for efficiently performing learning related to multiple models.

本実施形態に係る解析装置のハードウェア構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of an analysis device according to the present embodiment. 本実施形態に係る解析装置の機能構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of an analysis device according to the present embodiment. 本実施形態に係るパラメータ最適化処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a parameter optimization process according to the present embodiment. サンプルの一例を示す図である。FIG. 13 is a diagram showing an example of a sample. 密度関数の推定結果の一例を示す図（その１）である。FIG. 13 is a diagram showing an example of an estimation result of a density function (part 1). 密度関数の推定結果の一例を示す図（その２）である。FIG. 13 is a diagram (part 2) showing an example of an estimation result of a density function.

以下、本発明の一実施形態について説明する。以下では、複数のモデルを連続的につなぎ合わせることで、それら複数のモデルに関連する学習を効率的に行うことができる解析装置１０について説明する。このとき、複数のモデルを連続的に表現するために、Ｃ^＊環やＨｉｌｂｅｒｔＣ^＊－ｍｏｄｕｌｅ等の数学的な概念を利用する。 An embodiment of the present invention will be described below. In the following, an analysis device 10 capable of efficiently performing learning related to a plurality of models by continuously connecting the plurality of models will be described. In this case, mathematical concepts such as C ^* -algebras and Hilbert C ^* -modules are used to continuously express the plurality of models.

本実施形態に係る解析装置１０によれば、複数のモデルに関連する学習（つまり、複数のモデル間で相互作用しながらの学習）を効率的に行うことが可能になり、その結果、連続性のあるデータの解析を効率的に行うことができるようになる。 According to the analysis device 10 of this embodiment, it becomes possible to efficiently perform learning related to multiple models (i.e., learning while interacting between multiple models), and as a result, it becomes possible to efficiently analyze continuous data.

＜理論的構成＞
本実施形態の理論的構成について説明する。 <Theoretical Framework>
The theoretical configuration of this embodiment will be described.

≪モデル及びそのパラメータの設定≫
コンパクト空間Ｚ上の連続関数全体の空間をＡとする。任意の可換なＣ^＊環は、このような空間Ａと同型となることに留意する。パラメータ（重み）がＡに値を取るような、隠れ層がＫ層のニューラルネットワークモデルを考える。 <Model and parameter settings>
Let A be the space of all continuous functions on a compact space Z. Note that any commutative C ^* -algebra is isomorphic to such a space A. Consider a neural network model with K hidden layers whose parameters (weights) take values in A.

ニューラルネットワークモデルの各層の次元を表す自然数をＮ_０，・・・，Ｎ_Ｋ＋１とする。ただし、Ｎ_０は入力層の次元、Ｎ_Ｋ＋１は出力層の次元である。 The natural numbers representing the dimensions of each layer of the neural network model are denoted as N ₀ , ..., N _K+1 , where N ₀ is the dimension of the input layer, and N _K+1 is the dimension of the output layer.

ｉ＝１，・・・，Ｋ＋１に対して、Ｎ_ｉ－１×Ｎ_ｉ行列であって、かつ、各成分がＡに値を持つものをＷ_ｉとする。また、Ａ^Ｎ_ｉからＡ^Ｎ_ｉへの変換をσ_ｉとする。ただし、「Ｎ_ｉ」は「Ｎ_ｉ」を表す。σ_ｉは活性化関数に相当し、非線形な変換とすることで、ニューラルネットワーク全体の変換に非線形性が生じる。 For i=1, ..., K+1, let W _i be an N _i-1 ×N _i matrix with each component having a value in A. Also, let σ _i be the transformation from A ^N_i to A ^N_i , where "N_i" represents "N _i ". _{σ i} corresponds to the activation function, and by making it a nonlinear transformation, nonlinearity occurs in the transformation of the entire neural network.

このとき、Ａ^Ｎ_０のベクトルを入力、Ａ^{Ｎ_（Ｋ＋１）}のベクトルを出力する以下の式（１）に示すようなニューラルネットワークを考える。ただし、「Ｎ_０」は「Ｎ_０」、「Ｎ_（Ｋ＋１）」は「Ｎ_Ｋ＋１」をそれぞれ表す。 In this case, consider a neural network as shown in the following formula (1) that inputs a vector A ^{N — 0} and outputs a vector A ^{N — (K+1)} , where "N_0" represents "N ₀ " and "N_(K+1)" represents "N _K+1 ".

Ｗ_１，・・・，Ｗ_Ｋ＋１のすべての成分を並べたベクトルをθ∈Ａ^Ｎとおく。ただし、Ｎ＝Ｎ_０Ｎ_１＋・・・＋Ｎ_ＫＮ_Ｋ＋１である。以下では、ｆがパラメータθを持つことを明示する場合はｆ^θと表す。Ａ^Ｎは、Ｈｉｌｂｅｒｔ空間を拡張した概念であるＨｉｌｂｅｒｔＣ^＊－ｍｏｄｕｌｅと呼ばれる空間である。ＨｉｌｂｅｒｔＣ^＊－ｍｏｄｕｌｅでは、Ａに値を持つ内積が定義される。［ａ_１，・・・，ａ_Ｎ］∈Ａ^Ｎと［ｂ_１，・・・，ｂ_Ｎ］∈Ａ^Ｎに対して、Ａ^Ｎでの内積は、

Let θ∈A ^N be the vector in which all components of W ₁ , ..., W _K+1 are arranged. Here, N = N ₀ N ₁ + ... + N _K N _K+1 . In the following, when it is explicitly stated that f has a parameter θ, it will be expressed as f ^θ . ^{A N} is a space called the Hilbert C ^* -module, which is a concept that extends Hilbert space. In the Hilbert C ^* -module, an inner product with a value in A is defined. For [a ₁ , ..., a _N ]∈A ^N and [b ₁ , ..., b _N ]∈A ^N , the inner product in A ^N is:

により定義される。

It is defined by:

σ_ｉを固定し、パラメータθに依存した損失関数Ｌを設定した上で、学習データが与えられたときに、その損失関数Ｌを最小化することにより最適なｆを求める。損失関数Ｌは、パラメータθをＡの元に写すような写像とする。Ａに値を持つ（関数値の）パラメータを考えることは、複数のモデルを連続的につなぎ合わせていることに相当する。学習データには、ニューラルネットワークｆへの入力を表すサンプルが少なくとも含まれる。また、教師あり学習により最適なｆを求める場合、学習データには、サンプルに対する教師（つまり、そのサンプルを入力したときのニューラルネットワークｆの出力の正解例）も含まれる。なお、教師は分類タスク等においてはラベル等とも呼ばれる。 When training data is given, the optimal f is obtained by minimizing the loss function L after fixing σ _i and setting a loss function L that depends on the parameter θ. The loss function L is a mapping that maps the parameter θ to the element A. Considering a parameter (function value) that has a value in A is equivalent to continuously connecting multiple models. The training data includes at least a sample that represents the input to the neural network f. In addition, when the optimal f is obtained by supervised learning, the training data also includes a teacher for the sample (i.e., a correct example of the output of the neural network f when the sample is input). Note that the teacher is also called a label, etc. in classification tasks, etc.

≪損失関数の設定≫
連続化した複数のモデルの間で相互作用しながら学習が行われるようにするため、通常の損失関数Ｌ（例えば、平均二乗誤差やクロスエントロピー誤差等）に正則化項を加えたＬ_ｒｅｇを最小化する。Ｚが有限測度空間であるとき、以下の式（２）に示すようにＬ_ｒｅｇを定義する。 <Loss function setting>
In order to perform learning while interacting between multiple continuous models, _Lreg , which is a regularization term added to a normal loss function L (e.g., mean square error, cross entropy error, etc.), is minimized. When Z is a finite measure space, _Lreg is defined as shown in the following formula (2).

ただし、λ∈Ａは正則化項の重みを表すハイパーパラメータである。

Here, λ∈A is a hyperparameter representing the weight of the regularization term.

≪損失関数の最小化≫
勾配法を用いて損失関数Ｌの最小化を行うことを考える。写像Ｌ：Ａ^Ｎ→Ａとパラメータθ∈Ａ^Ｎに対して、ＬのθでのＡ値の勾配∇_θＬを以下のように定義する。 <Minimization of the loss function>
Consider minimizing the loss function L using a gradient method. For a mapping L: A ^N →A and a parameter θ∈A ^N , the gradient ∇ _θ L of the A value of L at θ is defined as follows:

「あるξ∈Ａ^Ｎが存在し、任意のδ∈Ａ^Ｎと任意のｚ∈Ｚに対して、 "There exists a ξ∈A ^N such that for every δ∈A ^N and every z∈Z,

を満たすとき、∇_θＬ＝ξと定める。」
例えば、あるＦ：Ｒ^Ｎ×Ｚ×→Ｒ（ただし、Ｒは実数全体を表す。）により、Ｌ（θ）（ｚ）＝Ｆ（θ（ｚ），ｚ）と表せるとき、各ｚ∈Ｚを通常の勾配∇_θ（ｚ）Ｆ（・，ｚ）に対応させる写像が連続であれば、この写像はＡ値の勾配∇_θＬである。

When this is satisfied, we define ∇ _θ L = ξ.
For example, for some F: R ^N × Z × → R (where R represents the set of real numbers), if L(θ)(z) = F(θ(z), z), then if the mapping that maps each z∈Z to the ordinary gradient ∇ _θ(z) F(·,z) is continuous, then this mapping is the A-valued gradient ∇ _θ L.

このとき、上記の式（２）で定義したＬ_ｒｅｇ（θ）の勾配は以下の式（３）のように計算できる。 In this case, the gradient of L _reg (θ) defined by the above formula (2) can be calculated as shown in the following formula (3).

ただし、λ'はＡ^Ｎのベクトルで、要素がすべてλであるものとする。

Here, λ' is a vector of A ^N , and all elements are λ.

∇_θＬ_ｒｅｇを用いて、以下の式（４）に示すような勾配法のスキームを構成する。 Using ∇ _θ L _reg , a gradient method scheme is constructed as shown in the following equation (4).

ただし、θ_０∈Ａ^Ｎである。また、η_ｔ∈Ａは学習率を表すハイパーパラメータである。なお、上記の式（４）は右辺で左辺を更新することを表す。

Here, θ ₀ ∈ A ^N. Also, η _t ∈ A is a hyperparameter representing a learning rate. Note that the right side of the above formula (4) indicates that the left side is updated by the right side.

ここで、実際には、Ａの元（関数）そのものをコンピュータ上で扱うことはできないため、例えば、次のように計算を行う。Ａ^Ｎの有限次元部分空間Ｖと、Ｖへの写像Ｐ：Ａ^Ｎ→Ｖをとり、∇_θＬをＰ（∇_θＬ）に置き換える。また、ハイパーパラメータλとη_ｔは定数関数として選ぶ。Ｐは、例えば、適当な点ｚ_１，・・・，ｚ_ｍ∈Ｚにおける回帰等により定める。また、Ｚはコンパクトであるため、列 Here, since it is actually impossible to handle the elements (functions) of A themselves on a computer, the calculation is performed, for example, as follows. Take a finite-dimensional subspace V of A ^N and a mapping P:A ^N →V onto V, and replace ∇ _θ L with P(∇ _θ L). Also, the hyperparameters λ and η _t are selected as constant functions. P is determined, for example, by regression at suitable points z ₁ , ..., z _m ∈ Z. Also, since Z is compact, the sequence

がｔに関して一様にＬｉｐｓｃｈｉｔｚ連続であれば、

If is uniformly Lipschitz continuous in t, then

が各点収束するならば一様収束することが示せる。そこで、Ｖとしては、Ｌｉｐｓｃｈｉｔｚ連続関数を基底として持つ有限次元の部分空間等を考える。

It can be shown that if converges at each point, then it converges uniformly. Therefore, as V, a finite-dimensional subspace having a Lipschitz continuous function as a basis is considered.

＜解析装置１０のハードウェア構成例＞
本実施形態に係る解析装置１０のハードウェア構成例を図１に示す。図１に示すように、本実施形態に係る解析装置１０は、入力装置１０１と、表示装置１０２と、外部Ｉ／Ｆ１０３と、通信Ｉ／Ｆ１０４と、ＲＡＭ（Random Access Memory）１０５と、ＲＯＭ（Read Only Memory）１０６と、補助記憶装置１０７と、プロセッサ１０８とを有する。これらの各ハードウェアは、それぞれがバス１０９を介して通信可能に接続されている。 <Example of Hardware Configuration of Analysis Device 10>
An example of a hardware configuration of an analysis device 10 according to this embodiment is shown in Fig. 1. As shown in Fig. 1, the analysis device 10 according to this embodiment includes an input device 101, a display device 102, an external I/F 103, a communication I/F 104, a RAM (Random Access Memory) 105, a ROM (Read Only Memory) 106, an auxiliary storage device 107, and a processor 108. Each of these pieces of hardware is connected to each other via a bus 109 so as to be able to communicate with each other.

入力装置１０１は、例えば、キーボード、マウス、タッチパネル、物理ボタン等である。表示装置１０２は、例えば、ディスプレイ、表示パネル等である。なお、解析装置１０は、例えば、入力装置１０１と表示装置１０２の少なくとも一方を有していなくてもよい。The input device 101 is, for example, a keyboard, a mouse, a touch panel, a physical button, etc. The display device 102 is, for example, a display, a display panel, etc. Note that the analysis device 10 does not have to have at least one of the input device 101 and the display device 102, for example.

外部Ｉ／Ｆ１０３は、記録媒体１０３ａ等の外部装置とのインタフェースである。解析装置１０は、外部Ｉ／Ｆ１０３を介して、記録媒体１０３ａの読み取りや書き込み等を行うことができる。記録媒体１０３ａとしては、例えば、フレキシブルディスク、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disk）、ＳＤメモリカード（Secure Digital memory card）、ＵＳＢ（Universal Serial Bus）メモリカード等が挙げられる。The external I/F 103 is an interface with an external device such as a recording medium 103a. The analysis device 10 can read and write data from and to the recording medium 103a via the external I/F 103. Examples of the recording medium 103a include a flexible disk, a CD (Compact Disc), a DVD (Digital Versatile Disk), a SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.

通信Ｉ／Ｆ１０４は、解析装置１０を通信ネットワークに接続するためのインタフェースである。ＲＡＭ１０５は、プログラムやデータを一時保持する揮発性の半導体メモリ（記憶装置）である。ＲＯＭ１０６は、電源を切ってもプログラムやデータを保持することができる不揮発性の半導体メモリ（記憶装置）である。補助記憶装置１０７は、例えば、ＨＤＤ（Hard Disk Drive）やＳＳＤ（Solid State Drive）等のストレージ装置（記憶装置）である。プロセッサ１０８は、例えば、ＣＰＵ（Central Processing Unit）やＧＰＵ（Graphics Processing Unit）等の演算装置である。The communication I/F 104 is an interface for connecting the analysis device 10 to a communication network. The RAM 105 is a volatile semiconductor memory (storage device) that temporarily stores programs and data. The ROM 106 is a non-volatile semiconductor memory (storage device) that can store programs and data even when the power is turned off. The auxiliary storage device 107 is a storage device (storage device) such as a hard disk drive (HDD) or a solid state drive (SSD). The processor 108 is an arithmetic device such as a central processing unit (CPU) or a graphics processing unit (GPU).

本実施形態に係る解析装置１０は、図１に示すハードウェア構成を有することにより、後述するパラメータ最適化処理を実現することができる。なお、図１に示すハードウェア構成は一例であって、解析装置１０のハードウェア構成はこれに限られるものではない。例えば、解析装置１０は、複数の補助記憶装置１０７や複数のプロセッサ１０８を有していてもよいし、図示したハードウェア以外の様々なハードウェアを有していてもよい。The analysis device 10 according to this embodiment has the hardware configuration shown in Fig. 1, and is therefore capable of implementing the parameter optimization process described below. Note that the hardware configuration shown in Fig. 1 is merely an example, and the hardware configuration of the analysis device 10 is not limited thereto. For example, the analysis device 10 may have multiple auxiliary storage devices 107 and multiple processors 108, or may have various other hardware besides the hardware shown in the figure.

＜解析装置１０の機能構成例＞
本実施形態に係る解析装置１０の機能構成例を図２に示す。図２に示すように、本実施形態に係る解析装置１０は、データ入力部２０１と、パラメータ最適化部２０２と、出力計算部２０３と、データ出力部２０４とを有する。これら各部は、例えば、解析装置１０にインストールされた１以上のプログラムが、プロセッサ１０８に実行させる処理により実現される。また、本実施形態に係る解析装置１０は、記憶部２０５を有する。記憶部２０５は、例えば、補助記憶装置１０７等により実現される。なお、記憶部２０５は、例えば、解析装置１０と通信ネットワークを介して接続される記憶装置等により実現されてもよい。 <Example of functional configuration of analysis device 10>
An example of the functional configuration of the analysis device 10 according to this embodiment is shown in Fig. 2. As shown in Fig. 2, the analysis device 10 according to this embodiment has a data input unit 201, a parameter optimization unit 202, an output calculation unit 203, and a data output unit 204. Each of these units is realized, for example, by a process in which one or more programs installed in the analysis device 10 are executed by the processor 108. The analysis device 10 according to this embodiment also has a storage unit 205. The storage unit 205 is realized, for example, by the auxiliary storage device 107 or the like. Note that the storage unit 205 may be realized, for example, by a storage device connected to the analysis device 10 via a communication network.

データ入力部２０１は、解析装置１０に与えられた学習データを入力する。パラメータ最適化部２０２は、データ入力部２０１によって入力された学習データを用いて、式（３）及び式（４）によりパラメータθを更新する。出力計算部２０３は、パラメータ最適化部２０２が式（３）により勾配∇_θＬ_ｒｅｇを計算する際に、学習データに含まれるサンプルを入力としたときのｆの出力を計算する。データ出力部２０４は、パラメータ最適化部２０２によって更新されたパラメータθを出力する。記憶部２０５は、各種情報（例えば、学習データ、パラメータθ、ハイパーパラメータλ及びη_ｔ等）を記憶する。 The data input unit 201 inputs learning data provided to the analysis device 10. The parameter optimization unit 202 updates the parameter θ according to equations (3) and (4) using the learning data input by the data input unit 201. The output calculation unit 203 calculates the output of f when a sample included in the learning data is input, when the parameter optimization unit 202 calculates the gradient ∇ _θ L _reg according to equation (3). The data output unit 204 outputs the parameter θ updated by the parameter optimization unit 202. The storage unit 205 stores various information (e.g., learning data, parameter θ, hyperparameters λ and η _t , etc.).

＜パラメータ最適化処理の流れ＞
本実施形態に係るパラメータ最適化処理の流れについて、図３を参照しながら説明する。ここで、以下、Ｍを十分大きい自然数、εを十分小さい正の実数であるものとする。 <Parameter optimization process flow>
The flow of the parameter optimization process according to this embodiment will be described with reference to Fig. 3. Hereinafter, M is assumed to be a sufficiently large natural number, and ε is assumed to be a sufficiently small positive real number.

データ入力部２０１は、与えられた学習データ（一般に、多数の学習データ）を入力する（ステップＳ１０１）。The data input unit 201 inputs given learning data (generally a large number of learning data) (step S101).

パラメータ最適化部２０２は、ｔ＝０と初期化すると共に、パラメータθ_ｔ＝θ_０を初期化する（ステップＳ１０２）。なお、パラメータ最適化部２０２は、例えば、乱数等によりパラメータθ_０を初期化すればよい。 The parameter optimization unit 202 initializes t=0 and also initializes the parameter θ _t =θ ₀ (step S102). Note that the parameter optimization unit 202 may initialize the parameter θ ₀ using, for example, a random number.

パラメータ最適化部２０２は、ｔ＜Ｍ又は||Ｌ（θ_ｔ）||＞εであるか否かを判定する（ステップＳ１０３）。パラメータ最適化部２０２は、ｔ＜Ｍ又は||Ｌ（θ_ｔ）||＞εであると判定された場合はステップＳ１０４に進み、そうでない場合（つまり、ｔ≧Ｍかつ||Ｌ（θ_ｔ）||≦εである場合）はステップＳ１０７に進む。 The parameter optimization unit 202 judges whether t<M or ∥L( _θt )∥>ε (step S103). If it is judged that t<M or ∥L( _θt )∥>ε, the parameter optimization unit 202 proceeds to step S104, and if not (i.e., t≧M and ∥L( _θt )∥≦ε), the parameter optimization unit 202 proceeds to step S107.

パラメータ最適化部２０２は、式（３）により勾配∇_θＬ_ｒｅｇを計算する（ステップＳ１０４）。このとき、パラメータ最適化部２０２は、式（３）中の∇_θＬをＰ（∇_θＬ）に置き換えた上で式（３）を計算する。また、パラメータ最適化部２０２が式（３）を計算する際には、学習データに含まれるサンプルを入力としたときのｆの出力を出力計算部２０３により計算することで、Ｐ（∇_θＬ）を計算する。 The parameter optimization unit 202 calculates the gradient ∇ _θ L _reg by equation (3) (step S104). At this time, the parameter optimization unit 202 calculates equation (3) after replacing ∇ _θ L in equation (3) with P(∇ _θ L). In addition, when the parameter optimization unit 202 calculates equation (3), the output calculation unit 203 calculates the output of f when a sample included in the learning data is input, thereby calculating P(∇ _θ L).

次に、パラメータ最適化部２０２は、式（４）によりパラメータθ_ｔ＋１を更新する（ステップＳ１０５）。 Next, the parameter optimization unit 202 updates the parameter θ _t+1 by equation (4) (step S105).

次に、パラメータ最適化部２０２は、ｔ←ｔ＋１と更新する（ステップＳ１０６）。Next, the parameter optimization unit 202 updates t ← t + 1 (step S106).

データ出力部２０４は、パラメータθ_ｔを出力する（ステップＳ１０７）。なお、データ出力部２０４は、予め決められた任意の出力先（例えば、記憶部２０５、他のプログラム、通信ネットワークを介して接続される他の機器等）に当該パラメータθ_ｔを出力すればよい。これにより、最適なパラメータθ＝θ_ｔが得られ、その結果、このパラメータを持つ最適なｆ＝ｆ^θが得られる。 The data output unit 204 outputs the parameter _θt (step S107). The data output unit 204 may output the parameter _θt to any predetermined output destination (for example, the storage unit 205, another program, another device connected via a communication network, etc.). This provides an optimal parameter θ= _θt , and as a result, an optimal f= ^fθ having this parameter is obtained.

＜応用例＞
以下、本実施形態に係る解析装置１０の応用例について説明する。 <Application Examples>
Application examples of the analysis device 10 according to this embodiment will be described below.

≪応用例１：Ｎｏｒｍａｌｉｚｉｎｇｆｌｏｗを用いた密度推定≫
Ｚをコンパクトな確率空間とし、ＤをＺ上の確率測度とする。また、Ωを確率空間とし、Ｚ×Ω上の確率変数Ｘで、Ｒ^Ｎ_０（ただし、Ｒは実数全体、「Ｎ_０」は「Ｎ_０」を表す。）に値を取るものを考える。任意のω∈Ωに対して、Ｘ（・，ω）は連続とする。 <Application Example 1: Density estimation using normalizing flow>
Let Z be a compact probability space and D be a probability measure on Z. Also, let Ω be a probability space, and consider a random variable X on Z×Ω that takes values in R ^N_0 (where R is the set of real numbers and "N_0" represents "N ₀ "). For any ω∈Ω, let X(.,ω) be continuous.

このとき、任意のｚ∈Ｚに対して、確率変数Ｘ（ｚ，・）が正規分布に従うならば、確率変数In this case, if the random variable X(z, ) follows a normal distribution for any z∈Z, then the random variable

が、学習データとして与えられたサンプルの分布に従うように上記の式（１）で定義したｆ＝ｆ^θを学習する。以下、上記の数９に示す確率変数の分布の密度関数をｐ_ｄａｔａ ^θ，ｚと表す。

f = ^fθ defined in the above formula (1) is learned so that it follows the distribution of the sample given as the learning data. Hereinafter, the density function of the distribution of the random variable shown in the above formula 9 is represented as p _data ^θ,z .

学習データとして与えられたサンプルをｘ_１，・・・，ｘ_ｎ∈Ｒ^Ｎ_０とする。損失関数Ｌについては、尤度最大化の考えに基づいて、 Let x ₁ , ..., x _n ∈ ^{R N_0} be samples given as training data. The loss function L is defined based on the idea of maximizing the likelihood as follows:

とする。このＬに対して、図３に示すパラメータ最適化処理によりパラメータθの最適化を行う。ｆ^θの入力と出力はそれぞれＡ^Ｎ_０のベクトルとＡ^{Ｎ_（Ｋ＋１）}のベクトルであるが、入力は値がｘ_ｉの定数関数とする。出力については、推定が正しければｐ_ｄａｔａ ^θ，ｚはｚに依存しないはずである。そこで、例えば、出力として得られるＺ上の関数を、Ｚ上でＤを用いて積分する等して最終的な出力として密度関数を得る。

For this L, the parameter θ is optimized by the parameter optimization process shown in FIG. 3. The input and output of ^fθ are the vector A ^N_0 and the vector A ^N_(K+1) , respectively, but the input is a constant function whose value is x _i . As for the output, if the estimation is correct, p _data ^θ,z should not depend on z. Therefore, for example, the function on Z obtained as the output is integrated on Z using D to obtain a density function as the final output.

≪応用例２：Ｆｅｗ－ｓｈｏｔ学習≫
Ｆｅｗ－ｓｈｏｔ学習は、少ないサンプルからモデルを学習するための方法である。Ｆｅｗ－ｓｈｏｔ学習の精度を上げるための方法として、メタ学習がある、メタ学習では、複数のタスクに対する学習データを用いて予めメタモデルを学習することで、タスク間に共通の特徴量を学習する。この特徴量を用いることにより、新しいタスクに対するサンプル数が少ない場合であっても、その新しいタスクに対するモデルを効率的に学習でき、精度が向上する。例えば、参考文献１で提案されている方法では、メタモデルとして、タスクを低次元空間での表現に写す写像Ｚと、低次元空間での表現をモデルのパラメータに変換する写像Θを学習する。このようなメタモデルと、図３に示すパラメータ最適化処理とを組み合わせることで、Ｆｅｗ－ｓｈｏｔ学習の精度を更に向上させることができる。 <Application Example 2: Few-shot learning>
Few-shot learning is a method for learning a model from a small number of samples. As a method for improving the accuracy of Few-shot learning, there is meta-learning. In meta-learning, a meta-model is learned in advance using learning data for a plurality of tasks, thereby learning features common between tasks. By using this feature, even if the number of samples for a new task is small, a model for the new task can be efficiently learned, improving accuracy. For example, in the method proposed in Reference 1, a mapping Z that maps a task to an expression in a low-dimensional space and a mapping Θ that converts the expression in the low-dimensional space into the parameters of the model are learned as meta-models. By combining such a meta-model with the parameter optimization process shown in FIG. 3, the accuracy of Few-shot learning can be further improved.

一例として、分類問題を考える。サンプルをｘ_１，・・・，ｘ_ｎ∈Ｒ^Ｎ_０とする。また、サンプルｘ_ｉに対するラベルをｙ_ｉ∈Ｒ^{Ｎ_（Ｋ＋１）}とする。更に、ｘ_ｉに値を取る定数関数をＸ_ｉ、ｙ_ｉに値を取る定数関数をＹ_ｉとする。このとき、サンプルｘ_ｉ∈Ｒ^Ｎ_０とそのラベルｙ_ｉ∈Ｒ^{Ｎ_（Ｋ＋１）}に対して、定数関数Ｘ_ｉを定数関数Ｙ_ｉに変換する写像となるように、上記の式（１）で定義したｆ＝ｆ^θを学習する。 Consider a classification problem as an example. Let the samples be x ₁ , ..., x _n ∈ ^{R N_0} . Let the label for sample x _i be y _i ∈ ^{R N_(K+1)} . Let X _i be a constant function that takes a value for x _i , and Y _i be a constant function that takes a value for y _i . In this case, for sample x _i ∈ ^{R N_0} and its label y _i ∈ ^{R N_(K+1)} , f = f ^θ defined in the above formula (1) is learned so as to become a mapping that converts the constant function X _i into a constant function Y _i .

メタモデルにおけるＺとΘを予め学習させておき、新たなタスクに対する低次元空間での表現ｚ_ｎｅｗを得る。ｚ_ｎｅｗの近傍をＺとし、新たなタスクに対するモデルのパラメータの初期値θ_０を、ΘをＺへ制限した写像として設定する。損失関数Ｌについては、クロスエントロピー誤差 Z and Θ in the meta-model are trained in advance to obtain a representation z _new in a low-dimensional space for a new task. The neighborhood of z _new is set as Z, and the initial value θ ₀ of the model parameters for the new task is set as a mapping that restricts Θ to Z. The loss function L is the cross-entropy error

を用いる。ただし、〈・，・〉は、Ａ^ＮでのＡに値を持つ内積である。

where 〈·,·〉 is the A-valued inner product in A ^N.

＜評価結果＞
以下、本実施形態に係る解析装置１０を評価した結果について説明する。 <Evaluation Results>
The results of evaluation of the analysis device 10 according to this embodiment will be described below.

図４の左図及び右図にそれぞれ示すような２種類の１００個のサンプルが与えられたものとする。このとき、２種類の１００個のサンプルがそれぞれ従う分布の密度関数を応用例１で説明した方法により推定した。Suppose we are given two types of 100 samples, as shown in the left and right figures of Figure 4. In this case, the density functions of the distributions that each of the two types of 100 samples follows are estimated using the method described in Application Example 1.

Ｚ＝［－４，４］×［－４，４］とし、 Let Z = [-4, 4] x [-4, 4],

とおき、Ａ^Ｎの有限次元部分空間ＶをＳｐａｎ｛ｖ_１，・・・，ｖ_９｝^Ｎとおいた。また、Ｐは、ｖ_１，・・・，ｖ_９によるカーネルリッジ回帰とした。つまり、（ｉ，ｊ）成分が

Let V be the finite-dimensional subspace of A ^N as Span {v ₁ , ..., v ₉ } ^N. P is a kernel ridge regression with v ₁ , ..., v _9. In other words, the (i, j) component is

であるような行列をＧとしたとき、θ＝［θ_１，・・・，θ_Ｎ］^Ｔに対するＰ（θ）の第ｉ成分は、［ｖ_１，・・・，ｖ_９］（Ｇ＋μＩ）^－１［θ_ｉ（ｚ_１），・・・，θ_ｉ（ｚ_９）］^Ｔにより計算される。ここで、μ≧０はハイパーパラメータ、Ｉは単位行列である。

If G is a matrix such that θ=[θ ₁ , ..., θ _N ] ^T , the i-th component of P(θ) for θ=[θ 1 , ..., θ N ] ^T is calculated by [v ₁ , ..., v ₉ ](G+μI) ⁻¹ [θ _i (z ₁ ), ..., θ _i (z ₉ )] T, where μ≧0 is a hyperparameter and I is an identity matrix.

また、確率変数Ｘ（ｚ，・）が従う正規分布は、平均ｚ、標準偏差１の正規分布とした。 In addition, the random variable X(z,・) follows a normal distribution with mean z and standard deviation 1.

このとき、図４の左図に示すサンプルが与えられたときに、応用例１で説明した方法により密度関数を推定した結果を図５の左図に、ｉ＝１，・・・，９に対して平均ｚ_ｉ、標準偏差１の正規分布を用いた９個のモデルを別々に学習し、それらの出力を平均した結果を図５の右図にそれぞれ示す。同様に、図４の右図に示すサンプルが与えられたときに、応用例１で説明した方法により密度関数を推定した結果を図６の左図に、ｉ＝１，・・・，９に対して平均ｚ_ｉ、標準偏差１の正規分布を用いた９個のモデルを別々に学習し、それらの出力を平均した結果を図６の右図にそれぞれ示す。ここで、図５及び図６の右図に示す結果を得た方法は、応用例１で説明した方法において、式（２）におけるλとカーネルリッジ回帰のハイパーパラメータμをどちらも零にした場合に相当する。 In this case, when the sample shown in the left diagram of FIG. 4 is given, the result of estimating the density function by the method described in Application Example 1 is shown in the left diagram of FIG. 5, and the result of separately learning nine models using normal distributions with mean z _i and standard deviation 1 for i = 1, ..., 9 and averaging their outputs is shown in the right diagram of FIG. 5. Similarly, when the sample shown in the right diagram of FIG. 4 is given, the result of estimating the density function by the method described in Application Example 1 is shown in the left diagram of FIG. 6, and the result of separately learning nine models using normal distributions with mean z _i and standard deviation 1 for i = 1, ..., 9 and averaging their outputs is shown in the right diagram of FIG. 6. Here, the method of obtaining the results shown in the right diagrams of FIG. 5 and FIG. 6 corresponds to the case where both λ and the hyperparameter μ of the kernel ridge regression in Equation (2) are set to zero in the method described in Application Example 1.

図５及び図６の左図と右図を比較すると、左図の方が右図のよりも密度関数を滑らかに推定できていることがわかる。これは、本実施形態に係る解析装置１０では、複数のモデルを連続的につなぎ合わせ、関数としての連続的な性質を利用しているためである。 Comparing the left and right diagrams in Figures 5 and 6, it can be seen that the density function can be estimated more smoothly in the left diagram than in the right diagram. This is because the analysis device 10 according to this embodiment continuously connects multiple models and utilizes the continuous nature of the function.

本発明は、具体的に開示された上記の実施形態に限定されるものではなく、請求の範囲の記載から逸脱することなく、種々の変形や変更、既知の技術との組み合わせ等が可能である。The present invention is not limited to the specifically disclosed embodiments above, and various modifications, variations, and combinations with known technologies are possible without departing from the scope of the claims.

［参考文献］
参考文献１：Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019. [References]
Reference 1: Rusu, AA, Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.

１０解析装置
１０１入力装置
１０２表示装置
１０３外部Ｉ／Ｆ
１０３ａ記録媒体
１０４通信Ｉ／Ｆ
１０５ＲＡＭ
１０６ＲＯＭ
１０７補助記憶装置
１０８プロセッサ
１０９バス
２０１データ入力部
２０２パラメータ最適化部
２０３出力計算部
２０４データ出力部
２０５記憶部 10 Analysis device 101 Input device 102 Display device 103 External I/F
103a Recording medium 104 Communication I/F
105 RAM
106 ROM
107 Auxiliary storage device 108 Processor 109 Bus 201 Data input section 202 Parameter optimization section 203 Output calculation section 204 Data output section 205 Storage section

Claims

An input unit configured to input learning data for a neural network model f, the neural network model having a parameter θ composed of a plurality of elements whose values are in a space A of all continuous functions in a compact space Z;
A parameter optimization unit configured to learn the parameter θ that minimizes a function L _reg including a predetermined loss function L using the learning data;
An analysis device having the above configuration.

The analysis device according to claim 1 , wherein the function L _reg is expressed by a mapping that maps the parameter θ to an element of A, and includes a regularization term that is expressed by an integral of the loss function L over Z.

The parameter optimization unit is
A gradient ∇ θ L reg of the function L reg with respect to the parameter ^θ is calculated by calculating P(∇ θ L) instead of the gradient ∇ _θ L of the loss function L with respect to the parameter θ using a predetermined mapping P:A ^N _→ V, where N is the number of elements of the parameter θ and _V is a finite-dimensional _subspace of A _N ;
The analysis device according to claim 1 , wherein the parameter θ is learned by updating the parameter θ using the gradient ∇ _θ L _reg .

The analysis device of claim 3, wherein the mapping P is a mapping determined by regression including kernel ridge regression.

An input step of inputting learning data for a neural network model f having a parameter θ composed of a plurality of elements whose values are in a space A of all continuous functions in a compact space Z;
a parameter optimization procedure for learning the parameter θ that minimizes a function L _reg including a predetermined loss function L using the learning data;
A program that causes a computer to execute the following.