JP7761155B2

JP7761155B2 - Causal model construction device, causal model construction method, and program

Info

Publication number: JP7761155B2
Application number: JP2024540129A
Authority: JP
Inventors: 洋一松尾; 敬志郎渡辺; 雄介中野
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2025-10-28
Anticipated expiration: 2042-08-09
Also published as: WO2024034024A1; JPWO2024034024A1

Description

本開示は、因果モデル構築装置、因果モデル構築方法、及びプログラムに関する。

The present disclosure relates to a causal model construction device , a causal model construction method , and a program.

ＩＣＴ（Information and Communication Technology）事業者にとって、ＩＣＴシステム内で発生する異常の状態を把握し、その対応を迅速に行うことは重要な業務である。こうした中で、ＩＣＴシステム内で発生した異常を早期に検知するための手法や異常箇所を推定するための手法の研究が従来から行われている。For ICT (Information and Communication Technology) providers, it is important to understand the status of abnormalities that occur within ICT systems and to respond quickly to them. In this context, research has been conducted on methods for early detection of abnormalities that occur within ICT systems and methods for estimating the location of abnormalities.

異常箇所を推定するための手法として、異常箇所とその異常によって引き起こされるＩＣＴシステム内のデータ（以下、「観測データ」ともいう。）の変化との関係性を因果モデルとしてベイジアンネットワークによりモデル化し、異常時の観測データから異常箇所を推定する手法が提案されている（非特許文献１～３）。これらの手法は、ルールベース手法又はデータドリブン手法のいずれかに分類することができる。 One method proposed for estimating abnormal locations is to use a Bayesian network to model the relationship between the abnormal location and the changes in data (hereinafter also referred to as "observation data") within the ICT system caused by that abnormality as a causal model, and then estimate the abnormal location from the observation data at the time of the abnormality (Non-Patent Documents 1-3). These methods can be categorized as either rule-based or data-driven.

ルールベース手法は、事前に定義したルールに従ってモデル化する手法である。ルールベース手法では、主にＩＣＴシステムのオペレータ等のエキスパートの知識を用いて、異常箇所と観測データの変化との関係性をモデル化する。例えば、非特許文献１では、ルータの正常・異常は隣接しているリンクの観測データのみに影響するというルールをエキスパートの知識から作成し、このルールとＩＣＴシステムのネットワークトポロジーにおける隣接関係とを用いて因果モデルを構築している。また、非特許文献２では、テンプレートという抽象的なルールを作成することで、因果モデルの構築を容易するための提案がなされている。 The rule-based approach is a modeling method that follows predefined rules. The rule-based approach primarily uses the knowledge of experts, such as ICT system operators, to model the relationship between abnormal locations and changes in observed data. For example, in Non-Patent Document 1, a rule is created from expert knowledge that states that the normality or abnormality of a router only affects the observed data of adjacent links, and a causal model is constructed using this rule and the adjacent relationships in the ICT system's network topology. Furthermore, Non-Patent Document 2 proposes making it easier to construct causal models by creating abstract rules known as templates.

データドリブン手法は、データからモデル化する手法である。データドリブン手法では、過去に異常が発生したときの観測データを用いて、異常箇所とそのときの観測データの変化との関係性をモデル化する。例えば、非特許文献３では、或る障害に関して過去の複数の事例のデータを用いてその関係性をモデル化している。 Data-driven methods are methods of modeling from data. In data-driven methods, observed data from past anomalies is used to model the relationship between the location of the anomaly and changes in the observed data at that time. For example, in Non-Patent Document 3, the relationship between a certain failure and data from multiple past cases is modeled.

ところで、異常箇所を推定するための手法ではＩＣＴシステムのｓｙｓｌｏｇやトラヒック情報を用いて異常箇所を推定しているが、近年では、ｓｙｓｌｏｇやトラヒック情報以外にも、例えば、フローデータやテレメトリーデータ、通信機器に関するセンサデータ等の多様な種類の観測データが容易に取得できるようになってきている。このため、これらの多様な種類の観測データを用いることで、より細かい粒度で異常箇所を推定することができるようになると考えられる。 Methods for estimating abnormal locations use syslogs and traffic information from ICT systems. However, in recent years, it has become easier to obtain a variety of types of observational data in addition to syslogs and traffic information, such as flow data, telemetry data, and sensor data related to communications equipment. Therefore, it is believed that using these various types of observational data will enable the estimation of abnormal locations with finer granularity.

Srikanth Kandula, Dina Katabi, and Jean-philippe Vasseur. Shrink: A tool for failure diagnosis in IP networks. Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 173-178, 2005.Srikanth Kandula, Dina Katabi, and Jean-philippe Vasseur. Shrink: A tool for failure diagnosis in IP networks. Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 173-178, 2005. He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, and Jennifer Yates. G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks. IEEE/ACM Transactions on Networking, 20(6):1734-1747, 2012.He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, and Jennifer Yates. G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks. IEEE/ACM Transactions on Networking, 20(6):1734-1747, 2012. Kandula, Srikanth and Mahajan, Ratul and Verkaik, Patrick and Agarwal, Sharad and Padhye, Jitendra and Bahl, Paramvir. Detailed diagnosis in enterprise networks. ACM SIGCOMM Computer Communication Review, vol.39, num.4, pp.243-254, 2009.Kandula, Srikanth and Mahajan, Ratul and Verkaik, Patrick and Agarwal, Sharad and Padhye, Jitendra and Bahl, Paramvir. Detailed diagnosis in enterprise networks. ACM SIGCOMM Computer Communication Review, vol.39, num.4, pp.243-254, 2009.

しかしながら、多様な種類の観測データを用いて因果モデルを構築する場合、以下の課題がある。However, when constructing a causal model using diverse types of observational data, the following challenges arise:

課題１：ルールベース手法ではモデル化のために事前にエキスパートの知識が必要となるが、従来技術で使用していた観測データの種類数は非常に少なく、また、ＩＣＴシステムで発生する異常は波及して様々な観測データに影響を及ぼすため、ＩＣＴシステムで発生する異常と多様な種類の観測データとの関係性を一つ一つルール化することは困難である。 Challenge 1: Rule-based methods require expert knowledge in advance for modeling, but the number of types of observational data used in conventional technology is very small.In addition, abnormalities that occur in ICT systems have a ripple effect and affect various types of observational data, making it difficult to create rules for each and every relationship between abnormalities that occur in ICT systems and the various types of observational data.

課題２：多様な種類の観測データをベイジアンネットワークに入力する場合、ＩＣＴシステムから取得した各観測データの値が正常又は異常のいずれであるかを決定しなければならない（これは２値化とも呼ばれる。）。従来技術では観測データの種類数が非常に少なく、また２値化が容易な観測データ（例えば、「アラートが発生したか否か」を表すアラート情報等）を対象としているが、多様な種類の観測データを入力とする場合、各観測データの正常範囲はそれぞれ特性があり、それらを考慮して２値化をすることは困難である。 Issue 2: When various types of observational data are input into a Bayesian network, it is necessary to determine whether the value of each piece of observational data obtained from an ICT system is normal or abnormal (this is also known as binarization). Conventional technology deals with a very small number of types of observational data, and is designed for observational data that is easy to binarize (for example, alert information indicating whether an alert has occurred). However, when various types of observational data are input, the normal range for each piece of observational data has its own characteristics, making it difficult to binarize it while taking these into account.

課題３：データドリブン手法では過去に異常が発生したときの観測データが必要であるが、ＩＣＴシステムでは異常が頻発することは一般に少なく、また、観測データの種類が多様になることにより異常に対して観測データが取り得るパターン数が増加する。このため、その増加分を補うだけの異常事例を収集することは一般に困難である。 Challenge 3: Data-driven methods require observational data from past anomalies, but anomalies generally do not occur frequently in ICT systems. Furthermore, as the types of observational data become more diverse, the number of patterns that the observational data can take for anomalies increases. For this reason, it is generally difficult to collect enough anomaly cases to compensate for this increase.

課題４：更に、近年では、ＩＣＴシステムの仮想化技術により、ＩＣＴシステムのネットワークトポロジーが高頻度で変化することが増えている。また、それに伴い、ＩＣＴシステムから取得される観測データも高頻度で変化する。このため、ルールベース手法では異常と観測データとの関係性を一つ一つルール化することが困難であり、データドリブン手法では十分な異常事例を収集することが困難である。Issue 4: Furthermore, in recent years, ICT system virtualization technology has led to increasingly frequent changes in the network topology of ICT systems. Consequently, the observational data acquired from ICT systems also changes frequently. For this reason, it is difficult to create rules for the relationship between anomalies and observational data one by one using rule-based methods, and it is difficult to collect sufficient anomaly cases using data-driven methods.

本開示は、上記の点に鑑みてなされたもので、多様な種類の観測データに対する因果モデルを構築する際に、ネットワークトポロジー情報から因果モデルを構築できる技術を提供することを目的とする。 This disclosure has been made in consideration of the above points and aims to provide a technology that can construct a causal model from network topology information when building a causal model for various types of observation data.

本開示の一態様による因果モデル構築装置は、異常箇所の推定対象となるＩＣＴシステムのネットワークトポロジーを表すネットワークトポロジー情報を取得するように構成されている収集部と、前記ネットワークトポロジー情報を用いて、前記ＩＣＴシステムに異常が発生したときの観測データから前記異常箇所を推定するための因果モデルを構築するように構成されているモデル構築部と、を有する。 A causal model construction device according to one aspect of the present disclosure includes a collection unit configured to acquire network topology information representing the network topology of an ICT system for which an abnormality location is to be estimated, and a model construction unit configured to use the network topology information to construct a causal model for estimating the abnormality location from observation data when an abnormality occurs in the ICT system.

多様な種類の観測データに対する因果モデルを構築する際に、ネットワークトポロジー情報から因果モデルを構築できる技術が提供される。 When building causal models for various types of observational data, technology is provided that can build causal models from network topology information.

本実施形態に係る異常箇所推定装置のハードウェア構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a hardware configuration of an abnormality location estimation device according to an embodiment of the present invention. 本実施形態に係る異常箇所推定装置の機能構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of an abnormality location estimation device according to the present embodiment. 本実施形態に係る因果モデル構築処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a causal model construction process according to the present embodiment. 本実施形態に係る異常箇所推定処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of an abnormality location estimation process according to the present embodiment.

以下、本発明の一実施形態について説明する。以下の実施形態では、ＩＣＴシステムのネットワークトポロジー情報から因果モデルを構築し、この因果モデルを用いて多様な種類の観測データからＩＣＴシステムの異常箇所を推定する異常箇所推定装置１０について説明する。ここで、本実施形態に係る異常箇所推定装置１０には、ＩＣＴシステムのネットワークトポロジー情報から因果モデルを構築する「モデル構築フェーズ」と、この因果モデルを用いて異常発生時の観測データから異常箇所を推定する「異常箇所推定フェーズ」とが存在する。なお、モデル構築フェーズにおける異常箇所推定装置１０は、例えば、「モデル構築装置」等と称されてもよい。また、ネットワークトポロジー情報とは、ＩＣＴシステムのネットワークトポロジーを表す情報のことである。ネットワークトポロジー情報は、例えば、ＩＣＴシステムを構成する種々の機器（例えば、ルータやサーバ等）をノード、ノード間の通信経路等をリンクとするグラフ構造を表現する情報のことである。 An embodiment of the present invention will now be described. In the following embodiment, an anomaly location estimation device 10 will be described that constructs a causal model from network topology information of an ICT system and uses this causal model to estimate anomaly locations in the ICT system from various types of observation data. The anomaly location estimation device 10 of this embodiment has a "model construction phase" in which a causal model is constructed from network topology information of the ICT system, and an "anomaly location estimation phase" in which this causal model is used to estimate anomaly locations from observation data at the time of an anomaly occurrence. The anomaly location estimation device 10 in the model construction phase may be referred to as, for example, a "model construction device." Network topology information is information that represents the network topology of an ICT system. Network topology information is, for example, information that represents a graph structure in which various devices (e.g., routers, servers, etc.) that make up the ICT system are represented as nodes and communication paths between the nodes are represented as links.

＜理論的構成＞
まず、モデル構築フェーズにおける因果モデル構築と、異常箇所推定フェーズにおける異常箇所推定との理論的構成について説明する。 <Theoretical structure>
First, the theoretical configuration of the causal model construction in the model construction phase and the abnormality location estimation in the abnormality location estimation phase will be described.

因果モデル構築及び異常箇所推定の対象とするＩＣＴシステムを構成する機器をｉ∈｛１，・・・，Ｎ｝として、機器ｉの状態をｘ_ｉ∈｛０，１｝とする。ここで、Ｎは機器数を表し、ｘ_ｉは０のとき正常状態、１のとき異常状態を表すものとする。 Let i∈{1,...,N} denote the devices that make up the ICT system for which a causal model is constructed and anomaly locations are to be identified, and let x _i ∈{0,1} denote the state of device i. Here, N represents the number of devices, and x _i = 0 indicates a normal state and 1 indicates an abnormal state.

また、観測データをｊ∈｛１，・・・，Ｍ｝として、観測データｊの状態をｙ_ｊ∈｛０，１｝とする。ここで、Ｍは観測データ数を表し、ｙ_ｊは０のとき正常状態、１のとき異常状態を表すものとする。なお、観測データｊとしては、例えば、ＩＣＴシステムを構成する機器から取得可能な様々なデータ（例えば、ｓｙｓｌｏｇ、トラヒック情報、フローデータ、テレメトリーデータ、センサデータ等）が挙げられる。 Furthermore, let the observed data be j∈{1, ..., M}, and the state of the observed data j be y _j ∈{0, 1}. Here, M represents the number of observed data, and y _j represents a normal state when it is 0 and an abnormal state when it is 1. Note that examples of the observed data j include various data that can be obtained from devices that make up the ICT system (e.g., syslog, traffic information, flow data, telemetry data, sensor data, etc.).

各機器ｉに対して代表ノードｋ∈｛１，・・・，Ｎ｝を導入し、代表ノードｋの状態をｒ_ｋ∈｛０，１｝とする。ここで、ｒ_ｋは０のとき正常状態、１のとき異常状態を表すものとする。 A representative node kε{1, ..., N} is introduced for each device i, and the state of representative node k is defined as r _k ε{0, 1}, where r _k is 0 when it represents a normal state and 1 when it represents an abnormal state.

なお、ｘ_ｉ、ｙ_ｊ及びｒ_ｋは０又は１の２値ではなく、３値以上の多値を取るものとすることも可能である。 It should be noted that x _i , y _j and r _k may not be binary, 0 or 1, but may be multi-valued, ternary or more.

各機器ｉは代表ノードｋを１つ持つ。代表ノードｋは、それに対応する機器ｉから取得できる観測データの状態を表すノードである。代表ノードｋの状態ｒ_ｋは、観測データの異常への寄与度（後述）をもとに決定される。本実施形態では、状態ｘ_ｉに対応する機器ｉから取得できる観測データの状態を表すｒ_ｋの因果モデル（つまり、代表ノードｋの状態ｒ_ｋに関する因果モデル）を構築する。 Each device i has one representative node k. Representative node k is a node that represents the state of observation data that can be obtained from the corresponding device i. The state r _k of representative node k is determined based on the contribution of the observation data to an anomaly (described later). In this embodiment, a causal model of r _k that represents the state of observation data that can be obtained from device i that corresponds to state x _i (that is, a causal model related to the state r _k of representative node k) is constructed.

以下、因果モデルの構築方法と異常箇所推定方法を説明した後、異常への寄与度と代表ノードの状態の決定方法について説明する。 Below, we will explain how to construct a causal model and how to estimate the location of an anomaly, and then explain how to determine the contribution to the anomaly and the state of the representative node.

・因果モデルの構築方法と異常箇所推定方法
因果モデルは、事前確率Ｐ（Ｘ＝ｘ_１，・・・，ｘ_Ｎ｜α）と条件付き確率Ｐ（Ｒ＝ｒ_１，・・・，ｒ_Ｎ｜Ｘ，β，φ）を規定することにより構築する。事前確率は各機器の異常状態へのなりやすさを表す確率であり、以下のように規定する。 - Method for constructing a causal model and method for estimating anomaly locations A causal model is constructed by specifying the prior probability P(X = _x1 , ..., _xN | α) and the conditional probability P(R = _r1 , ..., _rN | X, β, φ). The prior probability is the probability that each device is likely to enter an abnormal state, and is specified as follows:

ここで、αは機器の異常状態へのなりやすさを表すハイパーパラメータであり、０以上１以下を取る。 Here, α is a hyperparameter that indicates the likelihood of the device going into an abnormal state, and takes a value between 0 and 1.

次に、条件付き確率を規定する。条件付き確率は、機器と代表ノードの因果関係と、その度合いとを表す。また、因果関係は、或る機器ｉが或る代表ノードｋと因果関係がある場合、ｘ_ｉとｒ_ｋの間にエッジｅ_ｉ，ｋを加えることで表す。ここで、機器Ｘと代表ノードＲとの間の因果関係は、ネットワークトポロジー情報を用いて、以下のように規定する。 Next, conditional probability is defined. Conditional probability represents the causal relationship between a device and a representative node and its degree. Furthermore, when a certain device i has a causal relationship with a certain representative node k, the causal relationship is expressed by adding an edge e _i,k between x _i and r _k . Here, the causal relationship between device X and representative node R is defined as follows using network topology information:

ここで、ｎｅｉｇ（ｉ）は、機器ｉに隣接するノードのインデックスの集合である。 Here, neig(i) is a set of indices of nodes adjacent to device i.

そして、機器Ｘと代表ノードＲとの間のすべてのエッジの集合をＥとして、エッジｅ_ｉ，ｋのインデックスを表すパラメータをφ_ｉ，ｋとすると、以下のようになる。 Then, if the set of all edges between device X and representative node R is E and the parameter representing the index of edge e _i,k is φ _i,k , then the following holds:

このとき、φ_ｉ，ｋを用いて、条件付き確率を以下のように規定する。 In this case, the conditional probability is defined as follows using φ _i,k :

ここで、βは因果関係の度合いを表すハイパーパラメータであり、０以上１以下を取る。また、δ（・）はデルタ関数であり、入力が真であるとき１、偽であるとき０を返す。 Here, β is a hyperparameter that indicates the degree of causality and takes a value between 0 and 1. Also, δ(·) is a delta function that returns 1 when the input is true and 0 when it is false.

最後に、代表ノードの状態が与えられたとき、事前確率と条件付き確率を用いて、以下を解くことにより異常箇所を推定する。 Finally, given the state of the representative node, the anomaly location is estimated by solving the following using the prior probability and conditional probability:

上記の式は、例えば、確率伝搬法（参考文献１）等により解くことができる。なお、以下、本明細書のテキスト中では、異常箇所の推定結果を「＾Ｘ」と表す。 The above equation can be solved, for example, by belief propagation (Reference 1). In the text of this specification, the estimated result of the abnormality location will be represented as "^X".

以上のように、ネットワークトポロジー情報のみを用いて、様々な種類の観測データに対する因果モデルを構築することができる。 As described above, causal models can be constructed for various types of observed data using only network topology information.

・異常への寄与度と代表ノードの状態の決定方法
次に、異常への寄与度と代表ノードの決定方法について説明する。ｃをＭ次元のベクトルで、各ベクトルの要素ｃ_ｊが観測データｊの異常への寄与度を表しているものとする。ここで、異常への寄与度は、各観測データが異常にどの程度影響しているかを表す値である。このため、異常状態になった機器に近い機器から取得される観測データは、異常になった機器から遠い機器から取得される観測データよりも寄与度が高くなる。このように、観測データの値を直接使うのではなく、寄与度を入力として使うことで、異常状態の機器はその近傍の機器から取得した観測データのみに影響を与えると仮定することができる。また、異常への寄与度は各観測データが異常にどの程度影響しているかを表しているため、各観測データの特性の多様性を考慮する必要がなく、寄与度の大きさだけで２値化の閾値を設定することが可能となる。 Method for Determining the Contribution to an Anomaly and the State of a Representative Node Next, a method for determining the contribution to an anomaly and the representative node will be described. Let c be an M-dimensional vector, and each vector element _cj represents the contribution of observation data j to the anomaly. Here, the contribution to an anomaly is a value that represents the extent to which each observation data item contributes to the anomaly. Therefore, observation data acquired from devices closer to an abnormal device will have a higher contribution than observation data acquired from devices farther from the abnormal device. In this way, by using the contribution as an input rather than directly using the observation data value, it is possible to assume that an abnormal device only affects the observation data acquired from its nearby devices. Furthermore, because the contribution to an anomaly represents the extent to which each observation data item contributes to the anomaly, there is no need to consider the diversity of the characteristics of each observation data item, and it is possible to set the binarization threshold based solely on the magnitude of the contribution.

異常への寄与度は、Ｍ次元の正常な観測データを用いて学習を行ったＡｕｔｏＥｎｃｏｄｅｒ（参考文献２）に対して、例えば、参考文献３や参考文献４に記載されている手法を適用することで計算することができる。 The contribution to anomalies can be calculated by applying the methods described in, for example, References 3 and 4 to an AutoEncoder (Reference 2) trained using M-dimensional normal observation data.

例えば、ＡｕｔｏＥｎｃｏｄｅｒの学習に使用した損失関数をＬ（ｖ）＝||ｖ－＾ｖ||とする。ただし、ｖはＡｕｔｏＥｎｃｏｄｅｒへの入力、＾ｖはＡｕｔｏＥｎｃｏｄｅｒからの出力である。このとき、寄与度ｃは、ｃ＝ａｒｇｍｉｎ_γＬ（ｖ＋γ）＋λ｜γ｜により計算することができる。ここで、λは予め設定された定数である。これは、Ｌ（ｖ）の値が下がる（つまり、異常度が下がる）ようなγを見つけるということを意味している。見つかったγは異常度を下げる、つまり異常へ寄与しているものであると考えられるためである。なお、上記の寄与度ｃを計算するための式の第２項は、γがスパース性を満たすようにするためのペナルティ項である。 For example, let the loss function used in training the AutoEncoder be L(v) = ∥v - ^v∥. Here, v is the input to the AutoEncoder, and ^v is the output from the AutoEncoder. In this case, the contribution c can be calculated as c = argmin _γ L(v + γ) + λ|γ|. Here, λ is a preset constant. This means finding a γ that reduces the value of L(v) (i.e., reduces the degree of anomaly). This is because the found γ is considered to reduce the degree of anomaly, that is, to contribute to anomalies. Note that the second term in the above equation for calculating the contribution c is a penalty term to ensure that γ satisfies sparsity.

次に、代表ノードの状態の決定方法について説明する。異常への寄与度ｃの各要素の絶対値が大きい順に上位ｓ個の値を集めた集合をＤ_ｓとする。ここで、ｓの値は任意に決めることができるが、例えば、観測データの種類数Ｍの１％の整数部分、等とすることが考えられる。そして、ｃの各要素の要素番号のうち、Ｄ_ｓに含まれる値に対応する要素の要素番号の集合Ω_ｓとする。すなわち、Ω_ｓ＝｛ｊ｜｜ｃ_ｊ｜∈Ｄ_ｓ｝とする。 Next, a method for determining the state of the representative node will be described. Let _Ds be a set of the top s values, sorted in descending order by the absolute value of each element of the contribution c to the anomaly. Here, the value of s can be determined arbitrarily, but it could be, for example, the integer part of 1% of the number M of types of observed data. Then, let _Ωs be a set of element numbers of elements of c that correspond to the values included in _Ds . That is, let _Ωs = {j|| _cj | _∈Ds }.

そして、代表ノードの状態ｒ_ｋを以下の式により決定する。 Then, the state r _k of the representative node is determined by the following equation.

ここで、ｆは機器ｋから取得される観測データのインデックスの集合を返す関数である。 Here, f is a function that returns a set of indices of observation data acquired from device k.

すなわち、各ｋ∈｛１，・・・，Ｎ｝に対してｆ（ｋ）を計算し、その計算の結果を表すインデックス集合の中にΩ_ｓの要素が１つでも入っていればｒ_ｋ＝１、そうでなければｒ_ｋ＝０とする。 That is, f(k) is calculated for each kε{1, . . . , N}, and if at least one element of Ω _s is included in the index set representing the calculation result, r _k =1; otherwise, r _k =0.

＜異常箇所推定装置１０のハードウェア構成例＞
本実施形態に係る異常箇所推定装置１０のハードウェア構成例を図１に示す。図１に示すように、本実施形態に係る異常箇所推定装置１０は、入力装置１０１と、表示装置１０２と、外部Ｉ／Ｆ１０３と、通信Ｉ／Ｆ１０４と、ＲＡＭ（Random Access Memory）１０５と、ＲＯＭ（Read Only Memory）１０６と、補助記憶装置１０７と、プロセッサ１０８とを有する。これらの各ハードウェアは、それぞれがバス１０９を介して通信可能に接続されている。 <Example of hardware configuration of abnormality location estimation device 10>
An example of the hardware configuration of an abnormality location estimation device 10 according to this embodiment is shown in Fig. 1. As shown in Fig. 1, the abnormality location estimation device 10 according to this embodiment includes an input device 101, a display device 102, an external I/F 103, a communication I/F 104, a random access memory (RAM) 105, a read only memory (ROM) 106, an auxiliary storage device 107, and a processor 108. These pieces of hardware are connected to each other via a bus 109 so as to be able to communicate with each other.

入力装置１０１は、例えば、キーボード、マウス、タッチパネル、物理ボタン等である。表示装置１０２は、例えば、ディスプレイ、表示パネル等である。なお、異常箇所推定装置１０は、入力装置１０１及び表示装置１０２のうちの少なくとも一方を有していなくてもよい。 The input device 101 is, for example, a keyboard, a mouse, a touch panel, a physical button, etc. The display device 102 is, for example, a display, a display panel, etc. Note that the abnormality location estimation device 10 does not necessarily have to have at least one of the input device 101 and the display device 102.

外部Ｉ／Ｆ１０３は、記録媒体１０３ａ等の外部装置とのインタフェースである。異常箇所推定装置１０は、外部Ｉ／Ｆ１０３を介して、記録媒体１０３ａの読み取りや書き込み等を行うことができる。なお、記録媒体１０３ａとしては、例えば、フレキシブルディスク、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disk）、ＳＤメモリカード（Secure Digital memory card）、ＵＳＢ（Universal Serial Bus）メモリカード等が挙げられる。 The external I/F 103 is an interface with external devices such as a recording medium 103a. The abnormality location estimation device 10 can read and write data from and to the recording medium 103a via the external I/F 103. Examples of the recording medium 103a include a flexible disk, a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.

通信Ｉ／Ｆ１０４は、異常箇所推定装置１０を通信ネットワークに接続するためのインタフェースである。ＲＡＭ１０５は、プログラムやデータを一時保持する揮発性の半導体メモリ（記憶装置）である。ＲＯＭ１０６は、電源を切ってもプログラムやデータを保持することができる不揮発性の半導体メモリ（記憶装置）である。補助記憶装置１０７は、例えば、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、フラッシュメモリ等のストレージ装置（記憶装置）である。プロセッサ１０８は、例えば、ＣＰＵ（Central Processing Unit）やＧＰＵ（Graphics Processing Unit）等の演算装置である。 The communication I/F 104 is an interface for connecting the anomaly location estimation device 10 to a communication network. The RAM 105 is a volatile semiconductor memory (storage device) that temporarily stores programs and data. The ROM 106 is a non-volatile semiconductor memory (storage device) that can store programs and data even when the power is turned off. The auxiliary storage device 107 is a storage device (storage device) such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or flash memory. The processor 108 is an arithmetic device such as a CPU (Central Processing Unit) or GPU (Graphics Processing Unit).

本実施形態に係る異常箇所推定装置１０は、図１に示すハードウェア構成を有することにより、後述する因果モデル構築処理や異常箇所推定処理を実現することができる。なお、図１に示すハードウェア構成は一例であって、異常箇所推定装置１０のハードウェア構成はこれに限られるものではない。例えば、異常箇所推定装置１０は、複数の補助記憶装置１０７や複数のプロセッサ１０８を有していてもよいし、図示したハードウェアの一部を有していなくてもよいし、図示したハードウェア以外の様々なハードウェアを有していてもよい。 The anomaly location estimation device 10 according to this embodiment has the hardware configuration shown in FIG. 1, and is therefore capable of performing the causal model construction process and the anomaly location estimation process described below. Note that the hardware configuration shown in FIG. 1 is merely an example, and the hardware configuration of the anomaly location estimation device 10 is not limited to this. For example, the anomaly location estimation device 10 may have multiple auxiliary storage devices 107 or multiple processors 108, may not have some of the hardware shown in the figure, or may have various hardware components other than the hardware shown in the figure.

＜異常箇所推定装置１０の機能構成例＞
本実施形態に係る異常箇所推定装置１０の機能構成例を図２に示す。図２に示すように、本実施形態に係る異常箇所推定装置１０は、収集部２０１と、因果モデル構築部２０２と、寄与度計算部２０３と、推定部２０４と、ユーザインタフェース部２０５とを有する。これら各部は、例えば、異常箇所推定装置１０にインストールされた１以上のプログラムが、プロセッサ１０８等に実行させる処理により実現される。また、本実施形態に係る異常箇所推定装置１０は、ＩＣＴシステムデータＤＢ３０１と、因果モデルＤＢ３０２と、寄与度ＤＢ３０３とを有する。これら各ＤＢは、例えば、補助記憶装置１０７等により実現される。 <Example of functional configuration of abnormality location estimation device 10>
An example of the functional configuration of the anomaly location estimation device 10 according to this embodiment is shown in Fig. 2. As shown in Fig. 2, the anomaly location estimation device 10 according to this embodiment includes a collection unit 201, a causal model construction unit 202, a contribution degree calculation unit 203, an estimation unit 204, and a user interface unit 205. Each of these units is realized, for example, by processing in which one or more programs installed in the anomaly location estimation device 10 are executed by the processor 108 or the like. The anomaly location estimation device 10 according to this embodiment also includes an ICT system data DB 301, a causal model DB 302, and a contribution degree DB 303. Each of these DBs is realized, for example, by the auxiliary storage device 107 or the like.

収集部２０１は、ネットワークトポロジー情報と各観測データｊとをＩＣＴシステムから収集する。収集部２０１によって収集されたネットワークトポロジー情報及び各観測データｊはＩＣＴシステムデータＤＢ３０１に格納される。 The collection unit 201 collects network topology information and each observation data j from the ICT system. The network topology information and each observation data j collected by the collection unit 201 are stored in the ICT system data DB 301.

因果モデル構築部２０２は、ＩＣＴシステムデータＤＢ３０１に格納されているネットワークトポロジー情報を用いて、因果モデル（つまり、上記の数１に示す事前確率Ｐ（Ｘ＝ｘ_１，・・・，ｘ_Ｎ｜α）と上記の数４に示す条件付き確率Ｐ（Ｒ＝ｒ_１，・・・，ｒ_Ｎ｜Ｘ，β，φ））を構築する。因果モデル構築部２０２によって構築された因果モデルは因果モデルＤＢ３０２に格納される。 The causal model construction unit 202 constructs a causal model (i.e., the prior probability P(X= _x1 , ..., _xN |α) shown in the above equation 1 and the conditional probability P(R= _r1 , ..., _rN |X, β, φ) shown in the above equation 4) using network topology information stored in the ICT system data DB 301. The causal model constructed by the causal model construction unit 202 is stored in the causal model DB 302.

寄与度計算部２０３は、異常箇所を推定する際に、ＩＣＴシステムデータＤＢ３０１に格納されている各観測データｊを用いて、異常への寄与度ｃを計算する。寄与度計算部２０３によって計算された寄与度ｃは寄与度ＤＢ３０３に格納される。 When estimating the location of an anomaly, the contribution calculation unit 203 calculates the contribution c to the anomaly using each observation data j stored in the ICT system data DB 301. The contribution c calculated by the contribution calculation unit 203 is stored in the contribution DB 303.

推定部２０４は、因果モデルＤＢ３０２に格納されている因果モデルと、寄与度ＤＢ３０３に格納されている寄与度ｃとを用いて、異常箇所＾Ｘを推定する。すなわち、推定部２０４は、寄与度ｃから代表ノードｋの状態ｒ_ｋを決定した上で、これら代表ノードｋの状態ｒ_ｋを用いて上記の数５により異常箇所＾Ｘを推定する。 The estimation unit 204 estimates the abnormality location ^X using the causal model stored in the causal model DB 302 and the contribution c stored in the contribution DB 303. That is, the estimation unit 204 determines the state r _k of the representative node k from the contribution c, and then estimates the abnormality location ^X according to the above equation 5 using the state r _k of the representative node k.

ユーザインタフェース部２０５は、推定部２０４によって推定された異常箇所＾Ｘをユーザ（例えば、ＩＣＴシステムのオペレータ等）に提示する。 The user interface unit 205 presents the abnormality location ^X estimated by the estimation unit 204 to a user (e.g., an operator of an ICT system).

＜因果モデル構築処理＞
以下、本実施形態に係る因果モデル構築処理について、図３を参照しながら説明する。因果モデル構築処理は、モデル構築フェーズで実行される処理である。なお、以下では、収集部２０１によって収集されたネットワークトポロジー情報がＩＣＴシステムデータＤＢ３０１に格納されているものとする。 <Causal model construction process>
The causal model construction process according to this embodiment will be described below with reference to Fig. 3. The causal model construction process is a process executed in the model construction phase. In the following, it is assumed that the network topology information collected by the collection unit 201 is stored in the ICT system data DB 301.

因果モデル構築部２０２は、ＩＣＴシステムデータＤＢ３０１に格納されているネットワークトポロジー情報を入力する（ステップＳ１０１）。 The causal model construction unit 202 inputs network topology information stored in the ICT system data DB 301 (step S101).

次に、因果モデル構築部２０２は、上記のステップＳ１０１で入力したネットワークトポロジー情報を用いて、因果モデル（上記の数１に示す事前確率Ｐ（Ｘ＝ｘ_１，・・・，ｘ_Ｎ｜α）と上記の数４に示す条件付き確率Ｐ（Ｒ＝ｒ_１，・・・，ｒ_Ｎ｜Ｘ，β，φ））を構築する（ステップＳ１０２）。 Next, the causal model construction unit 202 constructs a causal model (the prior probability P(X = _x1 , ..., _xN | α) shown in the above equation 1 and the conditional probability P(R = _r1 , ..., _rN | X, β, φ) shown in the above equation 4) using the network topology information input in the above step S101 (step S102).

そして、因果モデル構築部２０２は、上記のステップＳ１０２で構築した因果モデルを因果モデルＤＢ３０２に格納する（ステップＳ１０３）。 Then, the causal model construction unit 202 stores the causal model constructed in step S102 above in the causal model DB 302 (step S103).

＜異常箇所推定処理＞
以下、本実施形態に係る異常箇所推定処理について、図４を参照しながら説明する。異常箇所推定処理は、異常箇所推定フェーズで実行される処理である。なお、以下では、ＩＣＴシステムで何等かの異常が発生しており、そのときの各観測データｊが収集部２０１によって収集されてＩＣＴシステムデータＤＢ３０１に格納されているものとする。 <Abnormal location estimation process>
The abnormality location estimation process according to this embodiment will be described below with reference to Fig. 4. The abnormality location estimation process is executed in the abnormality location estimation phase. In the following, it is assumed that some abnormality has occurred in the ICT system, and that each piece of observation data j at that time has been collected by the collection unit 201 and stored in the ICT system data DB 301.

寄与度計算部２０３は、当該異常発生時の各観測データｊを入力する（ステップＳ２０１）。 The contribution calculation unit 203 inputs each observation data j at the time the abnormality occurred (step S201).

次に、寄与度計算部２０３は、上記のステップＳ２０１で入力した各観測データｊを用いて、当該異常への寄与度ｃを計算する（ステップＳ２０２）。すなわち、寄与度計算部２０３は、例えば、ＡｕｔｏＥｎｃｏｄｅｒの学習に使用した損失関数をＬ（ｖ）＝||ｖ－＾ｖ||として、ｃ＝ａｒｇｍｉｎ_γＬ（ｖ＋γ）＋λ｜γ｜により寄与度ｃを計算する。 Next, the contribution calculation unit 203 calculates the contribution c to the anomaly using each piece of observation data j input in step S201 (step S202). That is, the contribution calculation unit 203 calculates the contribution c by c = argmin _γ L(v + γ) + λ|γ|, where L(v) = ∥v - ^v∥ is the loss function used in learning the AutoEncoder, for example.

次に、寄与度計算部２０３は、上記のステップＳ２０２で計算した寄与度ｃを寄与度ＤＢ３０３に格納する（ステップＳ２０３）。 Next, the contribution calculation unit 203 stores the contribution c calculated in step S202 above in the contribution DB 303 (step S203).

次に、推定部２０４は、因果モデルＤＢ３０２に格納されている因果モデルと、寄与度ＤＢ３０３に格納されている寄与度ｃとを用いて、異常箇所＾Ｘを推定する（ステップＳ２０４）。すなわち、推定部２０４は、上記の数６により寄与度ｃから代表ノードｋの状態ｒ_ｋを決定した上で、これら代表ノードｋの状態ｒ_ｋを用いて上記の数５により異常箇所＾Ｘを推定する。 Next, the estimation unit 204 estimates the abnormality location ^X using the causal model stored in the causal model DB 302 and the contribution c stored in the contribution DB 303 (step S204). That is, the estimation unit 204 determines the state r _k of the representative node k from the contribution c using the above equation 6, and then estimates the abnormality location ^X using the above equation 5 using the state r _k of the representative node k.

そして、ユーザインタフェース部２０５は、上記のステップＳ２０４で推定された異常箇所＾Ｘをディスプレイ等の表示装置１０２に出力し、ユーザに提示する（ステップＳ２０５）。 Then, the user interface unit 205 outputs the abnormal location ^X estimated in step S204 above to a display device 102 such as a display and presents it to the user (step S205).

＜まとめ＞
以上により、モデル構築フェーズにおいて、本実施形態に係る異常箇所推定装置１０は、「或る機器で異常が発生した場合はその機器と隣接する機器の観測データに影響が出る」という仮定の下で、ネットワークトポロジー情報のみを用いて、代表ノードｋの状態ｒ_ｋに関する因果モデル（ベイジアンネットワーク）を構築する。また、異常箇所推定フェーズにおいて、本実施形態に係る異常箇所推定装置１０は、異常発生時の各観測データｊから計算される寄与度ｃを用いて、因果モデル（ベイジアンネットワーク）により異常箇所を推定することができる。これにより、本実施形態に係る異常箇所推定装置１０は、上記の課題１～課題４を解決することができる。 <Summary>
As described above, in the model construction phase, the anomaly location estimation device 10 according to this embodiment constructs a causal model (Bayesian network) for the state r _k of the representative node k using only network topology information under the assumption that "when an abnormality occurs in a certain device, it will affect the observation data of that device and its adjacent devices." Furthermore, in the anomaly location estimation phase, the anomaly location estimation device 10 according to this embodiment can estimate the anomaly location through the causal model (Bayesian network) using the contribution c calculated from each piece of observation data j at the time of the anomaly occurrence. As a result, the anomaly location estimation device 10 according to this embodiment can solve the above problems 1 to 4.

すなわち、本実施形態に係る異常箇所推定装置１０は、課題１の「異常が波及して様々な観測データに影響を及ぼす」という点を「異常への寄与度」というデータを用いることで解決し、また「異常と多様な種類の観測データとの関係性を一つ一つルール化することが困難」という点をベイジアンネットワークに代表ノードというノードを導入することで解決している。 In other words, the anomaly location estimation device 10 according to this embodiment solves problem 1, that "anomalies spread and affect various types of observation data," by using data called "contribution to the anomaly," and solves problem 1, that "it is difficult to create rules for the relationships between anomalies and various types of observation data one by one," by introducing a node called a representative node into the Bayesian network.

また、「異常への寄与度」というデータを用いることで、各観測データｊの正常状態を考える必要がなく、寄与度の値の大きさのみで２値化することが可能となり、課題２を解決している。更に、ネットワークトポロジー情報のみから因果モデルを構築できるため課題４を解決しており、加えて過去の異常データを使用しないため課題３が問題とならない。 In addition, by using data called "contribution to anomalies," there is no need to consider the normal state of each observation data j; it is possible to binarize based solely on the magnitude of the contribution value, thereby resolving problem 2. Furthermore, problem 4 is resolved because a causal model can be constructed from network topology information alone, and problem 3 is not an issue because past anomaly data is not used.

以上により、本実施形態に係る異常箇所推定装置１０では、上記の課題１～課題４を解決し、ＩＣＴシステムから取得できる多様な種類の観測データに対する因果モデルにより当該ＩＣＴシステムの異常箇所を推定することが可能となる。 As a result, the anomaly location estimation device 10 of this embodiment solves the above-mentioned problems 1 to 4 and makes it possible to estimate anomaly locations in an ICT system using a causal model for various types of observation data that can be obtained from the ICT system.

本発明は、具体的に開示された上記の実施形態に限定されるものではなく、請求の範囲の記載から逸脱することなく、種々の変形や変更、既知の技術との組み合わせ等が可能である。 The present invention is not limited to the specifically disclosed embodiments above, and various modifications, alterations, and combinations with known technologies are possible without departing from the scope of the claims.

［参考文献］
参考文献１：田中和之, [チュートリアル講演] 確率的情報処理と確率伝搬アルゴリズムの基礎, 信学技報, 2004.
参考文献２：M. Sakurada and T. Yairi, "Anomaly detection using autoencoders with nonlinear dimensionality reduction," in Proc. MLSDA, ser. MLSDA'14, 2014, p. 4-11.
参考文献３：Y. Ikeda, K. Tajiri, Y. Nakano, K. Watanabe, and K. Ishibashi, "Estimation of dimensions contributing to detected anomalies with variational autoencoders," arXiv preprint arXiv:1811.04576, 2018.
参考文献４：Scott Lundberg, Su-In Lee,"A Unified Approach to Interpreting Model Predictions,", in Proc. NIPS 2017. [References]
Reference 1: Kazuyuki Tanaka, [Tutorial Lecture] Fundamentals of Probabilistic Information Processing and Belief Propagation Algorithms, IEICE Technical Report, 2004.
Reference 2: M. Sakurada and T. Yairi, "Anomaly detection using autoencoders with nonlinear dimensionality reduction," in Proc. MLSDA, ser. MLSDA'14, 2014, p. 4-11.
Reference 3: Y. Ikeda, K. Tajiri, Y. Nakano, K. Watanabe, and K. Ishibashi, "Estimation of contributing dimensions to detected anomalies with variational autoencoders," arXiv preprint arXiv:1811.04576, 2018.
Reference 4: Scott Lundberg, Su-In Lee, "A Unified Approach to Interpreting Model Predictions," in Proc. NIPS 2017.

１０異常箇所推定装置
１０１入力装置
１０２表示装置
１０３外部Ｉ／Ｆ
１０３ａ記録媒体
１０４通信Ｉ／Ｆ
１０５ＲＡＭ
１０６ＲＯＭ
１０７補助記憶装置
１０８プロセッサ
１０９バス
２０１収集部
２０２因果モデル構築部
２０３寄与度計算部
２０４推定部
２０５ユーザインタフェース部
３０１ＩＣＴシステムデータＤＢ
３０２因果モデルＤＢ
３０３寄与度ＤＢ 10 Abnormality location estimation device 101 Input device 102 Display device 103 External I/F
103a Recording medium 104 Communication I/F
105 RAM
106 ROM
107 Auxiliary storage device 108 Processor 109 Bus 201 Collection unit 202 Causal model construction unit 203 Contribution calculation unit 204 Estimation unit 205 User interface unit 301 ICT system data DB
302 Causal Model DB
303 Contribution DB

Claims

a collection unit configured to acquire network topology information representing a network topology of an ICT system in which an abnormality location is to be estimated;
a model construction unit configured to construct a causal model for estimating the location of an anomaly from observation data when an anomaly occurs in the ICT system, using the network topology information; and
and
The model construction unit
A causal model construction device configured to construct a Bayesian network as the causal model, which is defined by a priori probability representing the likelihood of an equipment constituting the ICT system falling into an abnormal state, and a conditional probability representing the causal relationship between the equipment and the state of a representative node representing the state of the observation data obtained from the equipment, and the degree of the causal relationship.

a collection procedure for acquiring network topology information representing the network topology of an ICT system in which an anomaly location is to be estimated;
a model construction step of constructing a causal model for estimating the location of an anomaly from observation data when an anomaly occurs in the ICT system, using the network topology information;
The computer executes
The model building procedure includes:
A causal model construction method in which a Bayesian network is constructed as the causal model, which is defined by a priori probability representing the likelihood of an equipment constituting the ICT system falling into an abnormal state, and a conditional probability representing the causal relationship between the equipment and the state of a representative node representing the state of the observation data obtained from the equipment, and the degree of the causal relationship.

A program for causing a computer to execute the causal model building method according to claim 2 .