JP7564447B2

JP7564447B2 - Method and program for determining cause of abnormality

Info

Publication number: JP7564447B2
Application number: JP2021031957A
Authority: JP
Inventors: 淳一樋口; 武司児玉; 仁上野
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2024-10-09
Anticipated expiration: 2041-03-01
Also published as: JP2022133094A

Description

本発明は、異常要因判定方法および異常要因判定プログラムに関する。 The present invention relates to an abnormality factor determination method and an abnormality factor determination program.

情報処理システムの動作状況を監視装置によって監視して、異常の発生を検知できるようにする技術は、広く普及している。異常の発生を検知する方法としては、例えば、情報処理システムに含まれるリソースの使用状況を示すメトリックを用いる方法がある。また、このような異常検知技術では、異常が検知された場合に、その異常の発生要因を判定することが求められる。異常の発生要因を判定する方法としては、例えば、情報処理システムに対して実行されたイベントのログを解析する方法が挙げられる。 Technology that uses a monitoring device to monitor the operating status of an information processing system and detect the occurrence of an abnormality is widely used. One method of detecting the occurrence of an abnormality is, for example, to use metrics that indicate the usage status of resources included in the information processing system. Furthermore, with such anomaly detection technology, when an abnormality is detected, it is necessary to determine the cause of the abnormality. One method of determining the cause of the abnormality is, for example, to analyze the log of events executed on the information processing system.

また、情報処理システムの監視や異常要因の解析に関しては、次のような技術が提案されている。例えば、監視対象システムから継続的に監視データを取得してシステムの挙動をモデル化した挙動モデルを作成し、連続して作成された挙動モデルの差に基づいて挙動が変化した期間を推測し、ユーザに通知する障害分析システムが提案されている。また、システム内の機器の入出力とアプリケーションプログラムの変数との対応を示す変数リレーション情報を生成し、機器の異常発生を検知すると、当該機器の入出力に関する変数を変数リレーション情報に基づいて特定し、特定された変数に関連するイベントの情報を発生イベント情報から抽出して表示する異常解析支援システムも提案されている。 Furthermore, the following technologies have been proposed for monitoring information processing systems and analyzing the causes of anomalies. For example, a fault analysis system has been proposed that continuously acquires monitoring data from a monitored system to create a behavior model that models the system's behavior, estimates the period during which the behavior changed based on the difference between the successively created behavior models, and notifies the user. Another proposed anomaly analysis support system generates variable relation information that indicates the correspondence between the input/output of equipment in a system and the variables of an application program, and, when an equipment anomaly is detected, identifies variables related to the input/output of the equipment based on the variable relation information, and extracts and displays information on events related to the identified variables from the event information that has occurred.

国際公開ＷＯ２０１４／１８４９３４号International Publication No. WO2014/184934 特開２０１７－２２７９７３号公報JP 2017-227973 A

ところで、上記のような監視装置が、情報処理システムの異常が検知すると、情報処理システムに対して実行されたイベントのログを取得し、取得したログの内容に基づいて異常の発生要因を判定することが考えられている。通常、異常が検知された場合、その異常発生要因となり得るイベントは、検知時刻の直前に実行されていることが多い。しかし、イベントの実行によって異常が発生してから、その異常が検知されるまでに長い時間がかかるケースもある。このようなケースでは、異常発生要因となり得るイベントのログをデータベースから検索する検索期間を、異常が検知された時刻を終端とする長い期間に設定しないと、適切なイベントのログを取得できない。しかし、検索期間が長くなるほど、検索対象となるログの数が増大し、検索にかかる時間が長くなって、その結果として異常発生要因の判定にかかる時間が長くなるという問題がある。 When a monitoring device such as the one described above detects an abnormality in an information processing system, it is considered that the device acquires a log of an event executed on the information processing system and determines the cause of the abnormality based on the contents of the acquired log. Usually, when an abnormality is detected, the event that may have been the cause of the abnormality is often executed immediately before the detection time. However, there are also cases where it takes a long time from when an abnormality occurs due to the execution of an event until the abnormality is detected. In such cases, it is necessary to set the search period for searching the database for logs of events that may be the cause of the abnormality to a long period ending at the time the abnormality was detected, otherwise the appropriate event log cannot be acquired. However, the longer the search period, the greater the number of logs to be searched, and the longer the time required for the search, which results in a problem of longer time required for determining the cause of the abnormality.

１つの側面では、本発明は、異常発生要因の判定時間を短縮することが可能な異常要因判定方法および異常要因判定プログラムを提供することを目的とする。 In one aspect, the present invention aims to provide an abnormality cause determination method and an abnormality cause determination program that can shorten the time required to determine the cause of an abnormality.

１つの案では、コンピュータが、それぞれ情報処理システムに含まれるリソースの使用状況を示す複数のメトリックのうち、第１のメトリックに基づいて第１の時刻に異常が検知された場合、複数のメトリックのうち第１のメトリックを除くメトリックの中から、第１の時刻の直前において対応するリソースが不使用状態であることを示す１以上の第２のメトリックを特定し、１以上の第２のメトリックのそれぞれが示す使用状況に基づき、第１の時刻の直前から過去に遡って対応するリソースが不使用状態から使用状態に変化する第２の時刻を１以上の第２のメトリックのそれぞれについて特定し、特定された第２の時刻のうち最も古い第３の時刻から第１の時刻までを検索期間として指定して、情報処理システムに対して実行されたイベントのログが蓄積されたデータベースから、検索期間において実行された、第１のメトリックに基づく異常の要因候補となる候補イベントのログを取得し、取得したログが示す候補イベントに基づいて第１のメトリックに基づく異常の発生要因を判定する、異常要因判定方法が提供される。 In one proposal, a method for determining the cause of an anomaly is provided, in which, when an anomaly is detected at a first time based on a first metric among a plurality of metrics indicating the usage status of resources included in an information processing system, a computer identifies one or more second metrics from among the plurality of metrics excluding the first metric, which indicate that the corresponding resource was unused immediately before the first time, and, based on the usage status indicated by each of the one or more second metrics, identifies a second time at which the corresponding resource changes from an unused state to a used state by going back from immediately before the first time for each of the one or more second metrics, specifying a search period from a third time, which is the oldest of the identified second times, to the first time, and acquiring logs of candidate events that were executed during the search period and are candidate causes of the anomaly based on the first metric from a database in which logs of events executed on the information processing system are accumulated, and determining the cause of the anomaly based on the first metric based on the candidate events indicated by the acquired logs.

また、１つの案では、上記の異常要因判定方法と同様の処理をコンピュータに実行させる異常要因判定プログラムが提供される。 In one proposal, an abnormality cause determination program is provided that causes a computer to execute processing similar to the abnormality cause determination method described above.

１つの側面では、異常発生要因の判定時間を短縮できる。 On the one hand, it can shorten the time it takes to determine the cause of an abnormality.

第１の実施の形態に係る異常要因判定装置を示す図である。1 is a diagram illustrating an abnormality factor determination device according to a first embodiment; 第２の実施の形態に係る情報処理システムの構成例を示す図である。FIG. 13 illustrates an example of a configuration of an information processing system according to a second embodiment. 監視装置のハードウェア構成例を示す図である。FIG. 2 illustrates an example of a hardware configuration of a monitoring device. 運用管理装置および監視装置が備える処理機能の構成例を示す図である。2 is a diagram illustrating an example of a configuration of processing functions included in an operation management apparatus and a monitoring apparatus; メトリックデータベースのデータ構成例を示す図である。FIG. 2 illustrates an example of a data structure of a metric database; 判定ルールデータベースのデータ構成例を示す図である。FIG. 4 illustrates an example of a data structure of a determination rule database; 判定結果データベースのデータ構成例を示す図である。FIG. 13 is a diagram illustrating an example of a data configuration of a determination result database. 異常発生の要因判定処理についての比較例を示す第１の図である。FIG. 11 is a first diagram showing a comparative example of a process for determining the cause of an abnormality; 異常発生の要因判定処理についての比較例を示す第２の図である。FIG. 2 is a second diagram showing a comparative example of the process for determining the cause of an abnormality. 第２の実施の形態における異常発生の要因判定処理を示す図である。FIG. 11 is a diagram illustrating a process for determining the cause of an abnormality in the second embodiment. 第２の実施の形態における監視装置の処理手順を示すフローチャートの例である。13 is a flowchart illustrating an example of a processing procedure of a monitoring device according to a second embodiment. 変形例における異常発生の要因判定処理を示す図である。13A and 13B are diagrams illustrating a process for determining the cause of an abnormality in a modified example. 変形例における監視装置の処理手順を示すフローチャートの例である。13 is a flowchart illustrating an example of a processing procedure of a monitoring device according to a modified example.

以下、本発明の実施の形態について図面を参照して説明する。
〔第１の実施の形態〕
図１は、第１の実施の形態に係る異常要因判定装置を示す図である。図１に示す異常要因判定装置１は、図示しない情報処理システムの動作状況を監視し、異常が検知された場合にその異常の発生要因を判定する装置である。異常要因判定装置１は、例えば、サーバ装置やパーソナルコンピュータなどのコンピュータとして実現される。この場合、以下で説明する異常要因判定装置１の処理は、例えば、異常要因判定装置１が備えるプロセッサが所定のプログラムを実行することで実現される。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
First Embodiment
Fig. 1 is a diagram showing an abnormality factor determination device according to a first embodiment. The abnormality factor determination device 1 shown in Fig. 1 is a device that monitors the operating status of an information processing system (not shown) and, when an abnormality is detected, determines the cause of the abnormality. The abnormality factor determination device 1 is realized as a computer such as a server device or a personal computer. In this case, the processing of the abnormality factor determination device 1 described below is realized, for example, by a processor included in the abnormality factor determination device 1 executing a predetermined program.

異常要因判定装置１は、メトリックデータベース（ＤＢ）２からメトリックを取得可能になっている。メトリックデータベース２には、それぞれ上記の情報処理システムに含まれるリソースの使用状況を示す複数のメトリックが、情報処理システムから逐次収集されて蓄積される。例えば、対応するリソースがＣＰＵ（Central Processing Unit）の場合、メトリックとしてはＣＰＵ使用率、ＣＰＵ待ち時間などがある。対応するリソースがメモリの場合、メトリックとしてはメモリ使用量、メモリスワップアウト量などがある。対応するリソースがネットワークインタフェースの場合、メトリックとしてはネットワーク使用量、パケットロス数などがある。 The abnormality cause determination device 1 is capable of acquiring metrics from a metric database (DB) 2. In the metric database 2, a number of metrics indicating the usage status of the resources included in each of the above-mentioned information processing systems are successively collected from the information processing systems and stored. For example, if the corresponding resource is a CPU (Central Processing Unit), the metrics include CPU usage rate and CPU wait time. If the corresponding resource is memory, the metrics include memory usage and memory swap-out amount. If the corresponding resource is a network interface, the metrics include network usage and number of packet losses.

異常要因判定装置１は、メトリックデータベース２に蓄積された複数のメトリックの中から、特定のメトリックを定期的に取得し、取得したメトリックの値に基づいて情報処理システムにおける異常を検知できる。また、異常要因判定装置１は、異常を検知した場合に、その異常の発生要因を判定するためにメトリックデータベース２内の他のメトリックを取得することもできる。 The abnormality cause determination device 1 periodically acquires a specific metric from among multiple metrics stored in the metric database 2, and can detect an abnormality in the information processing system based on the value of the acquired metric. In addition, when the abnormality cause determination device 1 detects an abnormality, it can also acquire other metrics in the metric database 2 to determine the cause of the abnormality.

また、イベントログデータベース（ＤＢ）３には、情報処理システムに対して実行されたイベントのログが蓄積される。異常要因判定装置１は、検知された異常の発生要因を判定するために、検索条件を指定して、検索条件に合致するイベントのログをイベントログデータベース３から取得できる。なお、イベントログデータベース３に対する検索処理自体は、異常要因判定装置１で実行されてもよいし、異常要因判定装置１の外部に接続された他の装置で実行されてもよい。 In addition, the event log database (DB) 3 accumulates logs of events that have been executed on the information processing system. In order to determine the cause of a detected abnormality, the abnormality factor determination device 1 can specify search conditions and obtain logs of events that match the search conditions from the event log database 3. Note that the search process for the event log database 3 itself may be executed by the abnormality factor determination device 1, or may be executed by another device connected externally to the abnormality factor determination device 1.

一方、図１の右側に示すタイムチャート４は、あるメトリックに基づいて異常が検知された場合における他のメトリックやイベントの状況の例を示す。以下、このタイムチャート４に示された例を用いて、異常要因判定装置１の処理を説明する。 On the other hand, time chart 4 shown on the right side of FIG. 1 shows an example of the status of other metrics and events when an abnormality is detected based on a certain metric. Below, the processing of the abnormality cause determination device 1 will be explained using the example shown in time chart 4.

異常要因判定装置１は、メトリックデータベース２に蓄積されたメトリックのうち、１以上の特定のメトリックに基づいて、情報処理システムにおける異常の有無を判定する。ここでは例として、特定のメトリックに基づいて異常の有無を判定する判定処理が、所定時間間隔の判定時刻ごとに実行されるものとする。この場合、ある判定時刻における判定処理は、前回の判定時刻から現判定時刻までの期間にメトリックデータベース２に蓄積されたメトリックに基づいて実行される。 The abnormality cause determination device 1 determines whether or not an abnormality exists in the information processing system based on one or more specific metrics among the metrics stored in the metric database 2. As an example, the determination process for determining whether or not an abnormality exists based on a specific metric is executed at each determination time at a predetermined time interval. In this case, the determination process at a certain determination time is executed based on the metrics stored in the metric database 2 during the period from the previous determination time to the current determination time.

図１のタイムチャート４では、メトリックＭ１（第１のメトリック）に基づいて異常の有無が判定されている例を示している。異常要因判定装置１は、メトリックＭ１から、上記の判定時刻のうち時刻Ｔ１，Ｔ２，Ｔ３，Ｔ４では異常を検知しなかったが、時刻Ｔ５（第１の時刻）で異常を検知したとする（ステップＳ１）。 The time chart 4 in FIG. 1 shows an example in which the presence or absence of an abnormality is determined based on metric M1 (first metric). The abnormality cause determination device 1 detects no abnormality at times T1, T2, T3, and T4 among the above determination times from metric M1, but detects an abnormality at time T5 (first time) (step S1).

すると、異常要因判定装置１は、メトリックデータベース２に蓄積された複数のメトリックのうち、メトリックＭ１を除く他のメトリックの中から、時刻Ｔ５の直前において対応するリソースが不使用状態であることを示す１以上のメトリック（第２のメトリック）を特定する。図１のタイムチャート４では、メトリックＭ１を除くメトリックＭ２～Ｍ４の中から、時刻Ｔ５の直前の判定時刻である時刻Ｔ４において対応するリソースが未使用状態であることを示すメトリックＭ２，Ｍ３が特定されたとする（ステップＳ２）。 Then, the abnormality cause determination device 1 identifies one or more metrics (second metrics) that indicate that the corresponding resource is unused immediately before time T5 from among the multiple metrics stored in the metric database 2, except for metric M1. In the time chart 4 of FIG. 1, it is assumed that metrics M2 and M3 that indicate that the corresponding resource is unused at time T4, which is the determination time immediately before time T5, are identified from among metrics M2 to M4, except for metric M1 (step S2).

次に、異常要因判定装置１は、特定されたメトリックＭ２，Ｍ３のそれぞれが示す使用状況に基づき、時刻Ｔ５の直前（ここでは時刻Ｔ４）から過去に遡って対応するリソースが不使用状態から使用状態に変化する時刻（第２の時刻）を、メトリックＭ２，Ｍ３のそれぞれについて特定する（ステップＳ３）。 Next, based on the usage status indicated by each of the identified metrics M2 and M3, the abnormality cause determination device 1 identifies, for each of the metrics M2 and M3, the time (second time) at which the corresponding resource changes from an unused state to a used state, going back from just before time T5 (here, time T4) (step S3).

図１のタイムチャート４では、メトリックＭ２については、時刻Ｔ２から時刻Ｔ１までの期間で対応するリソースが使用状態に変化している。このため、メトリックＭ２についての上記時刻としては時刻Ｔ１が特定される。また、メトリックＭ３については、時刻Ｔ３から時刻Ｔ２までの期間で対応するリソースが使用状態に変化している。このため、メトリックＭ３についての上記時刻としては時刻Ｔ２が特定される。 In time chart 4 of FIG. 1, for metric M2, the corresponding resource changes to a used state in the period from time T2 to time T1. Therefore, time T1 is identified as the above time for metric M2. Also, for metric M3, the corresponding resource changes to a used state in the period from time T3 to time T2. Therefore, time T2 is identified as the above time for metric M3.

次に、異常要因判定装置１は、ステップＳ３で特定された時刻Ｔ１，Ｔ２のうち、最も古い時刻Ｔ１を選択し、選択した時刻Ｔ１から、異常が検知された時刻Ｔ５までを検索期間として指定する。そして、異常要因判定装置１は、イベントログデータベース３から、指定された検索期間において実行された、メトリックＭ１に基づく異常の要因候補となる候補イベントのログを取得する（ステップＳ４）。ここで、検知された異常の要因候補となる候補イベントは、例えば、異常検知の元になったメトリックに応じてあらかじめ決められている。 Next, the abnormality cause determination device 1 selects the oldest time T1 from the times T1 and T2 identified in step S3, and specifies the search period from the selected time T1 to the time T5 at which the abnormality was detected. Then, the abnormality cause determination device 1 obtains from the event log database 3 a log of candidate events that were executed during the specified search period and are candidates for the cause of the abnormality based on the metric M1 (step S4). Here, the candidate events that are candidates for the cause of the detected abnormality are predetermined, for example, according to the metric that was the basis for the abnormality detection.

ステップＳ４では、時刻Ｔ１から時刻Ｔ５までの検索期間と候補イベントとを検索条件としてイベントログデータベース３が検索されることで、検索条件に合致する候補イベントのログが取得される。なお、前述のように、イベントログデータベース３の検索処理自体は、異常要因判定装置１で実行されてもよいし、異常要因判定装置１の外部に接続された他の装置で実行されてもよい。 In step S4, the event log database 3 is searched using the search period from time T1 to time T5 and the candidate event as search conditions, and logs of candidate events that match the search conditions are obtained. As described above, the search process of the event log database 3 itself may be executed by the abnormality factor determination device 1, or may be executed by another device connected to the outside of the abnormality factor determination device 1.

図１のタイムチャート４では、時刻Ｔ２から時刻Ｔ３の期間において、異常の要因となったイベントが実行され、このイベントに対応するログ５がイベントログデータベース３に登録されたとする。この場合、ステップＳ４では、候補イベントのログとしてログ５が取得される。すると、異常要因判定装置１は、ステップＳ４で取得したログ５が示す候補イベントに基づいて、メトリックＭ１に基づく異常の発生要因を判定する（ステップＳ５）。 In the time chart 4 of FIG. 1, an event that caused an abnormality was executed in the period from time T2 to time T3, and a log 5 corresponding to this event was registered in the event log database 3. In this case, in step S4, the log 5 is acquired as a log of a candidate event. The abnormality cause determination device 1 then determines the cause of the occurrence of the abnormality based on the metric M1, based on the candidate event indicated by the log 5 acquired in step S4 (step S5).

ここで、情報処理システムの異常が検知された場合、その異常発生要因となり得るイベントは、検知時刻の直前に実行されていることが多い。このようなイベントのログをイベントログデータベース３から取得するためには、ログの検索期間を異常の判定周期に相当する期間に設定すれば十分である。 When an abnormality is detected in the information processing system, the event that may have caused the abnormality is often executed immediately before the time of detection. In order to obtain logs of such events from the event log database 3, it is sufficient to set the log search period to a period equivalent to the abnormality determination cycle.

一方、図１のタイムチャート４に示した例では、ログ５が示すイベントの実行によって異常が発生してから、その異常が検知されるまでに長い時間がかかっている。このようなイベントのログをイベントログデータベース３から取得するためには、ログの検索期間をより長くする必要がある。しかし、ログの検索期間が長くなるほど、検索対象となるログの数が増大し、検索にかかる時間が長くなる。その結果、異常発生要因の判定にかかる時間が長くなってしまう。 On the other hand, in the example shown in time chart 4 of Figure 1, it takes a long time from when an abnormality occurs due to the execution of the event shown in log 5 until the abnormality is detected. In order to obtain such an event log from event log database 3, the log search period needs to be made longer. However, the longer the log search period, the greater the number of logs to be searched, and the longer the time required for the search. As a result, it takes longer to determine the cause of the abnormality.

異常が発生してから検知されるまでに長い時間がかかるケースとしては、リソースが使用されていない期間において、そのリソースに関する異常が発生しているケースがある。より具体的には、あるイベントの実行によってあるリソースに関する異常が発生したが、その時点ではリソースが使用されておらず、その後にリソースの使用が開始された時点で異常事象が出現し、異常が検知される、というケースがある。 Cases in which it takes a long time for an abnormality to be detected include cases where an abnormality occurs related to a resource during a period in which the resource is not being used. More specifically, there are cases in which an abnormality occurs related to a resource due to the execution of a certain event, but the resource is not being used at the time, and then when the resource begins to be used again, an abnormal event appears and the abnormality is detected.

図１のタイムチャート４に示した例では、ログ５が示すイベントが実行されたとき、そのイベントに関係するリソースが使用されておらず、その後に時刻Ｔ５の直前でリソースの使用が開始されたことで、時刻Ｔ５で異常が検知された、と考えることができる。 In the example shown in time chart 4 of Figure 1, when the event shown in log 5 was executed, the resource related to that event was not in use, and then just before time T5, the resource started to be used, which is why an abnormality was detected at time T5.

そこで、異常要因判定装置１は、メトリックＭ１を除く他のメトリックの中から、時刻Ｔ５の直前において対応するリソースが不使用状態であることを示すメトリックＭ２，Ｍ３を特定する。次に、異常要因判定装置１は、特定されたメトリックＭ２，Ｍ３のそれぞれについて、時刻Ｔ５の直前から過去に遡って対応するリソースが不使用状態から使用状態に変化する時刻Ｔ１，Ｔ２を特定する。そして、異常要因判定装置１は、特定された時刻Ｔ１，Ｔ２のうち最も古い時刻を、ログの検索期間の開始時刻に決定する。 The abnormality factor determination device 1 then identifies, from among the metrics other than metric M1, metrics M2 and M3 that indicate that the corresponding resource was unused immediately before time T5. Next, for each of the identified metrics M2 and M3, the abnormality factor determination device 1 identifies times T1 and T2 at which the corresponding resource changed from an unused state to an used state by going back from immediately before time T5. The abnormality factor determination device 1 then determines the oldest of the identified times T1 and T2 as the start time of the log search period.

このような処理により、異常が検知された時刻Ｔ５の直前まで不使用状態になっていたリソースに関係するイベントのログをすべて検索対象に含めることができるように、検索期間の開始時刻が決定される。これにより、検索期間を必要最小限の長さに設定できる。このため、検索期間の長さを抑制しながら、検知された異常の発生要因となり得る候補イベントのログを取得できる可能性が高まる。したがって、イベントログデータベース３の検索にかかる時間を短縮し、それによって異常要因判定装置１による異常の検知から異常発生要因の判定までにかかる時間を短縮しつつ、その判定精度を高めることができる。 Through this processing, the start time of the search period is determined so that all event logs related to resources that were unused until just before the time T5 when the abnormality was detected can be included in the search. This allows the search period to be set to the minimum length necessary. This increases the possibility of obtaining logs of candidate events that may be the cause of the detected abnormality while keeping the length of the search period down. This reduces the time required to search the event log database 3, thereby shortening the time required from when the abnormality cause determination device 1 detects an abnormality to when it determines the cause of the abnormality, while improving the accuracy of the determination.

〔第２の実施の形態〕
図２は、第２の実施の形態に係る情報処理システムの構成例を示す図である。図２に示す情報処理システムは、運用管理装置１００と監視装置２００とを含む。 Second Embodiment
2 is a diagram showing an example of the configuration of an information processing system according to the second embodiment. The information processing system shown in FIG.

運用管理装置１００は、ＩＣＴ（Information and Communication Technology）インフラストラクチャ１１０の運用を管理する。以下、ＩＣＴインフラストラクチャを「ＩＣＴインフラ」と略称する。ＩＣＴインフラ１１０は、コンピュータやネットワーク機器などの各種の情報処理機器を含む。例えば、ＩＣＴインフラ１１０がクラウドサービスを提供するものである場合、ＩＣＴインフラ１１０には、クラウドサーバとして動作するサーバ装置や、サーバ装置間を接続するネットワーク機器などが含まれる。 The operation management device 100 manages the operation of an ICT (Information and Communication Technology) infrastructure 110. Hereinafter, the ICT infrastructure will be abbreviated as "ICT infrastructure." The ICT infrastructure 110 includes various information processing devices such as computers and network devices. For example, if the ICT infrastructure 110 provides a cloud service, the ICT infrastructure 110 includes server devices that operate as cloud servers and network devices that connect the server devices.

運用管理装置１００は、ＩＣＴインフラ１１０に含まれる各情報処理機器に対する、運用管理に関する各種のイベント（運用イベント）を実行する。運用イベントは、ＩＣＴインフラ１１０における各種の構成変更や設定変更を行う処理であり、例えば、サーバ装置上で動作する仮想マシンの作成、削除、マイグレーションや、ドライバなどのプログラムの更新などがある。監視装置２００は、運用イベントを実行するとともに、実行した運用イベントに関するログをデータベースに記録する。 The operation management device 100 executes various events (operation events) related to operation management for each information processing device included in the ICT infrastructure 110. Operation events are processes that perform various configuration changes and setting changes in the ICT infrastructure 110, such as creating, deleting, and migrating virtual machines running on a server device, and updating programs such as drivers. The monitoring device 200 executes operation events and records logs related to the executed operation events in a database.

また、運用管理装置１００は、ＩＣＴインフラ１１０に含まれる各情報処理機器の稼働状態を監視し、各情報処理機器からリソースに関するメトリックを収集する。メトリックは、プロセッサやメモリなどの監視対象のリソースの動作状態を示す情報であり、例えば、リソースの動作状態を評価するための尺度を与える。 The operation management device 100 also monitors the operating status of each information processing device included in the ICT infrastructure 110, and collects resource-related metrics from each information processing device. Metrics are information that indicate the operating status of monitored resources such as processors and memory, and provide, for example, a scale for evaluating the operating status of resources.

監視装置２００は、運用管理装置１００を介してＩＣＴインフラ１１０の稼働状態を監視し、異常が検知された場合にはその発生要因を解析する。具体的には、監視装置２００は、運用管理装置１００によって収集されたメトリックを取得し、動作の正常性を判定する。異常が検知された場合、監視装置２００は、運用管理装置１００から運用イベントのログを取得し、異常発生の契機となり得る運用イベントを特定する、監視装置２００は、特定された運用イベントに基づいて異常発生要因を判定する。 The monitoring device 200 monitors the operating status of the ICT infrastructure 110 via the operation management device 100, and if an abnormality is detected, analyzes the cause of the abnormality. Specifically, the monitoring device 200 acquires metrics collected by the operation management device 100 and determines the normality of operation. If an abnormality is detected, the monitoring device 200 acquires an operation event log from the operation management device 100 and identifies the operation event that may be the trigger for the abnormality, and determines the cause of the abnormality based on the identified operation event.

図３は、監視装置のハードウェア構成例を示す図である。監視装置２００は、例えば、図３に示すようなコンピュータとして実現される。
図３に示す監視装置２００は、プロセッサ２０１、ＲＡＭ（Random Access Memory）２０２、ＨＤＤ（Hard Disk Drive）２０３、ＧＰＵ（Graphics Processing Unit）２０４、入力インタフェース（Ｉ／Ｆ）２０５、読み取り装置２０６および通信インタフェース（Ｉ／Ｆ）２０７を備える。 3 is a diagram showing an example of a hardware configuration of the monitoring device 200. The monitoring device 200 is realized as a computer as shown in FIG.
The monitoring device 200 shown in FIG. 3 includes a processor 201, a random access memory (RAM) 202, a hard disk drive (HDD) 203, a graphics processing unit (GPU) 204, an input interface (I/F) 205, a reading device 206, and a communication interface (I/F) 207.

プロセッサ２０１は、監視装置２００全体を統括的に制御する。プロセッサ２０１は、例えば、ＣＰＵ、ＭＰＵ（Micro Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）またはＰＬＤ（Programmable Logic Device）である。また、プロセッサ２０１は、ＣＰＵ、ＭＰＵ、ＤＳＰ、ＡＳＩＣ、ＰＬＤのうちの２以上の要素の組み合わせであってもよい。 The processor 201 performs overall control of the entire monitoring device 200. The processor 201 is, for example, a CPU, an MPU (Micro Processing Unit), a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), or a PLD (Programmable Logic Device). The processor 201 may also be a combination of two or more elements of a CPU, an MPU, a DSP, an ASIC, or a PLD.

ＲＡＭ２０２は、監視装置２００の主記憶装置として使用される。ＲＡＭ２０２には、プロセッサ２０１に実行させるＯＳ（Operating System）プログラムやアプリケーションプログラムの少なくとも一部が一時的に格納される。また、ＲＡＭ２０２には、プロセッサ２０１による処理に必要な各種データが格納される。 The RAM 202 is used as the main storage device of the monitoring device 200. The RAM 202 temporarily stores at least a portion of the OS (Operating System) program and application programs to be executed by the processor 201. The RAM 202 also stores various data necessary for processing by the processor 201.

ＨＤＤ２０３は、監視装置２００の補助記憶装置として使用される。ＨＤＤ２０３には、ＯＳプログラム、アプリケーションプログラム、および各種データが格納される。なお、補助記憶装置としては、ＳＳＤ（Solid State Drive）などの他の種類の不揮発性記憶装置を使用することもできる。 The HDD 203 is used as an auxiliary storage device for the monitoring device 200. The HDD 203 stores the OS program, application programs, and various data. Note that other types of non-volatile storage devices, such as a solid state drive (SSD), can also be used as the auxiliary storage device.

ＧＰＵ２０４には、表示装置２０４ａが接続されている。ＧＰＵ２０４は、プロセッサ２０１からの命令にしたがって、画像を表示装置２０４ａに表示させる。表示装置としては、液晶ディスプレイや有機ＥＬ（ElectroLuminescence）ディスプレイなどがある。 A display device 204a is connected to the GPU 204. The GPU 204 displays an image on the display device 204a in accordance with an instruction from the processor 201. Examples of the display device include a liquid crystal display and an organic EL (ElectroLuminescence) display.

入力インタフェース２０５には、入力装置２０５ａが接続されている。入力インタフェース２０５は、入力装置２０５ａから出力される信号をプロセッサ２０１に送信する。入力装置２０５ａとしては、キーボードやポインティングデバイスなどがある。ポインティングデバイスとしては、マウス、タッチパネル、タブレット、タッチパッド、トラックボールなどがある。 The input interface 205 is connected to the input device 205a. The input interface 205 transmits a signal output from the input device 205a to the processor 201. Examples of the input device 205a include a keyboard and a pointing device. Examples of the pointing device include a mouse, a touch panel, a tablet, a touch pad, and a trackball.

読み取り装置２０６には、可搬型記録媒体２０６ａが脱着される。読み取り装置２０６は、可搬型記録媒体２０６ａに記録されたデータを読み取ってプロセッサ２０１に送信する。可搬型記録媒体２０６ａとしては、光ディスク、半導体メモリなどがある。 A portable recording medium 206a is detachably attached to the reading device 206. The reading device 206 reads data recorded on the portable recording medium 206a and transmits it to the processor 201. Examples of the portable recording medium 206a include an optical disk and a semiconductor memory.

通信インタフェース２０７は、ネットワーク２０７ａを介して、運用管理装置１００などの他の装置との間でデータの送受信を行う。
以上のようなハードウェア構成によって、監視装置２００の処理機能を実現することができる。なお、運用管理装置１００についても、例えば、図３に示すような構成のコンピュータとして実現することができる。 The communication interface 207 transmits and receives data to and from other devices such as the operation management unit 100 via a network 207a.
The above hardware configuration can realize the processing functions of the monitoring device 200. The operation management device 100 can also be realized as a computer having a configuration as shown in FIG.

図４は、運用管理装置および監視装置が備える処理機能の構成例を示す図である。
まず、運用管理装置１００は、イベント実行部１０１、イベントログ検索部１０２およびメトリック収集部１０３を備える。イベント実行部１０１、イベントログ検索部１０２およびメトリック収集部１０３の処理は、例えば、運用管理装置１００が備える図示しないプロセッサが所定のプログラムを実行することで実現される。また、運用管理装置１００の図示しない記憶装置（例えばＲＡＭ）には、イベントログデータベース（ＤＢ）１０４とメトリックデータベース（ＤＢ）１０５とが記憶される。 FIG. 4 is a diagram illustrating an example of the configuration of processing functions provided in the operation management device and the monitoring device.
First, the operation management apparatus 100 includes an event execution unit 101, an event log search unit 102, and a metric collection unit 103. The processing of the event execution unit 101, the event log search unit 102, and the metric collection unit 103 is realized, for example, by a processor (not shown) included in the operation management apparatus 100 executing a predetermined program. Also, an event log database (DB) 104 and a metric database (DB) 105 are stored in a storage device (e.g., RAM) (not shown) of the operation management apparatus 100.

イベント実行部１０１は、ＩＣＴインフラ１１０に含まれる各情報処理機器に対する運用イベントを実行する。イベント実行部１０１は、実行された運用イベントに関するログをイベントログデータベース１０４に登録する。運用イベントのログには、実行された処理内容を示す情報や、実行の成否を示す情報、実行された時刻などの情報が含まれる。 The event execution unit 101 executes operation events for each information processing device included in the ICT infrastructure 110. The event execution unit 101 registers a log of the executed operation events in the event log database 104. The operation event log includes information indicating the content of the executed process, information indicating whether the execution was successful, information indicating the time of execution, etc.

イベントログ検索部１０２は、例えば監視装置２００からの検索依頼に応じて、イベントログデータベース１０４を検索し、検索された運用イベントのログを返信する。
メトリック収集部１０３は、ＩＣＴインフラ１１０に含まれる各情報処理機器からメトリックを収集し、収集されたメトリックをメトリックデータベース１０５に登録する。メトリックとしては、例えば、サーバ装置におけるＣＰＵ待ち時間、ＣＰＵ使用率、メモリスワップアウト量、パケットロス数、ネットワーク使用率などが収集される。 The event log search unit 102 searches the event log database 104 in response to a search request from, for example, the monitoring device 200, and returns the searched operational event log.
The metric collector 103 collects metrics from each information processing device included in the ICT infrastructure 110, and registers the collected metrics in the metric database 105. As the metrics, for example, CPU waiting time, CPU usage rate, memory swap-out amount, packet loss count, network usage rate, etc. in the server device are collected.

次に、監視装置２００は、メトリック取得部２１１、正常性判定部２１２および要因判定部２１３を備える。メトリック取得部２１１、正常性判定部２１２および要因判定部２１３の処理は、例えば、監視装置２００が備えるプロセッサ２０１が所定のプログラムを実行することで実現される。また、監視装置２００の記憶装置（例えばＲＡＭ２０２）には、メトリックデータベース（ＤＢ）２２１、判定ルールデータベース（ＤＢ）２２２および判定結果データベース（ＤＢ）２２３が記憶される。 Next, the monitoring device 200 includes a metric acquisition unit 211, a normality determination unit 212, and a factor determination unit 213. The processing of the metric acquisition unit 211, the normality determination unit 212, and the factor determination unit 213 is realized, for example, by the processor 201 included in the monitoring device 200 executing a predetermined program. In addition, a metric database (DB) 221, a determination rule database (DB) 222, and a determination result database (DB) 223 are stored in a storage device (for example, RAM 202) of the monitoring device 200.

メトリック取得部２１１は、運用管理装置１００のメトリックデータベース１０５に登録されたメトリックを取得して、メトリックデータベース２２１に登録する。
正常性判定部２１２は、メトリックデータベース２２１に登録されたメトリックに基づいて、メトリックに関する正常性判定処理を定期的に実行する。この正常性判定処理、直近の一定時間内に運用管理装置１００によって収集されたメトリックを用いて実行される。正常性判定部２１２は、メトリックの異常が検知されると、そのメトリック（異常検知メトリック）を要因判定部２１３に通知する。 The metric acquisition unit 211 acquires metrics registered in the metric database 105 of the operation management device 100 , and registers them in the metric database 221 .
The normality determining unit 212 periodically executes normality determination processing for metrics based on the metrics registered in the metric database 221. This normality determination processing is executed using metrics collected by the operation management device 100 within a fixed period of time in the past. When an abnormality in a metric is detected, the normality determining unit 212 notifies the cause determining unit 213 of the metric (abnormality detection metric).

判定ルールデータベース２２２には、異常検知メトリックと、そのメトリックの異常発生の要因となり得る運用イベントと、異常発生要因とが、あらかじめ対応付けて登録されている。要因判定部２１３は、判定ルールデータベース２２２に基づいて、正常性判定部２１２から通知された異常検知メトリックについての異常発生の要因となり得る運用イベント（要因イベント）を特定する。 In the judgment rule database 222, anomaly detection metrics, operational events that may be the cause of an anomaly in the metric, and anomaly occurrence causes are registered in advance in association with each other. The cause determination unit 213 identifies an operational event (causing event) that may be the cause of an anomaly in the anomaly detection metric notified by the normality determination unit 212 based on the judgment rule database 222.

要因判定部２１３は、現時刻から所定時間だけ前の時刻までの期間に実行された要因イベントのログをイベントログデータベース１０４から検索するように、イベントログ検索部１０２に依頼する。要因判定部２１３は、イベントログデータベース１０４から要因イベントのログが検索された場合、判定ルールデータベース２２２から、検索されたログが示す運用イベントに対応する異常発生要因を抽出し、異常発生要因の判定結果を判定結果データベース２２３に登録する。 The cause determination unit 213 requests the event log search unit 102 to search the event log database 104 for logs of cause events that were executed during the period from the current time to a time a predetermined time before. When a log of a cause event is searched for in the event log database 104, the cause determination unit 213 extracts, from the determination rule database 222, the abnormality occurrence cause corresponding to the operation event indicated by the searched log, and registers the determination result of the abnormality occurrence cause in the determination result database 223.

図５は、メトリックデータベースのデータ構成例を示す図である。この図５では監視装置２００のメトリックデータベース２２１について示すが、運用管理装置１００のメトリックデータベース１０５も同様のデータ構成を有する。 Figure 5 is a diagram showing an example of the data structure of a metric database. Figure 5 shows the metric database 221 of the monitoring device 200, but the metric database 105 of the operation management device 100 also has a similar data structure.

メトリックデータベース２２１には、メトリックが収集された収集時刻に対して、メトリックの種別（監視項目）ごとのメトリックの値が対応付けて登録される。図５の例では、メトリックの項目として、ホスト＃１のＣＰＵ使用率、ホスト＃１のＮＩＣ（Network Interface Card）＃１におけるネットワーク使用率、ホスト＃１のＮＩＣ＃２におけるネットワーク使用率が登録されている。この例では、少なくとも、仮想マシンが動作するサーバ装置であるホスト＃１が、ＣＰＵや２つのＮＩＣ＃１，＃２を備えているものとする。 In the metric database 221, metric values for each metric type (monitoring item) are registered in association with the collection time when the metric was collected. In the example of FIG. 5, the following metric items are registered: CPU usage of host #1, network usage of NIC (Network Interface Card) #1 of host #1, and network usage of NIC #2 of host #1. In this example, it is assumed that host #1, which is a server device on which a virtual machine runs, is equipped with at least a CPU and two NICs #1 and #2.

図６は、判定ルールデータベースのデータ構成例を示す図である。判定ルールデータベース２２２は、異常が検知されたメトリック（異常検知メトリック）から、異常発生要因を推定するために参照されるデータベースである。判定ルールデータベース２２２には、異常検知メトリックに対して、異常発生の要因となり得る運用イベントである要因イベントと、異常発生の要因とが対応付けて登録される。これらの情報は、判定ルールデータベース２２２にあらかじめ登録される。 Figure 6 is a diagram showing an example of the data configuration of the judgment rule database. The judgment rule database 222 is a database that is referenced to estimate the cause of an anomaly from a metric in which an anomaly has been detected (anomaly detection metric). In the judgment rule database 222, a causal event, which is an operational event that may be the cause of an anomaly, and the cause of the anomaly are registered in association with each anomaly detection metric. This information is registered in advance in the judgment rule database 222.

図６の例では、異常検知メトリックがＣＰＵ待ち時間の場合に、要因イベントとして仮想マシン（Virtual Machine：ＶＭ）のマイグレーションが考えられ、そのマイグレーションによるＣＰＵの競合が異常発生要因になり得ることが登録されている。また、異常検知メトリックがメモリスワップアウト量の場合に、要因イベントとして仮想マシンのマイグレーションが考えられ、そのマイグレーションによるメモリの競合が異常発生要因になり得ることが登録されている。 In the example of Figure 6, when the anomaly detection metric is CPU wait time, the migration of a virtual machine (VM) is considered as a causal event, and it is registered that CPU contention due to the migration may be a cause of the anomaly. Also, when the anomaly detection metric is memory swap-out amount, the migration of a virtual machine is considered as a causal event, and it is registered that memory contention due to the migration may be a cause of the anomaly.

さらに、異常検知メトリックがパケットロス数の場合に、要因イベントとして仮想マシンのマイグレーションが考えられ、そのマイグレーションによるネットワークの競合が異常発生要因になり得ることが登録されている。また、異常検知メトリックがパケットロス数の場合には他の例として、要因イベントとしてＮＩＣドライバの更新が考えられ、そのＮＩＣドライバの不具合が異常発生要因になり得ることが登録されている。 Furthermore, when the anomaly detection metric is the number of packets lost, it is registered that a virtual machine migration may be a contributing event, and that network contention due to this migration may be a contributing factor in the occurrence of an anomaly. As another example, when the anomaly detection metric is the number of packets lost, it is registered that a NIC driver update may be a contributing event, and that a malfunction of the NIC driver may be a contributing factor in the occurrence of an anomaly.

図７は、判定結果データベースのデータ構成例を示す図である。判定結果データベース２２３には、判定結果を示す情報として、異常検知時刻、監視ホスト名、監視箇所、異常検知メトリックおよび要因が対応付けて登録されている。異常検知時刻は、異常が検知された時刻を示す。監視ホスト名は、監視対象のホストを示す。監視箇所は、そのホストにおける監視対象の箇所を示す。異常検知メトリックは、異常が検知されたメトリックを示す。要因は、判定された異常発生要因を示す。 Figure 7 is a diagram showing an example of the data configuration of the judgment result database. In the judgment result database 223, the abnormality detection time, monitoring host name, monitoring location, abnormality detection metric, and cause are registered in association with each other as information indicating the judgment result. The abnormality detection time indicates the time when the abnormality was detected. The monitoring host name indicates the host to be monitored. The monitoring location indicates the location to be monitored on that host. The abnormality detection metric indicates the metric at which the abnormality was detected. The cause indicates the determined cause of the abnormality.

次に、図８、図９を用いて、異常発生の要因判定処理についての比較例を説明する。
図８は、異常発生の要因判定処理についての比較例を示す第１の図である。
監視装置２００の正常性判定部２１２は、運用管理装置１００によって収集されたメトリックに基づいて、ＩＣＴインフラ１１０の稼働状況の正常性を判定する。このような正常性の判定時刻は一定時間間隔で設定され、正常性判定部２１２は、判定時刻を基準とした直近の一定時間に収集されたメトリックに基づいて、正常性の判定を行う。図８では例として、３分間隔で正常性の判定時刻が設定されている。 Next, a comparative example of the process for determining the cause of an abnormality will be described with reference to FIGS.
FIG. 8 is a first diagram showing a comparative example of a process for determining the cause of an abnormality.
The normality determination unit 212 of the monitoring device 200 determines the normality of the operating status of the ICT infrastructure 110 based on the metrics collected by the operation management device 100. Such normality determination times are set at regular time intervals, and the normality determination unit 212 determines the normality based on the metrics collected during a regular time period immediately prior to the determination time. In Fig. 8, as an example, the normality determination times are set at 3-minute intervals.

収集された複数項目のメトリックの中には、正常性判定のために使用される１以上の特定のメトリックがあらかじめ決められている。図８では、正常性判定のために使用されるメトリックとして、ＣＰＵ使用率、メモリスワップアウト量、パケットロス数が例示されている。なお、メモリスワップアウト量は、一定期間（前回の判定時刻から現在の判定時刻までの期間）においてメモリからＨＤＤやＳＳＤに退避されたデータの量を示し、パケットロス数は、一定期間に発生したパケットロスの回数を示す。 Among the multiple metrics collected, one or more specific metrics to be used for normality determination are predetermined. In FIG. 8, CPU usage, memory swap-out amount, and number of packet losses are exemplified as metrics to be used for normality determination. The memory swap-out amount indicates the amount of data evacuated from memory to a HDD or SSD during a certain period (the period from the previous determination time to the current determination time), and the number of packet losses indicates the number of packet losses that occurred during the certain period.

正常性判定部２１２は、例えば、メトリックごとに設定された判定閾値に基づき、メトリックの値が判定閾値を超えた場合、あるいは判定閾値未満になった場合に、そのメトリックについての異常が検知されたと判定する。例えば、図８に示したＣＰＵ使用率やメモリスワップアウト量、パケットロス数の場合、値が判定閾値を超えた場合に異常検知と判定される。なお、実際には、互いに関連する複数項目のメトリックの値に基づいて正常性（および異常検知）が判定されてもよい。例えば、一定期間でのパケットロス数と、一定期間での送信パケット数の相関関係に基づいて、正常か異常かが判定されてもよい。 For example, based on a judgment threshold set for each metric, the normality determination unit 212 determines that an abnormality has been detected for that metric when the metric value exceeds the judgment threshold or falls below the judgment threshold. For example, in the case of the CPU usage rate, memory swap-out amount, and number of packet losses shown in FIG. 8, an abnormality is determined to have been detected when the value exceeds the judgment threshold. Note that normality (and abnormality detection) may actually be determined based on the values of multiple metrics that are related to each other. For example, normality or abnormality may be determined based on the correlation between the number of packet losses in a certain period of time and the number of transmitted packets in a certain period of time.

正常性判定部２１２によってあるメトリックについて異常が検知されると、要因判定部２１３は、判定ルールデータベース２２２を参照して、異常が検知されたメトリックについての異常発生の要因となり得る運用イベント（要因イベント）を特定する。図８の例では、１０時９分においてパケットロス数についての異常が検知されたとする。ここで、図６に示した判定ルールデータベース２２２の例では、パケットロス数に対して要因イベントとしてＶＭマイグレーションとＮＩＣドライバ更新とが登録されている。したがって、図８の例では要因イベントとしてＶＭマイグレーションとＮＩＣドライバ更新が特定される。 When the normality determination unit 212 detects an abnormality in a certain metric, the cause determination unit 213 refers to the determination rule database 222 to identify an operational event (causing event) that may be the cause of the abnormality in the metric in which the abnormality was detected. In the example of FIG. 8, it is assumed that an abnormality in the number of packet losses was detected at 10:09. Here, in the example of the determination rule database 222 shown in FIG. 6, VM migration and NIC driver update are registered as causing events for the number of packet losses. Therefore, in the example of FIG. 8, VM migration and NIC driver update are identified as causing events.

また、要因判定部２１３は、現在の判定時刻から前回の判定時刻までの期間（１０時６分から１０時９分までの期間）に実行された要因イベントのログの検索を、運用管理装置１００のイベントログ検索部１０２に依頼する。図８の例では、ＮＩＣドライバを更新したことを示すログＬＧ１が検索されたとする。この場合、要因判定部２１３は、判定ルールデータベース２２２からパケットロス数およびＮＩＣドライバ更新に対応付けられた異常発生の要因を抽出する。図６の判定ルールデータベース２２２に基づく場合、要因としてＮＩＣドライバの不具合が抽出される。要因判定部２１３は、このような異常発生要因の判定結果を判定結果データベース２２３に登録する。 The cause determination unit 213 also requests the event log search unit 102 of the operation management device 100 to search for logs of cause events that were executed during the period from the current determination time to the previous determination time (the period from 10:06 to 10:09). In the example of FIG. 8, it is assumed that a log LG1 indicating that the NIC driver has been updated is searched for. In this case, the cause determination unit 213 extracts the cause of the abnormality associated with the number of packet losses and the NIC driver update from the determination rule database 222. When based on the determination rule database 222 of FIG. 6, a malfunction of the NIC driver is extracted as the cause. The cause determination unit 213 registers the determination result of such an abnormality cause in the determination result database 223.

ここで、ＩＣＴインフラ１１０で発生する異常は、ＩＣＴインフラ１１０の運用管理において実行される構成変更や設定変更のイベント（運用イベント）を契機として発生することが多い。上記処理によれば、異常が検知されたメトリックに関連する運用イベントのログに基づいて異常発生要因が判定されるので、要因判定精度を高めることができる。 Anomalies occurring in the ICT infrastructure 110 are often triggered by configuration changes or setting change events (operational events) executed in the operational management of the ICT infrastructure 110. According to the above process, the cause of the anomaly is determined based on the log of the operational event related to the metric in which the anomaly was detected, thereby improving the accuracy of determining the cause.

ところが、上記の方法では、次の図９に例示するような場合に、適切な要因イベントのログを検索により取得できず、異常判定要因を正確に判定できないという問題がある。
図９は、異常発生の要因判定処理についての比較例を示す第２の図である。異常の事象中には、要因イベントの実行に伴って異常が発生したときに、すぐには異常が検知されず、時間が経過してから異常が検知されるものがある。その例として、要因イベントの実行によりあるリソースに異常が発生したが、その時点でリソースが使用されておらず、その後にリソースが使用された時点で異常が検知される、というものがある。 However, in the above method, in a case such as the example shown in FIG. 9, it is not possible to obtain a log of an appropriate cause event by search, and therefore it is not possible to accurately determine the cause of the abnormality.
9 is a second diagram showing a comparative example of the process of determining the cause of an anomaly. Among anomalies, there are some in which, when an anomaly occurs with the execution of a causal event, the anomaly is not detected immediately, but is detected after some time has passed. One such example is when an anomaly occurs in a certain resource due to the execution of a causal event, but the resource is not being used at that time, and the anomaly is detected when the resource is used thereafter.

図９の例では、１０時９分から１２分までの期間に、ＮＩＣ＃１のドライバを更新するという要因イベントが実行され、これに伴ってＮＩＣ＃１のドライバ（またはＮＩＣ＃１）の動作に異常が発生したとする。ただし、この時点でＮＩＣ＃１のドライバは使用されていなかった（ＮＩＣ＃１で通信が行われていなかった）とする。この場合、ＮＩＣ＃１による通信ではパケットロスが発生しないので、パケットロス数というメトリックからは異常は検知されない。 In the example of Figure 9, let us say that a causal event to update the driver of NIC#1 was executed between 10:09 and 10:12, which caused an abnormality in the operation of the driver of NIC#1 (or NIC#1). However, let us say that the driver of NIC#1 was not being used at this point (no communication was taking place via NIC#1). In this case, no packet loss occurs in the communication via NIC#1, so no abnormality is detected from the metric of the number of packet losses.

しかし、その後の１０時１５分から１８分までの期間においてＮＩＣ＃１による通信が開始されたとする。ＮＩＣ＃１のドライバ（またはＮＩＣ＃１）には異常が発生しているので、ＮＩＣ＃１によって開始された通信ではパケットロスが発生する。このため、１０時１８分における正常性判定処理で、パケットロス数から異常が検知される。このように、要因イベントの実行から長い時間遅れて異常が検知されるケースがある。 However, suppose that communication using NIC #1 is subsequently started between 10:15 and 10:18. Because an abnormality has occurred in the driver for NIC #1 (or NIC #1), packet loss occurs in the communication started by NIC #1. For this reason, the abnormality is detected from the number of packet losses in the normality determination process at 10:18. In this way, there are cases where an abnormality is detected a long time after the execution of the causal event.

ここで、図８で説明したように、イベントログデータベース１０４から運用イベントのうち要因イベントのログを検索する期間を、正常性の判定周期に相当する時間とする。この場合、図９において１０時１８分にパケットロス数から異常が検出されると、その直前の３分間がログの検索期間（Ｐ１とする）となる。しかし、検索期間Ｐ１においてはＮＩＣ＃１のドライバ更新を示すログＬＧ２を取得できないので、異常発生要因を判定できない。 As explained in FIG. 8, the period during which the event log database 104 is searched for logs of causal events among operational events is set to a time period corresponding to the normality determination cycle. In this case, when an abnormality is detected from the number of packet losses at 10:18 in FIG. 9, the log search period (P1) is the three minutes immediately preceding that time. However, during search period P1, log LG2 indicating the driver update for NIC#1 cannot be obtained, and therefore the cause of the abnormality cannot be determined.

このような問題を解決する方法としては、要因イベントのログの検索期間を長くする方法が考えられる。例えば図９に示すように、より長い検索期間Ｐ２を設定することで、ＮＩＣ＃１のドライバ更新を示すログＬＧ２を取得できるようになる。しかし、検索期間を長くするほど、イベントログデータベース１０４における検索対象のイベントログ数が多くなり、大量のイベントログの中から検索条件に合致する要因イベントのログを検索しなければならなくなる。このため、運用管理装置１００における検索処理にかかる時間が長くなり、それによって監視装置２００による異常発生要因の判定処理全体にかかる時間も長くなってしまう。また、運用管理装置１００における検索処理負荷が増大することで、場合によっては運用管理装置１００による運用イベントの実行処理に支障が出る可能性もある。 A possible method for solving such a problem is to lengthen the search period for logs of causal events. For example, as shown in FIG. 9, by setting a longer search period P2, it becomes possible to obtain log LG2 indicating a driver update for NIC#1. However, the longer the search period, the greater the number of event logs to be searched in the event log database 104, and it becomes necessary to search for logs of causal events that match the search conditions from among a large number of event logs. This increases the time required for search processing in the operation management device 100, which in turn increases the time required for the entire process of determining the cause of the abnormality by the monitoring device 200. Furthermore, an increase in the search processing load in the operation management device 100 may, in some cases, cause problems in the execution processing of operation events by the operation management device 100.

図１０は、第２の実施の形態における異常発生の要因判定処理を示す図である。本実施の形態において、監視装置２００の要因判定部２１３は、次のような手順で要因イベントログの検索期間を決定する。この図１０では、図９と同様にＮＩＣ＃１のドライバ更新に起因する異常がパケットロス数から検知されたものとする。 Figure 10 is a diagram showing the process of determining the cause of an abnormality in the second embodiment. In this embodiment, the cause determination unit 213 of the monitoring device 200 determines the search period for the cause event log in the following procedure. In this Figure 10, it is assumed that an abnormality caused by a driver update of NIC#1 has been detected from the number of packet losses, similar to Figure 9.

１０時１８分にパケットロス数から異常が検知されると、要因判定部２１３は、その時刻を要因イベントログの検索期間の終了時刻Ｔｅとする。また、要因判定部２１３は、メトリックデータベース２２１を参照し、パケットロス数とは異なる他のメトリックの中から、直前の正常性判定時刻において対応するリソースが使用されていないことを示すメトリックを特定する。図１０の例では、他のメトリックとして、リソースの使用量を示すメトリックであるＣＰＵ使用率およびネットワーク使用率が存在するとする。これらのメトリックは、数値が０の場合にリソースが使用されていないことを示す。このため、図１０の例では、数値が０であるメトリックとして、ＮＩＣ＃１のネットワーク使用率と、ＮＩＣ＃２のネットワーク使用率が特定される。 When an abnormality is detected from the packet loss count at 10:18, the cause determination unit 213 sets that time as the end time Te of the search period for the cause event log. The cause determination unit 213 also refers to the metric database 221 and identifies a metric that indicates that the corresponding resource was not in use at the immediately preceding normality determination time from among other metrics other than the packet loss count. In the example of FIG. 10, it is assumed that the other metrics include CPU utilization and network utilization, which are metrics that indicate the amount of resource usage. These metrics indicate that the resource is not in use when the numerical value is 0. For this reason, in the example of FIG. 10, the network utilization of NIC #1 and the network utilization of NIC #2 are identified as metrics with a numerical value of 0.

次に、要因判定部２１３は、特定されたメトリックのそれぞれについて過去に遡って数値を取得し、数値が０より大きい値に転じた時刻を特定する。これにより、メトリックに対応するリソースが使用状態であった期間の終端が特定される。図１０の例では、１０時６分においてＮＩＣ＃１のネットワーク使用率が０％から３０％に転じており、１０時９分においてＮＩＣ＃２のネットワーク使用率が０％から２０％に転じている。したがって、数値が０より大きい値に転じた時刻として、ＮＩＣ＃１のネットワーク使用率については１０時６分が特定され、ＮＩＣ＃２のネットワーク使用率については１０時９分が特定される。 Next, the cause determination unit 213 retrieves the numerical values for each of the identified metrics going back in time, and identifies the time when the numerical values became greater than 0. This identifies the end of the period during which the resource corresponding to the metric was in use. In the example of FIG. 10, the network usage rate of NIC #1 changed from 0% to 30% at 10:06, and the network usage rate of NIC #2 changed from 0% to 20% at 10:09. Therefore, 10:06 is identified as the time when the numerical values became greater than 0 for the network usage rate of NIC #1, and 10:09 is identified as the time when the numerical values became greater than 0 for the network usage rate of NIC #2.

要因判定部２１３は、このようにして特定された時刻の中から最も古い時刻を特定し、その時刻を要因イベントログの検索期間の開始時刻Ｔｓとする。図１０の例では、ＮＩＣ＃１のネットワーク使用率についての時刻である１０時６分が、検索期間の開始時刻Ｔｓと特定される。これにより、開始時刻Ｔｓから前述の終了時刻Ｔｅまでの期間が検索期間に決定される。このような検索期間から要因イベントログが検索されることで、要因判定部２１３は、ＮＩＣ＃１のドライバ更新を示すログＬＧ２を取得でき、異常発生要因を正確に判定できる。 The cause determination unit 213 identifies the oldest time from among the times identified in this manner, and sets this time as the start time Ts of the search period for the cause event log. In the example of FIG. 10, 10:06, which is the time for the network usage rate of NIC #1, is identified as the start time Ts of the search period. As a result, the period from the start time Ts to the aforementioned end time Te is determined as the search period. By searching for the cause event log from such a search period, the cause determination unit 213 can obtain log LG2 indicating a driver update for NIC #1, and can accurately determine the cause of the abnormality.

前述のように、あるリソースに関する異常の発生から検知までに時間がかかる場合、その異常は、リソースが使用されていない期間に実行された運用イベントを契機として発生した可能性がある。上記の処理では、メトリックの値が０より大きい値に転じた時刻のうち、最も古い時刻が検索期間の開始時刻とされる。これにより、異常が検知される直前まで使用されていない状態になっていたリソースに関係する運用イベントのログを、すべて検索対象に含めることができる。すなわち、要因イベントログの検索期間を必要最小限の長さに設定できる。このため、検索期間の長さを抑制しながら、検知された異常の発生の契機となった運用イベントのログを取得できる可能性が高まる。したがって、運用管理装置１００における検索処理時間を短縮し、それによって監視装置２００による異常発生要因の判定処理にかかる時間を短縮しつつ、その判定精度を高めることができる。また、異常発生要因の判定精度を高めつつ、運用管理装置１００における検索処理負荷を抑制できる。 As mentioned above, if it takes a long time from the occurrence of an abnormality related to a certain resource to its detection, the abnormality may have been triggered by an operation event executed during a period when the resource was not in use. In the above process, the oldest time among the times at which the metric value turned to a value greater than 0 is set as the start time of the search period. This makes it possible to include all logs of operation events related to resources that were not in use until just before the abnormality was detected in the search target. In other words, the search period for the cause event log can be set to the minimum required length. This increases the possibility of obtaining the log of the operation event that triggered the occurrence of the detected abnormality while suppressing the length of the search period. Therefore, the search processing time in the operation management device 100 can be shortened, thereby shortening the time required for the monitoring device 200 to determine the cause of the abnormality while improving the accuracy of the determination. In addition, the search processing load in the operation management device 100 can be suppressed while improving the accuracy of the determination of the cause of the abnormality.

なお、図１０では、異常の発生から検知までに時間がかかる例として、パケットロス数の異常検知に応じて、他のメトリックとしてネットワーク使用率の数値変化が解析される例を示した。他の例としては、メトリックとしてＣＰＵ待ち時間から異常が検知された場合に、他のメトリックとしてＣＰＵ使用量の数値変化が解析される場合が考えられる。 In addition, FIG. 10 shows an example in which it takes time from the occurrence of an anomaly to its detection, and in response to the detection of an anomaly in the number of packet losses, the numerical change in network utilization is analyzed as another metric. As another example, when an anomaly is detected from the CPU wait time as a metric, the numerical change in CPU usage is analyzed as another metric.

図１１は、第２の実施の形態における監視装置の処理手順を示すフローチャートの例である。図１１の処理は、正常性の判定時刻ごとに実行される。
［ステップＳ１１］メトリック取得部２１１は、運用管理装置１００のメトリックデータベース１０５から、現判定時刻から前回の判定時刻までの期間に収集されたメトリックを取得し、メトリックデータベース２２１に登録する。 11 is a flowchart showing an example of a processing procedure of a monitoring device according to the second embodiment. The processing in FIG. 11 is executed every time normality is determined.
[Step S<b>11 ] The metric acquisition unit 211 acquires metrics collected during the period from the current judgment time to the previous judgment time from the metric database 105 of the operation management device 100 , and registers them in the metric database 221 .

［ステップＳ１２］正常性判定部２１２は、ステップＳ１１で登録されたメトリックのうちあらかじめ決められた１以上のメトリックに基づいて、ＩＣＴインフラ１１０の正常性を判定する。メトリックに基づいて異常が検知された場合、処理がステップＳ１３に進められる。この場合、異常が検知されたメトリックが正常性判定部２１２から要因判定部２１３に通知される。そして、ステップＳ１３～Ｓ１７の処理は、通知されたメトリックごとに実行される。一方、いずれのメトリックからも異常が検知されなかった場合、図１１の処理が終了される。 [Step S12] The normality determination unit 212 determines the normality of the ICT infrastructure 110 based on one or more predetermined metrics from the metrics registered in step S11. If an abnormality is detected based on the metrics, the process proceeds to step S13. In this case, the normality determination unit 212 notifies the cause determination unit 213 of the metric in which the abnormality was detected. Then, the processes of steps S13 to S17 are executed for each notified metric. On the other hand, if no abnormality is detected from any of the metrics, the process of FIG. 11 is terminated.

［ステップＳ１３］要因判定部２１３は、判定ルールデータベース２２２に基づいて、異常が検知されたメトリックに対応する要因イベント（異常発生要因の候補となる運用イベント）を特定する。 [Step S13] The cause determination unit 213 identifies a cause event (an operational event that is a candidate for the cause of the abnormality) corresponding to the metric in which the abnormality was detected, based on the determination rule database 222.

［ステップＳ１４］要因判定部２１３は、異常が検知されたメトリックとは異なる他のメトリックの中から、異常検知時刻の直前の正常性判定時刻において、対応するリソースが不使用状態であることを示すメトリックを特定する。例えば、リソースの使用量を示すメトリックの中から、異常検知時刻の直前の正常性判定時刻において数値が０であるメトリックを特定する。 [Step S14] The cause determination unit 213 identifies, from among metrics other than the metric in which the abnormality was detected, a metric that indicates that the corresponding resource was unused at the normality determination time immediately prior to the abnormality detection time. For example, from among metrics that indicate the amount of resource usage, it identifies a metric whose numerical value is 0 at the normality determination time immediately prior to the abnormality detection time.

［ステップＳ１５］要因判定部２１３は、メトリックデータベース２２１から、ステップＳ１４で特定された各メトリックについて過去に遡って数値を取得する。そして、要因判定部２１３は、各メトリックについて、リソースの使用状態が不使用状態から使用状態に変化した時刻を特定する。上記のようにリソースの使用量を示すメトリックの場合、メトリックの値が０からそれより大きい値に転じた時刻が特定される。 [Step S15] The factor determination unit 213 retrieves numerical values from the metric database 221 going back in time for each metric identified in step S14. Then, for each metric, the factor determination unit 213 identifies the time when the resource usage state changed from an unused state to a used state. In the case of a metric that indicates the amount of resource usage as described above, the time when the metric value changed from 0 to a value greater than 0 is identified.

なお、リソースの使用量を示すメトリックを用いた場合、ステップＳ１４，Ｓ１５では、メトリックの値が０か、それより大きいかという判定基準が用いられたが、この判定基準としては０より大きい判定閾値が用いられてもよい。例えば、判定閾値を０．０１とし、ステップＳ１４では数値が０．０１以下のメトリックが特定され、ステップＳ１５ではメトリックの値が０．０１以下から０．０１を超えた時刻が特定されてもよい。 When a metric indicating resource usage is used, in steps S14 and S15, the criterion used is whether the metric value is 0 or greater, but a judgment threshold greater than 0 may be used as the judgment criterion. For example, the judgment threshold may be 0.01, and in step S14, metrics whose numerical values are 0.01 or less may be identified, and in step S15, the time when the metric value changes from 0.01 or less to exceed 0.01 may be identified.

［ステップＳ１６］要因判定部２１３は、ステップＳ１５で特定された時刻の中から最も古い時刻を特定し、その時刻を要因イベントログの検索期間の開始時刻Ｔｓに決定する。 [Step S16] The cause determination unit 213 identifies the oldest time from among the times identified in step S15, and determines that time as the start time Ts of the search period for the cause event log.

［ステップＳ１７］要因判定部２１３は、現判定時刻（終了時刻Ｔｅ）から上記の開始時刻Ｔｓまでの期間を検索期間とし、この検索期間と、ステップＳ１３で特定された要因イベントの識別情報とを引数で指定して、運用管理装置１００に対してイベントログの検索を依頼する。運用管理装置１００のイベントログ検索部１０２は、指定された検索期間に収集された運用イベントのログの中から、指定された要因イベントのログを抽出して、監視装置２００に返信する。要因判定部２１３は、抽出された要因イベントのログを受信し、取得する。 [Step S17] The cause determination unit 213 sets the period from the current determination time (end time Te) to the above-mentioned start time Ts as the search period, and requests the operation management device 100 to search for event logs by specifying this search period and the identification information of the cause event identified in step S13 as arguments. The event log search unit 102 of the operation management device 100 extracts the log of the specified cause event from the logs of operation events collected during the specified search period, and returns it to the monitoring device 200. The cause determination unit 213 receives and acquires the log of the extracted cause event.

［ステップＳ１８］要因判定部２１３は、判定ルールデータベース２２２を参照し、異常が検知されたメトリック（異常検知メトリック）と、ステップＳ１７で取得されたログが示す要因イベントとに対応付けられた要因を取得する。要因判定部２１３は、取得された要因を異常発生要因と判定し、その判定結果を出力する。例えば、判定結果は、異常検知時刻、監視ホスト名、監視箇所、異常検知メトリック、および取得された要因の組み合わせとして判定結果データベース２２３に登録される。 [Step S18] The cause determination unit 213 refers to the determination rule database 222 and obtains the cause associated with the metric in which the abnormality was detected (anomaly detection metric) and the cause event indicated by the log obtained in step S17. The cause determination unit 213 determines that the obtained cause is the abnormality occurrence cause and outputs the determination result. For example, the determination result is registered in the determination result database 223 as a combination of the abnormality detection time, monitoring host name, monitoring location, anomaly detection metric, and the obtained cause.

ここで、監視ホスト名および監視箇所は、異常検知メトリック、ステップＳ１７で取得されたログが示す要因イベントの内容、これらに基づく異常発生要因の少なくとも１つ、または２つ以上の組み合わせから特定される。例えば、要因イベントがＮＩＣドライバ更新の場合、更新されたＮＩＣドライバに対応するＮＩＣが監視箇所として特定され、そのＮＩＣが搭載されたホスト（サーバ装置）の名前が監視ホスト名として特定される。また、異常検知メトリックがＣＰＵ待ち時間、要因イベントがＶＭマイグレーションの場合、監視箇所はＣＰＵ待ち時間の検出対象とされたＣＰＵとして特定され、そのＣＰＵが搭載されたホストの名前が監視ホスト名として特定される。 The monitoring host name and monitoring location are identified from at least one of the anomaly detection metric, the content of the causal event indicated by the log acquired in step S17, and a combination of two or more of the causes of the anomaly based on these. For example, if the causal event is a NIC driver update, the NIC corresponding to the updated NIC driver is identified as the monitoring location, and the name of the host (server device) on which that NIC is installed is identified as the monitoring host name. Also, if the anomaly detection metric is CPU wait time and the causal event is VM migration, the monitoring location is identified as the CPU that is the target of CPU wait time detection, and the name of the host on which that CPU is installed is identified as the monitoring host name.

なお、ステップＳ１７の検索で複数の要因イベントのログが取得された場合、ステップＳ１８では、各要因イベントに基づく異常発生要因が、それぞれ可能性のある異常発生要因として出力されればよい。 If the search in step S17 returns logs of multiple causal events, in step S18, the causes of the abnormality based on each causal event may be output as possible causes of the abnormality.

〔第２の実施の形態の変形例〕
第２の実施の形態における監視装置２００の処理の一部は、以下のように変形されてもよい。 [Modification of the second embodiment]
A part of the processing of the monitoring device 200 in the second embodiment may be modified as follows.

図１２は、変形例における異常発生の要因判定処理を示す図である。この図１２では、図９、図１０と同様にＮＩＣ＃１のドライバ更新に起因する異常がパケットロス数から検知されたものとする。 Figure 12 is a diagram showing the process of determining the cause of an abnormality in a modified example. In this Figure 12, it is assumed that an abnormality caused by a driver update of NIC#1 is detected from the number of packet losses, as in Figures 9 and 10.

図９、図１０、図１２のように異常の発生から検知までに時間がかかるケースでは、使用されていない状態のリソースに関して異常が発生した後、そのリソースの使用が開始されることで異常が検知される。そこで、要因判定部２１３は、メトリックの異常が検知されると、それとは異なる他のメトリックの中から、メトリックの値に基づき、その直前の正常性判定時刻から現判定時刻までの期間に対応するリソースの使用が開始されたメトリックを特定する。そして、要因判定部２１３は、特定されたメトリックについて過去に遡って数値を取得し、取得した数値に基づき、対応するリソースが使用状態であった期間の終端を特定して、要因イベントログの検索期間の開始時刻を決定する。 In cases where it takes time from the occurrence of an abnormality to its detection, as in Figures 9, 10, and 12, an abnormality occurs with respect to a resource that is not in use, and then the abnormality is detected when the use of that resource begins. Thus, when an abnormality in a metric is detected, the cause determination unit 213 identifies, from among other metrics that are different from the metric, a metric for which use of a resource corresponding to the period from the immediately preceding normality determination time to the current determination time, based on the metric value. The cause determination unit 213 then obtains a numerical value going back in time for the identified metric, and based on the obtained numerical value, identifies the end of the period when the corresponding resource was in use, and determines the start time of the search period for the cause event log.

図１２の例では、１０時１８分にパケットロス数から異常が検知されると、要因判定部２１３は、パケットロス数とは異なる、リソースの使用量を示す他のメトリックの中から、直前の正常性判定時刻で数値が０であり、現判定時刻で数値が０を超えたメトリックを特定する。図１２ではこのようなメトリックとして、ＮＩＣ＃１のネットワーク使用率が特定される。すると、要因判定部２１３は、特定されたネットワーク使用率の数値を過去に遡って取得し、数値が０からそれより大きい値に転じた時刻を特定する。図１２では、１０時６分においてＮＩＣ＃１のネットワーク使用率が０％から３０％に転じており、数値が０より大きい値に転じた時刻として１０時６分が特定され、この時刻が検索期間の開始時刻Ｔｓと決定される。 In the example of FIG. 12, when an abnormality is detected from the number of packet losses at 10:18, the cause determination unit 213 identifies, from among other metrics that indicate resource usage other than the number of packet losses, a metric whose value was 0 at the immediately preceding normality determination time and whose value exceeds 0 at the current determination time. In FIG. 12, the network utilization rate of NIC#1 is identified as such a metric. Then, the cause determination unit 213 retroactively obtains the identified network utilization rate value and identifies the time when the value changed from 0 to a value greater than that. In FIG. 12, the network utilization rate of NIC#1 changes from 0% to 30% at 10:06, and 10:06 is identified as the time when the value changed to a value greater than 0, and this time is determined as the start time Ts of the search period.

以上の処理によれば、対応するリソースが使用状態であった期間の終端を特定するための数値の変化を解析する対象のメトリックを絞り込むことができ、検索期間の開始時刻を決定するための処理負荷を軽減でき、その処理時間を短縮できる。また、異常検知時刻の直前において対応するリソースの使用が開始されたメトリックを特定することで、検知された異常に関連する可能性の高いメトリックだけを数値変化の解析対象として絞り込むことができる。このため、異常発生要因の判定精度を落とさずに、検索期間の決定処理時間を短縮でき、その結果、異常発生要因の判定処理全体を短縮できる。 The above process makes it possible to narrow down the metrics to be analyzed for changes in numerical values to identify the end of the period when the corresponding resource was in use, reducing the processing load for determining the start time of the search period and shortening the processing time. Also, by identifying the metrics for which the corresponding resource began to be used immediately before the time the anomaly was detected, it is possible to narrow down the metrics to be analyzed for changes in numerical values to only those that are likely to be related to the detected anomaly. This makes it possible to shorten the processing time for determining the search period without reducing the accuracy of determining the cause of the anomaly, and as a result, the overall process for determining the cause of the anomaly can be shortened.

図１３は、変形例における監視装置の処理手順を示すフローチャートの例である。本変形例では、図１１に示したフローチャートの処理ステップのうち、ステップＳ１４の処理が次のステップＳ１４ａの処理に変更される。 Figure 13 is an example of a flowchart showing the processing procedure of a monitoring device in a modified example. In this modified example, among the processing steps of the flowchart shown in Figure 11, the processing of step S14 is changed to the processing of the next step S14a.

［ステップＳ１４ａ］要因判定部２１３は、異常が検知されたメトリックとは異なる他のメトリックの中から、異常検知時刻の直前の正常性判定時刻において、対応するリソースが不使用状態であり、かつ、異常検知時刻において使用状態に変化しているメトリックを特定する。例えば、リソースの使用量を示すメトリックの中から、異常検知時刻の直前の正常性判定時刻において数値が０であり、異常検知時刻において数値が０より大きいメトリックを特定する。 [Step S14a] The cause determination unit 213 identifies, from among other metrics other than the metric in which the abnormality was detected, a metric whose corresponding resource was in an unused state at the normality determination time immediately prior to the abnormality detection time and whose corresponding resource changed to a used state at the abnormality detection time. For example, from among the metrics indicating the amount of resource usage, a metric whose numerical value was 0 at the normality determination time immediately prior to the abnormality detection time and whose numerical value is greater than 0 at the abnormality detection time is identified.

次のステップＳ１５では、ステップＳ１４ａで特定された各メトリックが数値取得の対象となる。これにより、第２の実施の形態と比較して、数値取得の対象となるメトリックが絞り込まれる。 In the next step S15, each metric identified in step S14a becomes a target for numerical value acquisition. This narrows down the metrics for which numerical values are to be acquired compared to the second embodiment.

なお、上記の各実施の形態に示した装置（例えば、異常要因判定装置１、運用管理装置１００、監視装置２００）の処理機能は、コンピュータによって実現することができる。その場合、各装置が有すべき機能の処理内容を記述したプログラムが提供され、そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、磁気記憶装置、光ディスク、半導体メモリなどがある。磁気記憶装置には、ハードディスク装置（ＨＤＤ）、磁気テープなどがある。光ディスクには、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disc）、ブルーレイディスク（Blu-ray Disc：ＢＤ、登録商標）などがある。 The processing functions of the devices (e.g., the anomaly factor determination device 1, the operation management device 100, and the monitoring device 200) shown in each of the above embodiments can be realized by a computer. In this case, a program describing the processing contents of the functions that each device should have is provided, and the above processing functions are realized on the computer by executing the program on a computer. The program describing the processing contents can be recorded on a computer-readable recording medium. Examples of computer-readable recording media include magnetic storage devices, optical disks, and semiconductor memories. Examples of magnetic storage devices include hard disk drives (HDDs) and magnetic tapes. Examples of optical disks include CDs (Compact Discs), DVDs (Digital Versatile Discs), and Blu-ray Discs (BD, registered trademark).

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ、ＣＤなどの可搬型記録媒体が販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When distributing a program, for example, portable recording media such as DVDs and CDs on which the program is recorded are sold. The program can also be stored in a storage device of a server computer, and the program can be transferred from the server computer to other computers via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムまたはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムにしたがった処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムにしたがった処理を実行することもできる。また、コンピュータは、ネットワークを介して接続されたサーバコンピュータからプログラムが転送されるごとに、逐次、受け取ったプログラムにしたがった処理を実行することもできる。 A computer that executes a program stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. The computer then reads the program from its own storage device and executes processing according to the program. Note that the computer can also read a program directly from a portable recording medium and execute processing according to that program. The computer can also execute processing according to the received program each time a program is transferred from a server computer connected via a network.

１異常要因判定装置
２メトリックデータベース
３イベントログデータベース
４タイムチャート
５ログ
Ｍ１～Ｍ４メトリック
Ｓ１～Ｓ５ステップ
Ｔ１～Ｔ５時刻 1 Anomaly factor determination device 2 Metric database 3 Event log database 4 Time chart 5 Log M1-M4 Metric S1-S5 Step T1-T5 Time

Claims

The computer
when an abnormality is detected at a first time based on a first metric among a plurality of metrics indicating usage status of resources included in the information processing system, identifying one or more second metrics from among the plurality of metrics excluding the first metric, the second metrics indicating that a corresponding resource was unused immediately before the first time;
Identifying, for each of the one or more second metrics, a second time at which a corresponding resource changes from an unused state to a used state, going back from just before the first time to a past time based on the usage status indicated by each of the one or more second metrics;
specifying a search period from a third time, which is the oldest of the identified second times, to the first time, and acquiring, from a database in which logs of events executed in the information processing system are accumulated, logs of candidate events that are candidates for the cause of anomaly based on the first metric and that were executed during the search period;
determining a cause of the anomaly based on the first metric on the basis of the candidate event indicated by the acquired log;
Method for determining abnormality causes.

In identifying the one or more second metrics, a metric whose value is equal to or less than a predetermined value immediately before the first time is identified as the one or more second metrics from among the plurality of metrics excluding the first metric;
In identifying the second time, a time when a value of each of the one or more second metrics went from being equal to or less than the predetermined value to exceeding the predetermined value going back to a time immediately before the first time is identified as the second time.
The method for determining an abnormality cause according to claim 1.

The computer further identifies one or more third metrics among the one or more second metrics, the third metrics indicating that a corresponding resource has changed from an unused state to a used state at the first time;
In the determination of the second time, the second time is determined for each of the one or more third metrics.
The method for determining an abnormality cause according to claim 1.

In identifying the one or more second metrics, a metric whose value is equal to or less than a predetermined value immediately before the first time is identified as the one or more second metrics from among the plurality of metrics excluding the first metric;
In identifying the one or more third metrics, a metric whose value has changed from less than or equal to the predetermined value to more than the predetermined value at the first time is identified as the one or more third metrics from among the one or more second metrics;
In identifying the second time, a time when a value of each of the one or more third metrics went from being equal to or less than the predetermined value to exceeding the predetermined value going back to a time immediately before the first time is identified as the second time.
The method for determining an abnormality cause according to claim 3.

the computer executes a determination process for determining the presence or absence of an abnormality based on each of a plurality of specific metrics among the plurality of metrics at each determination time at a predetermined time interval;
In identifying the one or more second metrics, when an abnormality is detected based on the first metric of the identified metrics at the first time among the determination times, the one or more second metrics indicating that a corresponding resource is in an unused state at a determination time immediately before the first time among the determination times are identified from among the metrics excluding the first metric among the plurality of metrics.
The method for determining an abnormality cause according to claim 1 .

On the computer,
when an abnormality is detected at a first time based on a first metric among a plurality of metrics indicating usage status of resources included in the information processing system, identifying one or more second metrics from among the plurality of metrics excluding the first metric, the second metrics indicating that a corresponding resource was unused immediately before the first time;
Identifying, for each of the one or more second metrics, a second time at which a corresponding resource changes from an unused state to a used state, going back from just before the first time to a past time based on the usage status indicated by each of the one or more second metrics;
specifying a search period from a third time, which is the oldest of the identified second times, to the first time, and acquiring, from a database in which logs of events executed in the information processing system are accumulated, logs of candidate events that are candidates for the cause of anomaly based on the first metric and that were executed during the search period;
determining a cause of the anomaly based on the first metric on the basis of the candidate event indicated by the acquired log;
An abnormality cause determination program that executes processing.