JP4456082B2

JP4456082B2 - Failure prediction system and failure prediction program

Info

Publication number: JP4456082B2
Application number: JP2006017158A
Authority: JP
Inventors: 泰治俣野; 秀和濱砂
Original assignee: 株式会社日立情報システムズ
Priority date: 2006-01-26
Filing date: 2006-01-26
Publication date: 2010-04-28
Anticipated expiration: 2026-01-26
Also published as: JP2007199976A

Description

本発明は、コンピュータシステムにおける障害の発生を予知することができる障害予知システム及び障害予知プログラムに係り、特に接続するクライアントが多数なコンピュータシステムにおけるサービスシステムの障害発生の確率を効率良く算出し、予防的に障害発生の高確率時刻を担当者／関係者に通知することができる障害予知システム及び障害予知プログラムに関する。 The present invention relates to a failure prediction system and a failure prediction program that can predict the occurrence of a failure in a computer system, and in particular, efficiently calculates the probability of occurrence of a failure in a service system in a computer system with a large number of connected clients, and prevents the failure. In particular, the present invention relates to a failure prediction system and a failure prediction program capable of notifying a person in charge / related person of a high probability of occurrence of a failure.

一般に多数のクライアントと接続し、ＡＳＰ（ＡｐｐｌｉｃａｔｉｏｎＳｅｒｖｉｃｅＰｒｏｖｉｄｅｒ）サービスなどのシステムサービスを提供するコンピュータシステムにおいては、リソースへの負荷の増大／ハードウェア障害／定常・臨時作業時のヒューマンエラーなどにより、予期せぬ障害が発生し、システムサービスが停止することが知られている。従来技術によるコンピュータシステムによる障害対応は、このような障害発生後に、障害が発生した部位に応じて担当部署やベンダを召集することが行われていた。 In general, computer systems that connect to a large number of clients and provide system services such as ASP (Application Service Provider) services are expected due to increased load on resources / hardware failures / human errors during routine / temporary work. It is known that an unexpected failure occurs and the system service stops. In the case of a failure response by a computer system according to the prior art, after such a failure occurs, a responsible department or vendor is summoned according to the part where the failure has occurred.

このような障害発生時の対応のための技術として、従来技術においては、監視システムのＳＮＭＰ（標準化されたＴＣＰ／ＩＰネットワーク環境での管理プロトコル）やＩＣＭＰ（ＴＣＰ／ＩＰプロトコルにおいて、その機能を補助するために用意された制御用のプロトコル）を用いて稼働状況を常時監視し、ある閾値を越えたとき又はノードの生死監視を行い、ＩＣＭＰパケットが返って来なかったときに担当者に連絡するなどで対応しているが、この監視方式では、障害部位担当者が対応開始（かけつける）する前に障害が発生してしまう可能性が大きいと言う不具合があった。 As a technology for dealing with such a failure, the conventional technology supports the functions of the monitoring system SNMP (standardized management protocol in TCP / IP network environment) and ICMP (TCP / IP protocol). The operation status is constantly monitored using a control protocol prepared to perform monitoring, and when a certain threshold is exceeded or the node is monitored for life or death, the person in charge is notified when no ICMP packet is returned. In this monitoring method, however, there is a problem that there is a high possibility that a failure will occur before the person in charge of the failure site starts (applies).

また、コンピュータに動作状況に関連した障害を判断する手段と回避方法に基づく動作処理を行う障害回避手段とを設けることにより、障害の発生を未然に検知し、発生しうる障害に対する処置を自動的に行う技術が下記特許文献に記載されている。
特開平１０−４９２１９号号公報 In addition, by providing a computer with a means for judging a fault related to the operation status and a fault avoidance means for performing an operation process based on the avoidance method, the occurrence of the fault is detected in advance and the action for the possible fault is automatically performed. Techniques performed in the following are described in the following patent documents.
Japanese Patent Laid-Open No. 10-49219

前述の従来技術は、ＣＰＵやメモリなどのリソース及び動作プロセスを監視し、その推移からクライアントの障害を予知する技術のため、作業時のヒューマンエラーによる障害や、ディスク破壊／ファイル破壊などの予期できない要因の障害の予知ができないと言う不具合があった。 The above-described conventional technology is a technology for monitoring resources and operation processes such as CPU and memory and predicting a client failure based on the transition, so that it is not possible to predict a failure due to a human error during operation or disk / file destruction. There was a problem that the failure of the factor could not be predicted.

本発明の目的は、前述の従来技術による不具合を除去することであり、コンピュータシステムにおける障害発生の高くなる日時を未然に推測し、予防保守や担当者の人員配置などを効率よく行えることができる障害予知システム及びプログラムを提供することである。 An object of the present invention is to eliminate the above-described problems caused by the prior art, and it is possible to estimate the date and time when the occurrence of a failure in a computer system becomes high, and to efficiently perform preventive maintenance, personnel assignment of a person in charge, etc. It is to provide a failure prediction system and program.

前記目的を達成するために本発明は、複数のクライアントに接続されたコンピュータシステムの障害発生の予知を行う障害予知システムであって、
過去に発生した複数の障害毎に、障害による現象を障害現象タイプ別に区分した現象区分と、障害が発生した要因を障害要因タイプ別に区分した要因区分と、該現象区分及び要因区分が区分された障害により発生した不稼働時間とを含む標本データを格納する障害情報データベースと、
該障害情報データベースに格納した現象区分及び要因区分が区分された障害の発生時刻から復旧時刻までの時間である不稼働時間と、現象区分及び要因区分が同一の複数の障害による不稼働時間の総和を総不稼働時間として算出し、該総不稼働時間を用いて同一の現象区分及び要因区分が区分された障害が発生する障害発生確率を算出する障害予知機能部と、
該障害予知機能部が計算を行う際の時間変数ｋを設定するパラメータ設定ファイルと、
前記障害予知機能部が算出した、同一の現象区分及び要因区分の障害により発生する障害発生確率を含む障害予知データを、格納する障害予知データベースと、
該障害予知データベースに格納した障害予知データの障害発生確率が予め設定した閾値を越えたとき、予め設定された宛先に通知する障害発生高確率通知機能部とを備え、
前記障害予知機能部が、時間変数をｋ、過去の任意の日時から現在までの時間をＸ、総不稼動時間をｘ、線形確率をＰ’としたとき、線形確率Ｐ’を、時間変数ｋを「ｋ＋１」により増加させながら計算式「１００×（ｘ＋ｋ）／Ｘ」により変数ｋが所定値になるまで算出し、該算出した線形確率Ｐ’の変位を関数ｔａｎｈを用いて障害発生確率の最大値が１００％未満になるようにｔａｎｈ関数適用処理を実行することにより、時間変数ｋが所定値に達するまでの障害発生確率を算出することを第１の特徴とする。 To achieve the above object, the present invention is a failure prediction system for predicting the occurrence of a failure in a computer system connected to a plurality of clients,
For each of a plurality of failures that occurred in the past, a phenomenon category in which the failure phenomenon is classified by failure phenomenon type, a factor category in which the cause of the failure is classified by failure factor type, and the phenomenon category and factor category are classified. A failure information database that stores sample data including downtime caused by a failure;
Time and downtime is from the time of occurrence of disorder symptoms section and cause division stored in the fault information database is divided up recovery time, the downtime caused by phenomena section and cause indicator same multiple failure calculated sum as the total downtime, and failure prediction function unit for calculating the failure probability of failure same phenomenon classification and cause division are divided to generate using said total downtime,
A parameter setting file in which the failure prediction function unit sets a variable k between the time when performing the calculation,
A failure prediction database for storing failure prediction data calculated by the failure prediction function unit and including failure occurrence probability caused by failures of the same phenomenon category and factor category;
A failure occurrence high probability notification function unit for notifying a preset destination when the failure occurrence probability of the failure prediction data stored in the failure prediction database exceeds a preset threshold;
The failure prediction function unit, k the time variable, the time from the past of any of the date and time up to the current X, 'when it was, linear probability P' the total non-operating time x, the linear probability P, and time change While the number k is increased by “k + 1”, the calculation formula “100 × (x + k) / X” is used until the variable k reaches a predetermined value, and the displacement of the calculated linear probability P ′ is generated using the function tanh. The first feature is to calculate the failure occurrence probability until the time variable k reaches a predetermined value by executing the tanh function application process so that the maximum value of the probability is less than 100%.

また本発明は、前記障害予知システムにおいて、前記障害による現象を障害現象タイプ別に区分した現象区分を格納する現象区分マスタと、前記障害が発生した要因を障害要因タイプ別に区分した要因区分を格納する要因区分マスタとを設け、前記障害予知データベースに、前記現象区分マスタ及び要因区分マスタに格納した全ての現象区分及び要因区分の組み合わせによる総不稼働時間を格納することを第２の特徴とする。 According to the present invention, in the failure prediction system, a phenomenon category master for storing a phenomenon category in which a phenomenon caused by the failure is classified according to a failure phenomenon type, and a factor category in which the cause of the failure is classified according to a failure factor type are stored. It provided the source segment master, the fault prediction database, the second, characterized in that storing the phenomenon classification master and total downtime due to a combination of all phenomena section and cause division stored in the cause division master.

更に本発明は、過去に発生した複数の障害毎に、障害による現象を障害現象タイプ別に区分した現象区分と、障害が発生した要因を障害要因タイプ別に区分した要因区分と、該現象区分及び要因区分が区分された障害により発生した不稼働時間とを含む標本データを格納する障害情報データベースと、
該障害情報データベースに格納した現象区分及び要因区分が区分された障害の発生時刻から復旧時刻までの時間である不稼働時間と、現象区分及び要因区分が同一の複数の障害による不稼働時間の総和を総不稼働時間として算出し、該総不稼働時間を用いて同一の現象区分及び要因区分が区分された障害が発生する障害発生確率を算出する障害予知機能部と、
該障害予知機能部が計算を行う際の時間変数ｋを設定するパラメータ設定ファイルと、
前記障害予知機能部が算出した、同一の現象区分及び要因区分の障害により発生する障害発生確率を含む障害予知データを、格納する障害予知データベースと、
該障害予知データベースに格納した障害予知データの障害発生確率が予め設定した閾値を越えたとき、予め設定された宛先に通知する障害発生高確率通知機能部とを備え、複数のクライアントに接続されたコンピュータシステムの障害発生の予知を行う障害予知システムの障害予知プログラムであって、
前記障害予知機能部に、時間変数をｋ、過去の任意の日時から現在までの時間をＸ、総不稼動時間をｘ、線形確率をＰ’としたとき、線形確率Ｐ’を、時間変数ｋを「ｋ＋１」により増加させながら計算式「１００×（ｘ＋ｋ）／Ｘ」により変数ｋが所定値になるまで算出し、該算出した線形確率Ｐ’の変位を関数ｔａｎｈを用いて障害発生確率の最大値が１００％未満になるようにｔａｎｈ関数適用処理を実行することにより、時間変数ｋが所定値に達するまでの障害発生確率を算出させる機能を実現させることを第３の特徴とする。
Further, the present invention provides a phenomenon classification in which a phenomenon caused by a failure is classified according to a failure phenomenon type for each of a plurality of failures that have occurred in the past, a factor classification in which a cause of the failure is classified according to a failure factor type, the phenomenon classification and the factor A failure information database for storing sample data including the downtime caused by the failure in which the division is classified ;
Time and downtime is from the time of occurrence of disorder symptoms section and cause division stored in the fault information database is divided up recovery time, the downtime caused by phenomena section and cause indicator same multiple failure calculated sum as the total downtime, and failure prediction function unit for calculating the failure probability of failure same phenomenon classification and cause division are divided to generate using said total downtime,
A parameter setting file in which the failure prediction function unit sets a variable k between the time when performing the calculation,
The failure prediction function unit is calculated, the failure prediction database fault prediction data, and stores including failure occurrence probability caused by failure of the same phenomenon classification and factors division,
When the failure occurrence probability of the failure prediction data stored in the failure prediction database exceeds a preset threshold, the failure occurrence high probability notification function unit for notifying a preset destination is provided and connected to a plurality of clients. A failure prediction program for a failure prediction system for predicting the occurrence of a failure in a computer system,
In the failure prediction function unit, when the time variable is k, the time from any past date to the present is X, the total downtime is x, and the linear probability is P ′, the linear probability P ′ is the time variable k. Is increased by “k + 1” and is calculated until the variable k reaches a predetermined value by the calculation formula “100 × (x + k) / X”, and the displacement of the calculated linear probability P ′ is calculated using the function tanh . A third feature is to realize a function of calculating a failure occurrence probability until the time variable k reaches a predetermined value by executing the tanh function application process so that the maximum value is less than 100%.

また本発明は、前記障害予知システムにおいて、前記障害による現象を障害現象タイプ別に区分した現象区分を格納する現象区分マスタと、前記障害が発生した要因を障害要因タイプ別に区分した要因区分を格納する要因区分マスタとを設け、前記障害予知データベースに、前記現象区分マスタ及び要因区分マスタに格納した全ての現象区分及び要因区分の組み合わせによる総不稼働時間を格納することを第４の特徴とする。
According to the present invention, in the failure prediction system, a phenomenon category master for storing a phenomenon category in which a phenomenon caused by the failure is classified according to a failure phenomenon type, and a factor category in which the cause of the failure is classified according to a failure factor type are stored. According to a fourth feature of the present invention, there is provided a factor category master, and the failure prediction database stores a total downtime by a combination of all the phenomenon categories and factor categories stored in the phenomenon category master and the factor category master .

本発明による障害予知システム及び障害予知方法は、障害予知機能部が、障害情報データベースに格納された現象区分と要因区分の組み合わせによる不稼働時間とを基に、前記総不稼働時間を算出し、該現象区分と要因区分の組み合わせと総不稼働時間とを基に線形確率計算を用い、未来における障害が発生する現象区分と要因区分の組み合わせと該組み合わせによる障害が発生する年月日及び障害発生確率を算出する機能を実現させると共に、障害発生高確率通知機能部に、障害予知データベースに格納した障害予知データの障害発生確率が予め設定した閾値を越えたとき、予め設定された宛先に通知する機能を実現させることによって、将来発生する確率が高い障害を予知することができる。特に本発明においては、過去に発生した障害の発生現象と発生要因の組み合わせを基に、該組み合わせに対応した障害による総不稼働時間を算出し、総不稼働時間が大きい前記組み合わせが障害の発生確率が大きいと線形確率計算を行うことによって、将来発生する確率が高い現象及び要因の障害を予知することができる。 In the failure prediction system and the failure prediction method according to the present invention, the failure prediction function unit calculates the total downtime based on the downtime due to the combination of the phenomenon category and the factor category stored in the fault information database, Using linear probability calculation based on the combination of the phenomenon category and the factor category and the total downtime, the combination of the phenomenon category and the factor category where the failure will occur in the future, the date on which the failure occurs due to the combination, and the occurrence of the failure A function for calculating the probability is realized, and the failure occurrence probability of the failure prediction data stored in the failure prediction database is notified to a preset destination when the failure occurrence probability exceeds a preset threshold. By realizing the function, it is possible to predict a failure having a high probability of occurring in the future. In particular, in the present invention, based on a combination of a failure occurrence phenomenon and an occurrence factor that occurred in the past, a total downtime due to a failure corresponding to the combination is calculated, and the combination with a large total downtime is a failure occurrence. If the probability is large, a linear probability calculation can be performed to predict a phenomenon and a failure of a factor that are likely to occur in the future.

以下、本発明による障害予知システム及びプログラムの一実施形態を図面を参照して詳細に説明する。図１は、本発明の第１の実施形態による障害予知プログラムが適用される障害予知システムの構成図、図２は、本実施形態による障害予知機能及び障害発生高確率通知機能の第１の処理動作例を示すフローチャート、図３は、本実施形態によるシステム時刻取得処理の詳細動作を示すフローチャート、図４は、本実施形態による時間数計算処理の詳細動作を示すフローチャート、図５は、本実施形態による現象／要因区分の組み合わせ取得処理の詳細動作を示すフローチャート、図６は、本実施形態による当該現象／要因障害の不稼働時間総和計算処理の概念を示す図、図７は、本実施形態による当該現象／要因障害の不稼働時間総和計算処理の詳細動作を示すフローチャート、図８は、本実施形態による未来Ｎまでの線形確率計算処理の詳細動作を示すフローチャート、図９は本実施形態によるｔａｎｈ関数適用処理のｔａｎｈ関数のグラフ、図１０は、本実施形態によるｔａｎｈ関数適用処理の詳細動作を示すフローチャート、図１１は、本実施形態による障害発生高確率通知処理及び通知フラグ＝１処理の詳細動作を示すフローチャート、図１２は、本実施形態による障害予知データベースの構成及びサンプルデータを示す図、図１３は、本実施形態による障害情報データベースの構成及びサンプルデータを示す図、図１４は、本実施形態による現象区分マスタの構成及びサンプルデータを示す図、図１５は、本実施形態による要因区分マスタの構成及びサンプルデータを示す図、図１６は、本実施形態によるパラメータ設定ファイルの構成及びサンプルデータを示す図、図１７は、本実施形態によるワークデータベースの構成及びサンプルデータを示す図、図１８は、本実施形態によるワークデータベースの構成及びサンプルデータを示す図、図１９は、本実施形態による線形確率データの構成及びサンプルデータを示す図である。
<全体構成の説明> Hereinafter, an embodiment of a failure prediction system and a program according to the present invention will be described in detail with reference to the drawings. FIG. 1 is a configuration diagram of a failure prediction system to which a failure prediction program according to a first embodiment of the present invention is applied. FIG. 2 is a first process of a failure prediction function and a failure occurrence high probability notification function according to this embodiment. FIG. 3 is a flowchart showing the detailed operation of the system time acquisition process according to the present embodiment, FIG. 4 is a flowchart showing the detailed operation of the time number calculation process according to the present embodiment, and FIG. 6 is a flowchart showing the detailed operation of the phenomenon / factor category combination acquisition processing according to the form, FIG. 6 is a diagram showing the concept of the total downtime calculation processing of the phenomenon / factor failure according to the present embodiment, and FIG. 7 is the present embodiment. FIG. 8 is a flow chart showing the detailed operation of the uptime calculation processing for the phenomenon / cause failure according to the present embodiment, and FIG. 8 shows the details of the linear probability calculation processing up to the future N according to this embodiment. FIG. 9 is a graph of the tanh function of the tanh function application process according to this embodiment, FIG. 10 is a flowchart showing the detailed operation of the tanh function application process according to this embodiment, and FIG. 11 is a failure according to this embodiment. FIG. 12 is a diagram showing the configuration and sample data of the failure prediction database according to the present embodiment, and FIG. 13 is a diagram of the failure information database according to the present embodiment. FIG. 14 is a diagram illustrating the configuration and sample data of the phenomenon classification master according to the present embodiment, FIG. 15 is a diagram illustrating the configuration and sample data of the factor classification master according to the present embodiment, and FIG. The figure which shows the structure of a parameter setting file by this embodiment, and sample data 7 is a diagram showing the configuration and sample data of the work database according to the present embodiment, FIG. 18 is a diagram showing the configuration of the work database and sample data according to the present embodiment, and FIG. 19 is the configuration of linear probability data according to the present embodiment. It is a figure which shows sample data.
<Description of overall configuration>

本実施形態による障害予知プログラムが適用される障害予知システム１０１は、図１に示す如く、他のコンピュータシステムからの障害情報を収集した障害情報システム１１２を介して障害情報を標本データとして格納する障害情報データベース（ＤＢ）１１１から該標本データを入力する障害予知機能部１０２と、該障害予知機能１０２によって作成した障害予知情報を格納する障害予知データベース（ＤＢ）１０４と、前記障害予知機能１０２からの報告を受け、該当の担当者や関係者に障害予知情報を通知する障害発生高確率通知機能部１０３と、過去に発生した障害の現象を現象区分して格納する現象区分マスタ１０５と、過去に発生した障害の要因（原因）を発生要因区分して格納する要因区分マスタ１０６と、前記現象区分マスタ１０５及び要因区分マスタ１０６に格納した現象情報及び発生要因情報を解析するための閾値であるパラメータを設定するためのパラメータ設定ファイル１０７と、後述するワークデータベース（ＤＢ）１０８及びワークデータベース（ＤＢ）１０９と、前記現象区分マスタ１０５に格納した各種データを参照して障害が発生する確率を算出するための線形確率データを格納するための線形確率データベース（ＤＢ）１１０とから構成される。 The failure prediction system 101 to which the failure prediction program according to the present embodiment is applied is a failure that stores failure information as sample data via a failure information system 112 that collects failure information from other computer systems, as shown in FIG. A failure prediction function unit 102 for inputting the sample data from the information database (DB) 111, a failure prediction database (DB) 104 for storing failure prediction information created by the failure prediction function 102, and the failure prediction function 102 A failure occurrence high probability notification function unit 103 that receives a report and notifies failure prediction information to a person in charge or a person concerned, a phenomenon classification master 105 that stores a phenomenon of a failure that has occurred in the past, and a past A factor category master 106 for storing the cause (cause) of the failure that has occurred and classifying the cause, and the phenomenon category cell 105 and a parameter setting file 107 for setting parameters that are threshold values for analyzing the phenomenon information and the cause information stored in the factor classification master 106, and a work database (DB) 108 and a work database (DB) 109 to be described later. And a linear probability database (DB) 110 for storing linear probability data for calculating the probability of occurrence of a failure with reference to various data stored in the phenomenon classification master 105.

前記障害情報ＤＢ１１１は、図１３に示す如く、過去の障害情報を標本データを格納するものであって、例えば、「障害番号」「影響会員コード」「案件名」「発生日時」「復旧日時」「不稼働時間」「現象区分」「要因区分」「作業区分」「現象詳細」「要因詳細」「対応」「再発防止策」の各項目を、例えば、障害番号「00000001」の障害が、影響会員コード「JPNGX0001」の会員において、案件名「プログラムバク」が2002年08月10日23時30分に発生し、2002年08月11日01時10分に復旧し、この不稼働時間が「1時間」、現象区分が「0001」、要因区分が「0004」、作業区分が「0000021912」、現象詳細が「受注データ件数誤り」、要因詳細が「仕様確認不足」、対応が「元に戻した」、再発防止策が「仕様レビュー」であったことを格納している。 As shown in FIG. 13, the failure information DB 111 stores sample data of past failure information. For example, “failure number” “affected member code” “case name” “occurrence date” “recovery date” "Unavailable time", "Symptom category", "Cause category", "Work category", "Symptom details", "Factor details", "Action", "Reoccurrence prevention measures", for example, the failure of failure number "00000001" affects For the member with membership code “JPNGX0001”, the project name “Program Baku” occurred at 23:30 on August 10, 2002 and recovered at 01:10 on August 11, 2002. 1 hour ", phenomenon category is" 0001 ", cause category is" 0004 ", work category is" 0000021912 ", phenomenon details is" order data number error ", cause details is" insufficient specification check ", and response is" Undo “, The fact that the recurrence prevention measure was“ specification review ”is stored.

この様に本実施形態による障害情報ＤＢ１１１は、過去に発生した複数の障害について、障害による現象を障害現象タイプ別に区分した「現象区分」と、障害が発生した要因を障害要因タイプ別に区分した「要因区分」と、障害により発生したシステムの停止時間を「不稼働時間」と、障害の「発生時刻」等をデータベース化して格納するものである。 As described above, the failure information DB 111 according to the present embodiment categorizes the phenomenon caused by the failure according to the failure phenomenon type and the cause of the failure according to the failure factor type for a plurality of failures that have occurred in the past. The “factor category”, the system downtime caused by the failure, the “non-operation time”, the “occurrence time” of the failure and the like are stored in a database.

前記障害予知ＤＢ１０４は、将来予知される障害予知情報を格納するものであって、図１２に示す如く、予知された障害が発生する「年月日時分秒」、「現象区分」、「要因区分」、「障害発生確率」の各項目を、例えば、2005年01月01日の00時00分00秒に、現象区分が「0001」且つ要因区分が「0001」の障害が、障害発生確率「１０％」で発生するとの障害予知情報を格納する。このように本実施形態における障害予知ＤＢ１０４は、前記障害情報ＤＢ１１１に格納された過去の障害事例に基づき、後述する予測処理を行い、この結果である予知された年月日／現象区分／要因区分／障害発生確率を格納するものである。尚、本実施形態における前記障害発生確率を得るための線形確率とは、統計学における一般化線形モデル（数値変数である反応変数を数値変換或いは因子変量の説明変数の線形結合で予測するモデル）が好ましいが、これに限られるものではない。 The failure prediction DB 104 stores failure prediction information predicted in the future. As shown in FIG. 12, “year / month / day / hour / minute / second”, “phenomenon division”, “factor division” where a predicted failure occurs is stored. ”,“ Failure occurrence probability ”, for example, at 00:00:00 on January 01, 2005, the failure with the phenomenon classification“ 0001 ”and the factor classification“ 0001 ”is the failure occurrence probability“ Fault prediction information that occurs at “10%” is stored. As described above, the failure prediction DB 104 according to the present embodiment performs a prediction process to be described later based on the past failure cases stored in the failure information DB 111, and the predicted date / phenomenon classification / factor classification as a result thereof. / Stores failure occurrence probability. In addition, the linear probability for obtaining the failure occurrence probability in the present embodiment is a generalized linear model in statistics (a model in which a reaction variable that is a numerical variable is predicted by numerical transformation or linear combination of explanatory variables of factor variables) However, it is not limited to this.

前記現象区分マスタ１０５は、発生する障害の現象を「現象区分」と「現象名」とを対応して現象タイプ別に区分したものであって、図１４に示す如く、例えば現象区分「0001」は現象名が「システムダウン」、現象区分「0005」は現象名が「ファイル破壊」、現象区分「0008」は現象名が「セキュリティ不備」として格納している。 The phenomenon classification master 105 classifies the failure phenomenon that occurs according to the phenomenon type corresponding to the “phenomenon classification” and the “phenomenon name”. As shown in FIG. The phenomenon name is “system down”, the phenomenon classification “0005” is stored as the phenomenon name “file destruction”, and the phenomenon classification “0008” is stored as the phenomenon name “insufficient security”.

前記要因区分マスタ１０６は、発生する障害の要因を「要因区分」と「要因名」とを対応して要因タイプ別に区分したものであって、図１５に示す如く、例えば要因区分「0001」は要因名が「センタハード障害」、要因区分「0005」は要因名が「外部攻撃による障害」、要因区分「0008」は現象名が「通信回線障害」として格納している。 The factor category master 106 classifies the cause of the failure that occurs by factor type corresponding to “factor category” and “factor name”. As shown in FIG. 15, for example, the factor category “0001” The cause name is “center hardware failure”, the cause category “0005” is stored as the cause name “failure due to external attack”, and the cause category “0008” is stored as the phenomenon name “communication line failure”.

前記パラメータ設定ファイル１０７は、現象情報及び発生要因情報を解析するためのパラメータ（閾値）及び通知先を設定したものであって、図１６に示す如く、処理起動や計算単位の時間を設定する「単位時間」と、標本対象データとなる過去の年数を設定する「過去Ｍ年」と、確率計算を行う未来の年数を設定する「未来Ｎ年」と、確率計算を行い、通知すべき危険と判断される確率の閾値を設定する「障害確率危険域閾値」と、メールあて先を設定する「通知あて先メールアドレス」の各項目を、例えば、単位時間を「１秒」としたとき、過去１０年の標本データを基に未来５年間において、障害確率危険域閾値が６０％以上のときにhitachi@dokono.ne.jp他宛にメールを送信することを格納している。 The parameter setting file 107 is set with parameters (thresholds) and notification destinations for analyzing the phenomenon information and the cause information, and as shown in FIG. “Unit time”, “past M years” that sets the number of years in the past as sample target data, “future N years” that sets the number of years in the future for which probability calculation is performed, and the danger to be notified by performing probability calculation For example, when the unit time is set to “1 second” for each item of “failure probability risk threshold” for setting a threshold of a probability to be determined and “notification destination email address” for setting an email destination, the past 10 years Based on this sample data, it stores emails sent to hitachi@dokono.ne.jp and others when the failure probability threshold is 60% or more in the next five years.

前記ワークデータベース１０８は、図１７に示す如く、「現象区分」と「要因区分」の対応付けを格納したものであって、例えば現象区分「0001」に対して複数の要因区分「0001」〜「0003」が対応していることを格納し、前記ワークデータベース１０９は、図１８に示す如く、「障害番号」／「現象区分」／「要因区分」／「不稼働時間」の対応付けを格納したものであって、例えば、障害番号「001928372」の現象区分が「0001」、要因区分が「0004」、不稼働時間が「1」時間であることを格納している。 As shown in FIG. 17, the work database 108 stores associations between “phenomenon categories” and “factor categories”. For example, the work database 108 includes a plurality of factor categories “0001” to “0001” for the phenomenon category “0001”. "0003" is stored, and the work database 109 stores the correspondence of "Fault number" / "Symptom category" / "Factor category" / "Unavailable time" as shown in FIG. For example, the fact that the phenomenon classification of the failure number “001928372” is “0001”, the cause classification is “0004”, and the non-operation time is “1” time is stored.

前記線形確率データベース１１０は、「現象区分」と「要因区分」と「経過時間」の組み合わせによる「線形確率」を格納したものであって、図１９に示す如く、例えば、現象区分「0001」且つ要因区分が「0001」且つ経過時間が「1」時間の障害に対する線形確率が「２％」である旨、現象区分「0001」且つ要因区分が「0001」且つ経過時間が「8」時間の障害に対する線形確率が「６％」である旨を登録しているものである。 The linear probability database 110 stores “linear probability” by a combination of “phenomenon category”, “factor category”, and “elapsed time”. As shown in FIG. 19, for example, the phenomenon category “0001” and Failure with a factor category of “0001” and an elapsed time of “1” hours with a linear probability of “2%”, a phenomenon category of “0001”, a factor category of “0001” and an elapsed time of “8” hours The fact that the linear probability for “6” is “6%” is registered.

前記障害情報システム１１２は、障害が発生した際、その障害の発生日時／復旧日時／現象区分／要因区分／不稼働時間／その他障害詳細情報／障害対応担当責任者を入力し、過去の障害情報を標本データを障害情報ＤＢ１１１に格納・更新するものである。 When a failure occurs, the failure information system 112 inputs the failure occurrence date / time / recovery date / time / phenomenon category / cause category / non-operation time / other failure detailed information / failure response manager, and records past failure information. Is stored / updated in the failure information DB 111.

更に前記障害予知機能部１０２は、毎時間（時間単位［時刻、分、秒］は可変であり、この時間単位はパラメータ設定ファイル１０７に設定）毎に起動され、前記障害発生日時他の障害情報の標本データの更新を行う機能と、パラメータ設定ファイル１０７に設定された過去Ｍ年分の障害情報ＤＢ１１１に格納された標本データを入力とし、該パラメータ設定ファイル１０７に設定された未来Ｎ年までの毎時間の障害発生確率を、現象区分マスタ１０５／要因区分マスタ１０６／ワークＤＢ１０８／ワークＤＢ１０９／線形確率データベース１１０を参照して予想される障害予知情報を計算して障害予知ＤＢ１０４を更新するように動作する機能と、前記計算の途中において、障害予知情報の障害発生確率が、パラメータ設定ファイル１０７に設定された閾値（例えば６０％）を越えた場合、その予知日時／現象区分／要因区分とパラメータ設定ファイル１０７に設定されたメールアドレス一覧とを障害発生高確率通知機能部１０３に通知する機能とを有する。 Further, the failure prediction function unit 102 is activated every hour (the time unit [time, minute, second] is variable, and this time unit is set in the parameter setting file 107), and the failure occurrence date and other failure information. The sample data stored in the failure information DB 111 for the past M years set in the parameter setting file 107 and the future N years set in the parameter setting file 107 are input. The failure prediction probability is estimated by referring to the phenomenon occurrence master 105 / cause division master 106 / work DB 108 / work DB 109 / linear probability database 110 to update the failure prediction DB 104 for each hour. The function that operates and the failure occurrence probability of the failure prediction information during the calculation are the parameter setting file 107 A function of notifying the failure occurrence high probability notification function unit 103 of the prediction date / time / phenomenon classification / factor classification and the mail address list set in the parameter setting file 107 when a set threshold value (for example, 60%) is exceeded; Have

前記障害発生高確率通知機能部１０３は、障害予知機能１０２から受け取った障害予知情報の予知日時／現象区分／要因区分を、メールアドレス一覧の担当者／関係者１１３宛てに電子メールにて通知する機能を有する。 The failure occurrence high probability notification function unit 103 notifies the prediction date / time / phenomenon / factor classification of the failure prediction information received from the failure prediction function 102 to the person in charge / related person 113 in the mail address list by e-mail. It has a function.

このように本実施形態による障害予知システム１０１は、過去に発生した障害情報を障害情報システム１１２から得、その障害情報をパラメータ設定ファイル１０７に設定したパラメータに従って解析し、現象区分情報及び障害要因区分として区分マスタ１０５及び１０６に登録しておき、障害予知機能１０２が、未来Ｎ年までの毎時間の障害発生確率が、パラメータ設定ファイル１０７に設定された閾値を越えたときに、障害発生高確率通知機能部１０３が、その予知日時／現象区分／要因区分とを担当者等に通知することによって、予め予想される障害を担当者に通知することができ、従って予防処置等を効率的に行うことができる。
<詳細動作> As described above, the failure prediction system 101 according to the present embodiment obtains failure information that has occurred in the past from the failure information system 112, analyzes the failure information according to the parameters set in the parameter setting file 107, and provides phenomenon category information and failure cause category. Are registered in the classification masters 105 and 106, and the failure prediction function 102 determines that the failure occurrence probability is high when the failure occurrence probability for each hour until the next N years exceeds the threshold set in the parameter setting file 107. The notification function unit 103 notifies the person in charge of the prediction date / time / phenomenon classification / factor classification, so that the person in charge can be notified of a predicted failure in advance, and thus the preventive measures are efficiently performed. be able to.
<Detailed operation>

<定義等>
次に障害予知機能１０２及び障害発生高確率通知機能部１０３の動作を図２を参照して詳細に説明するが、まず、変数等の定義を次のように設定する。
(1)過去Ｍ年の障害実績データから未来Ｎ年先までのΔｔ（時、分、秒）単位の障害発生確率を計算する。
(2)現（処理起動）時点の時刻：ｔ（ＹＹＹＹ／ＭＭ／ＤＤＨＨ：ＭＭ：ＳＳなどΔｔ単位）。
(3)過去Ｍ年〜ｔまでの時間数：Ｘ（Δｔ）
(4)ｔ〜未来Ｎ年までの時間数：Ｙ（Δｔ）
(5)当該（カレント）現象区分、要因区分の不稼働時間の総和：ｘ（Δｔ）
(6)当該（カレント）現象区分、要因区分の障害の不稼働時間 τ１，τ２，τ３，・・・τｎ（Δｔ）
尚、前提条件として、当該現象／要因の障害が、過去（Ｍ年前〜現在）にｎ回（０≦ｎ<∞）あったとし、障害情報ＤＢ１０４は影響顧客コードまで主キーとなっているため障害番号でグループ化（不稼働時間の最も大きなレコードを採用）する。
(7)当該（カレント）現象区分：ＰＨＥＮＯＭＥ（Ｉ）
(8)当該（カレント）要因区分：ＦＡＣＴＯＲ（Ｊ）
(9)障害確率閾値：Ｋ（％）
<予知→通知動作概略> <Definition etc.>
Next, operations of the failure prediction function 102 and the failure occurrence high probability notification function unit 103 will be described in detail with reference to FIG. 2. First, definitions of variables and the like are set as follows.
(1) The failure occurrence probability in units of Δt (hours, minutes, seconds) from the failure record data in the past M years to the future N years ahead is calculated.
(2) Time at present (processing start-up): t (Yt, YYYY / MM / DD HH: MM: SS, etc., Δt unit).
(3) Number of hours from the past M years to t: X (Δt)
(4) Number of hours from t to N years in the future: Y (Δt)
(5) Sum of downtime of the current (current) phenomenon category and factor category: x (Δt)
(6) Failure time of failure in current (current) phenomenon category and factor category τ1, τ2, τ3, ... τn (Δt)
As a precondition, it is assumed that the failure of the phenomenon / factor has occurred n times (0 ≦ n <∞) in the past (M years ago to present), and the failure information DB 104 is the main key up to the affected customer code. Therefore, group by failure number (adopt the record with the largest downtime).
(7) Applicable (current) phenomenon classification: PHENOME (I)
(8) Applicable (current) factor classification: FACTOR (J)
(9) Failure probability threshold: K (%)
<Forecast → Notification operation outline>

さて、本実施形態による障害予知機能１０２及び障害発生高確率通知機能部１０３は、過去の障害実績情報から未来の障害発生確率を計算し、閾値を越えた際に担当者／関係者に通知を行うものであって、この動作を図２他を参照して説明する。
［ステップＳ２０１］ The failure prediction function 102 and the failure occurrence high probability notification function unit 103 according to the present embodiment calculates the future failure occurrence probability from the past failure record information, and notifies the person in charge / related party when the threshold is exceeded. This operation will be described with reference to FIG.
[Step S201]

本システムは、まず、ステップＳ２０１の如く現在のシステム時刻を取得する。このステップＳ２０１は、例えばΔｔが１秒の場合はｔ＝ＹＹＹＹＭＭＤＤＨＨＭＭＳＳとなり、Δｔが１分の場合はＹＹＹＹＭＭＤＤＨＨＭＭとなる。このΔｔは、パラメータ設定ファイル１０７に設定されており、該パラメータ設定ファイル１０７は、図１６に示す如く、単位時間／過去Ｍ年／未来Ｎ年／障害確率危険域閾値／通知先メールアドレスの各項目が、00001/10年/5年/60％／hitachi@dokono.ne.jpの如く登録されている。前記単位時間が「00001」とは１秒を意味する。 The system first acquires the current system time as in step S201. In this step S201, for example, when Δt is 1 second, t = YYYYMMDDDHHMSS, and when Δt is 1 minute, YYYYMMDDDHHMM. This Δt is set in the parameter setting file 107. As shown in FIG. 16, the parameter setting file 107 includes unit time / past M year / future N year / failure probability risk threshold / notification destination mail address. Items are registered as 00001/10/5/60%/hitachi@dokono.ne.jp. The unit time “00001” means 1 second.

尚、本実施例では、Δｔの時間単位を１時間、１分、１秒で示しているが、システムクロックが値を持っていれば、ミリ秒やナノ秒などの単位計算ができ、この設定は後述の図３に示した時刻単位を更に細分岐化することにより可能である。このシステム日時は、システム日時を取得し、その日時を軸に過去Ｍ年、未来Ｎ年のデータ操作／計算を行うため、可能な限り正確である方が良く、例えばＮＴＰサーバなどを利用し、システム日時の同期を取っておいても良い。
［ステップＳ２０２］ In this embodiment, the time unit of Δt is shown as 1 hour, 1 minute, and 1 second. However, if the system clock has a value, units such as milliseconds and nanoseconds can be calculated. Is possible by further subdividing the time unit shown in FIG. The system date / time is obtained as much as possible in order to obtain the system date / time and perform data operations / calculations for the past M years and the future N years around the date / time. For example, using an NTP server, You may keep the system date and time synchronized.
[Step S202]

次に本障害予知機能１０２及び障害発生高確率通知機能部１０３は、ステップＳ２０２の如く、パラメータ設定ファイル１０７に予め設定された過去Ｍ年から現時点ｔまでの時間数及びｔから未来Ｎ年までの時間数を計算する。このステップＳ２０２の詳細動作は図４を参照して後述する。 Next, the failure prediction function 102 and the failure occurrence high probability notification function unit 103, as shown in step S202, the number of hours from the past M years to the current t preset in the parameter setting file 107 and from t to the future N years. Calculate the number of hours. The detailed operation of step S202 will be described later with reference to FIG.

この計算は、Δｔ＝１秒の場合、次式のように計算され、
Ｘ＝Ｍ×６０×６０×２４×３６５
Ｙ＝Ｎ×６０×６０×２４×３６５ This calculation is calculated as follows when Δt = 1 second:
X = M × 60 × 60 × 24 × 365
Y = N × 60 × 60 × 24 × 365

Δｔが１分の場合、次式のように計算される。
Ｘ＝Ｍ×６０×２４×３６５
Ｙ＝Ｎ×６０×２４×３６５
［ステップＳ２０３］ When Δt is 1 minute, it is calculated as follows.
X = M × 60 × 24 × 365
Y = N × 60 × 24 × 365
[Step S203]

次いで障害予知機能１０２等が、ステップＳ２０３の如くワークデータベース１０８の全レコードをクリアする。
［ステップＳ２０４］ Next, the failure prediction function 102 or the like clears all records in the work database 108 as in step S203.
[Step S204]

次に、本システムは、ステップＳ２０４の如く、ワークＤＢ１０８に現象区分マスタ１０５と要因区分マスタ１０６の全レコードの組み合わせを登録し、最初のレコードの対を、ＰＨＥＮＯＭＥ（Ｉ）＝カレントの現象区分とＦＡＣＴＯＲ（Ｊ）＝カレントの要因区分として読み込む。 Next, as shown in step S204, the present system registers a combination of all records of the phenomenon category master 105 and the factor category master 106 in the work DB 108, and the first record pair is defined as PHENOME (I) = current phenomenon category. FACTOR (J) = Read as current factor classification.

本ステップＳ２０４は、ワークＤＢ１０８の全レコードを読込む間ループする。このワークＤＢ１０８は、現象区分マスタ１０５と要因区分マスタ１０６の全てのレコードの組み合わせとなるため、それぞれｍレコード、ｎレコードあったとすると、ｍ×ｎレコードがワークＤＢ１０８に保管されることとなる。 This step S204 loops while all records in the work DB 108 are read. Since the work DB 108 is a combination of all the records of the phenomenon classification master 105 and the factor classification master 106, assuming that there are m records and n records, respectively, m × n records are stored in the work DB 108.

また、２回目以降のループでは本ステップはカレントの現象区分−要因区分の対の次のレコードを読込むこととなる。
このステップＳ２０４は後述の図５を参照して詳細を述べる。
［ステップＳ２０５］ In the second and subsequent loops, this step reads the next record of the current phenomenon category-factor category pair.
This step S204 will be described in detail with reference to FIG.
[Step S205]

次に本システムは、障害履歴ＤＢ１１１に格納した過去の障害情報から障害現象区分／要因区分／不稼働時間を読み込み、当該現象区分（ＰＨＥＮＯＭＥ（Ｉ））、要因区分（ＦＡＣＴＯＲ（Ｊ））の過去Ｍ年から現システム時刻までの不稼働時間の総和を不稼働時間総和ｘとして計算する（ｘ＝Σ（ｒ＝１）（ｎ） τｒ）。このステップＳ２０５の詳細は図６及び図７参照して後述する。
［ステップＳ２０６］ Next, the present system reads the failure phenomenon category / factor category / non-operation time from the past failure information stored in the failure history DB 111, and stores the past of the phenomenon category (PHENOME (I)) and cause category (FACTOR (J)). The total non-operation time from M years to the current system time is calculated as the total non-operation time x (x = Σ (r = 1) (n) τr). Details of step S205 will be described later with reference to FIGS.
[Step S206]

次いで本システムは、ステップＳ２０６の如く不稼働時間総和ｘの値が"０"であるか否かを判定する。この判定の結果、ｘ＝０の場合、すなわち、過去に一度も当該現象区分／要因区分の対の障害が起こったことが無い場合は、障害予知ＤＢ１０４の当該現象区分、要因区分の未来Ｎ年までのレコードの"障害確率"項目に０をセットするステップＳ２１７に進み、ｘ≠０の場合はステップＳ２０７に進む。
［ステップＳ２０７］ Next, in step S206, the system determines whether the value of the total non-operation time x is “0”. As a result of this determination, if x = 0, that is, if no failure has occurred in the phenomenon / factor category pair in the past, the future N years of the phenomenon / factor category in the failure prediction DB 104 The process proceeds to step S217 in which “0” is set in the “failure probability” item of the records up to and to step S207 if x ≠ 0.
[Step S207]

次にステップＳ２０７の如く、当該現象区分（ＰＨＥＮＯＭＥ（Ｉ））、要因区分（ＦＡＣＴＯＲ（Ｊ））の未来Ｎ年までの線形確率を計算する。ｔ＋ｋ（ｋ＝１，２，・・・）後の線形障害発生確率Ｐ’は、１００×（ｘ＋ｋ）／Ｘとなる。このＰ’を未来Ｎ年までの時間毎分計算し、線形確率データベース１１０に一時保存する。ｘ＋ｋが比例増加するため、この線形障害発生確率Ｐ’は１００を越えることが想定されるため、本ステップ以降で線形（増加）確率をｔａｎｈ関数を利用し、実際の障害発生確率に適用する。このステップＳ２０７の詳細動作は図８を参照して後述する。
［ステップＳ２０８］ Next, as in step S207, linear probabilities of the phenomenon category (PHENOME (I)) and factor category (FACTOR (J)) up to N years in the future are calculated. The linear failure occurrence probability P ′ after t + k (k = 1, 2,...) is 100 × (x + k) / X. This P ′ is calculated every hour until the next N years and temporarily stored in the linear probability database 110. Since x + k increases proportionally, it is assumed that the linear failure occurrence probability P ′ exceeds 100. Therefore, the linear (increase) probability is applied to the actual failure occurrence probability using the tanh function in this step and thereafter. The detailed operation of step S207 will be described later with reference to FIG.
[Step S208]

次に本システムは、ステップＳ２０８の如く、通知フラグを０にセットする。本通知フラグとは、図１で示したの障害発生高確率通知機能に当該現象や当該要因で閾値を越えた場合に通知したかどうかをフラグ化したものである。 Next, the system sets a notification flag to 0 as in step S208. This notification flag is a flag indicating whether or not the failure occurrence high probability notification function shown in FIG. 1 is notified when the phenomenon or the factor exceeds the threshold.

これを説明すると、障害発生確率は時系列で増加するため、一度当該現象区分や要因区分の障害発生確率が閾値を越えた場合、以降、毎時間の発生確率が閾値を越えることとなり、この閾値を越えた以降、Ｎ年分までの毎時間分のアラーム情報全てが、通知されてしまうこととなる。本例の通知フラグは、このことを防ぐためのものであって、一度閾値を超えて通知した当該現象区分や要因区分の２単位時間以降分のアラーム通知は行わないようにするためのものである。
［ステップＳ２０９］ Explaining this, the failure occurrence probability increases in time series, so once the failure occurrence probability of the relevant phenomenon category or factor category exceeds the threshold, the occurrence probability of each hour will exceed the threshold thereafter. After exceeding, all alarm information for every hour up to N years will be notified. The notification flag in this example is to prevent this, and to prevent alarm notifications for two or more unit hours of the relevant event category or factor category once notified beyond the threshold. is there.
[Step S209]

次に前記ステップＳ２０７で作成した線形確率データベース１１０のデータを１レコードずつ読込み、線形確率データベース１１０のデータが最後のレコードであるか否かの判定を行う。この判定の結果、線形確率データベース１１０のデータが無い、又は最後のレコードの次で読込むレコードが無い場合、次ステップＳ２１４へ進み、読込むレコードがあった場合はステップＳ２１０へ進む。このステップＳ２０９は、線形確率データベース１１０データの最後までステップＳＳ２１３との間をループする。
［ステップＳ２１０］ Next, the data in the linear probability database 110 created in step S207 is read one record at a time, and it is determined whether or not the data in the linear probability database 110 is the last record. As a result of the determination, if there is no data in the linear probability database 110 or there is no record to be read after the last record, the process proceeds to the next step S214, and if there is a record to be read, the process proceeds to step S210. This step S209 loops between step SS213 until the end of the linear probability database 110 data.
[Step S210]

次いで本システムは、ステップＳ２１０の如く、読込んだレコードにｔａｎｈ関数を適用して障害発生確率Ｐ（Ｐ＝１００×ｔａｎｈ［Ｐ’／１００］）を算出する。このステップＳ２１０の詳細は図９及び図１０を参照して後述する。
［ステップＳ２１１］ Next, as shown in step S210, the present system applies a tanh function to the read record to calculate a failure occurrence probability P (P = 100 × tanh [P ′ / 100]). Details of step S210 will be described later with reference to FIGS.
[Step S211]

次いでステップＳ２１１の如く、前記ステップＳ２１０で算出した障害発生確率Ｐがパラメータ設定ファイル１０７に設定されている閾値以上か否かの判定を行う。この判定の結果、障害発生確率Ｐが閾値以上の場合、ステップＳ２１５に進み、設定された閾値よりも小さければステップＳ２１２進む。
［ステップＳ２１５］ Next, as in step S211, it is determined whether or not the failure occurrence probability P calculated in step S210 is equal to or greater than a threshold set in the parameter setting file 107. As a result of this determination, if the failure occurrence probability P is greater than or equal to the threshold, the process proceeds to step S215, and if smaller than the set threshold, the process proceeds to step S212.
[Step S215]

次いで本システムはステップＳ２１５の如く通知フラグが"１"であるか否かの判定を行う。この判定の結果、通知フラグが１のとき、即ち障害発生確率が閾値を越えていているが、通知済みの場合は、通知処理を行わず、ステップＳ２１２に進み、通知フラグが１でない場合、即ち障害発生確率が閾値を越えていて未通知の場合は、次ステップＳ２１６の障害発生確率高確率処理に進む。
［ステップＳ２１６］ Next, the system determines whether or not the notification flag is “1” as in step S215. As a result of this determination, when the notification flag is 1, that is, the failure occurrence probability exceeds the threshold value, but the notification has been completed, the notification processing is not performed, and the process proceeds to step S212. If the failure occurrence probability exceeds the threshold and is not notified, the process proceeds to the failure occurrence probability high probability process in the next step S216.
[Step S216]

前記ステップＳ２１５により通知フラグが"１"でないと判定されたとき、障害発生確率及びその時刻（現在のシステム時刻＋経過時間後）と、現象区分と、要因区分の情報を電子メールを用いて被障害予知システムの担当者／関係者に自動的に送付し、通知フラグを"１"にセットする。このステップＳ２１６の詳細動作は図１１を参照して後述する。
［ステップＳ２１２］ When it is determined in step S215 that the notification flag is not "1", the failure occurrence probability and its time (current system time + after the elapsed time), the phenomenon category, and the factor category information are received using e-mail. It is automatically sent to the person in charge / related person of the failure prediction system, and the notification flag is set to “1”. The detailed operation of step S216 will be described later with reference to FIG.
[Step S212]

更に本システムは、前記ステップＳ２１１又は２１５に続き、カレントの障害予知時刻（システム時刻＋経過時間）と現象区分と要因区分と障害発生確率Ｐを、障害予知ＤＢ１０４に更新する。この更新処理の主キーは、障害予知時刻（システム時刻＋経過時間）、現象区分、要因区分であり、レコードがあれば更新し、無ければ登録となる。この障害予知ＤＢ１０４は、前述の図１２に図示した障害予知情報が格納されている。
［ステップＳ２１３］ Further, following the step S211 or 215, the present system updates the current failure prediction time (system time + elapsed time), phenomenon category, factor category, and failure occurrence probability P to the failure prediction DB 104. The main keys of this update process are failure prediction time (system time + elapsed time), phenomenon category, and factor category. If there is a record, the record is updated. The failure prediction DB 104 stores the failure prediction information illustrated in FIG.
[Step S213]

次に、線形確率データベース１１０の次のレコードを読込み、前述のステップＳ２０９までループする。
［ステップＳ２１４］
本システムは、ステップＳ２１４の如く、ワークＤＢ１０８に現象や要因区分の対レードがあるか否かを判定し、対レコードがあるとき、前述のステップＳ２０４−ステップＳ２１４間の処理を繰り返し、ワークＤＢ１０８に格納した全レコードに対するループ処理を終了したときに処理を終了する。 Next, the next record in the linear probability database 110 is read, and the process loops to the above-described step S209.
[Step S214]
As in step S214, the system determines whether or not the work DB 108 has a pair of phenomena and factor categories. When there is a pair record, the system repeats the processing from step S204 to step S214 described above to the work DB 108. The process ends when the loop process for all stored records is completed.

この様に本システムは、過去の障害実績情報に含まれる当該現象区分と要因区分の組み合わせと、この組み合わせにより発生した障害の不稼働時間総和ｘとの関係にから未来の線形障害発生確率Ｐ’をステップＳ２０７により算出し、この算出した線形障害発生確率Ｐが所定の閾値を越える際に担当者／関係者に通知を行うことができる。以下、前記各ステップにおける詳細処理動作をステップ毎に説明する。
<システム時刻取得処理ステップＳ２０１の説明> In this way, the present system determines the future linear failure occurrence probability P ′ based on the relationship between the combination of the phenomenon category and the factor category included in the past failure record information and the total failure time x of failures caused by this combination. Can be calculated in step S207, and the person in charge / related parties can be notified when the calculated linear failure occurrence probability P exceeds a predetermined threshold. Hereinafter, the detailed processing operation in each step will be described step by step.
<Description of System Time Acquisition Processing Step S201>

次いで前述のステップＳ２０１によるシステム時刻取得処理を図３を参照して説明する。
この処理は、図３に示す如く、まずパラメータ設定ファイル１０７から予め定義した単位時間（Δｔの項目）を取得し、この単位時間における現在（図２の処理起動時）のシステム日時を取得し、Δｔが”０００１”であるか否かを判定し（ステップＳ３０１）、Δｔが”０００１”の場合、ステップＳ３０２に進み、Δｔが”０００１”でない場合、ステップＳ３０３に進む。 Next, the system time acquisition process in step S201 described above will be described with reference to FIG.
In this process, as shown in FIG. 3, first, a predetermined unit time (item of Δt) is acquired from the parameter setting file 107, and the current system date and time (at the time of starting the process of FIG. 2) in this unit time is acquired. It is determined whether Δt is “0001” (step S301). If Δt is “0001”, the process proceeds to step S302. If Δt is not “0001”, the process proceeds to step S303.

ステップＳ３０２ではシステム時刻をｔ（ｔ＝ｔｉｍｅ［ＹＹＹＹＭＭＤＤＨＨＭＭＳＳ］：ｔｉｍｅ（）関数はシステム日時を取得する関数）にセットする。 In step S302, the system time is set to t (t = time [YYYYMMDDHHMMSS]: time () function is a function for acquiring the system date and time).

次いでステップＳ３０１において、ｔｉｍｅ（ＹＹＹＹＭＭＤＤＨＨＭＭＳＳ）により「秒」までのシステム日時を取得し、ｔｉｍｅ（ＹＹＹＹＭＭＤＤＨＨＭＭ）により「分」までのシステム日時を取得し、ｔｉｍｅ（ＹＹＹＹＭＭＤＤＨＨ）により「時」までのシステム日時を取得する。 Next, in step S301, the system date and time up to "second" is obtained by time (YYYYMMDDDHMMSS), the system date and time up to "minute" is obtained by time (YYYYMMDDDHHMMM), and the system date and time up to "hour" is obtained by time (YYYYMMDDDHH). get.

ステップＳ３０３では、Δｔが”０００２”の場合、ステップＳ３０４に進み、Δｔが”０００２”でない場合、ステップＳ３０５に進む。
ステップＳ３０４ではシステム時刻をｔ［ｔ＝ｔｉｍｅ（ＹＹＹＹＭＭＤＤＨＨＭＭ）］にセットする。 In step S303, if Δt is “0002”, the process proceeds to step S304. If Δt is not “0002”, the process proceeds to step S305.
In step S304, the system time is set to t [t = time (YYYYMMDDDHHMM)].

ステップＳ３０５では、Δｔが”０００３”の場合、ステップＳ３０６に進み、Δｔが”０００３”でない場合、ステップＳ３０７に進む。
ステップＳ３０６では、システム時刻をｔ［ｔ＝ｔｉｍｅ（ＹＹＹＹＭＭＤＤＨＨ）］にセットする。
ステップＳ３０７では、前述のパラメータ設定ファイルのΔｔを障害予知システムで認識できないとしてエラー処理を行う。このエラーは、Δｔに本システムが認識できない単位時間が設定されていた場合であり、この旨のログ情報を出力する。 In step S305, if Δt is “0003”, the process proceeds to step S306, and if Δt is not “0003”, the process proceeds to step S307.
In step S306, the system time is set to t [t = time (YYYYMMDDDHH)].
In step S307, error processing is performed assuming that Δt in the parameter setting file cannot be recognized by the failure prediction system. This error occurs when a unit time that cannot be recognized by the system is set in Δt, and log information to that effect is output.

この様に本実施形態においては、システム時刻をパラメータ設定ファイル１０７から予め定義した単位時間（Δｔの項目：秒／分／時）単位で取得し、この取得した総時間を算出する。
<時間数計算処理ステップＳ２０２の説明> As described above, in this embodiment, the system time is acquired from the parameter setting file 107 in units of unit time (item of Δt: second / minute / hour) defined in advance, and the acquired total time is calculated.
<Description of Time Count Calculation Processing Step S202>

図４は、前述の図２で説明した時間数計算処理であるステップＳ２０２の詳細動作を説明するための図である。 FIG. 4 is a diagram for explaining the detailed operation of step S202, which is the time number calculation process described with reference to FIG.

このステップＳ２０２による時間数計算処理は、まずＭ年前から現在（図２処理による処理起動時）までの時間数及び現在から未来Ｎ年までの時間数を取得するものであって、パラメータ設定ファイル１０７にて定義された単位時間により、取得する項目（時／分／秒まで）が異なる。 The number of hours calculation process in step S202 first obtains the number of hours from M years ago to the present (when the process is started by the process in FIG. 2) and the number of hours from the present to the next N years, and is a parameter setting file. Items to be acquired (up to hours / minutes / seconds) differ depending on the unit time defined in 107.

本処理は、まずパラメータ設定ファイル１０７からΔｔとＭとＮの項目を取得し、ステップＳ４０１によりΔｔの値が”０００１”か否かを判定し、Δｔが”０００１”の場合、ステップＳ４０２に進み、Δｔが”０００１”でない場合、ステップＳ４０３に進む。 In this process, items Δt, M, and N are first acquired from the parameter setting file 107, and it is determined whether or not the value of Δt is “0001” in step S401. If Δt is “0001”, the process proceeds to step S402. , Δt is not “0001”, the process proceeds to step S403.

ステップＳ４０２では過去Ｍ年から現在までの時間をＸにセットし、現在から未来Ｎ年までの時間をＹにセットする。この値は、例えば過去Ｍ年、未来Ｎ年の場合、Ｘ＝Ｍ×６０×６０×２４×３６５、Ｙ＝Ｎ×６０×６０×２４×３６５の式により算出される。 In step S402, the time from the past M years to the present is set to X, and the time from the present to the future N years is set to Y. For example, in the case of the past M years and the future N years, this value is calculated by the equation of X = M × 60 × 60 × 24 × 365 and Y = N × 60 × 60 × 24 × 365.

次いで本処理は、ステップＳ４０３により、Δｔの値が”０００２”か否かを判定し、Δｔが”０００２”の場合、ステップＳ４０４に進み、Δｔが”０００２”でない場合、ステップＳ４０５に進む。 Next, in step S403, the process determines whether the value of Δt is “0002”. If Δt is “0002”, the process proceeds to step S404. If Δt is not “0002”, the process proceeds to step S405.

ステップＳ４０４では過去Ｍ年から現在までの時間をＸにセットし、現在から未来Ｎ年までの時間をＹにセットする。この値は、例えば、Ｘ＝Ｍ×６０×２４×３６５、Ｙ＝Ｎ×６０×２４×３６５の式により算出される。 In step S404, the time from the past M years to the present is set to X, and the time from the present to the future N years is set to Y. This value is calculated by, for example, an equation of X = M × 60 × 24 × 365 and Y = N × 60 × 24 × 365.

ステップＳ４０５では、Δｔが”０００３”か否かを判定し、Δｔが”０００３”の場合、ステップＳ４０６に進み、Δｔが”０００３”でない場合、ステップＳ４０７）に進む。 In step S405, it is determined whether or not Δt is “0003”. If Δt is “0003”, the process proceeds to step S406. If Δt is not “0003”, the process proceeds to step S407).

ステップＳ４０６では過去Ｍ年から現在までの時間をＸにセットし、現在から未来Ｎ年までの時間をＹにセットする。この値は、例えば、Ｘ＝Ｍ×２４×３６５、Ｙ＝Ｎ×２４×３６５の式によって算出される。 In step S406, the time from the past M years to the present is set to X, and the time from the present to the future N years is set to Y. This value is calculated by, for example, an equation of X = M × 24 × 365 and Y = N × 24 × 365.

エラーとしてはパラメータ設定ファイルのΔｔに障害予知システムで認識できないとき、ステップＳ４０７によってエラー処理を行う。このエラーは、Δｔに本システムが認識できない単位時間が設定されていた場合であり、この旨のログ情報を出力する。
この様に本処理においては、過去の標本データの年月日時刻を基に過去の標本対象の時間を算出すると共に、未来の年月日時刻を基に予測を行う時間を算出する。
<現象／要因区分の組み合わせ取得処理ステップＳ２０４の説明> As an error, when the fault prediction system cannot recognize Δt in the parameter setting file, error processing is performed in step S407. This error occurs when a unit time that cannot be recognized by the system is set in Δt, and log information to that effect is output.
In this way, in this process, the time of the past sample object is calculated based on the date of the past sample data, and the time for performing the prediction based on the date of the future is calculated.
<Description of phenomenon / factor category combination acquisition processing step S204>

次いで前述のステップＳ２０４による現象／要因区分の組み合わせ取得処理を図５を参照して説明する。 Next, the phenomenon / factor category combination acquisition processing in step S204 described above will be described with reference to FIG.

本ステップＳ２０４は、現象区分マスタ１０５と要因区分マスタ１０６に格納された全てのレコードの組み合わせの障害発生確率の計算を行うため、現象区分と要因区分の全ての組み合わせでループするよう現象区分と要因区分を取得し、ワークＤＢ１０８に格納するものである。 In this step S204, in order to calculate the failure occurrence probability of all combinations of records stored in the phenomenon classification master 105 and the factor classification master 106, the phenomenon classification and the factor are looped by all combinations of the phenomenon classification and the factor classification. The classification is acquired and stored in the work DB 108.

この処理について説明すると、現象区分マスタ１０５のレコード数をｍ、要因区分マスタ１０６のレコード数をｎとすると、その組み合わせはｍ×ｎ通りとなる。現象区分→ＰＨＥＮＯＭＥ（Ｉ）、要因区分→ＦＡＣＴＯＲ（Ｊ）とし、（ＰＨＥＮＯＭＥ（Ｉ），ＦＡＣＴＯＲ（Ｊ））［但し、Ｉ＝１，２，・・・ｍＪ＝１，２，・・・ｎ］のペア（対）でステップＳ５０１の如く、現象区分と要因区分の各組み合わせをループ処理することにより、ｍ×ｎ個のレコードをークＤＢ１０８に登録する。
この様に本処理においては、現象区分及び要因区分の複数の組み合わせによる複数のレコードを作成する。本実施形態において、前記作成した全レコードの数を総レコード数と呼ぶ。
<当該現象／要因障害の不稼働時間総和計算処理ステップＳ２０５の説明> This process will be described. When the number of records in the phenomenon classification master 105 is m and the number of records in the factor classification master 106 is n, there are m × n combinations. Phenomenon classification → PHENOME (I), factor classification → FACTOR (J), (PHENOME (I), FACTOR (J)) [where I = 1, 2,... M J = 1, 2,. n] pairs are registered in the database DB 108 by loop processing each combination of the phenomenon category and the factor category as in step S501.
In this way, in this processing, a plurality of records are created by a plurality of combinations of phenomenon categories and factor categories. In the present embodiment, the number of all created records is referred to as the total number of records.
<Description of Total Downtime Calculation Processing Step S205 of the Phenomenon / Cause Failure>

前記ステップＳ２０５による当該現象／要因障害の不稼働時間総和計算処理は、図６の概念図に示す如く、メイン処理起動時点ｔを基点として、過去Ｍ年から現在（基点ｔ）までに、ある現象区分／要因区分の障害がｎ回発生したとし、その不稼働時間の総和Στ_ｎを求める処理であって、前述した如く、障害履歴ＤＢ１１１に格納した過去の障害情報から障害現象区分／要因区分／不稼働時間を読み込み、当該現象区分の過去Ｍ年から現システム時刻までの不稼働時間τ_ｎを取得し、この総和を不稼働時間総和ｘとして計算するものである。尚、本処理は、過去Ｍ年の時点で障害中であった場合（復旧日時が過去Ｍ年〜現在の場合）、その障害の発生日時に関わらず、当該障害情報の不稼働時間の項目を取得し、総和を取得する。 As shown in the conceptual diagram of FIG. 6, the phenomenon / factor failure failure sum total calculation process in step S205 is a certain phenomenon from the past M years to the present (base point t) starting from the main process start time point t. Assume that the failure of the category / factor category occurs n times, and is a process for obtaining the sum Στ _n of the non-operation time. As described above, the failure phenomenon category / factor category / The non-working time is read, the non-working time τ _n from the past M years to the current system time of the relevant phenomenon classification is acquired, and this sum is calculated as the non-working time total x. In this process, when there is a failure at the time of the past M years (when the recovery date is from the past M years to the present), regardless of the failure occurrence date and time, the item of the non-operation time of the failure information is set. Get and get the sum.

前記障害情報ＤＢ１１１（図１３）から読み出す情報は、障害に関する多数の項目を含み、主キーが障害番号と影響顧客コードとなっており、同じ障害番号が複数ある場合（影響顧客コードがキーのため）、不稼働時間で昇順ソートし、最も不稼働時間の大きい値を採用しグループ化し総和を求めることとする。 The information read from the failure information DB 111 (FIG. 13) includes a number of items related to failures, the primary key is the failure number and the affected customer code, and there are a plurality of the same failure numbers (because the affected customer code is the key). ), Sort in ascending order by non-working time, adopt the value with the largest non-working time and group to obtain the sum.

このステップＳ２０５の計算処理の詳細は、図７に示す如く、まずワークＤＢ１０８に格納した当該現象区分−要因区分の対と、過去Ｍ年≦復旧日時≦ｔの抽出条件とを用いて障害情報ＤＢ１１１から抽出し、ワークＤＢ１０９に全て登録するステップＳ７０１と、パラメータ設定ファイル１０７に登録したΔｔの時間単位でワークＤＢ１０９に格納した全レコードの”不稼働時間”の総和を計算し、不稼働時間ｘにセットするステップＳ７０２と、ワークＤＢ１０９の全レコードをクリアするステップＳ７０３を順次実行することによって、不稼働時間の総和ｘ（＝Στ_ｎ）を求める。
<線形確率計算処理ステップＳ２０７の説明> As shown in FIG. 7, the details of the calculation processing in step S205 are as follows. First, the failure information DB 111 using the phenomenon category-factor category pair stored in the work DB 108 and the extraction condition of past M years ≦ recovery date ≦ t. And the sum of the “unavailable time” of all records stored in the work DB 109 in units of time Δt registered in the parameter setting file 107 is calculated as step S701 in which all of the records are registered in the work DB 109. By sequentially executing step S702 for setting and step S703 for clearing all the records in the work DB 109, the sum x (= Στ _n ) of the non-working time is obtained.
<Description of Linear Probability Calculation Processing Step S207>

次いで未来Ｎまでの線形確率計算処理を行うステップＳ２０７を図８を参照して説明する。この線形確率計算の原理は、前記ステップＳ２０５により算出した多数の障害の発生現象と発生要因の組み合わせに対応した障害による総不稼働時間総和ｘを算出し、総不稼働時間総和ｘが大きい前記組み合わせが障害の発生確率が大きいことを利用し、線形確率計算によって前記障害発生の確率を算出するものである。この線形確率とは、統計学における一般化線形モデル（数値変数である反応変数を数値変換或いは因子変量の説明変数の線形結合で予測するモデル）が好ましいが、これに限られるものではなく、他の確率計算手法であっても良い。 Next, step S207 for performing linear probability calculation processing up to future N will be described with reference to FIG. The principle of this linear probability calculation is to calculate the total non-operation time total x due to failures corresponding to the combinations of many failure occurrence phenomena and the generation factors calculated in step S205, and the combination with a large total non-operation time total x. However, the probability of failure occurrence is calculated by linear probability calculation using the fact that failure occurrence probability is high. The linear probability is preferably a generalized linear model in statistics (a model that predicts a response variable that is a numerical variable by numerical transformation or linear combination of explanatory variables of factor variables), but is not limited to this. The probability calculation method may be used.

このステップＳ２０７による具体的な線形確率計算処理は、まず、変数ｋ［変数ｋは現在からの経過時間でパラメータ設定ファイル１０７のΔｔによって単位が異なり、１秒か１分か１時間となる］に１をセットするステップＳ８０１と、次に線形確率Ｐ’をＰ’＝１００×（ｘ＋ｋ）／Ｘの計算式で求め、線形確率データベース１１０に現象区分、要因区分、経過時間（ｋ）、線形確率を出力するステップＳ８０２と、次にｋ＝ｋ＋１とし、次の時系列とするステップＳ８０３と、ｋがＹよりも大きいか否かを判定するステップＳ８０４と、該ステップＳ８０４においてｋがＹよりも大きいときに本処理が終了し、ｋ≦Ｙの場合はステップＳ８０２に戻り、ループする様に動作する。即ち、本処理は前述のステップＳ２０２で算出した未来Ｎ年の時間に達する迄、前記線形確率を繰り返し計算し、出力するものである。 The specific linear probability calculation process in step S207 starts with variable k [variable k is the elapsed time from the current, the unit varies depending on Δt of the parameter setting file 107, and is 1 second, 1 minute, or 1 hour]. Step S801 for setting 1 and then calculating a linear probability P ′ by a calculation formula of P ′ = 100 × (x + k) / X, and storing the phenomenon category, factor category, elapsed time (k), linear probability in the linear probability database 110 , Step S803 to set k = k + 1, and the next time series, step S804 to determine whether k is larger than Y, and k is larger than Y in step S804 Sometimes this process ends, and if k ≦ Y, the process returns to step S802 to operate as a loop. In other words, this processing repeatedly calculates and outputs the linear probability until the time of the future N years calculated in the above-described step S202 is reached.

ここで前記ステップＳ８０２における数式の分子（ｘ＋ｋ）は増加し、分母（Ｘ）が一定のため、Ｐ’は線形に増加し、１００（％）を越えることも想定される。そこで本実施形態においては、１００％が漸近線となるように増加する関数ｔａｎｈ（ｘ）を適用するｔａｎｈ関数適用処理ステップＳ２１０を行う。このステップＳ２１０を採用することにより、本実施形態による線形確率Ｐ’は、図９に示す如く１００％が漸近線となり、かつ増加する特性とすることができる。前記ｔａｎｈ（ＨｙｐｅｒｂｏｌｉｃＴａｎｇｅｎｔ）は、双曲線正接と呼ばれる。
<ｔａｎｈ関数適用処理ステップＳ２１０の説明> Here, since the numerator (x + k) of the mathematical formula in step S802 is increased and the denominator (X) is constant, it is assumed that P ′ increases linearly and exceeds 100 (%). Therefore, in the present embodiment, the tanh function application processing step S210 is performed in which the function tanh (x) that increases so that 100% becomes an asymptotic line is applied. By adopting this step S210, the linear probability P ′ according to the present embodiment can have a characteristic in which 100% is an asymptote and increases as shown in FIG. The tanh (Hyperbolic Tangent) is called a hyperbolic tangent.
<Description of Tanh Function Application Processing Step S210>

前記ｔａｎｈ関数適用処理は、図１０に示す如く、図９に示したのグラフ波形より、ｆ（ｘ）＝１００_※１×ｔａｎｈ（ｘ／１００_※２）を演算し、図８で作成された線形確率データベース１１０の項目及び線形確率Ｐ’をｘに当てはめるステップＳ９０１を実行することよりｔａｎｈ関数を適用することができる。尚、※１において、１００を掛けているのは、ｆ（ｘ）＝１００（％）を漸近線とするためであり、※２において、１００で割っているのは、時系列における時間軸の進捗を※１と調整して遅らせるためである。尚、前記ステップＳ２１０によるｔａｎｈ関数適用処理は、障害予知機能部１０２又は図示しないｔａｎｈ関数適用処理部によって実行される。
<障害発生高確率通知処理及び通知フラグ＝１処理ステップＳ２１０の説明> As shown in FIG. 10, the tanh function application process is created in FIG. 8 by calculating f (x) = 100 _{* 1} * tanh (x / 100 _{* 2} ) from the graph waveform shown in FIG. The tanh function can be applied by executing step S901 in which the items of the linear probability database 110 and the linear probability P ′ are applied to x. In * 1, 100 is multiplied to make f (x) = 100 (%) as an asymptotic line, and in * 2, 100 is divided by the time axis in the time series. This is to adjust the progress to * 1 and delay it. Note that the tanh function application processing in step S210 is executed by the failure prediction function unit 102 or a tanh function application processing unit (not shown).
<Failure High Probability Notification Processing and Notification Flag = 1 Description of Processing Step S210>

このステップＳ２１０は、図２におけるの障害発生高確率通知処理及び通知フラグを１とする処理であり、図１１に示す如く、パラメータ設定ファイル１０７からメールアドレス群の項目を取得し、その全てのメールアドレス、障害予知時刻、現象区分、要因区分、閾値を電子メールシステムに渡し、電子メールシステムが、全てのメールアドレス宛てにメールを送付するステップＳ１００１と、次いで通知フラグに１をセットするステップＳ１００２を実行することによって行われる。 This step S210 is a failure occurrence high probability notification process in FIG. 2 and a process for setting the notification flag to 1. As shown in FIG. 11, the mail address group items are acquired from the parameter setting file 107, and all the mails are acquired. Step S1001 in which the address, failure prediction time, phenomenon classification, factor classification, and threshold are passed to the electronic mail system, and the electronic mail system sends mail to all mail addresses, and then step S1002 in which a notification flag is set to 1. Done by running.

この様に本実施形態による障害予知システムでは、過去の障害実績から未来の障害発生確率を求めるにあたり、障害の根幹となる現象と要因に着目し、その頻度（障害による総不稼働時間）を標本データ、障害予知対象システムの稼働時間を母集合データとして未来の障害発生確率を現象／要因毎に、時系列で現象と要因の組み合わせ毎に発生した障害による不稼働時間を、時間毎（秒、分、時間など）のタイミングで確率計算し、この計算した障害発生の確率が予め定めた閾値を越えたときに管理者等に通知することによって、将来発生する障害を予知することができる。即ち、過去に発生した障害の現象及び要因の組み合わせ毎に障害による不稼働時間の総和を算出し、この不稼働時間総和が大きい現象及び要因の組み合わせが障害が発生する確率が高く且つ時間的にも近く発生する可能性が大きいと想定することによって、将来発生する障害を予知することができる。本実施形態によれば、過去Ｍ年、未来Ｎ年、障害発生高確率閾値、及び現象区分、要因区分をマスタＤＢ化し、且つ可変とすることにより汎用性が高く、大まかな予知からきめ細かな予知を行うことができる。 As described above, in the failure prediction system according to the present embodiment, in determining the future failure occurrence probability from the past failure record, paying attention to the phenomenon and the factors underlying the failure, the frequency (total downtime due to failure) is sampled. Data and failure forecasting system uptime as population data, future failure occurrence probability for each phenomenon / factor, time-series failure / non-operation time for each combination of phenomenon and factor, hourly (seconds, Probability is calculated at the timing of minutes, hours, etc., and a failure occurring in the future can be predicted by notifying the administrator or the like when the calculated failure occurrence probability exceeds a predetermined threshold. That is, for each combination of failure phenomenon and factor that occurred in the past, the sum of the non-operation time due to the failure is calculated. In the future, it is possible to predict a failure that will occur in the future. According to the present embodiment, the past M years, the future N years, the failure occurrence high probability threshold, the phenomenon classification, and the factor classification are converted into a master DB and are variable, so that the versatility is high, and rough prediction to detailed prediction. It can be performed.

尚、本発明は、前述の実施形態に限定されるものではなく、その要旨を逸脱しない範囲において種々変更可能である。例えば、本実施例では、ワークＤＢ１０８、ワークＤＢ１０９など、リソースを少なからず消費するＤＢをワークファイルとして設ける構成であっても良い。 In addition, this invention is not limited to the above-mentioned embodiment, In the range which does not deviate from the summary, various changes are possible. For example, in this embodiment, a configuration may be employed in which a DB that consumes a certain amount of resources, such as the work DB 108 and the work DB 109, is provided as a work file.

本発明の一実施形態による障害予知システムの構成図。The block diagram of the failure prediction system by one Embodiment of this invention. 本システム例による障害予知機能及び障害発生高確率通知機能の第１の処理動作例を示すフローチャート。The flowchart which shows the 1st processing operation example of the failure prediction function by this system example, and a failure occurrence high probability notification function. 本例によるシステム時刻取得処理の詳細動作を示すフローチャート。The flowchart which shows the detailed operation | movement of the system time acquisition process by this example. 本例による時間数計算処理の詳細動作を示すフローチャート。The flowchart which shows the detailed operation | movement of the time number calculation process by this example. 本例による現象／要因区分の組み合わせ取得処理動作を示すフローチャート。The flowchart which shows the combination acquisition process operation of the phenomenon / factor division by this example. 本例による当該現象要因障害の不稼働時間総和計算処理の概念を示す図。The figure which shows the concept of the non-operation time total calculation process of the said phenomenon factor failure by this example. 本例による当該現象要因障害の不稼働時間総和計算処理の詳細動作を示すフローチャート。The flowchart which shows the detailed operation | movement of the non-operation time total calculation process of the said phenomenon factor failure by this example. 本例による未来Ｎまでの線形確率計算処理の詳細動作を示すフローチャート。The flowchart which shows the detailed operation | movement of the linear probability calculation process to the future N by this example. 本例によるｔａｎｈ関数適用処理のｔａｎｈ関数のグラフ図。The graph of the tanh function of the tanh function application process by this example. 本例によるｔａｎｈ関数適用処理の詳細動作を示すフローチャート。The flowchart which shows the detailed operation | movement of a tanh function application process by this example. 本例による障害発生高確率通知処理及び通知フラグ＝１処理の詳細動作を示すフローチャート。The flowchart which shows detailed operation | movement of the failure high probability notification process and notification flag = 1 process by this example. 本例による障害予知データベースの構成及びサンプルデータを示す図。The figure which shows the structure and sample data of a failure prediction database by this example. 本例による障害情報データベースの構成及びサンプルデータを示す図。The figure which shows the structure and sample data of a failure information database by this example. 本例による現象区分マスタの構成及びサンプルデータを示す図。The figure which shows the structure and sample data of the phenomenon classification master by this example. 本例による要因区分マスタの構成及びサンプルデータを示す図。The figure which shows the structure and sample data of a factor division master by this example. 本例によるパラメータ設定ファイルの構成及びサンプルデータを示す図。The figure which shows the structure and sample data of the parameter setting file by this example. 本例によるワークデータベースの構成及びサンプルデータを示す図。The figure which shows the structure and sample data of a work database by this example. 本例によるワークデータベースの構成及びサンプルデータを示す図。The figure which shows the structure and sample data of a work database by this example. 本例による線形確率データの構成及びサンプルデータを示す図。The figure which shows the structure and sample data of the linear probability data by this example.

Explanation of symbols

１０１：障害予知システム、１０２：障害予知機能部、１０３：障害発生高確率通知機能部、１０４：障害予知データベース、１０５：現象区分マスタ、１０６：要因区分マスタ、１０７：パラメータ設定ファイル、１０８：ワークデータベース、１０９：ワークデータベース、１１０：線形確率データベース、１１１：障害情報データベース、１１２：障害情報システム。 101: Failure prediction system, 102: Failure prediction function unit, 103: Failure occurrence high probability notification function unit, 104: Failure prediction database, 105: Phenomenon classification master, 106: Factor classification master, 107: Parameter setting file, 108: Work Database: 109: Work database, 110: Linear probability database, 111: Failure information database, 112: Failure information system.

Claims

A failure prediction system for predicting a failure occurrence of a computer system connected to a plurality of clients,
For each of a plurality of failures that occurred in the past, a phenomenon category in which the failure phenomenon is classified by failure phenomenon type, a factor category in which the cause of the failure is classified by failure factor type, and the phenomenon category and factor category are classified. A failure information database that stores sample data including downtime caused by a failure;
Time and downtime is from the time of occurrence of disorder symptoms section and cause division stored in the fault information database is divided up recovery time, the downtime caused by phenomena section and cause indicator same multiple failure calculated sum as the total downtime, and failure prediction function unit for calculating the failure probability of failure same phenomenon classification and cause division are divided to generate using said total downtime,
A parameter setting file in which the failure prediction function unit sets a variable k between the time when performing the calculation,
The failure prediction function unit is calculated, the failure prediction database fault prediction data, and stores including failure occurrence probability caused by failure of the same phenomenon classification and disorders division,
A failure occurrence high probability notification function unit for notifying a preset destination when the failure occurrence probability of the failure prediction data stored in the failure prediction database exceeds a preset threshold;
The failure prediction function unit, k the time variable, the time from the past of any of the date and time up to the current X, 'when it was, linear probability P' the total non-operating time x, the linear probability P, and time change While the number k is increased by “k + 1”, the calculation formula “100 × (x + k) / X” is used until the variable k reaches a predetermined value, and the displacement of the calculated linear probability P ′ is generated using the function tanh. A failure prediction system that calculates a failure occurrence probability until a time variable k reaches a predetermined value by executing a tanh function application process so that a maximum value of probability is less than 100%.

A phenomenon classification master that stores a phenomenon classification obtained by classifying the phenomenon according to a failure phenomenon type, and a factor classification master that stores a factor classification obtained by classifying the cause of the failure according to a failure factor type are provided in the failure prediction database. 2. The failure prediction system according to claim 1, wherein total failure time is stored by a combination of all phenomenon categories and factor categories stored in the phenomenon category master and the factor category master.

For each of a plurality of failures that occurred in the past, a phenomenon category in which the failure phenomenon is classified by failure phenomenon type, a factor category in which the cause of the failure is classified by failure factor type, and the phenomenon category and factor category are classified. A failure information database that stores sample data including downtime caused by a failure;
Time and downtime is from the time of occurrence of disorder symptoms section and cause division stored in the fault information database is divided up recovery time, the downtime caused by phenomena section and cause indicator same multiple failure calculated sum as the total downtime, and failure prediction function unit for calculating the failure probability of failure same phenomenon classification and cause division are divided to generate using the aggregate downtime,
A parameter setting file in which the failure prediction function unit sets a variable k between the time when performing the calculation,
The failure prediction function unit is calculated, the failure prediction database fault prediction data, and stores including failure occurrence probability caused by failure of the same phenomenon classification and factors division,
When the failure occurrence probability of the failure prediction data stored in the failure prediction database exceeds a preset threshold, the failure occurrence high probability notification function unit for notifying a preset destination is provided and connected to a plurality of clients. A failure prediction program for a failure prediction system for predicting the occurrence of a failure in a computer system,
The failure prediction function unit, k the time variable, the time from the past of any of the date and time up to the current X, 'when it was, linear probability P' the total non-operating time x, the linear probability P, and time change While the number k is increased by “k + 1”, the calculation formula “100 × (x + k) / X” is used until the variable k reaches a predetermined value, and the displacement of the calculated linear probability P ′ is generated using the function tanh. A failure prediction program that realizes a function of calculating a failure occurrence probability until the time variable k reaches a predetermined value by executing tanh function application processing so that the maximum value of the probability is less than 100% .

A phenomenon classification master storing phenomenon segment formed by dividing the pre Kigen elephant by disorder phenomena type, provided the source segment master for storing factors segment formed by dividing the factors that the problem has occurred by fault-types, the fault prediction database 4. The failure prediction program according to claim 3, wherein a total downtime is stored by a combination of all the phenomenon categories and factor categories stored in the phenomenon category master and the factor category master.