JP4848979B2

JP4848979B2 - Monitoring system, monitoring method and program

Info

Publication number: JP4848979B2
Application number: JP2007057347A
Authority: JP
Inventors: 和之進鹿田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-03-07
Filing date: 2007-03-07
Publication date: 2011-12-28
Anticipated expiration: 2027-03-07
Also published as: JP2008217682A

Description

本発明は、監視システムおよび監視方法ならびにプログラムに係り、特に、情報処理システムの障害監視における信頼性向上のための技術に係る。 The present invention relates to a monitoring system, a monitoring method, and a program, and more particularly, to a technique for improving reliability in fault monitoring of an information processing system.

情報処理システムの障害を監視する監視システム（監視装置）には、高い信頼性が要求される。そこで、監視システムの信頼性向上のために様々な工夫がなされている。例えば、特許文献１には、システム監視装置に対する負荷の集中、及び情報量の集中によるオーバーフローがなく、また、単一の監視装置のように、万一その機器が障害を起こした場合、システム情報を管理することが不可能となることがないように、信頼性を向上した監視システムが開示されている。このシステムは、サーバがクライアントへヘルスチェックを行っておりクライアントが正常に機能しているかをチェックするサーバ／クライアントシステムである。 High reliability is required for a monitoring system (monitoring device) that monitors a failure of an information processing system. Therefore, various ideas have been made to improve the reliability of the monitoring system. For example, in Patent Document 1, there is no overflow due to concentration of the load on the system monitoring device and concentration of information, and system information in the event that the device fails like a single monitoring device. A monitoring system with improved reliability is disclosed so that it is not impossible to manage the system. This system is a server / client system in which a server performs a health check on a client and checks whether the client is functioning normally.

また、特許文献２には、監視装置に障害が発生したとき、遠隔から再起動することができる監視システムが開示されている。この監視システムは、被監視装置を監視する監視装置と、この監視装置からログデータを受信し、ログデータの受信状態により監視装置の障害を検出すると共に、監視装置の障害を電子メールにて遠隔管理者に通知するログ収集送信装置を備えたものである。 Patent Document 2 discloses a monitoring system that can be remotely restarted when a failure occurs in the monitoring device. The monitoring system receives a log data from the monitoring device that monitors the monitored device and the monitoring device, detects a failure of the monitoring device based on a reception state of the log data, and remotely detects the failure of the monitoring device by e-mail. A log collection and transmission device for notifying an administrator is provided.

さらに、特許文献３には、複数の処理装置の状態を効率よく監視することができる情報処理システムが記載されている。このシステムにおいて、ネットワーク上の処理装置は、それぞれ自分自身を監視し、監視情報を取得する監視手段と、その監視情報を監視装置に送信する送信手段を持つ。複数の処理装置からそれぞれ送信された監視情報は、監視装置によって受信される。監視装置はそのための受信手段と、それを表示装置に表示する表示手段とを持つ。そして、監視装置の監視情報受信手段、表示手段、表示ウィンドウを、監視対象処理装置毎に設ける。 Furthermore, Patent Document 3 describes an information processing system that can efficiently monitor the states of a plurality of processing devices. In this system, each processing device on the network has a monitoring unit that monitors itself and acquires monitoring information, and a transmission unit that transmits the monitoring information to the monitoring device. The monitoring information transmitted from each of the plurality of processing devices is received by the monitoring device. The monitoring device has receiving means for that purpose and display means for displaying it on the display device. A monitoring information receiving unit, a display unit, and a display window of the monitoring device are provided for each monitoring target processing device.

また、特許文献４には、マネージャに管理データ記憶手段と、資源利用関係検索手段と、エージェント動作手段を設け、エージェントで障害が発生または修復した場合、障害が発生した資源を利用するアプリケーションを停止または再開するシステム管理装置が記載されている。 In Patent Document 4, the manager is provided with management data storage means, resource utilization relation retrieval means, and agent operation means, and when a failure occurs or is repaired in the agent, the application that uses the failed resource is stopped. Or, a system management apparatus to be resumed is described.

なお、関連する技術として、特許文献５には、上位装置との間のパスが二重化され、上位装置から下位装置に対するヘルスチェックを行う回線アダプタに関する障害検知技術が記載されている。 As a related technique, Patent Document 5 describes a failure detection technique related to a line adapter that performs a health check on a lower-level device from a higher-level device by duplicating a path with the higher-level device.

特開２０００−２０４２７号公報JP 2000-20427 A 特開２００２−１６９７０６号公報JP 2002-169706 A 特開平６−１２２８８号公報JP-A-6-12288 特開平８−１９０５２８号公報JP-A-8-190528 特開平９−５４７３９号公報JP-A-9-54739

従来の監視システムでは、クライアントマシンの監視手段がプロセスや性能情報等の被監視オブジェクトを監視しており、異常を検知するとサーバマシンの出力装置に通知する仕組みを持っている。しかし、監視手段やイベント通知パス自体に障害があると、被監視対象に関するイベントを通知することができない。例えば、特許文献２では、監視装置の障害を電子メールにて遠隔管理者に通知するが、送信装置に障害が発生すると、遠隔管理者は、この障害を検知することができない。また、特許文献３では、監視装置自体に障害が発生すると、監視情報を受信することができず、監視結果を表示することもできない。さらに、特許文献４でも同様に、マネージャに障害が発生すると、管理そのものができなくなってしまう。 In a conventional monitoring system, a monitoring unit of a client machine monitors monitored objects such as processes and performance information, and has a mechanism for notifying an output device of a server machine when an abnormality is detected. However, if there is a failure in the monitoring means or the event notification path itself, an event related to the monitored object cannot be notified. For example, in Patent Document 2, a failure of the monitoring device is notified to the remote administrator by electronic mail. However, if a failure occurs in the transmission device, the remote administrator cannot detect this failure. Moreover, in patent document 3, when a failure occurs in the monitoring device itself, monitoring information cannot be received and monitoring results cannot be displayed. Furthermore, similarly in Patent Document 4, when a failure occurs in the manager, management itself cannot be performed.

一方、特許文献１に記載された監視サーバ・クライアントシステムでは、サーバは、クライアントへヘルスチェックを行っておりクライアントが正常に機能しているかをチェックしている。また、サーバを多重化することにより信頼性を向上させている。しかし、この監視サーバ・クライアントシステムには、マネージャを多重化することでエージェント・マネージャ間を保証しているが、マネージャ・出力部間が正常に機能しない場合を考慮していない。したがって、監視システム全体としては、信頼性を充分に確保した監視を行っているとはいえない。 On the other hand, in the monitoring server / client system described in Patent Document 1, the server performs a health check on the client and checks whether the client is functioning normally. In addition, reliability is improved by multiplexing servers. However, this monitoring server / client system guarantees the agent / manager by multiplexing the manager, but does not consider the case where the manager / output unit does not function normally. Therefore, it cannot be said that the monitoring system as a whole performs monitoring with sufficiently ensured reliability.

したがって、本発明の目的は、監視手段やイベント通知パスに障害があった場合であっても、システム全体として信頼性を確保した監視を行う監視システムおよび監視方法ならびにプログラムを提供することにある。 Accordingly, it is an object of the present invention to provide a monitoring system, a monitoring method, and a program for performing monitoring while ensuring reliability as a whole system even when there is a failure in the monitoring means and the event notification path.

本発明の１つのアスペクトに係る監視システムは、第１、第２および第３の処理装置をそれぞれ含む第１、第２および第３の情報処理装置群を備え、第３の処理装置は、第２の処理装置を監視し、第２の処理装置は、第１の処理装置を監視する監視システムであって、第１の処理装置は、被監視対象とされるオブジェクトを監視する第１の監視部と、第１の監視部を監視する第２の監視部と、を備え、第２の処理装置は、第１の監視部における監視結果を管理する第１の管理部と、第２の監視部および第１の管理部を監視する第３の監視部と、を備え、第３の処理装置は、第１の管理部における監視結果を管理する第２の管理部と、第３の監視部および第２の管理部を監視する第４の監視部と、第２の管理部および第４の監視部の監視結果を出力する出力部と、を備える。 A monitoring system according to an aspect of the present invention includes first, second, and third information processing device groups including first, second, and third processing devices, respectively. The second processing device is a monitoring system that monitors the first processing device , and the first processing device monitors the object to be monitored. And a second monitoring unit that monitors the first monitoring unit, and the second processing device includes a first management unit that manages a monitoring result in the first monitoring unit, and a second monitoring unit And a third monitoring unit that monitors the first management unit, and the third processing device includes a second management unit that manages a monitoring result in the first management unit, and a third monitoring unit. And the fourth monitoring unit for monitoring the second management unit, and the monitoring results of the second management unit and the fourth monitoring unit. And an output unit for force, the.

本発明の監視システムにおいて、第２の監視部は、第１の監視部が正常であるか否かを判断し、第１の監視部を異常と判断した場合には、第１の監視部がダウンしていることを表すイベントを発行し、第１の監視部におけるプロセスの再起動を試み、一定時間内に起動すれば、第１の監視部の再起動成功のイベントを発行し、一定時間内に起動しなかった場合には、第１の監視部の再起動失敗のイベントを発行し、該発行したイベントを、第３の監視部に通知するようにしてもよい。 In the monitoring system of the present invention, the second monitoring unit determines whether or not the first monitoring unit is normal. If the first monitoring unit determines that the first monitoring unit is abnormal, If an event indicating that it is down is issued, the process of restarting the process in the first monitoring unit is attempted, and if the process is started within a predetermined time, an event of a successful restart of the first monitoring unit is issued, and the predetermined time is If the first monitoring unit does not start, a restart failure event of the first monitoring unit may be issued, and the issued event may be notified to the third monitoring unit.

本発明の監視システムにおいて、第３の処理装置は、第４の監視部を監視すると共に監視結果を第１の管理部に通知する第５の監視部をさらに備え、第１の管理部は、第５の監視部における監視結果をさらに管理するようにしてもよい。 In the monitoring system of the present invention, the third processing device further includes a fifth monitoring unit that monitors the fourth monitoring unit and notifies the first management unit of the monitoring result, and the first management unit includes: You may make it manage the monitoring result in a 5th monitoring part further.

本発明の監視システムにおいて、第２の情報処理装置群は、第１の監視部の監視結果を第１の管理部に転送する第１の転送部と、第２の監視部および第１の転送部を監視する第６の監視部と、を備える第４の処理装置を含み、第１の管理部は、第１の転送部からの監視結果をさらに管理し、第３の監視部は、第６の監視部をさらに監視するようにしてもよい。 In the monitoring system of the present invention, the second information processing apparatus group includes a first transfer unit that transfers a monitoring result of the first monitoring unit to the first management unit, a second monitoring unit, and a first transfer. A fourth processing unit comprising: a sixth monitoring unit that monitors the first monitoring unit, wherein the first management unit further manages the monitoring result from the first transfer unit, and the third monitoring unit includes: The 6 monitoring units may be further monitored.

本発明の監視システムにおいて、第１の管理部は、第１の監視部にヘルスチェックイベントの発行を依頼し、第１の監視部および第１の転送部の双方あるいは一方から一定時間内にヘルスチェックイベントの通知があるか否かを判定し、判定結果を第２の管理部に発行し、第１の監視部は、ヘルスチェックイベントを第１の管理部および第１の転送部に発行し、第２の管理部は、所定時間内に判定結果が得られない場合には、第１の管理部における定期通報が未着である旨のイベントを出力部に発行するようにしてもよい。 In the monitoring system of the present invention, the first management unit requests the first monitoring unit to issue a health check event, and the health of the first monitoring unit and / or the first transfer unit is within a certain period of time. It is determined whether there is a check event notification, the determination result is issued to the second management unit, and the first monitoring unit issues a health check event to the first management unit and the first transfer unit. If the determination result is not obtained within a predetermined time, the second management unit may issue an event to the output unit indicating that the regular notification in the first management unit has not arrived.

本発明の監視システムにおいて、第３の情報処理装置群は、第３の処理装置と同様に構成される、第４の処理装置を管理する第５の処理装置をさらに含むようにしてもよい。 In the monitoring system of the present invention, the third information processing device group may further include a fifth processing device that manages the fourth processing device, which is configured similarly to the third processing device.

本発明の監視システムにおいて、第１の情報処理装置群は、複数の第１の処理装置を含み、第２および第４の処理装置は、それぞれ１または２以上の第１の処理装置を管理するようにしてもよい。 In the monitoring system of the present invention, the first information processing device group includes a plurality of first processing devices, and each of the second and fourth processing devices manages one or more first processing devices. You may do it.

本発明の監視システムにおいて、第１の情報処理装置群は、複数の第１の処理装置を含み、第２の情報処理装置群は、複数の第６の処理装置をさらに含み、第６の処理装置のそれぞれは、複数の第１の処理装置のいずれか１以上の第１の監視部の監視結果を転送する第２の転送部と、複数の第１の処理装置のいずれか１以上の第２の監視部および第２の転送部を監視する第７の監視部と、を備え、第２の処理装置は、第１の処理装置を監視および管理する替わりに、複数の第６の処理装置のいずれかを監視および管理し、第４の処理装置は、第１の処理装置の監視および第１の処理装置の監視結果の転送を行う替わりに、複数の第６の処理装置のいずれかの監視および監視結果の転送を行うようにしてもよい。 In the monitoring system of the present invention, the first information processing device group includes a plurality of first processing devices, the second information processing device group further includes a plurality of sixth processing devices, and a sixth process Each of the devices includes a second transfer unit that transfers a monitoring result of any one or more first monitoring units of the plurality of first processing devices, and any one or more first of the plurality of first processing devices. 2 monitoring units and a seventh monitoring unit that monitors the second transfer unit, and the second processing device monitors a plurality of sixth processing devices instead of monitoring and managing the first processing device. One of the plurality of sixth processing devices instead of monitoring the first processing device and transferring the monitoring result of the first processing device. Monitoring and monitoring result transfer may be performed.

本発明の監視システムにおいて、第２の情報処理装置群は、複数の第７の処理装置をさらに含み、第７の処理装置のそれぞれは、複数の第６の処理装置のいずれか１以上の第２の転送部の転送結果を転送する第３の転送部と、複数の第６の処理装置のいずれか１以上の第７の監視部および第３の転送部を監視する第８の監視部と、を備え、第２の処理装置は、複数の第６の処理装置のいずれかを監視および管理する替わりに、複数の第７の処理装置のいずれかを監視および管理し、第４の処理装置は、複数の第６の処理装置のいずれかの監視および監視結果の転送を行う替わりに、複数の第７の処理装置のいずれかの監視および監視結果の転送を行うようにしてもよい。 In the monitoring system of the present invention, the second information processing apparatus group further includes a plurality of seventh processing apparatuses, and each of the seventh processing apparatuses is one or more of the plurality of sixth processing apparatuses. A third transfer unit that transfers the transfer results of the two transfer units, an eighth monitoring unit that monitors any one or more seventh monitoring units and the third transfer unit of the plurality of sixth processing devices; The second processing device monitors and manages any of the plurality of seventh processing devices instead of monitoring and managing any of the plurality of sixth processing devices, and the fourth processing device. Instead of monitoring any one of the plurality of sixth processing devices and transferring the monitoring result, any one of the plurality of seventh processing devices may be monitored and the monitoring result transferred.

本発明の他のアスペクトに係る監視方法は、第１、第２および第３の処理装置をそれぞれ含む第１、第２および第３の情報処理装置群を備え、第３の処理装置は、第２の処理装置を監視し、第２の処理装置は、第１の処理装置を監視する監視システムにおける監視方法であって、前記第１の処理装置において、被監視対象とされるオブジェクトを監視する第１の監視ステップと、前記第１の監視ステップにおける監視状況を監視する第２の監視ステップと、を含み、前記第２の処理装置において、前記第１の監視ステップにおける監視結果の有無を判定する第１の管理ステップと、前記第２の監視ステップおよび前記第１の管理ステップにおける状況を監視する第３の監視ステップと、を含み、前記第３の処理装置において、前記第１の管理ステップにおける判定結果の有無を監視する第２の管理ステップと、前記第３の監視ステップおよび前記第２の管理ステップにおける状況を監視する第４の監視ステップと、前記第２の管理ステップおよび前記第４の監視ステップの監視結果を出力する出力ステップと、を含む。 A monitoring method according to another aspect of the present invention includes first, second, and third information processing device groups including first, second, and third processing devices, respectively. The second processing device is a monitoring method in a monitoring system that monitors the first processing device, and monitors an object to be monitored in the first processing device. A first monitoring step; and a second monitoring step for monitoring a monitoring status in the first monitoring step. In the second processing device, the presence or absence of a monitoring result in the first monitoring step is determined. And a third monitoring step for monitoring the status in the second monitoring step and the first management step, and in the third processing apparatus, the first management step A second management step for monitoring the presence / absence of a determination result in the step, a fourth monitoring step for monitoring a situation in the third monitoring step and the second management step, the second management step, and the second And an output step for outputting the monitoring results of the four monitoring steps .

本発明のさらに他のアスペクトに係るプログラムは、第１、第２および第３の処理装置をそれぞれ含む第１、第２および第３の情報処理装置群を備え、第３の処理装置は、第２の処理装置を監視し、第２の処理装置は、第１の処理装置を監視する監視システムを構成するコンピュータのプログラムであって、プログラムは、前記第１の処理装置を構成するコンピュータに、前記第１の処理装置を、被監視対象とされるオブジェクトを監視する第１の監視部、前記第１の監視部を監視する第２の監視部、として機能させ、前記第２の処理装置を構成するコンピュータに、前記第２の処理装置を、前記第１の監視部おける監視結果の有無を判定する第１の管理部、前記第２の監視部および前記第１の管理部を監視する第３の監視部、として機能させ、前記第３の処理装置を構成するコンピュータに、前記第３の処理装置を、前記第１の管理部おける判定結果の有無を監視する第２の管理部、前記第３の監視部および前記第２の管理部を監視する第４の監視部、前記第２の管理部および前記第４の監視部の監視結果を出力する出力部、として機能させる。 A program according to still another aspect of the present invention includes first, second, and third information processing device groups including first, second, and third processing devices, respectively. The second processing device is a computer program constituting a monitoring system for monitoring the first processing device, and the program is stored in the computer constituting the first processing device. The first processing device functions as a first monitoring unit that monitors an object to be monitored, and a second monitoring unit that monitors the first monitoring unit, and the second processing device is A second computer that configures the second processing device to monitor the first management unit, the second monitoring unit, and the first management unit that determine whether there is a monitoring result in the first monitoring unit. 3 function as a monitoring unit The third processing device is connected to the computer that constitutes the third processing device, the second management unit that monitors the presence or absence of a determination result in the first management unit, the third monitoring unit, and the second monitoring unit. The second monitoring unit functions as a fourth monitoring unit that monitors the second management unit, and an output unit that outputs the monitoring results of the second management unit and the fourth monitoring unit .

本発明によれば、２つの監視系統によって監視システム全体が正常に機能していることを監視し、被監視オブジェクトに関するイベントを出力部までより確実に通報することができる。したがって、信頼性を充分に確保した監視を行うことができる。 According to the present invention, it is possible to monitor that the entire monitoring system is functioning normally by two monitoring systems, and to more reliably report an event related to the monitored object to the output unit. Therefore, monitoring with sufficient reliability can be performed.

本発明の実施形態に係る監視システムは、第１の処理装置（図１のエージェントマシン１０）、第２の処理装置（図１のマネージャマシン２０ａ）および第３の処理装置（図１のビューマシン３０）をそれぞれ含む第１の情報処理装置群（図１のエージェント層１）、第２の情報処理装置群（図１のマネージャ層２）および第３の情報処理装置群（図１のビュー層３）を備え、第３の処理装置は、第２の処理装置を監視し、第２の処理装置は、第１の処理装置を監視する監視システムである。第１、第２および第３の処理装置は、それぞれ少なくとも独立に監視機能を有すると共に監視情報を伝達する第１の監視系統（図１の監視手段１２、イベント管理手段２２、３２）および第２の監視系統（図１のメタ監視手段１１、２１ａ、３１）を含み、第２の監視系統は、第１の監視系統を監視するように機能する。 The monitoring system according to the embodiment of the present invention includes a first processing device (agent machine 10 in FIG. 1), a second processing device (manager machine 20a in FIG. 1), and a third processing device (view machine in FIG. 1). 30), a first information processing device group (agent layer 1 in FIG. 1), a second information processing device group (manager layer 2 in FIG. 1), and a third information processing device group (view layer in FIG. 1). 3), the third processing device is a monitoring system that monitors the second processing device, and the second processing device is a monitoring system that monitors the first processing device. Each of the first, second and third processing devices has a monitoring function (monitoring means 12, event management means 22, 32 in FIG. 1) and second having a monitoring function at least independently and transmitting monitoring information. The monitoring system (meta monitoring means 11, 21a, 31 in FIG. 1) includes a second monitoring system that functions to monitor the first monitoring system.

第１の処理装置（図１のエージェントマシン１０）は、被監視対象とされるオブジェクト（図１の被監視オブジェクト１３）を監視する第１の監視部（図１の監視手段１２）と、第１の監視部を監視する第２の監視部（図１のメタ監視手段１１）と、を備える。第２の処理装置（図１のマネージャマシン２０ａ）は、第１の監視部における監視結果を管理する第１の管理部（図１のイベント管理手段２２）と、第２の監視部および第１の管理部を監視する第３の監視部（図１のメタ監視手段２１ａ）と、を備える。第３の処理装置（図１のビューマシン３０）は、第１の管理部における監視結果を管理する第２の管理部（図１のイベント管理手段３２）と、第３の監視部および第２の管理部を監視する第４の監視部（図１のメタ監視手段３１）と、第２の管理部および第４の監視部の監視結果を出力する出力部（図１の出力装置３３）と、を備える。 The first processing device (agent machine 10 in FIG. 1) includes a first monitoring unit (monitoring means 12 in FIG. 1) for monitoring an object to be monitored (monitored object 13 in FIG. 1), And a second monitoring unit (meta monitoring unit 11 in FIG. 1) that monitors one monitoring unit. The second processing device (manager machine 20a in FIG. 1) includes a first management unit (event management unit 22 in FIG. 1) that manages the monitoring result in the first monitoring unit, a second monitoring unit, and a first monitoring unit. And a third monitoring unit (meta monitoring means 21a in FIG. 1) for monitoring the management unit. The third processing device (view machine 30 in FIG. 1) includes a second management unit (event management unit 32 in FIG. 1) that manages monitoring results in the first management unit, a third monitoring unit, and a second monitoring unit. A fourth monitoring unit (meta-monitoring means 31 in FIG. 1) that monitors the management unit, and an output unit (output device 33 in FIG. 1) that outputs monitoring results of the second management unit and the fourth monitoring unit; .

また、第２の監視部（図１のメタ監視手段１１）は、第１の監視部（図１の監視手段１２）が正常であるか否かを判断し、第１の監視部を異常と判断した場合には、第１の監視部がダウンしていることを表すイベントを発行し、第１の監視部におけるプロセスの再起動を試み、一定時間内に起動すれば、第１の監視部の再起動成功のイベントを発行し、一定時間内に起動しなかった場合には、第１の監視部の再起動失敗のイベントを発行し、該発行したイベントを、第３の監視部（図１のメタ監視手段２１ａ）に通知するようにしてもよい。 The second monitoring unit (meta monitoring unit 11 in FIG. 1) determines whether or not the first monitoring unit (monitoring unit 12 in FIG. 1) is normal, and determines that the first monitoring unit is abnormal. If it is determined, the first monitoring unit issues an event indicating that the first monitoring unit is down, attempts to restart the process in the first monitoring unit, and starts within a predetermined time. When a restart success event is issued and the event is not started within a predetermined time, a restart failure event of the first monitoring unit is issued, and the issued event is sent to the third monitoring unit (see FIG. 1 meta monitoring means 21a) may be notified.

さらに、第３の処理装置（図１のビューマシン３０ａ）は、第４の監視部（図１のメタ監視手段３１）を監視すると共に監視結果を第１の管理部（図１のイベント管理手段２２）に通知する第５の監視部（図１の監視手段３４）をさらに備え、第１の管理部は、第５の監視部における監視結果をさらに管理するようにしてもよい。 Further, the third processing device (view machine 30a in FIG. 1) monitors the fourth monitoring unit (meta monitoring unit 31 in FIG. 1) and displays the monitoring result in the first management unit (event management unit in FIG. 1). 22) may be further provided, and the first management unit may further manage the monitoring result in the fifth monitoring unit.

また、第２の情報処理装置群（図１のマネージャ層２）は、第１の監視部（図１の監視手段１２）の監視結果を第１の管理部（図１のイベント管理手段２２）に転送する第１の転送部（図１のイベント転送手段２３）と、第２の監視部（図１のメタ監視手段１１）および第１の転送部を監視する第６の監視部（図１のメタ監視手段２１ｂ）と、を備える第４の処理装置（図１のマネージャマシン２０ｂ）を含み、第１の管理部は、第１の転送部からの監視結果をさらに管理し、第３の監視部は、第６の監視部をさらに監視するようにしてもよい。 Further, the second information processing apparatus group (manager layer 2 in FIG. 1) sends the monitoring result of the first monitoring unit (monitoring unit 12 in FIG. 1) to the first management unit (event management unit 22 in FIG. 1). A first transfer unit (event transfer unit 23 in FIG. 1), a second monitoring unit (meta monitoring unit 11 in FIG. 1), and a sixth monitoring unit (FIG. 1) that monitors the first transfer unit. Meta monitoring means 21b), and a fourth processing device (manager machine 20b in FIG. 1), the first management unit further manages the monitoring result from the first transfer unit, The monitoring unit may further monitor the sixth monitoring unit.

さらに、第１の管理部（図１のイベント管理手段２２）は、第１の監視部にヘルスチェックイベントの発行を依頼し、第１の監視部および第１の転送部の双方あるいは一方から一定時間内にヘルスチェックイベントの通知があるか否かを判定し、判定結果を第２の管理部（図１のイベント管理手段３２）に発行し、第１の監視部は、ヘルスチェックイベントを第１の管理部および第１の転送部に発行し、第２の管理部は、所定時間内に判定結果が得られない場合には、第１の管理部における定期通報が未着である旨のイベントを出力部に発行するようにしてもよい。 Further, the first management unit (event management unit 22 in FIG. 1) requests the first monitoring unit to issue a health check event, and is fixed from both or one of the first monitoring unit and the first transfer unit. It is determined whether there is a health check event notification in time, and the determination result is issued to the second management unit (event management unit 32 in FIG. 1), and the first monitoring unit sends the health check event to the first Issued to the first management unit and the first transfer unit, and the second management unit indicates that the periodic notification in the first management unit has not arrived if the determination result is not obtained within a predetermined time. An event may be issued to the output unit.

また、第３の情報処理装置群（図１のビュー層３）は、第３の処理装置（図１のビューマシン３０ａ）と同様に構成される、第４の処理装置（図１のマネージャマシン２０ｂ）を管理する第５の処理装置（図１のビューマシン３０ｂ）をさらに含むようにしてもよい。 Further, the third information processing apparatus group (view layer 3 in FIG. 1) includes a fourth processing apparatus (manager machine in FIG. 1) configured similarly to the third processing apparatus (view machine 30a in FIG. 1). 20b) may be further included (view machine 30b in FIG. 1).

さらに、第１の情報処理装置群（図７のエージェント層１ａ）は、複数の第１の処理装置（図７のエージェントマシン１０ａ、１０ｂ、１０ｃ、１０ｄ）を含み、第２および第４の処理装置は、それぞれ１または２以上の第１の処理装置を管理するようにしてもよい。 Further, the first information processing apparatus group (agent layer 1a in FIG. 7) includes a plurality of first processing apparatuses (agent machines 10a, 10b, 10c, and 10d in FIG. 7), and the second and fourth processes. Each apparatus may manage one or more first processing apparatuses.

また、第１の情報処理装置群は、複数の第１の処理装置を含み、第２の情報処理装置群（図７のマネージャ層２ａ）は、複数の第６の処理装置（図７のマネージャマシン２０ｃ、２０ｄ、２０ｆ、２０ｇ）をさらに含み、第６の処理装置のそれぞれは、複数の第１の処理装置のいずれか１以上の第１の監視部の監視結果を転送する第２の転送部（図７のイベント転送手段２３ｃ、２３ｄ、２３ｆ、２３ｇ）と、複数の第１の処理装置のいずれか１以上の第２の監視部および第２の転送部を監視する第７の監視部（図７のメタ監視手段２１ｃ、２１ｄ、２１ｆ、２１ｇ）と、を備え、第２の処理装置は、第１の処理装置を監視および管理する替わりに、複数の第６の処理装置のいずれかを監視および管理し、第４の処理装置は、第１の処理装置の監視および第１の処理装置の監視結果の転送を行う替わりに、複数の第６の処理装置のいずれかの監視および監視結果の転送を行うようにしてもよい。 The first information processing device group includes a plurality of first processing devices, and the second information processing device group (manager layer 2a in FIG. 7) includes a plurality of sixth processing devices (manager in FIG. 7). Machine 20c, 20d, 20f, 20g), wherein each of the sixth processing devices transfers a monitoring result of one or more first monitoring units of any of the plurality of first processing devices. Monitoring unit (event transfer means 23c, 23d, 23f, and 23g in FIG. 7), and a seventh monitoring unit that monitors any one or more second monitoring units and second transfer units of the plurality of first processing devices. (Meta monitoring means 21c, 21d, 21f, 21g in FIG. 7), and the second processing device is one of a plurality of sixth processing devices instead of monitoring and managing the first processing device. And the fourth processing device is the first processing device. Monitoring and instead for transferring monitoring results of the first processing unit may perform one of the monitoring and the monitoring result forwarding of the plurality of sixth processing device.

さらに、第２の情報処理装置群は、複数の第７の処理装置（図７のマネージャマシン２０ｅ、２０ｈ）をさらに含み、第７の処理装置のそれぞれは、複数の第６の処理装置（図７のマネージャマシン２０ｃ、２０ｄ、２０ｆ、２０ｇ）のいずれか１以上の第２の転送部の転送結果を転送する第３の転送部（図７のイベント転送手段２３ｅ、２３ｈ）と、複数の第６の処理装置のいずれか１以上の第７の監視部および第３の転送部を監視する第８の監視部（図７のメタ監視手段２１ｅ、２１ｈ）と、を備え、第２の処理装置（図７のマネージャマシン２０ｊ）は、複数の第６の処理装置のいずれかを監視および管理する替わりに、複数の第７の処理装置のいずれかを監視および管理し、第４の処理装置（図７のマネージャマシン２０ｉ）は、複数の第６の処理装置のいずれかの監視および監視結果の転送を行う替わりに、複数の第７の処理装置のいずれかの監視および監視結果の転送を行うようにしてもよい。 Furthermore, the second information processing device group further includes a plurality of seventh processing devices (manager machines 20e and 20h in FIG. 7), and each of the seventh processing devices includes a plurality of sixth processing devices (FIG. A third transfer unit (event transfer means 23e, 23h in FIG. 7) for transferring the transfer results of one or more second transfer units of the seven manager machines 20c, 20d, 20f, 20g), And an eighth monitoring unit (meta-monitoring means 21e, 21h in FIG. 7) for monitoring any one or more of the seven monitoring units and the third transfer unit. (The manager machine 20j in FIG. 7) monitors and manages any of the plurality of seventh processing devices, instead of monitoring and managing any of the plurality of sixth processing devices. The manager machine 20i) in FIG. Sixth to one of monitoring and instead of performing the monitoring result transfer processing apparatus may perform any of the monitoring and the monitoring result forwarding of the plurality of seventh processing apparatus.

このような構成の監視システムは、主に以下の（１）〜（３）で説明するそれぞれの機能によってシステム全体の健全性を監視する。 The monitoring system having such a configuration mainly monitors the soundness of the entire system by the functions described in the following (1) to (3).

（１）監視手段（図１の１２）を監視する機能
メタ監視手段（図１の１１）は、監視手段（図１の１２）を監視し、メタ監視手段（図１の１２）が不正な動作やプロセスダウンを検知すると出力装置（図１の３３）に向けてメタイベントを通知し、また、監視手段（図１の１２）を再起動させ復旧しようとを試みる。ここでメタイベントとは、メタ監視手段が発行する、監視オペレータへ異常を通知するためのイベントである。 (1) The function meta monitoring means (11 in FIG. 1) for monitoring the monitoring means (12 in FIG. 1) monitors the monitoring means (12 in FIG. 1), and the meta monitoring means (12 in FIG. 1) is illegal. When an operation or process down is detected, a meta event is notified to the output device (33 in FIG. 1), and the monitoring means (12 in FIG. 1) is restarted to attempt recovery. Here, the meta event is an event issued by the meta monitoring means for notifying the monitoring operator of an abnormality.

（２）監視システム全体の監視機能
各マシンにメタ監視手段を備えることでシステムの健全性の監視を行う。上位層のメタ監視手段（例えば、マネージャ層であればメタ監視手段（図１の２１ａ））は、下位層のメタ監視手段（図１の１１）および同位層の他の手段（図１の２２、２１ｂ）を監視しており、下位層のメタ監視手段や同位層の他の手段に異常があれば、より上位層のメタ監視手段（図１の３１）を経由して出力装置（図１の３３）に向けてメタイベントを通知する。このような監視機能によって監視システム全体を監視することができる。 (2) Monitoring function of the entire monitoring system Each machine is equipped with a meta monitoring means to monitor the health of the system. The meta monitoring means in the upper layer (for example, the meta monitoring means (21a in FIG. 1) in the case of the manager layer), the meta monitoring means in the lower layer (11 in FIG. 1) and other means (22 in FIG. 1). 21b), and if there is an abnormality in the lower layer meta monitoring means and other means in the same layer, the output device (FIG. 1) is passed through the higher layer meta monitoring means (31 in FIG. 1). 33) is notified of the meta event. The entire monitoring system can be monitored by such a monitoring function.

（３）マネージャ層（図７の２ａ）における多重化機能
１台のマネージャマシンが監視することができるエージェント数には限界がある。したがって、大規模システムにあっては、単一マネージャマシンでは管理しきれなくなる。そこでマネージャ層を多段化し、マネージャマシンを複数配置して負荷分散することで、多数のエージェントの監視を実現する。 (3) Multiplexing function in the manager layer (2a in FIG. 7) There is a limit to the number of agents that can be monitored by one manager machine. Therefore, in a large-scale system, it cannot be managed by a single manager machine. Therefore, monitoring of a large number of agents is realized by multi-leveling the manager layer and arranging a plurality of manager machines to distribute the load.

以下、実施例に即し、図面を参照して詳しく説明する。 Hereinafter, it will be described in detail with reference to the drawings in accordance with embodiments.

図１は、本発明の第１の実施例に係る監視システムの構成を示すブロック図である。図１において、監視システムは、マネージャ層２によって監視されるマシンの集合からなるエージェント層１と、監視を行うマシンの集合からなるマネージャ層２と、主に監視システム全体で発生したイベントを表示するマシンを含むビュー層３とから構成される。ここで、各マシンは、プログラムを実行して所定の機能を実現するように構成される情報処理装置に相当する。 FIG. 1 is a block diagram showing the configuration of the monitoring system according to the first embodiment of the present invention. In FIG. 1, the monitoring system displays an agent layer 1 consisting of a set of machines monitored by the manager layer 2, a manager layer 2 consisting of a set of machines to be monitored, and events that have occurred mainly in the entire monitoring system. And a view layer 3 including a machine. Here, each machine corresponds to an information processing apparatus configured to execute a program and realize a predetermined function.

エージェント層１は、エージェントマシン１０を備える。エージェントマシン１０は、プロセスやログファイルに相当する被監視オブジェクト１３と、被監視オブジェクト１３を監視する監視手段１２と、監視手段１２を監視するメタ監視手段１１とを備える。 The agent layer 1 includes an agent machine 10. The agent machine 10 includes a monitored object 13 corresponding to a process or a log file, a monitoring unit 12 that monitors the monitored object 13, and a meta monitoring unit 11 that monitors the monitoring unit 12.

マネージャ層２は、エージェントマシン１０を監視するマネージャマシン２０ａ、２０ｂを備える。運用管理ではマネージャ層を二重化することが一般的であり、ここでもマネージャ層を二重化している。ただし、クラスタ構成ではなく両現用のシステムを採用している。マネージャマシン２０ａは、イベント管理手段２２とメタ監視手段２１ａとを備える。マネージャマシン２０ｂは、イベント転送手段２３とメタ監視手段２１ｂとを備える。 The manager layer 2 includes manager machines 20 a and 20 b that monitor the agent machine 10. In operation management, it is common to duplicate the manager layer, and here the manager layer is also duplicated. However, the current system is used instead of the cluster configuration. The manager machine 20a includes an event management unit 22 and a meta monitoring unit 21a. The manager machine 20b includes an event transfer unit 23 and a meta monitoring unit 21b.

ビュー層３は、ビューマシン３０ａ、３０ｂを備える。ビューマシン３０ａは、イベント管理手段３２、メタ監視手段３１、出力装置３３、監視手段３４を備える。ビューマシン３０ｂは、ビューマシン３０ａと同様に構成される。 The view layer 3 includes view machines 30a and 30b. The view machine 30a includes an event management unit 32, a meta monitoring unit 31, an output device 33, and a monitoring unit 34. The view machine 30b is configured similarly to the view machine 30a.

次に、各部について説明する。マネージャマシン２０ａにおいて、イベント管理手段２２は、監視手段１２から通知（送信）されたイベント（イベント通知）をイベント管理手段３２へ通知（送信）する。また、イベント管理手段２２は、定期的に監視手段１２に対して正常性を確認するヘルスチェックイベントを通知するように依頼し、イベント転送手段２３を経由したイベントと監視手段１２から直接到来するヘルスチェックイベントとを照らし合わせて、対応するイベントとヘルスチェックイベントをイベント管理手段３２へ通知する。 Next, each part will be described. In the manager machine 20a, the event management unit 22 notifies (transmits) the event (event notification) notified (transmitted) from the monitoring unit 12 to the event management unit 32. Further, the event management unit 22 periodically requests the monitoring unit 12 to notify the health check event for checking the normality, and the event that has passed through the event transfer unit 23 and the health that comes directly from the monitoring unit 12. The event management means 32 is notified of the corresponding event and the health check event in comparison with the check event.

また、メタ監視手段２１ａは、メタ監視手段１１とイベント管理手段２２を監視しており、監視によって異常を確認するとメタ監視手段３１にイベント（メタイベント通知）を通知する。メタ監視手段２１ａは、イベント管理手段２２と同様の判断機能を有しており、メタ監視手段１１から到来するヘルスチェックイベントやメタ監視手段２１ｂを経由してきたヘルスチェックイベントによってメタ監視手段３１へイベントを通知する。 The meta monitoring unit 21a monitors the meta monitoring unit 11 and the event management unit 22, and notifies the meta monitoring unit 31 of an event (meta event notification) when an abnormality is confirmed by monitoring. The meta monitoring unit 21a has a determination function similar to that of the event management unit 22, and an event is sent to the meta monitoring unit 31 by a health check event coming from the meta monitoring unit 11 or a health check event passing through the meta monitoring unit 21b. To be notified.

マネージャマシン２０ｂにおいて、イベント転送手段２３は、監視手段１２から通知されてきたヘルスチェックイベントをイベント管理手段２２とビューマシン３０ｂへ転送する。通常のイベントは、マネージャマシン２０ｂに接続しているビューマシン３０ｂへ転送する。 In the manager machine 20b, the event transfer unit 23 transfers the health check event notified from the monitoring unit 12 to the event management unit 22 and the view machine 30b. The normal event is transferred to the view machine 30b connected to the manager machine 20b.

また、メタ監視手段２１ｂは、メタ監視手段１１とイベント転送手段２３を監視しており、監視によって異常を確認するとビューマシン３０ｂへ通知する。また、メタ監視手段１１から送信されたイベントをビューマシン３０ｂへ転送し、ヘルスチェックイベントを、メタ監視手段２１ａとビューマシン３０ｂへ転送する。 The meta monitoring unit 21b monitors the meta monitoring unit 11 and the event transfer unit 23, and notifies the view machine 30b when an abnormality is confirmed by monitoring. In addition, the event transmitted from the meta monitoring unit 11 is transferred to the view machine 30b, and the health check event is transferred to the meta monitoring unit 21a and the view machine 30b.

ビューマシン３０ａにおいて、イベント管理手段３２は、イベント管理手段２２から通知されてきたイベントやヘルスチェックイベントを出力装置３３へ通知する。また、定期的に発行されるべきヘルスチェックイベントが通知されない場合、イベント通報パスが異常であると判断し、異常である旨を表すイベントを出力装置３３に通知する。 In the view machine 30a, the event management unit 32 notifies the output device 33 of the event or health check event notified from the event management unit 22. When a health check event that should be issued periodically is not notified, the event notification path is determined to be abnormal, and an event indicating the abnormality is notified to the output device 33.

また、メタ監視手段３１は、メタ監視手段２１ａとイベント管理手段３２を監視しており、異常を確認すると出力装置３３へ異常である旨を表すイベントを通知する。メタ監視手段３１は、イベント管理手段３２と同様の判断機能を有しており、メタ監視手段２１ａが通知してきたヘルスチェックイベントによって出力装置３３にイベントを通知する。 Further, the meta monitoring unit 31 monitors the meta monitoring unit 21a and the event management unit 32, and notifies the output device 33 of an event indicating an abnormality when the abnormality is confirmed. The meta monitoring unit 31 has a determination function similar to that of the event management unit 32, and notifies the output device 33 of the event by the health check event notified by the meta monitoring unit 21a.

さらに、監視手段３４は、エージェント層１の監視手段１２と同様のものであって、メタ監視手段３１を監視するために存在する。監視手段３４によって最上位のメタ監視手段３１を監視することができる。 Further, the monitoring unit 34 is the same as the monitoring unit 12 of the agent layer 1 and exists for monitoring the meta monitoring unit 31. The uppermost meta monitoring means 31 can be monitored by the monitoring means 34.

次に、本発明の第１の実施例に係る監視システムの動作について、以下の（１）〜（３）の機能別に説明する。 Next, the operation of the monitoring system according to the first embodiment of the present invention will be described according to the following functions (1) to (3).

（１）監視手段１２を監視する機能
図２は、エージェントマシンの動作を表すフローチャートである。まず、メタ監視手段１１は、監視手段１２が正常であるか否かを判断する（ステップＳ１１）。監視手段１２を異常と判断した場合（ステップＳ１１のＮ）、監視手段１２がダウンしていることを表すイベントの発行を行う（ステップＳ１２）。次に監視手段１２のプロセスの再起動を試み（ステップＳ１３）、一定時間内に起動すれば（ステップＳ１４のＹ）、監視手段１２の再起動成功のイベントを発行する（ステップＳ１５）。一定時間内に起動しなかった場合（ステップＳ１４のＮ）、監視手段１２の再起動失敗のイベントを発行する（ステップＳ１６）。発行されたイベントは、上位のマシンを経由し、出力装置３３において、監視オペレータに通知される。このように動作することでシステム全体を監視しているメタ監視手段が正常に機能していることがわかる。 (1) Function for Monitoring Monitoring Unit 12 FIG. 2 is a flowchart showing the operation of the agent machine. First, the meta monitoring unit 11 determines whether or not the monitoring unit 12 is normal (step S11). When it is determined that the monitoring unit 12 is abnormal (N in step S11), an event indicating that the monitoring unit 12 is down is issued (step S12). Next, an attempt is made to restart the process of the monitoring unit 12 (step S13). If the process is started within a certain time (Y in step S14), an event of a successful restart of the monitoring unit 12 is issued (step S15). If it has not been activated within a certain time (N in step S14), an event of restart failure of the monitoring means 12 is issued (step S16). The issued event is notified to the monitoring operator in the output device 33 via the host machine. By operating in this way, it can be seen that the meta monitoring means for monitoring the entire system functions normally.

また、監視手段１２が正常であるか異常であるかを判断するには、まず、監視手段１２は、定期的に監視手段自身が正常であると証明するためのログをログファイル１４として出力する（図３の（Ａ））。ログファイル１４の形式としては、図４に示すような、ｉｄ（identification）、日時、監視手段名、詳細を含むログからなる。次にメタ監視手段１１は、定期的にログファイル１４を読みにいき、前回から更新されたログが存在しない場合、または異常のログが存在する場合、監視手段１２を異常と判断する（図３の（Ｂ））。そして、監視手段１２が正常に機能していない場合、上位のメタ監視手段２１ａにイベントを発行する（図３の（Ｃ））。 In order to determine whether the monitoring unit 12 is normal or abnormal, first, the monitoring unit 12 periodically outputs a log for certifying that the monitoring unit itself is normal as a log file 14. ((A) of FIG. 3). The format of the log file 14 includes a log including id (identification), date and time, monitoring means name, and details as shown in FIG. Next, the meta monitoring unit 11 periodically reads the log file 14 and determines that the monitoring unit 12 is abnormal when there is no log updated from the previous time or when there is an abnormal log (FIG. 3). (B)). When the monitoring unit 12 is not functioning normally, an event is issued to the upper meta monitoring unit 21a ((C) in FIG. 3).

（２）イベント通報パスの監視機能
図５は、マネージャマシン２０ａの動作を表すフローチャートである。まず、イベント管理手段２２は、監視手段１２へヘルスチェックイベント発行の依頼をだす（ステップＳ２１）。監視手段１２は、この依頼を受信すると、イベント管理手段２２およびイベント転送手段２３へヘルスチェックイベントを発行する（ステップＳ２２、Ｓ２３）。イベント転送手段２３は、イベント管理手段２２にヘルスチェックイベントを転送する（ステップＳ２４）。一定時間内にイベント転送手段２３および監視手段１２の両方からヘルスチェックイベントが通知されれば、イベント通報パスが正常であると判断し（ステップＳ２５のＹ）、ステップＳ３１に進む。 (2) Event Notification Path Monitoring Function FIG. 5 is a flowchart showing the operation of the manager machine 20a. First, the event management unit 22 requests the monitoring unit 12 to issue a health check event (step S21). Upon receiving this request, the monitoring unit 12 issues a health check event to the event management unit 22 and the event transfer unit 23 (steps S22 and S23). The event transfer unit 23 transfers the health check event to the event management unit 22 (step S24). If a health check event is notified from both the event transfer unit 23 and the monitoring unit 12 within a predetermined time, it is determined that the event notification path is normal (Y in step S25), and the process proceeds to step S31.

また、両方からヘルスチェックイベントが通知されない場合（ステップＳ２５のＮ）、一定時間内に管理手段１２およびイベント転送手段２３のいずれか一方からヘルスチェックイベントが通知されるか否かをチェックする（ステップＳ２６）。一定時間内に通知がない場合（ステップＳ２６のＮ）、エージェントマシン１０発の定期通報未着のイベントを発行し（ステップＳ３０）、ステップＳ３１に進む。また、一定時間内に通知があった場合（ステップＳ２６のＹ）、イベント転送手段２３からヘルスチェックイベントが通知されるか否かをチェックする（ステップＳ２７）。イベント転送手段２３からヘルスチェックイベントが通知された場合、すなわち、監視手段１２から直接ヘルスチェックイベントが通知されない場合、マネージャマシン２０ａ経由の定期通報未着のイベントを発行し（ステップＳ２８）、マネージャマシン２０ｂ経由のヘルスチェックイベントが通知されないときは、マネージャマシン２０ｂ経由の定期通報未着のイベントを発行する（ステップＳ２９）。 Further, when the health check event is not notified from both (N in Step S25), it is checked whether or not the health check event is notified from either the management means 12 or the event transfer means 23 within a certain time (Step S25). S26). If there is no notification within a certain time (N in Step S26), an event that the agent machine 10 has not yet received a regular report is issued (Step S30), and the process proceeds to Step S31. If there is a notification within a certain time (Y in Step S26), it is checked whether or not a health check event is notified from the event transfer means 23 (Step S27). When a health check event is notified from the event transfer means 23, that is, when a health check event is not notified directly from the monitoring means 12, an event that does not arrive at the periodic report via the manager machine 20a is issued (step S28), and the manager machine When the health check event via 20b is not notified, an event that does not arrive at the regular report via the manager machine 20b is issued (step S29).

次に、イベント管理手段２２は、ヘルスチェックイベントをイベント管理手段３２に発行する（ステップＳ３１）。イベント管理手段３２は、一定時間内にイベント管理手段２２からヘルスチェックイベントを受信できないとき（ステップＳ３２のＮ）、イベント管理手段２２発の定期通報未着のイベントを出力装置３３に発行する（ステップＳ３３）。 Next, the event management unit 22 issues a health check event to the event management unit 32 (step S31). When the event management unit 32 cannot receive a health check event from the event management unit 22 within a predetermined time (N in step S32), the event management unit 32 issues an event not received by the event management unit 22 to the output device 33 (step S32). S33).

また、メタ監視手段２１ａも上記と同様な機能を有しており、ヘルスチェックイベントにより、イベント通報パスの健全性をチェックしている。これにより、監視手段１２が発行するイベントの通報パスが正常に機能しているかを確認することができる。 The meta monitoring means 21a also has the same function as described above, and checks the soundness of the event notification path by a health check event. As a result, it is possible to confirm whether the event notification path issued by the monitoring unit 12 is functioning normally.

（３）監視システム全体の監視機能
図６は、メタ監視手段の動作を表すフローチャートである。まず、上位のメタ監視手段が下位のメタ監視手段および同位の手段を監視しており、異常を発見すると（ステップＳ４１のＮ）、すぐに下位のメタ監視手段あるいは同位の手段がダウンしている旨のイベントを発行し出力装置３３に向けて通知する（ステップＳ４２）。 (3) Monitoring function of the entire monitoring system FIG. 6 is a flowchart showing the operation of the meta monitoring means. First, the upper meta monitoring means monitors the lower meta monitoring means and the peer means, and if an abnormality is found (N in step S41), the lower meta monitoring means or the peer means are immediately down. An event to that effect is issued and notified to the output device 33 (step S42).

なお、監視対象が正常か異常かの判断は、図３に示すようなメタ監視手段１１が監視手段１２を監視する方法と同じである。すなわち、メタ監視手段３１は、イベント管理手段３２とメタ監視手段２１ａを監視し、メタ監視手段２１ａは、イベント管理手段２２、メタ監視手段２１ｂおよびメタ監視手段１１を監視しており、メタ監視手段２１ｂは、イベント転送手段２３およびメタ監視手段１１を監視しており、メタ監視手段１１は、監視手段１２を監視している。 Whether the monitoring target is normal or abnormal is the same as the method in which the meta monitoring unit 11 monitors the monitoring unit 12 as shown in FIG. That is, the meta monitoring unit 31 monitors the event management unit 32 and the meta monitoring unit 21a, and the meta monitoring unit 21a monitors the event management unit 22, the meta monitoring unit 21b, and the meta monitoring unit 11, and the meta monitoring unit. 21b monitors the event transfer means 23 and the meta monitoring means 11, and the meta monitoring means 11 monitors the monitoring means 12.

最後に、ビュー層３に存在する最上位のメタ監視手段３１は、エージェント層１の監視手段１２と同様の監視手段３４によって監視され、異常があれば監視手段３４→イベント管理手段２２→イベント管理手段３２→出力装置３３の経路で異常が通知される。このとき、メタ監視手段３１は、監視手段３４から見ると被監視オブジェクトに相当するものとして認識される。 Finally, the uppermost meta monitoring means 31 existing in the view layer 3 is monitored by the monitoring means 34 similar to the monitoring means 12 of the agent layer 1, and if there is an abnormality, the monitoring means 34 → event management means 22 → event management Abnormality is notified through the route from the means 32 to the output device 33. At this time, when viewed from the monitoring unit 34, the meta monitoring unit 31 is recognized as corresponding to the monitored object.

以上のように動作する監視システムによれば、監視オペレータが出力装置３３に表示されるイベントをチェックすることで、被監視オブジェクト１３だけでなく監視システム全体として監視が正常に機能しているか否かを判断することができる。その理由は、上位のメタ監視手段が下位のメタ監視手段を監視しており、異常を発見するとすぐにイベントを発行して出力装置３３に通知することができることによる。 According to the monitoring system that operates as described above, whether or not monitoring is functioning normally not only for the monitored object 13 but also for the entire monitoring system by checking the event displayed on the output device 33 by the monitoring operator. Can be judged. The reason is that the upper meta monitoring means monitors the lower meta monitoring means, and an event can be issued and notified to the output device 33 as soon as an abnormality is found.

また、システムの一部が故障したときに異常をすばやく検知することができ、また監視において支障がない。その理由は、通報経路の異常がイベント経路とメタイベント経路とに二重化されており、また手段の異常検知機能がマネージャ層とビュー層に多重化されているためである。 Further, when a part of the system fails, an abnormality can be detected quickly, and there is no trouble in monitoring. The reason is that the abnormality in the notification path is duplicated in the event path and the meta event path, and the abnormality detection function of the means is multiplexed in the manager layer and the view layer.

図７は、本発明の第２の実施例に係る監視システムの構成を示すブロック図である。図７において、図１と同じ符号は、同一物を表し、その説明を省略する。ただし、監視システムは、図１に示された第１の実施例とはマネージャ層２ａが３段に構成されている点およびエージェント層１ａに多数のエージェントマシンが備えられる点で大きく異なる。 FIG. 7 is a block diagram showing the configuration of the monitoring system according to the second embodiment of the present invention. 7, the same reference numerals as those in FIG. 1 represent the same items, and description thereof is omitted. However, the monitoring system differs greatly from the first embodiment shown in FIG. 1 in that the manager layer 2a is configured in three stages and that the agent layer 1a is provided with a large number of agent machines.

エージェント層１ａは、エージェントマシン１０ａ、１０ｂ、１０ｃ、１０ｄを備える。エージェントマシン１０ａ、１０ｂ、１０ｃ、１０ｄは、それぞれ図１におけるエージェントマシン１０と同様の構成である。 The agent layer 1a includes agent machines 10a, 10b, 10c, and 10d. The agent machines 10a, 10b, 10c, and 10d have the same configuration as that of the agent machine 10 in FIG.

マネージャ層２ａは、マネージャマシン２０ｃ、２０ｄ、２０ｅ、２０ｆ、２０ｇ、２０ｈ、２０ｉ、２０ｊを備える。マネージャ層２ａにおいて、マネージャマシン２０ｃ、２０ｄ、２０ｆ、２０ｇが一段目として、マネージャマシン２０ｅ、２０ｈが二段目として、マネージャマシン２０ｉ、２０ｊが三段目として構成される。それぞれのマネージャマシン２０ｃ〜２０ｉが備えるメタ監視手段２１ｃ〜２１ｉ、イベント転送手段２３ｃ〜２３ｉのそれぞれは、図１のメタ監視手段２１ｂ、イベント転送手段２３と同等の構成である。また、マネージャマシン２０ｊが備えるメタ監視手段２１ｊ、イベント管理手段２２ｊは、図１のメタ監視手段２１ａ、イベント管理手段２２と同等の構成である。 The manager layer 2a includes manager machines 20c, 20d, 20e, 20f, 20g, 20h, 20i, and 20j. In the manager layer 2a, the manager machines 20c, 20d, 20f, and 20g are configured as the first stage, the manager machines 20e and 20h are configured as the second stage, and the manager machines 20i and 20j are configured as the third stage. Each of the meta monitoring means 21c to 21i and the event transfer means 23c to 23i included in each manager machine 20c to 20i has the same configuration as the meta monitoring means 21b and the event transfer means 23 of FIG. Further, the meta monitoring unit 21j and the event management unit 22j provided in the manager machine 20j have the same configuration as the meta monitoring unit 21a and the event management unit 22 of FIG.

マネージャマシン２０ｃ、２０ｄは、それぞれエージェントマシン１０ａ、１０ｂを監視している。マネージャマシン２０ｆ、２０ｇは、それぞれエージェントマシン１０ｃ、１０ｄを監視している。マネージャマシン２０ｅは、マネージャマシン２０ｃ、２０ｆを監視している。マネージャマシン２０ｈは、マネージャマシン２０ｄ、２０ｇを監視している。マネージャマシン２０ｊは、マネージャマシン２０ｅ、２０ｉを監視している。マネージャマシン２０ｉは、マネージャマシン２０ｈを監視している。 The manager machines 20c and 20d monitor the agent machines 10a and 10b, respectively. The manager machines 20f and 20g monitor the agent machines 10c and 10d, respectively. The manager machine 20e monitors the manager machines 20c and 20f. The manager machine 20h monitors the manager machines 20d and 20g. The manager machine 20j monitors the manager machines 20e and 20i. The manager machine 20i monitors the manager machine 20h.

なお、以上の例では、マネージャ層２ａにおけるマネージャマシンを３段に構成する場合を示した。マネージャマシン２０ｉ、２０ｊの処理能力が充分あれば、２段で構成するようにしてもよい。すなわち、マネージャマシン２０ｅ、２０ｈを省き、マネージャマシン２０ｊが直接マネージャマシン２０ｃ、２０ｆを管理し、マネージャマシン２０ｉが直接マネージャマシン２０ｄ、２０ｇを管理するようにしてもよい。 In the above example, the manager machine in the manager layer 2a is configured in three stages. If the manager machines 20i and 20j have sufficient processing capability, they may be configured in two stages. That is, the manager machines 20e and 20h may be omitted, the manager machine 20j may directly manage the manager machines 20c and 20f, and the manager machine 20i may directly manage the manager machines 20d and 20g.

一般に、一つのマネージャマシンは、監視可能なエージェントマシン数に限界がある。そこで、図７に示すようにマネージャ層を多段に構成してマネージャマシンを複数配置して負荷分散することで、監視可能なエージェントマシン数を増やすことができる。このような構成の監視システムによれば、システム全体として信頼性を充分に確保した監視を行うと共に、大規模なデータセンタ等における数千台のエージェントマシンの監視に対しても対応可能である。 In general, one manager machine has a limit on the number of agent machines that can be monitored. Therefore, as shown in FIG. 7, the number of agent machines that can be monitored can be increased by configuring the manager layers in multiple stages and arranging a plurality of manager machines to distribute the load. According to the monitoring system having such a configuration, it is possible to perform monitoring with sufficiently ensured reliability as a whole system, and it is also possible to cope with monitoring of thousands of agent machines in a large-scale data center or the like.

本発明の第１の実施例に係る監視システムの構成を示すブロック図である。It is a block diagram which shows the structure of the monitoring system which concerns on 1st Example of this invention. エージェントマシンの動作を表すフローチャートである。It is a flowchart showing operation | movement of an agent machine. メタ監視手段が監視手段を監視する場合の動作を模式的に表す図である。It is a figure which represents typically operation | movement in case a meta monitoring means monitors a monitoring means. ログファイルの構成を表す図である。It is a figure showing the structure of a log file. マネージャマシンの動作を表すフローチャートである。It is a flowchart showing operation | movement of a manager machine. メタ監視手段の動作を表すフローチャートである。It is a flowchart showing operation | movement of a meta monitoring means. 本発明の第２の実施例に係る監視システムの構成を示すブロック図である。It is a block diagram which shows the structure of the monitoring system which concerns on 2nd Example of this invention.

Explanation of symbols

１、１ａエージェント層
２、２ａマネージャ層
３ビュー層
１０、１０ａ、１０ｂ、１０ｃ、１０ｄエージェントマシン
１１メタ監視手段
１２監視手段
１３被監視オブジェクト
１４ログファイル
２０ａ、２０ｂ、２０ｃ、２０ｄ、２０ｅ、２０ｆ、２０ｇ、２０ｈ、２０ｉ、２０ｊマネージャマシン
２１ａ、２１ｂ、２１ｃ、２１ｄ、２１ｅ、２１ｆ、２１ｇ、２１ｈ、２１ｉ、２１ｊメタ監視手段
２２、２２ｊイベント管理手段
２３、２３ｃ、２３ｄ、２３ｅ、２３ｆ、２３ｇ、２３ｈ、２３ｉイベント転送手段
３０ａ、３０ｂビューマシン
３１メタ監視手段
３２イベント管理手段
３３出力装置
３４監視手段 1, 1a Agent layer 2, 2a Manager layer 3 View layer 10, 10a, 10b, 10c, 10d Agent machine 11 Meta monitoring means 12 Monitoring means 13 Monitored object 14 Log files 20a, 20b, 20c, 20d, 20e, 20f, 20g, 20h, 20i, 20j Manager machines 21a, 21b, 21c, 21d, 21e, 21f, 21g, 21h, 21i, 21j Meta monitoring means 22, 22j Event management means 23, 23c, 23d, 23e, 23f, 23g, 23h , 23i Event transfer means 30a, 30b View machine 31 Meta monitoring means 32 Event management means 33 Output device 34 Monitoring means

Claims

A first, second, and third information processing device group including first, second, and third processing devices, respectively;
The third processing device monitors the second processing device, and the second processing device is a monitoring system that monitors the first processing device,
The first processing device includes:
A first monitoring unit that monitors an object to be monitored;
The second monitoring unit monitors the abnormality of the first monitoring unit and notifies the second processing device of the monitoring result, and notifies the second processing unit of the monitoring result of the object in the first monitoring unit. The monitoring section of
With
The second processing device includes:
A first management unit;
An abnormality in the second monitoring unit and the first management unit is monitored and a monitoring result is notified to the third processing device, and a monitoring result notified from the second monitoring unit is further added to the third monitoring unit. A third monitoring unit for notifying the processing device;
With
The third processing device includes:
A second manager,
A fourth monitoring unit that monitors the abnormality of the third monitoring unit and the second management unit to obtain a monitoring result, and uses the monitoring result notified from the third monitoring unit as a new monitoring result; ,
An output unit for outputting a monitoring result of the second management unit and the fourth monitoring unit;
With
The first management unit requests the first monitoring unit to issue a health check event, determines whether or not there is a notification of the health check event within a predetermined time from the first monitoring unit, The determination result is issued to the second management unit,
The first monitoring unit issues the health check event to the first management unit,
If the determination result is not obtained within a predetermined time, the second management unit outputs an event indicating that the regular notification in the first management unit has not arrived as a monitoring result indicating an abnormality. Monitoring system characterized by being issued to.

The second monitoring unit determines whether or not the first monitoring unit is normal. If the first monitoring unit determines that the first monitoring unit is abnormal, the first monitoring unit is brought down. Is issued, an attempt is made to restart the process in the first monitoring unit, and if the process is started within a predetermined time, an event of a successful restart of the first monitoring unit is issued, and within a predetermined time 2. The system according to claim 1, wherein if the first monitoring unit is not started, an event of restart failure of the first monitoring unit is issued, and the issued event is notified to the third monitoring unit. Monitoring system.

The third processing apparatus further includes a fifth monitoring unit that monitors the abnormality of the fourth monitoring unit and notifies the first management unit of a monitoring result,
The monitoring system according to claim 1, wherein the first management unit further notifies the second management unit of a monitoring result in the fifth monitoring unit.

The second information processing device group includes:
A first transfer unit that transfers a monitoring result of the first monitoring unit to the first management unit;
A sixth monitoring unit for monitoring an abnormality of the second monitoring unit and the first transfer unit;
A fourth processing device comprising:
The first management unit further notifies the second management unit of a monitoring result from the first transfer unit,
The monitoring system according to claim 1, wherein the third monitoring unit further monitors an abnormality of the sixth monitoring unit.

The first management unit determines whether or not there is a notification of the health check event within a predetermined time from the first transfer unit, and issues a determination result to the second management unit,
The first monitoring unit issues the health check event to the first transfer unit,
If the determination result is not obtained within a predetermined time, the second management unit outputs an event indicating that the regular notification in the first management unit has not arrived as a monitoring result indicating an abnormality. The monitoring system according to claim 4, wherein the monitoring system is issued.

The third information processing device group further includes a fifth processing device configured in the same manner as the third processing device and having a notification function and a monitoring function with the fourth processing device. The monitoring system according to claim 4, wherein:

The first information processing device group includes a plurality of the first processing devices,
The monitoring system according to claim 4 or 6, wherein each of the second and fourth processing devices has a notification function and a monitoring function with one or more of the first processing devices.

The first information processing device group includes a plurality of the first processing devices,
The second information processing device group further includes a plurality of sixth processing devices,
Each of the sixth processing devices is
A second transfer unit that transfers a monitoring result of one or more of the first monitoring units of the plurality of first processing devices;
A seventh monitoring unit that monitors any one or more of the second monitoring unit and the second transfer unit of the plurality of first processing devices;
With
The second processing device has a notification function and a monitoring function with any of the plurality of sixth processing devices instead of the first processing device,
Instead of monitoring the first processing device and transferring the monitoring result of the first processing device, the fourth processing device monitors one of the plurality of sixth processing devices and 7. The monitoring system according to claim 4, wherein transfer is performed.

The second information processing device group further includes a plurality of seventh processing devices,
Each of the seventh processing devices is
A third transfer unit that transfers a transfer result of one or more of the second transfer units of a plurality of the sixth processing devices;
An eighth monitoring unit that monitors any one or more of the seventh monitoring unit and the third transfer unit of the plurality of sixth processing devices;
With
The second processing device has a notification function and a monitoring function with any of the plurality of seventh processing devices instead of any of the plurality of sixth processing devices,
Instead of performing monitoring and transfer of monitoring results of the plurality of sixth processing devices, the fourth processing device performs monitoring and transfer of monitoring results of the plurality of seventh processing devices. The monitoring system according to claim 8, wherein the monitoring system is performed.

First, second, and third information processing device groups each including a first, second, and third processing device are provided, the third processing device monitors the second processing device, and The processing apparatus 2 is a monitoring method in a monitoring system for monitoring the first processing apparatus,
In the first processing apparatus,
A first monitoring step for monitoring an object to be monitored;
Secondly, the abnormality in the first monitoring step is monitored and the monitoring result is notified to the second processing device, and the monitoring result of the object in the first monitoring step is notified to the second processing device. Monitoring steps,
Including
In the second processing apparatus,
A first management step;
An abnormality in the second monitoring step and the first management step is monitored and a monitoring result is notified to the third processing device, and a monitoring result notified in the second monitoring step is further transmitted to the third monitoring device. A third monitoring step for notifying the processing device;
Including
In the third processing apparatus,
A second management step;
A fourth monitoring step in which an abnormality in the third monitoring step and the second management step is monitored to obtain a monitoring result, and the monitoring result notified in the third monitoring step is a new monitoring result; ,
An output step for outputting monitoring results of the second management step and the fourth monitoring step;
Including
In the first management step, the first monitoring step is requested to issue a health check event, and whether or not there is a notification of the health check event within a certain time after the first monitoring step. Determine and issue a determination result to the second management step;
Issuing the health check event to the first management step in the first monitoring step;
In the second management step, when the determination result is not obtained within a predetermined time, the output step is performed with the event that the regular notification in the first management step is not received as a monitoring result indicating abnormality. The monitoring method characterized by issuing to.

The second monitoring step includes
Determining whether the monitoring result in the first monitoring step is normal;
If it is determined that the monitoring result in the first monitoring step is abnormal, issuing an event indicating that it is down in the first monitoring step;
Attempting to restart the process in the first monitoring step and issuing a successful restart event in the first monitoring step to the third monitoring step if started within a certain time;
Issuing a restart failure event in the first monitoring step to the third monitoring step if it has not started within a predetermined time;
The monitoring method according to claim 10, further comprising:

The third processing device includes a fifth monitoring step of monitoring an abnormality in the fourth monitoring step and notifying a monitoring result to the first management step;
The monitoring method according to claim 10, wherein in the first management step, a monitoring result in the fifth monitoring step is further notified to the second management step.

The second information processing device group further includes a fourth processing device,
A first transfer step in which the fourth processing device transfers a monitoring result of the first monitoring step to the first management step;
A sixth monitoring step in which the fourth processing device monitors an abnormality in the second monitoring step and the first transfer step;
Including
In the first management step, further notifying the monitoring result from the first transfer step to the second management step;
In the third monitoring step, further monitoring the abnormality in the sixth monitoring step;
The monitoring method according to claim 12, further comprising:

In the first management step,
After the first transfer step, determine whether there is a notification of the health check event within a certain time, issue a determination result to the second management step,
Issuing the health check event to the first transfer step in the first monitoring step;
In the second management step, when the determination result is not obtained within a predetermined time, the output step is performed with the event that the regular notification in the first management step is not received as a monitoring result indicating abnormality. The monitoring method according to claim 13, wherein the monitoring method is issued.

The third information processing device group further includes a fifth processing device configured similarly to the third processing device,
The monitoring method according to claim 13, wherein the fifth processing apparatus has a notification function and a monitoring function with the fourth processing apparatus.

The first information processing device group includes a plurality of the first processing devices,
The monitoring method according to claim 13 or 15, wherein each of the second and fourth processing devices has a notification function and a monitoring function with respect to one or more of the first processing devices.

The first information processing device group includes a plurality of the first processing devices,
The second information processing device group further includes a plurality of sixth processing devices,
A second transfer step in which each of the sixth processing devices transfers a monitoring result of the first monitoring step in any one or more of the plurality of first processing devices;
A seventh monitoring step in which each of the sixth processing devices monitors the second monitoring step and the second transfer step in any one or more of the plurality of first processing devices;
Including
The second processing device has a notification function and a monitoring function with any of the plurality of sixth processing devices instead of the first processing device,
Instead of monitoring the first processing device and transferring the monitoring result of the first processing device, the fourth processing device monitors the monitoring and monitoring results of any of the plurality of sixth processing devices. 16. The monitoring method according to claim 13, wherein transfer is performed.

The second information processing device group further includes a plurality of seventh processing devices,
A third transfer step in which each of the seventh processing devices transfers a transfer result in the second transfer step in any one or more of the sixth processing devices;
An eighth monitoring step in which each of the seventh processing devices performs monitoring in the seventh monitoring step and the third transfer step in any one or more of the plurality of sixth processing devices;
Including
The second processing device has a notification function and a monitoring function with any of the plurality of seventh processing devices instead of any of the plurality of sixth processing devices,
Instead of performing monitoring and monitoring result transfer of any of the plurality of sixth processing devices, the fourth processing device performs monitoring and transfer of the monitoring result of any of the plurality of seventh processing devices. The monitoring method according to claim 17, wherein the monitoring method is performed.