JP7707640B2

JP7707640B2 - Cluster system, monitoring system, monitoring method, and program

Info

Publication number: JP7707640B2
Application number: JP2021080395A
Authority: JP
Inventors: 大輝木本
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2025-07-15
Anticipated expiration: 2041-05-11
Also published as: JP2022174535A

Description

本開示はクラスタシステム、監視システム、監視方法、及びプログラムに関する。 This disclosure relates to a cluster system, a monitoring system, a monitoring method, and a program.

企業等が社内ネットワークを構築する場合に、拡張性及び可用性を確保するために、クラスタシステムを用いる場合がある。クラスタシステムは、予め定められたポリシーもしくは特定のパラメータ等を用いてクラスタシステム内のサーバ装置等の管理を行う。また、クラスタシステムにおいて可用性を確保されないサーバ装置は、クラスタシステムによる管理の対象外となり、クラスタシステムに適用されるポリシーが適用されない。このように、クラスタシステムによる管理の対象外となるサーバ装置は、クラスタシステム内のサーバ装置等に障害が発生した場合とは異なる手順によって障害時の回復動作が実行される。 When a company or the like builds an internal network, a cluster system may be used to ensure scalability and availability. The cluster system manages the server devices and the like within the cluster system using predetermined policies or specific parameters. Furthermore, server devices for which availability is not ensured in the cluster system are not subject to management by the cluster system, and the policies applied to the cluster system are not applied. In this way, for server devices that are not subject to management by the cluster system, recovery operations are performed in the event of a failure using a procedure that is different from that used when a failure occurs in a server device or the like within the cluster system.

特許文献１には、ネットワークを介して接続された複数の計算機が分散処理を行う構成が開示されている。特許文献１に開示されている計算機は、データの出力順序を決定する際に、半順序配信を行うことによって、一部の計算機に障害が発生した場合であっても、それぞれの計算機から出力されるデータの一貫性を確保し、処理を継続させる。 Patent document 1 discloses a configuration in which multiple computers connected via a network perform distributed processing. The computer disclosed in patent document 1 performs partial order distribution when determining the data output order, thereby ensuring the consistency of the data output from each computer and continuing processing even if a failure occurs in some of the computers.

また、特許文献２には、複数の機能を分散処理している二つの計算機及び共通補助記憶装置を有するシステムの構成が開示されている。特許文献１には、一方の計算機に障害が発生した場合に、他方の計算機が、障害が発生した計算機において実行されていた機能を引き継いで運用するバックアップ運用形態が開示されている。 Patent document 2 discloses a system configuration having two computers that perform distributed processing of multiple functions and a common auxiliary storage device. Patent document 1 discloses a backup operation mode in which, if a failure occurs in one computer, the other computer takes over and operates the functions that were being executed by the failed computer.

特開２０２０－１８７５２６号公報JP 2020-187526 A 特開平０９－２４４９１０号公報Japanese Patent Application Publication No. 09-244910

ここで、社内ネットワーク等に複数のクラスタシステムが含まれる場合、クラスタシステムによる管理の対象外となるサーバ装置を、複数のクラスタシステムが共有し、管理することがある。この場合、サーバ装置に障害が発生した場合、それぞれのクラスタシステムが、サーバ装置に対する回復動作を実行するため、回復動作が重複もしくは競合し、適切な回復動作が行われなくなるという問題がある。ここで、特許文献２に開示されている計算機は、障害が発生した場合、予め定められた手順に従って機能の引継ぎを行うため、障害が発生した計算機に対して複数の回復動作が実行されることはない。そのため、特許文献２に開示されている障害時の回復動作を実行しても、複数のクラスタシステムが共有し、さらに管理するサーバ装置に障害が発生した場合に、適切な回復動作が行われなくなるという問題を解決することができない。 When an in-house network or the like includes multiple cluster systems, a server device that is not subject to management by the cluster system may be shared and managed by the multiple cluster systems. In this case, when a failure occurs in a server device, each cluster system executes a recovery operation for the server device, resulting in a problem that the recovery operations overlap or conflict, making it impossible to perform an appropriate recovery operation. When a failure occurs in the computer disclosed in Patent Document 2, the computer takes over the function according to a predetermined procedure, so multiple recovery operations are not performed for the computer where the failure occurred. Therefore, even if the recovery operation at the time of failure disclosed in Patent Document 2 is executed, it cannot solve the problem that an appropriate recovery operation is not performed when a failure occurs in a server device shared and managed by multiple cluster systems.

本開示の目的の一つは、複数のクラスタシステムが共有するサーバ装置に障害が発生した場合に、サーバ装置に対する適切な回復動作を実行することができるクラスタシステム、監視システム、監視方法、及びプログラムを提供することにある。 One of the objectives of the present disclosure is to provide a cluster system, a monitoring system, a monitoring method, and a program that can execute appropriate recovery operations on a server device when a failure occurs in the server device shared by multiple cluster systems.

本開示の第１の態様にかかるクラスタシステムは、複数のクラスタシステムにおけるサーバ装置の監視状態及び前記サーバ装置が異常状態である場合に前記サーバ装置に対する回復動作を実行する第１のクラスタシステムを示す実行状態を管理する管理部と、前記サーバ装置が正常状態かもしくは異常状態かを監視し、監視結果を前記監視状態に反映するとともに、他のクラスタシステムから受信した前記サーバ装置の監視結果を前記監視状態に反映する監視部と、前記複数のクラスタシステムのうち少なくとも一つのクラスタシステムにおける監視結果が異常状態を示す場合、前記監視状態を管理する前記他のクラスタシステムが使用する判定基準と同一の前記判定基準に従って前記サーバ装置に対する回復動作を実行する前記第１のクラスタシステムを決定し、決定結果を前記実行状態に反映する決定部と、管理されている前記実行状態に従って前記サーバ装置に対する回復動作を実行するか否かを判定する制御部と、を備える。 The cluster system according to the first aspect of the present disclosure includes a management unit that manages the monitoring status of server devices in a plurality of cluster systems and an execution status indicating a first cluster system that executes a recovery operation on the server device when the server device is in an abnormal state; a monitoring unit that monitors whether the server device is in a normal state or an abnormal state, reflects the monitoring result in the monitoring status, and reflects the monitoring result of the server device received from the other cluster systems in the monitoring status; a determination unit that, when the monitoring result of at least one cluster system among the plurality of cluster systems indicates an abnormal state, determines the first cluster system that executes a recovery operation on the server device according to the same determination criterion as the determination criterion used by the other cluster systems that manage the monitoring status, and reflects the determination result in the execution status; and a control unit that determines whether or not to execute a recovery operation on the server device according to the managed execution status.

本開示の第２の態様にかかる監視システムは、複数のクラスタシステムと、前記複数のクラスタシステムによって管理されるサーバ装置と、を含む監視システムであって、それぞれの前記クラスタシステムは、前記複数のクラスタシステムにおける前記サーバ装置の監視状態及び前記サーバ装置が異常状態である場合に前記サーバ装置に対する回復動作を実行する第１のクラスタシステムを示す実行状態を管理し、前記サーバ装置が正常状態かもしくは異常状態かを監視し、監視結果を前記監視状態に反映するとともに、他のクラスタシステムから受信した前記サーバ装置の監視結果を前記監視状態に反映し、前記複数のクラスタシステムのうち少なくとも一つのクラスタシステムにおける監視結果が異常状態を示す場合、前記監視状態を管理する前記他のクラスタシステムが使用する判定基準と同一の前記判定基準に従って前記サーバ装置に対する回復動作を実行する前記第１のクラスタシステムを決定し、決定結果を前記実行状態に反映し、管理されている前記実行状態に従って前記サーバ装置に対する回復動作を実行するか否かを判定する。 The monitoring system according to the second aspect of the present disclosure is a monitoring system including a plurality of cluster systems and a server device managed by the plurality of cluster systems, each of which manages the monitoring status of the server device in the plurality of cluster systems and the execution status indicating the first cluster system that executes a recovery operation for the server device when the server device is in an abnormal state, monitors whether the server device is in a normal state or an abnormal state, reflects the monitoring result in the monitoring status, and reflects the monitoring result of the server device received from the other cluster systems in the monitoring status, and when the monitoring result of at least one of the plurality of cluster systems indicates an abnormal state, determines the first cluster system that executes a recovery operation for the server device according to the same judgment criterion as the judgment criterion used by the other cluster systems that manage the monitoring status, reflects the determination result in the execution status, and determines whether to execute a recovery operation for the server device according to the managed execution status.

本開示の第３の態様にかかる監視方法は、複数のクラスタシステムにおけるサーバ装置の監視状態及び前記サーバ装置が異常状態である場合に前記サーバ装置に対する回復動作を実行する第１のクラスタシステムを示す実行状態を管理し、前記サーバ装置が正常状態かもしくは異常状態かを監視し、監視結果を前記監視状態に反映するとともに、他のクラスタシステムから受信した前記サーバ装置の監視結果を前記監視状態に反映し、前記複数のクラスタシステムのうち少なくとも一つのクラスタシステムにおける監視結果が異常状態を示す場合、前記監視状態を管理する前記他のクラスタシステムが使用する判定基準と同一の前記判定基準に従って前記サーバ装置に対する回復動作を実行する前記第１のクラスタシステムを決定し、決定結果を前記実行状態に反映し、管理されている前記実行状態に従って前記サーバ装置に対する回復動作を実行するか否かを判定する。 The monitoring method according to the third aspect of the present disclosure manages the monitoring status of a server device in a plurality of cluster systems and an execution status indicating a first cluster system that executes a recovery operation for the server device when the server device is in an abnormal state, monitors whether the server device is in a normal state or an abnormal state, reflects the monitoring result in the monitoring status, and reflects the monitoring result of the server device received from the other cluster systems in the monitoring status, and when the monitoring result of at least one of the plurality of cluster systems indicates an abnormal state, determines the first cluster system that executes a recovery operation for the server device according to the same judgment criterion as the judgment criterion used by the other cluster systems that manage the monitoring status, reflects the determination result in the execution status, and determines whether to execute a recovery operation for the server device according to the managed execution status.

本開示の第４の態様にかかるプログラムは、複数のクラスタシステムにおけるサーバ装置の監視状態及び前記サーバ装置が異常状態である場合に前記サーバ装置に対する回復動作を実行する第１のクラスタシステムを示す実行状態を管理し、前記サーバ装置が正常状態かもしくは異常状態かを監視し、監視結果を前記監視状態に反映するとともに、他のクラスタシステムから受信した前記サーバ装置の監視結果を前記監視状態に反映し、前記複数のクラスタシステムのうち少なくとも一つのクラスタシステムにおける監視結果が異常状態を示す場合、前記監視状態を管理する前記他のクラスタシステムが使用する判定基準と同一の前記判定基準に従って前記サーバ装置に対する回復動作を実行する前記第１のクラスタシステムを決定し、決定結果を前記実行状態に反映し、管理されている前記実行状態に従って前記サーバ装置に対する回復動作を実行するか否かを判定することをコンピュータに実行させる。 The program according to the fourth aspect of the present disclosure causes a computer to manage the monitoring status of server devices in a plurality of cluster systems and the execution status indicating a first cluster system that executes a recovery operation for the server device when the server device is in an abnormal state, monitor whether the server device is in a normal state or an abnormal state, reflect the monitoring result in the monitoring status, and reflect the monitoring result of the server device received from the other cluster systems in the monitoring status, and when the monitoring result of at least one cluster system among the plurality of cluster systems indicates an abnormal state, determine the first cluster system that executes a recovery operation for the server device according to the same judgment criterion as the judgment criterion used by the other cluster system that manages the monitoring status, reflect the determination result in the execution status, and determine whether or not to execute a recovery operation for the server device according to the managed execution status.

本開示により、複数のクラスタシステムが共有するサーバ装置に障害が発生した場合に、サーバ装置に対する適切な回復動作を実行することができるクラスタシステム、監視システム、監視方法、及びプログラムを提供することができる。 The present disclosure provides a cluster system, a monitoring system, a monitoring method, and a program that can perform appropriate recovery operations on a server device when a failure occurs in the server device shared by multiple cluster systems.

実施の形態１にかかるクラスタシステムの構成図である。FIG. 1 is a configuration diagram of a cluster system according to a first embodiment. 実施の形態実施の形態２にかかる監視システムの構成図である。FIG. 11 is a configuration diagram of a monitoring system according to a second embodiment. 実施の形態２にかかる監視マップを示す図である。FIG. 11 is a diagram showing a monitoring map according to the second embodiment; 実施の形態２にかかる監視マップの監視状態に設定される値を説明する図である。FIG. 11 is a diagram for explaining values that are set in the monitoring status of the monitoring map according to the second embodiment. 実施の形態２にかかる監視マップの実行状態に設定される値を説明する図である。FIG. 11 is a diagram for explaining values that are set in the execution state of a monitoring map according to the second embodiment. 実施の形態２にかかる回復動作の実行処理の流れを示す図である。FIG. 11 is a diagram showing a flow of execution processing of a recovery operation according to the second embodiment; 実施の形態２にかかる回復動作の実行処理の流れを示す図である。FIG. 11 is a diagram showing a flow of execution processing of a recovery operation according to the second embodiment; 実施の形態２にかかる監視マップに設定される値の遷移を示す図である。FIG. 11 is a diagram showing the transition of values set in a monitoring map according to the second embodiment; 実施の形態２にかかる監視マップに設定される値の遷移を示す図である。FIG. 11 is a diagram showing the transition of values set in a monitoring map according to the second embodiment; 実施の形態２にかかる回復動作の実行処理の流れを示す図である。FIG. 11 is a diagram showing a flow of execution processing of a recovery operation according to the second embodiment; 実施の形態２にかかる回復動作の実行処理の流れを示す図である。FIG. 11 is a diagram showing a flow of execution processing of a recovery operation according to the second embodiment; 実施の形態２にかかる監視マップに設定される値の遷移を示す図である。FIG. 11 is a diagram showing the transition of values set in a monitoring map according to the second embodiment; それぞれの実施の形態にかかるクラスタシステムの構成図である。FIG. 2 is a configuration diagram of a cluster system according to each embodiment.

（実施の形態１）
以下、図面を参照して本開示の実施の形態について説明する。図１を用いて実施の形態１にかかるクラスタシステム１０の構成例について説明する。クラスタシステム１０は、１台以上のコンピュータ装置が連携して動作することによって、柔軟な拡張性もしくは高可用性を実現するシステムである。クラスタシステム１０は、複数のコンピュータ装置が分散処理を行うことによって動作するシステムであってもよい。もしくは、クラスタシステム１０は、アクティブ動作を行う１台のコンピュータ装置と、アクティブ動作を行っているコンピュータ装置のバックアップ用のコンピュータ装置とを有するシステムであってもよい。以下に説明されるクラスタシステム１０の構成要素は、複数のコンピュータ装置において分散して実行される機能等であってもよく、アクティブ動作を行う１台のコンピュータ装置において実行される機能等であってもよい。 (Embodiment 1)
Hereinafter, an embodiment of the present disclosure will be described with reference to the drawings. A configuration example of a cluster system 10 according to the first embodiment will be described with reference to FIG. 1. The cluster system 10 is a system that realizes flexible scalability or high availability by operating one or more computer devices in cooperation with each other. The cluster system 10 may be a system that operates by multiple computer devices performing distributed processing. Alternatively, the cluster system 10 may be a system having one computer device that performs active operation and a computer device for backing up the computer device that performs active operation. The components of the cluster system 10 described below may be functions that are distributed and executed in multiple computer devices, or may be functions that are executed in one computer device that performs active operation.

コンピュータ装置は、プロセッサがメモリに格納されたプログラムを実行することによって動作する装置である。コンピュータ装置は、例えば、サーバ装置であってもよい。 A computer device is a device that operates by a processor executing a program stored in a memory. The computer device may be, for example, a server device.

コンピュータ装置もしくはコンピュータ装置の集合であるクラスタシステム１０は、管理部１１、監視部１２、決定部１３、及び制御部１４を有している。管理部１１、監視部１２、決定部１３、及び制御部１４等のクラスタシステム１０の構成要素は、プロセッサがメモリに格納されたプログラムを実行することによって処理が実行されるソフトウェアもしくはモジュールであってもよい。または、クラスタシステム１０の構成要素は、回路もしくはチップ等のハードウェアであってもよい。 The cluster system 10, which is a computer device or a collection of computer devices, has a management unit 11, a monitoring unit 12, a decision unit 13, and a control unit 14. The components of the cluster system 10, such as the management unit 11, the monitoring unit 12, the decision unit 13, and the control unit 14, may be software or modules that perform processing by a processor executing a program stored in a memory. Alternatively, the components of the cluster system 10 may be hardware, such as a circuit or a chip.

管理部１１は、複数のクラスタシステムにおけるサーバ装置の監視状態及びサーバ装置が異常状態である場合にサーバ装置に対する回復動作を実行する第１のクラスタシステムを示す実行状態を管理する。複数のクラスタシステムに含まれるそれぞれのクラスタシステムは、他のクラスタシステムとは異なるポリシーもしくはシステム構成等を用いて、拡張性もしくは可用性を実現してもよい。サーバ装置は、それぞれのクラスタシステムにおいて拡張性もしくは可用性を確保するために管理されるコンピュータ装置の対象外となるコンピュータ装置である。サーバ装置は、例えば、DNS（Domain Name System）サーバ装置であってもよい。サーバ装置は、それぞれのクラスタシステムによって管理される。言い換えると、サーバ装置に障害が発生した場合に、それぞれのクラスタシステムがサーバ装置の障害を検出し、さらに、それぞれのクラスタシステムによってサーバ装置の回復動作が実行される。 The management unit 11 manages the monitoring status of server devices in the multiple cluster systems and the execution status indicating a first cluster system that executes a recovery operation for the server device when the server device is in an abnormal state. Each cluster system included in the multiple cluster systems may achieve scalability or availability using a policy or system configuration that is different from the other cluster systems. The server device is a computer device that is not subject to the computer devices that are managed to ensure scalability or availability in each cluster system. The server device may be, for example, a DNS (Domain Name System) server device. The server device is managed by each cluster system. In other words, when a failure occurs in a server device, each cluster system detects the failure of the server device, and further, each cluster system executes a recovery operation for the server device.

監視状態は、それぞれのクラスタシステムにおける監視結果を示しており、例えば、サーバ装置が正常状態か異常状態かを示す。異常状態は、例えば、サーバ装置に障害もしくは故障が発生した状態であってもよい。回復動作は、例えば、サーバ装置が有する一部の機能、サービス、もしくはアプリケーション等を再起動させることであってもよく、サーバ装置自体を再起動させることであってもよい。実行状態は、例えば、障害が発生したサーバ装置に対して、どのクラスタシステムが回復動作を実行するかを示す。 The monitoring status indicates the monitoring results for each cluster system, and indicates, for example, whether a server device is in a normal or abnormal state. The abnormal state may be, for example, a state in which a fault or failure has occurred in the server device. The recovery action may be, for example, restarting some of the functions, services, or applications possessed by the server device, or may be restarting the server device itself. The execution state indicates, for example, which cluster system will perform a recovery action for a server device in which a fault has occurred.

管理部１１は、例えば、監視状態及び実行状態をクラスタシステム毎に管理してもよい。具体的には、管理部１１は、クラスタシステムごとの監視状態及び実行状態を示すフラグ情報を、データベースを用いて管理してもよい。 The management unit 11 may, for example, manage the monitoring status and execution status for each cluster system. Specifically, the management unit 11 may use a database to manage flag information indicating the monitoring status and execution status for each cluster system.

監視部１２は、サーバ装置が正常状態かもしくは異常状態かを監視し、監視結果を監視状態に反映するとともに、他のクラスタシステムから受信したサーバ装置の監視結果を監視状態に反映する。 The monitoring unit 12 monitors whether the server device is in a normal or abnormal state, reflects the monitoring results in the monitoring status, and also reflects the monitoring results of the server device received from other cluster systems in the monitoring status.

監視部１２は、例えば、サーバ装置に対してメッセージを送信し、応答メッセージを受信することができたか否かに応じて、サーバ装置が正常状態かもしくは異常状態かを判定してもよい。または、サーバ装置がDNSサーバ装置である場合、監視部１２は、仮想ホスト名をDNSサーバ装置へ送信し、仮想ホスト名に対するアドレス情報を受信することができたか否かに応じて、サーバ装置が正常かもしくは異常状態かを判定してもよい。 For example, the monitoring unit 12 may send a message to the server device and determine whether the server device is in a normal or abnormal state depending on whether a response message is received. Alternatively, if the server device is a DNS server device, the monitoring unit 12 may send a virtual host name to the DNS server device and determine whether the server device is in a normal or abnormal state depending on whether address information for the virtual host name is received.

監視部１２は、監視結果を、管理部１１において管理されているクラスタシステム１０におけるサーバ装置の監視状態に反映する。さらに、監視部１２は、クラスタシステム１０とは異なる他のクラスタシステムからサーバ装置の監視結果を受信する。つまり、他のクラスタシステムも、監視部１２と同様に、サーバ装置の監視を行っている。監視部１２は、監視結果を受信すると、管理部１１において管理されている他のクラスタシステムにおけるサーバ装置の監視状態に反映する。 The monitoring unit 12 reflects the monitoring results in the monitoring status of the server devices in the cluster system 10 managed by the management unit 11. Furthermore, the monitoring unit 12 receives monitoring results of the server devices from other cluster systems different from the cluster system 10. In other words, the other cluster systems also monitor the server devices in the same way as the monitoring unit 12. When the monitoring unit 12 receives the monitoring results, it reflects them in the monitoring status of the server devices in the other cluster systems managed by the management unit 11.

決定部１３は、複数のクラスタシステムのうち少なくとも一つのクラスタシステムにおける監視結果が異常状態を示す場合、サーバ装置に対する回復動作を実行するクラスタシステムを決定する。決定部１３は、監視状態を管理する他のクラスタシステムが使用する判定基準と同一の判定基準に従って、異常状態のサーバ装置に対する回復動作を実行するクラスタシステムを決定する。決定部１３は、回復動作を実行するクラスタシステムを決定すると、決定結果を管理部１１において管理されている実行状態に反映する。 The determination unit 13 determines a cluster system that will execute a recovery operation on a server device when the monitoring result of at least one of the multiple cluster systems indicates an abnormal state. The determination unit 13 determines a cluster system that will execute a recovery operation on a server device in an abnormal state according to the same judgment criteria as those used by other cluster systems that manage the monitoring state. When the determination unit 13 determines a cluster system that will execute a recovery operation, it reflects the result of the determination in the execution state managed by the management unit 11.

それぞれのクラスタシステムは、異なる方法を用いてサーバ装置を監視してもよい。そのため、サーバ装置の異常状態を検出することができたクラスタシステムと、サーバ装置の異常状態を検出することができなかったクラスタシステムとが存在する。 Each cluster system may use a different method to monitor the server device. As a result, there may be cluster systems that are able to detect an abnormal state of the server device and cluster systems that are unable to detect an abnormal state of the server device.

判定基準は、回復動作を実行するクラスタシステムを一意に決定することができる基準である。例えば、判定基準には、それぞれのクラスタシステムの優先順位が定められており、決定部１３は、優先順位の高いクラスタシステムを、回復動作を実行するクラスタシステムと定めてもよい。複数のクラスタシステムは、同一の判定基準を有している。つまり、複数のクラスタシステムは、同一の判定基準を共有している。 The judgment criteria are criteria that can uniquely determine the cluster system that will execute the recovery operation. For example, the judgment criteria may define the priority of each cluster system, and the determination unit 13 may determine that the cluster system with the highest priority is the cluster system that will execute the recovery operation. Multiple cluster systems have the same judgment criteria. In other words, multiple cluster systems share the same judgment criteria.

制御部１４は、実行状態に従ってサーバ装置に対する回復動作を実行するか否かを判定する。制御部１４は、実行状態においてクラスタシステム１０が回復動作を実行することが示されている場合、サーバ装置に対する回復動作を実行する。また、制御部１４は、実行状態において、他のクラスタシステムが回復動作を実行することが示されている場合、サーバ装置に対する回復動作を実行しない。 The control unit 14 determines whether or not to perform a recovery operation on the server device according to the execution state. If the execution state indicates that the cluster system 10 is to perform a recovery operation, the control unit 14 performs a recovery operation on the server device. Furthermore, if the execution state indicates that another cluster system is to perform a recovery operation, the control unit 14 does not perform a recovery operation on the server device.

以上説明したように、クラスタシステム１０は、クラスタシステム１０を含むすべてのクラスタシステムにおけるサーバ装置の監視状態を管理する。これにより、クラスタシステム１０は、クラスタシステム１０においてサーバ装置の異常状態を検出することができなかった場合であっても、他のクラスタシステムにおいてサーバ装置の異常状態が検出されたことを把握することができる。 As described above, cluster system 10 manages the monitoring status of server devices in all cluster systems including cluster system 10. As a result, even if cluster system 10 is unable to detect an abnormal state of a server device in cluster system 10, it can know that an abnormal state of a server device has been detected in another cluster system.

さらに、クラスタシステム１０は、異常状態が検出されたサーバ装置に対する回復動作を実行するクラスタシステムを、他のクラスタシステムが有する判定基準と同一の判定基準を用いて決定する。これにより、クラスタシステム１０を含む複数のクラスタシステムは、回復動作を実行するクラスタシステムを一意に決定することができる。その結果、異常状態であるサーバ装置に対する回復動作が、複数のクラスタシステムから重複して実行されることを回避することができる。つまり、それぞれのクラスタシステムは、異常状態であるサーバ装置に対する回復動作を実行するクラスタシステムを適切に決定することができる。 Furthermore, cluster system 10 determines the cluster system that will perform recovery operations for a server device in which an abnormal state has been detected, using the same judgment criteria as the other cluster systems. This allows multiple cluster systems, including cluster system 10, to uniquely determine the cluster system that will perform the recovery operations. As a result, it is possible to avoid recovery operations for a server device in an abnormal state being duplicated and performed by multiple cluster systems. In other words, each cluster system can appropriately determine the cluster system that will perform recovery operations for a server device in an abnormal state.

（実施の形態２）
続いて、図２を用いて実施の形態２にかかる監視システムの構成例について説明する。図２の監視システムは、クラスタシステム１０、クラスタシステム２０、クラスタシステム３０、及び共有サーバ装置４０を有している。クラスタシステム１０、クラスタシステム２０、クラスタシステム３０、及び共有サーバ装置４０は、例えば、一つの社内システム等に含まれていてもよい。 (Embodiment 2)
Next, a configuration example of a monitoring system according to the second embodiment will be described with reference to Fig. 2. The monitoring system in Fig. 2 includes a cluster system 10, a cluster system 20, a cluster system 30, and a shared server device 40. The cluster system 10, the cluster system 20, the cluster system 30, and the shared server device 40 may be included in, for example, a single in-house system.

クラスタシステム１０、クラスタシステム２０、クラスタシステム３０、及び共有サーバ装置４０は、ネットワークを介して接続されている。ネットワークは、例えばIPネットワークであってもよい。クラスタシステム２０及びクラスタシステム３０は、クラスタシステム１０と同じ構成を有している。共有サーバ装置４０は、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０における拡張性もしくは可用性を確保するために管理されるコンピュータ装置の対象外となるサーバ装置である。共有サーバ装置４０は、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０によって管理されている。共有サーバ装置４０は、例えば、DNSサーバ装置であってもよい。 Cluster system 10, cluster system 20, cluster system 30, and shared server device 40 are connected via a network. The network may be, for example, an IP network. Cluster system 20 and cluster system 30 have the same configuration as cluster system 10. Shared server device 40 is a server device that is not subject to the computer devices managed to ensure scalability or availability in cluster system 10, cluster system 20, and cluster system 30. Shared server device 40 is managed by cluster system 10, cluster system 20, and cluster system 30. Shared server device 40 may be, for example, a DNS server device.

例えば、クラスタシステム１０は、クラスタシステム２０もしくは３０へアクセスするために、DNSサーバ装置として動作する共有サーバ装置４０から、クラスタシステム２０もしくは３０を識別するアドレス情報を取得してもよい。クラスタシステム２０へアクセスするとは、クラスタシステム２０内において管理されているいずれかのコンピュータ装置へアクセスすることであってもよい。もしくは、クラスタシステム２０へアクセスするとは、クラスタシステム２０において他のクラスタシステムと通信を行う機能を有するコンピュータ装置へアクセスすることであってもよい。 For example, in order to access cluster system 20 or 30, cluster system 10 may obtain address information identifying cluster system 20 or 30 from shared server device 40 operating as a DNS server device. Accessing cluster system 20 may mean accessing any of the computer devices managed within cluster system 20. Alternatively, accessing cluster system 20 may mean accessing a computer device in cluster system 20 that has the function of communicating with other cluster systems.

次に、図３を用いてクラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０が管理する監視マップについて説明する。以下においては、主にクラスタシステム１０が管理する監視マップについて説明するが、クラスタシステム２０及びクラスタシステム３０が管理する監視マップもクラスタシステム１０が管理する監視マップと同様の構成を有する。 Next, the monitoring maps managed by cluster system 10, cluster system 20, and cluster system 30 will be described using FIG. 3. In the following, the monitoring map managed by cluster system 10 will be mainly described, but the monitoring maps managed by cluster system 20 and cluster system 30 also have the same configuration as the monitoring map managed by cluster system 10.

クラスタシステム１０は、管理部１１において監視マップを管理する。監視マップは、それぞれのクラスタシステムと、監視状態、実行状態、及び実行順序とが関連付けられている。クラスタシステムの列に設定される数値は、クラスタシステムの識別情報を示しており、図２に示されるクラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０が監視マップにおいて管理されていることを示している。 Cluster system 10 manages a monitoring map in management unit 11. The monitoring map associates each cluster system with a monitoring state, execution state, and execution order. The numbers set in the cluster system column indicate the identification information of the cluster system, and indicate that cluster system 10, cluster system 20, and cluster system 30 shown in Figure 2 are managed in the monitoring map.

実行順序の列に設定される数値は、回復動作を実行する順序を示している。１が設定されているクラスタシステムが最も優先的に回復動作を実行するクラスタシステムであり、３が設定されているクラスタシステムが最も優先順位が低いクラスタシステムである。 The numbers set in the execution order column indicate the order in which recovery operations are performed. A cluster system set to 1 is the cluster system that has the highest priority for performing recovery operations, and a cluster system set to 3 is the cluster system with the lowest priority.

監視状態に設定される数値について図４を用いて説明する。監視状態に設定される数値は、フラグ情報と言い換えられてもよい。図４は、監視状態として、正常、一時停止、及び異常のパラメータが存在することを示している。また、図４は、監視状態として正常を示すフラグが０であり、一時停止を示すフラグが１であり、異常を示すフラグが２であることを示している。正常は、共有サーバ装置４０が異常状態ではなく、つまり、共有サーバ装置４０に障害もしくは故障が発生していないことを示す。一時停止は、共有サーバ装置４０の監視を一時的に停止していることを示す。異常は、共有サーバ装置４０が正常ではなく、つまり、共有サーバ装置４０に障害もしくは故障が発生していることを示す。 The numerical value set for the monitoring status will be explained with reference to FIG. 4. The numerical value set for the monitoring status may be referred to as flag information. FIG. 4 shows that there are parameters for normal, paused, and abnormal as the monitoring status. FIG. 4 also shows that the flag indicating normal as the monitoring status is 0, the flag indicating paused is 1, and the flag indicating abnormal is 2. Normal indicates that the shared server device 40 is not in an abnormal state, that is, no fault or malfunction has occurred in the shared server device 40. Paused indicates that monitoring of the shared server device 40 has been temporarily stopped. Abnormal indicates that the shared server device 40 is not normal, that is, a fault or malfunction has occurred in the shared server device 40.

続いて、実行状態に設定される数値について図５を用いて説明する。実行状態に設定される数値は、フラグ情報と言い換えられてもよい。図５は、実行状態として、未実施、実行準備、実行中、及び実行済のパラメータが存在することを示している。また、図５は、実行状態として未実施を示すフラグが０であり、実行準備を示すフラグが１であり、実行中を示すフラグが２であり、実行済を示すフラグが３であることを示している。未実施は、異常状態である共有サーバ装置４０に対する回復動作を実行しないことを示す。実行準備は、異常状態である共有サーバ装置４０に対する回復動作を実行するための準備中であることを示す。実行中は、異常状態である共有サーバ装置４０に対する回復動作を実行中であることを示す。実行済は、異常状態である共有サーバ装置４０に対する回復動作を完了したことを示す。 Next, the numerical value set in the execution state will be described with reference to FIG. 5. The numerical value set in the execution state may be referred to as flag information. FIG. 5 shows that there are parameters for the execution state: not executed, preparing to execute, executing, and executed. FIG. 5 also shows that the flag indicating not executed as the execution state is 0, the flag indicating preparing to execute is 1, the flag indicating executing is 2, and the flag indicating executed is 3. Not executed indicates that a recovery operation is not being performed on the shared server device 40, which is in an abnormal state. Preparing to execute indicates that preparations are being made to perform a recovery operation on the shared server device 40, which is in an abnormal state. Executing indicates that a recovery operation is being performed on the shared server device 40, which is in an abnormal state. Executed indicates that a recovery operation has been completed on the shared server device 40, which is in an abnormal state.

続いて、図６及び図７を用いて、クラスタシステム１０のみが共有サーバ装置４０の異常を検出した場合における回復動作の実行処理の流れについて説明する。さらに、図８を用いて、監視マップに設定される値の遷移について説明する。図８は、クラスタシステム１０の実行順序が１であり、クラスタシステム２０の実行順序が２であり、クラスタシステム３０の実行順序が３であることを示している。さらに、図８は、図６及び図７において監視マップが更新されるステップと、監視マップのフラグ情報とを関連付けて示している。 Next, the flow of the recovery operation execution process when only the cluster system 10 detects an abnormality in the shared server device 40 will be described using Figures 6 and 7. Furthermore, the transition of values set in the monitoring map will be described using Figure 8. Figure 8 shows that the execution order of the cluster system 10 is 1, the execution order of the cluster system 20 is 2, and the execution order of the cluster system 30 is 3. Furthermore, Figure 8 shows the steps in which the monitoring map is updated in Figures 6 and 7 in association with the flag information of the monitoring map.

はじめに、クラスタシステム１０は、共有サーバ装置４０が異常状態であることを検出する（Ｓ１１）。例えば、クラスタシステム１０は、共有サーバ装置４０から仮想ホスト名に対応するアドレス情報を取得できない場合に、共有サーバ装置４０が異常状態であると判定する。 First, the cluster system 10 detects that the shared server device 40 is in an abnormal state (S11). For example, the cluster system 10 determines that the shared server device 40 is in an abnormal state when it cannot obtain address information corresponding to the virtual host name from the shared server device 40.

次に、クラスタシステム１０は、クラスタシステム２０及びクラスタシステム３０へ、共有サーバ装置４０の異常状態を検出したことを示すメッセージを送信する（Ｓ１２）。 Next, cluster system 10 sends a message to cluster system 20 and cluster system 30 indicating that an abnormal state of shared server device 40 has been detected (S12).

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、監視マップにおける監視状態を更新する（Ｓ１３）。例えば、クラスタシステム１０は、異常状態を検出したことを示すメッセージを送信したことを契機に監視マップを更新する。また、クラスタシステム２０及びクラスタシステム３０は、異常状態を検出したことを示すメッセージを受信したことを契機に監視マップを更新する。図６においては、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０が監視マップを更新するタイミングが同一であることを示しているが、完全に同一のタイミングに監視マップの更新が行われなくてもよい。以下の説明においても同様に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０において実行される処理のタイミングが同一であることが示されていても、完全に同一のタイミングでなくてもよい。 Next, cluster system 10, cluster system 20, and cluster system 30 update the monitoring status in the monitoring map (S13). For example, cluster system 10 updates the monitoring map when it sends a message indicating that an abnormal state has been detected. Cluster system 20 and cluster system 30 update the monitoring map when it receives a message indicating that an abnormal state has been detected. In FIG. 6, cluster system 10, cluster system 20, and cluster system 30 update the monitoring map at the same timing, but the monitoring map does not have to be updated at exactly the same timing. Similarly, in the following description, even if it is shown that the timing of the processes executed in cluster system 10, cluster system 20, and cluster system 30 is the same, the timing does not have to be exactly the same.

具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図８の監視マップのステップＳ１２の列に示されるように、クラスタシステム１０の監視状態を２に設定する。 Specifically, cluster system 10, cluster system 20, and cluster system 30 set the monitoring status of cluster system 10 to 2, as shown in the step S12 column of the monitoring map in Figure 8.

また、図６においては、クラスタシステム１０は、メッセージを送信した後に、監視マップを更新しているが、ステップＳ１１において異常状態を検出し、ステップＳ１２においてメッセージを送信する前に、監視マップを更新してもよい。 In addition, in FIG. 6, the cluster system 10 updates the monitoring map after sending the message, but it may also detect an abnormal condition in step S11 and update the monitoring map before sending the message in step S12.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、共有サーバ装置４０に対する監視処理を実行する（Ｓ１４）。図６においては、クラスタシステム１０のみが共有サーバ装置４０の異常状態を検出する例について説明するため、クラスタシステム２０及びクラスタシステム３０は、ステップＳ１４において異常状態を検出しなかったとする。 Next, cluster system 10, cluster system 20, and cluster system 30 execute a monitoring process for shared server device 40 (S14). In FIG. 6, an example is described in which only cluster system 10 detects an abnormal state of shared server device 40, so it is assumed that cluster system 20 and cluster system 30 did not detect an abnormal state in step S14.

次に、クラスタシステム２０は、クラスタシステム１０及びクラスタシステム３０へ監視結果を含むメッセージを送信する（Ｓ１５）。さらに、クラスタシステム３０は、クラスタシステム１０及びクラスタシステム２０へ監視結果を示すメッセージを送信する（Ｓ１６）。クラスタシステム２０及びクラスタシステム３０は、共有サーバ装置４０が正常であることを示すメッセージを送信する。また、図６は、ステップＳ１５においてクラスタシステム２０がメッセージを送信した後に、クラスタシステム３０がステップＳ１６においてメッセージを送信する例を示しているが、ステップＳ１５及びＳ１６の順番は逆であってもよい。もしくは、ステップＳ１５及びＳ１６は、実質的に同一のタイミングに実行されてもよい。 Next, cluster system 20 transmits a message including the monitoring result to cluster system 10 and cluster system 30 (S15). Furthermore, cluster system 30 transmits a message indicating the monitoring result to cluster system 10 and cluster system 20 (S16). Cluster system 20 and cluster system 30 transmit a message indicating that shared server device 40 is normal. Also, while FIG. 6 shows an example in which cluster system 20 transmits a message in step S15, and then cluster system 30 transmits a message in step S16, the order of steps S15 and S16 may be reversed. Alternatively, steps S15 and S16 may be executed at substantially the same timing.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム２０は、監視マップにおける監視状態を更新する（Ｓ１７）。クラスタシステム１０は、クラスタシステム２０及びクラスタシステム３０から受信した監視結果を監視マップの監視状態に反映する。クラスタシステム２０は、ステップＳ１４における監視結果及びクラスタシステム３０から受信した監視結果を監視マップの監視状態に反映する。クラスタシステム３０は、ステップＳ１４における監視結果及びクラスタシステム２０から受信した監視結果を監視マップの監視状態に反映する。 Next, cluster system 10, cluster system 20, and cluster system 20 update the monitoring status in the monitoring map (S17). Cluster system 10 reflects the monitoring results received from cluster system 20 and cluster system 30 in the monitoring status of the monitoring map. Cluster system 20 reflects the monitoring results in step S14 and the monitoring results received from cluster system 30 in the monitoring status of the monitoring map. Cluster system 30 reflects the monitoring results in step S14 and the monitoring results received from cluster system 20 in the monitoring status of the monitoring map.

具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図８の監視マップのステップＳ１７の列に示されるように、ステップＳ１２における監視状態と同様の状態の監視マップを有する。 Specifically, cluster system 10, cluster system 20, and cluster system 30 have monitoring maps in a state similar to the monitoring state in step S12, as shown in the step S17 column of the monitoring map in Figure 8.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、回復動作を実行するクラスタシステムを決定し、監視マップの実行状態を更新する（Ｓ１８）。クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、異常状態を検出したクラスタシステムの中から回復動作を実行するクラスタシステムを決定する。クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、複数のクラスタシステムが共有サーバ装置４０の異常状態を検出した場合、実行順序に従って回復動作を実行するクラスタシステムを決定する。図６においては、共有サーバ装置４０の異常状態を検出したのはクラスタシステム１０のみである。そのため、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、回復動作を実行するクラスタシステムをクラスタシステム１０として、監視マップの実行状態を更新する。 Next, cluster system 10, cluster system 20, and cluster system 30 determine the cluster system that will execute the recovery operation and update the execution state of the monitoring map (S18). Cluster system 10, cluster system 20, and cluster system 30 determine the cluster system that will execute the recovery operation from among the cluster systems that detected the abnormal state. When multiple cluster systems detect an abnormal state of the shared server device 40, cluster system 10, cluster system 20, and cluster system 30 determine the cluster system that will execute the recovery operation according to the execution order. In FIG. 6, only cluster system 10 detected the abnormal state of the shared server device 40. Therefore, cluster system 10, cluster system 20, and cluster system 30 update the execution state of the monitoring map by specifying cluster system 10 as the cluster system that will execute the recovery operation.

具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図８の監視マップのステップＳ１８の列に示されるように、クラスタシステム１０の実行状態を１に設定する。つまり、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、クラスタシステム１０が回復動作の実行準備中であるとする。 Specifically, cluster system 10, cluster system 20, and cluster system 30 set the execution state of cluster system 10 to 1, as shown in the column for step S18 in the monitoring map of FIG. 8. In other words, cluster system 10, cluster system 20, and cluster system 30 consider that cluster system 10 is preparing to execute a recovery operation.

次に、クラスタシステム２０は、回復動作を実行しないため、共有サーバ装置４０の監視を一時的に停止することを示すメッセージをクラスタシステム１０及びクラスタシステム３０へ送信する（Ｓ１９）。また、クラスタシステム３０も、共有サーバ装置４０の監視を一時的に停止することを示すメッセージをクラスタシステム１０及びクラスタシステム２０へ送信する（Ｓ２０）。ステップＳ１９及びＳ２０は、実行される順番が逆であってもよく、実質的に同一のタイミングに行われてもよい。回復動作が実行された場合、共有サーバ装置４０の再起動が行われることがある。この場合、回復動作を実行しないクラスタシステムが共有サーバ装置４０の監視を行っていた場合、共有サーバ装置４０に異常状態が発生していると認識し、共有サーバ装置４０の異常状態を検出することがある。そのため、回復動作を実行しないクラスタシステムは、監視を一時的に停止することによって、回復動作中の共有サーバ装置４０に関する異常状態の検出を回避することができる。 Next, the cluster system 20 transmits a message to the cluster systems 10 and 30 indicating that the monitoring of the shared server device 40 will be temporarily stopped since the recovery operation will not be performed (S19). The cluster system 30 also transmits a message to the cluster systems 10 and 20 indicating that the monitoring of the shared server device 40 will be temporarily stopped (S20). Steps S19 and S20 may be performed in the opposite order, or may be performed at substantially the same timing. When the recovery operation is performed, the shared server device 40 may be restarted. In this case, if a cluster system that does not perform a recovery operation is monitoring the shared server device 40, it may recognize that an abnormal state has occurred in the shared server device 40 and detect the abnormal state of the shared server device 40. Therefore, the cluster system that does not perform a recovery operation can avoid detecting an abnormal state regarding the shared server device 40 during the recovery operation by temporarily stopping monitoring.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、監視マップにおけるクラスタシステム２０及びクラスタシステム３０の監視状態を更新する（Ｓ２１）。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図８の監視マップのステップＳ２１の列に示されるように、クラスタシステム２０及びクラスタシステム３０の監視状態を１に設定する。 Next, cluster system 10, cluster system 20, and cluster system 30 update the monitoring status of cluster system 20 and cluster system 30 in the monitoring map (S21). Specifically, cluster system 10, cluster system 20, and cluster system 30 set the monitoring status of cluster system 20 and cluster system 30 to 1, as shown in the column of step S21 in the monitoring map in FIG. 8.

次に、クラスタシステム１０は、クラスタシステム２０及びクラスタシステム３０へ、回復動作を開始することを示すメッセージを送信する（Ｓ２２）。 Next, cluster system 10 sends a message to cluster system 20 and cluster system 30 indicating that recovery operations are to begin (S22).

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、監視マップにおけるクラスタシステム１０の実行状態を実行中に更新する（Ｓ２３）。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図８の監視マップのステップＳ２３の列に示されるように、クラスタシステム１０の実行状態を２に設定する。また、クラスタシステム１０は、ステップＳ２２において回復動作を開始することを示すメッセージを送信する前に、クラスタシステム１０の実行状態を２に設定してもよい。 Next, cluster system 10, cluster system 20, and cluster system 30 update the execution state of cluster system 10 in the monitoring map to running (S23). Specifically, cluster system 10, cluster system 20, and cluster system 30 set the execution state of cluster system 10 to 2, as shown in the column of step S23 of the monitoring map in FIG. 8. Cluster system 10 may also set the execution state of cluster system 10 to 2 before sending a message indicating the start of recovery operation in step S22.

次に、クラスタシステム１０は、共有サーバ装置４０に対する回復動作を実行する（Ｓ２４）。例えば、クラスタシステム１０は、共有サーバ装置４０が有する一部のアプリケーションを再起動してもよく、共有サーバ装置４０を再起動してもよい。次に、クラスタシステム１０は、共有サーバ装置４０に対する回復動作を完了する（Ｓ２５）。 Then, the cluster system 10 executes a recovery operation on the shared server device 40 (S24). For example, the cluster system 10 may restart some of the applications held by the shared server device 40, or may restart the shared server device 40. Next, the cluster system 10 completes the recovery operation on the shared server device 40 (S25).

次に、クラスタシステム１０は、共有サーバ装置４０に対する回復動作が完了したことを示すメッセージを、クラスタシステム２０及びクラスタシステム３０へ送信する（Ｓ２６）。 Next, cluster system 10 sends a message to cluster system 20 and cluster system 30 indicating that the recovery operation for shared server device 40 has been completed (S26).

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、監視マップにおけるクラスタシステム１０の実行状態を実行済に更新する（Ｓ２７）。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図８の監視マップのステップＳ２７の列に示されるように、クラスタシステム１０の実行状態を３に設定する。また、クラスタシステム１０は、ステップＳ２７において回復動作が完了したことを示すメッセージを送信する前に、クラスタシステム１０の実行状態を３に設定してもよい。 Next, cluster system 10, cluster system 20, and cluster system 30 update the execution status of cluster system 10 in the monitoring map to "executed" (S27). Specifically, cluster system 10, cluster system 20, and cluster system 30 set the execution status of cluster system 10 to 3, as shown in the column for step S27 in the monitoring map of FIG. 8. Cluster system 10 may also set the execution status of cluster system 10 to 3 before sending a message indicating that the recovery operation has been completed in step S27.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、監視マップにおけるクラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０の実行状態を実行済に更新する（Ｓ２７）。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図８の監視マップのステップＳ２７の列に示されるように、クラスタシステム１０の実行状態を３に設定する。 Next, cluster system 10, cluster system 20, and cluster system 30 update the execution status of cluster system 10, cluster system 20, and cluster system 30 in the monitoring map to "executed" (S27). Specifically, cluster system 10, cluster system 20, and cluster system 30 set the execution status of cluster system 10 to 3, as shown in the column for step S27 in the monitoring map of FIG. 8.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、共有サーバ装置４０の監視を実行する（Ｓ２８）。クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、共有サーバ装置４０が正常に動作していると判定すると、監視マップの監視状態及び実行状態をリセットする（Ｓ２９）。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図８の監視マップのステップＳ２９の列に示されるように、監視状態及び実行状態に０を設定する。 Next, cluster system 10, cluster system 20, and cluster system 30 execute monitoring of shared server device 40 (S28). When cluster system 10, cluster system 20, and cluster system 30 determine that shared server device 40 is operating normally, they reset the monitoring status and execution status of the monitoring map (S29). Specifically, cluster system 10, cluster system 20, and cluster system 30 set the monitoring status and execution status to 0, as shown in the column of step S29 of the monitoring map in FIG. 8.

続いて、クラスタシステム１０及びクラスタシステム２０が、共有サーバ装置４０の異常状態を検出した場合における回復動作の実行処理の流れについて説明する。例えば、クラスタシステム１０が先に共有サーバ装置４０の異常状態を検出し、その後、クラスタシステム２０が共有サーバ装置４０の異常状態を説明する場合について説明する。 Next, the flow of the recovery operation execution process when cluster system 10 and cluster system 20 detect an abnormal state of shared server device 40 will be described. For example, a case will be described in which cluster system 10 first detects an abnormal state of shared server device 40, and then cluster system 20 explains the abnormal state of shared server device 40.

クラスタシステム１０及びクラスタシステム２０が共有サーバ装置４０の異常状態を検出した場合の回復動作の実行処理の流れは、図６及び図７と同様である。ここでは、クラスタシステム１０及びクラスタシステム２０が、共有サーバ装置４０の異常状態を検出した場合における、監視マップに設定される値の遷移について、クラスタシステム１０が異常状態を検出した場合との差異を説明する。 The flow of recovery operation execution processing when cluster system 10 and cluster system 20 detect an abnormal state of shared server device 40 is the same as in Figures 6 and 7. Here, we will explain the difference in the transition of values set in the monitoring map when cluster system 10 and cluster system 20 detect an abnormal state of shared server device 40 compared to when cluster system 10 detects an abnormal state.

クラスタシステム１０及びクラスタシステム２０が、共有サーバ装置４０の異常状態を検出した場合における回復動作の実行処理の流れについて、図６のステップＳ１からＳ１３までは、クラスタシステム１０のみが異常状態を検出した場合と同様である。 When cluster system 10 and cluster system 20 detect an abnormal state of shared server device 40, steps S1 to S13 in FIG. 6 are the same as when only cluster system 10 detects an abnormal state.

クラスタシステム２０は、図６のステップＳ１４において共有サーバ装置４０の異常状態を検出する。さらに、クラスタシステム２０は、ステップＳ１５において、クラスタシステム１０へ、共有サーバ装置４０の異常状態を検出したことを示すメッセージをクラスタシステム１０及びクラスタシステム３０へ送信する。 The cluster system 20 detects an abnormal state of the shared server device 40 in step S14 of FIG. 6. Furthermore, in step S15, the cluster system 20 transmits a message to the cluster system 10 and the cluster system 30 indicating that an abnormal state of the shared server device 40 has been detected.

この場合、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図９のステップＳ１７の列に示されるように、クラスタシステム１０及びクラスタシステム２０の監視状態を２に設定する。 In this case, cluster system 10, cluster system 20, and cluster system 30 set the monitoring status of cluster system 10 and cluster system 20 to 2, as shown in the column for step S17 in FIG. 9.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、ステップＳ１８において、共有サーバ装置４０に対する回復動作を実行するクラスタシステムを決定する。ステップＳ１７の時点において、共有サーバ装置４０の異常状態を検出したクラスタシステムは、クラスタシステム１０及びクラスタシステム２０である。また、クラスタシステム１０は、実行順序に１が設定されているため、実行順序の優先度は、クラスタシステム２０よりも高い。そのため、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、共有サーバ装置４０に対する回復動作を実行するクラスタシステムとしてクラスタシステム１０の監視マップの実行状態を更新する。 Next, in step S18, cluster system 10, cluster system 20, and cluster system 30 determine the cluster system that will perform recovery operations on shared server device 40. At the time of step S17, the cluster systems that have detected the abnormal state of shared server device 40 are cluster system 10 and cluster system 20. Also, because cluster system 10 has an execution order set to 1, its execution order priority is higher than that of cluster system 20. Therefore, cluster system 10, cluster system 20, and cluster system 30 update the execution status of the monitoring map for cluster system 10 as the cluster system that will perform recovery operations on shared server device 40.

具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図９の監視マップのステップＳ１８の列に示されるように、クラスタシステム１０の実行状態を１に設定する。つまり、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、クラスタシステム１０が回復動作の実行準備中であるとする。 Specifically, cluster system 10, cluster system 20, and cluster system 30 set the execution state of cluster system 10 to 1, as shown in the column for step S18 in the monitoring map of FIG. 9. In other words, cluster system 10, cluster system 20, and cluster system 30 consider that cluster system 10 is preparing to execute a recovery operation.

ステップＳ１９以降については、クラスタシステム１０のみが異常状態を検出した場合のステップ１９以降の処理と同様であるため、詳細な説明を省略する。 Steps S19 and onwards are the same as the processing from step S19 onwards when only the cluster system 10 detects an abnormal state, so detailed explanations will be omitted.

続いて、クラスタシステム１０及びクラスタシステム２０が、共有サーバ装置４０の異常状態を検出し、さらに、回復動作において共有サーバ装置４０が正常状態へ遷移しなかった場合における回復動作の実行処理の流れについて説明する。この場合、図６及び図７のステップＳ２８までの処理は、クラスタシステム１０及びクラスタシステム２０が、共有サーバ装置４０の異常状態を検出した場合の処理と同様であるため詳細な説明を省略する。以下に、図１０及び図１１を用いて、ステップＳ２８以降の処理について説明する。 Next, the flow of the recovery operation execution process will be described when the cluster systems 10 and 20 detect an abnormal state of the shared server device 40 and the shared server device 40 does not transition to a normal state during the recovery operation. In this case, the process up to step S28 in Figures 6 and 7 is similar to the process when the cluster systems 10 and 20 detect an abnormal state of the shared server device 40, so a detailed description will be omitted. The process from step S28 onwards will be described below with reference to Figures 10 and 11.

図１０は、図７のステップＳ２８以降の処理を示している。クラスタシステム１０及びクラスタシステム２０は、ステップＳ２８において共有サーバ装置４０の監視を実行すると、共有サーバ装置４０の異常状態を検出する（Ｓ３１）。つまり、クラスタシステム１０が共有サーバ装置４０に対して回復動作を実行したが、共有サーバ装置４０の異常状態は回復していない。 Figure 10 shows the process after step S28 in Figure 7. When the cluster system 10 and the cluster system 20 execute monitoring of the shared server device 40 in step S28, they detect an abnormal state of the shared server device 40 (S31). In other words, the cluster system 10 executes a recovery operation on the shared server device 40, but the abnormal state of the shared server device 40 has not been recovered.

次に、クラスタシステム１０は、クラスタシステム２０及びクラスタシステム３０へ共有サーバ装置４０が異常状態であることを検出したことを示すメッセージを送信する（Ｓ３２）。さらに、クラスタシステム２０も、クラスタシステム１０及びクラスタシステム３０へ共有サーバ装置４０が異常状態であることを検出したことを示すメッセージを送信する（Ｓ３３）。また、異常状態を検出していないクラスタシステム３０も、異常状態を検出していないことを示す監視結果をクラスタシステム１０及びクラスタシステム２０へ送信してもよい。 Next, cluster system 10 transmits a message to cluster system 20 and cluster system 30 indicating that it has detected that the shared server device 40 is in an abnormal state (S32). Furthermore, cluster system 20 also transmits a message to cluster system 10 and cluster system 30 indicating that it has detected that the shared server device 40 is in an abnormal state (S33). Furthermore, cluster system 30, which has not detected an abnormal state, may also transmit a monitoring result indicating that it has not detected an abnormal state to cluster system 10 and cluster system 20.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、監視マップにおけるクラスタシステム１０及びクラスタシステム２０の監視状態を更新する（Ｓ３４）。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図９のステップＳ２７の列に示されている監視マップの状態から、図１２のステップＳ３４の列に示されている監視マップの状態へ更新する。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図１２におけるクラスタシステム１０及びクラスタシステム２０の監視状態を２に更新する。 Next, cluster system 10, cluster system 20, and cluster system 30 update the monitoring status of cluster system 10 and cluster system 20 in the monitoring map (S34). Specifically, cluster system 10, cluster system 20, and cluster system 30 update the monitoring map status shown in the column of step S27 in FIG. 9 to the monitoring map status shown in the column of step S34 in FIG. 12. Specifically, cluster system 10, cluster system 20, and cluster system 30 update the monitoring status of cluster system 10 and cluster system 20 in FIG. 12 to 2.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、回復動作を実行するクラスタシステムを決定し、監視マップの実行状態を更新する（Ｓ３５）。ステップＳ３１において、クラスタシステム１０及びクラスタシステム２０が、共有サーバ装置４０の異常状態を検出している。また、図１２のステップＳ３４の列における実行状態には、クラスタシステム１０に３が設定されており、クラスタシステム１０における回復動作が実行済であることが示されている。そのため、ステップＳ３５においては、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、実行順序が２に設定されているクラスタシステム２０を、回復動作を実行するクラスタシステムとする。 Next, cluster system 10, cluster system 20, and cluster system 30 determine the cluster system that will execute the recovery operation, and update the execution status of the monitoring map (S35). In step S31, cluster system 10 and cluster system 20 detect an abnormal state of shared server device 40. In addition, in the execution status in the column of step S34 in FIG. 12, 3 is set for cluster system 10, indicating that the recovery operation in cluster system 10 has been executed. Therefore, in step S35, cluster system 10, cluster system 20, and cluster system 30 determine that cluster system 20, whose execution order is set to 2, is the cluster system that will execute the recovery operation.

具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図１２のステップＳ３５の列におけるクラスタシステム２０の実行状態を１に更新する。 Specifically, cluster system 10, cluster system 20, and cluster system 30 update the execution status of cluster system 20 to 1 in the column of step S35 in FIG. 12.

次に、クラスタシステム１０は、回復動作を実行しないため、共有サーバ装置４０の監視を一時的に停止することを示すメッセージをクラスタシステム２０及びクラスタシステム３０へ送信する（Ｓ３６）。また、クラスタシステム３０も、共有サーバ装置４０の監視を一時的に停止することを示すメッセージをクラスタシステム１０及びクラスタシステム２０へ送信する（Ｓ３７）。ステップＳ３６及びＳ３７は、実行される順番が逆であってもよく、実質的に同一のタイミングに行われてもよい。 Next, cluster system 10 transmits a message to cluster system 20 and cluster system 30 indicating that it will temporarily stop monitoring shared server device 40 since it will not perform a recovery operation (S36). Cluster system 30 also transmits a message to cluster system 10 and cluster system 20 indicating that it will temporarily stop monitoring shared server device 40 (S37). Steps S36 and S37 may be performed in the opposite order, or may be performed at substantially the same time.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、監視マップにおけるクラスタシステム１０及びクラスタシステム３０の監視状態を更新する（Ｓ３８）。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図１２の監視マップのステップＳ３８の列に示されるように、クラスタシステム１０及びクラスタシステム３０の監視状態を１に設定する。 Next, cluster system 10, cluster system 20, and cluster system 30 update the monitoring status of cluster system 10 and cluster system 30 in the monitoring map (S38). Specifically, cluster system 10, cluster system 20, and cluster system 30 set the monitoring status of cluster system 10 and cluster system 30 to 1, as shown in the column of step S38 in the monitoring map of FIG. 12.

次に、クラスタシステム２０は、クラスタシステム１０及びクラスタシステム３０へ、回復動作を開始することを示すメッセージを送信する（Ｓ３９）。 Next, cluster system 20 sends a message to cluster system 10 and cluster system 30 indicating that recovery operations are to begin (S39).

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、監視マップにおけるクラスタシステム１０の実行状態を実行中に更新する（Ｓ４０）。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図１２の監視マップのステップＳ４０の列に示されるように、クラスタシステム２０の実行状態を２に設定する。また、クラスタシステム２０は、ステップＳ３９において回復動作を開始することを示すメッセージを送信する前に、クラスタシステム２０の実行状態を２に設定してもよい。 Next, cluster system 10, cluster system 20, and cluster system 30 update the execution state of cluster system 10 in the monitoring map to running (S40). Specifically, cluster system 10, cluster system 20, and cluster system 30 set the execution state of cluster system 20 to 2, as shown in the column of step S40 in the monitoring map of FIG. 12. Cluster system 20 may also set the execution state of cluster system 20 to 2 before sending a message indicating the start of recovery operation in step S39.

次に、クラスタシステム２０は、共有サーバ装置４０に対する回復動作を実行する（Ｓ４１）。次に、クラスタシステム２０は、共有サーバ装置４０に対する回復動作を完了する（Ｓ４２）。 Next, the cluster system 20 executes recovery operations for the shared server device 40 (S41). Next, the cluster system 20 completes the recovery operations for the shared server device 40 (S42).

次に、クラスタシステム２０は、共有サーバ装置４０に対する回復動作が完了したことを示すメッセージを、クラスタシステム１０及びクラスタシステム３０へ送信する（Ｓ４３）。 Next, cluster system 20 sends a message to cluster system 10 and cluster system 30 indicating that the recovery operation for shared server device 40 has been completed (S43).

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、監視マップにおけるクラスタシステム２０の実行状態を実行済に更新する（Ｓ４４）。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図１２の監視マップのステップＳ４４の列に示されるように、クラスタシステム２０の実行状態を３に設定する。また、クラスタシステム２０は、ステップＳ４３において回復動作が完了したことを示すメッセージを送信する前に、クラスタシステム２０の実行状態を３に設定してもよい。 Next, cluster system 10, cluster system 20, and cluster system 30 update the execution state of cluster system 20 in the monitoring map to "executed" (S44). Specifically, cluster system 10, cluster system 20, and cluster system 30 set the execution state of cluster system 20 to 3, as shown in the column of step S44 of the monitoring map in FIG. 12. Cluster system 20 may also set the execution state of cluster system 20 to 3 before sending a message indicating that the recovery operation has been completed in step S43.

次に、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、共有サーバ装置４０の監視を実行する（Ｓ４５）。クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、共有サーバ装置４０が正常に動作していると判定すると、監視マップの監視状態及び実行状態をリセットする（Ｓ４６）。具体的には、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、図１２の監視マップのステップＳ４６の列に示されるように、監視状態及び実行状態に０を設定する。 Next, cluster system 10, cluster system 20, and cluster system 30 execute monitoring of shared server device 40 (S45). When cluster system 10, cluster system 20, and cluster system 30 determine that shared server device 40 is operating normally, they reset the monitoring status and execution status of the monitoring map (S46). Specifically, cluster system 10, cluster system 20, and cluster system 30 set the monitoring status and execution status to 0, as shown in the column of step S46 of the monitoring map in FIG. 12.

以上説明したように、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０が保有する監視マップは、同一となる。また、監視マップには、回復動作を実行する順序が定められている。そのため、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、監視マップを用いることによって、回復動作を実行するクラスタシステムを一意に決定することができる。これより、クラスタシステム１０、クラスタシステム２０、及びクラスタシステム３０は、共有サーバ装置４０に対して重複した回復動作を実行することがなく、共有サーバ装置４０に対して適切に回復動作を実行することができる。 As described above, the monitoring map held by cluster system 10, cluster system 20, and cluster system 30 is the same. Furthermore, the monitoring map defines the order in which recovery operations are performed. Therefore, by using the monitoring map, cluster system 10, cluster system 20, and cluster system 30 can uniquely determine the cluster system that will perform the recovery operation. As a result, cluster system 10, cluster system 20, and cluster system 30 can appropriately perform recovery operations on shared server device 40 without performing duplicate recovery operations on shared server device 40.

さらに、回復動作を実行しないクラスタシステムは、一時的に共有サーバ装置４０の監視を停止する。これにより、回復動作を実行しないクラスタシステムは、回復動作を実行中のサーバ装置を異常状態であると検出することを回避することができる。 Furthermore, a cluster system that does not perform a recovery operation temporarily stops monitoring the shared server device 40. This allows the cluster system that does not perform a recovery operation to avoid detecting a server device that is performing a recovery operation as being in an abnormal state.

また、実施の形態２にかかる監視システムにおいては、それぞれのクラスタシステムが監視マップを有することによって、上位サーバ装置もしくはリーダーとなるサーバ装置は不要である。これにより、一般的な分散処理において実行されるリーダーを決定するまでのシーケンス等を排除することが可能となり、上位サーバ装置等を設置するためのコストを低減することができる。 In addition, in the monitoring system according to the second embodiment, since each cluster system has a monitoring map, there is no need for a host server device or a leader server device. This makes it possible to eliminate the sequence leading up to the leader determination that is executed in general distributed processing, and reduces the cost of installing a host server device, etc.

図１３は、１台のコンピュータ装置として動作するクラスタシステム１０の構成例を示すブロック図である。図１３を参照すると、クラスタシステム１０は、ネットワークインタフェース１２０１、プロセッサ１２０２、及びメモリ１２０３を含む。ネットワークインタフェース１２０１は、ネットワークノード（e.g., eNB、MME、P-GW、）と通信するために使用されてもよい。ネットワークインタフェース１２０１は、例えば、IEEE 802.3 seriesに準拠したネットワークインタフェースカード（NIC）を含んでもよい。ここで、eNBはevolved Node B、MMEはMobility Management Entity、P-GWはPacket Data Network Gatewayを表す。IEEEは、Institute of Electrical and Electronics Engineersを表す。 Fig. 13 is a block diagram showing an example of the configuration of a cluster system 10 that operates as a single computer device. Referring to Fig. 13, the cluster system 10 includes a network interface 1201, a processor 1202, and a memory 1203. The network interface 1201 may be used to communicate with a network node (e.g., eNB, MME, P-GW, etc.). The network interface 1201 may include, for example, a network interface card (NIC) that complies with the IEEE 802.3 series. Here, eNB stands for evolved Node B, MME stands for Mobility Management Entity, and P-GW stands for Packet Data Network Gateway. IEEE stands for Institute of Electrical and Electronics Engineers.

プロセッサ１２０２は、メモリ１２０３からソフトウェア（コンピュータプログラム）を読み出して実行することで、上述の実施形態においてフローチャートを用いて説明されたクラスタシステム１０の処理を行う。プロセッサ１２０２は、例えば、マイクロプロセッサ、MPU、又はCPUであってもよい。プロセッサ１２０２は、複数のプロセッサを含んでもよい。 The processor 1202 reads and executes software (computer programs) from the memory 1203 to perform the processing of the cluster system 10 described using the flowcharts in the above-mentioned embodiment. The processor 1202 may be, for example, a microprocessor, an MPU, or a CPU. The processor 1202 may include multiple processors.

メモリ１２０３は、揮発性メモリ及び不揮発性メモリの組み合わせによって構成される。メモリ１２０３は、プロセッサ１２０２から離れて配置されたストレージを含んでもよい。この場合、プロセッサ１２０２は、図示されていないI/O（Input/Output）インタフェースを介してメモリ１２０３にアクセスしてもよい。 Memory 1203 is composed of a combination of volatile memory and non-volatile memory. Memory 1203 may include storage located away from processor 1202. In this case, processor 1202 may access memory 1203 via an I/O (Input/Output) interface not shown.

図１３の例では、メモリ１２０３は、ソフトウェアモジュール群を格納するために使用される。プロセッサ１２０２は、これらのソフトウェアモジュール群をメモリ１２０３から読み出して実行することで、上述の実施形態において説明されたクラスタシステム１０の処理を行うことができる。 In the example of FIG. 13, the memory 1203 is used to store a group of software modules. The processor 1202 can perform the processing of the cluster system 10 described in the above embodiment by reading and executing the group of software modules from the memory 1203.

図１３を用いて説明したように、上述の実施形態におけるクラスタシステム１０が有するプロセッサの各々は、図面を用いて説明されたアルゴリズムをコンピュータに行わせるための命令群を含む１又は複数のプログラムを実行する。 As explained using FIG. 13, each of the processors in the cluster system 10 in the above-described embodiment executes one or more programs that include a set of instructions for causing a computer to execute the algorithm explained using the drawing.

上述の例において、プログラムは、コンピュータに読み込まれた場合に、実施形態で説明された１又はそれ以上の機能をコンピュータに行わせるための命令群（又はソフトウェアコード）を含む。プログラムは、非一時的なコンピュータ可読媒体又は実体のある記憶媒体に格納されてもよい。限定ではなく例として、コンピュータ可読媒体又は実体のある記憶媒体は、random-access memory（RAM）、read-only memory（ROM）、フラッシュメモリ、solid-state drive（SSD）又はその他のメモリ技術、CD-ROM、digital versatile disc（DVD）、Blu-ray（登録商標）ディスク又はその他の光ディスクストレージ、磁気カセット、磁気テープ、磁気ディスクストレージ又はその他の磁気ストレージデバイスを含む。プログラムは、一時的なコンピュータ可読媒体又は通信媒体上で送信されてもよい。限定ではなく例として、一時的なコンピュータ可読媒体又は通信媒体は、電気的、光学的、音響的、またはその他の形式の伝搬信号を含む。 In the above examples, the program includes instructions (or software code) that, when loaded into a computer, cause the computer to perform one or more functions described in the embodiments. The program may be stored on a non-transitory computer-readable medium or a tangible storage medium. By way of example and not limitation, computer-readable media or tangible storage media include random-access memory (RAM), read-only memory (ROM), flash memory, solid-state drive (SSD) or other memory technology, CD-ROM, digital versatile disc (DVD), Blu-ray® disk or other optical disk storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage device. The program may be transmitted on a transitory computer-readable medium or communication medium. By way of example and not limitation, the transitory computer-readable medium or communication medium includes electrical, optical, acoustic, or other forms of propagated signals.

なお、本開示は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。 Note that this disclosure is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit and scope of the present disclosure.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記１）
複数のクラスタシステムにおけるサーバ装置の監視状態及び前記サーバ装置が異常状態である場合に前記サーバ装置に対する回復動作を実行する第１のクラスタシステムを示す実行状態を管理する管理部と、
前記サーバ装置が正常状態かもしくは異常状態かを監視し、監視結果を前記監視状態に反映するとともに、他のクラスタシステムから受信した前記サーバ装置の監視結果を前記監視状態に反映する監視部と、
前記複数のクラスタシステムのうち少なくとも一つのクラスタシステムにおける監視結果が異常状態を示す場合、前記監視状態を管理する前記他のクラスタシステムが使用する判定基準と同一の前記判定基準に従って前記サーバ装置に対する回復動作を実行する前記第１のクラスタシステムを決定し、決定結果を前記実行状態に反映する決定部と、
管理されている前記実行状態に従って前記サーバ装置に対する回復動作を実行するか否かを判定する制御部と、を備えるクラスタシステム。
（付記２）
前記監視部は、
前記他のクラスタシステムが前記サーバ装置に対する回復動作を実行することが前記実行状態に示されている場合、前記サーバ装置の監視を停止する、付記１に記載のクラスタシステム。
（付記３）
前記監視部は、
前記サーバ装置に対する回復動作を実行しない少なくとも一つの第２のクラスタシステムの監視状態を、前記サーバ装置の監視を停止している状態であることを示す情報に更新する、付記２に記載のクラスタシステム。
（付記４）
前記判定基準は、
前記回復動作を実行する前記第１のクラスタシステムの優先順位を定める、付記１乃至３のいずれか１項に記載のクラスタシステム。
（付記５）
前記決定部は、
前記複数のクラスタシステムのうち、前記サーバ装置が異常状態であることを検出した少なくとも一つの第３のクラスタシステムの中から、前記判定基準に従って前記サーバ装置に対する回復動作を実行する前記第１のクラスタシステムを決定する、付記１乃至４のいずれか１項に記載のクラスタシステム。
（付記６）
前記回復動作は、
前記サーバ装置において提供されるアプリケーションの再起動、又は前記サーバ装置の再起動である、付記１乃至５のいずれか１項に記載のクラスタシステム。
（付記７）
前記監視部は、
前記サーバ装置がDNSサーバ装置である場合に、仮想ホスト名のアドレス解決が成功したか否かに応じて前記DNSサーバ装置が正常状態かもしくは異常状態かを判定する、付記１乃至６のいずれか１項に記載のクラスタシステム。
（付記８）
複数のクラスタシステムと、
前記複数のクラスタシステムによって管理されるサーバ装置と、を含む監視システムであって、
それぞれの前記クラスタシステムは、
前記複数のクラスタシステムにおける前記サーバ装置の監視状態及び前記サーバ装置が異常状態である場合に前記サーバ装置に対する回復動作を実行する第１のクラスタシステムを示す実行状態を管理し、
前記サーバ装置が正常状態かもしくは異常状態かを監視し、監視結果を前記監視状態に反映するとともに、他のクラスタシステムから受信した前記サーバ装置の監視結果を前記監視状態に反映し、
前記複数のクラスタシステムのうち少なくとも一つのクラスタシステムにおける監視結果が異常状態を示す場合、前記監視状態を管理する前記他のクラスタシステムが使用する判定基準と同一の前記判定基準に従って前記サーバ装置に対する回復動作を実行する前記第１のクラスタシステムを決定し、決定結果を前記実行状態に反映し、
管理されている前記実行状態に従って前記サーバ装置に対する回復動作を実行するか否かを判定する、監視システム。
（付記９）
それぞれの前記クラスタシステムは、
前記他のクラスタシステムが前記サーバ装置に対する回復動作を実行することが前記実行状態に示されている場合、前記サーバ装置の監視を停止する、付記８に記載の監視システム。
（付記１０）
複数のクラスタシステムにおけるサーバ装置の監視状態及び前記サーバ装置が異常状態である場合に前記サーバ装置に対する回復動作を実行する第１のクラスタシステムを示す実行状態を管理し、
前記サーバ装置が正常状態かもしくは異常状態かを監視し、
監視結果を前記監視状態に反映するとともに、他のクラスタシステムから受信した前記サーバ装置の監視結果を前記監視状態に反映し、
前記複数のクラスタシステムのうち少なくとも一つのクラスタシステムにおける監視結果が異常状態を示す場合、前記監視状態を管理する前記他のクラスタシステムが使用する判定基準と同一の前記判定基準に従って前記サーバ装置に対する回復動作を実行する前記第１のクラスタシステムを決定し、
決定結果を前記実行状態に反映し、
管理されている前記実行状態に従って前記サーバ装置に対する回復動作を実行するか否かを判定する、クラスタシステムにおいて実行される監視方法。
（付記１１）
複数のクラスタシステムにおけるサーバ装置の監視状態及び前記サーバ装置が異常状態である場合に前記サーバ装置に対する回復動作を実行する第１のクラスタシステムを示す実行状態を管理し、
前記サーバ装置が正常状態かもしくは異常状態かを監視し、
監視結果を前記監視状態に反映するとともに、他のクラスタシステムから受信した前記サーバ装置の監視結果を前記監視状態に反映し、
前記複数のクラスタシステムのうち少なくとも一つのクラスタシステムにおける監視結果が異常状態を示す場合、前記監視状態を管理する前記他のクラスタシステムが使用する判定基準と同一の前記判定基準に従って前記サーバ装置に対する回復動作を実行する前記第１のクラスタシステムを決定し、
決定結果を前記実行状態に反映し、
管理されている前記実行状態に従って前記サーバ装置に対する回復動作を実行するか否かを判定することをコンピュータに実行させるプログラム。 A part or all of the above-described embodiments can be described as, but is not limited to, the following supplementary notes.
(Appendix 1)
a management unit for managing a monitoring state of a server device in a plurality of cluster systems and an execution state indicating a first cluster system that executes a recovery operation for the server device when the server device is in an abnormal state;
a monitoring unit that monitors whether the server device is in a normal state or an abnormal state, reflects the monitoring result in the monitoring state, and reflects the monitoring result of the server device received from another cluster system in the monitoring state;
a determination unit that, when a monitoring result in at least one cluster system among the plurality of cluster systems indicates an abnormal state, determines the first cluster system that will execute a recovery operation for the server device according to a determination criterion that is the same as a determination criterion used by the other cluster systems that manage the monitoring state, and reflects the determination result in the execution state;
a control unit that determines whether or not to execute a recovery operation for the server device according to the managed execution state.
(Appendix 2)
The monitoring unit is
2. The cluster system of claim 1, wherein the cluster system stops monitoring the server device when the execution state indicates that the other cluster system is to perform a recovery action on the server device.
(Appendix 3)
The monitoring unit is
3. The cluster system according to claim 2, further comprising: updating a monitoring status of at least one second cluster system that does not execute a recovery operation on the server device to information indicating that monitoring of the server device is stopped.
(Appendix 4)
The criteria are:
4. The cluster system according to claim 1, further comprising: a priority order for the first cluster system that executes the recovery operation.
(Appendix 5)
The determination unit is
A cluster system according to any one of appendices 1 to 4, further comprising: determining, from among at least one third cluster system among the plurality of cluster systems that detects that the server device is in an abnormal state, the first cluster system that will execute a recovery operation on the server device in accordance with the judgment criteria.
(Appendix 6)
The recovery operation includes:
6. The cluster system according to claim 1, wherein the restart is a restart of an application provided in the server device or a restart of the server device.
(Appendix 7)
The monitoring unit is
7. A cluster system according to any one of claims 1 to 6, wherein when the server device is a DNS server device, it is determined whether the DNS server device is in a normal state or an abnormal state depending on whether address resolution of a virtual host name is successful or not.
(Appendix 8)
A plurality of cluster systems;
a server device managed by the plurality of cluster systems,
Each of the cluster systems includes:
Manage a monitoring state of the server device in the plurality of cluster systems and an execution state indicating a first cluster system that executes a recovery operation for the server device when the server device is in an abnormal state;
monitoring whether the server device is in a normal state or an abnormal state, reflecting the monitoring result in the monitoring state, and reflecting the monitoring result of the server device received from another cluster system in the monitoring state;
When a monitoring result in at least one cluster system among the plurality of cluster systems indicates an abnormal state, determining the first cluster system which will execute a recovery operation for the server device according to a judgment criterion that is the same as a judgment criterion used by the other cluster systems which manage the monitoring state, and reflecting the determination result in the execution state;
A monitoring system that determines whether or not to execute a recovery action on the server device according to the managed execution state.
(Appendix 9)
Each of the cluster systems includes:
9. The monitoring system of claim 8, wherein the monitoring system stops monitoring the server device if the execution state indicates that the other cluster system will perform a recovery action on the server device.
(Appendix 10)
Manage a monitoring state of a server device in a plurality of cluster systems and an execution state indicating a first cluster system that executes a recovery operation on the server device when the server device is in an abnormal state;
monitor whether the server device is in a normal state or an abnormal state;
reflecting the monitoring result in the monitoring status, and reflecting the monitoring result of the server device received from another cluster system in the monitoring status;
determining the first cluster system to execute a recovery operation on the server device according to a criterion that is the same as a criterion used by the other cluster systems that manage the monitoring status when a monitoring result in at least one cluster system among the plurality of cluster systems indicates an abnormal state;
Reflecting the result of the decision in the execution state;
A monitoring method executed in a cluster system, the method determining whether or not to execute a recovery action for the server device according to the managed execution state.
(Appendix 11)
Manage a monitoring state of a server device in a plurality of cluster systems and an execution state indicating a first cluster system that executes a recovery operation on the server device when the server device is in an abnormal state;
monitor whether the server device is in a normal state or an abnormal state;
reflecting the monitoring result in the monitoring status, and reflecting the monitoring result of the server device received from another cluster system in the monitoring status;
determining the first cluster system to execute a recovery operation on the server device according to a criterion that is the same as a criterion used by the other cluster systems that manage the monitoring status when a monitoring result in at least one cluster system among the plurality of cluster systems indicates an abnormal state;
Reflecting the result of the decision in the execution state;
A program that causes a computer to execute the following: determining whether or not to execute a recovery operation for the server device according to the managed execution state.

１０クラスタシステム
１１管理部
１２監視部
１３決定部
１４制御部
２０クラスタシステム
３０クラスタシステム
４０共有サーバ装置 REFERENCE SIGNS LIST 10 Cluster system 11 Management unit 12 Monitoring unit 13 Determination unit 14 Control unit 20 Cluster system 30 Cluster system 40 Shared server device

Claims

a management unit for managing a monitoring state of a server device in a plurality of cluster systems and an execution state indicating a first cluster system that executes a recovery operation for the server device when the server device is in an abnormal state;
a monitoring unit that monitors whether the server device is in a normal state or an abnormal state, reflects the monitoring result in the monitoring state, and reflects the monitoring result of the server device received from another cluster system in the monitoring state;
a determination unit that, when a monitoring result in at least one cluster system among the plurality of cluster systems indicates an abnormal state, determines the first cluster system that will execute a recovery operation for the server device according to a determination criterion that is the same as a determination criterion used by the other cluster systems that manage the monitoring state, and reflects the determination result in the execution state;
a control unit that determines whether or not to execute a recovery operation for the server device according to the managed execution state.

The monitoring unit is
2. The cluster system according to claim 1, wherein when the execution state indicates that the other cluster system is to execute a recovery action on the server device, the cluster system stops monitoring the server device.

The monitoring unit is
3. The cluster system according to claim 2, wherein the monitoring status of at least one second cluster system that does not execute a recovery operation for said server device is updated to information indicating that monitoring of said server device is stopped.

The criteria are:
4. The cluster system according to claim 1, further comprising: a priority order for the first cluster system for executing the recovery operation.

The determination unit is
5. The cluster system according to claim 1 , further comprising: determining, from among at least one first cluster system among the plurality of cluster systems that detects that the server device is in an abnormal state, the first cluster system that will execute a recovery operation on the server device according to the judgment criterion.

The recovery operation includes:
6. The cluster system according to claim 1, wherein the restart is a restart of an application provided in the server device, or a restart of the server device.

The monitoring unit is
7. A cluster system according to claim 1, wherein when the server device is a DNS server device, it is determined whether the DNS server device is in a normal state or an abnormal state depending on whether address resolution of a virtual host name is successful.

A plurality of cluster systems;
a server device managed by the plurality of cluster systems,
Each of the cluster systems includes:
Manage a monitoring state of the server device in the plurality of cluster systems and an execution state indicating a first cluster system that executes a recovery operation for the server device when the server device is in an abnormal state;
monitoring whether the server device is in a normal state or an abnormal state, reflecting the monitoring result in the monitoring state, and reflecting the monitoring result of the server device received from another cluster system in the monitoring state;
When a monitoring result in at least one cluster system among the plurality of cluster systems indicates an abnormal state, determining the first cluster system which will execute a recovery operation for the server device according to a judgment criterion that is the same as a judgment criterion used by the other cluster systems which manage the monitoring state, and reflecting the determination result in the execution state;
A monitoring system that determines whether or not to execute a recovery action on the server device according to the managed execution state.

The cluster system
Manage a monitoring state of a server device in a plurality of cluster systems and an execution state indicating a first cluster system that executes a recovery operation on the server device when the server device is in an abnormal state;
monitor whether the server device is in a normal state or an abnormal state;
reflecting the monitoring result in the monitoring status, and reflecting the monitoring result of the server device received from another cluster system in the monitoring status;
determining the first cluster system to execute a recovery operation on the server device according to a criterion that is the same as a criterion used by the other cluster systems that manage the monitoring status when a monitoring result in at least one cluster system among the plurality of cluster systems indicates an abnormal state;
Reflecting the result of the decision in the execution state;
A monitoring method executed in a cluster system, the method determining whether or not to execute a recovery action for the server device according to the managed execution state.

Manage a monitoring state of a server device in a plurality of cluster systems and an execution state indicating a first cluster system that executes a recovery operation on the server device when the server device is in an abnormal state;
monitor whether the server device is in a normal state or an abnormal state;
reflecting the monitoring result in the monitoring status, and reflecting the monitoring result of the server device received from another cluster system in the monitoring status;
determining the first cluster system to execute a recovery operation on the server device according to a criterion that is the same as a criterion used by the other cluster systems that manage the monitoring status when a monitoring result in at least one cluster system among the plurality of cluster systems indicates an abnormal state;
Reflecting the result of the decision in the execution state;
A program that causes a computer to execute the following: determining whether or not to execute a recovery operation for the server device according to the managed execution state.