JP6073211B2

JP6073211B2 - Server monitoring method and server monitoring system

Info

Publication number: JP6073211B2
Application number: JP2013249326A
Authority: JP
Inventors: 裕介桐栄; 勲大西; 成剛松崎; 昌男稗田; 陽一郎中川; 正一岡部
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2013-12-02
Filing date: 2013-12-02
Publication date: 2017-02-01
Anticipated expiration: 2033-12-02
Also published as: JP2015106357A

Description

本発明は、サーバ監視方法およびサーバ監視システムに関するものであり、具体的には、クラスタ構成のサーバシステムにおける無応答障害を的確に検知可能とする技術に関する。 The present invention relates to a server monitoring method and a server monitoring system, and more specifically, to a technology that can accurately detect a non-response failure in a server system having a cluster configuration.

金融機関等の運営するコンピュータシステムは、その重要性から見て、安易なシステムダウンが許容されないミッションクリティカルなシステムであることが多い。従って、そうしたコンピュータシステムにおいては、優れた耐障害性や迅速かつ的確な障害時対応が高いレベルで要求されることになる。そこで、障害発生を監視して、その後の対応を早急に行う為の各種技術が従来から提案されている。 In view of its importance, a computer system operated by a financial institution is often a mission critical system that does not allow an easy system down. Therefore, such a computer system is required to have a high level of excellent fault tolerance and quick and accurate failure response. Thus, various techniques for monitoring the occurrence of a failure and promptly responding thereafter have been proposed.

このような技術としては、例えば以下のようなものが提案されている。すなわち、ステータスデータベースと、被監視デバイスと第１の接続を確立し、前記被監視デバイスからステータス項目を受信し、前記ステータス項目を前記ステータスデータベースに保存するステータスコレクタと、リモートデバイスとの間のインタフェースと、前記被監視デバイスと第３の接続を確立し、コマンドを前記被監視デバイスに送信するコマンドディスパッチャとを含むシステム（特許文献１参照）などである。 As such a technique, for example, the following has been proposed. That is, an interface between a status database, a status collector that establishes a first connection with a monitored device, receives a status item from the monitored device, and stores the status item in the status database, and a remote device And a command dispatcher that establishes a third connection with the monitored device and transmits a command to the monitored device (see Patent Document 1).

特表２０１０−５３７５６３号公報Special table 2010-537563

一方、上述のようなコンピュータシステムにおいて、ＨＡ（High Availability）クラスタ等により各種サーバ類の現用・待機構成を組むことで、無応答障害（ハングやスローダウン）を検知することができないケースが多く生じる。こうした状況下では、障害発生検知に伴う待機ノードへの切替え動作は実行されず、そのままサービス停止につながる場合が多い。ミッションクリティカルなコンピュータシステムにおける、少なくないサービス停止の懸念は大きな問題であり、システムでの無応答障害を的確に検知し、障害発生以降の復旧作業を迅速に実施出来る環境を整えておく必要がある。 On the other hand, in the computer system as described above, there are many cases in which a non-response failure (hang or slowdown) cannot be detected by constructing a working / standby configuration of various servers using an HA (High Availability) cluster or the like. . Under such circumstances, the switching operation to the standby node accompanying the detection of the failure occurrence is not executed and the service is often stopped as it is. Concerns about service interruptions in mission-critical computer systems are a major problem, and it is necessary to prepare an environment that can accurately detect non-response failures in the system and quickly perform recovery work after the failure occurs. .

そこで本発明の目的は、クラスタ構成のサーバシステムにおける無応答障害を的確に検知可能とする技術を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide a technique capable of accurately detecting a non-response failure in a server system having a cluster configuration.

上記課題を解決する本発明のサーバ監視方法は、クラスタ構成のサーバシステムが含む第１のサーバにおいて、当該第１のサーバがアクセする第２のサーバに対し、所定応答を要求する電文を送信し、前記電文の送信時点からの経過時間を計測する処理と、前記電文の送信に応じて前記第２のサーバから所定応答を受信するまでの経過時間が、第１の基準時間を越えた第１事象を、前記電文の送信毎に監視し、当該第１事象の連続発生回数が所定基準回数を超えた場合に、前記第２のサーバがスローダウン状態にある旨を所定端末に通知する処理と、前記電文の送信時点から、前記第１の基準時間以上の第２の基準時間内に、前記第２のサーバから所定応答を受信出来なかった第２事象を、前記電文の送信毎に監視し、当該第２事象の発生を検知した場合に、前記第２のサーバがハングアップ状態にある旨を所定端末に通知する処理と、を実行することを特徴とする。 In the server monitoring method of the present invention that solves the above problem, in a first server included in a server system having a cluster configuration, a message requesting a predetermined response is transmitted to a second server accessed by the first server. A process of measuring an elapsed time from the time of transmission of the electronic message, and a first time when an elapsed time until receiving a predetermined response from the second server in response to the transmission of the electronic message exceeds a first reference time A process of monitoring an event for each transmission of the message, and notifying a predetermined terminal that the second server is in a slowdown state when the number of consecutive occurrences of the first event exceeds a predetermined reference number; A second event in which a predetermined response cannot be received from the second server within a second reference time that is equal to or greater than the first reference time from the transmission time of the message is monitored for each transmission of the message. The occurrence of the second event When knowledge, the second server and the client performs a process of notifying the predetermined terminal to the effect that a hung up state.

また、本発明のサーバ監視システムは、クラスタ構成のサーバシステムが含む第１のサーバにおいて、当該第１のサーバがアクセする第２のサーバに対し、所定応答を要求する電文を送信し、前記電文の送信時点からの経過時間を計測する処理と、前記電文の送信に応じて前記第２のサーバから所定応答を受信するまでの経過時間が、第１の基準時間を越えた第１事象を、前記電文の送信毎に監視し、当該第１事象の連続発生回数が所定基準回数を超えた場合に、前記第２のサーバがスローダウン状態にある旨を所定端末に通知する処理と、前記電文の送信時点から、前記第１の基準時間以上の第２の基準時間内に、前記第２のサーバから所定応答を受信出来なかった第２事象を、前記電文の送信毎に監視し、当該第２事象の発生を検知した場合に、前記第２のサーバがハングアップ状態にある旨を所定端末に通知する処理を実行する演算装置を備える構成とする、ことを特徴とする。 In the server monitoring system of the present invention, in the first server included in the clustered server system, a message requesting a predetermined response is transmitted to the second server accessed by the first server, and the message A process of measuring an elapsed time from the transmission time of the first event, and a first event in which an elapsed time until receiving a predetermined response from the second server in response to the transmission of the message exceeds a first reference time, A process of monitoring each time a message is transmitted, and notifying a predetermined terminal that the second server is in a slowdown state when the number of consecutive occurrences of the first event exceeds a predetermined reference number; The second event in which a predetermined response has not been received from the second server within a second reference time that is equal to or greater than the first reference time from the time of transmission is monitored for each transmission of the message. Detected the occurrence of two events The case is configured to include an arithmetic unit for executing processing for notifying that the second server is hung up state to a predetermined terminal, and wherein the.

本発明によれば、クラスタ構成のサーバシステムにおける無応答障害を的確に検知することが可能となる。 According to the present invention, it is possible to accurately detect a non-response failure in a server system having a cluster configuration.

本実施形態のサーバ監視システムを含むネットワーク構成図である。It is a network block diagram containing the server monitoring system of this embodiment. 本実施形態におけるＡＰサーバのハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of AP server in this embodiment. 本実施形態におけるＤＢサーバのハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of DB server in this embodiment. 本実施形態における監視コンソールのハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of the monitoring console in this embodiment. 本実施形態のヘルスチェック用テーブルの構成例を示す図である。It is a figure which shows the structural example of the table for health checks of this embodiment. 本実施形態におけるサーバ監視方法の処理手順例を示すフロー図である。It is a flowchart which shows the process sequence example of the server monitoring method in this embodiment. 本実施形態における監視概念例を示す説明図である。It is explanatory drawing which shows the example of a monitoring concept in this embodiment.

−−−システム構成−−−
以下に本発明の実施形態について図面を用いて詳細に説明する。図１は、本実施形態のサーバ監視システム１０を含むネットワーク構成図である。図１に示すサーバ監視システム１０は、クラスタ構成のサーバシステム２０における無応答障害を的確に検知可能とするコンピュータシステムである。本実施形態において想定するクラスタ構成のサーバシステム２０としては、一例として金融機関で運用されている基幹システムを想定する。従って、本実施形態のサーバ監視システム１０は、金融機関における基幹システムにおける無応答障害の監視を行うシステムとなる。勿論、サーバ監視システム１０における無応答障害の監視対象は金融機関におけるシステムに限定されず、他業界における各種のサーバシステム（クラスタ構成）に適用可能である。 --- System configuration ---
Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a network configuration diagram including a server monitoring system 10 of the present embodiment. A server monitoring system 10 illustrated in FIG. 1 is a computer system that can accurately detect a non-response failure in a server system 20 having a cluster configuration. As a cluster system server system 20 assumed in the present embodiment, a backbone system operated in a financial institution is assumed as an example. Therefore, the server monitoring system 10 according to the present embodiment is a system that monitors a non-response failure in a backbone system in a financial institution. Of course, the non-response failure monitoring target in the server monitoring system 10 is not limited to a system in a financial institution, but can be applied to various server systems (cluster configurations) in other industries.

図１に例示するサーバシステム２０は、金融機関の顧客端末等と接続された外部ネットワーク１２０に、ファイヤウォール２１を介して接続されている。また、サーバシステム２０は、上述のファイヤウォール２１を経て顧客端末から受けた処理要求を、負荷分散装置３００にてＡＰサーバ１００に向けて振り分け処理する。当該サーバシステム２０におけるサーバ装置のクラスタ構成は、上述の負荷分散装置３００による処理要求振り分け先となる２台のＡＰサーバ１００Ａ、１００Ｂと、これらＡＰサーバ１００Ａ、１００Ｂと対応付けされ、ＡＰサーバ１００Ａ、１００Ｂらによるデータの取得や書込の対象となる１台のＤＢサーバ２００とが、現用系と待機系の２系統備わっている構成となる。待機系のＡＰサーバ１００として、参考のためにＡＰサーバ１００Ｃ、１００Ｄも図１にて示している。 A server system 20 illustrated in FIG. 1 is connected via a firewall 21 to an external network 120 connected to a customer terminal of a financial institution. In addition, the server system 20 distributes the processing request received from the customer terminal through the firewall 21 to the AP server 100 by the load balancer 300. The cluster configuration of the server device in the server system 20 is associated with the two AP servers 100A and 100B, which are processing request distribution destinations by the load distribution device 300, and the AP servers 100A and 100B. One DB server 200 that is a target of data acquisition and writing by 100B and the like is configured to have two systems, an active system and a standby system. As the standby AP server 100, AP servers 100C and 100D are also shown in FIG. 1 for reference.

なお、図１における２台のＡＰサーバ１００Ａ、１００Ｂは本発明における第１のサーバに対応する。以降の説明において、これら２台のＡＰサーバ１００Ａ、１００Ｂを区別せず、第１のサーバとして説明を行う場合にはＡＰサーバ１００と示すものとする。また、ＤＢサーバ２００は本発明における第２のサーバに対応する。本実施形態では、１台のＤＢサーバ２００に対し、２台のＡＰサーバ１００Ａ、１００Ｂが対応付いた構成を例示したが、この構成に限定されず、１台のＤＢサーバ２００に対し、３台以上のＡＰサーバ１００Ａ、１００Ｂが対応付いた構成であっても、特段の変更無く本実施形態のサーバ監視方法を適用可能である。 Note that the two AP servers 100A and 100B in FIG. 1 correspond to the first server in the present invention. In the following description, these two AP servers 100A and 100B are not distinguished from each other, and are described as the AP server 100 when the description is made as the first server. The DB server 200 corresponds to the second server in the present invention. In the present embodiment, a configuration in which two AP servers 100A and 100B are associated with one DB server 200 is illustrated. However, the present invention is not limited to this configuration, and there are three for one DB server 200. Even if the AP servers 100A and 100B are associated with each other, the server monitoring method of the present embodiment can be applied without any particular change.

また、各ＡＰサーバ１００Ａ、１００Ｂには、金融機関のシステム担当者等が用いる監視コンソール４００が接続され、本実施形態におけるサーバ監視方法の処理結果について、この監視コンソール４００に出力されるものとする。 The AP consoles 100A and 100B are connected to a monitoring console 400 used by a system person in charge of a financial institution, and the processing result of the server monitoring method in the present embodiment is output to the monitoring console 400. .

続いて、本実施形態のサーバ監視システム１０を構成する各装置について、そのハードウェア構成の例について説明する。図２は本実施形態におけるＡＰサーバ１００のハードウェア構成例を示す図である。上述したるＡＰサーバ１００のハードウェア構成は以下の如くとなる。ＡＰサーバ１００は、ハードディスクドライブなど適宜な不揮発性記憶装置で構成される記憶装置１０１、ＲＡＭなど揮発性記憶装置で構成されるメモリ１０３、記憶装置１０１に保持されるプログラム１０２をメモリ１０３に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵなどの演算装置１０４、内部ネットワーク１３０と接続し他装置（負荷分散装置３００やＤＢサーバ２００）との通信処理を担う通信装置１０５、を備える。 Next, an example of the hardware configuration of each device constituting the server monitoring system 10 of the present embodiment will be described. FIG. 2 is a diagram illustrating a hardware configuration example of the AP server 100 according to the present embodiment. The hardware configuration of the AP server 100 described above is as follows. The AP server 100 reads out a storage device 101 composed of a suitable non-volatile storage device such as a hard disk drive, a memory 103 composed of a volatile storage device such as a RAM, and a program 102 held in the storage device 101 to the memory 103. Are executed to perform overall control of the device itself and perform various determinations, computations, and control processes, and the like, and a communication process with other devices (the load distribution device 300 and the DB server 200) connected to the internal network 130. The communication device 105 is responsible for.

なお、記憶装置１０１内には、情報処理装置として当然備えるべきＯＳ（図示せず。以下同様）の他、本実施形態のサーバ監視システム１０におけるＡＰサーバ１００として必要な機能を実装するプログラム１０２が少なくとも記憶されている。本実施形態においては、ＡＰサーバ１００がこのプログラム１０２を実行することで、ヘルスチェックシェル１１０が実装される。ヘルスチェックシェル１１０は、各ＡＰサーバ１００Ａ、１００Ｂに常駐し、本実施形態におけるサーバ監視システム１０の主たる機能を実現するシェルである。このヘルスチェックシェル１１０は、常駐中のＡＰサーバ１００に応じた単純な値のみ要求するごくシンプルなＳＱＬ文を発行し、それに応じてＤＢサーバ２００から得た応答結果を、ＡＰサーバ１００上の所定の監視アプリケーション１５０（ＳＮＭＰによるネットワーク監視の情報を使ってアプリケーションやシステムの情報を総合的に収集する既存のもの）に提供し、この監視アプリケーション１５０を介して上述の監視コンソール４００に、無応答障害に関する情報を提示させるものとなる。 In addition, in the storage device 101, there is an OS (not shown) that is naturally provided as an information processing device, and a program 102 that implements functions necessary for the AP server 100 in the server monitoring system 10 of the present embodiment. At least remembered. In the present embodiment, the health check shell 110 is implemented when the AP server 100 executes the program 102. The health check shell 110 is a shell that resides in each of the AP servers 100A and 100B and realizes the main functions of the server monitoring system 10 in the present embodiment. The health check shell 110 issues a very simple SQL statement that requests only a simple value according to the resident AP server 100, and in response to the response result obtained from the DB server 200, a predetermined result on the AP server 100 is displayed. Monitoring application 150 (existing one that collects application and system information comprehensively using SNMP network monitoring information), and the monitoring application 400 receives no response failure via the monitoring application 150. It will be to present information about.

次にＤＢサーバ２００におけるハードウェア構成例について説明する。図３は本実施形態におけるＤＢサーバ２００のハードウェア構成例を示す図である。ＤＢサーバ２００は、上述のＡＰサーバ１００と同様に、一般的なサーバ装置としての構成を備えると共に、ＤＢサーバ２００間で共有のヘルスチェック用テーブル２２５を、その記憶装置２０１において保持している。このヘルスチェック用テーブル２２５は、ＡＰサーバ１００について所定値を対応付けたテーブルとなる。ヘルスチェック用テーブル２２５の詳細については後述する。 Next, a hardware configuration example in the DB server 200 will be described. FIG. 3 is a diagram illustrating a hardware configuration example of the DB server 200 in the present embodiment. Similar to the AP server 100 described above, the DB server 200 has a configuration as a general server device, and holds a health check table 225 shared among the DB servers 200 in the storage device 201. The health check table 225 is a table in which predetermined values are associated with the AP server 100. Details of the health check table 225 will be described later.

次に、監視コンソール４００のハードウェア構成例について説明する。図４は本実施形態における監視コンソール４００のハードウェア構成例を示す図である。監視コンソール４００は、上述のＡＰサーバ１００やＤＢサーバ２００等における一般的な情報処理装置としてのハードウェア構成の他に、金融機関におけるシステム担当者等、所定ユーザからのキー入力や音声入力を受け付ける入力装置４０５、ＡＰサーバ１００のヘルスチェックシェル１１０からの通知を表示するディスプレイ等の出力装置４０６を備えている。 Next, a hardware configuration example of the monitoring console 400 will be described. FIG. 4 is a diagram illustrating a hardware configuration example of the monitoring console 400 according to the present embodiment. The monitoring console 400 accepts key input and voice input from a predetermined user such as a person in charge of a system in a financial institution in addition to a hardware configuration as a general information processing apparatus in the AP server 100 and the DB server 200 described above. An input device 405 and an output device 406 such as a display for displaying a notification from the health check shell 110 of the AP server 100 are provided.

続いて、本実施形態のサーバ監視システム１０が備える機能について説明する。上述したように、以下に説明する機能は、例えばサーバ監視システム１０を成す適宜な情報処理装置が備えるプログラムを実行することで実装される機能と言える。本実施形態の場合、例えばＡＰサーバ１００におけるヘルスチェックシェル１１０が主導し、ＤＢサーバ２００や監視コンソール４００が協働する機能となる。サーバ監視システム１０として、ＡＰサーバ１００のみを想定する場合、ヘルスチェックシェル１１０での機能がサーバ監視システム１０の機能となり、サーバ監視システム１０として、ＡＰサーバ１００に加えて、ＤＢサーバ２００や監視コンソール４００も含めた構成を想定する場合、ヘルスチェックシェル１１０とＤＢサーバ２００および監視コンソール４００での機能がサーバ監視システム１０の機能となる。 Then, the function with which the server monitoring system 10 of this embodiment is provided is demonstrated. As described above, the functions described below can be said to be functions that are implemented by executing a program included in an appropriate information processing apparatus constituting the server monitoring system 10, for example. In the case of this embodiment, for example, the health check shell 110 in the AP server 100 takes the lead, and the DB server 200 and the monitoring console 400 cooperate with each other. When only the AP server 100 is assumed as the server monitoring system 10, the function of the health check shell 110 becomes the function of the server monitoring system 10. As the server monitoring system 10, in addition to the AP server 100, the DB server 200 and the monitoring console When a configuration including 400 is assumed, the functions of the health check shell 110, the DB server 200, and the monitoring console 400 become functions of the server monitoring system 10.

ここで、各ＡＰサーバ１００に常駐しているヘルスチェックシェル１１０は、常駐先のＡＰサーバ１００がアクセスするＤＢサーバ２００（単に常駐先のＡＰサーバ１００が通信できるＤＢサーバ２００である）に対し、所定応答を要求する電文を例えば６０秒間隔で送信し、各電文の送信時点からの経過時間を計測する機能を備えている。 Here, the health check shell 110 resident in each AP server 100 is in contrast to the DB server 200 accessed by the resident destination AP server 100 (which is simply the DB server 200 to which the resident destination AP server 100 can communicate). For example, a message requesting a predetermined response is transmitted at intervals of 60 seconds, for example, and a function of measuring an elapsed time from the transmission time of each message is provided.

また、上述の各ヘルスチェックシェル１１０は、上述の電文の送信に応じてＤＢサーバ２００から所定応答を受信するまでの経過時間が、例えば２秒（第１の基準時間）を越えた第１事象を、上述の電文の送信毎に監視し、当該第１事象の連続発生回数が例えば２回（所定基準回数）を超えた場合に、該当ＤＢサーバ２００がスローダウン状態にある旨を、監視コンソール４００に通知する機能を備えている。 In addition, each health check shell 110 described above is a first event in which the elapsed time until a predetermined response is received from the DB server 200 in response to the transmission of the above-described message exceeds, for example, 2 seconds (first reference time). Is monitored every time the above-mentioned message is transmitted, and the monitoring console indicates that the corresponding DB server 200 is in a slow-down state when the number of consecutive occurrences of the first event exceeds, for example, two times (predetermined reference number). 400 has a function of notifying the user.

また、各ヘルスチェックシェル１１０は、上述の電文の送信時点から、例えば３５秒（第２の基準時間）以内に、ＤＢサーバ２００から所定応答を受信出来なかった第２事象を、上述の電文の送信毎に監視し、当該第２事象の発生を検知した場合に、該当ＤＢサーバ２００がハングアップ状態にある旨を監視コンソール４００に通知する機能を備えている。 In addition, each health check shell 110 indicates a second event in which a predetermined response has not been received from the DB server 200 within 35 seconds (second reference time), for example, from the transmission time of the above-described message. It has a function of monitoring each transmission and notifying the monitoring console 400 that the corresponding DB server 200 is in a hang-up state when the occurrence of the second event is detected.

また、各ヘルスチェックシェル１１０は、上述のＤＢサーバ２００に対して例えば６０秒間隔で送信する電文として、当該ヘルスチェックシェル１１０が常駐するＡＰサーバ１００に対応した所定値を応答すべく要求する電文を送信する機能を備えている。この電文は、サーバ間の通信プロトコルに応じたフォーマットにおいて、好適には該当ＡＰサーバ１００のＩＤ（識別情報）のみ含んでいる。ここで応答を要求する所定値は、各ＡＰサーバ１００Ａ、１００Ｂごとに予め決められている１桁の数値である。 In addition, each health check shell 110 requests a message to respond to a predetermined value corresponding to the AP server 100 in which the health check shell 110 resides as a message to be transmitted to the above-described DB server 200 at intervals of 60 seconds, for example. The function to transmit. This message preferably includes only the ID (identification information) of the corresponding AP server 100 in a format corresponding to the communication protocol between the servers. Here, the predetermined value for requesting a response is a single-digit numerical value determined in advance for each AP server 100A, 100B.

この場合のＤＢサーバ２００は、記憶装置２０１におけるヘルスチェック用テーブル２２５において、各ＡＰサーバ１００Ａ、１００ＢのＩＤと、該当ＡＰサーバ１００Ａ、１００Ｂを示す１桁の数値との対応関係を規定している（図５参照）。そのため、ＤＢサーバ２００は、ヘルスチェックシェル１１０から送信されてきた上述の電文を受信した場合、該当電文が示す該当ＡＰサーバ１００Ａ、１００Ｂ（ヘルスチェックシェル１１０の常駐先）の識別情報を、ヘルスチェック用テーブル２２５に照合し、ヘルスチェックシェル１１０の常駐先であるＡＰサーバ１００に対応した１桁の数値を特定して、これを応答として該当ＡＰサーバ１００のヘルスチェックシェル１１０に返信する機能を備えている。 In this case, the DB server 200 defines the correspondence between the IDs of the AP servers 100A and 100B and the single-digit numerical values indicating the corresponding AP servers 100A and 100B in the health check table 225 in the storage device 201. (See FIG. 5). Therefore, when the DB server 200 receives the above-described message transmitted from the health check shell 110, the DB server 200 uses the health check to identify the identification information of the corresponding AP servers 100A and 100B (the resident location of the health check shell 110) indicated by the message. And a function of identifying a single digit value corresponding to the AP server 100 where the health check shell 110 is resident and returning this to the health check shell 110 of the corresponding AP server 100 as a response. ing.

一方、ＡＰサーバ１００のヘルスチェックシェル１１０から、上述のスローダウン状態やハングアップ状態に関する通知を受信する監視コンソール４００は、各ヘルスチェックシェル１１０、すなわち各ＡＰサーバ１００Ａ、１００Ｂからこうしたスローダウン状態ないしハングアップ状態に関する通知を受信した際、１台のＤＢサーバ２００に関して該当ＤＢサーバ２００に対応付けされた全てのＡＰサーバ１００Ａ、１００Ｂのヘルスチェックシェル１１０から、スローダウン状態にある旨ないしハングアップ状態にある旨の通知を受けた場合、該当ＤＢサーバ２００にスローダウン状態ないしハングアップ状態が生じている旨を出力装置４０６に表示する機能を備えている。 On the other hand, the monitoring console 400 that receives the notification regarding the slow-down state or the hang-up state from the health check shell 110 of the AP server 100 receives the slow-down state or the state from each health check shell 110, that is, each of the AP servers 100A and 100B. When the notification about the hang-up state is received, the health check shell 110 of all the AP servers 100A and 100B associated with the corresponding DB server 200 for one DB server 200 is in a slow-down state or hang-up state. Is received, a function for displaying on the output device 406 that the DB server 200 is in a slow-down state or a hang-up state is provided.

また、監視コンソール４００は、１台のＤＢサーバ２００に関して該当ＤＢサーバ２００に対応付けされた一部のＡＰサーバ１００ＡまたはＡＰサーバ１００Ｂからのみ、スローダウン状態にある旨ないしハングアップ状態にある旨の通知を受けた場合、この通知を送信してこなかったＡＰサーバ１００にスローダウン状態ないしハングアップ状態が生じている旨を出力装置４０６に表示する機能を備えている。 Further, the monitoring console 400 indicates that only one AP server 100A or AP server 100B associated with the corresponding DB server 200 is in a slow-down state or a hang-up state with respect to one DB server 200. When the notification is received, the output device 406 has a function of displaying that the AP server 100 that has not transmitted the notification is in a slowdown state or a hangup state.

また、監視コンソール４００は、１台のＤＢサーバ２００に関して該当ＤＢサーバ２００に対応付けされた全てのＡＰサーバ１００から、スローダウン状態にある旨ないしハングアップ状態にある旨の通知を受けた場合、ＤＢサーバ２００にスローダウン状態ないしハングアップ状態が生じている旨と、ＤＢサーバ２００へのログイン処理を実行して状態確認を行うべく促すメッセージを出力装置４０６に表示する機能を備えている。 In addition, when the monitoring console 400 receives a notification indicating that it is in a slowdown state or a hangup state from all the AP servers 100 associated with the corresponding DB server 200 with respect to one DB server 200, The DB server 200 has a function of displaying on the output device 406 that a slow down state or a hang-up state has occurred and a message prompting the user to execute a login process to the DB server 200 to check the state.

−−−処理手順例−−−
以下、本実施形態におけるサーバ監視方法の実際手順について図に基づき説明する。以下で説明するサーバ監視方法に対応する各種動作は、サーバ監視システム１０が含む各ＡＰサーバ１００Ａ、１００Ｂにおける各ヘルスチェックシェル１１０によって実現される。そして、このヘルスチェックシェル１１０は、以下に説明される各種の動作を行うためのコードから構成されている。 --- Processing procedure example ---
Hereinafter, the actual procedure of the server monitoring method in the present embodiment will be described with reference to the drawings. Various operations corresponding to the server monitoring method described below are realized by each health check shell 110 in each AP server 100A, 100B included in the server monitoring system 10. The health check shell 110 is composed of codes for performing various operations described below.

図６は、本実施形態におけるサーバ監視方法の処理手順例を示すフロー図であり、図７は本実施形態における監視概念例を示す説明図である。ここではまず、各ＡＰサーバ１００に常駐しているヘルスチェックシェル１１０が、常駐先のＡＰサーバ１００がアクセするＤＢサーバ２００（単に常駐先のＡＰサーバ１００が通信できるＤＢサーバ２００である）に対し、所定応答を要求する電文を例えば６０秒間隔で送信し、各電文の送信時点からの経過時間を計測する（ｓ１００）。上述の電文は、サーバ間の通信プロトコルに応じたフォーマットにおいて、該当ヘルスチェックシェル１１０の常駐先であるＡＰサーバ１００のＩＤ（識別情報）のみ含んでおり、該当ヘルスチェックシェル１１０が常駐するＡＰサーバ１００に対応した１桁の数値のみを応答すべく要求する電文である。 FIG. 6 is a flowchart illustrating an example of a processing procedure of the server monitoring method in the present embodiment, and FIG. 7 is an explanatory diagram illustrating an example of a monitoring concept in the present embodiment. Here, first, the health check shell 110 resident in each AP server 100 is connected to the DB server 200 accessed by the resident AP server 100 (which is simply the DB server 200 to which the resident AP server 100 can communicate). Then, a message requesting a predetermined response is transmitted at intervals of, for example, 60 seconds, and the elapsed time from the transmission time of each message is measured (s100). The above-mentioned message contains only the ID (identification information) of the AP server 100 where the corresponding health check shell 110 resides in a format corresponding to the communication protocol between the servers, and the AP server where the corresponding health check shell 110 resides. This is a message requesting to respond only to a single digit value corresponding to 100.

次に上述の各ヘルスチェックシェル１１０は、上述の電文の送信に応じてＤＢサーバ２００から、常駐先のＡＰサーバ１００に対応した１桁の数値（所定応答）を受信するまでの経過時間が、例えば２秒を越えた第１事象を、上述の電文の送信毎に監視し、その発生回数をカウントする（ｓ１０１）。このカウントの結果、上述の第１事象の連続発生回数が、例えば２回を超えて３回に達した場合（ｓ１０２：Ｙ）、該当ヘルスチェックシェル１１０は、該当ＤＢサーバ２００がスローダウン状態にある旨を、監視コンソール４００に通知する（ｓ１０３）。なお、上述のカウントにより、第１事象を例えば連続２回までカウントしたが、次の電文送信では例えば２秒以内の応答がＤＢサーバ２００からあった場合、ヘルスチェックシェル１１０は、それまでカウントしていた「２回」のカウント値（ＡＰサーバ１００のメモリ１０３等で保持）をクリアし、再び連続回数が「０回」である時点から第１事象の連続発生回数をカウントし始める。 Next, each health check shell 110 described above has elapsed time from receiving a one-digit numerical value (predetermined response) corresponding to the resident AP server 100 from the DB server 200 in response to the transmission of the above-described message. For example, the first event exceeding 2 seconds is monitored every time the above-mentioned message is transmitted, and the number of occurrences is counted (s101). As a result of this count, when the number of consecutive occurrences of the first event reaches, for example, more than two and reaches three (s102: Y), the corresponding health check shell 110 indicates that the corresponding DB server 200 is in a slow-down state. The monitoring console 400 is notified of this fact (s103). Although the first event is counted up to, for example, twice consecutively according to the above-described count, the health check shell 110 counts up to that time when a response within 2 seconds is received from the DB server 200 in the next message transmission, for example. The count value of “2 times” (held in the memory 103 of the AP server 100 or the like) is cleared, and the continuous occurrence count of the first event is started again from the time when the consecutive count is “0”.

また、ヘルスチェックシェル１１０は、上述の電文の送信時点から、例えば３５秒以内に、該当ＤＢサーバ２００から、上述した１桁の数値を受信出来なかった第２事象を、上述の電文の送信毎に監視し、当該第２事象の発生を検知した場合（ｓ１０４：Ｙ）、該当ＤＢサーバ２００がハングアップ状態にある旨を監視コンソール４００に通知する（ｓ１０５）。こうした本実施形態における監視概念は図７にて示す通りである。 In addition, the health check shell 110 detects the second event in which the above-described one-digit numerical value has not been received from the corresponding DB server 200 within, for example, 35 seconds from the transmission time of the above-described message, for each transmission of the above-described message. When the occurrence of the second event is detected (s104: Y), the monitoring console 400 is notified that the corresponding DB server 200 is in the hang-up state (s105). The monitoring concept in this embodiment is as shown in FIG.

一方、監視コンソール４００は、ＡＰサーバ１００のヘルスチェックシェル１１０から、上述のスローダウン状態やハングアップ状態に関する通知を受信し（ｓ１０６）、受信した通知がスローダウン状態とハングアップ状態のいずれであるか判定する（ｓ１０７）。受信した通知がスローダウン状態に関するものであった場合（ｓ１０７：スローダウン）、監視コンソール４００は、更に、１台のＤＢサーバ２００に関して該当ＤＢサーバ２００に対応付けされた全てのＡＰサーバ１００すなわちＡＰサーバ１００Ａ及びＡＰサーバ１００Ｂの各ヘルスチェックシェル１１０から、スローダウン状態にある旨の通知を受けたか判定する（ｓ１０８）。 On the other hand, the monitoring console 400 receives the notification regarding the slowdown state or the hangup state from the health check shell 110 of the AP server 100 (s106), and the received notification is either the slowdown state or the hangup state. Is determined (s107). When the received notification is related to the slowdown state (s107: slowdown), the monitoring console 400 further relates all the AP servers 100 associated with the corresponding DB server 200 with respect to one DB server 200, that is, APs. It is determined whether a notification indicating the slowdown state has been received from the health check shells 110 of the server 100A and the AP server 100B (s108).

監視コンソール４００は、１台のＤＢサーバ２００に対応付けされた全てのＡＰサーバ１００すなわちＡＰサーバ１００Ａ及びＡＰサーバ１００Ｂの各ヘルスチェックシェル１１０から、スローダウン状態にある旨の通知を受けたと判定した場合（ｓ１０８：全部）、該当ＤＢサーバ２００にスローダウン状態が生じている旨を、出力装置４０６に表示する（ｓ１０９）。 The monitoring console 400 determines that a notification indicating that it is in a slowdown state has been received from all AP servers 100 associated with one DB server 200, that is, the health check shells 110 of the AP server 100A and the AP server 100B. In the case (s108: all), the output device 406 displays that the DB server 200 is in a slowdown state (s109).

他方、１台のＤＢサーバ２００に対応付けされたＡＰサーバ１００のうち一部のもののみ、例えばＡＰサーバ１００Ａのヘルスチェックシェル１１０のみから、スローダウン状態にある旨の通知を受けたと判定した場合（ｓ１０８：一部）、該当ＡＰサーバ１００Ａにスローダウン状態が生じている旨を、出力装置４０６に表示する（ｓ１１０）。 On the other hand, when it is determined that only some of the AP servers 100 associated with one DB server 200, for example, only the health check shell 110 of the AP server 100A has received a notification indicating that it is in a slowdown state. (S108: Part), a message indicating that a slow-down state has occurred in the corresponding AP server 100A is displayed on the output device 406 (s110).

また、上述のステップｓ１０７において、受信した通知がハングアップ状態に関するものであった場合（ｓ１０７：ハングアップ）、監視コンソール４００は、更に、１台のＤＢサーバ２００に関して該当ＤＢサーバ２００に対応付けされた全てのＡＰサーバ１００すなわちＡＰサーバ１００Ａ及びＡＰサーバ１００Ｂの各ヘルスチェックシェル１１０から、ハングアップ状態にある旨の通知を受けたか判定する（ｓ１１１）。 If the received notification is related to the hang-up state in step s107 described above (s107: hang-up), the monitoring console 400 is further associated with the corresponding DB server 200 with respect to one DB server 200. It is determined whether a notification indicating that the state is in the hang-up state has been received from each of the health check shells 110 of all the AP servers 100, that is, the AP server 100A and the AP server 100B (s111).

監視コンソール４００は、１台のＤＢサーバ２００に対応付けされたＡＰサーバ１００のうち一部のもののみ、例えばＡＰサーバ１００Ａのヘルスチェックシェル１１０のみから、ハングアップ状態にある旨の通知を受けたと判定した場合（ｓ１１１：一部）、該当ＡＰサーバ１００Ａにハングアップ状態が生じている旨を、出力装置４０６に表示する（ｓ１１２）。 The monitoring console 400 receives a notification that only a part of the AP servers 100 associated with one DB server 200 is in a hang-up state, for example, only from the health check shell 110 of the AP server 100A. If it is determined (s111: part), the output device 406 displays that the corresponding AP server 100A is hung up (s112).

他方、１台のＤＢサーバ２００に関して該当ＤＢサーバ２００に対応付けされた全てのＡＰサーバ１００すなわちＡＰサーバ１００Ａ及びＡＰサーバ１００Ｂの各ヘルスチェックシェル１１０から、ハングアップ状態にある旨の通知を受けたと判定した場合（ｓ１１１：全部）、監視コンソール４００は、該当ＤＢサーバ２００へのログイン処理を実行して状態確認を行うべく促すメッセージを出力装置４０６に表示し（ｓ１１３）、処理を終了する。 On the other hand, when one of the health check shells 110 of each AP server 100 associated with the corresponding DB server 200, that is, the AP server 100A and the AP server 100B, has been notified that it is in a hang-up state. When the determination is made (s111: all), the monitoring console 400 displays a message prompting the user to check the status by executing the login process to the corresponding DB server 200 (s113), and ends the process.

こうした処理を監視コンソール４００にて行えば、クラスタ構成のサーバシステム２０において、いずれのサーバにおいて無応答障害が生じているのか、障害発生元を切り分けて、ユーザに提示することができる。このユーザは障害発生元と障害内容を確実に認識出来るため、それに引き続き行うべき障害対応を精度良く迅速に実行しやすくなる。また、ＤＢサーバ２００でのハングアップに際しては、基本的にはログイン処理の可否を確かめることで、該当サーバにおけるどのＤＢコントローラで障害が疑われるか判断することになるが、こうした対応をユーザに対して促すことが出来る。そのためユーザは、行うべき障害対応を具体的に認識し、迅速に実行しやすくなる。 If such a process is performed by the monitoring console 400, it is possible to identify the failure occurrence source in which server in the cluster-structured server system 20 and present it to the user. Since this user can surely recognize the failure source and the failure content, it becomes easy to execute the failure response to be performed subsequently with high accuracy. In addition, when the DB server 200 hangs up, basically, it is determined which DB controller in the corresponding server is suspected of failure by confirming whether or not the login process is possible. Can be encouraged. Therefore, the user specifically recognizes the failure response to be performed and easily executes it quickly.

以上、本発明を実施するための最良の形態などについて具体的に説明したが、本発明はこれに限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能である。 Although the best mode for carrying out the present invention has been specifically described above, the present invention is not limited to this, and various modifications can be made without departing from the scope of the invention.

こうした本実施形態によれば、クラスタ構成のサーバシステムにおける無応答障害を的確に検知することが可能となる。 According to this embodiment, it becomes possible to accurately detect a non-response failure in a server system having a cluster configuration.

本明細書の記載により、少なくとも次のことが明らかにされる。すなわち、本実施形態のサーバ監視システムの前記第１のサーバにおいて、前記第２のサーバに対する前記電文として、当該第１のサーバに対応した所定値を応答すべく要求する電文を送信し、前記第２のサーバにおいて、前記第１のサーバについて所定値を対応付けたテーブルを記憶装置にて保持しており、前記電文を前記第１のサーバから受信した場合、該当電文が示す前記第１のサーバの識別情報を前記テーブルに照合し、前記第１のサーバに対応した所定値を特定して、当該所定値を応答として前記第１のサーバに返信する処理を実行する、としてもよい。 At least the following will be clarified by the description of the present specification. That is, in the first server of the server monitoring system of the present embodiment, as the message for the second server, a message requesting to respond with a predetermined value corresponding to the first server is transmitted, and the first server In the second server, a table in which a predetermined value is associated with the first server is held in a storage device, and the first server indicated by the corresponding message when the message is received from the first server The identification information may be collated with the table, a predetermined value corresponding to the first server may be specified, and a process of returning the predetermined value as a response to the first server may be executed.

これによれば、スローダウンやハングアップといった無応答障害を監視するにあたり、しばしばそうした障害の原因となる処理データ量過大の事態に加担することなく、固定的で単純なデータ構造のテーブルから単に値を１つ得て返すのみといった極めて低負荷の処理で、迅速かつ的確なサーバ監視を行うことが出来る。 According to this, when monitoring unresponsive failures such as slowdowns and hangups, it is often not necessary to take part in the excessive amount of processing data that causes such failures, but simply from a table with a fixed and simple data structure. Server processing can be performed quickly and accurately with extremely low-load processing, such as obtaining and returning a single message.

また、前記クラスタ構成のサーバシステムにおいて、前記第２のサーバ１台当たりに、複数台の前記第１のサーバが対応付けされた構成となっているとする。この場合のサーバ監視システムにおける前記第１のサーバ各々は、当該第１のサーバ自身に対応付けされた前記第２のサーバに対し、前記電文を送信し、前記第２のサーバにおいて、前記第１のサーバ各々から前記電文を受信し、該当電文が示す該当第１のサーバの識別情報を前記テーブルに照合し、該当第１のサーバに対応した所定値を特定して、当該所定値を応答として該当第１のサーバに返信し、前記１のサーバ各々は、前記第１事象の連続発生回数が所定基準回数を超えた場合に、当該第１のサーバ自身に対応付けされた前記第２のサーバがスローダウン状態にある旨を所定端末に通知し、前記電文の送信時点から、前記第１の基準時間以上の第２の基準時間内に、当該第１のサーバ自身に対応付けされた前記第２のサーバから所定応答を受信出来なかった第２事象を、前記電文の送信毎に監視し、当該第２事象の発生を検知した場合に、当該第１のサーバ自身に対応付けされた前記第２のサーバがハングアップ状態にある旨を所定端末に通知し、前記所定端末において、前記第２のサーバに関して、前記スローダウン状態にある旨ないし前記ハングアップ状態にある旨の通知を前記第１のサーバから受信し、１台の前記第２のサーバに関して該当第２のサーバに対応付けされた全ての前記第１のサーバから、前記スローダウン状態にある旨ないし前記ハングアップ状態にある旨の通知を受けた場合、前記第２のサーバにスローダウン状態ないし前記ハングアップ状態が生じている旨を出力装置に表示し、１台の前記第２のサーバに関して該当第２のサーバに対応付けされた一部の前記第１のサーバからのみ、前記スローダウン状態にある旨ないし前記ハングアップ状態にある旨の通知を受けた場合、前記通知を送信してこなかった前記第１のサーバにスローダウン状態ないし前記ハングアップ状態が生じている旨を出力装置に表示する、としてもよい。 In the server system having the cluster configuration, it is assumed that a plurality of the first servers are associated with each second server. Each of the first servers in the server monitoring system in this case transmits the message to the second server associated with the first server itself, and in the second server, the first server The message is received from each of the servers, the identification information of the corresponding first server indicated by the corresponding message is collated with the table, a predetermined value corresponding to the corresponding first server is specified, and the predetermined value is used as a response. The first server sends back to the corresponding first server, and each of the first servers, when the number of consecutive occurrences of the first event exceeds a predetermined reference number, the second server associated with the first server itself Is notified to the predetermined terminal that it is in a slow-down state, and the second server is associated with the first server itself within a second reference time that is equal to or greater than the first reference time from the time of transmission of the message. Predetermined from 2 servers The second event for which an answer could not be received is monitored each time the message is transmitted, and when the occurrence of the second event is detected, the second server associated with the first server itself hangs The predetermined terminal is notified of being in the up state, and the predetermined terminal receives a notification from the first server that the second server is in the slow-down state or the hang-up state. When notification of the slowdown state or the hangup state is received from all the first servers associated with the corresponding second server with respect to one second server , A message indicating that a slowdown state or a hang-up state has occurred in the second server is displayed on the output device, and one second server is associated with the corresponding second server. When a notification indicating that it is in the slowdown state or in the hangup state is received only from some of the first servers, a slowdown state is sent to the first server that has not transmitted the notification. Or it is good also as displaying on the output device that the said hang-up state has arisen.

これによれば、クラスタ構成のサーバシステムにおいて、いずれのサーバにおいて無応答障害が生じているのか、障害発生元を切り分けて、ユーザ側（所定端末）に提示することができる。ユーザは障害発生元と障害内容を確実に認識出来るため、それに引き続き行うべき障害対応を精度良く迅速に実行しやすくなる。 According to this, in the server system with the cluster configuration, it is possible to identify the failure occurrence source in which server and identify the failure occurrence source and present it to the user side (predetermined terminal). Since the user can surely recognize the source of the failure and the content of the failure, it becomes easy to perform the failure response to be performed subsequently with high accuracy and speed.

また、上述の場合の前記所定端末において、１台の前記第２のサーバに関して該当第２のサーバに対応付けされた全ての前記第１のサーバから、前記ハングアップ状態にある旨の通知を受けた場合、前記第２のサーバに前記ハングアップ状態が生じている旨と、前記第２のサーバへのログイン処理を実行して状態確認を行うべく促すメッセージを出力装置に表示する、としてもよい。 In addition, the predetermined terminal in the above-described case receives a notification that it is in the hang-up state from all the first servers associated with the corresponding second server with respect to one second server. In such a case, the output device may display a message that the second server is in the hang-up state and a message prompting the user to execute a login process to the second server to check the state. .

ＤＢサーバ（第２のサーバ）でのハングアップに際しては、基本的にはログイン処理の可否を確かめることで、該当サーバにおけるどのＤＢコントローラで障害が疑われるか判断することになるが、こうした対応をユーザに対して促すことが出来る。そのためユーザは、行うべき障害対応を具体的に認識し、迅速に実行しやすくなる。 When a hang-up occurs in a DB server (second server), you can basically determine which DB controller in the server is suspected of failure by checking whether login processing is possible. Can prompt the user. Therefore, the user specifically recognizes the failure response to be performed and easily executes it quickly.

１０サーバ監視システム
２０クラスタ構成のサーバシステム
１００ＡＰサーバ（第１のサーバ）
１０１記憶装置
１０２プログラム
１０３メモリ
１０４演算装置
１０５通信装置
１１０ヘルスチェックシェル
１２０ネットワーク
２００ＤＢサーバ（第２のサーバ）
２２５ヘルスチェック用テーブル（テーブル）
３００負荷分散装置
４００監視コンソール（所定端末） 10 server monitoring system 20 cluster system server system 100 AP server (first server)
101 storage device 102 program 103 memory 104 arithmetic device 105 communication device 110 health check shell 120 network 200 DB server (second server)
225 Health check table
300 Load balancer 400 Monitoring console (predetermined terminal)

Claims

In the first server included in the clustered server system,
A process of transmitting a message requesting a predetermined response to the second server accessed by the first server, and measuring an elapsed time from the transmission time of the message;
A first event in which an elapsed time until receiving a predetermined response from the second server in response to the transmission of the message exceeds a first reference time is monitored for each transmission of the message, and the first event A process of notifying a predetermined terminal that the second server is in a slow-down state when the number of consecutive occurrences exceeds a predetermined reference number;
A second event in which a predetermined response cannot be received from the second server within a second reference time that is equal to or greater than the first reference time from the transmission time of the message is monitored for each transmission of the message. A process for notifying a predetermined terminal that the second server is in a hang-up state when the occurrence of the second event is detected;
The server monitoring method characterized by performing.

In the first server,
As a message to the second server, a message requesting to respond with a predetermined value corresponding to the first server is transmitted,
In the second server,
A table in which a predetermined value is associated with the first server is held in a storage device;
When the electronic message is received from the first server, the identification information of the first server indicated by the electronic message is checked against the table, a predetermined value corresponding to the first server is specified, and the predetermined value A process of returning the response to the first server as a response,
The server monitoring method according to claim 1, wherein:

In the server system having the cluster configuration, a plurality of the first servers are associated with each second server.
Each of the first servers
The telegram is transmitted to the second server associated with the first server itself,
In the second server,
The message is received from each of the first servers, the identification information of the corresponding first server indicated by the corresponding message is checked against the table, a predetermined value corresponding to the corresponding first server is specified, and the predetermined value As a response to the corresponding first server,
Each of the first servers
If the number of consecutive occurrences of the first event exceeds a predetermined reference number, notify the predetermined terminal that the second server associated with the first server is in a slowdown state,
A second event in which a predetermined response cannot be received from the second server associated with the first server within a second reference time that is equal to or greater than the first reference time from the time of transmission of the message. To the predetermined terminal that the second server associated with the first server itself is in a hang-up state when the occurrence of the second event is detected. Notify
In the predetermined terminal,
Receiving a notification from the first server that the server is in the slowdown state or the hangup state with respect to the second server,
When all the first servers associated with the second server with respect to one second server receive notification that the slow server is in the slowdown state or the hang-up state, Display on the output device that the second server is in a slowdown state or a hangup state,
Only one of the first servers associated with the second server with respect to one second server has been notified that it is in the slow-down state or the hang-up state. A message indicating that a slowdown state or a hangup state has occurred in the first server that has not transmitted the notification on the output device;
The server monitoring method according to claim 2, wherein:

In the predetermined terminal,
When the notification of being in the hang-up state is received from all of the first servers associated with the corresponding second server with respect to one second server, the hang-up is sent to the second server. A message indicating that an up state has occurred and a message prompting the user to perform status check by executing login processing to the second server are displayed on the output device;
The server monitoring method according to claim 3.

In the first server included in the clustered server system,
A process of transmitting a message requesting a predetermined response to the second server accessed by the first server, and measuring an elapsed time from the transmission time of the message;
A first event in which an elapsed time until receiving a predetermined response from the second server in response to the transmission of the message exceeds a first reference time is monitored for each transmission of the message, and the first event A process of notifying a predetermined terminal that the second server is in a slow-down state when the number of consecutive occurrences exceeds a predetermined reference number;
A second event in which a predetermined response cannot be received from the second server within a second reference time that is equal to or greater than the first reference time from the transmission time of the message is monitored for each transmission of the message. When the occurrence of the second event is detected, a configuration is provided that includes an arithmetic device that executes processing for notifying a predetermined terminal that the second server is in a hang-up state.
A server monitoring system characterized by that.