JP7585659B2

JP7585659B2 - Monitoring system, monitoring method, program, and fault-tolerant server

Info

Publication number: JP7585659B2
Application number: JP2020141150A
Authority: JP
Inventors: 良行桜井
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2024-11-19
Anticipated expiration: 2040-08-24
Also published as: JP2022036778A

Description

本開示は、監視システム等に関する。 This disclosure relates to a monitoring system, etc.

ＲＡＩＤ（Redundant Arrays of Inexpensive Disks）システムを構成するハードディスクのうち、故障には至っていないが、性能が低下しているハードディスクを特定する方法として、ハードディスクの応答時間を計測する方法がある。この方法では、応答時間が予め定められた閾値以上であるハードディスクは、性能が低下していると判定される。しかし、応答時間の閾値超過は一時的な負荷などの偶発的な要因によって発生した可能性がある。したがって、本当はハードディスクに異常がない場合でも異常判定をしてしまい、無駄なディスク交換につながる。 One method for identifying hard disks that make up a RAID (Redundant Arrays of Inexpensive Disks) system and that have not yet failed but have degraded performance is to measure the response time of the hard disk. With this method, a hard disk whose response time is equal to or exceeds a predetermined threshold is determined to have degraded performance. However, a response time exceeding the threshold may be caused by accidental factors such as temporary load. Therefore, even when there is actually no problem with the hard disk, it may be determined to be abnormal, leading to unnecessary disk replacement.

特許文献１には、ストレージ装置における潜在故障状態の記憶装置の発見のために、ディスク負荷が閾値以下かつレスポンスタイムが閾値以上である記憶装置を検出する方法が開示されている。特許文献１の方法では、ディスク負荷以外の偶発的な要因でレスポンスタイムが閾値超過してしまった場合にも記憶装置が故障している可能性があると判定してしまう。 Patent document 1 discloses a method for detecting storage devices whose disk load is below a threshold and whose response time is above a threshold in order to discover storage devices in a potentially faulty state in a storage device. The method in patent document 1 determines that a storage device may be faulty even if the response time exceeds the threshold due to an accidental factor other than the disk load.

ハードディスクの異常傾向を検出した後、異常状態を確定するためにハードディスクの診断を行うことがある。例えば、同様の計測を複数回実施して、全ての計測において特定のハードディスクのみ性能が低下していることを確認してから、当該ハードディスクに異常があると診断する方法がある。さらに、例えば、平均応答時間の比較を行い、平均応答時間が閾値を超えたハードディスクに異常があると診断する方法がある。 After detecting an abnormal trend in a hard disk, a diagnosis of the hard disk may be performed to confirm the abnormal state. For example, one method is to perform similar measurements multiple times and confirm that the performance of only a specific hard disk has decreased in all measurements, and then diagnose that the hard disk is abnormal. Another method is to compare average response times and diagnose that a hard disk whose average response time exceeds a threshold is abnormal.

特許文献２には、タイムアウト等のディスクエラーが発生した場合、当該ディスクを仮縮退状態にし、４通りのディスク診断処理を行うことが開示されている。 Patent document 2 discloses that when a disk error such as a timeout occurs, the disk is put into a provisionally degraded state and four types of disk diagnostic processing are performed.

特開２０１９－０３６１６３号公報JP 2019-036163 A 特開２００２－１０８５７３号公報JP 2002-108573 A

ハードディスクの診断処理と業務処理は並行して行うから、ハードディスクを切り離さずにハードディスクの診断処理を行う方法では、システム全体の性能低下を招き、業務処理に支障をきたす可能性がある。さらに、性能が低下している可能性のあるハードディスクに対して、診断のための入出力を繰り返して応答時間を計測することは、システムに更なる負荷をかけることになる。つまり、ハードディスクの診断を詳細に実施するほど、システムに負荷がかかり、業務処理に支障をきたす可能性がある。 Because hard disk diagnostic processing and business processing are performed in parallel, performing hard disk diagnostic processing without disconnecting the hard disk can lead to a decrease in performance of the entire system, which could cause problems with business processing. Furthermore, repeating diagnostic input/output and measuring response time for a hard disk whose performance may be degraded places an additional load on the system. In other words, the more detailed the hard disk diagnosis, the greater the load on the system, which could cause problems with business processing.

特許文献２の診断処理は、ホストコンピュータからの命令に並行して行われるため、同一のディスクアレイ装置上で、業務処理と並行して処理する必要がある。したがって、特許文献２の診断方法は、診断処理により業務処理の性能に影響が出てしまう可能性がある。 The diagnostic process in Patent Document 2 is performed in parallel with commands from the host computer, so it must be performed in parallel with business processing on the same disk array device. Therefore, the diagnostic method in Patent Document 2 may affect the performance of business processing due to the diagnostic process.

本開示の目的の一つは、業務処理への影響を回避する、記憶装置の監視システム等を提供することである。 One of the objectives of this disclosure is to provide a monitoring system for a storage device that avoids impacts on business processing.

本開示に係る監視システムは、同期された第１及び第２のサブシステムがそれぞれ備える記憶装置に対してそれぞれＩＯ(Input Output)要求を発行する処理手段と、前記ＩＯ要求に対する応答時間の差に基づいて、前記応答時間がより長い前記記憶装置の異常傾向を判定する判定手段と、異常傾向が判定されると、前記第１及び第２のサブシステムの同期を解除し、前記第１及び第２のサブシステムが独立して動作可能となるよう制御する同期制御手段と、を備え、前記処理手段は、異常傾向があると判定された前記記憶装置を備える前記サブシステムにおいて、前記記憶装置の異常の有無の診断処理を行う。 The monitoring system according to the present disclosure comprises a processing means for issuing an IO (Input Output) request to each of the storage devices included in the first and second synchronized subsystems, a determination means for determining an abnormal tendency of the storage device having a longer response time based on the difference in response times to the IO requests, and a synchronization control means for releasing the synchronization of the first and second subsystems and controlling the first and second subsystems to operate independently when an abnormal tendency is determined, and the processing means performs a diagnostic process for determining whether or not the storage device has an abnormality in the subsystem including the storage device determined to have an abnormal tendency.

本開示に係る監視方法は、同期された第１及び第２のサブシステムがそれぞれ備える記憶装置に対してそれぞれＩＯ(Input Output)要求を発行し、前記ＩＯ要求に対する応答時間の差に基づいて、前記応答時間がより長い前記記憶装置の異常傾向を判定し、異常傾向が判定されると、前記第１及び第２のサブシステムの同期を解除し、前記第１及び第２のサブシステムが独立して動作可能となるよう制御し、異常傾向があると判定された前記記憶装置を備える前記サブシステムにおいて、前記記憶装置の異常の有無の診断処理を行う。 The monitoring method according to the present disclosure issues an IO (Input Output) request to each of the storage devices included in the first and second synchronized subsystems, and determines whether the storage device with the longer response time is prone to an abnormality based on the difference in response time to the IO request. If an abnormality is determined, the synchronization of the first and second subsystems is released, and the first and second subsystems are controlled so that they can operate independently. In the subsystem including the storage device determined to be prone to an abnormality, a diagnostic process is performed to determine whether the storage device has an abnormality.

本開示に係るプログラムは、同期された第１及び第２のサブシステムがそれぞれ備える記憶装置に対してそれぞれＩＯ(Input Output)要求を発行する処理と、前記ＩＯ要求に対する応答時間の差に基づいて、前記応答時間がより長い前記記憶装置の異常傾向を判定する処理と、異常傾向が判定されると、前記第１及び第２のサブシステムの同期を解除し、前記第１及び第２のサブシステムが独立して動作可能となるよう制御する処理と、異常傾向があると判定された前記記憶装置を備える前記サブシステムにおいて、前記記憶装置の異常の有無の診断処理とをコンピュータに実行させる。 The program disclosed herein causes a computer to execute the following processes: issuing an IO (Input Output) request to each of the storage devices included in the synchronized first and second subsystems; judging an abnormal tendency of the storage device with the longer response time based on the difference in response time to the IO request; when an abnormal tendency is judged, releasing the synchronization of the first and second subsystems and controlling the first and second subsystems so that they can operate independently; and diagnosing the presence or absence of an abnormality in the storage device in the subsystem that includes the storage device judged to have an abnormal tendency.

本開示に係るフォールトトレラントサーバは、第１の記憶装置と、第１のＣＰＵモジュールと、第１のフォールトトレラント（ＦＴ）コントローラと有する、第１のサブシステムと、第２の記憶装置と、第２のＣＰＵモジュールと、第２のＦＴコントローラと有する、第２のサブシステムを備え、前記第１及び第２の記憶装置は、それぞれ、監視システムから発行されたＩＯ(Input Output)要求に対して応答し、前記ＩＯ要求に対する応答時間の差に基づいて、前記第１の記憶装置の異常傾向が判定されると、前記第１のＦＴコントローラは、前記第１及び第２のサブシステムの同期の解除を制御し、前記第１のサブシステムにおいて、前記第１の記憶装置の異常の診断を行い、前記第２のサブシステムにおいて、他の処理を行う。 The fault-tolerant server according to the present disclosure comprises a first subsystem having a first storage device, a first CPU module, and a first fault-tolerant (FT) controller, and a second subsystem having a second storage device, a second CPU module, and a second FT controller, and the first and second storage devices each respond to an IO (Input Output) request issued by a monitoring system, and when a tendency for the first storage device to be abnormal is determined based on the difference in response time to the IO request, the first FT controller controls the desynchronization of the first and second subsystems, diagnoses the abnormality of the first storage device in the first subsystem, and performs other processing in the second subsystem.

本開示によれば、業務処理への影響を回避する、記憶装置の監視システム等を提供することができる。 This disclosure makes it possible to provide a storage device monitoring system that avoids impacts on business processing.

本開示に係るＦＴサーバ１の構成を示す図である。2 is a diagram showing a configuration of an FT server 1 according to the present disclosure. 同期状態のＦＴサーバ１を示す図である。FIG. 1 is a diagram showing the FT server 1 in a synchronized state. 同期を解除した状態のＦＴサーバ１を示す図である。FIG. 13 is a diagram showing the FT server 1 in a released synchronization state. 第１実施形態に係る監視システム５０の構成を例示するブロック図である。1 is a block diagram illustrating a configuration of a monitoring system 50 according to a first embodiment. 第１実施形態に係る監視システム５０の動作の例を示すフローチャートである。4 is a flowchart showing an example of the operation of the monitoring system 50 according to the first embodiment. 第１実施形態に係る異常傾向検出処理を示すフローチャートである。5 is a flowchart showing an abnormal trend detection process according to the first embodiment. 同期を解除する際のシステム１０とシステム２０の動作を示すフローチャートである。11 is a flowchart showing the operations of the system 10 and the system 20 when releasing synchronization. 第１実施形態に係る診断処理を示すフローチャートである。4 is a flowchart showing a diagnosis process according to the first embodiment. 第２実施形態に係る監視システム５０の構成を例示するブロック図である。FIG. 11 is a block diagram illustrating a configuration of a monitoring system 50 according to a second embodiment. 第２実施形態に係る異常傾向検出処理を示すフローチャートである。10 is a flowchart showing an abnormal trend detection process according to the second embodiment. 第２実施形態に係る診断処理を示すフローチャートである。10 is a flowchart showing a diagnosis process according to a second embodiment. 同期を解除する際のシステム１０とシステム２０の動作を示すフローチャートである。11 is a flowchart showing the operations of the system 10 and the system 20 when releasing synchronization. 監視システム５０のハードウェア構成例を示すブロック図である。FIG. 2 is a block diagram showing an example of a hardware configuration of a monitoring system 50.

ミッションクリティカルなシステムでは、障害発生時においてもサービスの継続が要求されるため、フォールトトレラント（Fault Tolerant）技術が導入される。フォールトトレラント技術を採用したコンピュータとして、フォールトトレラントサーバ（ＦＴサーバ）が知られている。 In mission-critical systems, the continuity of service is required even when a failure occurs, so fault-tolerant technology is introduced. A fault-tolerant server (FT server) is known as a computer that employs fault-tolerant technology.

ＦＴサーバを構成するハードウェアコンポーネントは二重化されている。一方のハードウェアコンポーネントに故障が発生した場合、故障が発生した部分は論理的に切り離される。正常に動作している部分は処理を続行するため、二重化により耐障害性が向上する。 The hardware components that make up the FT server are duplicated. If a failure occurs in one of the hardware components, the part where the failure occurred is logically isolated. The part that is operating normally continues processing, so duplication improves fault tolerance.

本開示に係る監視システムは、例として、ＦＴサーバのハードディスクの監視に用いることができる。図１は、本開示に係るＦＴサーバ１の構成を示す図である。本開示に係るＦＴサーバ１は、２つのサブシステムとして、システム１０とシステム２０を備える。 The monitoring system according to the present disclosure can be used, for example, to monitor the hard disk of an FT server. FIG. 1 is a diagram showing the configuration of an FT server 1 according to the present disclosure. The FT server 1 according to the present disclosure has two subsystems, system 10 and system 20.

システム１０は、ＣＰＵモジュール１１、ＦＴコントローラ１２、ＩＯモジュール１３を有する。ＣＰＵモジュール１１はＣＰＵ（Central Processing Unit）、及び、メモリを有する。ＩＯモジュール１３は、ＮＩＣ（Network Interface Card）、及び、ハードディスク１４を含む、各ＩＯ（Input Output）デバイスを有する。 The system 10 has a CPU module 11, an FT controller 12, and an IO module 13. The CPU module 11 has a CPU (Central Processing Unit) and memory. The IO module 13 has each IO (Input Output) device, including a NIC (Network Interface Card) and a hard disk 14.

システム２０は、ＣＰＵモジュール２１、ＦＴコントローラ２２、ＩＯモジュール２３を有する。ＣＰＵモジュール２１はＣＰＵ、メモリを有する。ＩＯモジュール２３は、ＮＩＣ、及び、ハードディスク２４を含む、各ＩＯデバイスを有する。 The system 20 has a CPU module 21, an FT controller 22, and an IO module 23. The CPU module 21 has a CPU and memory. The IO module 23 has each IO device including a NIC and a hard disk 24.

ＣＰＵモジュール１１とＣＰＵモジュール２１は、同一クロックで同期動作するように、ＦＴコントローラ１２及びＦＴコントローラ２２によって制御される。ＩＯモジュール１３及びＩＯモジュール２３の各ＩＯデバイスは、ソフトウェアにより冗長構成を実現する。ＮＩＣはチーミング技術を利用して冗長構成を実現し、ハードディスクはミラーリング技術を利用して冗長構成を実現する。 The CPU module 11 and the CPU module 21 are controlled by the FT controller 12 and the FT controller 22 so that they operate synchronously with the same clock. Each IO device of the IO module 13 and the IO module 23 realizes a redundant configuration by software. The NIC realizes a redundant configuration by using teaming technology, and the hard disk realizes a redundant configuration by using mirroring technology.

図２は、同期状態のＦＴサーバ１を示す図である。図２に示すように、ＦＴサーバ１が同期状態で動作している場合、ＣＰＵモジュール１１とＣＰＵモジュール２１は同一クロックで同期動作するように、ＦＴコントローラ１２及びＦＴコントローラ２２によって制御されている。ＦＴサーバ１上では、ＯＳ（Operating System）３０が実行され、ＯＳ３０上で業務処理を行う。一方のシステムにハードウェア故障が発生した場合、故障が発生した部分を論理的に切り離し、正常に動作しているシステムが処理を続行することができる。 Figure 2 is a diagram showing FT server 1 in a synchronized state. As shown in Figure 2, when FT server 1 is operating in a synchronized state, CPU module 11 and CPU module 21 are controlled by FT controller 12 and FT controller 22 so that they operate synchronously with the same clock. An OS (Operating System) 30 runs on FT server 1, and business processing is performed on OS 30. If a hardware failure occurs in one of the systems, the part where the failure occurred is logically separated, allowing the system operating normally to continue processing.

ＦＴサーバ１が、ハードウェアの故障なく同期状態で動作している場合、ＦＴサーバ１の同期を意図的に一時的に解除することで、ＦＴサーバ１をそれぞれ独立して動作する２つのシステムに分割することが可能になる。 If FT Server 1 is operating in a synchronized state without any hardware failure, it is possible to split FT Server 1 into two systems that operate independently by intentionally temporarily disabling the synchronization of FT Server 1.

図３は、同期を解除した状態のＦＴサーバ１を示す図である。図３に示すように、意図的に同期を解除した状態では、システム１０ではＣＰＵモジュール１１が動作し、ＩＯモジュール１３の各ＩＯデバイスに対応するＩＯ処理が実施される。システム１０上では、ＯＳ３１が実行される。システム２０では、ＣＰＵモジュール２１が動作し、ＩＯモジュール２３の各ＩＯデバイスに対応するＩＯ処理が実施される。システム２０上では、ＯＳ３２が実行される。 Figure 3 is a diagram showing the FT server 1 in a state where synchronization is released. As shown in Figure 3, when synchronization is intentionally released, the CPU module 11 operates in the system 10, and IO processing corresponding to each IO device of the IO module 13 is performed. The OS 31 runs on the system 10. The CPU module 21 operates in the system 20, and IO processing corresponding to each IO device of the IO module 23 is performed. The OS 32 runs on the system 20.

［第１実施形態］
［構成］
図４Ａは、第１実施形態に係る監視システム５０の構成を例示するブロック図である。監視システム５０は、例えば、図１に示すＦＴサーバ１と有線または無線により接続され、ＦＴサーバ１のハードディスク１４、及び、ハードディスク２４を監視する。監視システム５０は、処理部５１、記録部（図示せず）、判定部５３、同期制御部５４を備える。処理部５１、判定部５３、同期制御部５４は、それぞれ、本開示に係る処理手段、判定手段、同期制御手段の一実施形態である。ハードディスク１４、及び、ハードディスク２４は、それぞれ本開示に係る記憶装置の一実施形態である。 [First embodiment]
[composition]
Fig. 4A is a block diagram illustrating a configuration of a monitoring system 50 according to the first embodiment. The monitoring system 50 is connected, for example, by wire or wirelessly to the FT server 1 shown in Fig. 1, and monitors the hard disks 14 and 24 of the FT server 1. The monitoring system 50 includes a processing unit 51, a recording unit (not shown), a determination unit 53, and a synchronization control unit 54. The processing unit 51, the determination unit 53, and the synchronization control unit 54 are embodiments of the processing means, the determination means, and the synchronization control means according to the present disclosure, respectively. The hard disks 14 and 24 are embodiments of the storage devices according to the present disclosure, respectively.

処理部５１は、同期された２つのサブシステムが備える記憶装置に対してそれぞれＩＯ要求を発行する（以下、単にＩＯを発行する、またはＩＯ発行と記す場合もある）。具体的には、処理部５１は、例えば、ハードディスク１４と、ハードディスク２４とに定期的にＩＯを発行する。処理部５１は、ＩＯ要求を送信してからＩＯ要求に対する応答を受信するまでの応答時間を計測するためにＩＯを発行する。そして、処理部５１は、計測した応答時間を記録部に記録させる。また、処理部５１は、異常傾向があると判定された記憶装置を備えるサブシステムにおいて、記憶装置の異常の有無を診断する診断処理を行う。 The processing unit 51 issues IO requests to the storage devices of the two synchronized subsystems (hereinafter, this may simply be referred to as issuing IOs or issuing IOs). Specifically, the processing unit 51 periodically issues IOs to, for example, the hard disk 14 and the hard disk 24. The processing unit 51 issues IOs to measure the response time from when the IO request is sent until when a response to the IO request is received. The processing unit 51 then records the measured response time in the recording unit. Furthermore, the processing unit 51 performs a diagnostic process to diagnose the presence or absence of an abnormality in the storage device in the subsystem that includes the storage device that has been determined to have an abnormal tendency.

記録部は、例えば、ＲＡＭ（Random Access Memory）等の補助記憶装置である。記録部は、ハードディスク等の記憶装置によって実現されてもよい。 The recording unit is, for example, an auxiliary storage device such as a RAM (Random Access Memory). The recording unit may also be realized by a storage device such as a hard disk.

判定部５３は、ＩＯ要求に対する応答時間の差に基づいて、応答時間がより長い記憶装置の異常傾向を判定する。具体的には、判定部５３は、例えば、記録部に記録された各ハードディスクの応答時間を監視し、応答時間の差が閾値以上になるか否かを判定する。応答時間の差が閾値を超えた場合、判定部５３は、どちらのハードディスクに異常傾向があるかの情報と共にＦＴサーバ１の同期解除が必要な旨の通知を同期制御部５４へ送信する。更に判定部５３は、ハードディスクの診断処理の結果に基づいて、当該ハードディスクに性能低下の異常があるか否かを確定するための判定を行う。 The determination unit 53 determines whether the storage device with the longer response time is prone to an abnormality based on the difference in response times to IO requests. Specifically, the determination unit 53 monitors the response times of each hard disk recorded in the recording unit, for example, and determines whether the difference in response times is equal to or greater than a threshold. If the difference in response times exceeds the threshold, the determination unit 53 sends a notification to the synchronization control unit 54 that desynchronization of the FT server 1 is necessary, along with information on which hard disk is prone to an abnormality. Furthermore, the determination unit 53 makes a determination to determine whether the hard disk in question is prone to an abnormality that reduces performance, based on the results of the hard disk diagnostic process.

同期制御部５４は、一方の記憶装置の異常傾向が判定されると、２つのサブシステムの同期を解除し、システム１０及びシステム２０が独立して動作可能となるよう制御する。具体的には、同期制御部５４は、例えば、判定部５３からの通知に基づき、ＦＴサーバ１の同期の制御を行う。同期の解除を行う際、異常傾向のない（応答時間が短い）ハードディスク側のシステムを業務継続側、異常傾向のある（応答時間が長い）ハードディスク側のシステムをハードディスク診断側として同期を解除する。なお、ＦＴサーバ１の同期の制御は、ＦＴコントローラ１２、ＦＴコントローラ２２を介して行われる。 When the synchronization control unit 54 determines that one of the storage devices is showing an abnormal trend, it releases the synchronization of the two subsystems and controls so that systems 10 and 20 can operate independently. Specifically, the synchronization control unit 54 controls the synchronization of the FT server 1, for example, based on a notification from the determination unit 53. When releasing the synchronization, it releases the synchronization with the system on the hard disk side that is not showing an abnormal trend (short response time) as the business continuation side, and the system on the hard disk side that is showing an abnormal trend (long response time) as the hard disk diagnosis side. Note that the synchronization of the FT server 1 is controlled via the FT controller 12 and the FT controller 22.

［動作］
図４Ｂは、第１実施形態に係る監視システム５０の動作の例を示すフローチャートである。まず、処理部５１は、同期された２つのサブシステムである、システム１０とシステム２０がそれぞれ備える記憶装置に対し、それぞれＩＯ要求を発行する（ステップＳ１０１）。判定部５３は、ＩＯ要求に対する応答時間の差に基づいて、応答時間がより長い記憶装置の異常傾向を判定する（ステップＳ１０２）。同期制御部５４は、記憶装置の異常傾向が判定されると、システム１０とシステム２０の同期を解除し、２つのサブシステムが独立して動作可能となるよう制御する（ステップＳ１０３）。処理部５１は、異常傾向があると判定された記憶装置を備えるサブシステムにおいて、記憶装置の異常の有無の診断を行う（ステップＳ１０４）。 [Action]
4B is a flowchart showing an example of the operation of the monitoring system 50 according to the first embodiment. First, the processing unit 51 issues an IO request to each of the storage devices included in the two synchronized subsystems, the system 10 and the system 20 (step S101). The determination unit 53 determines the abnormal tendency of the storage device with the longer response time based on the difference in response time to the IO request (step S102). When the synchronization control unit 54 determines the abnormal tendency of the storage device, it releases the synchronization between the system 10 and the system 20 and controls the two subsystems so that they can operate independently (step S103). The processing unit 51 diagnoses the presence or absence of an abnormality in the storage device in the subsystem including the storage device determined to have an abnormal tendency (step S104).

図５から図７を参照し、第１実施形態に係る処理フローの具体例を説明する。 A specific example of the processing flow according to the first embodiment will be described with reference to Figures 5 to 7.

図５は、第１実施形態に係る異常傾向のあるハードディスクの検出処理（チェックＣ１）を示すフローチャートである。まず、処理部５１は、ＦＴサーバ１が同期状態で動作しているかを確認する。同期状態で動作していない場合（ステップＳ１；ＮＯ）、処理部５１は、ＦＴサーバ１が同期状態で動作していない旨のエラーを図示しない表示部に通知する（ステップＳ２）。同期状態で動作している場合（ステップＳ１；ＹＥＳ）、処理部５１は、定期的にハードディスク１４とハードディスク２４にＩＯを発行する（ステップＳ３）。処理部５１は、それぞれの応答時間を計測し、記録部に記録させる（ステップＳ４）。 Figure 5 is a flowchart showing the process (check C1) for detecting a hard disk with an abnormal tendency according to the first embodiment. First, the processing unit 51 checks whether the FT server 1 is operating in a synchronized state. If it is not operating in a synchronized state (step S1; NO), the processing unit 51 notifies the display unit (not shown) of an error indicating that the FT server 1 is not operating in a synchronized state (step S2). If it is operating in a synchronized state (step S1; YES), the processing unit 51 periodically issues IO to the hard disks 14 and 24 (step S3). The processing unit 51 measures the response time of each and records it in the recording unit (step S4).

次に判定部５３は、応答時間の監視処理を行う。具体的には、判定部５３は、各ハードディスクの応答時間の差が閾値を超えているか否かを監視する。ハードディスクの性能低下の異常傾向を監視するために、判定部５３は、例えば、各ハードディスクの応答時間の差分を計算し、その差分が閾値（Ｍ ms（ms：ミリ秒））を超えているかを監視しても良い。また、判定部５３は、例えば、各ハードディスクの応答時間の比率を計算し、その比率が閾値（Ｎ倍）（１＜Ｎ）を超えているか否かを監視しても良い。ハードディスクの性能低下を監視する目的を達し得るのであれば、他の指標が用いられても良い。応答時間の差が閾値を超えている場合、判定部５３は、片方のハードディスクに性能低下の異常傾向があることを検出する。 Next, the determination unit 53 performs a response time monitoring process. Specifically, the determination unit 53 monitors whether the difference in response time between the hard disks exceeds a threshold value. To monitor for an abnormal trend in hard disk performance degradation, the determination unit 53 may, for example, calculate the difference in response time between the hard disks and monitor whether the difference exceeds a threshold value (M ms (ms: milliseconds)). The determination unit 53 may also, for example, calculate the ratio of the response times between the hard disks and monitor whether the ratio exceeds a threshold value (N times) (1 < N). Other indicators may be used as long as the purpose of monitoring hard disk performance degradation can be achieved. If the difference in response time exceeds the threshold value, the determination unit 53 detects that one of the hard disks has an abnormal trend in performance degradation.

応答時間の差が閾値を超えていない場合（ステップＳ５；ＮＯ）、監視システム５０は再びＦＴサーバ１の同期状態の確認を行い、判定部５３は次のＩＯ発行を待つ。 If the difference in response time does not exceed the threshold (step S5; NO), the monitoring system 50 checks the synchronization status of the FT server 1 again, and the determination unit 53 waits for the next IO to be issued.

応答時間の差が閾値を超えていた場合（ステップＳ５；ＹＥＳ）、判定部５３はどちらのハードディスクの応答時間が長いのかの判定を行う。判定部５３は、性能低下の異常傾向を検出すると、ＦＴサーバ１の同期解除が必要な旨の通知を、同期制御部５４に対して行う。ハードディスク２４の応答時間の方が長い場合（ステップＳ６；ＹＥＳ）、同期制御部５４は、システム１０を業務継続側、システム２０をハードディスク診断側としてＦＴサーバ１の同期を解除する（ステップＳ７）。ハードディスク１４の応答時間の方が長い場合（ステップＳ６；ＮＯ）、同期制御部５４は、システム２０を業務継続側、システム１０をハードディスク診断側としてＦＴサーバ１の同期を解除する（ステップＳ８）。 If the difference in response time exceeds the threshold (step S5; YES), the judgment unit 53 judges which hard disk has a longer response time. When the judgment unit 53 detects an abnormal trend of performance degradation, it notifies the synchronization control unit 54 that it is necessary to desynchronize the FT server 1. If the response time of the hard disk 24 is longer (step S6; YES), the synchronization control unit 54 desynchronizes the FT server 1 with the system 10 as the business continuation side and the system 20 as the hard disk diagnosis side (step S7). If the response time of the hard disk 14 is longer (step S6; NO), the synchronization control unit 54 desynchronizes the FT server 1 with the system 20 as the business continuation side and the system 10 as the hard disk diagnosis side (step S8).

図６は、同期を解除する際のシステム１０とシステム２０の動作を示すフローチャートである。なお、図６では、システム２０側のハードディスク２４に性能低下の異常傾向が検出された場合を例として説明する。 Figure 6 is a flowchart showing the operation of system 10 and system 20 when canceling synchronization. Note that Figure 6 uses as an example a case where an abnormal trend of performance degradation is detected in the hard disk 24 on the system 20 side.

まず、ＦＴサーバ１のシステム１０とシステム２０は同期状態で動作している（ステップＳ１１）。ＦＴコントローラ１２は、同期制御部５４による制御に基づいて、ＦＴコントローラ２２に対し同期解除の指示を行う（ステップＳ１２）。この指示により、ＦＴコントローラ２２は同期を解除し（ステップＳ１６）、ＦＴサーバ１はそれぞれ独立して動作するシステム１０とシステム２０に分割される。システム１０は業務を継続し（ステップＳ１３）、システム２０では業務を中止する（ステップＳ１７）。この際、システム１０側では業務を継続するため、ＦＴサーバ１の同期動作時に使用していたＩＰアドレスを継続して使用する。システム２０側では別のＩＰアドレスを一時的に割り当てる。 First, systems 10 and 20 of FT server 1 are operating in a synchronized state (step S11). Based on the control of synchronization control unit 54, FT controller 12 instructs FT controller 22 to cancel synchronization (step S12). This instruction causes FT controller 22 to cancel synchronization (step S16), and FT server 1 is divided into systems 10 and 20 that operate independently. System 10 continues operations (step S13), and system 20 stops operations (step S17). At this time, in order to continue operations on the system 10 side, the IP address that was used during the synchronous operation of FT server 1 continues to be used. On the system 20 side, a different IP address is temporarily assigned.

システム２０では、異常傾向が検出されたハードディスク２４に対して、本当にハードディスクに異常があるか否かを確定させるためのハードディスク診断処理Ａ１が実行される（ステップＳ１８）。ハードディスク診断処理Ａ１の結果、ハードディスク２４に異常なしと判定された場合（ステップＳ１９；ＮＯ）、システム２０はハードディスクの診断処理が終了したことを、ネットワークを介してシステム１０へ通知する（ステップＳ２０）。通知をシステム１０が受信すると（ステップＳ１４）、ＦＴコントローラ１２はＦＴコントローラ２２へ同期の指示を行う（ステップＳ１５）。システム２０は同期の指示を受信し、業務処理を継続していたシステム１０をベースとして同期を行う（ステップＳ２１）。同期処理が完了すると、ＦＴサーバ１は同期状態に復帰する（ステップＳ２３）。 In the system 20, a hard disk diagnostic process A1 is executed for the hard disk 24 in which an abnormal trend has been detected, in order to determine whether or not the hard disk is actually abnormal (step S18). If the result of the hard disk diagnostic process A1 is that the hard disk 24 is determined to be normal (step S19; NO), the system 20 notifies the system 10 via the network that the hard disk diagnostic process has ended (step S20). When the system 10 receives the notification (step S14), the FT controller 12 instructs the FT controller 22 to synchronize (step S15). The system 20 receives the synchronization instruction and performs synchronization based on the system 10 that was continuing business processing (step S21). When the synchronization process is completed, the FT server 1 returns to a synchronized state (step S23).

ハードディスク診断処理Ａ１の結果、ハードディスク２４に異常ありと判定された場合（ステップＳ１９；ＹＥＳ）、システム２０はハードディスクの交換が必要な旨のエラーを、図示しない表示部に通知する（ステップＳ２２）。 If the result of the hard disk diagnostic process A1 indicates that there is an abnormality in the hard disk 24 (step S19; YES), the system 20 notifies the display unit (not shown) of an error indicating that the hard disk needs to be replaced (step S22).

図７は、第１実施形態に係る診断処理（ハードディスク診断処理Ａ１）を示すフローチャートである。ハードディスク診断処理Ａ１は、性能低下の異常傾向があると判定されたハードディスク２４に対して、偶発的な要因によって異常傾向が検出されたのか、本当に性能低下の異常が発生しているのかを確認するために行う。 Figure 7 is a flowchart showing the diagnostic process (hard disk diagnostic process A1) according to the first embodiment. The hard disk diagnostic process A1 is performed on a hard disk 24 that has been determined to have an abnormal trend of performance degradation in order to confirm whether the abnormal trend has been detected due to an accidental factor or whether an abnormality of performance degradation has actually occurred.

ハードディスク診断処理Ａ１は、ＦＴサーバ１の同期を解除して、業務継続側のシステムとは独立して行うことが可能なため、業務に影響を与えることなく、詳細な診断を行うことが可能である。第１実施形態において、チェックＣ１と同様のＩＯを複数回発行して平均応答時間で判定する診断方法を示すが、診断方法はこの方法には限定されない。例えば、ハードディスクの全面リード、全面ライト試験などより詳細な診断を行うことも可能である。 The hard disk diagnostic process A1 can be performed independently of the system on the business continuation side by releasing the synchronization of the FT server 1, making it possible to perform a detailed diagnosis without affecting business operations. In the first embodiment, a diagnostic method is shown in which an IO similar to check C1 is issued multiple times and judged based on the average response time, but the diagnostic method is not limited to this method. For example, it is also possible to perform a more detailed diagnosis, such as a full read and full write test of the hard disk.

ハードディスク診断処理Ａ１では、まず、処理部５１は、ハードディスク２４にＩＯを発行し（ステップＳ３１）、応答時間を計測し記録部に記録させる（ステップＳ３２）。監視システム５０は、規定回数（Ｘ回）の計測が終わるまでステップＳ３１とステップＳ３２の処理を繰り返す（ステップＳ３３；ＮＯ）。監視システム５０は、規定回数の計測を終了後に判定部５３による判定処理へ移行する（ステップＳ３３；ＹＥＳ）。 In the hard disk diagnostic process A1, first, the processing unit 51 issues an IO to the hard disk 24 (step S31), measures the response time, and records it in the recording unit (step S32). The monitoring system 50 repeats the processes of steps S31 and S32 until the specified number of measurements (X times) is completed (step S33; NO). After the specified number of measurements is completed, the monitoring system 50 transitions to the determination process by the determination unit 53 (step S33; YES).

判定部５３は、ハードディスク２４の規定回数分（Ｘ回）の応答時間からハードディスク２４の平均応答時間を計算する（ステップＳ３４）。また、判定部５３は、記録部に記録されているハードディスク１４の直近Ｘ回分の応答時間からハードディスク２４の平均応答時間を計算する（ステップＳ３５）。 The determination unit 53 calculates the average response time of the hard disk 24 from the response times of the specified number of times (X times) of the hard disk 24 (step S34). The determination unit 53 also calculates the average response time of the hard disk 24 from the response times of the most recent X times of the hard disk 14 recorded in the recording unit (step S35).

判定部５３は、ハードディスク１４とハードディスク２４の平均応答時間を比較する。ハードディスク２４の平均応答時間の方が長く、かつ、その差が閾値を超えていた場合（ステップＳ３６；ＹＥＳ）、判定部５３は、ハードディスク２４を異常ありと判定する（ステップＳ３７）。この条件に当てはまらない場合には（ステップＳ３６；ＮＯ）、ハードディスク２４を異常なしと判定する。なお、ここで使用する閾値は、チェックＣ１における閾値と同様の閾値でも良いが、より正確な診断を行うために、チェックＣ１よりも小さい閾値を使用しても良い。 The determination unit 53 compares the average response time of the hard disk 14 and the hard disk 24. If the average response time of the hard disk 24 is longer and the difference exceeds a threshold (step S36; YES), the determination unit 53 determines that the hard disk 24 is abnormal (step S37). If this condition is not met (step S36; NO), the determination unit 53 determines that the hard disk 24 is normal. Note that the threshold used here may be the same as the threshold used in check C1, but a threshold smaller than that of check C1 may be used to perform a more accurate diagnosis.

［効果］
第１実施形態の監視システム５０によれば、同期された２つのサブシステムがそれぞれ備える記憶装置のうち、一方の記憶装置の性能低下による、業務処理への影響を回避することができる。その理由は、監視システム５０において、処理部５１が各記憶装置に対するＩＯ要求を発行し、判定部５３が、応答時間の差に基づいて記憶装置の異常傾向を判定し、同期制御手段が２つのサブシステムの同期を解除するためである。また、同期が解除されたサブシステムはそれぞれ独立して動作可能となり、処理部５１は、異常傾向があると判定された記憶装置を備えるサブシステムにおいて、記憶装置の異常の有無の診断処理を行うためである。 [effect]
According to the monitoring system 50 of the first embodiment, it is possible to avoid the impact on business processing due to a decrease in performance of one of the storage devices included in each of the two synchronized subsystems. This is because, in the monitoring system 50, the processing unit 51 issues an IO request to each storage device, the determining unit 53 determines whether the storage device is prone to an abnormality based on the difference in response time, and the synchronization control means releases the synchronization of the two subsystems. In addition, the subsystems whose synchronization has been released become capable of operating independently, and the processing unit 51 performs a diagnostic process for determining whether or not the storage device is abnormal in the subsystem including the storage device determined to be prone to an abnormality.

第１実施形態によれば、一方のハードディスクの性能低下によるＦＴサーバ１全体の性能低下、及び、業務処理への影響を回避することができる。その理由は、監視システム５０が各ハードディスクのＩＯ要求の応答時間を計測することで、性能低下の異常傾向を監視し、異常傾向が見られた場合には、ＦＴサーバ１の同期を解除するためである。 According to the first embodiment, it is possible to avoid a performance drop in the entire FT server 1 and the impact on business processing due to a performance drop in one of the hard disks. This is because the monitoring system 50 measures the response time of the IO requests of each hard disk, monitors for abnormal trends in performance drop, and releases synchronization of the FT server 1 if an abnormal trend is detected.

また、第１実施形態によれば、異常傾向が見られたハードディスクの診断を行うことで、本当にハードディスクに異常が発生しているか否かを確認することができ、ハードディスクの無駄な交換を避けることが可能となる。 In addition, according to the first embodiment, by diagnosing a hard disk that shows signs of abnormality, it is possible to confirm whether or not an abnormality has actually occurred in the hard disk, thereby making it possible to avoid unnecessary replacement of the hard disk.

さらに、業務側のシステムは業務に専念することができ、ハードディスク診断側のシステムでは業務に影響を与えることなくハードディスクのより詳細な診断を行うことが可能となる。その理由は、ＦＴサーバ１の同期を解除して、独立して動作するシステム１０とシステム２０に分割しているためである。 Furthermore, the business system can concentrate on business, and the hard disk diagnostic system can perform more detailed diagnosis of the hard disk without affecting business. This is because FT Server 1 is desynchronized and divided into System 10 and System 20, which operate independently.

［第２実施形態］
第１実施形態において、チェックＣ１の段階においてハードディスクに性能低下の異常傾向を検出すると、ＦＴサーバ１の同期を解除する場合について説明した。これは、ハードディスクの性能低下が業務処理に支障をきたすことを回避することを優先するためである（性能優先モード）。 [Second embodiment]
In the first embodiment, a case has been described in which, if an abnormal trend of performance degradation in the hard disk is detected at the check C1 stage, synchronization of the FT server 1 is cancelled. This is because priority is given to preventing degradation of hard disk performance from interfering with business processing (performance priority mode).

第２実施形態において、ＣＰＵ負荷やＩＯ負荷などのシステム１０及び２０の負荷状況を監視しながら同期を解除する場合について説明する。第２実施形態において、監視システム５０は、ハードディスクに性能低下の異常傾向を検出しても、システムの負荷が閾値よりも低く、業務処理に与える影響が軽微な状態であれば、ＦＴサーバ１の同期は解除せずに、ハードディスクの診断を継続する。負荷が閾値を超えた場合にだけＦＴサーバ１の同期を解除することによって、ＦＴサーバ１の同期状態を維持することを優先できる（同期優先モード）。 In the second embodiment, a case will be described in which synchronization is released while monitoring the load status of systems 10 and 20, such as CPU load and IO load. In the second embodiment, even if the monitoring system 50 detects an abnormal trend of performance degradation in the hard disk, if the system load is lower than the threshold and the impact on business processing is minor, the monitoring system 50 does not release synchronization of FT server 1 and continues diagnosing the hard disk. By releasing synchronization of FT server 1 only when the load exceeds the threshold, it is possible to prioritize maintaining the synchronized state of FT server 1 (synchronization priority mode).

［構成］
図８は、第２実施形態に係る監視システム５０の構成を例示するブロック図である。図８において、監視システム５０は、同期して動作可能な２つのサブシステムであるシステム１０及びシステム２０と、有線または無線により接続されている。なお、第２実施形態に係る監視システム５０について、第１実施形態に係る監視システム５０と同様の構成についてはその説明を省略する。第２実施形態に係る監視システム５０は、記録部５２を含み、負荷計測部５５をさらに備える点で第１実施形態に係る監視システム５０と異なる。 [composition]
Fig. 8 is a block diagram illustrating a configuration of a monitoring system 50 according to a second embodiment. In Fig. 8, the monitoring system 50 is connected by wire or wirelessly to two subsystems, a system 10 and a system 20, which are capable of operating synchronously. Note that, regarding the monitoring system 50 according to the second embodiment, a description of the same configuration as that of the monitoring system 50 according to the first embodiment will be omitted. The monitoring system 50 according to the second embodiment differs from the monitoring system 50 according to the first embodiment in that it includes a recording unit 52 and further includes a load measuring unit 55.

負荷計測部５５は、ＣＰＵ負荷またはＩＯ負荷などの、システム負荷を計測する。異常傾向が判定された際に、システム負荷が所定の閾値を超えない場合、同期制御部５４は、サブシステムの同期を解除せず、処理部５１は、同期されたサブシステムにおいて、記憶装置の診断処理を行う。システム負荷が所定の閾値を超えた場合、同期制御部５４は、サブシステムの同期を解除する。 The load measurement unit 55 measures the system load, such as the CPU load or IO load. If the system load does not exceed a predetermined threshold when an abnormal trend is determined, the synchronization control unit 54 does not release the synchronization of the subsystem, and the processing unit 51 performs diagnostic processing of the storage device in the synchronized subsystem. If the system load exceeds the predetermined threshold, the synchronization control unit 54 releases the synchronization of the subsystem.

第２実施形態において、判定部５３は、負荷計測部５５における負荷状況を監視する。判定部５３は、記憶装置に性能低下の異常傾向があり、かつ負荷が閾値を超えた場合に、どちらの記憶装置に異常傾向があるかの情報と共にサブシステムの同期解除が必要な旨の通知を同期制御部５４へ送信する。 In the second embodiment, the determination unit 53 monitors the load status in the load measurement unit 55. When the storage device shows an abnormal tendency of performance degradation and the load exceeds a threshold, the determination unit 53 transmits to the synchronization control unit 54 a notification that the subsystem needs to be desynchronized, together with information on which storage device shows an abnormal tendency.

［動作］
図９から図１１のフローチャート参照し、ＦＴサーバ１を監視する場合の第２実施形態に係る処理フローを説明する。 [Action]
The process flow according to the second embodiment for monitoring the FT server 1 will be described with reference to the flowcharts of FIG. 9 to FIG.

図９は、第２実施形態に係る、性能低下の異常傾向のあるハードディスクを検出するための監視システム５０の処理を示すフローチャートである。まず、処理部５１は、ＦＴサーバ１が同期状態で動作しているかを確認する。同期状態で動作していない場合（ステップＳ４１；ＮＯ）、処理部５１は、ＦＴサーバ１が同期状態で動作していない旨のエラーを図示しない表示部に通知する（ステップＳ４２）。同期状態で動作している場合（ステップＳ４１；ＹＥＳ）、処理部５１は、定期的にハードディスク１４とハードディスク２４にＩＯを発行して（ステップＳ４３）、それぞれの応答時間を計測し記録部５２に記録させる（ステップＳ４４）。 Figure 9 is a flowchart showing the processing of the monitoring system 50 for detecting a hard disk with an abnormal tendency of performance degradation according to the second embodiment. First, the processing unit 51 checks whether the FT server 1 is operating in a synchronized state. If it is not operating in a synchronized state (step S41; NO), the processing unit 51 notifies the display unit (not shown) of an error that the FT server 1 is not operating in a synchronized state (step S42). If it is operating in a synchronized state (step S41; YES), the processing unit 51 periodically issues IO to the hard disks 14 and 24 (step S43), measures the response time of each, and records it in the recording unit 52 (step S44).

判定部５３は、各ハードディスクの応答時間の差が閾値を超えているか否かを監視する。ハードディスクの性能低下を監視するための閾値としては、第１実施形態と同様で良い。応答時間の差が閾値を超えている場合、判定部５３は、片方のハードディスクに性能低下の異常傾向があることを検出する。 The determination unit 53 monitors whether the difference in response time between the hard disks exceeds a threshold value. The threshold value for monitoring the performance degradation of the hard disks may be the same as that in the first embodiment. If the difference in response time exceeds the threshold value, the determination unit 53 detects that one of the hard disks has an abnormal tendency for performance degradation.

応答時間の差が閾値を超えていない場合（ステップＳ４５；ＮＯ）、ステップＳ４１に戻って同期状態の確認を行い、判定部５３は次のＩＯ発行を待つ。 If the difference in response time does not exceed the threshold (step S45; NO), the process returns to step S41 to check the synchronization state, and the determination unit 53 waits for the next IO to be issued.

応答時間の差が閾値を超えていた場合（ステップＳ４５；ＹＥＳ）、判定部５３は、負荷が閾値を超えているか否かを監視する。監視対象の負荷は、ＣＰＵ負荷、ＩＯ負荷など業務処理への影響度合いを監視するために必要な負荷を業務特性に応じて選択すれば良い。監視対象は１つであっても良いし、必要なものを組み合わせて監視対象としても良い。 If the difference in response time exceeds the threshold (step S45; YES), the determination unit 53 monitors whether the load exceeds the threshold. The load to be monitored may be selected according to the characteristics of the business, such as the CPU load or IO load, which is necessary for monitoring the degree of impact on business processing. There may be one monitored object, or a combination of necessary objects may be monitored.

負荷が閾値を超えていた場合（ステップＳ４６；ＹＥＳ）、判定部５３はどちらのハードディスクの応答時間が長いのかの判定を行う。ハードディスク２４の応答時間の方が長い場合（ステップＳ４７；ＹＥＳ）、同期制御部５４は、システム１０を業務継続側、システム２０をハードディスク診断側としてＦＴサーバ１の同期を解除する（ステップＳ４８）。ハードディスク１４の応答時間の方が長い場合（ステップＳ４７；ＮＯ）、同期制御部５４は、システム２０を業務継続側、システム１０をハードディスク診断側としてＦＴサーバ１の同期を解除する（ステップＳ４９）。 If the load exceeds the threshold (step S46; YES), the judgment unit 53 judges which hard disk has a longer response time. If the response time of the hard disk 24 is longer (step S47; YES), the synchronization control unit 54 releases the synchronization of the FT server 1 with the system 10 as the business continuation side and the system 20 as the hard disk diagnosis side (step S48). If the response time of the hard disk 14 is longer (step S47; NO), the synchronization control unit 54 releases the synchronization of the FT server 1 with the system 20 as the business continuation side and the system 10 as the hard disk diagnosis side (step S49).

負荷が閾値を超えていない場合（ステップＳ４６；ＮＯ）、ＦＴサーバ１は同期状態のままハードディスク診断処理Ａ２を実施する（ステップＳ５０）。ハードディスク診断処理Ａ２の結果、ハードディスクに異常ありと判定された場合（ステップＳ５１；ＹＥＳ）、ＦＴサーバ１は異常ハードディスクをミラーリングから切り離し、ハードディスクの交換が必要な旨のエラーを通知する（ステップＳ５２）。なお、この際、ＦＴサーバ１のＣＰＵモジュールは同期状態を維持して動作している。また、ＣＰＵモジュール以外の必要なハードウェアコンポーネントも、同期状態を維持して動作していてもよい。 If the load does not exceed the threshold (step S46; NO), the FT server 1 performs the hard disk diagnostic process A2 while remaining in a synchronized state (step S50). If the result of the hard disk diagnostic process A2 indicates that there is an abnormality in the hard disk (step S51; YES), the FT server 1 separates the abnormal hard disk from the mirroring and notifies an error that the hard disk needs to be replaced (step S52). At this time, the CPU module of the FT server 1 operates while maintaining a synchronized state. Furthermore, necessary hardware components other than the CPU module may also operate while maintaining a synchronized state.

ハードディスク診断処理Ａ２の結果、ハードディスクに異常なしと判定された場合（ステップＳ５１；ＮＯ）、ステップＳ４１に戻って同期状態の確認を行ってから、次のＩＯ発行を待つ。 If the result of the hard disk diagnosis process A2 is that the hard disk is determined to be normal (step S51; NO), the process returns to step S41 to check the synchronization state and then waits for the next IO to be issued.

図１０は、第２実施形態に係る診断処理（ハードディスク診断処理Ａ２）を示すフローチャートである。ハードディスク診断処理Ａ２は、どちらかのハードディスクに性能低下の異常傾向があると判定された場合に、偶発的な要因によるものなのか、本当に性能低下の異常が発生しているのかを確認するために行う。ハードディスク診断処理Ａ２は、ＦＴサーバ１が同期した状態で実施する。 Figure 10 is a flowchart showing the diagnostic process (hard disk diagnostic process A2) according to the second embodiment. The hard disk diagnostic process A2 is performed when it is determined that one of the hard disks has an abnormal tendency of performance degradation, in order to check whether this is due to accidental factors or whether an abnormality in performance degradation has actually occurred. The hard disk diagnostic process A2 is performed when the FT server 1 is synchronized.

ハードディスク診断処理Ａ２では、まず、ハードディスク１４とハードディスク２４にＩＯを発行して応答時間を計測して記録する（ステップＳ６１、Ｓ６２）。次に、負荷が閾値を超えているか否かを監視する。 In the hard disk diagnostic process A2, first, IO is issued to the hard disk 14 and the hard disk 24, and the response time is measured and recorded (steps S61 and S62). Next, it is monitored whether the load exceeds a threshold value.

負荷が閾値を超えていた場合（ステップＳ６３；ＹＥＳ）、どちらのハードディスクの応答時間が長いのかの判定を行う。ハードディスク２４の応答時間の方が長い場合（ステップＳ７１；ＹＥＳ）、システム１０を業務継続側、システム２０をハードディスク診断側としてＦＴサーバ１の同期を解除する（ステップＳ７２）。ハードディスク１４の応答時間の方が長い場合（ステップＳ７１；ＮＯ）、システム２０を業務継続側、システム１０をハードディスク診断側としてＦＴサーバ１の同期を解除する（ステップＳ７３）。 If the load exceeds the threshold (step S63; YES), a determination is made as to which hard disk has a longer response time. If the response time of hard disk 24 is longer (step S71; YES), the synchronization of FT server 1 is released with system 10 as the business continuation side and system 20 as the hard disk diagnosis side (step S72). If the response time of hard disk 14 is longer (step S71; NO), the synchronization of FT server 1 is released with system 20 as the business continuation side and system 10 as the hard disk diagnosis side (step S73).

負荷が閾値を超えていない場合（ステップＳ６３；ＮＯ）、規定回数（Ｘ回）の計測が終わるまでこの処理を繰り返し（ステップＳ６４；ＮＯ）、規定回数の計測を終了後に判定処理のステップへ移行する（ステップＳ６４；ＹＥＳ）。 If the load does not exceed the threshold (step S63; NO), this process is repeated until the specified number of measurements (X times) is completed (step S64; NO), and after the specified number of measurements is completed, the process proceeds to the judgment process step (step S64; YES).

判定処理では、まず、ハードディスク１４とハードディスク２４の規定回数分（Ｘ回）の平均応答時間を計算する（ステップＳ６５）。ハードディスク１４とハードディスク２４の平均応答時間を比較し、その差が閾値を超えていない場合（ステップＳ６６；ＮＯ）、ハードディスクを異常なしと判定する（ステップＳ７０）。 In the judgment process, first, the average response time of the hard disk 14 and the hard disk 24 for a specified number of times (X times) is calculated (step S65). The average response times of the hard disk 14 and the hard disk 24 are compared, and if the difference does not exceed the threshold value (step S66; NO), the hard disk is judged to be normal (step S70).

ハードディスク１４とハードディスク２４の平均応答時間を比較し、その差が閾値を超えていた場合（ステップＳ６６；ＹＥＳ）、どちらのハードディスクの平均応答時間が長いかを比較し、ハードディスク１４の平均応答時間が長ければ（ステップＳ６７；ＮＯ）、ハードディスク１４を異常ありと判定する（ステップＳ６８）。ハードディスク２４の平均応答時間が長ければ（ステップＳ６７；ＹＥＳ）、ハードディスク２４を異常ありと判定する（ステップＳ６９）。 The average response times of hard disk 14 and hard disk 24 are compared, and if the difference exceeds a threshold (step S66; YES), a comparison is made to see which hard disk has the longer average response time. If the average response time of hard disk 14 is longer (step S67; NO), the hard disk 14 is determined to be abnormal (step S68). If the average response time of hard disk 24 is longer (step S67; YES), the hard disk 24 is determined to be abnormal (step S69).

図１１は、各ハードディスクの応答時間の差が閾値を超え、かつ、負荷が閾値を超えた場合に、ＦＴサーバ１の同期を解除する際の、システム１０とシステム２０がそれぞれ実施する動作を示したフローチャートである。なお、ここでは、システム２０側のハードディスク２４に性能低下の異常傾向が検出された場合を例として説明する。 Figure 11 is a flowchart showing the operations performed by system 10 and system 20 when releasing synchronization of FT server 1 when the difference in response time between the hard disks exceeds a threshold and the load exceeds a threshold. Note that the following describes an example in which an abnormal trend of performance degradation is detected in the hard disk 24 on the system 20 side.

まず、ＦＴサーバ１のシステム１０とシステム２０は同期状態で動作している（ステップＳ８１）。ハードディスク２４に性能低下の異常傾向が検出され、負荷も閾値を超え、ＦＴサーバ１の同期解除が必要な旨の通知を受信した際、ＦＴコントローラ１２からＦＴコントローラ２２へ同期解除の指示を行う（ステップＳ８２）。この指示により、ＦＴサーバ１はそれぞれ独立して動作するシステム１０とシステム２０に分割される（ステップＳ８６）。システム１０は業務を継続し（ステップＳ８３）、システム２０では業務を中止する（ステップＳ８７）。この際、システム１０側では業務を継続するため、ＦＴサーバ１の同期動作時に使用していたＩＰアドレスを継続して使用する。システム２０側では別のＩＰアドレスを一時的に割り当てる。 First, systems 10 and 20 of FT server 1 are operating in a synchronized state (step S81). When an abnormal trend of performance degradation is detected in the hard disk 24, the load exceeds a threshold, and a notification is received that FT server 1 needs to be desynchronized, FT controller 12 issues a desynchronization instruction to FT controller 22 (step S82). This instruction causes FT server 1 to be divided into systems 10 and 20, which operate independently (step S86). System 10 continues operations (step S83), and system 20 stops operations (step S87). At this time, in order to continue operations on the system 10 side, the IP address used during the synchronous operation of FT server 1 continues to be used. A different IP address is temporarily assigned on the system 20 side.

システム２０では異常傾向が検出されたハードディスク２４に対して、本当にハードディスクに異常があるか否かを確定させるためのハードディスク診断処理Ａ１を実行する。ハードディスク診断処理Ａ１の結果、ハードディスク２４が異常なしと判定された場合（ステップＳ８９；ＮＯ）、ハードディスクの診断処理が終了したことを、ネットワークを介してシステム１０へ通知する（ステップＳ９０）。通知を受信したシステム１０は（ステップＳ８４）、ＦＴコントローラ１２からＦＴコントローラ２２へ同期の指示を行う（ステップＳ８５）。システム２０は同期の指示を受信し、業務処理を継続していたシステム１０をベースとして同期を行う（ステップＳ９１）。同期処理が完了すると、ＦＴサーバ１は同期状態に復帰する（ステップＳ９２）。 The system 20 executes a hard disk diagnostic process A1 for the hard disk 24 for which an abnormal trend has been detected, to determine whether or not the hard disk is actually abnormal. If the result of the hard disk diagnostic process A1 is that the hard disk 24 is determined to be normal (step S89; NO), the system 10 is notified via the network that the hard disk diagnostic process has ended (step S90). The system 10 that receives the notification (step S84) issues a synchronization instruction from the FT controller 12 to the FT controller 22 (step S85). The system 20 receives the synchronization instruction and performs synchronization based on the system 10 that was continuing business processing (step S91). When the synchronization process is completed, the FT server 1 returns to a synchronized state (step S92).

ハードディスク診断処理Ａ１の結果、ハードディスク２４に異常ありと判定された場合（ステップＳ８９；ＹＥＳ）、ＦＴサーバ１はハードディスクに異常があるため交換が必要な旨のエラーを通知する（ステップＳ９３）。この際、異常と判定されたハードディスク２４は同期不可の状態にして（ステップＳ９４）、ハードディスクの診断処理が終了したことを、ネットワークを介してシステム１０へ通知する（ステップＳ９０）。通知を受信したシステム１０は（ステップＳ８４）、ＦＴコントローラ１２からＦＴコントローラ２２へ同期の指示を行う（ステップＳ８５）。システム２０は同期の指示を受信し、業務処理を継続していたシステム１０をベースとして同期を行う（ステップＳ９１）。同期処理が完了すると、ＦＴサーバ１のハードディスク以外のハードウェアコンポーネントが同期状態に復帰する。（ステップＳ９２）。 If the result of the hard disk diagnostic process A1 is that the hard disk 24 is determined to be abnormal (step S89; YES), the FT server 1 notifies the system 10 of an error that the hard disk is abnormal and needs to be replaced (step S93). At this time, the hard disk 24 determined to be abnormal is placed in an unsynchronized state (step S94), and the system 10 is notified via the network that the hard disk diagnostic process has ended (step S90). The system 10 that receives the notification (step S84) issues a synchronization instruction from the FT controller 12 to the FT controller 22 (step S85). The system 20 receives the synchronization instruction and performs synchronization based on the system 10 that was continuing business processing (step S91). When the synchronization process is completed, the hardware components of the FT server 1 other than the hard disk are restored to a synchronized state (step S92).

［効果］
第２実施形態によれば、ＦＴサーバ１の同期状態を優先したい場合にも、本開示を適用することができる。その理由は、ＦＴサーバ１が、負荷が閾値を超えるまではＦＴサーバ１の同期を維持したままハードディスクの診断を行うためである。ハードディスクに異常ありと判定された場合は、当該ハードディスクだけがミラーリングから切り離される。 [effect]
According to the second embodiment, the present disclosure can also be applied when it is desired to give priority to the synchronization state of the FT server 1. The reason for this is that the FT server 1 performs a hard disk diagnosis while maintaining the synchronization of the FT server 1 until the load exceeds a threshold value. If it is determined that there is an abnormality in a hard disk, only that hard disk is separated from the mirroring.

第２実施形態において、負荷が閾値を超えた場合、ＦＴサーバ１は、ＦＴサーバ１の同期を解除してハードディスクの診断を行う。ＦＴサーバ１は、診断によりハードディスクに異常ありと判定されたハードディスクは同期不可とし、その他のハードウェアコンポーネントを同期状態に復帰をさせる。 In the second embodiment, when the load exceeds a threshold, the FT server 1 cancels synchronization of the FT server 1 and diagnoses the hard disks. The FT server 1 makes the hard disks that are determined to have an abnormality by the diagnosis unsynchronizable, and returns the other hardware components to a synchronized state.

［ハードウェア構成］
上述した各実施形態において、監視システム５０の各構成要素は、機能単位のブロックを示している。監視システム５０の各構成要素の一部又は全部は、コンピュータ５００とプログラムとの任意の組み合わせにより実現されてもよい。図１２は、監視システム５０のハードウェア構成の例を示すブロック図である。図１２を参照すると、コンピュータ５００は、例えば、ＣＰＵ（Central Processing Unit）５０１、ＲＯＭ（Read Only Memory）５０２、ＲＡＭ（Random Access Memory）５０３、プログラム５０４、記憶装置５０５、ドライブ装置５０７、通信インタフェース５０８、入力装置５０９、入出力インタフェース５１１、及び、バス５１２を含む。 [Hardware configuration]
In each of the above-described embodiments, each component of the monitoring system 50 is shown as a functional block. A part or all of each component of the monitoring system 50 may be realized by any combination of a computer 500 and a program. FIG. 12 is a block diagram showing an example of a hardware configuration of the monitoring system 50. Referring to FIG. 12, the computer 500 includes, for example, a CPU (Central Processing Unit) 501, a ROM (Read Only Memory) 502, a RAM (Random Access Memory) 503, a program 504, a storage device 505, a drive device 507, a communication interface 508, an input device 509, an input/output interface 511, and a bus 512.

プログラム５０４は、監視システム５０の各機能を実現するための命令（instruction）を含む。プログラム５０４は、予め、ＲＯＭ５０２やＲＡＭ５０３、記憶装置５０５に格納される。ＣＰＵ５０１は、プログラム５０４に含まれる命令を実行することにより、監視システム５０の各機能を実現する。例えば、監視システム５０のＣＰＵ５０１がプログラム５０４に含まれる命令を実行することにより、監視システム５０の機能を実現する。また、ＲＡＭ５０３は、監視システム５０の各機能において処理されるデータを記憶してもよい。例えば、コンピュータ５００のＲＡＭ５０３に、ＩＯ要求に対する応答時間を記憶してもよい。 The program 504 includes instructions for implementing each function of the monitoring system 50. The program 504 is stored in advance in the ROM 502, the RAM 503, or the storage device 505. The CPU 501 implements each function of the monitoring system 50 by executing the instructions included in the program 504. For example, the CPU 501 of the monitoring system 50 implements the functions of the monitoring system 50 by executing the instructions included in the program 504. The RAM 503 may also store data to be processed in each function of the monitoring system 50. For example, the response time to an IO request may be stored in the RAM 503 of the computer 500.

ドライブ装置５０７は、記録媒体５０６の読み書きを行う。通信インタフェース５０８は、通信ネットワークとのインタフェースを提供する。入力装置５０９は、例えば、マウスやキーボード等であり、ユーザからの情報の入力を受け付ける。出力装置５１０は、例えば、ディスプレイであり、ユーザへ情報を出力（表示）する。入出力インタフェース５１１は、周辺機器とのインタフェースを提供する。バス５１２は、これらハードウェアの各構成要素を接続する。なお、プログラム５０４は、通信ネットワークを介してＣＰＵ５０１に供給されてもよいし、予め、記録媒体５０６に格納され、ドライブ装置５０７により読み出され、ＣＰＵ５０１に供給されてもよい。例えば、コンピュータ５００と上述の実施形態におけるサブシステムは、通信ネットワークを介して接続されてもよく、入出力インタフェース５１１を介して接続されてもよい。 The drive device 507 reads and writes data from the recording medium 506. The communication interface 508 provides an interface with a communication network. The input device 509 is, for example, a mouse or a keyboard, and accepts information input from a user. The output device 510 is, for example, a display, and outputs (displays) information to a user. The input/output interface 511 provides an interface with peripheral devices. The bus 512 connects these hardware components. The program 504 may be supplied to the CPU 501 via a communication network, or may be stored in advance in the recording medium 506, read by the drive device 507, and supplied to the CPU 501. For example, the computer 500 and the subsystem in the above-mentioned embodiment may be connected via a communication network, or may be connected via the input/output interface 511.

なお、図１２に示されているハードウェア構成は例示であり、これら以外の構成要素が追加されていてもよく、一部の構成要素を含まなくてもよい。 Note that the hardware configuration shown in FIG. 12 is an example, and other components may be added, or some components may not be included.

［変形例］
監視システム５０の実現方法には、様々な変形例がある。監視システム５０は、ＦＴサーバ１の資源を用いて実現されてもよい。例えば、監視システム５０のプログラムをハードディスク１４、２４のそれぞれに搭載してもよい。また、例えば、監視システム５０は、構成要素毎にそれぞれ異なるコンピュータとプログラムとの任意の組み合わせにより実現されてもよい。また、監視システム５０が備える複数の構成要素が、一つのコンピュータとプログラムとの任意の組み合わせにより実現されてもよい。 [Modification]
There are various modified examples of the method of realizing the monitoring system 50. The monitoring system 50 may be realized by using the resources of the FT server 1. For example, the program of the monitoring system 50 may be loaded onto each of the hard disks 14, 24. Also, for example, the monitoring system 50 may be realized by any combination of a different computer and program for each component. Also, multiple components included in the monitoring system 50 may be realized by any combination of a single computer and program.

また、監視システム５０の各構成要素の一部又は全部は、プロセッサ等を含む汎用又は専用の回路（circuitry）や、これらの組み合わせによって実現されてもよい。これらの回路は、単一のチップによって構成されてもよいし、バスを介して接続される複数のチップによって構成されてもよい。監視システム５０の各構成要素の一部又は全部は、上述した回路等とプログラムとの組み合わせによって実現されてもよい。 Furthermore, some or all of the components of the monitoring system 50 may be realized by general-purpose or dedicated circuits including a processor, etc., or a combination of these. These circuits may be configured by a single chip, or may be configured by multiple chips connected via a bus. Some or all of the components of the monitoring system 50 may be realized by a combination of the above-mentioned circuits, etc., and a program.

また、監視システム５０の各構成要素の一部又は全部が複数のコンピュータや回路等により実現される場合、複数のコンピュータや回路等は、集中配置されてもよいし、分散配置されてもよい。 In addition, when some or all of the components of the monitoring system 50 are realized by multiple computers, circuits, etc., the multiple computers, circuits, etc. may be centrally or decentralized.

以上、実施形態を参照して本開示を説明したが、本開示は上記実施形態に限定されるものではない。本開示の構成や詳細には、本開示のスコープ内で当業者が理解し得る様々な変更をすることができる。また、各実施形態における構成は、本開示のスコープを逸脱しない限りにおいて、互いに組み合わせることが可能である。
Although the present disclosure has been described above with reference to the embodiments, the present disclosure is not limited to the above embodiments. Various modifications that can be understood by a person skilled in the art can be made to the configuration and details of the present disclosure within the scope of the present disclosure. Furthermore, the configurations in each embodiment can be combined with each other without departing from the scope of the present disclosure.

１ＦＴサーバ
１０、２０サブシステム
１１、２１ＣＰＵモジュール
１２、２２ＦＴコントローラ
１３、２３ＩＯモジュール
１４、２４ハードディスク
５０監視システム
５１処理部
５２記録部
５３判定部
５４同期制御部
５５負荷計測部 REFERENCE SIGNS LIST 1 FT server 10, 20 Subsystem 11, 21 CPU module 12, 22 FT controller 13, 23 IO module 14, 24 Hard disk 50 Monitoring system 51 Processing unit 52 Recording unit 53 Determination unit 54 Synchronization control unit 55 Load measurement unit

Claims

A monitoring system for monitoring a fault-tolerant server having a first subsystem and a second subsystem, comprising:
a processing means for issuing an IO (Input Output) request to a storage device included in each of the first and second subsystems which perform business processing in synchronization with each other;
a determination means for determining an abnormal tendency of the storage device having a longer response time based on a difference in the response time to the I/O request;
a load measuring means for measuring a load of the first and second subsystems;
a synchronization control means for releasing the synchronization between the first and second subsystems and controlling the first and second subsystems to be able to operate independently when the load exceeds a predetermined threshold value when the determination means determines that there is an abnormal tendency,
The processing means includes:
When the load exceeds a predetermined threshold value when the abnormal tendency is determined, a diagnosis process is performed on the storage device in either the first or second subsystem, in which synchronization is released , the storage device being determined to have the abnormal tendency , to determine whether or not the storage device has an abnormality;
When the abnormal tendency is determined, if the load does not exceed a predetermined threshold, the first and second subsystems are synchronized and the diagnostic process is performed.
Surveillance system.

the processing means determines whether the first and second subsystems are operating in synchronization, and when it is determined that the first and second subsystems are operating in synchronization, periodically issues the I/O requests, and measures and records the response times of the I/O requests;
The determining means determines whether the load exceeds a predetermined threshold value when the difference in the response times exceeds a threshold value.
The monitoring system of claim 1 .

1. A method for monitoring a fault-tolerant server having a first subsystem and a second subsystem, comprising:
issuing an IO (Input Output) request to a storage device provided in each of the first and second subsystems which perform business processing in synchronization with each other ;
determining an abnormal tendency of the storage device having a longer response time based on the difference in response time to the IO request;
Measuring loads on the first and second subsystems;
when the load exceeds a predetermined threshold when the abnormal tendency is determined, the synchronization of the first and second subsystems is released, and the first and second subsystems are controlled so as to be able to operate independently, and in the subsystem including the storage device determined to have the abnormal tendency, a diagnosis process is performed to determine whether or not there is an abnormality in the storage device;
When the load does not exceed a predetermined threshold value when the abnormal tendency is determined, the synchronization between the first and second subsystems is not released, and the diagnosis process is performed in a state in which the first and second subsystems are synchronized.
Monitoring methods.

A program for monitoring a fault-tolerant server having a first subsystem and a second subsystem, comprising:
a process of issuing an IO (Input Output) request to a storage device included in each of the first and second subsystems which perform business processing in synchronization with each other;
A process of determining an abnormal tendency of the storage device having a longer response time based on a difference in the response time to the I/O request;
A process of measuring loads of the first and second subsystems;
a process of releasing synchronization between the first and second subsystems and controlling the first and second subsystems to be able to operate independently when the load exceeds a predetermined threshold when an abnormal trend is determined;
a diagnostic process for diagnosing whether or not there is an abnormality in the storage device in either the first or second subsystem in which synchronization has been released, the storage device being determined to have the abnormal tendency ;
a diagnostic process for diagnosing the presence or absence of an abnormality in the storage device while the first and second subsystems are in a synchronized state, without releasing the synchronization between the first and second subsystems if the load does not exceed a predetermined threshold when the abnormality tendency is determined;
A program that causes a computer to execute the following.

a first subsystem having a first storage device, a first CPU module, and a first FT ( Fault Tolerant ) controller;
a second subsystem having a second storage device, a second CPU module, and a second FT controller;
the first and second storage devices respectively included in the first and second subsystems which perform business processing in synchronization with each other respond to an IO (Input Output) request issued from a monitoring system;
an abnormality tendency of the first storage device is determined based on the difference in response time to the I/O request , and when the load of the first and second subsystems exceeds a predetermined threshold , the first FT controller controls the release of synchronization of the first and second subsystems, and a diagnosis of an abnormality of the first storage device is performed in the first subsystem from which synchronization has been released ;
An abnormal tendency of the first storage device is determined based on the difference in response time to the I/O request, and if the load of the first and second subsystems does not exceed a predetermined threshold, a diagnosis of an abnormality of the first storage device is performed in a state in which the first and second subsystems are synchronized.
Fault-tolerant servers.