JP7632633B2

JP7632633B2 - Virtualization system recovery device and virtualization system recovery method

Info

Publication number: JP7632633B2
Application number: JP2023531190A
Authority: JP
Inventors: 健太篠原; 紀貴堀米; 真生上野
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2025-02-19
Anticipated expiration: 2041-06-29
Also published as: WO2023275984A1; JPWO2023275984A1

Description

本発明は、仮想マシンやコンテナをベースとするコンピューティング基盤において、コンテナやコンテナ上で動作するアプリケーションの異常検知及び障害復旧を実現する仮想化システム復旧装置及び仮想化システム復旧方法に関する。 The present invention relates to a virtualization system recovery device and a virtualization system recovery method that detects abnormalities and recovers failures in containers and applications running on containers in a computing platform based on virtual machines or containers.

上述した仮想マシンは、物理コンピュータと同機能をソフトウェアで実現したコンピュータである。コンテナは、アプリケーションを「コンテナ」と呼ばれる環境にパッケージ化して作成され、コンテナエンジン上で作動する仮想化技術である。従来のコンテナ系の技術では、主に後述するクーバネテス（kubernetes）が持つ後述のLiveness/Readiness Probe機能（プローブ機能ともいう）により、コンテナやコンテナ上で動作するアプリケーションの異常検知及び障害復旧が実現されている。 The virtual machine mentioned above is a computer that realizes the same functions as a physical computer using software. A container is a virtualization technology that is created by packaging an application in an environment called a "container" and runs on a container engine. In conventional container-based technologies, anomaly detection and failure recovery of containers and applications running on containers are mainly achieved by the Liveness/Readiness Probe function (also called the probe function) of Kubernetes, which will be described later.

クーバネテスは、Docker等のコンテナを作成してクラスタ化するコンテナ仮想化ソフトウェアであり、且つオープンソースソフトウェアである。Liveness Probe機能は、コンテナを再起動する等の制御を行い、Readiness Probe機能は、コンテナがリクエストを受け付けるか否か等の制御を行うものである。この種の従来技術として非特許文献１に記載の技術がある。 CubeNetes is container virtualization software that creates and clusters containers such as Docker, and is also open source software. The liveness probe function controls whether or not a container will accept requests, while the readiness probe function controls whether or not a container will accept requests. An example of this type of prior art is the technology described in Non-Patent Document 1.

“Liveness Probe、Readiness ProbeおよびStartup Probeを使用する,”kubernetes，［online］，［令和３年６月１０日検索］，インターネット〈https://kubernetes.io/ja/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/〉"Using Liveness Probes, Readiness Probes, and Startup Probes," kubernetes, [online], [Retrieved June 10, 2021], Internet, https://kubernetes.io/ja/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

ところで、上述したコンテナに限らず仮想化技術領域としての仮想化システムにおいては、仮想化システム内の障害が検知されて発報されたアラートに基づいて、人力による復旧作業等が行われている。しかし、アラート発報後に人力で復旧作業を行うので、障害発生から正常化までの時間短縮が難しい。 By the way, in virtualization systems, which are not limited to the containers mentioned above but are part of the virtualization technology field, manual recovery work is performed based on the alerts that are issued when faults in the virtualization system are detected. However, since manual recovery work is performed after an alert is issued, it is difficult to reduce the time from the occurrence of a fault to normalization.

その障害をクーバネテスが持つ異常検知及び障害復旧を行うプローブ機能で障害を検知して復旧させる場合、障害監視を行う周期を、予め定められた１秒等の遅い周期にしか設定できない。このため、極力早い異常検知及び障害復旧が必要な場合に、デフォルト状態のクーバネテスが持つ異常検知復旧機能よりも早く、異常検知及び障害復旧を行うことができない、という課題があった。 When detecting and recovering from such a fault using the probe function of CubeNetes that detects anomalies and recovers from faults, the period for fault monitoring can only be set to a slow period, such as one second, which is determined in advance. This poses the problem that when the fastest possible fault detection and recovery is required, fault detection and recovery cannot be performed faster than the default CubeNetes function.

本発明は、このような事情に鑑みてなされたものであり、仮想化システムで発生した障害をコンテナ仮想化ソフトウェアが持つ異常検知復旧機能よりも、早く異常検知及び障害復旧を行うことを課題とする。The present invention was made in consideration of the above circumstances, and its objective is to detect abnormalities and recover from failures that occur in a virtualization system more quickly than the anomaly detection and recovery functions of the container virtualization software.

上記課題を解決するため、本発明の仮想化システム復旧装置は、物理マシン上にコンテナ仮想化ソフトウェアにより仮想的に作成され、当該仮想的に作成されるコンテナをクラスタ化して配置する計算資源クラスタと、前記仮想的に作成され、前記クラスタ化されたコンテナの配置及び動作に係る制御を管理するクラスタ管理部と、各々が、前記計算資源クラスタ及び前記クラスタ管理部を有して構成される複数のクラスタと、前記複数のクラスタ毎に配置され、且つ前記仮想的に作成された計算資源クラスタ及びクラスタ管理部の外部に前記仮想的に作成され、前記コンテナの異常を検知する内部異常検知部と、前記複数のクラスタの外部に前記仮想的に作成され、前記内部異常検知部でのコンテナの異常検知時に当該異常のコンテナが配置されたクラスタを異常と検知する外部異常検知部と、前記複数のクラスタの外部に前記仮想的に作成され、前記外部異常検知部で検知された異常のクラスタに係る通信振分中止指示をＤＮＳ（Domain Name System）へ通知する振分先切替部と、を備えるとともに、前記コンテナに係るアプリケーションの名称を示すドメイン名と、前記クラスタ毎のＩＰ（Internet Protocol）アドレスとを対応付けて管理する前記ＤＮＳを、当該クラスタを構成するサーバの外部に備え、前記ＤＮＳは、前記通信振分中止指示で示される異常のクラスタのＩＰアドレスを消去することを特徴とする。 In order to solve the above problems, a virtualization system recovery device of the present invention includes a computational resource cluster that is virtually created on a physical machine by container virtualization software and that clusters and arranges the virtually created containers, a cluster management unit that manages control related to the arrangement and operation of the virtually created and clustered containers, a plurality of clusters each including the computational resource cluster and the cluster management unit, an internal anomaly detection unit that is arranged for each of the plurality of clusters and that is virtually created outside the virtually created computational resource cluster and the cluster management unit and that detects an anomaly in the container, an external anomaly detection unit that is virtually created outside the plurality of clusters and that detects an anomaly in a cluster in which an abnormal container is arranged as an anomaly when an anomaly in a container is detected by the internal anomaly detection unit, and a distribution destination switching unit that is virtually created outside the plurality of clusters and that notifies a DNS (Domain Name System) of a communication distribution stop instruction related to the cluster in which the abnormality is detected by the external anomaly detection unit, and The DNS, which manages IP addresses in association with DNS (Network Address Translation) addresses, is provided outside the servers constituting the cluster, and the DNS erases the IP address of the abnormal cluster indicated in the communication distribution stop instruction .

本発明によれば、仮想化システムで発生した障害をコンテナ仮想化ソフトウェアが持つ異常検知復旧機能よりも、早く異常検知及び障害復旧を行うことができる。 According to the present invention, it is possible to detect abnormalities and recover from failures that occur in a virtualization system more quickly than the abnormality detection and recovery functions inherent in container virtualization software.

本発明の実施形態に係る仮想化システム復旧装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a virtualization system recovery device according to an embodiment of the present invention. 本実施形態の仮想化システム復旧装置における障害対応デプロイ指示部によるエンドポイント設定部とＰｏｄとを１：１の構成としてデプロイした際の構成を示すブロック図である。11 is a block diagram showing a configuration when an endpoint setting unit and a Pod are deployed in a 1:1 configuration by a failure response deployment instruction unit in the virtualization system recovery device of this embodiment. FIG. 本実施形態の仮想化システム復旧装置のＰｏｄによるコンテナの第１異常検知処理を説明するためのブロック図である。FIG. 11 is a block diagram for explaining a first abnormality detection process of a container by a Pod of the virtualization system recovery device of this embodiment. 本実施形態の仮想化システム復旧装置のワーカーノード毎に備えられたルーティングテーブルによる第２異常検知処理を説明するためのブロック図である。11 is a block diagram for explaining a second abnormality detection process by a routing table provided for each worker node of the virtualization system recovery apparatus of this embodiment. FIG. 本実施形態の仮想化システム復旧装置のワーカーノード毎に備えられた仮想スイッチのデーモンの監視による第３異常検知処理を説明するためのブロック図である。FIG. 11 is a block diagram for explaining a third abnormality detection process by monitoring a daemon of a virtual switch provided for each worker node of the virtualization system recovery device of this embodiment. 本実施形態の仮想化システム復旧装置のワーカーノード毎に備えられたコンテナランタイムのデーモンの監視による第４異常検知処理を説明するためのブロック図である。FIG. 11 is a block diagram for explaining a fourth abnormality detection process by monitoring a daemon of a container runtime provided for each worker node of the virtualization system recovery device of this embodiment. 本実施形態の仮想化システム復旧装置のワーカーノード毎の監視による第５異常検知処理を説明するためのブロック図である。FIG. 13 is a block diagram for explaining a fifth abnormality detection process by monitoring each worker node of the virtualization system recovery device of this embodiment. 本実施形態の仮想化システム復旧装置のコンテナシステムのクラスタに外付けされたＤＢの監視による第６異常検知処理を説明するためのブロック図である。FIG. 13 is a block diagram for explaining a sixth abnormality detection process by monitoring a DB externally attached to a cluster of a container system of the virtualization system recovery device of this embodiment. 外部異常検知部によって複数のクラスタの障害発生に係る異常検知の処理について説明する構成を示すブロック図である。11 is a block diagram illustrating a configuration for explaining a process of detecting an anomaly related to the occurrence of a failure in a plurality of clusters by an external anomaly detection unit. FIG. 本実施形態の仮想化システム復旧装置の異常対応処理を説明するためのブロック図である。1 is a block diagram for explaining an abnormality handling process of the virtualization system recovery device of the present embodiment. FIG. ＤＮＳレコードテーブルのドメイン名と解決先ＩＰアドレスとの対応関係を示す図である。FIG. 2 is a diagram showing the correspondence between domain names in a DNS record table and resolved IP addresses. ＤＮＳレコードテーブルの障害クラスタのＩＰアドレスの消去様態を示す図である。FIG. 13 illustrates how IP addresses of a failed cluster are deleted from a DNS record table. 本実施形態の仮想化システム復旧装置の異常対応処理の動作を説明するためのフローチャートである。11 is a flowchart for explaining an abnormality handling process performed by the virtualization system recovery device of the present embodiment. 本発明の実施形態の変形例１に係る仮想化システム復旧装置の構成を示すブロック図である。FIG. 11 is a block diagram showing a configuration of a virtualization system recovery device according to a first modified example of an embodiment of the present invention. 本発明の実施形態の変形例２に係る仮想化システム復旧装置の構成を示すブロック図である。FIG. 11 is a block diagram showing a configuration of a virtualization system recovery device according to a second modified example of an embodiment of the present invention. 本実施形態に係る仮想化システム復旧装置の機能を実現するコンピュータの一例を示すハードウェア構成図である。FIG. 2 is a hardware configuration diagram showing an example of a computer that realizes the functions of the virtualization system recovery device according to the present embodiment.

以下、本発明の実施形態を、図面を参照して説明する。但し、本明細書の全図において機能が対応する構成部分には同一符号を付し、その説明を適宜省略する。
＜実施形態の構成＞
図１は、本発明の実施形態に係る仮想化システム復旧装置の構成を示すブロック図である。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. However, in all the drawings in this specification, components having corresponding functions are given the same reference numerals, and the description thereof will be omitted as appropriate.
<Configuration of the embodiment>
FIG. 1 is a block diagram showing the configuration of a virtualization system recovery device according to an embodiment of the present invention.

図１に示すコンテナシステム２０は、コンテナがクラスタ化された複数のクラスタ（本例では第１クラスタ１２Ａ及び第２クラスタ１２Ｂとする）により構成された仮想化システムである。第１クラスタ１２Ａは、クラスタ管理部１４Ａ及び計算資源クラスタ１５Ａを有して構成されている。第２クラスタ１２Ｂは、クラスタ管理部１４Ｂ及び計算資源クラスタ１５Ｂを有して構成されている。 The container system 20 shown in Figure 1 is a virtualization system composed of multiple clusters (in this example, a first cluster 12A and a second cluster 12B) in which containers are clustered. The first cluster 12A is composed of a cluster management unit 14A and a computational resource cluster 15A. The second cluster 12B is composed of a cluster management unit 14B and a computational resource cluster 15B.

クラスタ管理部１４Ａ，１４Ｂは、通信振分部１４ａと、計算資源操作部１４ｂと、計算資源管理部１４ｃと、コンテナ構成受付部１４ｄと、コンテナ配置先決定部１４ｅと、コンテナ管理部１４ｆとを備えて構成されている。計算資源クラスタ１５Ａ，１５Ｂは、複数のアプリケーション１５ａ，１５ｂを備えて構成されている。The cluster management units 14A and 14B are configured to include a communication distribution unit 14a, a computational resource operation unit 14b, a computational resource management unit 14c, a container configuration reception unit 14d, a container placement destination determination unit 14e, and a container management unit 14f. The computational resource clusters 15A and 15B are configured to include a plurality of applications 15a and 15b.

なお、クラスタ管理部１４Ａ，１４Ｂは、クラスタ管理部１４とも称し、計算資源クラスタ１５Ａ，１５Ｂは、計算資源クラスタ１５とも称す。 The cluster management units 14A and 14B are also referred to as the cluster management units 14, and the computational resource clusters 15A and 15B are also referred to as the computational resource cluster 15.

図１に示す仮想化システム復旧装置（復旧装置ともいう）１０は、コンテナシステム２０において障害が発生したコンテナの異常検知と障害復旧を行うものである。この復旧装置１０は、クラスタ管理部１４Ａ，１４Ｂと、計算資源クラスタ１５Ａ，１５Ｂと、内部異常検知部１７Ａ，１７Ｂと、異常復旧対応部１８Ａ，１８Ｂと、障害対応デプロイ指示部１９Ａ，１９Ｂと、振分先切替部２１と、外部異常検知部２３とを備えて構成されている。 The virtualization system recovery device (also referred to as the recovery device) 10 shown in Figure 1 detects anomalies and recovers from failures in containers in a container system 20. This recovery device 10 is configured with cluster management units 14A, 14B, computational resource clusters 15A, 15B, internal anomaly detection units 17A, 17B, anomaly recovery response units 18A, 18B, failure response deployment instruction units 19A, 19B, an allocation destination switching unit 21, and an external anomaly detection unit 23.

なお、内部異常検知部１７Ａ，１７Ｂは、内部異常検知部１７とも称し、異常復旧対応部１８Ａ，１８Ｂは、異常復旧対応部１８とも称し、障害対応デプロイ指示部１９Ａ，１９Ｂは、障害対応デプロイ指示部１９とも称す。 The internal anomaly detection units 17A and 17B are also referred to as the internal anomaly detection unit 17, the abnormality recovery response units 18A and 18B are also referred to as the abnormality recovery response unit 18, and the fault response deployment instruction units 19A and 19B are also referred to as the fault response deployment instruction unit 19.

各クラスタ１２Ａ，１２Ｂの内部には、内部異常検知部１７と、異常復旧対応部１８と、障害対応デプロイ指示部１９とが配備されている。各クラスタ１２Ａ，１２Ｂの外部には、振分先切替部２１及び外部異常検知部２３が配備されている。但し、内部異常検知部１７、異常復旧対応部１８、障害対応デプロイ指示部１９、振分先切替部２１及び外部異常検知部２３は、コンテナ仮想化ソフトウェアにより仮想的に作成されるクラスタ管理部１４及び計算資源クラスタ１５の外部に配備されている。また、内部異常検知部１７、異常復旧対応部１８及び障害対応デプロイ指示部１９は、振分先切替部２１及び外部異常検知部２３と同様に、各クラスタ１２Ａ，１２Ｂの外部に配備してもよい。Inside each cluster 12A, 12B, an internal anomaly detection unit 17, an anomaly recovery response unit 18, and a failure response deployment instruction unit 19 are provided. Outside each cluster 12A, 12B, an allocation destination switching unit 21 and an external anomaly detection unit 23 are provided. However, the internal anomaly detection unit 17, the anomaly recovery response unit 18, the failure response deployment instruction unit 19, the allocation destination switching unit 21, and the external anomaly detection unit 23 are provided outside the cluster management unit 14 and the computational resource cluster 15 that are virtually created by the container virtualization software. In addition, the internal anomaly detection unit 17, the anomaly recovery response unit 18, and the failure response deployment instruction unit 19 may be provided outside each cluster 12A, 12B, like the allocation destination switching unit 21 and the external anomaly detection unit 23.

第１及び第２クラスタ１２Ａ，１２Ｂは実質上同構成であるため、第１クラスタ１２Ａを代表して機能構成を説明する。 Since the first and second clusters 12A, 12B have substantially the same configuration, the functional configuration will be explained using the first cluster 12A as a representative.

計算資源クラスタ１５は、複数のアプリケーション１５ａ，１５ｂを備えて構成されている。アプリケーション１５ａ，１５ｂは、言い換えれば、１又は複数のコンテナの集合体の管理単位としてのＰｏｄ（図３に示すＰｏｄ１５ａ，１５ｂ参照）である。Ｐｏｄは、クーバネテス（コンテナ仮想化ソフトウェア）で実行できるアプリケーションの最小単位である。つまり、Ｐｏｄとしてのアプリケーション１５ａ，１５ｂでコンテナを作成してクラスタ化し、このクラスタをコンテナエンジン上で作動させるようになっている。この計算資源クラスタ１５は、物理マシン上にコンテナ仮想化ソフトウェアにより仮想的に作成され、当該仮想的に作成されるコンテナをクラスタ化して配置するものである。The computational resource cluster 15 is configured with multiple applications 15a and 15b. In other words, the applications 15a and 15b are Pods (see Pods 15a and 15b in FIG. 3) that serve as a management unit for a collection of one or more containers. A Pod is the smallest unit of an application that can be executed by CubeNetes (container virtualization software). In other words, containers are created and clustered with the applications 15a and 15b as Pods, and this cluster is operated on a container engine. This computational resource cluster 15 is virtually created on a physical machine by the container virtualization software, and the virtually created containers are clustered and arranged.

クラスタ管理部１４は、上記仮想的に作成され、上記クラスタ化されたコンテナの配置及び動作に係る制御を管理するものである。このクラスタ管理部１４は、通信振分部１４ａと、計算資源操作部１４ｂと、計算資源管理部１４ｃと、コンテナ構成受付部１４ｄと、コンテナ配置先決定部１４ｅと、コンテナ管理部１４ｆとを備えて構成されている。The cluster management unit 14 manages the control related to the placement and operation of the virtually created and clustered containers. The cluster management unit 14 is configured with a communication distribution unit 14a, a computational resource operation unit 14b, a computational resource management unit 14c, a container configuration reception unit 14d, a container placement destination determination unit 14e, and a container management unit 14f.

このような構成の復旧装置１０において、障害対応デプロイ指示部（デプロイ指示部ともいう）１９は、図２に示すエンドポイント（終点）設定部１４ｊ，１４ｋとＰｏｄ１５ａ，１５ｂとを、１：１の構成としてデプロイ（配置）する処理を行う。エンドポイント設定部１４ｊ，１４ｋは、複数のＰｏｄ１５ａ，１５ｂ毎に対応付けられ、各Ｐｏｄ１５ａ，１５ｂへのトラフィックの振分割合（％）が設定され、通信データの終点となる。In the recovery device 10 configured as above, the failure response deployment instruction unit (also referred to as the deployment instruction unit) 19 performs a process of deploying (placing) the endpoint (end point) setting units 14j, 14k and the Pods 15a, 15b shown in Fig. 2 in a 1:1 configuration. The endpoint setting units 14j, 14k are associated with each of the multiple Pods 15a, 15b, and the traffic allocation ratio (%) to each of the Pods 15a, 15b is set, and the endpoint setting units 14j, 14k become the end points of communication data.

図１に示す内部異常検知部１７は、コンテナシステム２０内の１又は複数のコンテナであるＰｏｄ（アプリケーション）１５ａ，１５ｂの異常を検知する。The internal anomaly detection unit 17 shown in Figure 1 detects abnormalities in Pods (applications) 15a, 15b, which are one or more containers within the container system 20.

異常復旧対応部１８は、内部異常検知部１７で異常が検知されたＰｏｄ（例えばＰｏｄ１５ａ）に対応付けられたデプロイ指示部１９のウエイト値を０％に変更して、異常Ｐｏｄ１５ａを切り離すための変更コマンドを通信振分部１４ａへ送信する。また、異常復旧対応部１８は、その切り離したＰｏｄ１５ａを復旧する場合、復旧対象のＰｏｄ１５ａへのトラフィックを予め定められた所定トラフィック値まで徐々に上げるための復旧コマンドを通信振分部１４ａへ送信する。The abnormality recovery response unit 18 changes the weight value of the deployment instruction unit 19 associated with a Pod (e.g., Pod 15a) in which an abnormality has been detected by the internal abnormality detection unit 17 to 0%, and transmits a change command to the communication distribution unit 14a to separate the abnormal Pod 15a. When restoring the separated Pod 15a, the abnormality recovery response unit 18 transmits a recovery command to the communication distribution unit 14a to gradually increase traffic to the Pod 15a to be restored up to a predetermined traffic value.

クラスタ管理部１４において、通信振分部１４ａは、ルータであり、異常復旧対応部１８からの変更コマンド又は復旧コマンドを該当する各部１４ｂ～１４ｆへ振り分けて通知する。また、通信振分部１４ａは、後述のエンドポイント設定部１４ｊ，１４ｋ毎に設定されるトラフィック振分割合を示すウエイト値（％）をもとに、送信先のエンドポイント設定部１４ｊ，１４ｋ（後述）へのトラフィックの振り分けを行う。In the cluster management unit 14, the communication distribution unit 14a is a router that distributes and notifies the change commands or recovery commands from the abnormality recovery response unit 18 to the corresponding units 14b to 14f. The communication distribution unit 14a also distributes traffic to the destination endpoint setting units 14j and 14k (described later) based on a weight value (%) indicating the traffic distribution ratio set for each of the endpoint setting units 14j and 14k (described later).

コンテナ構成受付部（受付部ともいう）１４ｄは、計算資源クラスタ１５にコンテナをデプロイする構成情報を外部サーバ等から受け取る。The container configuration reception unit (also referred to as the reception unit) 14d receives configuration information for deploying a container to the computational resource cluster 15 from an external server, etc.

コンテナ配置先決定部（配置先決定部ともいう）１４ｅは、受付部１４ｄで受け付けた構成情報をもとに、どのコンテナを、どのワーカーノード（計算資源クラスタ１５）に配置するかを決める。The container placement destination determination unit (also referred to as the placement destination determination unit) 14e determines which container to place on which worker node (computational resource cluster 15) based on the configuration information received by the reception unit 14d.

コンテナ管理部１４ｆは、コンテナが正常に動作中か否か等をチェックする。 The container management unit 14f checks whether the container is operating normally or not.

計算資源管理部１４ｃは、ワーカーノードが動作可能か否か、ワーカーノードを構成するサーバの計算資源の使用量、ＣＰＵ（Central Processing Unit）残量等を把握して管理する。The computational resource management unit 14c grasps and manages whether a worker node is operational, the amount of computational resources used by the servers that make up the worker node, the remaining CPU (Central Processing Unit) capacity, etc.

計算資源操作部１４ｂは、あるコンテナに対して一定量のＣＰＵ等の計算資源を、所定量割り当てる操作、言い換えれば、ストレージ容量の割り当て、ＣＰＵ時間、コンテナが使用可能なメモリ容量等を割り当てる操作を行う。The computational resource operation unit 14b performs operations to allocate a certain amount of computational resources, such as a CPU, to a certain container, in other words, to allocate storage capacity, CPU time, memory capacity available to the container, etc.

次に、復旧装置１０の内部異常検知部１７によるコンテナシステム２０のコンテナに係る各種の異常検知処理（第１～第６異常検知処理）について、図３～図８を参照して説明する。Next, various abnormality detection processes (first to sixth abnormality detection processes) related to containers in the container system 20 performed by the internal abnormality detection unit 17 of the recovery device 10 will be explained with reference to Figures 3 to 8.

＜第１異常検知処理＞
図３は、本実施形態の仮想化システム復旧装置１０のＰｏｄ（アプリケーション）１５ａ，１５ｂによるコンテナの第１異常検知処理を説明するためのブロック図である。但し、各Ｐｏｄ１５ａ，１５ｂは、１又は複数のコンテナを構成している。 <First abnormality detection process>
3 is a block diagram for explaining a first abnormality detection process of a container by the Pods (applications) 15a and 15b of the virtualization system recovery device 10 of the present embodiment. Each of the Pods 15a and 15b includes one or more containers.

図３において、コンテナシステム２０内には、仮想マシンによってマスタノード１４Ａと、インフラノード１４Ｂと、ワーカーノード１５Ａ，１５Ｂとが構成され、各々が仮想スイッチ｛ＯＶＳ（Open vSwitch）｝３０によって接続されるようになっている。但し、仮想スイッチは、ＯＶＳ以外に他の仮想スイッチであってもよい。マスタノード１４Ａ及びインフラノード１４Ｂは、クラスタ管理部１４Ａ，１４Ｂ（図１）に対応し、ワーカーノード１５Ａ，１５Ｂは、計算資源クラスタ１５Ａ，１５Ｂ（図１）に対応している。 In FIG. 3, within the container system 20, a master node 14A, an infrastructure node 14B, and worker nodes 15A and 15B are configured by virtual machines, and each is connected by a virtual switch {OVS (Open vSwitch)} 30. However, the virtual switch may be a virtual switch other than OVS. The master node 14A and the infrastructure node 14B correspond to the cluster management units 14A and 14B (FIG. 1), and the worker nodes 15A and 15B correspond to the computational resource clusters 15A and 15B (FIG. 1).

更に、マスタノード１４Ａ及びワーカーノード１５Ａで１つ目のクラスタ１２が構成され、インフラノード１４Ｂ及びワーカーノード１５Ｂで２つ目のクラスタ１２が構成されている。これらのクラスタ１２でコンテナシステム２０が構成されているとする。 Furthermore, a first cluster 12 is formed by the master node 14A and the worker node 15A, and a second cluster 12 is formed by the infrastructure node 14B and the worker node 15B. It is assumed that these clusters 12 form a container system 20.

コンテナシステム２０の外部には、図１の構成と同様に内部異常検知部１７が配置されている。図３では内部異常検知部１７をワーカーノード１５Ａ，１５Ｂ毎に合計２つ記載しているが、１つであってもよい。マスタノード１４Ａ、インフラノード１４Ｂ、ワーカーノード１５Ａ，１５Ｂ、及び内部異常検知部１７は、ネットワーク２２によって対向装置２４に接続されている。対向装置２４は、コンテナシステム２０に対して要求信号等を送信する外部サーバ等の通信装置である。An internal anomaly detection unit 17 is arranged outside the container system 20, similar to the configuration in Figure 1. In Figure 3, a total of two internal anomaly detection units 17 are shown for each worker node 15A, 15B, but there may be only one. The master node 14A, infrastructure node 14B, worker nodes 15A, 15B, and internal anomaly detection unit 17 are connected to a counterpart device 24 via a network 22. The counterpart device 24 is a communication device such as an external server that transmits request signals, etc. to the container system 20.

内部異常検知部１７は、往復矢印Ｙ１，Ｙ２で示すポーリングによって、所定のコマンド（例えば「sudo crictl ps」）をワーカーノード１５Ａ，１５ＢのＰｏｄ１５ａ，１５ｂへ送信し、コマンドに応じてＰｏｄ１５ａ，１５ｂから返信されてくる応答結果により正常か異常かを判断する。このポーリング実試験においては、ポーリングを１０回実行した際の往復時間の平均値が０．０６秒であった。The internal anomaly detection unit 17 sends a predetermined command (e.g., "sudo crictl ps") to Pods 15a and 15b of worker nodes 15A and 15B by polling as indicated by the round-trip arrows Y1 and Y2, and judges whether the command is normal or abnormal based on the response returned from Pods 15a and 15b in response to the command. In this polling test, the average round-trip time when polling was performed 10 times was 0.06 seconds.

内部異常検知部１７での異常判断は、ポーリングによってＰｏｄ１５ａ，１５ｂから返信されてくるコマンド応答結果に記載された正常又は異常を示す文字列を読み取って行う。例えば、文字列の「Running」はコンテナ（Ｐｏｄ１５ａ，１５ｂ）の動作が正常であることを示し、「Running」以外の文字列は異常であることを示す。このため、内部異常検知部１７は、コマンド応答結果に「Running」が記載の場合はコンテナ（Ｐｏｄ１５ａ，１５ｂ）の動作が正常と判断し、「Running」以外の文字列が記載の場合は異常と判断する。The internal anomaly detection unit 17 judges whether an abnormality has occurred by reading a character string indicating normality or abnormality that is written in the command response result returned from Pod 15a, 15b by polling. For example, the character string "Running" indicates that the operation of the container (Pod 15a, 15b) is normal, and a character string other than "Running" indicates an abnormality. For this reason, the internal anomaly detection unit 17 judges that the operation of the container (Pod 15a, 15b) is normal if the command response result contains "Running," and judges that an abnormality has occurred if the command response result contains a character string other than "Running."

＜第２異常検知処理＞
次に、図４は、本実施形態の仮想化システム復旧装置１０のワーカーノード１５Ａ，１５Ｂ毎に備えられたルーティングテーブル１５ｃによる第２異常検知処理を説明するためのブロック図である。 <Second abnormality detection process>
Next, FIG. 4 is a block diagram for explaining a second abnormality detection process by the routing table 15c provided for each of the worker nodes 15A, 15B of the virtualization system recovery device 10 of this embodiment.

ルーティングテーブル（テーブルともいう）１５ｃは、対向装置２４からネットワーク２２を介して、ワーカーノード１５Ａ，１５ＢのＰｏｄ１５ａ，１５ｂへ送信されるパケットの送信先のコンテナを、送信先を示す経路情報で管理している。このテーブル１５ｃの送信先管理が正しくないと、適切なコンテナにパケットが届かないこととなる。このため、内部異常検知部１７でテーブル１５ｃの送信先管理の正常又は異常を検知するようにした。 The routing table (also called table) 15c uses route information indicating the destination container of a packet sent from the opposing device 24 to Pods 15a and 15b of worker nodes 15A and 15B via network 22. If the destination management in this table 15c is incorrect, the packet will not reach the appropriate container. For this reason, the internal anomaly detection unit 17 is configured to detect whether the destination management in table 15c is normal or abnormal.

但し、ルーティングテーブル１５ｃは、「iptables」と「nftables」の一対のテーブルから構成されている。この他、ルーティングテーブル１５ｃは、「iptbles」のみ、又は「nftables」のみで構成されていてもよい。However, the routing table 15c is composed of a pair of tables, "iptables" and "nftables". Alternatively, the routing table 15c may be composed of only "iptables" or only "nftables".

内部異常検知部１７は、往復矢印Ｙ３，Ｙ４で示すポーリングによって、所定のコマンドをワーカーノード１５Ａ，１５Ｂの各テーブル１５ｃへ送信し、コマンドに応じて各テーブル１５ｃから返信されてくる応答結果により正常か異常かを判断する。The internal anomaly detection unit 17 sends a specified command to each table 15c of the worker nodes 15A and 15B by polling as indicated by the reciprocating arrows Y3 and Y4, and determines whether the table is normal or abnormal based on the response results returned from each table 15c in response to the command.

上記所定のコマンドは、「sudo iptables -L│wc-│」、及び、「sudo nft list ruleset」の一対である。コマンド「sudo iptables -L│wc-│」がテーブル１５ｃの「iptables」に通知され、コマンド「sudo nft list ruleset」が「nftables」に通知される。そして、「iptables」及び「nftables」の各テーブルがコマンドに応じた応答を内部異常検知部１７へ返信するようになっている。The above-mentioned specified commands are a pair of "sudo iptables -L│wc-│" and "sudo nft list ruleset." The command "sudo iptables -L│wc-│" is notified to "iptables" in table 15c, and the command "sudo nft list ruleset" is notified to "nftables." Then, each of the "iptables" and "nftables" tables sends a response according to the command back to the internal anomaly detection unit 17.

一対のコマンドによるポーリング実試験においては、ポーリングを１０回実行した際の往復時間の平均値が、コマンド「sudo iptables -L│wc-│」の場合に０．０３秒であり、コマンド「sudo nft list ruleset」の場合に０．０８秒であった。In actual polling tests using a pair of commands, the average round-trip time when polling was performed 10 times was 0.03 seconds for the command "sudo iptables -L│wc-│" and 0.08 seconds for the command "sudo nft list ruleset."

内部異常検知部１７での異常判断は、各テーブル１５ｃから返信されてくるコマンド応答結果に、送信先の経路情報が記載されていれば正常と判断し、何も記載されていなければ異常と判断する。The internal anomaly detection unit 17 judges anomalies as follows: if the command response result returned from each table 15c contains destination route information, it is judged to be normal; if nothing is contained, it is judged to be abnormal.

＜第３異常検知処理＞
次に、図５は、本実施形態の仮想化システム復旧装置１０のワーカーノード１５Ａ，１５Ｂ毎に備えられた仮想スイッチ３０のデーモンの監視による第３異常検知処理を説明するためのブロック図である。なお、仮想スイッチ３０のデーモンを、ＯＶＳデーモンとも称す。 <Third abnormality detection process>
5 is a block diagram for explaining a third abnormality detection process by monitoring a daemon of the virtual switch 30 provided for each worker node 15A, 15B of the virtualization system recovery device 10 of the present embodiment. The daemon of the virtual switch 30 is also called an OVS daemon.

デーモンは、仮想スイッチ３０においてパケットの送信先を管理するプログラムである。内部異常検知部１７で、ＯＶＳデーモンを監視し、パケットが適正に送信されていれば正常、送信されていなければ異常と検知するようにした。 The daemon is a program that manages the destination of packets in the virtual switch 30. The internal anomaly detection unit 17 monitors the OVS daemon and detects that a packet is being sent properly if it is normal, and that a packet is not being sent if it is abnormal.

内部異常検知部１７は、往復矢印Ｙ５，Ｙ６で示すポーリングによって、所定のコマンド（例えば「ps aux|grep ovs-vswitchd|grep "db.sock"|wc-│」）をワーカーノード１５Ａ，１５Ｂ毎の仮想スイッチ３０へ送信し、コマンドに応じて仮想スイッチ３０から返信されてくる応答結果により正常か異常かを判断する。The internal anomaly detection unit 17 sends a specified command (for example, "ps aux|grep ovs-vswitchd|grep "db.sock"|wc-│") to the virtual switch 30 for each worker node 15A, 15B by polling as indicated by the reciprocating arrows Y5 and Y6, and determines whether the virtual switch 30 is normal or abnormal based on the response returned from the virtual switch 30 in response to the command.

このポーリング実試験においては、ポーリングを１０回実行した際の往復時間の平均値が０．０３秒であった。 In this actual polling test, the average round trip time when polling was performed 10 times was 0.03 seconds.

内部異常検知部１７での異常判断は、各仮想スイッチ３０から返信されてくるコマンド応答結果に、送信先に係る例えば「db.sockプロセス」が記載されていれば正常と判断し、記載されていなければ異常と判断する。The internal anomaly detection unit 17 judges an anomaly as follows: if the command response result returned from each virtual switch 30 contains, for example, a "db.sock process" related to the destination, it is judged as normal; if it does not contain such a process, it is judged as abnormal.

＜第４異常検知処理＞
次に、図６は、本実施形態の仮想化システム復旧装置１０のワーカーノード１５Ａ，１５Ｂ毎に備えられたコンテナランタイム１５ｄのデーモンの監視による第４異常検知処理を説明するためのブロック図である。なお、上記デーモンは、ｃｒｉｏデーモンとも称し、コンテナランタイム１５ｄの一例である。ｃｒｉｏ（ｃｒｉ－ｏ）は、コンテナ型仮想化技術で使われるオープンソースのコミュニティ主導型のコンテナエンジンである。 <Fourth abnormality detection process>
6 is a block diagram for explaining a fourth abnormality detection process by monitoring a daemon of the container runtime 15d provided for each worker node 15A, 15B of the virtualization system recovery device 10 of the present embodiment. The daemon is also called a crio daemon and is an example of the container runtime 15d. crio (cri-o) is an open source community-driven container engine used in container-based virtualization technology.

コンテナランタイム１５ｄは、Ｐｏｄ１５ａ，１５ｂのコンテナを起動する役割を担うので、コンテナランタイム１５ｄを監視することでコンテナが正常に起動しているか否かを検知できる。そこで、内部異常検知部１７で、ｃｒｉｏデーモンを監視し、コンテナが起動していれば正常、起動していなければ異常と検知するようにした。 The container runtime 15d is responsible for starting the containers of Pods 15a and 15b, so by monitoring the container runtime 15d, it is possible to detect whether the containers are running normally. Therefore, the internal anomaly detection unit 17 monitors the crio daemon and detects that the container is running as normal if it is running, and that the container is not running as an anomaly.

内部異常検知部１７は、往復矢印Ｙ７，Ｙ８で示すポーリングによって、所定のコマンド（例えば「systemctl│status crio|grep Active」）をワーカーノード１５Ａ，１５Ｂ毎のコンテナランタイム１５ｄへ送信し、コマンドに応じて各コンテナランタイム１５ｄから返信されてくる応答結果により正常か異常かを判断する。The internal anomaly detection unit 17 sends a specified command (e.g., "systemctl│status crio|grep Active") to the container runtime 15d for each worker node 15A, 15B by polling as indicated by the reciprocating arrows Y7 and Y8, and determines whether the status is normal or abnormal based on the response results returned from each container runtime 15d in response to the command.

内部異常検知部１７での異常判断は、各仮想スイッチ３０から返信されてくるコマンド応答結果において、ｃｒｉｏデーモンの起動状態を示す”active(running)”が記載されていれば正常と判断し、”active(running)”以外の記載であれば異常と判断する。The internal anomaly detection unit 17 judges anomalies as follows: if the command response result returned from each virtual switch 30 contains "active (running)", which indicates the startup status of the crio daemon, then it is judged to be normal; if it contains anything other than "active (running)", then it is judged to be abnormal.

＜第５異常検知処理＞
次に、図７は、本実施形態の仮想化システム復旧装置１０のワーカーノード１５Ａ，１５Ｂ毎の監視による第５異常検知処理を説明するためのブロック図である。 <Fifth abnormality detection process>
Next, FIG. 7 is a block diagram for explaining a fifth abnormality detection process by monitoring each of the worker nodes 15A, 15B by the virtualization system recovery device 10 of this embodiment.

但し、ワーカーノード１５Ａ，１５Ｂが、物理マシン３２による仮想化技術（仮想マシン）で作成されている構成を前提とする。この構成の場合、仮想マシンの外側の物理マシン３２上に内部異常検知部１７が存在し、この内部異常検知部１７で仮想マシンが起動していればコンテナが正常と検知し、起動していなければコンテナが異常と検知するようにした。However, this assumes a configuration in which worker nodes 15A and 15B are created using virtualization technology (virtual machines) using a physical machine 32. In this configuration, an internal anomaly detection unit 17 exists on the physical machine 32 outside the virtual machines, and this internal anomaly detection unit 17 detects that the container is normal if the virtual machine is running, and detects that the container is abnormal if the virtual machine is not running.

内部異常検知部１７は、往復矢印Ｙ９，Ｙ１０で示すポーリングによって、所定のコマンド（例えば「sudo virsh list」）をワーカーノード１５Ａ，１５Ｂ毎へ送信し、コマンドに応じて各ワーカーノード１５Ａ，１５Ｂから返信されてくる応答結果により正常か異常かを判断する。The internal anomaly detection unit 17 sends a specified command (e.g., "sudo virsh list") to each worker node 15A, 15B by polling as indicated by the reciprocating arrows Y9, Y10, and determines whether the node is normal or abnormal based on the response returned from each worker node 15A, 15B in response to the command.

内部異常検知部１７での異常判断は、各ワーカーノード１５Ａ，１５Ｂから返信されてくるコマンド応答結果において、対象のワーカーノード１５Ａ，１５Ｂの起動状態を示す”running”が記載されていれば正常と判断し、”running”以外の記載であれば異常と判断する。The internal anomaly detection unit 17 judges an anomaly as follows: if the command response result returned from each worker node 15A, 15B contains "running", which indicates the startup status of the target worker node 15A, 15B, then it is judged to be normal; if it contains anything other than "running", then it is judged to be abnormal.

＜第６異常検知処理＞
次に、図８は、本実施形態の仮想化システム復旧装置１０のコンテナシステム２０のクラスタ１２に外付けされたＤＢ（Data Base）２６ａ，２６ｂの監視による第６異常検知処理を説明するためのブロック図である。 <Sixth abnormality detection process>
Next, FIG. 8 is a block diagram for explaining a sixth abnormality detection process by monitoring DBs (Data Bases) 26a, 26b externally attached to the cluster 12 of the container system 20 of the virtualization system recovery device 10 of this embodiment.

各クラスタ１２Ａ，１２Ｂ（図１）の外付けの装置として、コンテナに係るデータを記憶するＤＢ（外部ＤＢともいう）２６ａ，２６ｂを、ネットワーク２２を介してワーカーノード１５Ａ，１５Ｂに接続する構成がある。この際、内部異常検知部１７もネットワーク２２を介してワーカーノード１５Ａ，１５Ｂに接続されている。 As an external device of each cluster 12A, 12B (FIG. 1), DBs (also called external DBs) 26a, 26b that store data related to containers are configured to be connected to the worker nodes 15A, 15B via the network 22. In this case, the internal anomaly detection unit 17 is also connected to the worker nodes 15A, 15B via the network 22.

ここで、各クラスタ１２Ａ，１２Ｂがネットワーク２２を介して相互に接続される構成もあるので、図８に示すように、内部異常検知部１７がネットワーク２２を介してクラスタ１２に接続されていても、図１に示したと同様に、各クラスタ１２Ａ，１２Ｂ内の内部異常検知部１７と位置付ける。Here, since there is a configuration in which each cluster 12A, 12B is connected to each other via a network 22, even if the internal anomaly detection unit 17 is connected to the cluster 12 via the network 22 as shown in Figure 8, it is positioned as the internal anomaly detection unit 17 within each cluster 12A, 12B, as shown in Figure 1.

内部異常検知部１７は、往復矢印Ｙ１１，Ｙ１２で示すポーリングによって、ネットワーク２２を介して外部ＤＢ２６ａ，２６ｂに所定のコマンドを送信し、コマンドに応じて各外部ＤＢ２６ａ，２６ｂから返信されてくる応答結果により正常か異常かを判断する。この場合のコマンドは、外部ＤＢ２６ａ，２６ｂの種類に依存したものとなる。The internal abnormality detection unit 17 transmits a predetermined command to the external DBs 26a and 26b via the network 22 by polling as indicated by the reciprocating arrows Y11 and Y12, and judges whether the external DBs 26a and 26b are normal or abnormal based on the response results returned from the external DBs 26a and 26b in response to the command. The command in this case depends on the type of the external DBs 26a and 26b.

応答結果としては、応答・死活監視に係る結果と、コネクション数上限オーバーに係る結果とがある。応答・死活監視は、外部ＤＢ２６ａ，２６ｂが正常に起動しているか否かを監視するものである。つまり、内部異常検知部１７は、応答結果に、外部ＤＢ２６ａ，２６ｂが正常に起動していない内容が記載されていれば異常と判断する。 Response results include results related to response/alive monitoring and results related to the number of connections exceeding the upper limit. Response/alive monitoring monitors whether the external DBs 26a, 26b are running normally. In other words, the internal anomaly detection unit 17 determines that an anomaly has occurred if the response result indicates that the external DBs 26a, 26b are not running normally.

コネクション数上限オーバーは、外部ＤＢ２６ａ，２６ｂが接続されているコンテナ数が、予め定められた閾値を超えていることを表す。つまり、内部異常検知部１７は、応答結果に、外部ＤＢ２６ａ，２６ｂの接続コンテナ数が閾値を超えていることが記載されていれば異常と判断する。 The "Connection limit exceeded" indicates that the number of containers connected to external DBs 26a and 26b exceeds a predetermined threshold. In other words, the internal anomaly detection unit 17 determines that an anomaly has occurred if the response result indicates that the number of connected containers to external DBs 26a and 26b exceeds the threshold.

このポーリング実試験においては、ポーリング往復時間は外部ＤＢ２６ａ，２６ｂの種類に依存したものとなる。 In this actual polling test, the polling round trip time depends on the type of external DB 26a, 26b.

＜複数クラスタ異常検知１＞
次に、図９に示す外部異常検知部２３によって、複数のクラスタ１２Ａ，１２Ｂにおいて障害が発生した場合に、その障害に係る異常検知の処理について説明する。但し、クラスタ１２Ａ，１２Ｂ毎の異常検知１は、上述した第１～第６異常検知の何れか１つであるとする。 <Multiple cluster anomaly detection 1>
Next, when a fault occurs in a plurality of clusters 12A and 12B, the process of detecting an anomaly related to the fault will be described using the external anomaly detection unit 23 shown in Fig. 9. However, it is assumed that the anomaly detection 1 for each of the clusters 12A and 12B is any one of the first to sixth anomaly detections described above.

図９に示すように、外部異常検知部２３は、第１クラスタ１２Ａの内部異常検知部１７Ａと、第２クラスタ１２Ｂの内部異常検知部１７Ｂとに接続されている。外部異常検知部２３は、内部異常検知部１７Ａ，１７Ｂで上記第１～第６異常検知の何れか１つに係る異常が検知された際に、矢印Ｙ３１ａ又はＹ３１ｂで示すように、異常のアプリケーション１５ａ，１５ｂに係るコンテナが配置されたクラスタ１２Ａ又は１２Ｂを異常と検知する。9, the external anomaly detection unit 23 is connected to the internal anomaly detection unit 17A of the first cluster 12A and the internal anomaly detection unit 17B of the second cluster 12B. When the internal anomaly detection units 17A, 17B detect an anomaly related to any one of the first to sixth anomaly detections, the external anomaly detection unit 23 detects the cluster 12A or 12B in which the container related to the anomalous application 15a, 15b is located as abnormal, as shown by the arrow Y31a or Y31b.

＜複数クラスタの異常検知２＞
図９に示す外部異常検知部２３は、第１クラスタ１２Ａにおけるクラスタ管理部１４Ａの通信振分部１４ａと、第２クラスタ１２Ｂにおけるクラスタ管理部１４Ｂの通信振分部１４ａとに接続されている。通信振分部１４ａは、クラスタ管理部１４の信号入力部分に配備されており、入力信号を後段へ振り分けて出力すると共に、クラスタ１２Ａ，１２Ｂ毎の確認通信に応じて、クラスタ１２Ａ，１２Ｂの正常時に応答を返信する。 <Anomaly detection for multiple clusters 2>
9 is connected to a communication distribution unit 14a of a cluster management unit 14A in the first cluster 12A and a communication distribution unit 14a of a cluster management unit 14B in the second cluster 12B. The communication distribution unit 14a is provided at a signal input portion of the cluster management unit 14, and distributes and outputs an input signal to a subsequent stage, and also returns a response in response to confirmation communication for each of the clusters 12A and 12B when the clusters 12A and 12B are normal.

外部異常検知部２３は、双方向矢印Ｙ３３ａ，Ｙ３３ｂで示すように、クラスタ１２Ａ，１２Ｂ毎の通信振分部１４ａと一定周期でクラスタ１２Ａ，１２Ｂ毎の確認通信を行い、正常に応答が帰ってくるか否かを検知する。応答が帰ってこない場合に、該当クラスタ１２Ａ，１２Ｂの異常と検知する。As indicated by the bidirectional arrows Y33a and Y33b, the external anomaly detection unit 23 performs confirmation communication for each cluster 12A and 12B at regular intervals with the communication distribution unit 14a for each cluster 12A and 12B, and detects whether a normal response is returned. If no response is returned, it detects an anomaly in the corresponding cluster 12A or 12B.

この複数クラスタの異常検知２では、内部異常検知部１７Ａ，１７Ｂを介さず各クラスタ１２Ａ，１２Ｂの異常検知が可能となる。 In this multiple cluster anomaly detection 2, anomaly detection in each cluster 12A, 12B is possible without going through the internal anomaly detection units 17A, 17B.

＜複数クラスタの異常検知３＞
複数クラスタの異常検知３は、外部異常検知部２３が、上記異常検知１，２の双方によって、各クラスタ１２Ａ，１２Ｂの異常検知を行う処理である。この処理では、各クラスタ１２Ａ，１２Ｂの異常検知を、より適正に行うことができる。 <Anomaly detection for multiple clusters 3>
The multiple cluster anomaly detection 3 is a process in which the external anomaly detection unit 23 performs anomaly detection for each of the clusters 12A and 12B by both of the above anomaly detections 1 and 2. In this process, anomaly detection for each of the clusters 12A and 12B can be performed more appropriately.

図１０は、本実施形態の仮想化システム復旧装置１０の異常対応処理を説明するためのブロック図である。異常対応処理を行う異常検知は、上記複数クラスタの異常検知１～３の何れか１つである。 Figure 10 is a block diagram for explaining the anomaly response processing of the virtualization system recovery device 10 of this embodiment. The anomaly detection for which the anomaly response processing is performed is any one of the anomaly detections 1 to 3 of the multiple clusters described above.

振分先切替部２１は、復旧装置１０の外部のＤＮＳ（Domain Name System）２５に接続されている。ＤＮＳ２５は、各クラスタ１２Ａ，１２Ｂのアプリケーション１５ａ，１５ｂの名称を示すドメイン（又はドメイン名）と、各クラスタ１２Ａ，１２Ｂの通信振分部１４ａの住所に該当する解決先ＩＰ（Internet Protocol）アドレスとを対応付けて管理するサーバである。このＤＮＳ２５は、ドメインとＩＰアドレスとを相互に変換するものであり、ＤＮＳレコードテーブル２５ａを備えている。The distribution destination switching unit 21 is connected to a DNS (Domain Name System) 25 external to the recovery device 10. The DNS 25 is a server that manages the correspondence between the domains (or domain names) indicating the names of the applications 15a and 15b of each cluster 12A and 12B and the resolved IP (Internet Protocol) addresses corresponding to the addresses of the communication distribution units 14a of each cluster 12A and 12B. This DNS 25 converts between domains and IP addresses and includes a DNS record table 25a.

図１１に示すように、ＤＮＳレコードテーブル（テーブルともいう）２５ａは、ドメイン名と解決先ＩＰアドレスとを対応付けて記憶している。本例では、テーブル２５ａにおいて、クラスタ１２Ａ，１２Ｂ毎のアプリケーション１５ａのドメイン名としての「Ｓｖｃ１．ｎｅｔ」に、クラスタ１２Ａ，１２Ｂ毎の通信振分部１４ａの解決先ＩＰアドレスとしての「第１クラスタ１２ＡのＩＰアドレス」及び「第２クラスタ１２ＢのＩＰアドレス」が対応付けられている。As shown in Figure 11, the DNS record table (also called table) 25a stores a correspondence between a domain name and a resolved IP address. In this example, in table 25a, "Svc1.net" as the domain name of application 15a for each cluster 12A, 12B is associated with "IP address of first cluster 12A" and "IP address of second cluster 12B" as resolved IP addresses of communication distribution unit 14a for each cluster 12A, 12B.

この対応付け関係は、「Ｓｖｃ１．ｎｅｔ」のアプリケーション１５ａが、第１クラスタ１２Ａ又は第２クラスタ１２Ｂで作動することを表している。 This correspondence relationship indicates that application 15a of "Svc1.net" runs in either the first cluster 12A or the second cluster 12B.

更に、テーブル２５ａにおいて、クラスタ１２Ａ，１２Ｂ毎のアプリケーション１５ｂのドメイン名としての「Ｓｖｃ２．ｎｅｔ」に、クラスタ１２Ａ，１２Ｂ毎の通信振分部１４ａの解決先ＩＰアドレスとしての「第１クラスタ１２ＡのＩＰアドレス」及び「第２クラスタ１２ＢのＩＰアドレス」が対応付けられている。Furthermore, in table 25a, "Svc2.net" as the domain name of application 15b for each cluster 12A, 12B is associated with "IP address of first cluster 12A" and "IP address of second cluster 12B" as the resolved IP addresses of communication distribution unit 14a for each cluster 12A, 12B.

この対応付け関係は、「Ｓｖｃ２．ｎｅｔ」のアプリケーション１５ｂが、第１クラスタ１２Ａ又は第２クラスタ１２Ｂで作動することを表している。 This correspondence relationship indicates that application 15b "Svc2.net" runs in either the first cluster 12A or the second cluster 12B.

このようなテーブル２５ａを備えるＤＮＳ２５に、外部サーバ（図示せず）が解決先ＩＰアドレスを問い合わせると、ＤＮＳ２５から、各クラスタ１２Ａ，１２Ｂの双方のＩＰアドレスが返信されてくる。このため、外部サーバは、各クラスタ１２Ａ，１２Ｂの何れにもデータ送信が可能となる。When an external server (not shown) queries DNS 25, which has such a table 25a, for a resolved IP address, DNS 25 returns the IP addresses of both clusters 12A and 12B. This allows the external server to send data to either cluster 12A or 12B.

ここで、図１０に示す外部異常検知部２３において、複数クラスタ１２Ａ，１２Ｂの異常検知１～３の何れか１つの異常（例えば第２クラスタ１２Ｂの異常）が検知されたとする。外部異常検知部２３は、その第２クラスタ１２Ｂの異常検知を矢印Ｙ３４で示すように、振分先切替部２１へ通知する。10 detects one of the anomalies 1 to 3 in the multiple clusters 12A and 12B (for example, an anomaly in the second cluster 12B). The external anomaly detection unit 23 notifies the allocation destination switching unit 21 of the anomaly detection in the second cluster 12B, as shown by arrow Y34.

振分先切替部２１は、第２クラスタ１２Ｂへの通信振り分けを中止する指示（通信振分中止指示）を、矢印Ｙ３５で示すようにＤＮＳ２５へ通知する。ＤＮＳ２５は、通信振分中止指示に応じて、図１２に示すテーブル２５ａにおけるドメイン名の「Ｓｖｃ１．ｎｅｔ」及び「Ｓｖｃ２．ｎｅｔ」の双方に対応付けられた解決先ＩＰアドレスにおいて、第２クラスタ１２ＢのＩＰアドレスを消去する処理を行う。The destination switching unit 21 notifies the DNS 25 of an instruction to stop the distribution of communication to the second cluster 12B (communication distribution stop instruction), as shown by arrow Y35. In response to the communication distribution stop instruction, the DNS 25 performs a process of deleting the IP address of the second cluster 12B in the resolved IP addresses associated with both the domain names "Svc1.net" and "Svc2.net" in the table 25a shown in FIG. 12.

＜実施形態の動作＞
次に、異常対応処理の動作を、図１３に示すフローチャートを参照して説明する。 <Operation of the embodiment>
Next, the operation of the abnormality handling process will be described with reference to the flowchart shown in FIG.

図１３に示すステップＳ１において、第２クラスタ１２Ｂのアプリケーション１５ａ，１５ｂに障害（×印）が発生し、この異常が内部異常検知部１７Ｂで検知されたとする。この場合、内部異常検知部１７Ｂで第２クラスタ１２Ｂの異常が矢印Ｙ３１ｂに示すように、外部異常検知部２３へ通知される。13, suppose that a fault (marked with an x) occurs in the applications 15a and 15b of the second cluster 12B, and this fault is detected by the internal fault detection unit 17B. In this case, the internal fault detection unit 17B notifies the external fault detection unit 23 of the fault in the second cluster 12B, as shown by the arrow Y31b.

ステップＳ２において、外部異常検知部２３は、上記通知によって第２クラスタ１２Ｂの異常を検知し、矢印Ｙ３４で示すように、振分先切替部２１へ通知する。In step S2, the external anomaly detection unit 23 detects an abnormality in the second cluster 12B based on the above notification and notifies the allocation destination switching unit 21, as indicated by arrow Y34.

ステップＳ３において、振分先切替部２１は、第２クラスタ１２Ｂへの通信振分中止指示を、矢印Ｙ３５で示すようにＤＮＳ２５へ通知する。 In step S3, the distribution destination switching unit 21 notifies DNS 25 of an instruction to stop distribution of communication to the second cluster 12B, as shown by arrow Y35.

ステップＳ４において、ＤＮＳ２５は、図１２に示すテーブル２５ａにおけるドメイン名の「Ｓｖｃ１．ｎｅｔ」及び「Ｓｖｃ２．ｎｅｔ」の双方に対応付けられた解決先ＩＰアドレスにおいて、第２クラスタ１２ＢのＩＰアドレスを消去する。これによって、テーブル２５ａのドメイン名の「Ｓｖｃ１．ｎｅｔ」及び「Ｓｖｃ２．ｎｅｔ」の双方に対応付けられた解決先ＩＰアドレスは、第１クラスタ１２ＡのＩＰアドレスのみとなる。In step S4, DNS 25 erases the IP address of the second cluster 12B from the resolved IP addresses associated with both the domain names "Svc1.net" and "Svc2.net" in table 25a shown in Figure 12. As a result, the resolved IP address associated with both the domain names "Svc1.net" and "Svc2.net" in table 25a becomes only the IP address of the first cluster 12A.

このため、ステップＳ５において、外部サーバがＤＮＳ２５に解決先ＩＰアドレスを問い合わせた場合に、ＤＮＳ２５からは第１クラスタ１２ＡのＩＰアドレスのみが返信される。言い換えれば、障害が発生した第２クラスタ１２Ｂへはアクセス出来なくなるので、第２クラスタ１２Ｂへの通信が止められることになる。Therefore, in step S5, when an external server queries DNS 25 for a resolved IP address, only the IP address of the first cluster 12A is returned from DNS 25. In other words, since the second cluster 12B where the failure occurred cannot be accessed, communication to the second cluster 12B is stopped.

＜実施形態の効果＞
本発明の実施形態に係る仮想化システム復旧装置１０の効果について説明する。 Effects of the embodiment
The effects of the virtualization system recovery device 10 according to the embodiment of the present invention will be described.

（１ａ）復旧装置１０は、計算資源クラスタ１５と、クラスタ管理部１４と、複数のクラスタ１２Ａ，１２Ｂと、内部異常検知部１７と、外部異常検知部２３とを備える。計算資源クラスタ１５は、物理マシン上にコンテナ仮想化ソフトウェアにより仮想的に作成され、当該仮想的に作成されるコンテナをクラスタ化して配置する。クラスタ管理部１４は、仮想的に作成され、クラスタ化されたコンテナの配置及び動作に係る制御を管理する。 (1a) The recovery device 10 includes a computational resource cluster 15, a cluster management unit 14, a plurality of clusters 12A, 12B, an internal anomaly detection unit 17, and an external anomaly detection unit 23. The computational resource cluster 15 is virtually created on a physical machine by container virtualization software, and the virtually created containers are arranged in a cluster. The cluster management unit 14 manages control related to the arrangement and operation of the virtually created and clustered containers.

各クラスタ１２Ａ，１２Ｂは、計算資源クラスタ１５及びクラスタ管理部１４を有して構成される。内部異常検知部１７は、クラスタ１２Ａ，１２Ｂ毎に配置され、且つ仮想的に作成された計算資源クラスタ１５及びクラスタ管理部１４の外部に仮想的に作成され、コンテナの異常を検知する。外部異常検知部２３は、各クラスタ１２Ａ，１２Ｂの外部に仮想的に作成され、内部異常検知部１７でのコンテナの異常検知時に当該異常のコンテナが配置されたクラスタを異常と検知する構成とした。Each cluster 12A, 12B is configured to have a computational resource cluster 15 and a cluster management unit 14. The internal anomaly detection unit 17 is arranged for each cluster 12A, 12B, and is virtually created outside the virtually created computational resource cluster 15 and cluster management unit 14 to detect anomalies in the container. The external anomaly detection unit 23 is virtually created outside each cluster 12A, 12B, and is configured to detect, when the internal anomaly detection unit 17 detects an anomaly in a container, the cluster in which the abnormal container is located as being abnormal.

この構成によれば、クラスタ１２Ａ，１２Ｂ毎の内部異常検知部１７でコンテナの異常が検知された際に、外部異常検知部２３で、その異常のコンテナが配置されたクラスタを異常と検知するようにした。内部異常検知部１７及び外部異常検知部２３は、クラスタ管理部１４及び計算資源クラスタ１５を仮想的に作成するコンテナ仮想化ソフトウェアに関与しない。このため、各クラスタ１２Ａ，１２Ｂで発生した障害をコンテナ仮想化ソフトウェアが持つ異常検知復旧機能よりも、早く異常検知できる。この早い異常検知によってクラスタの障害に係るコンテナ等を早く復旧できる。 According to this configuration, when the internal anomaly detection unit 17 for each cluster 12A, 12B detects an anomaly in a container, the external anomaly detection unit 23 detects the cluster in which the abnormal container is placed as abnormal. The internal anomaly detection unit 17 and the external anomaly detection unit 23 are not involved in the container virtualization software that virtually creates the cluster management unit 14 and the computational resource cluster 15. As a result, a failure that occurs in each cluster 12A, 12B can be detected more quickly than the anomaly detection and recovery function of the container virtualization software. This early anomaly detection allows containers etc. related to the cluster failure to be quickly recovered.

（２ａ）クラスタ管理部１４は、当該クラスタ管理部１４の信号入力部分に配備され、入力信号を後段へ振り分けて出力すると共に、クラスタの確認通信に応じて当該クラスタの正常時に応答を返信する通信振分部１４ａを備える。外部異常検知部２３は、通信振分部１４ａに所定周期でクラスタの確認通信を行い、応答が返信されない場合にクラスタの異常と検知する構成とした。 (2a) The cluster management unit 14 is provided at a signal input portion of the cluster management unit 14, and includes a communication distribution unit 14a that distributes and outputs an input signal to a subsequent stage and returns a response to a cluster confirmation communication when the cluster is normal. The external anomaly detection unit 23 is configured to perform cluster confirmation communication with the communication distribution unit 14a at a predetermined period and detect an anomaly in the cluster when no response is returned.

この構成によれば、クラスタ１２Ａ，１２Ｂ毎の内部異常検知部１７を介さず各クラスタの異常を検知できる。 With this configuration, abnormalities in each cluster can be detected without going through the internal anomaly detection unit 17 for each cluster 12A, 12B.

（３ａ）外部異常検知部２３は、内部異常検知部１７でのコンテナの異常検知時に当該コンテナが配置されたクラスタを異常と検知する処理と、通信振分部１４ａに所定周期でクラスタの確認通信を行い、応答が返信されない場合にクラスタの異常と検知する処理との双方の処理によって異常を検知する構成とした。 (3a) The external anomaly detection unit 23 is configured to detect anomalies by both a process of detecting an anomaly in the cluster in which the container is placed when the internal anomaly detection unit 17 detects an anomaly in the container, and a process of sending a cluster confirmation communication to the communication distribution unit 14a at a predetermined period and detecting an anomaly in the cluster if no response is returned.

この構成によれば、各クラスタの異常検知を、より適正に行うことができる。 With this configuration, anomaly detection for each cluster can be performed more accurately.

（４ａ）コンテナに係るアプリケーションの名称を示すドメイン名と、クラスタ毎のＩＰアドレスとを対応付けて管理するＤＮＳ２５を、各クラスタ１２Ａ，１２Ｂを構成するサーバの外部に備える。各クラスタ１２Ａ，１２Ｂの外部に仮想的に作成され、外部異常検知部２３で検知された異常のクラスタに係る通信振分中止指示をＤＮＳ２５へ通知する振分先切替部を、クラスタ毎に備える。ＤＮＳ２５は、通信振分中止指示で示される異常のクラスタのＩＰアドレスを消去する構成とした。 (4a) A DNS 25 that associates and manages a domain name indicating the name of an application related to a container with an IP address for each cluster is provided outside the servers that constitute each cluster 12A, 12B. A distribution destination switching unit that is virtually created outside each cluster 12A, 12B and notifies the DNS 25 of a communication distribution stop instruction related to an abnormal cluster detected by the external anomaly detection unit 23 is provided for each cluster. The DNS 25 is configured to erase the IP address of the abnormal cluster indicated by the communication distribution stop instruction.

この構成によれば、ＤＮＳ２５で管理されるクラスタのＩＰアドレスにおいて、外部異常検知部２３で異常検知されたクラスタのＩＰアドレスが消去される。このため、外部サーバがＤＮＳ２５にクラスタのＩＰアドレスを問い合わせた場合に、障害が発生したクラスタのＩＰアドレスにはアクセス出来なくなる。つまり、異常のクラスタへの通信を止めることができる。外部異常検知部２３、振分先切替部及びＤＮＳ２５は、上述したコンテナ仮想化ソフトウェアに関与しない。このため、クラスタで発生した障害をコンテナ仮想化ソフトウェアが持つ異常検知復旧機能よりも、早く異常検知できるので、この異常検知されたクラスタの障害に係るコンテナ等を早く復旧できる。 According to this configuration, the IP address of the cluster in which an abnormality has been detected by the external anomaly detection unit 23 is erased from the IP addresses of the clusters managed by DNS 25. Therefore, when an external server queries DNS 25 for the IP address of the cluster, it becomes impossible to access the IP address of the cluster in which the failure has occurred. In other words, communication to the abnormal cluster can be stopped. The external anomaly detection unit 23, the allocation destination switching unit, and DNS 25 are not involved in the container virtualization software described above. Therefore, a failure that has occurred in a cluster can be detected earlier than the anomaly detection and recovery function of the container virtualization software, and therefore containers etc. related to the failure of the cluster in which the abnormality has been detected can be quickly recovered.

＜実施形態の変形例１＞
図１４は、本発明の実施形態の変形例１に係る仮想化システム復旧装置１０Ａの構成を示すブロック図である。 <First Modification of the Embodiment>
FIG. 14 is a block diagram showing a configuration of a virtualization system recovery device 10A according to a first modified example of the embodiment of the present invention.

図１４に示す変形例１の復旧装置１０Ａが、復旧装置１０（図１０）と異なる点は、振分先切替部２１からの矢印Ｙ３５で示す通信振分中止指示を、ＤＮＳ２５の他に、各クラスタ１２Ａ，１２Ｂの通信振分部１４ａへも通知するようにしたことにある。The recovery device 10A of variant example 1 shown in Figure 14 differs from the recovery device 10 (Figure 10) in that the communication distribution stop instruction indicated by arrow Y35 from the distribution destination switching unit 21 is notified not only to DNS 25 but also to the communication distribution units 14a of each cluster 12A, 12B.

通信振分部１４ａは、その通知された通信振分中止指示で示される第１クラスタ１２Ａ又は第２クラスタ１２Ｂの通信を停止する。つまり、各クラスタ１２Ａ，１２Ｂへの通信は、必ず入力側の通信振分部１４ａを経由して行われるので、その通信振分部１４ａの通信機能を通信振分中止指示に応じて停止するようにした。The communication distribution unit 14a stops communication with the first cluster 12A or the second cluster 12B indicated by the notified communication distribution stop instruction. In other words, since communication to each cluster 12A, 12B is always performed via the communication distribution unit 14a on the input side, the communication function of the communication distribution unit 14a is stopped in response to the communication distribution stop instruction.

この構成によれば、異常のクラスタ（例えば第２クラスタ１２Ｂ）に係る通信振分中止指示を、異常クラスタ１２Ｂの通信振分部１４ａへ通知して、通信振分部１４ａの通信機能を停止できる。この停止によって、異常クラスタ１２Ｂへはアクセス出来なくなる。このため、外部サーバのＤＮＳ２５への問い合わせを省略できる。 According to this configuration, a communication sorting instruction for the abnormal cluster (e.g., the second cluster 12B) can be sent to the communication sorting unit 14a of the abnormal cluster 12B to stop the communication function of the communication sorting unit 14a. This stops the communication function of the abnormal cluster 12B, making it impossible to access the abnormal cluster 12B. This makes it possible to omit queries to the DNS 25 of the external server.

＜実施形態の変形例２＞
図１５は、本実施形態の変形例２に係る仮想化システム復旧装置１０Ｂの構成を示すブロック図である。 <Modification 2 of the embodiment>
FIG. 15 is a block diagram showing the configuration of a virtualization system recovery device 10B according to the second modification of this embodiment.

図１５に示す変形例２の復旧装置１０Ｂが、復旧装置１０（図１０）と異なる点は、第１クラスタ１２Ａに内部ＤＮＳ２５Ａを備えると共に、第２クラスタ１２Ｂに内部ＤＮＳ２５Ｂを備え、振分先切替部２１からの矢印Ｙ３５で示す通信振分中止指示を、ＤＮＳ２５の他に、各内部ＤＮＳ２５Ａ，２５Ｂへも通知するようにしたことにある。The recovery device 10B of variant example 2 shown in Figure 15 differs from the recovery device 10 (Figure 10) in that it has an internal DNS 25A in the first cluster 12A and an internal DNS 25B in the second cluster 12B, and the communication distribution stop instruction indicated by arrow Y35 from the distribution destination switching unit 21 is notified not only to DNS 25 but also to each internal DNS 25A, 25B.

内部ＤＮＳ２５Ａ，２５Ｂは、ＤＮＳ２５と同様にＤＮＳレコードテーブル２５ａを備えるが、異なる点は、テーブル２５ａをキャッシュメモリに備えている。従って、内部ＤＮＳ２５Ａ，２５Ｂでは、テーブル２５ａの情報が所定時間で消去される。しかし、内部ＤＮＳ２５Ａ，２５Ｂは、その消去後に、必要に応じてＤＮＳ２５から必要情報を取得可能となっている。 The internal DNSs 25A and 25B have a DNS record table 25a like DNS 25, but the difference is that the table 25a is stored in a cache memory. Therefore, in the internal DNSs 25A and 25B, the information in the table 25a is erased at predetermined times. However, after the information is erased, the internal DNSs 25A and 25B are able to obtain the necessary information from DNS 25 as necessary.

内部異常検知部１７Ａ（又は内部異常検知部１７Ｂ）は、第２クラスタ１２Ｂの異常が検知された際の通信振分中止指示（矢印Ｙ３５参照）に応じて、図１２に示すテーブル２５ａの第２クラスタ１２ＢのＩＰアドレスを消去する処理を行う。The internal anomaly detection unit 17A (or the internal anomaly detection unit 17B) performs a process of erasing the IP address of the second cluster 12B from the table 25a shown in FIG. 12 in response to a communication distribution stop instruction (see arrow Y35) when an abnormality is detected in the second cluster 12B.

この構成によれば、外部サーバが、クラスタ毎の内部ＤＮＳ２５Ａ，２５Ｂに各クラスタ１２Ａ，１２ＢのＩＰアドレスの問い合わせを行うことができるので、外部のＤＮＳ２５への負荷を減少できる。 With this configuration, an external server can query the internal DNS 25A, 25B for each cluster for the IP addresses of each cluster 12A, 12B, thereby reducing the load on the external DNS 25.

＜ハードウェア構成＞
上述した実施形態に係る仮想化システム復旧装置１０，１０Ａ，１０Ｂの何れか１つは、例えば図１６に示すような構成のコンピュータ１００によって実現される。コンピュータ１００は、ＣＰＵ（Central Processing Unit）１０１、ＲＯＭ（Read Only Memory）１０２、ＲＡＭ（Random Access Memory）１０３、ＨＤＤ（Hard Disk Drive）１０４、入出力Ｉ／Ｆ（Interface）１０５、通信Ｉ／Ｆ１０６、及びメディアＩ／Ｆ１０７を有する。 <Hardware Configuration>
Any one of the virtualization system recovery devices 10, 10A, 10B according to the above-described embodiments is realized by a computer 100 having a configuration as shown in Fig. 16. The computer 100 has a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, a HDD (Hard Disk Drive) 104, an input/output I/F (Interface) 105, a communication I/F 106, and a media I/F 107.

ＣＰＵ１０１は、ＲＯＭ１０２又はＨＤＤ１０４に記憶されたプログラムに基づき作動し、各機能部の制御を行う。ＲＯＭ１０２は、コンピュータ１００の起動時にＣＰＵ１０１により実行されるブートプログラムや、コンピュータ１００のハードウェアに係るプログラム等を記憶する。The CPU 101 operates based on a program stored in the ROM 102 or the HDD 104, and controls each functional unit. The ROM 102 stores a boot program executed by the CPU 101 when the computer 100 is started, programs related to the hardware of the computer 100, and the like.

ＣＰＵ１０１は、入出力Ｉ／Ｆ１０５を介して、プリンタやディスプレイ等の出力装置１１１及び、マウスやキーボード等の入力装置１１０を制御する。ＣＰＵ１０１は、入出力Ｉ／Ｆ１０５を介して、入力装置１１０からデータを取得し、又は、生成したデータを出力装置１１１へ出力する。The CPU 101 controls an output device 111 such as a printer or a display, and an input device 110 such as a mouse or a keyboard, via the input/output I/F 105. The CPU 101 acquires data from the input device 110, or outputs generated data to the output device 111, via the input/output I/F 105.

ＨＤＤ１０４は、ＣＰＵ１０１により実行されるプログラム及び当該プログラムによって使用されるデータ等を記憶する。通信Ｉ／Ｆ１０６は、通信網１１２を介して図示せぬ他の装置からデータを受信してＣＰＵ１０１へ出力し、また、ＣＰＵ１０１が生成したデータを、通信網１１２を介して他の装置へ送信する。The HDD 104 stores programs executed by the CPU 101 and data used by the programs. The communication I/F 106 receives data from other devices (not shown) via the communication network 112 and outputs the data to the CPU 101, and also transmits data generated by the CPU 101 to other devices via the communication network 112.

メディアＩ／Ｆ１０７は、記録媒体１１３に格納されたプログラム又はデータを読み取り、ＲＡＭ１０３を介してＣＰＵ１０１へ出力する。ＣＰＵ１０１は、目的の処理に係るプログラムを、メディアＩ／Ｆ１０７を介して記録媒体１１３からＲＡＭ１０３上にロードし、ロードしたプログラムを実行する。記録媒体１１３は、ＤＶＤ（Digital Versatile Disc）、ＰＤ（Phase change rewritable Disk）等の光学記録媒体、ＭＯ（Magneto Optical disk）等の光磁気記録媒体、磁気記録媒体、導体メモリテープ媒体又は半導体メモリ等である。The media I/F 107 reads the program or data stored in the recording medium 113 and outputs it to the CPU 101 via the RAM 103. The CPU 101 loads the program related to the target processing from the recording medium 113 onto the RAM 103 via the media I/F 107, and executes the loaded program. The recording medium 113 is an optical recording medium such as a DVD (Digital Versatile Disc) or a PD (Phase change rewritable Disc), a magneto-optical recording medium such as an MO (Magneto Optical disc), a magnetic recording medium, a conductive memory tape medium, or a semiconductor memory, etc.

例えば、コンピュータ１００が実施形態に係る仮想化システム復旧装置１０，１０Ａ，１０Ｂの何れか１つとして機能する場合、コンピュータ１００のＣＰＵ１０１は、ＲＡＭ１０３上にロードされたプログラムを実行することにより、仮想化システム復旧装置１０の機能を実現する。また、ＨＤＤ１０４には、ＲＡＭ１０３内のデータが記憶される。ＣＰＵ１０１は、目的の処理に係るプログラムを記録媒体１１３から読み取って実行する。この他、ＣＰＵ１０１は、他の装置から通信網１１２を介して目的の処理に係るプログラムを読み込んでもよい。
＜効果＞
（１）物理マシン上にコンテナ仮想化ソフトウェアにより仮想的に作成され、当該仮想的に作成されるコンテナをクラスタ化して配置する計算資源クラスタと、前記仮想的に作成され、前記クラスタ化されたコンテナの配置及び動作に係る制御を管理するクラスタ管理部と、各々が、前記計算資源クラスタ及び前記クラスタ管理部を有して構成される複数のクラスタと、前記複数のクラスタ毎に配置され、且つ前記仮想的に作成された計算資源クラスタ及びクラスタ管理部の外部に前記仮想的に作成され、前記コンテナの異常を検知する内部異常検知部と、前記複数のクラスタの外部に前記仮想的に作成され、前記内部異常検知部でのコンテナの異常検知時に当該異常のコンテナが配置されたクラスタを異常と検知する外部異常検知部とを備えることを特徴とする仮想化システム復旧装置である。 For example, when the computer 100 functions as any one of the virtualization system recovery devices 10, 10A, and 10B according to the embodiment, the CPU 101 of the computer 100 executes a program loaded onto the RAM 103 to realize the functions of the virtualization system recovery device 10. Furthermore, the HDD 104 stores data in the RAM 103. The CPU 101 reads and executes a program relating to a target process from the recording medium 113. Additionally, the CPU 101 may read a program relating to a target process from another device via the communication network 112.
＜Effects＞
(1) A virtualization system recovery device comprising: a computing resource cluster that is virtually created on a physical machine by container virtualization software and that clusters and arranges the virtually created containers; a cluster management unit that manages control related to the arrangement and operation of the virtually created and clustered containers; a plurality of clusters, each of which is configured to have the computing resource cluster and the cluster management unit; an internal anomaly detection unit that is arranged for each of the plurality of clusters and is virtually created outside the virtually created computing resource cluster and the cluster management unit and that detects an abnormality in the container; and an external anomaly detection unit that is virtually created outside the plurality of clusters and that, when an abnormality in a container is detected by the internal anomaly detection unit, detects as abnormal a cluster in which the abnormal container is arranged.

この構成によれば、複数のクラスタ毎の内部異常検知部でコンテナの異常が検知された際に、外部異常検知部で、その異常のコンテナが配置されたクラスタを異常と検知するようにした。内部異常検知部及び外部異常検知部は、クラスタ管理部及び計算資源クラスタを仮想的に作成するコンテナ仮想化ソフトウェアに関与しない。このため、複数のクラスタで発生した障害をコンテナ仮想化ソフトウェアが持つ異常検知復旧機能よりも、早く異常検知できる。この早い異常検知によってクラスタの障害に係るコンテナ等を早く復旧できる。 According to this configuration, when an anomaly is detected in a container by the internal anomaly detection unit for each of the multiple clusters, the external anomaly detection unit detects the cluster in which the abnormal container is placed as abnormal. The internal anomaly detection unit and the external anomaly detection unit are not involved in the cluster management unit and the container virtualization software that virtually creates the computing resource cluster. As a result, anomalies that occur in multiple clusters can be detected more quickly than the anomaly detection and recovery function of the container virtualization software. This rapid anomaly detection allows containers etc. related to the cluster failure to be quickly recovered.

（２）前記クラスタ管理部は、当該クラスタ管理部の信号入力部分に配備され、入力信号を後段へ振り分けて出力すると共に、前記クラスタの確認通信に応じて当該クラスタの正常時に応答を返信する通信振分部を備え、前記外部異常検知部は、前記通信振分部に所定周期でクラスタの確認通信を行い、前記応答が返信されない場合に前記クラスタの異常と検知することを特徴とする上記（１）に記載の仮想化システム復旧装置である。 (2) The cluster management unit is provided at a signal input portion of the cluster management unit, and distributes and outputs the input signal to a subsequent stage, and includes a communication distribution unit that returns a response when the cluster is normal in response to a confirmation communication of the cluster, and the external anomaly detection unit is a virtualization system recovery device as described in (1) above, characterized in that it performs cluster confirmation communication with the communication distribution unit at a predetermined period, and detects an abnormality in the cluster when the response is not returned.

この構成によれば、複数のクラスタ毎の内部異常検知部を介さず各クラスタの異常を検知できる。 With this configuration, abnormalities in each cluster can be detected without going through the internal anomaly detection unit for each of the multiple clusters.

（３）前記外部異常検知部は、前記内部異常検知部でのコンテナの異常検知時に当該コンテナが配置されたクラスタを異常と検知する処理と、前記通信振分部に所定周期でクラスタの確認通信を行い、前記応答が返信されない場合に前記クラスタの異常と検知する処理との双方の処理によって異常を検知することを特徴とする上記（２）に記載の仮想化システム復旧装置である。 (3) The external anomaly detection unit is a virtualization system recovery device described in (2) above, characterized in that it detects an anomaly by both a process of detecting an anomaly in a cluster in which a container is placed when an anomaly in the container is detected by the internal anomaly detection unit, and a process of sending a cluster confirmation communication to the communication distribution unit at a predetermined period and detecting an anomaly in the cluster when no response is returned.

（４）前記コンテナに係るアプリケーションの名称を示すドメイン名と、前記クラスタ毎のＩＰ（Internet Protocol）アドレスとを対応付けて管理するＤＮＳ（Domain Name System）を、当該クラスタを構成するサーバの外部に備え、前記複数のクラスタの外部に前記仮想的に作成され、前記外部異常検知部で検知された異常のクラスタに係る通信振分中止指示を前記ＤＮＳへ通知する振分先切替部を、前記クラスタ毎に備え、前記ＤＮＳは、前記通信振分中止指示で示される異常のクラスタのＩＰアドレスを消去することを特徴とする上記（１）～（３）の何れか１項に記載の仮想化システム復旧装置である。(4) A virtualization system recovery device as described in any one of (1) to (3) above, characterized in that a DNS (Domain Name System) that associates and manages a domain name indicating the name of an application related to the container with an IP (Internet Protocol) address for each cluster is provided outside the server that constitutes the cluster, and a distribution destination switching unit that is virtually created outside the multiple clusters and notifies the DNS of a communication distribution stop instruction related to an abnormal cluster detected by the external anomaly detection unit is provided for each cluster, and the DNS erases the IP address of the abnormal cluster indicated by the communication distribution stop instruction.

この構成によれば、ＤＮＳで管理されるクラスタのＩＰアドレスにおいて、外部異常検知部で異常検知されたクラスタのＩＰアドレスが消去される。このため、外部サーバがＤＮＳにクラスタのＩＰアドレスを問い合わせた場合に、障害が発生したクラスタのＩＰアドレスにはアクセス出来なくなる。つまり、異常のクラスタへの通信を止めることができる。外部異常検知部、振分先切替部及びＤＮＳは、上述したコンテナ仮想化ソフトウェアに関与しない。このため、クラスタで発生した障害をコンテナ仮想化ソフトウェアが持つ異常検知復旧機能よりも、早く異常検知できるので、この異常検知されたクラスタの障害に係るコンテナ等を早く復旧できる。 According to this configuration, the IP address of a cluster that has an abnormality detected by the external anomaly detection unit is erased from the IP addresses of the clusters managed by DNS. Therefore, when an external server queries the DNS for the IP address of a cluster, it will not be able to access the IP address of the cluster where the failure occurred. In other words, communication to the abnormal cluster can be stopped. The external anomaly detection unit, the distribution destination switching unit, and the DNS are not involved in the container virtualization software described above. Therefore, a failure that occurs in a cluster can be detected earlier than the anomaly detection and recovery function of the container virtualization software, so that containers etc. related to the failure of the cluster where the abnormality was detected can be recovered earlier.

（５）前記振分先切替部は、前記外部異常検知部で検知された異常のクラスタに係る通信振分中止指示を前記複数のクラスタにおける通信振分部へ通知し、前記通信振分部は、前記外部異常検知部で検知された異常のクラスタに係る通信振分中止指示の通知時に通信機能を停止する処理を行うことを特徴とする上記（４）に記載の仮想化システム復旧装置である。 (5) The virtualization system recovery device described in (4) above is characterized in that the distribution destination switching unit notifies the communication distribution units in the multiple clusters of an instruction to stop communication distribution related to the cluster of the abnormality detected by the external anomaly detection unit, and the communication distribution unit performs processing to stop communication function when the communication distribution stop instruction related to the cluster of the abnormality detected by the external anomaly detection unit is notified.

この構成によれば、異常のクラスタに係る通信振分中止指示を、異常クラスタの通信振分部へ通知して、通信振分部の通信機能を停止できる。この停止によって、異常クラスタへはアクセス出来なくなる。このため、外部サーバのＤＮＳへの問い合わせを省略できる。 With this configuration, a command to stop traffic sorting for the abnormal cluster can be sent to the traffic sorting unit of the abnormal cluster, and the communication function of the traffic sorting unit can be stopped. This stops the traffic sorting function from being accessible to the abnormal cluster. This makes it possible to omit queries to the DNS of external servers.

（６）前記複数のクラスタ毎に、前記ＤＮＳと同様に、前記コンテナに係るアプリケーションの名称を示すドメイン名と、前記クラスタ毎のＩＰアドレスとを対応付けて管理する内部ＤＮＳを備え、前記振分先切替部からの通信振分中止指示を前記内部ＤＮＳへ通知することを特徴とする上記（４）に記載の仮想化システム復旧装置である。 (6) A virtualization system recovery device as described in (4) above, characterized in that for each of the multiple clusters, an internal DNS is provided which, similar to the DNS, associates and manages a domain name indicating the name of the application related to the container with an IP address for each of the clusters, and notifies the internal DNS of a communication distribution stop instruction from the distribution destination switching unit.

この構成によれば、外部サーバが、クラスタ毎の内部ＤＮＳにクラスタのＩＰアドレスの問い合わせを行うことができるので、外部のＤＮＳへの負荷を減少できる。 With this configuration, external servers can query the internal DNS for each cluster for the cluster's IP address, thereby reducing the load on the external DNS.

その他、具体的な構成について、本発明の主旨を逸脱しない範囲で適宜変更が可能である。 In addition, the specific configuration may be modified as appropriate without departing from the spirit and scope of the present invention.

１０，１０Ａ，１０Ｂ仮想化システム復旧装置
１２Ａ第１クラスタ（クラスタ）
１２Ｂ第２クラスタ（クラスタ）
１４Ａ，１４Ｂクラスタ管理部
１４ａ通信振分部
１４ｂ計算資源操作部
１４ｃ計算資源管理部
１４ｄコンテナ構成受付部
１４ｅコンテナ配置先決定部
１４ｆコンテナ管理部
１５Ａ，１５Ｂ計算資源クラスタ
１５ａ，１５ｂアプリケーション
１７Ａ，１７Ｂ内部異常検知部
２１振分先切替部
２３外部異常検知部
２５ＤＮＳ
２５ａＤＮＳレコードテーブル
２５Ａ，２５Ｂ内部ＤＮＳ 10, 10A, 10B Virtualization system recovery device 12A First cluster (cluster)
12B Second Cluster (Cluster)
14A, 14B Cluster management unit 14a Communication distribution unit 14b Computational resource operation unit 14c Computational resource management unit 14d Container configuration reception unit 14e Container placement destination determination unit 14f Container management unit 15A, 15B Computational resource cluster 15a, 15b Application 17A, 17B Internal abnormality detection unit 21 Distribution destination switching unit 23 External abnormality detection unit 25 DNS
25a DNS record table 25A, 25B Internal DNS

Claims

a computing resource cluster that is virtually created on a physical machine by container virtualization software and that clusters and arranges the virtually created containers;
a cluster management unit that manages control related to the placement and operation of the virtually created and clustered containers;
A plurality of clusters, each of which includes the computational resource cluster and the cluster manager;
an internal anomaly detection unit that is disposed for each of the plurality of clusters, that is virtually created outside the virtually created computing resource clusters and the cluster management unit, and that detects an anomaly in the container;
an external anomaly detection unit that is virtually created outside the plurality of clusters and detects, when an anomaly is detected in a container by the internal anomaly detection unit , a cluster in which the abnormal container is located as abnormal;
a distribution destination switching unit that is virtually created outside the plurality of clusters and notifies a DNS (Domain Name System) of a command to stop distribution of traffic related to an abnormal cluster detected by the external abnormality detection unit;
the DNS that associates and manages a domain name indicating a name of an application related to the container with an IP (Internet Protocol) address for each cluster is provided outside a server that constitutes the cluster;
The DNS deletes the IP address of the abnormal cluster indicated in the communication distribution stop instruction.
A virtualization system recovery device comprising :

the cluster management unit includes a communication distribution unit that is disposed at a signal input portion of the cluster management unit, distributes and outputs an input signal to a subsequent stage, and returns a response in response to a confirmation communication of the cluster when the cluster is normal;
2. The virtualization system recovery device according to claim 1, wherein the external anomaly detection unit performs a cluster confirmation communication with the communication sorting unit at a predetermined period, and detects an anomaly in the cluster when the response is not returned.

3. The virtualization system recovery device according to claim 2, characterized in that the external anomaly detection unit detects an anomaly by both a process of detecting an anomaly in a cluster in which the container is placed when the internal anomaly detection unit detects an anomaly in the container, and a process of sending a cluster confirmation communication to the communication sorting unit at a predetermined period and detecting an anomaly in the cluster when the response is not returned.

the distribution destination switching unit notifies communication distribution units in the plurality of clusters of a communication distribution stop instruction related to a cluster in which an abnormality has been detected by the external abnormality detection unit,
2. The virtualization system recovery device according to claim 1 , wherein the communication sorting unit performs processing for stopping a communication function when an instruction to stop communication sorting related to a cluster in which an abnormality has been detected by the external abnormality detection unit is notified.

an internal DNS for managing, for each of the plurality of clusters, a domain name indicating a name of an application related to the container and an IP address for each of the clusters in association with each other, in the same manner as the DNS;
2. The virtualization system recovery device according to claim 1 , further comprising: a notification of a traffic sorting stop instruction from said sorting destination switching unit to said internal DNS.

A method for recovering a virtualization system by a virtualization system recovery device, comprising:
Virtualization system recovery device,
A plurality of clusters in which containers virtually created by container virtualization software are clustered and arranged on a physical machine;
The virtual internal anomaly detection unit and the virtual external anomaly detection unit are provided outside the plurality of clusters ,
a DNS that associates and manages a domain name indicating a name of an application related to the container with an IP address for each cluster, the DNS being provided outside the server that constitutes the cluster;
Virtualization system recovery device,
the internal anomaly detection unit detecting an anomaly in a container for each of the plurality of clusters;
a step in which the external anomaly detection unit detects, when the internal anomaly detection unit detects an anomaly in a container, a cluster in which the abnormal container is located as an anomaly ;
notifying the DNS of a command to stop traffic sorting related to the cluster of anomalies detected by the external anomaly detection unit;
The DNS executes a step of deleting the IP address of the abnormal cluster indicated in the communication sorting stop instruction.
A virtualization system recovery method comprising :