JP7728899B2

JP7728899B2 - Synchronously replicating datasets and other managed objects to a cloud-based storage system

Info

Publication number: JP7728899B2
Application number: JP2024001666A
Authority: JP
Inventors: ボーツ，パー; コルグローブ，ジョン; ドリスコル，アラン; グランウォルド，デイヴィッド; ホジソン，スティーヴン; カー，ロナルド
Original assignee: ピュアストレージ，インコーポレイテッド
Priority date: 2017-03-10
Filing date: 2024-01-10
Publication date: 2025-08-25
Anticipated expiration: 2038-01-31
Also published as: JP2020514902A; JP7419439B2; JP2025172070A; US11086555B1; EP3961365C0; JP2024041875A; AU2018230871A1; EP4571481A1; CA3054040A1; EP3414653A1; EP4733925A1; CN110392876A; JP2022122993A; US20180260125A1; AU2022268336A1; EP3961365B1; EP4571481B1; AU2025200352A1; WO2018164782A1; JP7086093B2

Description

［0001］一部の実施態様に係るデータストレージのための第１の例のシステムを示す図である。[0001] FIG. 1 illustrates a first example system for data storage according to some embodiments. ［0002］一部の実施態様に係るデータストレージのための第２の例のシステムを示す図である。[0002] FIG. 1 illustrates a second example system for data storage according to some embodiments. ［0003］一部の実施態様に係るデータストレージのための第３の例のシステムを示す図である。[0003] FIG. 2 illustrates a third example system for data storage according to some embodiments. ［0004］一部の実施態様に係るデータストレージのための第４の例のシステムを示す図である。[0004] FIG. 2 illustrates a fourth example system for data storage according to some embodiments. ［0005］一部の実施形態に従って、複数のストレージノード、及びネットワーク接続ストレージを提供するために各ストレージノードに結合された内部ストレージを有するストレージクラスタの斜視図である。[0005] FIG. 1 is a perspective view of a storage cluster having multiple storage nodes and internal storage coupled to each storage node to provide network-attached storage, according to some embodiments. ［0006］いくつかの実施形態に従って複数のストレージノードを結合する相互接続スイッチを示すブロック図である。[0006] FIG. 1 is a block diagram illustrating an interconnect switch coupling multiple storage nodes according to some embodiments. ［0007］いくつかの実施形態に係るストレージノードのコンテンツ及び不揮発性ソリッドステートストレージユニットの１つのコンテンツを示すマルチレベルブロック図である。[0007] FIG. 1 is a multi-level block diagram illustrating the contents of a storage node and the contents of one of the non-volatile solid-state storage units according to some embodiments. ［0008］いくつかの実施形態に係るいくつかの上記図のストレージノード及びストレージユニットの実施形態を使用するストレージサーバ環境を示す図である。[0008] FIG. 2 illustrates a storage server environment using some of the storage node and storage unit embodiments of the above figures, according to some embodiments. ［0009］いくつかの実施形態に従って、制御プレーン、計算プレーン、及びストレージプレーン、並びに基本的な物理リソースと対話するオーソリティを示す、ブレードハードウェアブロック図である。[0009] FIG. 1 is a blade hardware block diagram showing the control plane, compute plane, and storage plane, as well as authorities interacting with underlying physical resources, according to some embodiments. ［0010］いくつかの実施形態に係るストレージクラスタのブレードでのエラスティシティ（ｅｌａｓｔｉｃｉｔｙ）ソフトウェア層を示す図である。[0010] FIG. 2 illustrates an elasticity software layer on a blade of a storage cluster according to some embodiments. ［0011］いくつかの実施形態に係る、オーソリティ、及びストレージクラスタのブレードでのストレージリソースを示す図である。[0011] FIG. 2 illustrates an authority and storage resources at blades of a storage cluster, according to some embodiments. ［0012］本開示のいくつかの実施形態に従って、クラウドサービスプロバイダとのデータ通信のために結合されるストレージシステムの図を説明する図である。[0012] FIG. 1 illustrates a diagram of a storage system coupled for data communication with a cloud service provider in accordance with some embodiments of the present disclosure. ［0013］本開示のいくつかの実施形態に係るストレージシステムの図を説明する図である。[0013] FIG. 1 illustrates a diagram of a storage system according to some embodiments of the present disclosure. ［0014］本開示のいくつかの実施形態に係るポッドをサポートする複数のストレージシステムを示すブロック図を説明する図である。[0014] FIG. 1 illustrates a block diagram showing multiple storage systems supporting pods in accordance with some embodiments of the present disclosure. ［0015］本開示のいくつかの実施形態に係るポッドをサポートする複数のストレージシステムを示すブロック図を説明する図である。[0015] FIG. 1 illustrates a block diagram showing multiple storage systems supporting pods in accordance with some embodiments of the present disclosure. ［0016］本開示のいくつかの実施形態に係るポッドをサポートする複数のストレージシステムを示すブロック図を説明する図である。[0016] FIG. 1 illustrates a block diagram showing multiple storage systems supporting pods in accordance with some embodiments of the present disclosure. ［0017］本開示のいくつかの実施形態に従って、２つ以上のストレージシステム間で同期複製関係を確立する例の方法を示すフローチャートを説明する図である。[0017] FIG. 2 illustrates a flowchart illustrating an example method for establishing a synchronous replication relationship between two or more storage systems in accordance with some embodiments of the present disclosure. ［0018］本開示のいくつかの実施形態に従って、２つ以上のストレージシステム間で同期複製関係を確立する追加の例の方法を示すフローチャートを説明する図である。[0018] FIG. 10 illustrates a flowchart depicting an additional example method for establishing a synchronous replication relationship between two or more storage systems in accordance with some embodiments of the present disclosure. ［0019］本開示のいくつかの実施形態に従って、２つ以上のストレージシステム間で同期複製関係を確立する追加の例の方法を示すフローチャートを説明する図である。[0019] FIG. 10 illustrates a flowchart depicting an additional example method for establishing a synchronous replication relationship between two or more storage systems, according to some embodiments of the present disclosure. ［0020］本開示のいくつかの実施形態に従って、２つ以上のストレージシステム間で同期複製関係を確立する追加の例の方法を示すフローチャートを説明する図である。[0020] FIG. 10 illustrates a flowchart depicting an additional example method for establishing a synchronous replication relationship between two or more storage systems, according to some embodiments of the present disclosure. ［0021］本開示のいくつかの実施形態に従って、複数のストレージシステム全体で同期されるデータセットに向けられたＩ／Ｏ動作にサービスを提供するための例の方法を示すフローチャートを説明する図である。[0021] FIG. 1 illustrates a flowchart showing an example method for servicing I/O operations directed to data sets that are synchronized across multiple storage systems, in accordance with some embodiments of the present disclosure. ［0022］本開示のいくつかの実施形態に従って、複数のストレージシステム全体で同期されるデータセットに向けられたＩ／Ｏ動作にサービスを提供するための追加の例の方法を示すフローチャートを説明する図である。[0022] FIG. 10 illustrates a flowchart showing an additional example method for servicing I/O operations directed to data sets that are synchronized across multiple storage systems, in accordance with some embodiments of the present disclosure. ［0023］本開示のいくつかの実施形態に従って、複数のストレージシステム全体で同期されるデータセットに向けられたＩ／Ｏ動作にサービスを提供するための追加の例の方法を示すフローチャートを説明する図である。[0023] FIG. 10 illustrates a flowchart showing an additional example method for servicing I/O operations directed to data sets that are synchronized across multiple storage systems, in accordance with some embodiments of the present disclosure. ［0024］本開示のいくつかの実施形態に従って、複数のストレージシステム全体で同期されるデータセットに向けられたＩ／Ｏ動作にサービスを提供するための追加の例の方法を示すフローチャートを説明する図である。[0024] FIG. 10 illustrates a flowchart depicting an additional example method for servicing I/O operations directed to data sets synchronized across multiple storage systems, in accordance with some embodiments of the present disclosure. ［0025］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステム間で仲介するための例の方法を示すフローチャートを説明する図である。[0025] FIG. 1 illustrates a flowchart showing an example method for brokering between storage systems that synchronously replicate data sets, according to some embodiments of the present disclosure. ［0026］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステム間で仲介するための例の方法を示すフローチャートを説明する図である。[0026] FIG. 1 illustrates a flowchart showing an example method for brokering between storage systems that synchronously replicate data sets, according to some embodiments of the present disclosure. ［0027］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステム間で仲介するための例の方法を示すフローチャートを説明する図である。[0027] FIG. 1 illustrates a flowchart showing an example method for brokering between storage systems that synchronously replicate data sets, according to some embodiments of the present disclosure. ［0028］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための回復のための例の方法を示すフローチャートを説明する図である。[0028] FIG. 1 illustrates a flowchart showing an example method for recovery for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0029］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための回復のための例の方法を示すフローチャートを説明する図である。[0029] FIG. 1 illustrates a flowchart showing an example method for recovery for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0030］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための回復のための例の方法を示すフローチャートを説明する図である。[0030] FIG. 2 illustrates a flowchart showing an example method for recovery for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0031］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための再同期のための例の方法を示すフローチャートを説明する図である。[0031] FIG. 1 illustrates a flowchart showing an example method for resynchronization for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0032］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する図である。[0032] FIG. 10 illustrates a flowchart showing an additional example method for resynchronization for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0033］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する図である。[0033] FIG. 10 illustrates a flowchart showing an additional example method for resynchronization for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0034］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する図である。[0034] FIG. 10 illustrates a flowchart showing an additional example method for resynchronization for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0035］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する図である。[0035] FIG. 10 illustrates a flowchart showing an additional example method for resynchronization for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0036］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する図である。[0036] FIG. 10 illustrates a flowchart showing an additional example method for resynchronization for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0037］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する図である。[0037] FIG. 10 illustrates a flowchart showing an additional example method for resynchronization for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0038］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する図である。[0038] FIG. 10 illustrates a flowchart showing an additional example method for resynchronization for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0039］本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する図である。[0039] FIG. 10 illustrates a flowchart showing an additional example method for resynchronization for a storage system that synchronously replicates data sets, according to some embodiments of the present disclosure. ［0040］本開示のいくつかの実施形態に従って同期複製されたストレージシステムへの接続性を管理するための例の方法を示すフローチャートを説明する図である。[0040] FIG. 1 illustrates a flowchart showing an example method for managing connectivity to a synchronously replicated storage system according to some embodiments of the present disclosure. ［0041］本開示のいくつかの実施形態に従って同期複製されたストレージシステムへの接続性を管理するための追加の例の方法を示すフローチャートを説明する図である。[0041] FIG. 10 illustrates a flowchart showing an additional example method for managing connectivity to a synchronously replicated storage system in accordance with some embodiments of the present disclosure. ［0042］本開示のいくつかの実施形態に従って同期複製されたストレージシステムへの接続性を管理するための追加の例の方法を示すフローチャートを説明する図である。[0042] FIG. 10 illustrates a flowchart showing an additional example method for managing connectivity to a synchronously replicated storage system in accordance with some embodiments of the present disclosure. ［0043］本開示のいくつかの実施形態に従って同期複製されたストレージシステムへの接続性を管理するための追加の例の方法を示すフローチャートを説明する図である。[0043] FIG. 10 illustrates a flowchart showing an additional example method for managing connectivity to a synchronously replicated storage system in accordance with some embodiments of the present disclosure. ［0044］本開示のいくつかの実施形態に従って同期複製されたストレージシステムへの接続性を管理するための追加の例の方法を示すフローチャートを説明する図である。[0044] FIG. 10 illustrates a flowchart showing an additional example method for managing connectivity to a synchronously replicated storage system in accordance with some embodiments of the present disclosure. ［0045］本開示のいくつかの実施形態に従って同期複製されたストレージシステムへの接続性を管理するための追加の例の方法を示すフローチャートを説明する図である。[0045] FIG. 10 illustrates a flowchart depicting an additional example method for managing connectivity to a synchronously replicated storage system in accordance with some embodiments of the present disclosure. ［0046］本開示のいくつかの実施形態に従って同期複製されたストレージシステムへの接続性を管理するための追加の例の方法を示すフローチャートを説明する図である。[0046] FIG. 10 illustrates a flowchart showing an additional example method for managing connectivity to a synchronously replicated storage system in accordance with some embodiments of the present disclosure. ［0047］本開示のいくつかの実施形態に係る仲介サービスのための自動ストレージシステム構成のための例の方法を示すフローチャートを説明する図である。[0047] FIG. 10 illustrates a flowchart showing an example method for automated storage system configuration for intermediary services according to some embodiments of the present disclosure. ［0048］本開示のいくつかの実施形態に係る仲介サービスのための自動ストレージシステム構成のための例の方法を示すフローチャートを説明する図である。[0048] FIG. 10 illustrates a flowchart showing an example method for automated storage system configuration for intermediary services according to some embodiments of the present disclosure. ［0049］本開示のいくつかの実施形態に係る仲介サービスのための自動ストレージシステム構成のための例の方法を示すフローチャートを説明する図である。[0049] FIG. 10 illustrates a flowchart showing an example method for automated storage system configuration for intermediary services according to some embodiments of the present disclosure. ［0050］本開示のいくつかの実施形態に係る仲介サービスのための自動ストレージシステム構成のための例の方法を示すフローチャートを説明する図である。[0050] FIG. 10 illustrates a flowchart showing an example method for automated storage system configuration for intermediary services according to some embodiments of the present disclosure. ［0051］本開示のいくつかの実施形態に従って、ともにストレージデータの論理ボリューム又は論理ボリュームの一部分を表す場合があるメタデータオブジェクトの構造化された集合体として実装されてよいメタデータ表現の図を説明する図である。[0051] Figure 1 illustrates a diagram of a metadata representation that may be implemented as a structured collection of metadata objects that together may represent a logical volume or portion of a logical volume of storage data, in accordance with some embodiments of the present disclosure. ［0052］本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間でメタデータを同期させるための例の方法を示すフローチャートを説明する図である。[0052] FIG. 10 illustrates a flowchart showing an example method for synchronizing metadata between storage systems that synchronously replicate data sets according to some embodiments of the present disclosure. ［0053］本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間でメタデータを同期させるための例の方法を示すフローチャートを説明する図である。[0053] FIG. 10 illustrates a flowchart showing an example method for synchronizing metadata between storage systems that synchronously replicate data sets according to some embodiments of the present disclosure. ［0054］本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間のアクティブメンバーシップを決定するための例の方法を示すフローチャートを説明する図である。[0054] FIG. 10 illustrates a flowchart showing an example method for determining active membership among storage systems that synchronously replicate data sets according to some embodiments of the present disclosure. ［0055］本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間のアクティブメンバーシップを決定するための例の方法を示すフローチャートを説明する図である。[0055] FIG. 10 illustrates a flowchart showing an example method for determining active membership among storage systems that synchronously replicate data sets according to some embodiments of the present disclosure. ［0056］本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間のアクティブメンバーシップを決定するための例の方法を示すフローチャートを説明する図である。[0056] FIG. 10 illustrates a flowchart showing an example method for determining active membership among storage systems that synchronously replicate data sets according to some embodiments of the present disclosure. ［0057］本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間のアクティブメンバーシップを決定するための例の方法を示すフローチャートを説明する図である。[0057] FIG. 10 illustrates a flowchart showing an example method for determining active membership among storage systems that synchronously replicate data sets according to some embodiments of the present disclosure. ［0058］本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間のアクティブメンバーシップを決定するための例の方法を示すフローチャートを説明する図である。[0058] FIG. 10 illustrates a flowchart showing an example method for determining active membership among storage systems that synchronously replicate data sets according to some embodiments of the present disclosure. ［0059］本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間でメタデータを同期させるための例の方法を示すフローチャートを説明する図である。[0059] FIG. 10 illustrates a flowchart showing an example method for synchronizing metadata between storage systems that synchronously replicate data sets according to some embodiments of the present disclosure.

［0060］本開示の実施形態に従って、データセット及び他の管理オブジェクトをクラウドベースのストレージシステムに同期複製するための例の方法、装置、及び製品が、図１Ａで始まる添付の図面を参照して説明される。図１Ａは、一部の実施態様に係るデータストレージのための例のシステムを示す。（本明細書では「ストレージシステム」とも呼ばれる）システム１００は、制限よりむしろ説明の目的で多数の要素を含む。システム１００が、他の実施態様において同じ又は異なる方法で構成された同じ要素、より多くの要素、又はより少ない要素を含んでよいことに留意されたい。 [0060] Example methods, apparatus, and articles of manufacture for synchronously replicating data sets and other managed objects to a cloud-based storage system in accordance with embodiments of the present disclosure are described with reference to the accompanying drawings, beginning with FIG. 1A. FIG. 1A illustrates an example system for data storage according to some embodiments. System 100 (also referred to herein as a "storage system") includes a number of elements for purposes of illustration rather than limitation. Note that system 100 may include the same elements, more elements, or fewer elements configured in the same or different ways in other embodiments.

［0061］システム１００は、いくつかのコンピューティングデバイス１６４Ａ～１６４Ｂを含む。例えばデータセンタ内のサーバ、ワークステーション、パーソナルコンピュータ、ノートブック等の（本明細書で「クライアントデバイス」とも呼ばれる）コンピューティングデバイスが実施されてよい。コンピューティングデバイス１６４Ａ～１６４Ｂは、ストレージエリアネットワーク（『ＳＡＮ』）１５８又はローカルエリアネットワーク（『ＬＡＮ』）１６０を通して、１つ以上のストレージアレイ１０２Ａ～１０２Ｂへのデータ通信のために結合されてよい。 [0061] System 100 includes several computing devices 164A-164B. For example, computing devices (also referred to herein as "client devices") may be implemented as servers, workstations, personal computers, notebooks, etc. in a data center. Computing devices 164A-164B may be coupled for data communication to one or more storage arrays 102A-102B through a storage area network ("SAN") 158 or a local area network ("LAN") 160.

［0062］ＳＡＮ１５８は、さまざまなデータ通信ファブリック、デバイス、及びプロトコルで実装されてよい。例えば、ＳＡＮ１５８のためのファブリックは、ファイバチャネル、イーサネット、インフィニバンド、シリアルアタッチドスモールコンピュータシステムインタフェース（『ＳＡＳ』）等を含んでよい。ＳＡＮ１５８との使用のためのデータ通信プロトコルは、アドバンスドテクノロジーアタッチメント（『ＡＴＡ』）、ファイバーチャンネルプロトコル、スモールコンピュータシステムインタフェース（『ＳＣＳＩ』）、インターネットスモールコンピュータシステムインタフェース（『ｉＳＣＳＩ』）、ＨｙｐｅｒＳＣＳＩ、不揮発性メモリエキスプレス（『ＮＶＭｅ』）オーバーファブリック（ｏｖｅｒＦａｂｒｉｃｓ）等を含んでよい。ＳＡＮ１５８が、制限のためではなく説明のために提供されていることに留意されたい。他のデータ通信結合は、コンピューティングデバイス１６４Ａ～１６４Ｂとストレージアレイ１０２Ａ～１０２Ｂとの間で実装されてよい。 [0062] SAN 158 may be implemented with a variety of data communication fabrics, devices, and protocols. For example, fabrics for SAN 158 may include Fibre Channel, Ethernet, InfiniBand, Serial Attached Small Computer System Interface ("SAS"), etc. Data communication protocols for use with SAN 158 may include Advanced Technology Attachment ("ATA"), Fibre Channel Protocol, Small Computer System Interface ("SCSI"), Internet Small Computer System Interface ("iSCSI"), HyperSCSI, Non-Volatile Memory Express ("NVMe") over Fabrics, etc. Note that SAN 158 is provided for purposes of illustration and not limitation. Other data communication couplings may be implemented between computing devices 164A-164B and storage arrays 102A-102B.

［0063］また、ＬＡＮ１６０は、さまざまなファブリック、デバイス、及びプロトコルで実装されてもよい。例えば、ＬＡＮ１６０のためのファブリックは、イーサネット（８０２．３）及び無線（８０２．１１）等を含んでよい。ＬＡＮ１６０での使用のためのデータ通信プロトコルは、トランスミッションコントロールプロトコル（『ＴＣＰ』）、ユーザデータグラムプロトコル（『ＵＤＰ』）、インターネットプロトコル（『ＩＰ』）、ハイパーテキスト転送プロトコル（『ＨＴＴＰ』）、ワイヤレスアクセスプロトコル（『ＷＡＰ』）、ハンドヘルド機器転送プロトコル（『ＨＤＴＰ』）、セッションイニシエーションプロトコル（『ＳＩＰ』）、リアルタイムプロトコル（『ＲＴＰ』）等を含んでよい。 [0063] LAN 160 may also be implemented with a variety of fabrics, devices, and protocols. For example, fabrics for LAN 160 may include Ethernet (802.3), wireless (802.11), etc. Data communication protocols for use in LAN 160 may include Transmission Control Protocol ("TCP"), User Datagram Protocol ("UDP"), Internet Protocol ("IP"), Hypertext Transfer Protocol ("HTTP"), Wireless Access Protocol ("WAP"), Handheld Device Transfer Protocol ("HDTP"), Session Initiation Protocol ("SIP"), Real Time Protocol ("RTP"), etc.

［0064］ストレージアレイ１０２Ａ～１０２Ｂは、コンピューティングデバイス１６４Ａ～１６４Ｂに永続データストレージを提供する場合がある。実施態様では、ストレージアレイ１０２Ａは、シャシ（不図示）に含まれてよく、ストレージアレイ１０２Ｂは、別のシャシ（不図示）に含まれてよい。ストレージアレイ１０２Ａ及び１０２Ｂは、（本明細書では「コントローラ」とも呼ばれる）１つ以上のストレージアレイコントローラ１１０を含んでよい。ストレージアレイコントローラ１１０は、コンピュータハードウェア、コンピュータソフトウェア、又はコンピュータハードウェアとソフトウェアの組合せを含んだ自動化された計算機のモジュールとして実施されてよい。一部の実施態様では、ストレージアレイコントローラ１１０は、多様なストレージタスクを実施するように構成されてよい。ストレージタスクは、コンピューティングデバイス１６４Ａ～１６４Ｂから受け取られたデータをストレージアレイ１０２Ａ～１０２Ｂに書き込むこと、ストレージアレイ１０２Ａ～１０２Ｂからデータを消去すること、ストレージアレイ１０２Ａ～１０２Ｂからデータを取り出し、データをコンピューティングデバイス１６４Ａ～１６４Ｂに提供すること、ディスクの活用及び性能をモニタし、報告すること、例えばレイド（『ＲＡＩＤ』）又はレイドのようなデータ冗長性動作等の冗長性動作を実行すること、データを圧縮すること、データを暗号化すること等を含んでよい。 [0064] Storage arrays 102A-102B may provide persistent data storage for computing devices 164A-164B. In embodiments, storage array 102A may be included in a chassis (not shown) and storage array 102B may be included in another chassis (not shown). Storage arrays 102A and 102B may include one or more storage array controllers 110 (also referred to herein as "controllers"). Storage array controller 110 may be implemented as an automated computing module including computer hardware, computer software, or a combination of computer hardware and software. In some embodiments, storage array controller 110 may be configured to perform a variety of storage tasks. Storage tasks may include writing data received from computing devices 164A-164B to storage arrays 102A-102B, erasing data from storage arrays 102A-102B, retrieving data from storage arrays 102A-102B and providing the data to computing devices 164A-164B, monitoring and reporting disk utilization and performance, performing redundancy operations such as RAID or RAID-like data redundancy operations, compressing data, encrypting data, etc.

［0065］ストレージアレイコントローラ１１０は、フィールドプログラマブルゲートアレイ（『ＦＰＧＡ』）、プログラマブルロジックチップ（『ＰＬＣ』）、特定用途向け集積回路（『ＡＳＩＣ』）、システムオンチップ（『ＳＯＣ』）、又は処理装置、中央演算処理装置、コンピュータメモリ、若しくは多様なアダプタ等の離散構成要素を含む任意のコンピューティングデバイスとして、を含んだ、さまざまな方法で実装されてよい。ストレージアレイコントローラ１１０は、例えば、ＳＡＮ１５８又はＬＡＮ１６０を介して通信をサポートするように構成されたデータ通信アダプタを含んでよい。一部の実施態様では、ストレージアレイコントローラ１１０は、ＬＡＮ１６０に独立して結合されてよい。実施態様では、ストレージアレイコントローラ１１０は、データ通信のためのストレージアレイコントローラ１１０を、中央平面（不図示）を通して（本明細書では「ストレージリソース」とも呼ばれる）永続記憶装置リソース１７０Ａ～１７０Ｂに結合するＩ／Ｏコントローラ等を含んでよい。永続記憶装置リソース１７０Ａ～１７０Ｂは、（本明細書では「ストレージデバイス」とも呼ばれる）任意の数のストレージドライブ１７１Ａ～１７１Ｆ、及び任意の数の不揮発性ランダムアクセスメモリ（『ＮＶＲＡＭ』）デバイス（不図示）を含んでよい。 [0065] Storage array controller 110 may be implemented in a variety of ways, including as a field programmable gate array ("FPGA"), a programmable logic chip ("PLC"), an application specific integrated circuit ("ASIC"), a system on a chip ("SOC"), or any computing device including discrete components such as a processing unit, a central processing unit, computer memory, or various adapters. Storage array controller 110 may include, for example, a data communications adapter configured to support communications via SAN 158 or LAN 160. In some embodiments, storage array controller 110 may be independently coupled to LAN 160. In embodiments, storage array controller 110 may include an I/O controller or the like that couples storage array controller 110 for data communications to persistent storage resources 170A-170B (also referred to herein as "storage resources") through a midplane (not shown). Persistent storage resources 170A-170B may include any number of storage drives 171A-171F (also referred to herein as "storage devices") and any number of non-volatile random access memory (NVRAM) devices (not shown).

［0066］一部の実施態様では、永続記憶装置リソース１７０Ａ～１７０ＢのＮＶＲＡＭデバイスは、ストレージアレイコントローラ１１０から、ストレージドライブ１７１Ａ～１７１Ｆに記憶されるデータを受け取るように構成されてよい。一部の例では、データは、コンピューティングデバイス１６４Ａ～１６４Ｂから生じる場合がある。一部の例では、ＮＶＲＡＭデバイスにデータを書き込むことは、ストレージドライブ１７１Ａ～１７１Ｆにデータを直接的に書き込むよりもより迅速に実施されてよい。実施態様では、ストレージアレイコントローラ１１０は、ストレージドライブ１７１Ａ～１７１Ｆに書き込まれることが分かっているデータのための迅速にアクセス可能なバッファとしてＮＶＲＡＭデバイスを活用するように構成されてよい。バッファとしてＮＶＲＡＭデバイスを使用する書込み要求の待ち時間は、ストレージアレイコントローラ１１０がストレージドライブ１７１Ａ～１７１Ｆに直接的にデータを書き込むシステムに比して改善されてよい。一部の実施態様では、ＮＶＲＡＭデバイスは、高帯域幅、低待ち時間ＲＡＭの形をとるコンピュータメモリで実装されてよい。ＮＶＲＡＭデバイスは、ＮＶＲＡＭデバイスに対する主電源損失後に、ＲＡＭの状態を維持する固有の電源を受け入れてよい、又は含んでよいため、「不揮発性」と呼ばれる。係る電源は、電池、１つ以上のコンデンサ等であってよい。電力損失に応えて、ＮＶＲＡＭデバイスは、ＲＡＭのコンテンツを、例えばストレージドライブ１７１Ａ～１７１Ｆ等の永続記憶装置に書き込むように構成されてよい。 [0066] In some embodiments, the NVRAM devices of persistent storage resources 170A-170B may be configured to receive data to be stored on storage drives 171A-171F from storage array controller 110. In some examples, the data may originate from computing devices 164A-164B. In some examples, writing data to the NVRAM devices may be performed more quickly than writing data directly to storage drives 171A-171F. In embodiments, storage array controller 110 may be configured to utilize the NVRAM devices as a quickly accessible buffer for data known to be written to storage drives 171A-171F. The latency of write requests using the NVRAM devices as a buffer may be improved compared to systems in which storage array controller 110 writes data directly to storage drives 171A-171F. In some embodiments, the NVRAM devices may be implemented with computer memory in the form of high-bandwidth, low-latency RAM. NVRAM devices are referred to as "non-volatile" because they may receive or include an intrinsic power source that maintains the state of the RAM after a loss of main power to the NVRAM device. Such a power source may be a battery, one or more capacitors, etc. In response to a power loss, the NVRAM device may be configured to write the contents of the RAM to persistent storage, such as storage drives 171A-171F.

［0067］実施態様では、ストレージドライブ１７１Ａ～１７１Ｆは、永続的にデータを記録するように構成された任意のデバイスを指す場合があり、「永続的に」又は「永続的な」は、電力損失後に記録されているデータを維持するデバイスの能力を指す。一部の実施態様では、ストレージドライブ１７１Ａ～１７１Ｆは、非ディスクストレージ媒体に相当する場合がある。例えば、ストレージドライブ１７１Ａ～１７１Ｆは、１つ以上のソリッドステートドライブ（『ＳＳＤ』）、フラッシュメモリベースのストレージ、任意のタイプのソリッドステート不揮発性メモリ、又は任意の他のタイプの非機械的なストレージデバイスであってよい。他の実施態様では、ストレージドライブ１７１Ａ～１７１Ｆは、例えばハードディスクドライブ（『ＨＤＤ』）等の機械的なつまり回転するハードディスクを含む場合がある。 [0067] In embodiments, storage drives 171A-171F may refer to any device configured to persistently record data, where "persistently" or "persistent" refers to the device's ability to maintain recorded data after a loss of power. In some embodiments, storage drives 171A-171F may correspond to non-disk storage media. For example, storage drives 171A-171F may be one or more solid-state drives ("SSDs"), flash memory-based storage, any type of solid-state non-volatile memory, or any other type of non-mechanical storage device. In other embodiments, storage drives 171A-171F may include mechanical or rotating hard disks, such as hard disk drives ("HDDs").

［0068］一部の実施態様では、ストレージアレイコントローラ１１０は、ストレージアレイ１０２Ａ～１０２Ｂ内のストレージドライブ１７１Ａ～１７１Ｆからデバイス管理責任を取り除くために構成されてよい。例えば、ストレージアレイコントローラ１１０は、ストレージドライブ１７１Ａ～１７１Ｆの１つ以上のメモリブロックの状態を記述する場合がある制御情報を管理してよい。制御情報は、例えば、特定のメモリブロックが故障し、もはや書き込まれるべきではないこと、特定のメモリブロックがストレージアレイコントローラ１１０のためのブートコードを含むこと、特定のメモリブロックで実行されてきたプログラム－消去（『Ｐ／Ｅ』）サイクルの数、特定のメモリブロックに記憶されるデータの年齢、特定のメモリブロックに記憶されるデータのタイプ等を示してよい。一部の実施態様では、制御情報は、メタデータとして関連付けられたメモリブロックとともに記憶されてよい。他の実施形態では、ストレージドライブ１７１Ａ～１７１Ｆの制御情報は、ストレージアレイコントローラ１１０によって選択されるストレージドライブ１７１Ａ～１７１Ｆの１つ以上の特定のメモリブロックに記憶されてよい。選択されたメモリブロックは、選択されたメモリブロックが制御情報を含むことを示す識別子でタグ付けされてよい。識別子は、制御情報を含むメモリブロックを迅速に識別するためにストレージドライブ１７１Ａ～１７１Ｆと併せてストレージアレイコントローラ１１０によって活用されてよい。例えば、ストレージコントローラ１１０は、制御情報を含むメモリブロックの場所を突き止めるためのコマンドを発行してよい。制御情報は非常に大きいので、制御情報の部分が複数の場所に記憶される場合があること、制御情報が、例えば冗長性のために複数の場所に記憶されてよいこと、又は制御情報が、それ以外の場合、ストレージドライブ１７１Ａ～１７１Ｆの複数のメモリブロック全体で分散されてよいことに留意されたい。 [0068] In some embodiments, the storage array controller 110 may be configured to remove device management responsibilities from the storage drives 171A-171F in the storage arrays 102A-102B. For example, the storage array controller 110 may manage control information that may describe the state of one or more memory blocks of the storage drives 171A-171F. The control information may indicate, for example, that a particular memory block has failed and should no longer be written to, that a particular memory block contains boot code for the storage array controller 110, the number of program-erase ("P/E") cycles that have been performed on a particular memory block, the age of the data stored in a particular memory block, the type of data stored in a particular memory block, etc. In some embodiments, the control information may be stored with the associated memory block as metadata. In other embodiments, the control information for the storage drives 171A-171F may be stored in one or more specific memory blocks of the storage drives 171A-171F selected by the storage array controller 110. The selected memory blocks may be tagged with an identifier indicating that the selected memory blocks contain control information. The identifier may be utilized by storage array controller 110 in conjunction with storage drives 171A-171F to quickly identify memory blocks containing control information. For example, storage controller 110 may issue a command to locate memory blocks containing control information. Note that because the control information is quite large, portions of the control information may be stored in multiple locations; the control information may be stored in multiple locations, for example, for redundancy, or the control information may otherwise be distributed across multiple memory blocks on storage drives 171A-171F.

［0069］実施態様では、ストレージアレイコントローラ１１０は、ストレージドライブ１７１Ａ～１７１Ｆから、ストレージドライブ１７１Ａ～１７１Ｆの１つ以上のメモリブロックの状態を記述する制御情報を取り出すことによってストレージアレイ１０２Ａ～１０２Ｂのストレージドライブ１７１Ａ～１７１Ｆからデバイス管理責任を取り除いてよい。ストレージドライブ１７１Ａ～１７１Ｆから制御情報を取り出すことは、例えばストレージアレイコントローラ１１０が、特定のストレージドライブ１７１Ａ～１７１Ｆに対する制御情報の場所についてストレージドライブ１７１Ａ～１７１Ｆに問い合わせることによって実施されてよい。ストレージドライブ１７１Ａ～１７１Ｆは、ストレージドライブ１７１Ａ～１７１Ｆが制御情報の場所を識別できるようにする命令を実行するように構成されてよい。命令は、ストレージドライブ１７１Ａ～１７１Ｆと関連付けられた、又はそれ以外の場合位置するコントローラ（不図示）によって実行されてよく、ストレージドライブ１７１Ａ～１７１Ｆに各メモリブロックの一部分を走査させて、ストレージドライブ１７１Ａ～１７１Ｆのために制御情報を記憶するメモリブロックを識別してよい。ストレージドライブ１７１Ａ～１７１Ｆは、ストレージドライブ１７１Ａ～１７１Ｆのための制御情報の場所を含む応答メッセージをストレージアレイコントローラ１１０に送信することによって応答してよい。応答メッセージを受け取ることに応えて、ストレージアレイコントローラ１１０は、ストレージドライブ１７１Ａ～１７１Ｆのための制御情報の場所と関連付けられたアドレスに記憶されたデータを読み取る要求を発行してよい。 [0069] In embodiments, storage array controller 110 may remove device management responsibility from storage drives 171A-171F of storage arrays 102A-102B by retrieving, from storage drives 171A-171F, control information describing the state of one or more memory blocks of storage drives 171A-171F. Retrieving the control information from storage drives 171A-171F may be performed, for example, by storage array controller 110 querying storage drives 171A-171F about the location of the control information for a particular storage drive 171A-171F. Storage drives 171A-171F may be configured to execute instructions that enable storage drives 171A-171F to identify the location of the control information. The instructions may be executed by a controller (not shown) associated with or otherwise located at storage drives 171A-171F, and may cause storage drives 171A-171F to scan a portion of each memory block to identify memory blocks that store control information for storage drives 171A-171F. Storage drives 171A-171F may respond by sending a response message to storage array controller 110 that includes the location of the control information for storage drives 171A-171F. In response to receiving the response message, storage array controller 110 may issue a request to read the data stored at the address associated with the location of the control information for storage drives 171A-171F.

［0070］他の実施態様では、ストレージアレイコントローラ１１０は、制御情報を受け取ることに応えてストレージドライブ管理動作を実行することによってストレージドライブ１７１Ａ～１７１Ｆからデバイス管理責任をさらに取り除いてよい。ストレージドライブ管理動作は、例えばストレージドライブ１７１Ａ～１７１Ｆ（例えば、特定のストレージドライブ１７１Ａ～１７１Ｆと関連付けられたコントローラ（不図示））によって通常実行される動作を含んでよい。ストレージドライブ管理動作は、例えば、データがストレージドライブ１７１Ａ～１７１Ｆの中の故障したメモリブロックに書き込まれないことを確実にすること、適切な磨耗率が達成されるように等、データがストレージドライブ１７１Ａ～１７１Ｆの中のメモリブロックに書き込まれることを確実にすることを含んでよい。 [0070] In other embodiments, storage array controller 110 may further remove device management responsibilities from storage drives 171A-171F by performing storage drive management operations in response to receiving the control information. Storage drive management operations may include, for example, operations normally performed by storage drives 171A-171F (e.g., a controller (not shown) associated with a particular storage drive 171A-171F). Storage drive management operations may include, for example, ensuring that data is not written to failed memory blocks within storage drives 171A-171F, ensuring that data is written to memory blocks within storage drives 171A-171F so that an appropriate wear rate is achieved, etc.

［0071］実施態様では、ストレージアレイ１０２Ａ～１０２Ｂは、２つ以上のストレージアレイコントローラ１１０を実装してよい。例えば、ストレージアレイ１０２Ａは、ストレージアレイコントローラ１１０Ａ及びストレージアレイコントローラ１１０Ｂを含んでよい。所与のインスタンスで、ストレージシステム１００の単一のストレージアレイコントローラ１１０（例えば、ストレージアレイコントローラ１１０Ａ）は、（本明細書では「一次コントローラ」とも呼ばれる）一次ステータスで指定されてよく、他のストレージアレイコントローラ１１０（例えば、ストレージアレイコントローラ１１０Ａ）は、（本明細書では「二次コントローラ」とも呼ばれる）二次ステータスで指定されてよい。一次コントローラは、永続記憶装置リソース１７０Ａ～１７０Ｂのデータを改変する許可等の特定の権利を有してよい（例えば、永続記憶装置リソース１７０Ａ～１７０Ｂにデータを書き込むこと）。一次コントローラの権利の少なくともいくつかが、二次コントローラの権利を置き換える場合がある。例えば、二次コントローラは、一次コントローラが権利を有するとき、永続記憶装置リソース１７０Ａ～１７０Ｂのデータを改変する許可を有さない場合がある。ストレージアレイコントローラ１１０のステータスは変化する場合がある。例えば、ストレージアレイコントローラ１１０Ａは二次ステータスで指定されてよく、ストレージアレイコントローラ１１０Ｂは一次ステータスで指定されてよい。 [0071] In embodiments, storage arrays 102A-102B may implement two or more storage array controllers 110. For example, storage array 102A may include storage array controller 110A and storage array controller 110B. At a given instance, a single storage array controller 110 (e.g., storage array controller 110A) in storage system 100 may be designated with primary status (also referred to herein as the "primary controller"), and the other storage array controller 110 (e.g., storage array controller 110A) may be designated with secondary status (also referred to herein as the "secondary controller"). The primary controller may have certain rights, such as the permission to modify data on persistent storage resources 170A-170B (e.g., writing data to persistent storage resources 170A-170B). At least some of the rights of the primary controller may supersede the rights of the secondary controller. For example, a secondary controller may not have permission to modify data on persistent storage resources 170A-170B when the primary controller has the right. The status of storage array controllers 110 may change. For example, storage array controller 110A may be designated with secondary status, and storage array controller 110B may be designated with primary status.

［0072］一部の実施態様では、ストレージアレイコントローラ１１０Ａ等の一次コントローラは、１つ以上のストレージアレイ１０２Ａ～１０２Ｂのための一次コントローラとしての機能を果たしてよく、ストレージアレイコントローラ１１０Ｂ等の第２のコントローラは、１つ以上のストレージアレイ１０２Ａ～１０２Ｂのための二次コントローラとしての機能を果たしてよい。例えば、ストレージアレイコントローラ１１０Ａは、ストレージアレイ１０２Ａ及びストレージアレイ１０２Ｂのための一次コントローラであってよく、ストレージアレイコントローラ１１０Ｂは、ストレージアレイ１０２Ａ及び１０２Ｂのための二次コントローラであってよい。一部の実施態様では、（「ストレージ処理モジュール」とも呼ばれる）ストレージコントローラ１１０Ｃ及び１１０Ｄは、一次ステータスも二次ステータスも有さない場合がある。ストレージ処理モジュールとして実装されるストレージアレイコントローラ１１０Ｃ及び１１０Ｄは、一次コントローラと二次コントローラ（例えば、それぞれストレージアレイコントローラ１１０Ａ及び１１０Ｂ）と、ストレージアレイ１０２Ｂとの間の通信インタフェースの機能を果たしてよい。例えば、ストレージアレイ１０２Ａのストレージアレイコントローラ１１０Ａは、ＳＡＮ１５８を介してストレージアレイ１０２Ｂに書込み要求を送信してよい。書込み要求は、ストレージアレイ１０２Ｂのストレージアレイコントローラ１１０Ｃ及び１１０Ｄの両方によって受け取られてよい。ストレージアレイコントローラ１１０Ｃ及び１１０Ｄは、例えば適切なストレージドライブ１７１Ａ～１７１Ｆに書込み要求を送信する等、通信を容易にする。一部の実施態様において、ストレージ処理モジュールは、一次コントローラ及び二次コントローラによって制御されるストレージドライブの数を増加させるために使用されてよいことに留意されたい。 [0072] In some embodiments, a primary controller, such as storage array controller 110A, may serve as the primary controller for one or more storage arrays 102A-102B, and a second controller, such as storage array controller 110B, may serve as a secondary controller for one or more storage arrays 102A-102B. For example, storage array controller 110A may be the primary controller for storage arrays 102A and 102B, and storage array controller 110B may be the secondary controller for storage arrays 102A and 102B. In some embodiments, storage controllers 110C and 110D (also referred to as "storage processing modules") may have neither primary nor secondary status. Storage array controllers 110C and 110D, implemented as storage processing modules, may serve as a communication interface between the primary and secondary controllers (e.g., storage array controllers 110A and 110B, respectively) and storage array 102B. For example, storage array controller 110A of storage array 102A may send a write request to storage array 102B via SAN 158. The write request may be received by both storage array controllers 110C and 110D of storage array 102B. Storage array controllers 110C and 110D facilitate the communication, e.g., sending the write request to the appropriate storage drives 171A-171F. Note that in some embodiments, storage processing modules may be used to increase the number of storage drives controlled by the primary and secondary controllers.

［0073］実施態様では、ストレージアレイコントローラ１１０は、１つ以上のストレージドライブ１７１Ａ～１７１Ｆに、及びストレージアレイ１０２Ａ～１０２Ｂの一部として含まれる１つ以上のＮＶＲＡＭデバイス（不図示）に中央平面（不図示）を介して通信で結合される。ストレージアレイコントローラ１１０は、１つ以上のデータ通信リンクを介して中央平面に結合されてよく、中央平面は、１つ以上のデータ通信リンクを介してストレージドライブ１７１Ａ～１７１Ｆ及びＮＶＲＡＭデバイスに結合されてよい。本明細書に説明されるデータ通信リンクは、データ通信リンク１０８Ａ～１０８Ｄによって集合的に示され、例えばぺリフェラルコンポーネントインターコネクトエクスプレス（『ＰＣＩｅ』）バスを含んでよい。 [0073] In an embodiment, storage array controller 110 is communicatively coupled to one or more storage drives 171A-171F and to one or more NVRAM devices (not shown) included as part of storage arrays 102A-102B via a midplane (not shown). Storage array controller 110 may be coupled to the midplane via one or more data communication links, and the midplane may be coupled to storage drives 171A-171F and the NVRAM devices via one or more data communication links. The data communication links described herein are collectively represented by data communication links 108A-108D and may include, for example, Peripheral Component Interconnect Express ("PCIe") buses.

［0074］図１Ｂは、一部の実施態様に係るデータストレージのための例のシステムを示す。図１Ｂに示されるストレージアレイコントローラ１０１は、図１Ａに関して説明されるストレージアレイコントローラ１１０に類似してよい。一例では、ストレージアレイコントローラ１０１は、ストレージアレイコントローラ１１０Ａ又はストレージアレイコントローラ１１０Ｂに類似してよい。ストレージアレイコントローラ１０１は、制限よりむしろ説明のために多数の要素を含む。ストレージアレイコントローラ１０１は、他の実施態様で同じように又は異なるように構成された同じ要素、より多くの要素、又はより少ない要素を含む場合があることに留意されたい。図１Ａの要素は、ストレージアレイコントローラ１０１の特徴を示すのに役立つように以下に含まれてよいことに留意されたい。 [0074] FIG. 1B illustrates an example system for data storage according to some embodiments. The storage array controller 101 illustrated in FIG. 1B may be similar to the storage array controller 110 described with respect to FIG. 1A. In one example, the storage array controller 101 may be similar to storage array controller 110A or storage array controller 110B. The storage array controller 101 includes multiple elements for purposes of explanation rather than limitation. Note that the storage array controller 101 may include the same elements, more elements, or fewer elements configured similarly or differently in other embodiments. Note that elements of FIG. 1A may be included below to help illustrate features of the storage array controller 101.

［0075］ストレージアレイコントローラ１０１は、１つ以上の処理装置１０４及びランダムアクセスメモリ（『ＲＡＭ』）１１１を含んでよい。処理装置１０４（又はコントローラ１０１）は、例えばマイクロプロセッサ、中央演算処理装置等の１つ以上の汎用処理装置を表す。より詳細には、処理装置１０４（又はコントローラ１０１）は、複数命令セットコンピューティング（『ＣＩＳＣ』）マイクロプロセッサ、縮小命令セットコンピューティング（『ＲＩＳＣ』）マイクロプロセッサ、超長命令語（『ＶＬＩＷ』）マイクロプロセッサ、又は他の命令セットを実施するプロセッサ、又は命令セットの組合せを実施するプロセッサであってよい。また、処理装置１０４（又はコントローラ１０１）は、例えば特定用途向け集積回路（『ＡＳＩＣ』）、フィールドプログラマブルゲートアレイ（『ＦＰＧＡ』）、デジタルシグナルプロセッサ（『ＤＳＰ』）、ネットワークプロセッサ等の１つ又は複数の専用処理装置であってよい。 [0075] The storage array controller 101 may include one or more processing units 104 and random access memory ("RAM") 111. The processing unit 104 (or controller 101) represents one or more general-purpose processing units, such as, for example, a microprocessor, a central processing unit, or the like. More specifically, the processing unit 104 (or controller 101) may be a multiple instruction set computing ("CISC") microprocessor, a reduced instruction set computing ("RISC") microprocessor, a very long instruction word ("VLIW") microprocessor, or a processor implementing other instruction sets or a processor implementing a combination of instruction sets. The processing unit 104 (or controller 101) may also be one or more special-purpose processing units, such as, for example, an application specific integrated circuit ("ASIC"), a field programmable gate array ("FPGA"), a digital signal processor ("DSP"), a network processor, or the like.

［0076］処理装置１０４は、例えばダブルデータレート４（『ＤＤＲ４』）バス等の高速メモリバスとして実施されてよいデータ通信リンク１０６を介してＲＡＭ１１１に接続されてよい。ＲＡＭ１１１に記憶されるのは、オペレーティングシステム１１２である。一部の実施態様では、命令１１３はＲＡＭ１１１に記憶される。命令１１３は、直接図化フラッシュストレージシステムで動作を実行するためのコンピュータプログラム命令を含んでよい。一実施形態では、直接図化フラッシュストレージシステムは、フラッシュドライブの中のデータブロックを直接的にアドレス指定するフラッシュストレージシステムであり、アドレス変換なしで、フラッシュドライブのストレージコントローラによって実行される。 [0076] Processing unit 104 may be connected to RAM 111 via data communication link 106, which may be implemented as a high-speed memory bus such as a double data rate 4 ("DDR4") bus. Stored in RAM 111 is operating system 112. In some implementations, instructions 113 are stored in RAM 111. Instructions 113 may include computer program instructions for performing operations on a direct-mapped flash storage system. In one embodiment, a direct-mapped flash storage system is a flash storage system that directly addresses data blocks within a flash drive, without address translation, and is executed by the flash drive's storage controller.

［0077］実施態様では、ストレージアレイコントローラ１０１は、データ通信リンク１０５Ａ～１０５Ｃを介して処理装置１０４に結合される１つ以上のホストバスアダプタ１０３Ａ～１０３Ｃを含む。実施態様では、ホストバスアダプタ１０３Ａ～１０３Ｃは、ホストシステム（例えば、ストレージアレイコントローラ）を他のネットワーク及びストレージアレイに接続するコンピュータハードウェアであってよい。一部の例では、ホストバスアダプタ１０３Ａ～１０３Ｃは、ストレージアレイコントローラ１０１がＳＡＮに接続できるようにするファイバチャネルアダプタ、ストレージアレイコントローラ１０１がＬＡＮに接続できるようにするイーサネットアダプタ等であってよい。ホストバスアダプタ１０３Ａ～１０３Ｃは、例えばＰＣＩｅバス等のデータ通信リンク１０５Ａ～１０５Ｃを介して処理装置１０４に結合されてよい。 [0077] In an embodiment, the storage array controller 101 includes one or more host bus adapters 103A-103C coupled to the processing unit 104 via data communication links 105A-105C. In an embodiment, the host bus adapters 103A-103C may be computer hardware that connects a host system (e.g., a storage array controller) to other networks and storage arrays. In some examples, the host bus adapters 103A-103C may be Fibre Channel adapters that allow the storage array controller 101 to connect to a SAN, Ethernet adapters that allow the storage array controller 101 to connect to a LAN, etc. The host bus adapters 103A-103C may be coupled to the processing unit 104 via data communication links 105A-105C, such as a PCIe bus.

［0078］実施態様では、ストレージアレイコントローラ１０１は、エキスパンダ１１５に結合されるホストバスアダプタ１１４を含んでよい。エキスパンダ１１５は、ホストシステムをより多数のストレージドライブにアタッチするために使用されてよい。エキスパンダ１１５は、例えば、ホストバスアダプタ１１４がＳＡＳコントローラとして実施される実施態様において、ホストアダプタ１１４がストレージドライブにアタッチできるようにするために活用されるＳＡＳエキスパンダであってよい。 [0078] In an embodiment, storage array controller 101 may include a host bus adapter 114 coupled to an expander 115. Expander 115 may be used to attach a host system to a larger number of storage drives. Expander 115 may be, for example, a SAS expander utilized to allow host adapter 114 to attach storage drives in an embodiment in which host bus adapter 114 is implemented as a SAS controller.

［0079］実施態様では、ストレージアレイコントローラ１０１は、データ通信リンク１０９を介して処理装置１０４に結合されたスイッチ１１６を含んでよい。スイッチ１１６は、単一のエンドポイントの中から複数のエンドポイントを作り出し、それによって複数のデバイスが単一のエンドポイントを共用できるようにするコンピュータハードウェアデバイスであってよい。スイッチ１１６は、例えばＰＣＩｅバス（例えば、データバスリンク１０９）に結合され、複数のＰＣＩｅ接続ポイントを中央平面に提示するＰＣＩｅスイッチであってよい。 [0079] In an embodiment, storage array controller 101 may include a switch 116 coupled to processing unit 104 via data communication link 109. Switch 116 may be a computer hardware device that creates multiple endpoints from a single endpoint, thereby allowing multiple devices to share a single endpoint. Switch 116 may be, for example, a PCIe switch coupled to a PCIe bus (e.g., data bus link 109) and presenting multiple PCIe connection points at a midplane.

［0080］実施態様では、ストレージアレイコントローラ１０１は、他のストレージアレイコントローラにストレージアレイコントローラ１０１を結合するためのデータ通信リンク１０７を含む。一部の例では、データ通信リンク１０７は、クイックパスインターコネクト（ＱＰＩ）相互接続であってよい。 [0080] In embodiments, storage array controller 101 includes data communication link 107 for coupling storage array controller 101 to other storage array controllers. In some examples, data communication link 107 may be a Quick Path Interconnect (QPI) interconnect.

［0081］従来のフラッシュドライブを使用する従来のストレージシステムは、従来のストレージシステムの一部であるフラッシュドライブ全体でプロセスを実施してよい。例えば、ストレージシステムのより高いレベルのプロセスは、フラッシュドライブ全体でプロセスを開始し、制御してよい。しかしながら、従来のストレージシステムのフラッシュドライブは、プロセスも実行する独自のストレージコントローラを含んでよい。したがって、従来のストレージシステムの場合、（例えば、ストレージシステムによって開始される）より高いレベルのプロセス及び（例えば、ストレージシステムのストレージコントローラによって開始される）より低いレベルのプロセスは両方とも実行されてよい。 [0081] A conventional storage system using conventional flash drives may implement processes throughout the flash drives that are part of the conventional storage system. For example, higher-level processes in the storage system may initiate and control processes throughout the flash drives. However, flash drives in a conventional storage system may include their own storage controller that also executes processes. Thus, in the case of a conventional storage system, both higher-level processes (e.g., initiated by the storage system) and lower-level processes (e.g., initiated by the storage controller of the storage system) may be executed.

［0082］従来のストレージシステムの多様な不備を解決するために、動作はより低いレベルのプロセスによってではなく、より高いレベルのプロセスによって実行されてよい。例えば、フラッシュストレージシステムは、プロセスを提供するストレージコントローラを含まないフラッシュドライブを含んでよい。したがって、フラッシュストレージシステムのオペレーティングシステム自体がプロセスを開始し、制御してよい。これは、フラッシュドライブの中で直接的にデータブロックをアドレス指定する直接図化フラッシュストレージシステムによって達成され、アドレス変換なしに、フラッシュドライブのストレージコントローラによって実行されてよい。 [0082] To address various deficiencies of conventional storage systems, operations may be performed by higher-level processes rather than by lower-level processes. For example, a flash storage system may include a flash drive that does not include a storage controller to provide the processes. Thus, the flash storage system's operating system itself may initiate and control the processes. This is accomplished by direct mapping flash storage systems that address data blocks directly within the flash drive, which may be performed by the flash drive's storage controller without address translation.

［0083］フラッシュストレージシステムのオペレーティングシステムは、割当てユニットのリストを識別し、フラッシュストレージシステムの複数のフラッシュドライブ全体で維持してよい。割当てユニットは、完全消去ブロック又は複数の消去ブロックであってよい。オペレーティングシステムは、マップ又はフラッシュストレージシステムのフラッシュドライブのブロックを消去するためにアドレスを直接的にマッピングするアドレス範囲を維持してよい。 [0083] The operating system of the flash storage system may identify and maintain a list of allocation units across multiple flash drives in the flash storage system. An allocation unit may be a whole erase block or multiple erase blocks. The operating system may maintain a map or address ranges that directly map addresses to erase blocks on the flash drives of the flash storage system.

［0084］フラッシュドライブのブロックを消去するための直接的なマッピングは、データを書き換え、データを消去するために使用されてよい。例えば、動作は、第１のデータが保持され、第２のデータはフラッシュストレージシステムによってもはや使用されていない第１のデータ及び第２のデータを含む１つ以上の割当てユニットに対して実行されてよい。オペレーティングシステムは、第１のデータを他の割当てユニットの中の新しい場所に書き込み、第２のデータを消去し、割当てユニットを、後続のデータのための使用に利用可能であるとマークするためのプロセスを開始してよい。このようにして、プロセスは、追加のより低いレベルのプロセスがフラッシュドライブのコントローラによって実行されることなく、フラッシュストレージシステムのより高いレベルのオペレーティングシステムによってのみ実行されてよい。 [0084] Direct mapping to erase blocks on a flash drive may be used to rewrite data and erase data. For example, operations may be performed on one or more allocation units containing first data and second data, where the first data is retained and the second data is no longer in use by the flash storage system. The operating system may initiate a process to write the first data to a new location in another allocation unit, erase the second data, and mark the allocation unit as available for use for subsequent data. In this manner, the process may be performed solely by the higher-level operating system of the flash storage system, without additional lower-level processes being performed by the flash drive's controller.

［0085］プロセスがフラッシュストレージシステムのオペレーティングシステムによってのみ実行される優位点は、不必要な又は冗長な書込み動作がプロセスの間に実行されていないので、フラッシュストレージシステムのフラッシュドライブの信頼性の増加を含む。ここで新規性の１つの考えられる点は、フラッシュストレージシステムのオペレーティングシステムでプロセスを開始し、制御する概念である。さらに、プロセスは、複数のフラッシュドライブ全体でオペレーティングシステムによって制御できる。これは、フラッシュドライブのストレージコントローラによって実行されているプロセスとは対照的である。 [0085] Advantages of processes being performed solely by the flash storage system's operating system include increased reliability of the flash drives in the flash storage system, since unnecessary or redundant write operations are not performed during the process. One possible point of novelty here is the concept of initiating and controlling the process in the flash storage system's operating system. Furthermore, the process can be controlled by the operating system across multiple flash drives. This is in contrast to processes being performed by the flash drive's storage controller.

［0086］ストレージシステムは、フェイルオーバのためにドライブの集合を共用する２つのストレージアレイコントローラから成る場合がある、又はストレージシステムは、複数のドライブを活用するストレージサービスを提供する単一のストレージアレイコントローラから成るであろう、又はストレージシステムは、それぞれがなんらかの数のドライブ若しくはなんらかの量のフラッシュストレージを有するストレージアレイコントローラの分散ネットワークから成り、ネットワークのストレージアレイコントローラは、完全なストレージサービスを提供するために協働し、ストレージ割当て及びガベージコレクションを含んだストレージサービスの多様な態様に関して協働する。 [0086] A storage system may consist of two storage array controllers that share a set of drives for failover purposes, or the storage system may consist of a single storage array controller that provides storage services utilizing multiple drives, or the storage system may consist of a distributed network of storage array controllers, each with some number of drives or some amount of flash storage, that cooperate to provide a complete storage service and cooperate with respect to various aspects of the storage service, including storage allocation and garbage collection.

［0087］図１Ｃは、一部の実施態様に係るデータストレージのための第３の例のシステム１１７を示す。システム１１７（本明細書では「ストレージシステム」とも呼ばれる）は、制限よりはむしろ説明のために多数の要素を含む。システム１１７が、他の実施態様で同じように又は異なるように構成された同じ要素、より多くの要素、又はより少ない要素を含んでよい場合があることに留意されたい。 [0087] Figure 1C illustrates a third example system 117 for data storage according to some embodiments. System 117 (also referred to herein as a "storage system") includes a number of elements for purposes of illustration rather than limitation. Note that system 117 may include the same elements, more elements, or fewer elements configured similarly or differently in other embodiments.

［0088］一実施形態では、システム１１７は、別々にアドレス指定可能な高速書込みストレージを有するデュアルペリフェラルコンポーネントインターコネクト（『ＰＣＩ』）フラッシュストレージデバイス１１８を含む。システム１１７は、ストレージコントローラ１１９を含んでよい。一実施形態では、ストレージコントローラ１１９は、ＣＰＵ、ＡＳＩＣ、ＦＰＧＡ、又は本開示に従って必要な制御構造を実施してよい任意の他の回路網であってよい。一実施形態では、システム１１７は、ストレージデバイスコントローラ１１９の多様なチャネルに動作可能なように結合された（例えば、フラッシュメモリデバイス１２０ａから１２０ｎを含む）フラッシュメモリデバイスを含む。フラッシュメモリデバイス１２０ａから１２０ｎは、フラッシュページ、消去ブロック、及び／又はストレージデバイスコントローラ１１９がフラッシュの多用な態様をプログラムし、取り出すことを可能にするほど十分な制御要素のアドレス指定可能な集合体として、コントローラ１１９に提示されてよい。一実施形態では、ストレージデバイスコントローラ１１９は、ページのデータコンテンツを記憶し、取り出すこと、任意のブロックを配置し、消去すること、フラッシュメモリページ、消去ブロック、及びセルの使用及び再利用に関係する統計を追跡すること、フラッシュメモリの中のエラーコード及び障害を追跡し、予測すること、プログラミングと関連付けられた電圧レベルを制御すること、及びフラッシュセル等のコンテンツを取り出すことを含む動作をフラッシュメモリデバイス１２０Ａ～１２０Ｎに対して実行してよい。 [0088] In one embodiment, system 117 includes dual peripheral component interconnect ("PCI") flash storage devices 118 with separately addressable high-speed write storage. System 117 may include storage controller 119. In one embodiment, storage controller 119 may be a CPU, ASIC, FPGA, or any other circuitry that may implement the necessary control structures in accordance with the present disclosure. In one embodiment, system 117 includes flash memory devices (e.g., including flash memory devices 120a through 120n) operatively coupled to various channels of storage device controller 119. Flash memory devices 120a through 120n may be presented to controller 119 as addressable collections of flash pages, erase blocks, and/or control elements sufficient to allow storage device controller 119 to program and retrieve various aspects of the flash. In one embodiment, storage device controller 119 may perform operations on flash memory devices 120A-120N, including storing and retrieving data content of pages, allocating and erasing any blocks, tracking statistics related to the use and reuse of flash memory pages, erase blocks, and cells, tracking and predicting error codes and failures in flash memory, controlling voltage levels associated with programming, and retrieving content such as flash cells.

［0089］一実施形態では、システム１１７は、別々にアドレス指定可能な高速書込みデータを記憶するためにＲＡＭ１２１を含んでよい。一実施形態では、ＲＡＭ１２１は、１つ以上の別々の個別素子であってよい。別の実施形態では、ＲＡＭ１２１は、ストレージデバイスコントローラ１１９又は複数のストレージデバイスコントローラに統合されてよい。ＲＡＭ１２１は、例えばストレージデバイスコントローラ１１９内の処理装置（例えば、ＣＰＵ）のための一時プログラムメモリ等、他の目的に活用されてよい。 [0089] In one embodiment, system 117 may include RAM 121 for storing separately addressable high-speed write data. In one embodiment, RAM 121 may be one or more separate, discrete components. In another embodiment, RAM 121 may be integrated into storage device controller 119 or multiple storage device controllers. RAM 121 may be utilized for other purposes, such as temporary program memory for a processing unit (e.g., a CPU) within storage device controller 119.

［0090］一実施形態では、システム１１９は、例えば充電式電池又はコンデンサ等の貯蔵エネルギーデバイス１２２を含んでよい。貯蔵エネルギーデバイス１２２は、ＲＡＭのコンテンツをフラッシュメモリに書き込むほど十分な時間の間、ＲＡＭ（例えば、ＲＡＭ１２１）のなんらかの量、及びフラッシュメモリ（例えば、フラッシュメモリ１２０ａから１２０ｎ）のなんらかの量、ストレージデバイスコントローラ１１９に電力を提供するほど十分なエネルギーを貯蔵してよい。一実施形態では、ストレージデバイスコントローラ１１９は、ストレージデバイスコントローラが外部電力の損失を検出する場合にＲＡＭのコンテンツをフラッシュメモリに書き込んでよい。 [0090] In one embodiment, system 119 may include a stored energy device 122, such as a rechargeable battery or a capacitor. Stored energy device 122 may store enough energy to power some amount of RAM (e.g., RAM 121), some amount of flash memory (e.g., flash memories 120a through 120n), and storage device controller 119 for a time sufficient to write the contents of the RAM to flash memory. In one embodiment, storage device controller 119 may write the contents of RAM to flash memory when the storage device controller detects a loss of external power.

［0091］一実施形態では、システム１１７は、２つのデータ通信リンク１２３ａ、１２３ｂを含む。一実施形態では、データ通信リンク１２３ａ、１２３ｂは、ＰＣＩインタフェースであってよい。別の実施形態では、データ通信リンク１２３ａ、１２３ｂは、他の通信規格（例えば、ＨｙｐｅｒＴｒａｎｓｐｏｒｔ、インフィニバンド等）に基づいてよい。データ通信リンク１２３ａ、１２３ｂは、ストレージシステム１１７の他の構成要素からストレージデバイスコントローラ１１９への外部接続を可能にする不揮発性メモリエキスプレス（『ＮＶＭｅ』）又はＮＶＭｅオーバーファブリック（『ＮＶＭｆ』）仕様に基づいてよい。データ通信リンクが、便宜上本明細書ではＰＣＩバスと同義で呼ばれてよいことに留意されたい。 [0091] In one embodiment, system 117 includes two data communication links 123a, 123b. In one embodiment, data communication links 123a, 123b may be PCI interfaces. In another embodiment, data communication links 123a, 123b may be based on other communication standards (e.g., HyperTransport, InfiniBand, etc.). Data communication links 123a, 123b may be based on the Non-Volatile Memory Express ("NVMe") or NVMe over Fabric ("NVMf") specifications, which enable external connections from other components of storage system 117 to storage device controller 119. Note that for convenience, data communication links may be referred to herein synonymously as PCI buses.

［0092］また、システム１１７は、一方又は両方のデータ通信リンク１２３ａ、１２３ｂを介して設けられてよい、又は別々に設けられてよい外部電源（不図示）を含んでもよい。代替実施形態は、ＲＡＭ１２１のコンテンツを記憶する際の使用専用の別個のフラッシュメモリ（不図示）を含む。ストレージデバイスコントローラ１１９は、アドレス指定可能な高速書込み論理デバイスを含んでよいＰＣＩバス上の論理デバイス、又はＰＣＩメモリとして若しくは永続記憶装置として提示されてよいストレージデバイス１１８の論理アドレス空間の別個の一部を提示してよい。一実施形態では、デバイスの中に記憶するための動作は、ＲＡＭ１２１の中に向けられる。停電時、ストレージデバイスコントローラ１１９は、長期永続記憶のために、アドレス指定可能な高速書込み論理ストレージと関連付けられた記憶されているコンテンツをフラッシュメモリ（例えば、フラッシュメモリ１２０ａから１２０ｎ）に書き込んでよい。 [0092] System 117 may also include an external power source (not shown), which may be provided via one or both data communication links 123a, 123b, or may be provided separately. An alternative embodiment includes a separate flash memory (not shown) dedicated for use in storing the contents of RAM 121. Storage device controller 119 may present a logical device on the PCI bus, which may include an addressable fast-write logical device, or a separate portion of the logical address space of storage device 118, which may be presented as PCI memory or as persistent storage. In one embodiment, operations to store into the device are directed into RAM 121. In the event of a power outage, storage device controller 119 may write stored content associated with the addressable fast-write logical storage to flash memory (e.g., flash memories 120a through 120n) for long-term persistent storage.

［0093］一実施形態では、論理デバイスは、フラッシュメモリデバイス１２０ａ～１２０ｎのコンテンツの一部又はすべてのなんらかの提示を含んでよく、その提示は、ストレージデバイス１１８を含んだストレージシステム（例えば、ストレージシステム１１７）が、フラッシュメモリページを直接的にアドレス指定し、ＰＣＩバスを通してストレージデバイスにとって外部であるストレージシステム構成要素から消去ブロックを直接的にプログラムし直すことを可能にする。また、提示は、外部構成要素のうちの１つ以上が、すべてのフラッシュメモリデバイス全体でフラッシュメモリページ、消去ブロック、及びセルの使用及び再利用に関係する統計を追跡することと、フラッシュメモリデバイスの中で及び全体でエラーコード及び障害を追跡し、予測することと、プログラミングと関連付けられた電圧レベルを制御することと、フラッシュセル等のコンテンツを取り出すこととのうちの一部又はすべてを含んだフラッシュメモリの他の態様を制御し、取り出すこととを可能にしてもよい。 [0093] In one embodiment, the logical device may include some representation of some or all of the contents of flash memory devices 120a-120n that allows a storage system (e.g., storage system 117) including storage device 118 to directly address flash memory pages and reprogram erase blocks directly from storage system components external to the storage device through a PCI bus. The representation may also allow one or more of the external components to control and retrieve other aspects of the flash memory, including some or all of the following: tracking statistics related to the use and reclamation of flash memory pages, erase blocks, and cells across all flash memory devices; tracking and predicting error codes and failures within and across flash memory devices; controlling voltage levels associated with programming; and retrieving content such as flash cells.

［0094］一実施形態では、貯蔵エネルギーデバイス１２２は、フラッシュメモリデバイス１０７ａ～１２０ｎに対する進行中の動作の完了を確実にするために十分であってよく、貯蔵エネルギーデバイス１２２は、それらの動作のために、及びフラッシュメモリへの高速書込みＲＡＭの記憶のためにストレージデバイスコントローラ１１９及び関連付けられたフラッシュメモリデバイス（例えば、１２０ａ～１２０ｎ）に電力を提供してよい。貯蔵エネルギーデバイス１２２は、蓄積された統計及びフラッシュメモリデバイス１２０ａ～１２０ｎ及び／又はストレージデバイスコントローラ１１９によって保たれ、追跡されている他のパラメータを記憶するために使用されてよい。別々のコンデンサ又は貯蔵エネルギーデバイス（例えば、フラッシュメモリデバイス自体に近い、又はフラッシュメモリデバイス自体に埋め込まれたより小さいコンデンサ）は、本明細書に説明される動作のいくつか又はすべてのために使用されてよい。 [0094] In one embodiment, stored energy device 122 may be sufficient to ensure completion of ongoing operations on flash memory devices 107a-120n, and stored energy device 122 may provide power to storage device controller 119 and associated flash memory devices (e.g., 120a-120n) for those operations and for fast write RAM storage to flash memory. Stored energy device 122 may also be used to store accumulated statistics and other parameters kept and tracked by flash memory devices 120a-120n and/or storage device controller 119. A separate capacitor or stored energy device (e.g., a smaller capacitor near or embedded in the flash memory device itself) may be used for some or all of the operations described herein.

［0095］例えば経時的に電圧レベルを調整すること、貯蔵エネルギーデバイス１２２を部分的に放電して対応する放電特徴を測定すること等、多様な方式が、貯蔵エネルギー構成要素の寿命を追跡し、最適化するために使用されてよい。利用可能なエネルギーが経時的に減少する場合、アドレス指定可能な高速書込みストレージの有効利用可能容量は、それが、現在利用可能な貯蔵エネルギーに基づいて安全に書き込むことができることを確実にするために減少されてよい。 [0095] Various approaches may be used to track and optimize the life of the stored energy components, such as adjusting the voltage level over time or partially discharging the stored energy device 122 and measuring the corresponding discharge characteristics. If the available energy decreases over time, the effective available capacity of the addressable fast-write storage may be reduced to ensure that it can be safely written based on the currently available stored energy.

［0096］図１Ｄは、一部の実施態様に係るデータストレージのための第３の例のシステム１２４を示す。一実施形態では、システム１２４は、ストレージコントローラ１２５ａ、１２５ｂを含む。一実施形態では、ストレージコントローラ１２５ａ、１２５ｂは、デュアルＰＣＩストレージデバイス１１９ａ、１１９ｂ、及び１１９ｃ、１１９ｄにそれぞれ動作可能なように結合される。ストレージコントローラ１２５ａ、１２５ｂは、なんらかの数のホストコンピュータ１２７ａ～１２７ｎに（例えば、ストレージネットワーク１３０を介して）動作可能なように結合されてよい。 [0096] Figure 1D illustrates a third example system 124 for data storage in accordance with some embodiments. In one embodiment, system 124 includes storage controllers 125a and 125b. In one embodiment, storage controllers 125a and 125b are operably coupled to dual PCI storage devices 119a, 119b, and 119c, 119d, respectively. Storage controllers 125a and 125b may be operably coupled to any number of host computers 127a-127n (e.g., via storage network 130).

［0097］一実施形態では、２つのストレージコントローラ（例えば、１２５ａ及び１２５ｂ）は、例えばＳＣＳ）ブロックストレージアレイ、ファイルサーバ、オブジェクトサーバ、データベース、又はデータ解析サービス等、ストレージサービスを提供する。ストレージコントローラ１２５ａ、１２５ｂは、なんらかの数のネットワークインタフェース（例えば、１２６ａ～１２６ｄ）を通して、ストレージシステム１２４の外のホストコンピュータ１２７ａ～１２７ｎにサービスを提供してよい。ストレージコントローラ１２５ａ、１２５ｂは、完全にストレージシステム１２４の中で統合サービス又はアプリケーションを提供し、集中型のストレージ及び計算システムを形成してよい。ストレージコントローラ１２５ａ、１２５ｂは、停電、ストレージコントローラ削除、ストレージコントローラ又はストレージシステムの停止、又はストレージシステム１２４の中の１つ以上のソフトウェア構成要素若しくはハードウェア構成要素のなんらかの障害時に動作が失われないことを確実にするために、進行中の動作を記録するためにストレージデバイス１１９ａから１１９ｄの中で又は全体で高速書込みメモリを活用してよい。 [0097] In one embodiment, two storage controllers (e.g., 125a and 125b) provide storage services, such as a block storage array (e.g., SCSI), a file server, an object server, a database, or a data analysis service. Storage controllers 125a, 125b may provide services to host computers 127a-127n outside of storage system 124 through any number of network interfaces (e.g., 126a-126d). Storage controllers 125a, 125b may provide integrated services or applications entirely within storage system 124, forming a converged storage and computing system. Storage controllers 125a, 125b may utilize high-speed write memory within or across storage devices 119a-119d to record ongoing operations to ensure that operations are not lost in the event of a power outage, removal of a storage controller, a shutdown of a storage controller or storage system, or any failure of one or more software or hardware components within storage system 124.

［0098］一実施形態では、コントローラ１２５ａ、１２５ｂは、一方又は他方のＰＣＩバス１２８ａ、１２８ｂに対するＰＣＩマスタとして動作する。別の実施形態では、１２８ａ及び１２８ｂは、他の通信規格（例えば、ＨｙｐｅｒＴｒａｎｓｐｏｒｔ、インフィニバンド等）に基づいてよい。他のストレージシステムの実施形態は、ＰＣＩバス１２８ａ、１２８ｂのためのマルチマスタとしてストレージコントローラ１２５ａ、１２５ｂを操作してよい。代わりに、ＰＣＩ／ＮＶＭｅ／ＮＶＭｆ交換インフラストラクチャ又はファブリックは、複数のストレージコントローラを接続してよい。いくつかのストレージシステム実施形態は、ストレージデバイスが、ストレージコントローラとだけ通信するよりむしろ、互いに直接的に通信できるようにしてよい。一実施形態では、ストレージデバイスコントローラ１１９ａは、フラッシュメモリデバイスの中に記憶されるデータを合成し、ＲＡＭ（例えば、図１ＣのＲＡＭ１２１）に記憶されているデータから転送するためにストレージコントローラ１２５ａからの指揮下で操作可能であってよい。例えば、ＲＡＭコンテンツの再計算されたバージョンは、データの安全性を保証する改善するために、又は再利用のためにアドレス指定可能な高速書込み容量を解放するために、ストレージコントローラが、動作がストレージシステム全体で完全にコミットしたと判断した後、又はデバイスに対する高速書込みメモリが特定の使用容量に達したとき、又は一定量の時間の後に転送されてよい。この機構は、例えばストレージコントローラ１２５ａ、１２５ｂからバス（例えば、１２８ａ、１２８ｂ）上での第２の転送を回避するために使用されてよい。一実施形態では、再計算はデータを圧縮することと、インデックス化又は他のメタデータを接続することと、複数のデータセグメントを互いに結合することと、イレイジャーコード計算を実行すること等を含んでよい。 [0098] In one embodiment, controllers 125a, 125b act as PCI masters for one or the other PCI bus 128a, 128b. In another embodiment, 128a and 128b may be based on other communication standards (e.g., HyperTransport, InfiniBand, etc.). Other storage system embodiments may operate storage controllers 125a, 125b as multi-masters for PCI bus 128a, 128b. Alternatively, a PCI/NVMe/NVMf switching infrastructure or fabric may connect multiple storage controllers. Some storage system embodiments may allow storage devices to communicate directly with each other rather than communicating only with the storage controller. In one embodiment, storage device controller 119a may be operable under direction from storage controller 125a to combine data stored in flash memory devices and transfer data from data stored in RAM (e.g., RAM 121 of FIG. 1C). For example, a recomputed version of the RAM contents may be transferred after the storage controller determines that operations are fully committed across the storage system, or when fast write memory for a device reaches a certain usage capacity, or after a certain amount of time, to improve data safety or to free up addressable fast write capacity for reuse. This mechanism may be used, for example, to avoid a second transfer on a bus (e.g., 128a, 128b) from storage controller 125a, 125b. In one embodiment, the recomputation may include compressing data, attaching indexing or other metadata, combining multiple data segments together, performing erasure code calculations, etc.

［0099］一実施形態では、ストレージコントローラ１２５ａ、１２５ｂからの命令下で、ストレージデバイスコントローラ１１９ａ、１１９ｂは、データを計算し、ストレージコントローラ１２５ａ、１２５ｂの関与なしに、ＲＡＭ（例えば、図１ＣのＲＡＭ１２１）に記憶されているデータから他のストレージデバイスに転送するよう作動してよい。この動作は、あるコントローラ１２５ａから別のコントローラ１２５ｂに記憶されたデータをミラーリングするために使用されることもあれば、それは圧縮、データ統合、及び／又はストレージコントローラ又はストレージコントローラインタフェース１２９ａ、１２９ｂからＰＣＩバス１２８ａ、１２８ｂに対する負荷を削減するために、イレイジャーコーディング計算及びストレージデバイスへの転送をアンロードするために使用される場合もある。 [0099] In one embodiment, under direction from storage controller 125a, 125b, storage device controller 119a, 119b may operate to calculate and transfer data from data stored in RAM (e.g., RAM 121 in FIG. 1C) to other storage devices without the involvement of storage controller 125a, 125b. This operation may be used to mirror data stored from one controller 125a to another controller 125b, or it may be used to offload compression, data consolidation, and/or erasure coding calculations and transfers to storage devices to reduce the load on PCI bus 128a, 128b from storage controller or storage controller interface 129a, 129b.

［0100］ストレージデバイスコントローラ１１９は、デュアルＰＣＩストレージデバイス１１８にとって外部のストレージシステムの他の部分による使用のために高い可用性プリミティブを実施するための機構を含んでよい。例えば、予約プリミティブ又は除外プリミティブは、高可用性ストレージサービスを提供する２つのストレージコントローラを有するストレージシステムにおいて、一方のストレージコントローラが、他方のストレージコントローラがストレージデバイスにアクセスする又はストレージデバイスにアクセスし続けることを防ぎ得るように提供されてよい。これは、例えば、一方のコントローラが、他方のコントローラが適切に機能していないことを検出する場合、又は２つのストレージコントローラ間の相互接続がそれ自体適切に機能していない場合がある場合に使用できるであろう。 [0100] The storage device controller 119 may include mechanisms for implementing high availability primitives for use by other portions of the storage system external to the dual PCI storage device 118. For example, reservation or exclusion primitives may be provided so that in a storage system having two storage controllers providing highly available storage services, one storage controller may prevent the other storage controller from accessing or continuing to access the storage device. This could be used, for example, when one controller detects that the other controller is not functioning properly, or when the interconnect between the two storage controllers may itself not be functioning properly.

［0101］一実施形態では、別々にアドレス指定可能な高速書込みストレージを有するデュアルＰＣＩ直接図化ストレージデバイスとの使用のためのストレージシステムは、消去ブロック又は消去ブロックのグループを、ストレージサービスの代わりにデータを記憶するための、又はストレージサービスと関連付けられたメタデータ（例えば、インデックス、ログ等）を記憶するための、又はストレージシステム自体の適切な管理のための割当てユニットとして管理するシステムを含む。サイズが数キロバイトである場合があるフラッシュページは、データが到達すると、又はストレージシステムが長い時間の間隔（例えば、時間の定められた閾値を超えて）データを持続すると、書き込まれてよい。データをより迅速にコミットするために、又はフラッシュメモリデバイスへの書込み数を削減するために、ストレージコントローラは、まず１つ以上のストレージデバイス上の別々にアドレス指定可能な高速書込みストレージにデータを書き込んでよい。 [0101] In one embodiment, a storage system for use with a dual PCI direct mapped storage device having separately addressable fast write storage includes a system that manages erase blocks or groups of erase blocks as allocation units for storing data on behalf of a storage service, for storing metadata associated with a storage service (e.g., indexes, logs, etc.), or for proper management of the storage system itself. Flash pages, which may be several kilobytes in size, may be written as data arrives or once the storage system has persisted the data for a long interval of time (e.g., beyond a defined threshold of time). To commit data more quickly or to reduce the number of writes to the flash memory device, the storage controller may first write the data to separately addressable fast write storage on one or more storage devices.

［0102］一実施形態では、ストレージコントローラ１２５ａ、１２５ｂは、ストレージデバイスの年齢及び予想される残りの寿命に従って、又は他の統計に基づいて、ストレージデバイス（例えば、１１８）の中で及びストレージデバイス（例えば、１１８）全体で消去ブロックの使用を開始してよい。ストレージコントローラ１２５ａ、１２５ｂは、フラッシュページ及び消去ブロックの寿命を管理するため及び全体的なシステム性能を管理するためだけではなく、もはや必要とされないページに従ってストレージデバイス間でガベージコレクション及びデータ移行データを開始してもよい。 [0102] In one embodiment, storage controllers 125a, 125b may initiate the use of erase blocks within and across storage devices (e.g., 118) according to the age and expected remaining life of the storage device, or based on other statistics. Storage controllers 125a, 125b may initiate garbage collection and data migration data between storage devices according to pages that are no longer needed, as well as to manage the lifespan of flash pages and erase blocks and manage overall system performance.

［0103］一実施形態では、ストレージシステム１２４は、アドレス指定可能な高速書込みストレージへデータを記憶することの一部として、及び／又は消去ブロックと関連付けられた割当てユニットにデータを書き込むことの一部としてミラーリング方式及び／又は消去コーディング方式を利用してよい。イレイジャーコードは、単一又は複数のストレージデバイス故障に対する冗長性を提供するために、又はフラッシュメモリ動作から若しくはフラッシュメモリセルの劣化から生じるフラッシュメモリページの内部破損から保護するために、消去ブロック若しくは割当てユニットの中で、又は単一のストレージデバイス上のフラッシュメモリデバイスの中で及びフラッシュメモリデバイス全体でだけではなく、ストレージデバイス全体で使用されてよい。多様なレベルのミラーリング及びイレイジャーコーディングが、別々に又は組み合わされて発生する複数のタイプの故障から回復するために使用されてよい。 [0103] In one embodiment, storage system 124 may utilize mirroring and/or erasure coding schemes as part of storing data to addressable fast-write storage and/or as part of writing data to allocation units associated with erase blocks. Erasure codes may be used within erase blocks or allocation units, or across flash memory devices on a single storage device, as well as across storage devices, to provide redundancy against single or multiple storage device failures or to protect against internal corruption of flash memory pages resulting from flash memory operations or from degradation of flash memory cells. Various levels of mirroring and erasure coding may be used to recover from multiple types of failures, either separately or in combination.

［0104］図２Ａから図２Ｇに関して示される実施形態は、１つ以上のユーザシステム若しくはクライアントシステム又はストレージクラスタにとって外部の他のソースから生じるユーザデータ等のユーザデータを記憶するストレージクラスタを示す。ストレージクラスタは、イレイジャーコーディング及びメタデータの冗長なコピーを使用し、シャシの中に収容されたストレージノード全体で、又は複数のシャシ全体でユーザデータを分散する。イレイジャーコーディングは、データが、例えばディスク、ストレージノード、又は地理的な位置等の異なる場所の集合全体で記憶されるデータの保護又は再構築の方法を指す。実施形態は、他のタイプのソリッドステートメモリ又は非ソリッドステートメモリを含んだ他の記憶媒体にも拡張され得るが、フラッシュメモリは、実施形態と統合されてよい１つのタイプのソリッドステートメモリである。記憶場所及び作業負荷の制御は、クラスタ化されたピアツーピアシステムでの記憶場所全体で分散される。多様なストレージノード間で通信を仲介すること、ストレージノードがいつ利用できなくなったのかを検出すること、及び多様なストレージノード全体でＩ／Ｏ（入出力）のバランスをとること等のタスクは、すべて分散ベースで処理される。データは、いくつかの実施形態でデータ回復をサポートするデータフラグメント又はストライプ内で複数のストレージノード全体にレイアウト又は分散される。データの所有権は、入力パターン及び出力パターンとは関係なく、クラスタの中で割り当てし直される。データは他のストレージノードから再構築することができ、したがって入力動作及び出力動作に利用可能のままであるので、以下により詳細に説明されるこのアーキテクチャは、クラスタ内のストレージノードが、システムが稼働したままの状態で故障することを可能にする。多様な実施形態では、ストレージノードは、クラスタノード、ブレード、又はサーバと呼ばれる場合がある。 2A through 2G illustrate a storage cluster that stores user data, such as user data originating from one or more user or client systems or other sources external to the storage cluster. The storage cluster uses erasure coding and redundant copies of metadata to distribute user data across storage nodes housed within a chassis or across multiple chassis. Erasure coding refers to a method of protecting or reconstructing data where the data is stored across a collection of different locations, such as disks, storage nodes, or geographic locations. Flash memory is one type of solid-state memory that may be integrated with embodiments, although embodiments may be extended to other storage media, including other types of solid-state memory or non-solid-state memory. Storage location and workload control are distributed across storage locations in a clustered peer-to-peer system. Tasks such as brokering communications between the various storage nodes, detecting when a storage node becomes unavailable, and balancing I/O (input/output) across the various storage nodes are all handled on a distributed basis. Data is laid out or distributed across multiple storage nodes in data fragments or stripes, which in some embodiments supports data recovery. Data ownership is reassigned within the cluster regardless of input and output patterns. This architecture, described in more detail below, allows storage nodes within a cluster to fail while the system remains operational, since data can be reconstructed from other storage nodes and therefore remains available for input and output operations. In various embodiments, storage nodes may be referred to as cluster nodes, blades, or servers.

［0105］ストレージクラスタは、シャシ、つまり１つ以上のストレージノードを収容するエンクロージャの中に含まれる場合がある。例えば配電バス等の各ストレージノードに電力を提供するための機構、及びストレージノード間の通信を可能にする通信バス等の通信機構が、シャシの中に含まれる。ストレージクラスタは、いくつかの実施形態によれば、１つの場所の独立したシステムとして実行できる。一実施形態では、シャシは、独立して有効又は無効にされ得る、配電及び通信バスの両方の少なくとも２つのインスタンスを含む。内部通信バスはイーサネットバスであってよいが、例えばＰＣＩｅ、インフィニバンド、及び他等の他の技術は等しく適している。シャシは、直接的に又はスイッチを通して、複数のシャシ間の及びクライアントシステムとの通信を可能にするための外部通信バス用のポートを提供する。外部通信は、例えばイーサネット、インフィニバンド、ファイバチャネル等の技術を使用してよい。いくつかの実施形態では、外部通信バスは、シャシ間通信及びクライアント通信のための異なる通信バス技術を使用する。スイッチがシャシの中で又はシャシ間に配備される場合、スイッチは、複数のプロトコル又は技術の間の変換の機能を果たしてよい。複数のシャシがストレージクラスタを定義するために接続されるとき、ストレージクラスタは、専用のインタフェース又は例えばネットワークファイルシステム（『ＮＦＳ』）、共通インターネットファイルシステム（『ＣＩＦＳ』）、スモールコンピュータシステムインタフェース（『ＳＣＳＩ』）、又はハイパーテキスト転送プロトコル（『ＨＴＴＰ』）等の標準的なインタフェースを使用し、クライアントによってアクセスされてよい。クライアントプロトコルからの変換は、スイッチ、シャシ外部通信バスで、又は各ストレージノードの中で起こる場合がある。いくつかの実施形態では、複数のシャシは、アグリゲータスイッチを通して互いに結合又は接続されてよい。結合又は接続されたシャシの一部分及び／又はすべては、ストレージクラスタとして指定されてよい。上述されたように、各シャシは複数のブレードを有する場合があり、各ブレードは媒体アクセス制御（『ＭＡＣ』）アドレスを有するが、ストレージクラスタは、いくつかの実施形態では単一のクラスタＩＰアドレス及び単一のＭＡＣアドレスを有するとして外部ネットワークに提示される。 [0105] A storage cluster may be contained within a chassis, i.e., an enclosure that houses one or more storage nodes. Contained within the chassis are mechanisms for providing power to each storage node, such as a power distribution bus, and communication mechanisms, such as a communication bus, that enable communication between the storage nodes. A storage cluster, according to some embodiments, can run as an independent system in a single location. In one embodiment, a chassis includes at least two instances of both a power distribution and a communication bus, which can be independently enabled or disabled. The internal communication bus may be an Ethernet bus, although other technologies, such as PCIe, InfiniBand, and others, are equally suitable. The chassis provides ports for an external communication bus to enable communication between multiple chassis and with client systems, either directly or through a switch. External communication may use technologies such as Ethernet, InfiniBand, Fibre Channel, etc. In some embodiments, the external communication bus uses different communication bus technologies for inter-chassis communication and client communication. When a switch is deployed within or between chassis, the switch may perform the function of translating between multiple protocols or technologies. When multiple chassis are connected to define a storage cluster, the storage cluster may be accessed by clients using a proprietary interface or a standard interface, such as Network File System (NFS), Common Internet File System (CIFS), Small Computer System Interface (SCSI), or Hypertext Transfer Protocol (HTTP). Translation from the client protocol may occur at the switch, the chassis external communication bus, or within each storage node. In some embodiments, multiple chassis may be coupled or connected to each other through an aggregator switch. Some and/or all of the coupled or connected chassis may be designated as a storage cluster. As mentioned above, each chassis may have multiple blades, and each blade has a media access control (MAC) address, but in some embodiments the storage cluster is presented to the external network as having a single cluster IP address and a single MAC address.

［0106］各ストレージノードは、１つ以上のストレージサーバであってよく、各ストレージサーバは、ストレージユニット又はストレージデバイスと呼ばれる場合がある、１つ以上の不揮発性ソリッドステートメモリユニットに接続される。一実施形態は、各ストレージノードに、及び１つから８つの不揮発性ソリッドステートメモリユニット間に単一のストレージサーバを含むが、この一例は限定的となることを意図していない。ストレージサーバは、プロセッサ、ＤＲＡＭ、及び内部通信バス及び電力バスのそれぞれの配電のためのインタフェースを含んでよい。いくつかの実施形態では、ストレージノードの内側で、インタフェース及びストレージユニットは、例えばＰＣＩＥｘｐｒｅｓｓ等の通信バスを共用する。不揮発性ソリッドステートメモリユニットは、ストレージノード通信バスを通して内部通信バスインタフェースに直接的にアクセスしてよい、又はストレージノードにバスインタフェースにアクセスするように要求してよい。不揮発性ソリッドステートメモリユニットは、埋め込みＣＰＵ、ソリッドステートストレージコントローラ、及びいくつかの実施形態では、例えば２～３２テラバイト（『ＴＢ』）の間の大量のソリッドステートマスストレージを含む。例えばＤＲＡＭ等の埋め込み揮発性ストレージ媒体、及びエネルギー貯蔵装置は、不揮発性ソリッドステートメモリユニットに含まれる。いくつかの実施形態では、エネルギー貯蔵装置は、電力損失時にＤＲＡＭコンテンツの部分集合を安定した記憶媒体に転送することを可能にするコンデンサ、超コンデンサ、又は電池である。いくつかの実施形態では、不揮発性ソリッドステートメモリユニットは、例えばＤＲＡＭの代わりになり、電力削減ホールドアップ装置（ｒｅｄｕｃｅｄｐｏｗｅｒｈｏｌｄ－ｕｐａｐｐａｒａｔｕｓ）を可能にする相変化メモリ又は磁気抵抗ランダムアクセスメモリ（『ＭＲＡＭ』）等のストレージクラスメモリで構築される。 [0106] Each storage node may be one or more storage servers, each connected to one or more non-volatile solid-state memory units, sometimes referred to as storage units or storage devices. One embodiment includes a single storage server for each storage node and between one and eight non-volatile solid-state memory units, although this example is not intended to be limiting. The storage server may include a processor, DRAM, and interfaces for internal communication buses and power distribution, respectively. In some embodiments, inside the storage node, the interfaces and storage units share a communication bus, such as PCI Express. The non-volatile solid-state memory units may directly access the internal communication bus interface through the storage node communication bus or may require the storage node to access the bus interface. The non-volatile solid-state memory units include an embedded CPU, a solid-state storage controller, and, in some embodiments, a large amount of solid-state mass storage, e.g., between 2 and 32 terabytes ("TB"). An embedded volatile storage medium, such as DRAM, and an energy storage device are included in the non-volatile solid-state memory units. In some embodiments, the energy storage device is a capacitor, supercapacitor, or battery that allows for the transfer of a subset of the DRAM contents to a stable storage medium upon power loss. In some embodiments, the non-volatile solid-state memory unit is constructed with storage-class memory, such as phase-change memory or magnetoresistive random access memory ("MRAM"), which replaces DRAM and allows for reduced power hold-up apparatus.

［0107］ストレージノード及び不揮発性ソリッドステートストレージの多くの特徴の１つは、ストレージクラスタ内で先を見越してデータを再構築する能力である。ストレージノード及び不揮発性ソリッドステートストレージは、そのストレージノード又は不揮発性ソリッドステートストレージと関わるデータを読み取る試みがあるかどうかとは無関係に、ストレージノード又はストレージクラスタ内の不揮発性ソリッドステートストレージがいつ到達不能であるのかを判断できる。ストレージノード及び不揮発性ソリッドステートストレージは、次いで少なくとも部分的に新しい場所のデータを復旧し、再構築するために協調する。これは、システムが、データが、ストレージクラスタを利用するクライアントシステムから開始される読取りアクセスのために必要とされるまで待機することなくデータを再構築するという点で先を見越した再構築を構成する。ストレージメモリ及びその動作のこれらの詳細及び追加の詳細は、以下に説明される。 [0107] One of the many features of storage nodes and non-volatile solid-state storage is the ability to proactively rebuild data within a storage cluster. The storage nodes and non-volatile solid-state storage can determine when a storage node or non-volatile solid-state storage within a storage cluster is unreachable, regardless of whether there is an attempt to read the data associated with that storage node or non-volatile solid-state storage. The storage nodes and non-volatile solid-state storage then cooperate to at least partially recover and rebuild the data in a new location. This constitutes proactive rebuilding in that the system rebuilds data without waiting until the data is needed for a read access initiated by a client system utilizing the storage cluster. These and additional details of the storage memory and its operation are described below.

［0108］図２Ａは、いくつかの実施形態に従って、ネットワーク接続ストレージ又はストレージエリアネットワークを提供するために、複数のストレージノード１５０、及び各ストレージノードに結合された内部ソリッドステートメモリを有する、ストレージクラスタ１６１の斜視図である。ネットワーク接続ストレージ、ストレージエリアネットワーク、又はストレージクラスタ、又は他のストレージメモリは、それぞれが、物理構成要素及びそれにより提供されるストレージメモリの量の両方の柔軟かつ再構成可能な構成で、１つ以上のストレージノード１５０を有する１つ以上のストレージクラスタ１６１を含む場合があるであろう。ストレージクラスタ１６１はラックに収まるように設計され、１つ以上のラックは、ストレージメモリに対して所望されるようにセットアップされ、装着される場合がある。ストレージクラスタ１６１は、複数のスロット１４２を有するシャシ１３８を有する。シャシ１３８が、筐体、エンクロージャ、又はラックユニットと呼ばれる場合があることが理解されるべきである。他のスロット数は容易に考案されるが、一実施形態では、シャシ１３８は１４のスロット１４２を有する。例えば、いくつかの実施形態は、４つのスロット、８つのスロット、１６のスロット、３２のスロット、又は他の適切な数のスロットを有する。いくつかの実施形態では、各スロット１４２は、１つのストレージノード１５０を収容する場合がある。シャシ１３８は、ラック上にシャシ１３８を取り付けるために活用できるフラップ１４８を含む。他の冷却構成要素が使用できるであろうが、ファン１４４は、ストレージノード１５０及びその構成要素の冷却のための空気循環を提供する、又は冷却構成要素のない実施形態が考案される場合があるであろう。スイッチファブリック１４６は、メモリへの通信のためにシャシ１３８の中のストレージノード１５０を互いに、及びネットワークに結合する。本明細書に示される実施形態では、スイッチファブリック１４６及びファン１４４の左側のスロット１４２は、ストレージノード１５０によって占有されて示されている。一方、スイッチファブリック１４６及びファン１４４の右側のスロット１４２は空であり、例示的な目的のためにストレージノード１５０の挿入に利用可能である。この構成は一例であり、１つ以上のストレージノード１５０が、多様な追加の構成でスロット１４２を占有する場合があるであろう。ストレージノードの構成は、いくつかの実施形態では、連続又は隣接している必要はない。ストレージノード１５０はホットプラグ可能であり、ストレージノード１５０は、システムを停止する又は電源を切ることなく、シャシ１３８内のスロット１４２に挿入できる、又はスロット１４２から取り外しできることを意味する。ストレージノード１５０の挿入又はスロット１４２からの取外し時、システムは、変更を認識し、変更に適応するために自動的に再構成する。いくつかの実施形態では、再構成は、冗長性を復元すること及び／又はデータ若しくは負荷のバランスを再び取ることを含む。 2A is a perspective view of a storage cluster 161 having multiple storage nodes 150 and internal solid-state memory coupled to each storage node to provide network-attached storage or a storage area network, according to some embodiments. Network-attached storage, storage area network, storage cluster, or other storage memory may include one or more storage clusters 161, each having one or more storage nodes 150, with a flexible and reconfigurable configuration of both physical components and the amount of storage memory provided thereby. The storage cluster 161 is designed to fit into a rack, and one or more racks may be set up and populated as desired for the storage memory. The storage cluster 161 includes a chassis 138 having multiple slots 142. It should be understood that the chassis 138 may also be referred to as a case, enclosure, or rack unit. While other slot counts are readily devised, in one embodiment, the chassis 138 has 14 slots 142. For example, some embodiments have four slots, eight slots, 16 slots, 32 slots, or other suitable numbers of slots. In some embodiments, each slot 142 may house one storage node 150. The chassis 138 includes a flap 148 that can be utilized to mount the chassis 138 on a rack. The fans 144 provide air circulation for cooling the storage nodes 150 and their components, although other cooling components could be used, or embodiments without cooling components could be devised. The switch fabric 146 couples the storage nodes 150 in the chassis 138 to each other and to a network for communication to memory. In the embodiment shown herein, the slots 142 to the left of the switch fabric 146 and fans 144 are shown occupied by storage nodes 150, while the slots 142 to the right of the switch fabric 146 and fans 144 are empty and available for insertion of storage nodes 150 for illustrative purposes. This configuration is exemplary, and one or more storage nodes 150 could occupy the slots 142 in a variety of additional configurations. The configuration of storage nodes need not be contiguous or adjacent in some embodiments. Storage node 150 is hot-pluggable, meaning that storage node 150 can be inserted into or removed from slot 142 in chassis 138 without shutting down or powering down the system. When storage node 150 is inserted or removed from slot 142, the system recognizes the change and automatically reconfigures to accommodate the change. In some embodiments, the reconfiguration includes restoring redundancy and/or rebalancing data or load.

［0109］各ストレージノード１５０は、複数の構成要素を含む場合がある。他の取付物及び／又は構成要素が追加の実施形態で使用されるであろう場合があるが、ここに示される実施形態では、ストレージノード１５０は、ＣＰＵ１５６、つまりプロセッサによって装着されるプリント基板１５９、ＣＰＵ１５６に結合されたメモリ１５４、及びＣＵ１５６に結合された不揮発性ソリッドステートストレージ１５２を含む。メモリ１５４は、ＣＰＵ１５６によって実行される命令、及び／又はＣＰＵ１５６によって影響されるデータを有する。以下にさらに説明されるように、不揮発性ソリッドステートストレージ１５２は、フラッシュを含む、又は追加の実施形態では他のタイプのソリッドステートメモリを含む。 [0109] Each storage node 150 may include multiple components. While other attachments and/or components may be used in additional embodiments, in the embodiment shown, storage node 150 includes a CPU 156, i.e., a printed circuit board 159 mounted with a processor, memory 154 coupled to CPU 156, and non-volatile solid-state storage 152 coupled to CU 156. Memory 154 contains instructions executed by CPU 156 and/or data affected by CPU 156. As described further below, non-volatile solid-state storage 152 may include flash, or in additional embodiments, other types of solid-state memory.

［0110］図２Ａを参照すると、ストレージクラスタ１６１はスケーラブルであり、不均一なストレージサイズを有する記憶容量が、上述されたように容易に追加されることを意味する。１つ以上のストレージノード１５０は、各シャシに差し込む又は各シャシから取り外すことができ、ストレージクラスタはいくつかの実施形態では自己設定する。プラグインストレージノード１５０は、配送済み（ａｓｄｅｌｉｖｅｒｅｄ）又は後に付加されるとしてシャシにインストールされるかどうかに関わりなく、異なるサイズを有する場合がある。例えば、一実施形態では、ストレージノード１５０は、例えば８ＴＢ、１２ＴＢ、１６ＴＢ、３２ＴＢ等の４ＴＢの任意の倍数を有する場合がある。追加の実施形態では、ストレージノード１５０は、他のストレージ量又は記憶容量の任意の倍数を有する場合があるであろう。各ストレージノード１５０の記憶容量は一斉送信され、データにストライプを付ける（ｓｔｒｉｐｅ）方法の決定に影響を与える。最大記憶効率のために、実施形態は、シャシの中の最大で１つ又は最大で２つの不揮発性ソリッドステートストレージユニット１５２又はストレージノード１５０の損失のある継続動作の所定の要件を条件に、ストライプで可能な限り幅広く自己設定することができる。 2A, the storage cluster 161 is scalable, meaning that storage capacity having non-uniform storage sizes is easily added as described above. One or more storage nodes 150 can be plugged in or removed from each chassis, and the storage cluster is self-configuring in some embodiments. The plug-in storage nodes 150, whether installed in the chassis as delivered or as a later add-on, may have different sizes. For example, in one embodiment, the storage nodes 150 may have any multiple of 4 TB, such as 8 TB, 12 TB, 16 TB, 32 TB, etc. In additional embodiments, the storage nodes 150 could have other storage amounts or any multiple of storage capacity. The storage capacity of each storage node 150 is broadcast and influences the determination of how the data is striped. For maximum storage efficiency, embodiments may self-configure in stripes as widely as possible, subject to the given requirement of continued operation with the loss of at most one or at most two non-volatile solid-state storage units 152 or storage nodes 150 in the chassis.

［0111］図２Ｂは、複数のストレージノード１５０を結合する通信相互接続１７１Ａ～１７１Ｆ及び配電バス１７２を示すブロック図である。図２Ａを参照し直すと、いくつかの実施形態では、通信相互接続１７１Ａ～１７１Ｆは、スイッチファブリック１４６に含まれる場合がある、又はスイッチファブリック１４６で実施される場合がある。複数のストレージクラスタ１６１がラックを占有する場合、いくつかの実施形態では、通信相互接続１７１Ａ～１７１Ｆは、ラックスイッチの上部に含まれる場合がある、又はラックスイッチの上部で実施される場合がある。図２Ｂに示されるように、ストレージクラスタ１６１は、単一のシャシ１３８の中に封入される。外部ポート１７６は、通信相互接続１７１Ａ～１７１Ｆを通してストレージノード１５０に結合される。一方、外部ポート１７４は、ストレージノードに直接的に結合される。外部電力ポート１７８は、配電バス１７２に結合される。ストレージノード１５０は、図２Ａに関して説明されるように、変化する量の、及び異なる容量の不揮発性ソリッドステートストレージ１５２を含む場合がある。さらに、１つ以上のストレージノードは、図２Ｂに示されるように計算専用のストレージノードであってよい。オーソリティ１６８は、例えばメモリに記憶されたリスト又は他のデータ構造等、不揮発性ソリッドストレージ１５２上で実施される。いくつかの実施形態では、オーソリティは、不揮発性ソリッドステートストレージ１５２の中で記憶され、不揮発性ソリッドステートストレージ１５２のコントローラ又は他のプロセッサ上で実行中のソフトウェアによってサポートされる。追加の実施形態では、オーソリティ１６８は、例えばメモリ１５４に記憶されたリスト又は他のデータ構造として、ストレージノード１５０で実施され、ストレージノード１５０のＣＰＵ１５６で実行中のソフトウェアによってサポートされる。オーソリティ１６８は、いくつかの実施形態では、どのようにして及びどこでデータが不揮発性ソリッドステートストレージ１５２に記憶されるのかを制御する。この制御は、どのタイプのイレイジャーコーディング方式がデータに適用されるのか、及びストレージノード１５０がデータのどの部分を有しているのかを決定するのに役立つ。各オーソリティ１６８は、不揮発性ソリッドステートストレージ１５２に割り当てられてよい。各オーソリティは、多様な実施形態では、ファイルシステムによって、ストレージノード１５０によって、又は不揮発性ソリッドステートストレージ１５２によってデータに割り当てられる、一連のｉｎｏｄｅ番号、セグメント番号、又は他のデータ識別子を制御してよい。 2B is a block diagram illustrating communication interconnects 171A-171F and power distribution bus 172 coupling multiple storage nodes 150. Referring back to FIG. 2A, in some embodiments, communication interconnects 171A-171F may be included in or implemented in switch fabric 146. When multiple storage clusters 161 occupy a rack, in some embodiments, communication interconnects 171A-171F may be included in or implemented in an upper portion of a rack switch. As shown in FIG. 2B, storage cluster 161 is enclosed within a single chassis 138. External port 176 is coupled to storage node 150 through communication interconnects 171A-171F, while external port 174 is coupled directly to the storage node. External power port 178 is coupled to power distribution bus 172. Storage node 150 may include varying amounts and capacities of non-volatile solid-state storage 152, as described with respect to FIG. 2A . Additionally, one or more storage nodes may be compute-only storage nodes, as shown in FIG. 2B . Authority 168 is embodied on non-volatile solid-state storage 152, for example, as a list or other data structure stored in memory. In some embodiments, authority 168 is stored in non-volatile solid-state storage 152 and supported by software running on a controller or other processor of non-volatile solid-state storage 152. In additional embodiments, authority 168 is embodied on storage node 150, for example, as a list or other data structure stored in memory 154, and supported by software running on CPU 156 of storage node 150. Authority 168, in some embodiments, controls how and where data is stored on non-volatile solid-state storage 152. This control helps determine what type of erasure coding scheme is applied to the data and which portions of the data storage node 150 holds. Each authority 168 may be assigned to non-volatile solid-state storage 152. In various embodiments, each authority may control a set of inode numbers, segment numbers, or other data identifiers assigned to data by the file system, by storage node 150, or by non-volatile solid-state storage 152.

［0112］いくつかの実施形態では、あらゆるデータ、及びあらゆるメタデータは、システム内に冗長性を有する。さらに、あらゆるデータ及びあらゆるメタデータは、オーソリティと呼ばれる場合もあるオーナーを有する。オーソリティが、例えばストレージノードの故障により到達不能である場合、そのデータ又はそのメタデータをどのようにして見つけるのかについての継承の計画がある。多様な実施形態では、オーソリティ１６８の冗長なコピーがある。オーソリティ１６８は、いくつかの実施形態では、ストレージノード１５０及び不揮発性ソリッドステートストレージ１５２に対する関係性を有する。一連のデータセグメント番号又はデータの他の識別子をカバーする各オーソリティ１６８は、特定の不揮発性ソリッドステートストレージ１５２に割り当てられてよい。いくつかの実施形態では、係る範囲のすべてに対するオーソリティ１６８は、ストレージクラスタの不揮発性ソリッドステートストレージ１５２上で分散される。各ストレージノード１５０は、そのストレージノード１５０の不揮発性ソリッドステートストレージ（複数可）１５２にアクセスを提供するネットワークポートを有する。データは、セグメント番号と関連付けられるセグメントに記憶される場合があり、そのセグメント番号は、いくつかの実施形態ではＲＡＩＤ（独立した複数のディスクから成る冗長配列）ストライプの構成のためのインダイレクションである。オーソリティ１６８の割当て及び使用は、このようにしてデータに対するインダイレクションを確立する。インダイレクションは、いくつかの実施形態によれば、間接的に、この場合オーソリティ１６８を介して、データを参照する能力と呼ばれる場合がある。セグメントは、不揮発性ソリッドステートストレージ１５２の集合、及びデータを含む場合がある不揮発性ソリッドステートストレージ１５２の集合の中へのローカル識別子を識別する。いくつかの実施形態では、ローカル識別子は、デバイスへのオフセットであり、複数のセグメントによって連続的に再利用されてよい。他の実施形態では、ローカル識別子は、特定のセグメントにとって一意であり、決して再利用されない。不揮発性ソリッドステートストレージ１５２のオフセットは、（ＲＡＩＤストライプの形で）不揮発性ソリッドステートストレージ１５２への書込み又は不揮発性ソリッドステートストレージ１５２からの読取りのためにデータの場所を突き止めることに適用される。データは、特定のデータセグメントのためのオーソリティ１６８を有する不揮発性ソリッドステートストレージ１５２を含んでよい、又は不揮発性ソリッドステートストレージ１５２とは異なってよい、不揮発性ソリッドステートストレージ１５２の複数のユニット全体でストライプを付けられる。 [0112] In some embodiments, every piece of data and every piece of metadata has redundancy within the system. Furthermore, every piece of data and every piece of metadata has an owner, sometimes called an authority. If the authority is unreachable, for example due to a storage node failure, there is a succession plan for how to find the data or its metadata. In various embodiments, there are redundant copies of authority 168. In some embodiments, authority 168 has a relationship to storage nodes 150 and non-volatile solid-state storage 152. Each authority 168, which covers a range of data segment numbers or other identifiers of data, may be assigned to a particular non-volatile solid-state storage 152. In some embodiments, authorities 168 for all such ranges are distributed across the non-volatile solid-state storage 152 of the storage cluster. Each storage node 150 has a network port that provides access to the non-volatile solid-state storage(s) 152 of that storage node 150. Data may be stored in segments that are associated with a segment number, which in some embodiments is the indirection for the configuration of a RAID (Redundant Array of Independent Disks) stripe. The assignment and use of authority 168 thus establishes indirection to the data. Indirection, according to some embodiments, may be referred to as the ability to reference data indirectly, in this case via authority 168. A segment identifies a collection of non-volatile solid-state storage 152 and a local identifier within the collection of non-volatile solid-state storage 152 that may contain the data. In some embodiments, the local identifier is an offset into the device and may be reused consecutively by multiple segments. In other embodiments, the local identifier is unique to a particular segment and is never reused. The non-volatile solid-state storage 152 offset is applied to locate data for writing to or reading from non-volatile solid-state storage 152 (in the form of a RAID stripe). Data is striped across multiple units of non-volatile solid-state storage 152, which may include non-volatile solid-state storage 152 with authority 168 for a particular data segment, or may be different from non-volatile solid-state storage 152.

［0113］例えば、データ移動又はデータ再構築中に、データの特定のセグメントの場所が突き止められる変化がある場合、そのデータセグメントのためのオーソリティ１６８は、その不揮発性ソリッドステートストレージ１５２又はそのオーソリティ１６８を有するストレージノード１５０で相談されるべきである。１つの特定のデータの場所を突き止めるために、実施形態は、データセグメントのハッシュ値を計算する、又はｉｎｏｄｅ番号若しくはデータセグメント番号を適用する。この動作の出力は、その１個の特定のデータのためのオーソリティ１６８を有する不揮発性ソリッドステートストレージ１５２を指す。いくつかの実施形態では、この動作に対する２つの段階がある。第１の段階は、例えばセグメント番号、ｉｎｏｄｅ番号等のエンティティ識別子（ＩＤ）、又はディレクトリ番号をオーソリティ識別子にマッピングする。このマッピングは、ハッシュ又はビットマスク等の計算を含む場合がある。第２の段階は、明示的なマッピングを通して行われる場合がある、特定の不揮発性ソリッドステートストレージ１５２にオーソリティ識別子をマッピングすることである。動作は繰り返し可能であり、これにより計算が実行されるとき、計算の結果は、オーソリティ１６８を有する特定の不揮発性ソリッドステートストレージ１５２を繰り返し自在に且つ確実に指す。動作は、入力として到達可能なストレージノードの集合を含む場合がある。到達可能な不揮発性ソリッドステートユニットの集合が変化する場合、最適集合が変化する。いくつかの実施形態では、持続値は（つねに真である）現在の割当てであり、計算値は、クラスタがそれに向かって再構成しようと試みるターゲット割当てである。この計算は、到達可能であり、同じクラスタを構成する不揮発性ソリッドステートストレージ１５２の集合の存在下でオーソリティのために最適な不揮発性ソリッドステートストレージ１５２を決定するために使用されてよい。また、計算は、割り当てられた不揮発性ソリッドステートストレージが到達不能である場合にもオーソリティが決定され得るように、不揮発性ソリッドステートストレージマッピングに対するオーソリティも記録するピア不揮発性ソリッドステートストレージ１５２の順序集合も決定する。特定のオーソリティ１６８がいくつかの実施形態で利用できない場合、複製の又は代替のオーソリティ１６８が相談されてよい。 [0113] For example, during data migration or data reconstruction, if there is a change in the location of a particular segment of data, the authority 168 for that data segment should be consulted in that non-volatile solid-state storage 152 or storage node 150 that has that authority 168. To locate a particular piece of data, embodiments calculate a hash value of the data segment or apply an inode number or data segment number. The output of this operation points to the non-volatile solid-state storage 152 that has the authority 168 for that piece of data. In some embodiments, there are two stages to this operation. The first stage is mapping an entity identifier (ID), such as a segment number, inode number, or directory number, to an authority identifier. This mapping may include the calculation of a hash or bitmask, etc. The second stage is mapping the authority identifier to a particular non-volatile solid-state storage 152, which may be done through explicit mapping. The operation is repeatable, so that when the calculation is performed, the result of the calculation repeatably and reliably points to the particular non-volatile solid-state storage 152 with the authority 168. The operation may include a set of reachable storage nodes as input. If the set of reachable non-volatile solid-state units changes, the optimal set changes. In some embodiments, the persistence value is the current allocation (which is always true), and the calculated value is the target allocation toward which the cluster attempts to reconfigure. This calculation may be used to determine the optimal non-volatile solid-state storage 152 for the authority given the set of non-volatile solid-state storages 152 that are reachable and that constitute the same cluster. The calculation also determines an ordered set of peer non-volatile solid-state storages 152 that also record the authority to non-volatile solid-state storage mapping so that an authority can be determined even if the assigned non-volatile solid-state storage is unreachable. If a particular authority 168 is unavailable in some embodiments, a duplicate or alternative authority 168 may be consulted.

［0114］図２Ａ及び図２Ｂを参照すると、ストレージノード１５０上のＣＰＵ１５６の多くのタスクのうちの２つは、書込みデータを分割し、読取りデータをアセンブルし直すことである。システムが、データが書き込まれると判断するとき、そのデータのオーソリティ１６８は上記のように位置する。データのセグメントＩＤがすでに決定されているとき、書き込む要求は、セグメントから決定されたオーソリティ１６８のホストであると現在判断される不揮発性ソリッドステートストレージ１５２に転送される。不揮発性ソリッドステートストレージ１５２及び対応するオーソリティ１６８が常駐するストレージノード１５０のホストＣＰＵ１５６は、次いでデータを分割又はシャード（ｓｈａｒｄ）し、データを多様な不揮発性ソリッドステートストレージ１５２に送出する。送信されたデータは、イレイジャーコーディング方式に従って、データストライプとして書き込まれる。いくつかの実施形態では、データはプルされることが要求され、他の実施形態では、データはプッシュされる。逆に、データが読み取られるとき、データを含んだセグメントＩＤのためのオーソリティ１６８は、上述されたように場所を突き止められる。不揮発性ソリッドステートストレージ１５２及び対応するオーソリティ１６８が常駐するストレージノード１５０のホストＣＰＵ１５６は、オーソリティによって指される不揮発性ソリッドステートストレージ及び対応するストレージノードからデータを要求する。いくつかの実施形態では、データは、データストライプとしてフラッシュストレージから読み取られる。ストレージノード１５０のホストＣＰＵ１５６は、次いで読み取られたデータをアセンブルし直し、適切なイレイジャーコーディング方式に従って（存在する場合）あらゆるエラーを訂正し、アセンブルし直されたデータをネットワークに転送する。追加の実施形態では、これらのタスクのいくつか又はすべては、不揮発性ソリッドステートストレージ１５２内で処理される場合がある。いくつかの実施形態では、セグメントホストは、ストレージからページを要求し、次いで元の要求をしたストレージノードにデータを送信することによって、データがストレージノード１５０に送信されることを要求する。 2A and 2B, two of the many tasks of the CPU 156 on a storage node 150 are to split write data and reassemble read data. When the system determines that data is to be written, the authority 168 for that data is located as described above. When the segment ID of the data has already been determined, the write request is forwarded from the segment to the non-volatile solid-state storage 152 currently determined to host the determined authority 168. The host CPU 156 of the storage node 150 on which the non-volatile solid-state storage 152 and corresponding authority 168 reside then splits or shards the data and sends the data to the various non-volatile solid-state storages 152. The transmitted data is written as data stripes according to an erasure coding scheme. In some embodiments, the data is requested to be pulled, while in other embodiments, the data is pushed. Conversely, when data is to be read, the authority 168 for the segment ID containing the data is located as described above. The host CPU 156 of the storage node 150 on which the non-volatile solid-state storage 152 and corresponding authority 168 reside requests data from the non-volatile solid-state storage and corresponding storage node pointed to by the authority. In some embodiments, the data is read from flash storage as a data stripe. The host CPU 156 of the storage node 150 then reassembles the read data, corrects any errors (if present) according to an appropriate erasure coding scheme, and transfers the reassembled data to the network. In additional embodiments, some or all of these tasks may be handled within the non-volatile solid-state storage 152. In some embodiments, the segment host requests that data be sent to the storage node 150 by requesting a page from storage and then sending the data to the storage node that made the original request.

［0115］例えばＵＮＩＸ様式のファイルシステムにおいて等、いくつかのシステムでは、データは、ファイルシステムのオブジェクトを表すデータ構造を指定するインデックスノードつまりｉｎｏｄｅで処理される。オブジェクトは、例えばファイル又はディレクトリである場合があるであろう。メタデータは、他の属性の中で、例えば許可データ及び作成タイムスタンプ等の属性としてオブジェクトに付随してもよい。セグメント番号は、ファイルシステム内の係るオブジェクトのすべて又は一部分に割り当てられる場合があるであろう。他のシステムでは、データセグメントは、他のどこかで割り当てられたセグメント番号で処理される。説明のために、分散の単位はエンティティであり、エンティティはファイル、ディレクトリ、又はセグメントである場合がある。すなわち、エンティティは、ストレージシステムによって記憶されるデータ又はメタデータの単位である。エンティティは、オーソリティと呼ばれる集合にグループ化される。各オーソリティは、オーソリティのエンティティを更新する独占権を有するストレージノードであるオーソリティオーナーを有する。言い換えると、ストレージノードはオーソリティを含み、そのオーソリティは同様にエンティティを含む。 [0115] In some systems, such as UNIX-style file systems, data is organized in index nodes or inodes, which specify data structures that represent objects in the file system. An object might be, for example, a file or a directory. Metadata may be associated with an object as attributes such as permission data and creation timestamps, among other attributes. A segment number might be assigned to all or a portion of such an object in the file system. In other systems, data segments are organized with segment numbers assigned elsewhere. For purposes of illustration, the unit of distribution is an entity, which might be a file, directory, or segment. That is, an entity is a unit of data or metadata stored by the storage system. Entities are grouped into collections called authorities. Each authority has an authority owner, which is a storage node that has exclusive rights to update the authority's entities. In other words, storage nodes contain authorities, which in turn contain entities.

［0116］いくつかの実施形態によれば、セグメントはデータの論理コンテナである。セグメントは、媒体アドレス空間と物理的なフラッシュ場所との間のアドレス空間である。つまり、データセグメント番号はこのアドレス空間内にある。また、セグメントは、高レベルのソフトウェアの関与なしにデータ冗長性を復元する（異なるフラッシュ場所又はデバイスに書き換える）ことを可能にするメタデータを含んでもよい。一実施形態では、セグメントの内部フォーマットは、データの位置を決定するためにクライアントデータ及び中間マッピングを含む。各データセグメントは、適用可能な場合、セグメントをいくつかのデータ及びパリティシャードに分割することによって、例えばメモリ及び他の故障から保護される。データ及びパリティシャードは、イレイジャーコーディング方式に従ってホストＣＰＵ１５６（図２Ｅ及び図２Ｇを参照）に結合された不揮発性ソリッドステートストレージ１５２全体で分散される、つまりストライプを付けられる。用語セグメントの使用は、いくつかの実施形態では、コンテナ及びセグメントのアドレス空間内でのその位置を指す。用語ストライプの使用は、セグメントと同じシャードの集合を指し、シャードが、いくつかの実施形態に従って冗長性又はパリティ情報とともにどのように分散されるのかを含む。 [0116] According to some embodiments, a segment is a logical container of data. A segment is an address space between the media address space and a physical flash location; that is, data segment numbers reside within this address space. A segment may also contain metadata that allows data redundancy to be restored (rewritten to a different flash location or device) without the involvement of higher-level software. In one embodiment, the internal format of a segment includes client data and intermediate mappings to determine the location of the data. Each data segment is protected against, for example, memory and other failures, if applicable, by dividing the segment into several data and parity shards. The data and parity shards are distributed, or striped, across the non-volatile solid-state storage 152 coupled to the host CPU 156 (see Figures 2E and 2G) according to an erasure coding scheme. The use of the term segment, in some embodiments, refers to the container and its location within the segment's address space. The use of the term stripe refers to the same collection of shards as a segment, including how the shards are distributed along with the redundancy or parity information according to some embodiments.

［0117］一連のアドレス空間変換は、ストレージシステム全体で起こる。上部には、ｉｎｏｄｅにリンクするディレクトリエントリ（ファイル名）がある。Ｉｎｏｄｅは、データが論理的に記憶される媒体アドレス空間の中を指す。媒体アドレスは、大きいファイルの負荷を拡散する又は重複排除若しくはスナップショットのようなデータサービスを実施するために一連の間接的な媒体を通してマッピングされてよい。媒体アドレスは、大きいファイルの負荷を拡散する又は重複排除若しくはスナップショットのようなデータサービスを実施するために一連の間接的な媒体を通してマッピングされてよい。セグメントアドレスは、次いで物理的なフラッシュ場所に変換される。物理的なフラッシュ場所は、いくつかの実施形態によればシステム内のフラッシュの量で制限されたアドレス範囲を有する。媒体アドレス及びセグメントアドレスは論理コンテナであり、いくつかの実施形態では、実質的に無限となるように１２８ビット以上の識別子を使用し、再利用の見込みはシステムの予想寿命よりも長いとして計算される。論理コンテナからのアドレスは、いくつかの実施形態では階層的に割り当てられる。最初に、各不揮発性ソリッドステートストレージユニット１５２が、アドレス空間の範囲を割り当てられてよい。この割り当てられた範囲内で、不揮発性ソリッドステートストレージ１５２は、他の不揮発性ソリッドステートストレージ１５２と同期することなくアドレスを割り当てることができる。 [0117] A series of address space translations occurs throughout the storage system. At the top are directory entries (file names) that link to inodes. Inodes point into the media address space where data is logically stored. Media addresses may be mapped through a series of indirection media to spread the load of large files or implement data services such as deduplication or snapshots. Media addresses may be mapped through a series of indirection media to spread the load of large files or implement data services such as deduplication or snapshots. Segment addresses are then translated into physical flash locations. Physical flash locations have address ranges limited by the amount of flash in the system, according to some embodiments. Media addresses and segment addresses are logical containers, and in some embodiments use identifiers of 128 bits or more to be effectively infinite, with a likelihood of reuse calculated to be longer than the expected life of the system. Addresses from the logical containers are assigned hierarchically, in some embodiments. Initially, each non-volatile solid-state storage unit 152 may be assigned a range of address space. Within this allocated range, non-volatile solid-state storage 152 can allocate addresses without synchronization with other non-volatile solid-state storage 152.

［0118］データ及びメタデータは、変化する作業負荷パターンのために最適化された基本的なストレージレイアウトの集合、及びストレージデバイスごとに記憶される。これらのレイアウトは、複数の冗長性方式、圧縮フォーマット、及びインデックスアルゴリズムを組み込む。これらのレイアウトのいくつかは、オーソリティ及びオーソリティマスタについての情報を記憶する。一方、他は、ファイルメタデータ及びファイルデータを記憶する。冗長性方式は、（例えば、ＮＡＮＤフラッシュチップ等）単一のストレージデバイスの中で破損したビットに耐えるエラー訂正符号、複数のストレージノードの故障に耐えるイレイジャーコード、及びデータセンタ又は地域の故障に耐える複製方式を含む。いくつかの実施形態では、低密度パリティ検査（『ＬＤＰＣ』）符号が、単一のストレージユニットの中で使用される。リードソロモン符号化はストレージクラスタの中で使用され、ミラーリングはいくつかの実施形態ではストレージグリッドの中で使用される。メタデータは、（例えばログ構造化マージツリー等の）順序付けログ構造化インデックスを使用し、記憶される場合があり、大きいデータは、ログ構造化レイアウトでは記憶されない場合がある。 [0118] Data and metadata are stored per storage device and in a collection of basic storage layouts optimized for varying workload patterns. These layouts incorporate multiple redundancy schemes, compression formats, and indexing algorithms. Some of these layouts store information about authorities and authority masters, while others store file metadata and file data. Redundancy schemes include error correction codes to tolerate corrupted bits within a single storage device (e.g., a NAND flash chip), erasure codes to tolerate failures of multiple storage nodes, and replication schemes to tolerate data center or regional failures. In some embodiments, low-density parity check (LDPC) codes are used within a single storage unit. Reed-Solomon coding is used in storage clusters, and mirroring is used in storage grids in some embodiments. Metadata may be stored using an ordered log-structured index (e.g., a log-structured merge tree), and large data may not be stored in a log-structured layout.

［0119］エンティティの複数のコピー全体で一貫性を維持するために、ストレージノードは、計算を通して以下の２つのこと、つまり、（１）エンティティを含むオーソリティ、及び（２）オーソリティを含むストレージノードに暗黙のうちに同意する。エンティティのオーソリティに対する割当ては、エンティティをオーソリティに疑似乱数的に割り当てることによって、外部で生成された鍵に基づいてエンティティを範囲に分割することによって、又は単一のエンティティを各オーソリティに入れることによって行われる場合がある。疑似乱数方式の例は、線形ハッシュ法、及びコントロールド・リプリケーション・アンダー・スケーラブル・ハッシング（『ＣＲＵＳＨ』）を含んだハッシュのリプリケーション・アンダー・スケーラブル・ハッシング（『ＲＵＳＨ』）ファミリである。いくつかの実施形態では、ノードの集合が変化する場合があるため、疑似乱数割当ては、オーソリティをノードに割り当てるためだけに活用される。オーソリティの集合は変化することができず、したがって任意の主観的な関数がこれらの実施形態で適用されてよい。いくつかの配置方式は、オーソリティをストレージノードに自動的に配置する。一方、他の配置方式は、オーソリティのストレージノードへの明示的なマッピングに依存する。いくつかの実施形態では、疑似乱数方式は、各オーソリティから候補オーソリティオーナーの集合にマッピングするために利用される。ＣＲＵＳＨに関係する疑似乱数データ分散関数は、オーソリティをストレージノードに割り当て、オーソリティが割り当てられる場所のリストを作成してよい。各ストレージノードは、疑似乱数データ分散関数のコピーを有し、オーソリティを分散し、後に発見する又は場所を突き止めるために同じ計算に到達する場合がある。疑似乱数方式のそれぞれは、同じターゲットノードを推断するために、いくつかの実施形態で入力としてストレージノードの到達可能な集合を必要とする。エンティティがオーソリティに入れられると、エンティティは、予想される故障が予想外のデータ損失を生じさせないように物理的デバイスに記憶されてよい。いくつかの実施形態では、リバランスアルゴリズムが、すべてのエンティティのコピーを、同じレイアウト内の、及びマシンの同じ集合上のオーソリティの中に記憶しようと試みる。 [0119] To maintain consistency across multiple copies of an entity, storage nodes implicitly agree through computation on two things: (1) the authority that contains the entity, and (2) the storage node that contains the authority. The assignment of entities to authorities may be done by pseudo-randomly assigning entities to authorities, by dividing entities into ranges based on an externally generated key, or by placing a single entity into each authority. Examples of pseudo-random schemes are linear hashing and the Replication Under Scalable Hashing ("RUSH") family of hashes, which includes Controlled Replication Under Scalable Hashing ("CRUSH"). In some embodiments, pseudo-random assignment is utilized only to assign authorities to nodes, since the set of nodes may change. The set of authorities cannot change, and therefore any subjective function may be applied in these embodiments. Some placement schemes automatically place authorities on storage nodes. While other placement schemes rely on explicit mapping of authorities to storage nodes, in some embodiments, a pseudo-random scheme is utilized to map each authority to a set of candidate authority owners. A pseudo-random data distribution function associated with CRUSH may assign authorities to storage nodes and create a list of locations where authorities are assigned. Each storage node may have a copy of the pseudo-random data distribution function and arrive at the same calculation for distributing and later discovering or locating authorities. Each pseudo-random scheme requires a reachable set of storage nodes as input in some embodiments to infer the same target node. Once an entity is placed into an authority, it may be stored on a physical device so that expected failures do not result in unexpected data loss. In some embodiments, a rebalancing algorithm attempts to store copies of all entities within the authority in the same layout and on the same set of machines.

［0120］予想される故障の例は、デバイス故障、マシンの盗難、データセンタ火災、及び例えば核事象又は地質学的事象等の地域の災害を含む。種々の故障が、異なるレベルの受容可能なデータ損失につながる。いくつかの実施形態では、ストレージノードの盗難は、システムのセキュリティにも信頼性にも影響を与えない。一方、システム構成によっては、地域の事象は、データ損失なし、数秒若しくは数分の更新の損失、又は完全なデータ損失にもつながる場合があるであろう。 [0120] Examples of anticipated failures include device failure, machine theft, data center fire, and regional disasters such as nuclear or geological events. Various failures result in different levels of acceptable data loss. In some embodiments, theft of a storage node does not affect the security or reliability of the system. On the other hand, depending on the system configuration, a regional event could result in no data loss, loss of a few seconds or minutes of updates, or even complete data loss.

［0121］実施形態では、ストレージの冗長性のためのデータの配置は、データ一貫性のためのオーソリティの配置とは無関係である。いくつかの実施形態では、オーソリティを含むストレージノードは、いずれの永続記憶装置も含まない。代わりに、ストレージノードは、オーソリティを含まない不揮発性ソリッドステートストレージユニットに接続される。ストレージノードと不揮発性ソリッドステートストレージユニットとの間の通信相互接続は、複数の通信技術から成り、不均一な性能特性及びフォールトトレランス特性を有する。いくつかの実施形態では、上述されたように、不揮発性ソリッドステートストレージユニットは、ＰＣＩエクスプレスを介してストレージノードに接続され、ストレージノードは、イーサネットバックプレーンを使用し、単一のシャシの中で互いに接続され、シャシは互いに接続されてストレージクラスタを形成する。ストレージクラスタは、いくつかの実施形態では、イーサネット又はファイバチャネルを使用し、クライアントに接続される。複数のストレージクラスタが１つのストレージグリッドの中に構成される場合、複数のストレージクラスタは、インターネット又は例えば「マクロスケール」リンク若しくはインターネットを横断しないプライベートリンク等の他の長距離ネットワーキングリンクを使用し、接続される。 [0121] In embodiments, data placement for storage redundancy is independent of authority placement for data consistency. In some embodiments, storage nodes containing authorities do not contain any persistent storage devices. Instead, the storage nodes are connected to non-volatile solid-state storage units that do not contain authorities. The communication interconnect between the storage nodes and the non-volatile solid-state storage units is comprised of multiple communication technologies and has uneven performance and fault tolerance characteristics. In some embodiments, as described above, the non-volatile solid-state storage units are connected to the storage nodes via PCI Express, and the storage nodes are connected to each other within a single chassis using an Ethernet backplane, and the chassis are connected to each other to form a storage cluster. The storage cluster, in some embodiments, is connected to clients using Ethernet or Fibre Channel. When multiple storage clusters are configured in a storage grid, the multiple storage clusters are connected using the Internet or other long-distance networking links, such as "macro-scale" links or private links that do not traverse the Internet.

［0122］オーソリティオーナーは、エンティティを修正する、ある不揮発性ソリッドステートストレージユニットから別の不揮発性ソリッドステートストレージユニットにエンティティを移動させる、並びにエンティティのコピーを付加及び削除する独占権を有する。これは、基本的なデータの冗長性を維持することを可能にする。オーソリティオーナーが故障するとき、廃棄されるとき、又はオーバロードされるとき、オーソリティは新しいストレージノードに転送される。一時障害は、すべての欠陥のないマシンが新しいオーソリティ場所について同意していることを確実にすることを重要にする。一時障害により発生する曖昧性は、リモートシステム管理者による若しくはローカルハードウェア管理者による人手を介して(例えば、故障したマシンをクラスタから物理的に取り外す、又は故障したマシン上でボタンを押すことによって)、例えばＰａｘｏｓ、ホットウォームフェイルオーバ方式等のコンセンサスプロトコルによって自動的に達成できる。いくつかの実施形態では、コンセンサスプロトコルが使用され、フェイルオーバは自動である。短すぎる期間に多すぎる故障又は複製イベントが発生する場合、システムは、自己保存モードに入り、管理者がいくつかの実施形態に従って介入するまで複製及びデータ移動の活動を休止させる。 [0122] The authority owner has the exclusive right to modify entities, move entities from one non-volatile solid-state storage unit to another, and add and delete copies of entities. This allows basic data redundancy to be maintained. When an authority owner fails, is decommissioned, or becomes overloaded, the authority is transferred to a new storage node. Transient failures make it important to ensure that all non-faulty machines agree on the new authority location. Ambiguity caused by transient failures can be resolved manually by a remote system administrator or by a local hardware administrator (e.g., by physically removing the failed machine from the cluster or pressing a button on the failed machine), or automatically by a consensus protocol such as Paxos or a hot-warm failover scheme. In some embodiments, a consensus protocol is used and failover is automatic. If too many failures or replication events occur in too short a period of time, the system enters a self-preservation mode, suspending replication and data movement activity until an administrator intervenes according to some embodiments.

［0123］オーソリティがストレージノード間で転送され、オーソリティオーナーが自分のオーソリティのエンティティを更新すると、システムはストレージノードと不揮発性ソリッドステートストレージユニットの間でメッセージを転送する。パーシステントメッセージに関して、異なる目的を有するメッセージは異なるタイプである。メッセージのタイプに応じて、システムは異なる順序付け及び耐久性の保証を維持する。パーシステントメッセージが処理されていると、メッセージは一時的に複数の長持ちするストレージハードウェア技術及び長持ちしないストレージハードウェア技術に記憶される。いくつかの実施形態では、メッセージは、ＲＡＭ、ＮＶＲＡＭに、及びＮＡＮＤフラッシュデバイスに記憶され、各記憶媒体を効率的に使用するために、さまざまなプロトコルが使用される。待ち時間に敏感なクライアント要求は、複製されたＮＶＲＡＭで、次いで後にＮＡＮＤで持続されてよい。一方、背景リバランス動作はＮＡＮＤに直接的に持続される。 [0123] As authorities are transferred between storage nodes and authority owners update their authority entities, the system transfers messages between storage nodes and non-volatile solid-state storage units. Regarding persistent messages, messages with different purposes are of different types. Depending on the message type, the system maintains different ordering and durability guarantees. As persistent messages are processed, they are temporarily stored in multiple persistent and non-persistent storage hardware technologies. In some embodiments, messages are stored in RAM, NVRAM, and NAND flash devices, and various protocols are used to efficiently use each storage medium. Latency-sensitive client requests may be persisted in replicated NVRAM and then later in the NAND, while background rebalancing operations are persisted directly to the NAND.

［0124］パーシステントメッセージは、送信される前に持続的に記憶される。これは、システムが、故障及び構成要素の交換にも関わらずクライアント要求にサービスを提供し続けることを可能にする。多くのハードウェア構成要素は、システム管理者、製造業者、ハードウェアサプライチェーン、及び継続中のモニタ品質制御インフラに可視である一意の識別子を含むが、インフラストラクチャアドレスの上部で実行中のアプリケーションはアドレスを仮想化する。これらの仮想化されたアドレスは、構成要素の故障及び交換にも関わらずストレージシステムの耐用期間にわたって変化しない。これが、再構成又はクライアント要求の処理の中断なしに、ストレージシステムの各構成要素を経時的に交換できるようにする。つまり、システムは無停止更新をサポートする。 [0124] Persistent messages are persistently stored before being sent. This allows the system to continue servicing client requests despite failures and component replacement. Many hardware components contain unique identifiers that are visible to system administrators, manufacturers, the hardware supply chain, and ongoing monitoring and quality control infrastructure, but applications running on top of the infrastructure addresses virtualize the addresses. These virtualized addresses do not change over the life of the storage system despite component failures and replacements. This allows components of the storage system to be replaced over time without reconfiguration or interruption to the processing of client requests. In other words, the system supports non-disruptive updates.

［0125］いくつかの実施形態では、仮想化されたアドレスは、十分な冗長性をもって記憶される。連続モニタシステムは、ハードウェア及びソフトウェアのステータス、並びにハードウェアの識別子を相互に関連付ける。これは、欠陥のある構成要素及び製造詳細による故障の検出及び予測を可能にする。また、モニタシステムは、いくつかの実施形態では、構成要素を重大な経路から取り除くことによって故障発生前に影響を受けるデバイスから離すオーソリティ及びエンティティの先を見越した転送を可能にする。 [0125] In some embodiments, virtualized addresses are stored with full redundancy. A continuous monitoring system correlates hardware and software status and hardware identifiers. This allows for the detection and prediction of failures due to defective components and manufacturing details. The monitoring system also, in some embodiments, allows for the proactive transfer of authorities and entities away from affected devices before a failure occurs by removing components from critical paths.

［0126］図２Ｃは、ストレージノード１５０のコンテンツ、及びストレージノード１５０の不揮発性ソリッドステートストレージ１５２のコンテンツを示すマルチレベルブロック図である。データは、いくつかの実施形態では、ネットワークインタフェースコントローラ（『ＮＩＣ』）２０２によってストレージノード１５０に、及びストレージノード１５０から通信される。各ストレージノード１５０は、上述されたように、ＣＰＵ１５６及び１つ以上の不揮発性ソリッドステートストレージ１５２を有する。図２Ｃで１レベル下方に移動すると、各不揮発性ソリッドステートストレージ１５２は、例えば不揮発性ランダムアクセスメモリ（『ＮＶＲＡＭ』）２０４、及びフラッシュメモリ２０６等の相対的に高速の不揮発性ソリッドステートメモリを有する。いくつかの実施形態では、ＮＶＲＡＭ２０４は、プログラム／消去サイクル（ＤＲＡＭ、ＭＲＡＭ、ＰＣＭ）を必要としない構成要素であってよく、メモリが読み取られるよりもはるかにより頻繁に書き込まれることをサポートできるメモリである場合がある。図２Ｃで別のレベルに下がると、ＮＶＲＡＭ２０４は、エネルギー貯蔵２１８によってバックアップされる、例えばダイナミックランダムアクセスメモリ（ＤＲＡＭ）２１６等の高速揮発性メモリとして一実施形態で実施される。エネルギー貯蔵２１８は、停電の場合には、コンテンツがフラッシュメモリ２０６に転送されるほど十分に長くＤＲＡＭ２１６に電力を供給させておくほど十分な電力を提供する。いくつかの実施形態ではエネルギー貯蔵２１８は、電力損失の場合に安定した記憶媒体へのＤＲＡＭ２１６のコンテンツの転送を可能にするほど十分なエネルギーの適切な供給を供給するコンデンサ、超コンデンサ、電子、又は他のデバイスである。フラッシュメモリ２０６は、フラッシュダイ２２２のパッケージ又はフラッシュダイ２２２のアレイとも呼ばれてよい、複数のフラッシュダイ２２２として実装される。フラッシュダイ２２２は、パッケージあたり単一のダイ、パッケージあたり複数のダイ（つまり、マルチチップパッケージ）で、ハイブリッドパッケージで、プリント基板又は他の基板上でむき出しのダイとして、カプセル化されたダイとして等、任意のいくつかの方法でパッケージ化できるであろうことが理解されるべきである。示される実施形態では、不揮発性ソリッドステートストレージ１５２は、コントローラ２１２又は他のプロセッサを有し、コントローラ２１２に結合された入出力（Ｉ／Ｏ）ポート２１０を有する。Ｉ／Ｏポート２１０は、ＣＰＵ１５６、及び／又はフラッシュストレージノード１５０のネットワークインタフェースコントローラ２０２に結合される。フラッシュ入出力（Ｉ／Ｏ）ポート２２０は、フラッシュダイ２２２に結合され、ダイレクトメモリアクセスユニット（ＤＭＡ）２１４は、コントローラ２１２、ＤＲＡＭ２１６、及びフラッシュダイ２２２に結合される。示される実施形態では、Ｉ／Ｏポート２１０、コントローラ２１２、ＤＭＡユニット２１４、及びフラッシュＩ／Ｏポート２２０は、例えばフィールドプログラマブルゲートアレイ（ＦＰＧＡ）等、プログラム可能論理回路（『ＰＬＤ』）２０８上に実装される。本実施形態では、各フラッシュダイ２２２は、１６ｋＢ（キロバイト）のページ２２４として編成されるページ、及びそれを通してデータをフラッシュダイ２２２に書き込む又はフラッシュダイ２２２から読み取ることができるレジスタ２２６を有する。追加の実施形態では、他のタイプのソリッドステートメモリが、フラッシュダイ２２２の中に示されるフラッシュメモリの代わりに、又はフラッシュメモリに加えて使用される。 2C is a multi-level block diagram illustrating the contents of a storage node 150 and the contents of the non-volatile solid-state storage 152 of the storage node 150. Data is communicated to and from the storage node 150 by a network interface controller ("NIC") 202, in some embodiments. Each storage node 150 includes a CPU 156 and one or more non-volatile solid-state storage devices 152, as described above. Moving down one level in FIG. 2C, each non-volatile solid-state storage device 152 includes a non-volatile random access memory ("NVRAM") 204 and a relatively fast non-volatile solid-state memory such as flash memory 206. In some embodiments, the NVRAM 204 may be a component that does not require program/erase cycles (DRAM, MRAM, PCM), and may be memory that can support being written to much more frequently than the memory is read. 2C , NVRAM 204 is implemented in one embodiment as a high-speed volatile memory, such as dynamic random access memory (DRAM) 216, backed up by energy storage 218. Energy storage 218 provides sufficient power to keep DRAM 216 powered long enough for the contents to be transferred to flash memory 206 in the event of a power outage. In some embodiments, energy storage 218 is a capacitor, supercapacitor, electron, or other device that provides an adequate supply of energy sufficient to enable the transfer of the contents of DRAM 216 to a stable storage medium in the event of a power loss. Flash memory 206 is implemented as multiple flash dies 222, which may also be referred to as a package of flash dies 222 or an array of flash dies 222. It should be understood that flash dies 222 could be packaged in any of several ways, such as a single die per package, multiple dies per package (i.e., a multi-chip package), in a hybrid package, as bare dies on a printed circuit board or other substrate, as encapsulated dies, etc. In the illustrated embodiment, non-volatile solid-state storage 152 has a controller 212 or other processor and input/output (I/O) ports 210 coupled to controller 212. I/O ports 210 are coupled to CPU 156 and/or network interface controller 202 of flash storage node 150. Flash input/output (I/O) ports 220 are coupled to flash die 222, and direct memory access unit (DMA) 214 is coupled to controller 212, DRAM 216, and flash die 222. In the illustrated embodiment, I/O ports 210, controller 212, DMA unit 214, and flash I/O ports 220 are implemented on a programmable logic device ("PLD") 208, such as, for example, a field programmable gate array (FPGA). In this embodiment, each flash die 222 has pages organized as 16 kB (kilobyte) pages 224 and registers 226 through which data can be written to or read from the flash die 222. In additional embodiments, other types of solid-state memory are used instead of or in addition to the flash memory shown in the flash die 222.

［0127］ストレージクラスタ１６１は、本明細書に開示される多様な実施形態では、一般にストレージアレイと対比される場合がある。ストレージノード１５０は、ストレージクラスタ１６１を作成する集合体の一部である。各ストレージノード１５０は、データのスライス及びデータを提供するために必要とされるコンピューティングを所有する。複数のストレージノード１５０は、データを記憶し、取り出すために協調する。一般にストレージアレイで使用されるストレージメモリ又はストレージデバイスは、データを処理し、操作することにあまり関与していない。ストレージアレイ内のストレージメモリ又はストレージデバイスは、データを読み取る、書き込む、又は消去するコマンドを受け取る。ストレージアレイ内のストレージメモリ又はストレージデバイスは、それらが埋め込まれるより大きいシステム、又はデータが何を意味するのかを認識していない。ストレージアレイ内のストレージメモリ又はストレージデバイスは、例えばＲＡＭ、ソリッドステートドライブ、ハードディスクドライブ等の多様なタイプのストレージメモリを含む場合がある。本明細書に説明されるストレージユニット１５２は、同時にアクティブであり、複数の目的を果たす複数のインタフェースを有する。いくつかの実施形態では、ストレージノード１５０の機能性のいくつかは、ストレージユニット１５２の中にシフトされ、ストレージユニット１５２をストレージユニット１５２及びストレージノード１５０の組合せに変換する。（ストレージデータに対する）計算をストレージユニット１５２の中に入れることは、この計算をデータ自体により近く配置する。多様なシステム実施形態は、異なる機能を有するストレージノード層の階層を有する。対照的に、ストレージアレイでは、コントローラは、コントローラがシェルフ又はストレージデバイス内で管理するデータのすべてについてのあらゆるものを所有し、知っている。ストレージクラスタ１６１では、本明細書に説明されるように、複数のストレージユニット１５２及び／又はストレージノード１５０の複数のコントローラが、（例えば、イレイジャーコーディング、データシャード、メタデータ通信及び冗長性、記憶容量拡大又は縮小、データ復旧等のための）多様な方法で協調する。 [0127] In various embodiments disclosed herein, a storage cluster 161 may generally be contrasted with a storage array. Storage nodes 150 are part of a collection that makes up the storage cluster 161. Each storage node 150 owns a slice of data and the computing required to provide the data. Multiple storage nodes 150 cooperate to store and retrieve data. Storage memory or storage devices typically used in a storage array are not primarily involved in processing or manipulating data. The storage memory or storage devices in a storage array receive commands to read, write, or erase data. The storage memory or storage devices in a storage array are unaware of the larger system in which they are embedded or what the data represents. The storage memory or storage devices in a storage array may include various types of storage memory, such as RAM, solid-state drives, hard disk drives, etc. The storage units 152 described herein have multiple interfaces that are simultaneously active and serve multiple purposes. In some embodiments, some of the functionality of storage node 150 is shifted into storage unit 152, transforming storage unit 152 into a combination of storage unit 152 and storage node 150. Putting the computation (on the stored data) into storage unit 152 places this computation closer to the data itself. Various system embodiments have a hierarchy of storage node tiers with different functionality. In contrast, in a storage array, the controller owns and knows everything about all of the data it manages within its shelves or storage devices. In storage cluster 161, multiple controllers of multiple storage units 152 and/or storage nodes 150 cooperate in various ways (e.g., for erasure coding, data sharding, metadata communication and redundancy, storage capacity expansion or contraction, data recovery, etc.) as described herein.

［0128］図２Ｄは、図２Ａから図２Ｃのストレージノード１５０及びストレージユニット１５２の実施形態を使用する、ストレージサーバ環境を示す。このバージョンでは、各ストレージユニット１５２は、コントローラ２１２（図２Ｃを参照）等のプロセッサ、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）、フラッシュメモリ２０６、及びシャシ１３８（図２Ａを参照）内のＰＣＩｅ（ペリフェラルコンポーネントインターコネクトエクスプレス）ボード上の（超コンデンサによりバックアップされるＤＲＡＭ２１６である、図２Ｂ及び図２Ｃを参照）ＮＶＲＡＭ２０４を有する。ストレージユニット１５２は、ストレージを含んだ単一の基板として実装されてよく、シャシ内側の最大の許容故障ドメインであってよい。いくつかの実施形態では、最大で２つのストレージユニット１５２が故障し得、デバイスはデータ損失なしに続行する。 [0128] Figure 2D shows a storage server environment using an embodiment of the storage node 150 and storage unit 152 of Figures 2A-2C. In this version, each storage unit 152 has a processor such as a controller 212 (see Figure 2C), an FPGA (Field Programmable Gate Array), flash memory 206, and NVRAM 204 (which is DRAM 216 backed up by supercapacitors, see Figures 2B and 2C) on a PCIe (Peripheral Component Interconnect Express) board within the chassis 138 (see Figure 2A). The storage unit 152 may be implemented as a single board containing storage and may be the largest tolerable failure domain within the chassis. In some embodiments, up to two storage units 152 can fail and the device continues without data loss.

［0129］物理ストレージは、いくつかの実施形態でアプリケーションの使用に基づき名付けられた領域に分割される。ＮＶＲＡＭ２０４は、ストレージユニット１５２ＤＲＡＭ２１６内の予約メモリの連続ブロックであり、ＮＡＮＤフラッシュによってバックアップされる。ＮＶＲＡＭ２０４は、スプール（例えば、ｓｐｏｏｌ＿ｒｅｇｉｏｎ）として２つのために書き込まれる複数のメモリ領域に論理的に分割される。ＮＶＲＡＭ２０４スプールの中の空間は、独立して各オーソリティ１６８によって管理される。各デバイスは、ストレージ空間の量を各オーソリティ１６８に提供する。そのオーソリティ１６８は、その空間の中の寿命及び割当てをさらに管理する。スプールの例は、分散されたトランザクション又は概念を含む。ストレージユニット１５２への一次電力が故障するとき、搭載された超コンデンサは、短期間の電力ホールドアップを提供する。このホールドアップ間隔の間、ＮＶＲＡＭ２０４のコンテンツは、フラッシュメモリ２０６にフラッシュされる。次の電力投入時、ＮＶＲＡＭ２０４のコンテンツは、フラッシュメモリ２０６から取り戻される。 [0129] In some embodiments, physical storage is divided into regions named based on application usage. NVRAM 204 is a contiguous block of reserved memory within storage unit 152 DRAM 216 and backed by NAND flash. NVRAM 204 is logically divided into multiple memory regions that are written to for two as spools (e.g., spool_region). Space within the NVRAM 204 spools is managed independently by each authority 168. Each device provides an amount of storage space to each authority 168, which further manages the lifetime and allocation within that space. Examples of spools include distributed transactions or concepts. When primary power to storage unit 152 fails, an on-board supercapacitor provides short-term power holdup. During this holdup interval, the contents of NVRAM 204 are flushed to flash memory 206. Upon next power-on, the contents of NVRAM 204 are retrieved from flash memory 206.

［0130］ストレージユニットコントローラに関して、論理「コントローラ」の責任は、オーソリティ１６８を含んだブレードのそれぞれ全体で分散される。この論理制御の分散は、ホストコントローラ２４２、中間層コントローラ２４４、及びストレージユニットコントローラ（複数可）２４６として図２Ｄに示される。パーツは物理的に同じブレード上に配置されてよいが、制御プレーン及びストレージプレーンの管理は、独立して処理される。各オーソリティ１６８は、実質的には独立したコントローラとしての機能を果たす。各オーソリティ１６８は、独自のデータ構造及びメタデータ構造、独自のバックグラウンドワーカーを提供し、独自のライフサイクルを維持する。 [0130] With respect to storage unit controllers, logical "controller" responsibilities are distributed across each of the blades containing an authority 168. This distribution of logical control is shown in FIG. 2D as host controller 242, mid-tier controller 244, and storage unit controller(s) 246. While the parts may be physically located on the same blade, management of the control plane and storage plane is handled independently. Each authority 168 effectively functions as an independent controller. Each authority 168 provides its own data and metadata structures, its own background workers, and maintains its own lifecycle.

［0131］図２Ｅは、ブレード２５２ハードウェアブロック図であり、制御プレーン２５４、計算プレーン、及びストレージプレーン２５６、２５８、並びに図２Ｄのストレージサーバ環境で図２Ａから図２Ｃのストレージノード１５０及びストレージユニット１５２の実施形態を使用し、基本的な物理リソースと対話するオーソリティ１６８を示す。制御プレーン２５４は、ブレード２５２のいずれかで実行するために計算プレーン２５６内の計算リソースを使用できるいくつかのオーソリティ１６８に区分化される。ストレージプレーン２５８はデバイスの集合に区分化され、デバイスのそれぞれがフラッシュ２０６及びＮＶＲＡＭ２０４リソースへのアクセスを提供する。 [0131] Figure 2E is a blade 252 hardware block diagram showing the control plane 254, compute plane, and storage planes 256, 258, as well as the authority 168 that interacts with the underlying physical resources using an embodiment of the storage node 150 and storage unit 152 of Figures 2A-2C in the storage server environment of Figure 2D. The control plane 254 is partitioned into several authorities 168 that can use the computational resources in the compute plane 256 to run on any of the blades 252. The storage plane 258 is partitioned into a collection of devices, each of which provides access to flash 206 and NVRAM 204 resources.

［0132］図２Ｅの計算プレーン及びストレージプレーン２５６、２５８では、オーソリティ１６８は、基本的な物理リソース（つまり、デバイス）と対話する。オーソリティ１６８の観点から、そのリソースは、物理的デバイスのすべての上でストライプを付けられる。デバイスの観点から、それは、オーソリティがどこでたまたま実行するのかには関係なく、すべてのオーソリティ１６８にリソースを提供する。各オーソリティ１６８は、例えばフラッシュメモリ２０６及びＮＶＲＡＭ２０４のパーティション２６０等、ストレージユニット１５２内のストレージメモリの１つ以上のパーティション２６０を割り当てた、又は割り当てられた。各オーソリティ１６８は、ユーザデータを書き込む又は読み取るために、それに属する、それらの割り当てられたパーティション２６０を使用する。オーソリティは、システムの異なる量の物理ストレージと関連付けられる場合がある。例えば、１つのオーソリティ１６８は、１つ以上の他のオーソリティ１６８よりも多数のパーティション２６０又はより大きいサイズのパーティション２６０を１つ以上のストレージユニット１５２に有する場合があるであろう。 2E, authorities 168 interact with underlying physical resources (i.e., devices). From the perspective of the authority 168, its resources are striped across all of the physical devices. From the perspective of the device, it provides resources to all authorities 168, regardless of where the authority happens to be running. Each authority 168 has assigned or is assigned one or more partitions 260 of storage memory within the storage unit 152, such as partitions 260 of flash memory 206 and NVRAM 204. Each authority 168 uses its assigned partitions 260 to write or read user data. Authorities may be associated with different amounts of physical storage in the system. For example, one authority 168 may have more partitions 260 or larger-sized partitions 260 on one or more storage units 152 than one or more other authorities 168.

［0133］図２Ｆは、いくつかの実施形態に従って、ストレージクラスタのブレード２５２のエラスティシティソフトウェア層を示す。エラスティシティ構造では、エラスティシティソフトウェアは対称的である。つまり、各ブレードの計算モジュール２７０は、図２Ｆに示されるプロセスの３つの同一の層を実行する。ストレージマネージャ２７４は、ローカルストレージユニット１５２ＮＶＲＡＭ２０４、及びフラッシュ２０６に記憶されたデータ及びメタデータに対する他のブレード２５２からの読取り要求及び書込み要求を実行する。オーソリティ１６８は、必要な読取り及び書込みを、そのストレージユニット１５２上で、対応するデータ又はメタデータが常駐するブレード２５２に発行することによってクライアント要求に応える。エンドポイント２７２は、スイッチファブリック１４６監督ソフトウェアから受け取られたクライアント接続要求を解析し、クライアント接続要求を、達成を担うオーソリティ１６８に中継し、オーソリティ１６８の応答をクライアントに中継する。対称的な３層構造は、ストレージシステムの高度の同時並行性を可能にする。エラスティシティは、これらの実施形態で効率的に且つ確実にスケールアウトする。さらに、エラスティシティは、クライアントアクセスパターンとは関係なく、すべてのリソース全体で均等に作業のバランスをとり、通常は、従来の分散ロックで発生するブレード間の調整の必要性の多くを排除することによって同時並行性を最大限にする一意のスケールアウトを実装する。 2F illustrates elasticity software layers for a blade 252 of a storage cluster, according to some embodiments. In an elasticity architecture, the elasticity software is symmetric. That is, each blade's compute module 270 executes the three identical layers of the process illustrated in FIG. 2F. The storage manager 274 executes read and write requests from other blades 252 to data and metadata stored in the local storage unit 152, NVRAM 204, and flash 206. The authority 168 fulfills client requests by issuing the necessary reads and writes to the blade 252 on whose storage unit 152 the corresponding data or metadata resides. The endpoint 272 analyzes client connection requests received from the switch fabric 146 oversight software, relays the client connection request to the authority 168 responsible for fulfillment, and relays the authority 168's response to the client. The symmetrical three-tier architecture enables a high degree of concurrency in the storage system. Elasticity scales out efficiently and reliably in these embodiments. Furthermore, Elasticity implements a unique scale-out strategy that maximizes concurrency by balancing work evenly across all resources regardless of client access patterns, eliminating much of the need for cross-blade coordination that typically occurs with traditional distributed locks.

［0134］まだ図２Ｆを参照すると、ブレード２５２の計算モジュール２７０で実行するオーソリティ１６８は、クライアント要求に応えるために必要とされる内部動作を実行する。エラスティシティの１つの特徴は、オーソリティ１６８がステートレスであること、つまりオーソリティが、高速アクセスのためにアクティブなデータ及びメタデータを独自のブレード２５２のＤＲＡＭにキャッシュに入れるが、オーソリティは、更新がフラッシュ２０６に書き込まれるまで３つの別々のブレード２５２上のそのＮＶＲＡＭ２０４パーティションのあらゆる更新を記憶する点である。ＮＶＲＡＭ２０４に対するすべてのストレージシステムの書込みは、いくつかの実施形態では、３つの別々のブレード２５２上でのパーティションに対して３つある。３倍にミラーリングされたＮＶＲＡＭ２０４並びにパリティ及びリードソロモンＲＡＩＤチェックサムによって保護された永続記憶装置を用いると、ストレージシステムは、データ、メタデータ、又はどちらかへのアクセスを失うことなく２つのブレード２５２の同時故障を乗り切ることができる。 [0134] Still referring to FIG. 2F, the authority 168, executing on the compute module 270 of the blade 252, performs the internal operations required to service client requests. One feature of elasticity is that the authority 168 is stateless; that is, the authority caches active data and metadata in its own blade 252's DRAM for fast access, but the authority stores any updates to its NVRAM 204 partitions on three separate blades 252 until the updates are written to flash 206. All storage system writes to NVRAM 204 are, in some embodiments, triplicate for partitions on three separate blades 252. With triple-mirrored NVRAM 204 and persistent storage protected by parity and Reed-Solomon RAID checksums, the storage system can survive the simultaneous failure of two blades 252 without losing data, metadata, or access to either.

［0135］オーソリティ１６８はステートレスであるため、オーソリティはブレード２５２間で移行できる。各オーソリティ１６８は一意の識別子を有する。ＮＶＲＡＭ２０４及びフラッシュ２０６のパーティションは、オーソリティがいくつか実行しているブレード２５２とではなく、オーソリティ１６８の識別子と関連付けられる。したがって、オーソリティ１６８が移行するとき、オーソリティ１６８は、その新しい場所から同じストレージパーティションを管理し続ける。新しいブレード２５２がストレージクラスタの実施形態にインストールされるとき、システムは、システムのオーソリティ１６８による使用のために新しいブレード２５２のストレージを区画化し、選択されたオーソリティ１６８を新しいブレード２５２に移行し、新しいブレード２５２でエンドポイント２７２を開始し、それらをスイッチファブリック１４６のクライアント接続分散アルゴリズムに含めることによって自動的に負荷のバランスを取り直す。 [0135] Because authorities 168 are stateless, authorities can migrate between blades 252. Each authority 168 has a unique identifier. NVRAM 204 and flash 206 partitions are associated with the identifier of the authority 168, not with the blade 252 on which some authority is running. Thus, when an authority 168 migrates, the authority 168 continues to manage the same storage partitions from its new location. When a new blade 252 is installed in a storage cluster embodiment, the system partitions the new blade's 252's storage for use by the system's authorities 168, migrates selected authorities 168 to the new blade 252, and automatically rebalances the load by starting endpoints 272 on the new blade 252 and including them in the switch fabric 146's client connection distribution algorithm.

［0136］その新しい場所から、移行されたオーソリティ１６８は、フラッシュ２０６上でそのＮＶＲＡＭ２０４パーティションのコンテンツを持続し、他のオーソリティ１６８からの読取り要求及び書込み要求を処理し、エンドポイント２７２がそれらに向けるクライアント要求に応える。同様に、ブレード２５２が故障する又は取り除かれる場合、システムは、システムの残りのブレード２５２の間でそのオーソリティ１６８を分散し直す。分散し直されたオーソリティ１６８は、その新しい場所からその元の機能を実行し続ける。 [0136] From its new location, the migrated authority 168 persists the contents of its NVRAM 204 partition on flash 206, processes read and write requests from other authorities 168, and serves client requests directed to it by endpoints 272. Similarly, if a blade 252 fails or is removed, the system redistributes its authorities 168 among the remaining blades 252 in the system. The redistributed authorities 168 continue to perform their original functions from their new locations.

［0137］図２Ｇは、いくつかの実施形態に係る、オーソリティ１６８、及びストレージクラスタのブレード２５２のストレージリソースを示す。各オーソリティ１６８は、各ブレード２５２でのフラッシュ２０６及びＮＶＲＡＭ２０４のパーティションを独占的に担う。オーソリティ１６８は、他のオーソリティ１６８とは無関係にそのパーティションのコンテンツ及び完全性を管理する。オーソリティ１６８は、入信データを圧縮し、それを一時的にそのＮＶＲＡＭ２０４パーティションに保存し、次いでそのフラッシュ２０６パーティション内のストレージのセグメントでデータを強固にし、ＲＡＩＤ保護し、持続する。オーソリティ１６８がフラッシュ２０６にデータを書き込むと、ストレージマネージャ２７４は、書込み性能を最適化し、媒体寿命を最大限にするために必要なフラッシュ変換を実行する。背景では、オーソリティ１６８は、「ガーベージコレクト」する、又はクライアントがデータを上書きすることによって時代遅れのものにするデータによって占有される空間を取り戻す。オーソリティ１６８のパーティションは共通の要素をもたないので、クライアント及び書込みを実行する、又はバックグラウンド機能を実行する分散されたロックに対する必要性がないことが理解されるべきである。 2G illustrates the storage resources of an authority 168 and a blade 252 of a storage cluster, according to some embodiments. Each authority 168 is exclusively responsible for a partition of flash 206 and NVRAM 204 on each blade 252. The authority 168 manages the contents and integrity of its partition independently of other authorities 168. The authority 168 compresses incoming data, temporarily stores it in its NVRAM 204 partition, and then hardens, RAID-protects, and persists the data in a segment of storage within its flash 206 partition. As the authority 168 writes data to the flash 206, the storage manager 274 performs the necessary flash transformations to optimize write performance and maximize media lifespan. In the background, the authority 168 "garbage collects," or reclaims, the space occupied by data that clients make obsolete by overwriting the data. It should be understood that because the partitions of authority 168 have no common elements, there is no need for distributed locking between clients and writes or to perform background functions.

［0138］本明細書に記載される実施形態は、多様なソフトウェアプロトコル、通信プロトコル、及び／又はネットワーキングプロトコルを活用してよい。さらに、ハードウェア及び／又はソフトウェアの構成は、多様なプロトコルに対応するために調整されてよい。例えば、実施形態は、ＷＩＮＤＯＷＳ（商標）環境で認証、ディレクトリ、方針、及び他のサービスを提供するデータベースをベースにしたシステムであるアクティブディレクトリを利用してよい。これらの実施形態では、ＬＤＡＰ（ライトウェイトディレクトリアクセスプロトコル）は、例えばアクティブディレクトリ等のディレクトリサービスプロバイダでアイテムを問い合わせし、修正するための一例のアプリケーションプロトコルである。いくつかの実施形態では、ネットワークロックマネージャ（『ＮＬＭ』）は、システムＶ様式のアドバイザリーファイルを提供し、ネットワーク上でロックを記録するためにネットワークファイルシステム（『ＮＦＳ』）と協調して機能する機構として利用される。その１つのバージョンが共通インターネットファイルシステム（『ＣＩＦＳ』）としても知られるサーバメッセージブロック（『ＳＭＢ』）プロトコルは、本明細書で説明されるストレージシステムと統合されてよい。ＳＭＰは、ファイル、プリンタ、及びシリアルポートに共用アクセス、並びにネットワーク上のノード間のさまざまな通信を提供するために通常使用されるアプリケーション層ネットワークプロトコルとして動作する。また、ＳＭＢは、認証されたプロセス間通信機構も提供する。ＡＭＡＺＯＮ（商標）Ｓ３（シンプルストレージサービス）は、ＡｍａｚｏｎＷｅｂＳｅｒｖｉｃｅｓによって提供されるウェブサービスであり、本明細書に説明されるシステムは、ウェブサービスインタフェース（ＲＥＳＴ（表現可能な状態の転送）、ＳＯＡＰ（シンプルオブジェクトアクセスプロトコル）、及びＢｉｔＴｒｏｒｒｅｎｔ）を通してＡｍａｚｏｎＳ３と連動してよい。ＲＥＳＴｆｕｌＡＰＩ（アプリケーションプログラミングインタフェース）は、一連の小さいモジュールを作成するためにトランザクションを分解する。各モジュールは、トランザクションの特定の基本的な部分をアドレス指定する。特にオブジェクトデータのために、これらの実施形態で提供される制御又は許可は、アクセス制御リスト（『ＡＣＬ』）の活用を含んでよい。ＡＣＬは、オブジェクトに付与される許可のリストであり、ＡＣＬは、どの動作が所与のオブジェクトで許可されるのかだけではなく、どのユーザプロセス又はシステムプロセスがオブジェクトへのアクセスを許可されるのかも指定する。システムは、ネットワーク上のコンピュータのために識別及び位置選定システムを提供し、インターネット全体でトラフィックを送る通信プロトコルのために、ＩＰｖ４だけではなくインターネットプロトコルバージョン６（『ＩＰＶ６』）も活用する。ネットワーク化されたシステム間でのパケットのルーティングは、イコールコストマルチパスルーティング（『ＥＣＭＰ』）を含んでよい。ＥＣＭＰは、単一の宛先に転送するネクストホップパケットが、ルーティングメトリック計算で一番上の場所に並ぶ複数の「最善の経路」上で発生する場合があるルーティング戦略である。マルチパスルーティングは、単一ルータに制限されたホップ単位の決定であるため、マルチパスルーティングは、大部分のルーティングプロトコロルと併せて使用できる。ソフトウェアは、ソフトウェアアプリケーションの単一インスタンスが複数のカスタマにサービスを提供するアーキテクチャであるマルチテナンシーをサポートしてよい。各カスタマは、テナントと呼ばれてよい。テナントは、アプリケーションのいくつかの部分をカスタマイズする能力を与えられてよいが、いくつかの実施形態では、アプリケーションのコードをカスタマイズしない場合がある。実施形態は、監査ログを維持してよい。監査ログは、コンピューティングシステムにおけるイベントを記録する文書である。どのリソースがアクセスされたのかを文書化することに加えて、監査ログエントリは、通常、宛先アドレス及びソースアドレス、タイムスタンプ、並びに多様な規則とのコンプライアンスのためのユーザログイン情報を含む。実施形態は、例えば暗号化鍵ローテーション等の多様な鍵管理方針をサポートしてよい。さらに、システムは、動的なルートパスワード又はいくつかの変形動的変化パスワードをサポートしてよい。 [0138] The embodiments described herein may utilize a variety of software protocols, communication protocols, and/or networking protocols. Additionally, hardware and/or software configurations may be adjusted to accommodate the various protocols. For example, embodiments may utilize Active Directory, a database-based system that provides authentication, directory, policy, and other services in a WINDOWS™ environment. In these embodiments, LDAP (Lightweight Directory Access Protocol) is an example application protocol for querying and modifying items in a directory service provider such as Active Directory. In some embodiments, a Network Lock Manager ("NLM") is utilized as a mechanism that provides System V-style advisory files and works in conjunction with the Network File System ("NFS") to record locks over the network. The Server Message Block ("SMB") protocol, one version of which is also known as the Common Internet File System ("CIFS"), may be integrated with the storage systems described herein. SMP operates as an application-layer network protocol typically used to provide shared access to files, printers, and serial ports, as well as various communications between nodes on a network. SMB also provides an authenticated inter-process communication mechanism. AMAZON™ S3 (Simple Storage Service) is a web service provided by Amazon Web Services, and the system described herein may interface with Amazon S3 through web service interfaces (REST (Representable State Transfer), SOAP (Simple Object Access Protocol), and BitTorrent). A RESTful API (Application Programming Interface) breaks down transactions to create a series of small modules, each addressing a specific fundamental part of the transaction. The control or permissions provided in these embodiments, particularly for object data, may include the use of access control lists ("ACLs"). An ACL is a list of permissions granted to an object; an ACL specifies not only which operations are allowed on a given object, but also which user or system processes are allowed to access the object. The system provides an identification and location system for computers on the network and utilizes Internet Protocol version 6 ("IPV6"), as well as IPv4, as the communications protocol for sending traffic across the Internet. Routing of packets between networked systems may include equal-cost multipath routing ("ECMP"). ECMP is a routing strategy in which next-hop packets forwarded to a single destination may occur on multiple "best routes" that rank for top place in a routing metric calculation. Because multipath routing is a hop-by-hop decision limited to a single router, multipath routing can be used in conjunction with most routing protocols. The software may support multitenancy, an architecture in which a single instance of a software application serves multiple customers. Each customer may be referred to as a tenant. Tenants may be given the ability to customize some parts of the application, but in some embodiments, may not customize the application's code. Embodiments may maintain an audit log. An audit log is a document that records events in a computing system. In addition to documenting which resources were accessed, audit log entries typically include destination and source addresses, timestamps, and user login information for compliance with various regulations. Embodiments may support various key management policies, such as encryption key rotation. Additionally, the system may support a dynamic root password or some variant of a dynamically changing password.

［0139］図３Ａは、本開示のいくつかの実施形態に従って、クラウドサービスプロバイダ３０２とのデータ通信のために結合されるストレージシステム３０６の図を説明する。あまり詳細には示されないが、図３Ａに示されるストレージシステム３０６は、図１Ａから図１Ｄ及び図２Ａから図２Ｇに関して上述されたストレージシステムに類似してよい。いくつかの実施形態では、図３Ａに示されるストレージシステム３０６は、アンバランスなアクティブ／アクティブコントローラを含むストレージシステムとして、バランスアクティブ／アクティブコントローラを含むストレージシステムとして、各コントローラのすべてに満たないリソースが、各コントローラがフェイルオーバをサポートするために使用され得る予備のリソースを有するように活用されるアクティブ／アクティブコントローラを含むストレージシステムとして、完全にアクティブ／アクティブコントローラを含むストレージシステムとして、データセット分離コントローラを含むストレージシステムとして、フロントエンドコントローラ及びバックエンド統合ストレージコントローラを有する二重層アーキテクチャを含むストレージシステムとして、二重コントローラアレイのスケールアウトクラスタを含むストレージシステムとして、及び係る実施形態の組合せとして実施されてよい。 3A illustrates a diagram of a storage system 306 coupled for data communication with a cloud service provider 302 in accordance with some embodiments of the present disclosure. While not shown in greater detail, the storage system 306 shown in FIG. 3A may be similar to the storage systems described above with respect to FIGS. 1A-1D and 2A-2G. In some embodiments, the storage system 306 shown in FIG. 3A may be implemented as a storage system including unbalanced active/active controllers, a storage system including balanced active/active controllers, a storage system including active/active controllers in which less than all of the resources of each controller are utilized such that each controller has spare resources that can be used to support failover, a storage system including fully active/active controllers, a storage system including data set separation controllers, a storage system including a dual-tier architecture with a front-end controller and a back-end unified storage controller, a storage system including a scale-out cluster of dual-controller arrays, and combinations of such embodiments.

［0140］図３Ａに示される例では、ストレージシステム３０６は、データ通信リンク３０４を介してクラウドサービスプロバイダ３０２に結合される。データ通信リンク３０４は、専用のデータ通信リンクとして、例えば広域ネットワーク（『ＷＡＮ』）若しくはローカルエリアネットワーク（『ＬＡＮ』）等のデータ通信ネットワーク、又はストレージシステム３０６とクラウドサービスプロバイダ３０２との間でデジタル情報をトランスポートできる他のなんらかの機構のうちの１つの使用により提供されるデータ通信経路として実施されてよい。係るデータ通信リンク３０４は、完全に有線、完全に無線、又は有線データ通信経路及び無線データ通信経路のなんらかの集約であってよい。係る例では、デジタル情報は、１つ以上のデータ通信プロトコルを使用し、データ通信リンク３０４を介してストレージシステム３０６とクラウドサービスプロバイダ３０２との間で交換されてよい。例えば、デジタル情報は、ハンドヘルドデバイス転送プロトコル（『ＨＤＴＰ』）、ハイパーテキスト転送プロトコル（『ＨＴＴＰ』）、インターネットプロトコル（『ＩＰ』）、リアルタイム転送プロトコル（『ＲＴＰ』）、伝送制御プロトコル（『ＴＣＰ』）、ユーザデータグラムプロトコル（『ＵＤＰ』）、無線アプリケーションプロトコル（『ＷＡＰ』）、又は他のプロトコルを使用し、データ通信リンク３０４を介してストレージシステム３０６とクラウドサービスプロバイダ３０２との間で交換されてよい。 3A, storage system 306 is coupled to cloud service provider 302 via data communications link 304. Data communications link 304 may be implemented as a dedicated data communications link, as a data communications path provided through the use of one of a data communications network such as a wide area network ("WAN") or a local area network ("LAN"), or some other mechanism capable of transporting digital information between storage system 306 and cloud service provider 302. Such data communications link 304 may be entirely wired, entirely wireless, or some aggregation of wired and wireless data communications paths. In such an example, digital information may be exchanged between storage system 306 and cloud service provider 302 via data communications link 304 using one or more data communications protocols. For example, digital information may be exchanged between storage system 306 and cloud service provider 302 over data communications link 304 using Handheld Device Transfer Protocol ("HDTP"), Hypertext Transfer Protocol ("HTTP"), Internet Protocol ("IP"), Real-Time Transport Protocol ("RTP"), Transmission Control Protocol ("TCP"), User Datagram Protocol ("UDP"), Wireless Application Protocol ("WAP"), or other protocols.

［0141］図３Ａに示されるクラウドサービスプロバイダ３０２は、例えばデータ通信リンク３０４を介したコンピューティングリソースの共用を通してクラウドサービスプロバイダ３０２のユーザにサービスを提供するシステム及びコンピューティング環境として実施されてよい。クラウドサービスプロバイダ３０２は、例えばコンピュータネットワーク、サーバ、ストレージ、アプリケーション、及びサービス等の構成可能なコンピューティングリソースの共用プールにオンデマンドアクセスを提供してよい。構成可能なリソースの共用プールは、迅速にセットアップし、最小の管理努力でクラウドサービスプロバイダ３０２のユーザにリリースされてよい。概して、クラウドサービスプロバイダ３０２のユーザは、サービスを提供するためにクラウドサービスプロバイダ３０２によって利用される的確なコンピューティングリソースを認識していない。多くの場合、係るクラウドサービスプロバイダ３０２は、インターネットを介してアクセス可能であってよいが、当業の読者は、任意のデータ通信リンクを通してユーザにサービスを提供するために共用リソースの使用を概念化する任意のシステムが、クラウドサービスプロバイダ３０２と見なされてよいことを認識する。 [0141] The cloud service provider 302 shown in FIG. 3A may be implemented as a system and computing environment that provides services to users of the cloud service provider 302 through the sharing of computing resources, for example, via a data communication link 304. The cloud service provider 302 may provide on-demand access to a shared pool of configurable computing resources, such as computer networks, servers, storage, applications, and services. The shared pool of configurable resources may be quickly set up and released to users of the cloud service provider 302 with minimal administrative effort. Generally, users of the cloud service provider 302 are unaware of the exact computing resources utilized by the cloud service provider 302 to provide services. While in many cases, such cloud service providers 302 may be accessible via the Internet, readers skilled in the art will recognize that any system that conceptualizes the use of shared resources to provide services to users over any data communication link may be considered a cloud service provider 302.

［0142］図３Ａに示される例では、クラウドサービスプロバイダ３０２は、さまざまなサービスをストレージシステム３０６及びストレージシステム３０６のユーザに多様なサービスモデルの実施態様を通して提供しようと構成されてよい。例えば、クラウドサービスプロバイダ３０２は、ストレージシステム３０６及びストレージシステム３０６のユーザに、クラウドサービスプロバイダ３０２が加入者へのサービスとして、例えばバーチャルマシン及び他のリソース等のコンピューティングインフラストラクチャを提供するインフラストラクチャアズアサービス（『Ｉａａｓ』）サービスモデルの実施態様を通してサービスを提供するように構成されてよい。さらに、クラウドサービスプロバイダ３０２は、ストレージシステム３０６及びストレージシステム３０６のユーザに、クラウドサービスプロバイダ３０２がアプリケーション開発者に開発環境を提供するプラットフォームアズアサービス（『ＰａａＳ』）サービスモデルの実施態様を通してサービスを提供するように構成されてよい。係る開発環境は、例えばオペレーティングシステム、プログラミング言語実行環境、データベース、ウェブサーバ、又はクラウドプラットフォーム上でソフトウェア解決策を開発し、実行するためにアプリケーション開発者によって利用されてよい他の構成要素を含んでよい。さらに、クラウドサービスプロバイダ３０２は、ストレージシステム３０６及びストレージシステム３０６のユーザに対してアプリケーションを実行するために使用されるプラットフォームだけではなく、クラウドサービスプロバイダ３０２がアプリケーションソフトウェア、データベースを提供するソフトウェアアズアサービス（『ＳａａＳ』）サービスモデルの実施態様を通して、ストレージシステム３０６及びストレージシステム３０６のユーザにサービスを提供するように構成されてよく、ストレージシステム３０６及びストレージシステム３０６のユーザにオンデマンドソフトウェアを提供し、ローカルコンピュータ上でアプリケーションをインストールし、実行する必要性を排除し、アプリケーションの保守及びサポートを簡略化してよい。クラウドサービスプロバイダ３０２は、ストレージシステム３０６及びストレージシステム３０６のユーザに、アプリケーション、データソース、又は他のリソースに対するアクセスを保護するために使用できる認証サービスをクラウドサービスプロバイダ３０２が提供するオーセンティケーションアズアサービス（ａｕｔｈｅｎｔｉｃａｔｉｏｎａｓａｓｅｒｖｉｃｅ）（『ＡａａＳ』）サービスモデルの実施態様を通してサービスを提供するようにさらに構成されてよい。また、クラウドサービスプロバイダ３０２は、ストレージシステム３０６及びストレージシステム３０６のユーザに、クラウドサービスプロバイダ３０２がストレージシステム３０６及びストレージシステム３０６のユーザによる使用のために、そのストレージインフラストラクチャへのアクセスを提供するストレージアズアサービスモデルの実施態様を通してサービスを提供するように構成されてもよい。上述されたサービスモデルは、説明のためにだけ含まれ、クラウドサービスプロバイダ３０２によって提供されてよいサービスの制限、又はクラウドサービスプロバイダ３０２によって実装されてよいサービスモデルに関する制限をまったく表さないので、読者は、クラウドサービスプロバイダ３０２が、追加のサービスモデルの実施態様を通してストレージシステム３０６及びストレージシステム３０６のユーザに追加のサービスを提供するように構成されてよいことを理解する。 3A, cloud service provider 302 may be configured to provide various services to storage system 306 and users of storage system 306 through various service model implementations. For example, cloud service provider 302 may be configured to provide services to storage system 306 and users of storage system 306 through an infrastructure-as-a-service ("IaaS") service model implementation in which cloud service provider 302 provides computing infrastructure, such as virtual machines and other resources, as a service to subscribers. Furthermore, cloud service provider 302 may be configured to provide services to storage system 306 and users of storage system 306 through a platform-as-a-service ("PaaS") service model implementation in which cloud service provider 302 provides a development environment to application developers. Such a development environment may include, for example, an operating system, a programming language execution environment, a database, a web server, or other components that may be used by application developers to develop and run software solutions on a cloud platform. Additionally, cloud service provider 302 may be configured to provide services to storage system 306 and users of storage system 306 through an implementation of a software-as-a-service ("SaaS") service model in which cloud service provider 302 provides application software, databases, as well as a platform used to run applications to storage system 306 and users of storage system 306, providing on-demand software to storage system 306 and users of storage system 306, eliminating the need to install and run applications on local computers and simplifying application maintenance and support. Cloud service provider 302 may further be configured to provide services to storage system 306 and users of storage system 306 through an implementation of an authentication-as-a-service ("AaaS") service model in which cloud service provider 302 provides authentication services that can be used to secure access to applications, data sources, or other resources. Cloud service provider 302 may also be configured to provide services to storage system 306 and users of storage system 306 through an implementation of a storage-as-a-service model in which cloud service provider 302 provides access to its storage infrastructure for use by storage system 306 and users of storage system 306. The reader will understand that cloud service provider 302 may be configured to provide additional services to storage system 306 and users of storage system 306 through the implementation of additional service models, as the service models described above are included for illustrative purposes only and do not represent any limitations on the services that may be provided by cloud service provider 302 or on the service models that may be implemented by cloud service provider 302.

［0143］図３Ａに示される例では、クラウドサービスプロバイダ３０２は、例えばプライベートクラウドとして、パブリッククラウドとして、又はプライベートクラウドとパブリッククラウドの組合せとして実施されてよい。クラウドサービスプロバイダ３０２がプライベートクラウドとして実施される実施形態では、クラウドサービスプロバイダ３０２は、複数の組織にサービスを提供するよりむしろ単一の組織にサービスを提供することに専念してよい。クラウドサービス３０２がパブリッククラウドとして実施される実施形態では、クラウドサービスプロバイダ３０２は複数の組織にサービスを提供してよい。パブリッククラウド及びプライベートクラウドの配備モデルは異なる場合があり、多様な優位点及び不利な点を伴う場合がある。例えば、パブリッククラウド配備は異なる組織全体でコンピューティングインフラストラクチャの共用を必要とするため、係る配備はセキュリティの懸念がある組織、ミッションクリティカルな作業負荷、アップタイム要件需要等にとっては理想的ではない場合がある。プライベートクラウド配備はこれらの問題のいくつかに取り組むことができるが、プライベートクラウド配備はプライベートクラウドを管理するために自社運用スタッフを必要とする場合がある。さらに代替の実施形態では、クラウドサービスプロバイダ３０２は、プライベートクラウドサービス及びパブリッククラウドサービスのハイブリッドクラウド配備との混合として実施されてよい。 3A, cloud service provider 302 may be implemented, for example, as a private cloud, a public cloud, or a combination of private and public clouds. In embodiments in which cloud service provider 302 is implemented as a private cloud, cloud service provider 302 may focus on providing services to a single organization rather than serving multiple organizations. In embodiments in which cloud service provider 302 is implemented as a public cloud, cloud service provider 302 may provide services to multiple organizations. Public and private cloud deployment models may differ and may involve various advantages and disadvantages. For example, because public cloud deployments require the sharing of computing infrastructure across different organizations, such deployments may not be ideal for organizations with security concerns, mission-critical workloads, uptime requirements, etc. While private cloud deployments can address some of these issues, private cloud deployments may require on-premise staff to manage the private cloud. In yet another alternative embodiment, cloud service provider 302 may be implemented as a hybrid cloud deployment of private and public cloud services.

［0144］図３Ａには明示的に示されていないが、読者は、追加のハードウェア構成要素及び追加のソフトウェア構成要素が、ストレージシステム３０６及びストレージシステム３０６のユーザに対するクラウドサービスの送達を容易にするために必要である場合があることを理解する。例えば、ストレージシステム３０６は、クラウドストレージゲートウェイに結合されて（又は含んでも）よい。係るクラウドストレージゲートウェイは、例えばストレージシステム３０６を有する施設に位置するハードウェアベース又はソフトウェアベースの機器として実施されてよい。係るクラウドストレージゲートウェイは、ストレージアレイ３０６上で実行中のローカルアプリケーションと、ストレージアレイ３０６によって活用される遠隔のクラウドベースのストレージとの間のブリッジとして動作し得る。クラウドストレージゲートウェイの使用により、組織は、一時的なｉＳＣＳＩ又はＮＡＳをクラウドサービスプロバイダ３０２に移動し、組織がそのオンプレミス型ストレージシステムでの場所を節約できるようにする。係るクラウドストレージゲートウェイは、ＳＣＳＩコマンド、ファイルサーバコマンド、又は他の適切なコマンドを、クラウドサービスプロバイダ３０２との通信を容易にするＲＥＳＴ空間プロトコルに変換できるディスクアレイ、ブロックベースのデバイス、ファイルサーバ、又は他のストレージシステムにエミュレートするように構成されてよい。 3A , the reader will understand that additional hardware and software components may be necessary to facilitate delivery of cloud services to the storage system 306 and users of the storage system 306. For example, the storage system 306 may be coupled to (or include) a cloud storage gateway. Such a cloud storage gateway may be implemented, for example, as a hardware- or software-based appliance located at the facility containing the storage system 306. Such a cloud storage gateway may act as a bridge between local applications running on the storage array 306 and remote, cloud-based storage leveraged by the storage array 306. Using a cloud storage gateway, an organization can move temporary iSCSI or NAS data to the cloud service provider 302, allowing the organization to conserve space in its on-premises storage system. Such a cloud storage gateway may be configured to emulate a disk array, block-based device, file server, or other storage system that can translate SCSI commands, file server commands, or other appropriate commands into a RESTful storage protocol that facilitates communication with the cloud service provider 302.

［0145］ストレージシステム３０６及びストレージシステム３０６のユーザが、クラウドサービスプロバイダ３０２によって提供されるサービスを利用できるようにするために、クラウド移行プロセスは、組織のローカルシステムから（又は別のクラウド環境から）のデータ、アプリケーション、又は他の要素がクラウドサービスプロバイダ３０２に移動される間に起こる場合がある。データ、アプリケーション、又は他の要素をクラウドサービスプロバイダ３０２の環境に無事に移行するためには、クラウド移行ツール等のミドルウェアが、クラウドサービスプロバイダ３０２の環境と組織の環境との間のギャップを埋めるために活用されてよい。また、係るクラウド移行ツールは、データ通信ネットワーク上でクラウドサービスプロバイダ３０２に対する極秘データと関連したセキュリティの懸念に対応するだけではなく、潜在的に高いネットワークコストとクラウドサービスプロバイダ３０２への大量のデータの移行に関連した長い移行時間に対応するように構成されてもよい。ストレージシステム３０６及びストレージシステム３０６のユーザがさらにクラウドサービスプロバイダ３０２によって提供されるサービスを利用できるようにするために、クラウドオーケストレータが使用されて、統合されたプロセス又はワークフローの作成を求めて自動化されたタスクを準備し、調整してよい。係るクラウドオーケストレータは、係る構成要素間で相互接続を管理することだけではなく、それらの構成要素がクラウド構成要素なのか、それともオンプレミス構成要素なのかに関わりなく、多様な構成要素を構成する等のタスクを実行してよい。クラウドオーケストレータは、リンクが正しく構成され、保守されていることを確実にするために、構成要素間の通信及び接続を簡略化できる。 [0145] A cloud migration process may occur during which data, applications, or other elements are moved from the organization's local system (or from another cloud environment) to cloud service provider 302 to enable storage system 306 and users of storage system 306 to take advantage of services offered by cloud service provider 302. To successfully migrate the data, applications, or other elements to the cloud service provider's 302 environment, middleware such as a cloud migration tool may be utilized to bridge the gap between the cloud service provider's 302 environment and the organization's environment. Such cloud migration tools may also be configured to address security concerns associated with transmitting sensitive data over a data communications network to cloud service provider 302, as well as potentially high network costs and long migration times associated with migrating large amounts of data to cloud service provider 302. To further enable storage system 306 and users of storage system 306 to take advantage of services offered by cloud service provider 302, a cloud orchestrator may be used to prepare and coordinate automated tasks to create an integrated process or workflow. Such a cloud orchestrator may perform tasks such as configuring various components, whether those components are cloud or on-premise components, as well as managing the interconnections between such components. The cloud orchestrator can simplify communication and connections between components to ensure that links are properly configured and maintained.

［0146］図３Ａに示される例では、及び上記に簡略に説明されるように、クラウドサービスプロバイダ３０２は、ストレージシステム３０６及びストレージシステム３０６のユーザに対してアプリケーションを実行するために使用されるプラットフォームだけではなく、クラウドサービスプロバイダ３０２がアプリケーションソフトウェア、データベースを提供するＳａａＳサービスモデルの使用によって、ストレージシステム３０６及びストレージシステム３０６のユーザにサービスを提供するように構成されてよく、ストレージシステム３０６及びストレージシステム３０６のユーザにオンデマンドソフトウェアを提供し、ローカルコンピュータ上にアプリケーションをインストールし、実行する必要性を排除し、このことがアプリケーションの保守及びサポートを簡略化し得る。係るアプリケーションは、本開示の多様な実施形態に従って多くの形をとってよい。例えば、クラウドサービスプロバイダ３０２は、データ解析アプリケーションへのアクセスをストレージシステム３０６及びストレージシステム３０６のユーザに提供するように構成されてよい。係るデータ解析アプリケーションは、例えば、ストレージシステム３０６によって自社サーバと通信されるテレメトリデータを受け取るように構成される場合がある。係るテレメトリデータは、ストレージシステム３０６の多様な動作特性を記述する場合があり、例えばストレージシステム３０６の健康を決定するために、ストレージシステム３０６上で実行している作業負荷を識別するために、ストレージシステム３０６がいつ多様なリソースを使い果たすのかを予測するために、ストレージシステム３０６の動作を改善し得る構成変更、ハードウェア又はソフトウェアのアップグレード、ワークフロー移行、又は他のアクションを推奨するために解析されてよい。 3A, and as briefly described above, cloud service provider 302 may be configured to provide services to storage system 306 and users of storage system 306 through the use of a SaaS service model in which cloud service provider 302 provides application software, databases, as well as the platform used to run the applications to storage system 306 and users of storage system 306, providing on-demand software to storage system 306 and users of storage system 306, eliminating the need to install and run applications on local computers, which may simplify application maintenance and support. Such applications may take many forms in accordance with various embodiments of the present disclosure. For example, cloud service provider 302 may be configured to provide storage system 306 and users of storage system 306 with access to a data analysis application. Such a data analysis application may be configured, for example, to receive telemetry data communicated by storage system 306 to its servers. Such telemetry data may describe various operating characteristics of storage system 306 and may be analyzed, for example, to determine the health of storage system 306, to identify the workload running on storage system 306, to predict when storage system 306 will run out of various resources, and to recommend configuration changes, hardware or software upgrades, workflow transitions, or other actions that may improve the operation of storage system 306.

［0147］また、クラウドサービスプロバイダ３０２は、ストレージシステム３０６及びストレージシステム３０６のユーザに、仮想化されたコンピューティング環境へのアクセスを提供するように構成されてもよい。係る仮想化されたコンピューティング環境は、例えばバーチャルマシン又は他の仮想化コンピュータハードウェアプラットフォーム、仮想ストレージデバイス、仮想化コンピュータネットワークリソース等として実施されてよい。係る仮想化環境の例は、実際のコンピュータをエミュレートするために創られるバーチャルマシン、論理デスクトップを物理マシンから分離する仮想化デスクトップ環境、異なるタイプの具体的なファイルシステムに対する一様なアクセスを可能にする仮想化ファイルシステム、及び多くの他のものを含む場合がある。 [0147] Cloud service provider 302 may also be configured to provide storage system 306 and users of storage system 306 with access to virtualized computing environments. Such virtualized computing environments may be implemented, for example, as virtual machines or other virtualized computer hardware platforms, virtual storage devices, virtualized computer network resources, etc. Examples of such virtualized environments may include virtual machines created to emulate actual computers, virtualized desktop environments that separate logical desktops from physical machines, virtualized file systems that enable uniform access to different types of concrete file systems, and many others.

［0148］追加の説明のために、図３Ｂは、本開示のいくつかの実施形態に係るストレージシステム３０６の図を説明する。あまり詳しく説明されていないが、ストレージシステムは上述された構成要素の多くを含んでよいので、図３Ｂに示されるストレージシステム３０６は、図１Ａから図１Ｄ及び図２Ａから図２Ｇを参照して上述されたストレージシステムに類似してよい。 [0148] For further explanation, FIG. 3B illustrates a diagram of a storage system 306 according to some embodiments of the present disclosure. Although not described in great detail, the storage system 306 shown in FIG. 3B may be similar to the storage systems described above with reference to FIGS. 1A-1D and 2A-2G, as the storage system may include many of the components described above.

［0149］図３Ｂに示されるストレージシステム３０６は、多くの形で実施されてよいストレージリソース３０８を含んでよい。例えば、いくつかの実施形態では、ストレージリソース３０８は、ナノＲＡＭ又は基板上に付着されたカーボンナノチューブを利用する別の形の不揮発性ランダムアクセスメモリを含む場合がある。いくつかの実施形態では、ストレージリソース３０８は、積み重ね可能な交差格子付き（ｃｒｏｓｓ－ｇｒｉｄｄｅｄ）データアクセスアレイと併せて、ビットストレージがバルク抵抗の変更に基づく３Ｄクロスポイント不揮発性メモリを含んでよい。いくつかの実施形態では、ストレージリソース３０８は、単一レベルセル（『ＳＬＣ』）ＮＡＮＤフラッシュ、マルチレベルセル（『ＭＬＣ』）ＮＡＮＤフラッシュ、トリプルレベルセル（『ＴＬＣ』）ＮＡＮＤフラッシュ、クワッドレベルセル（『ＱＬＣ』）ＮＡＮＤフラッシュ及び他を含んだフラッシュメモリを含んでよい。いくつかの実施形態では、ストレージリソース３０８は、データが磁気記憶素子を使用して記憶されるスピン移動トルク（『ＳＴＴ』）ＭＲＡＭを含んだ不揮発性磁気抵抗ランダムアクセスメモリ（『ＭＲＡＭ』）を含んでよい。いくつかの実施形態では、例のストレージリソース３０８は、セルがいくつかの別個の中間状態を達成できるので、単一のセルに複数のビットを保持する能力を有する場合がある不揮発性相変化メモリ（『ＰＣＭ』）を含んでよい。いくつかの実施形態では、ストレージリソース３０８は、光量子情報の記憶及び取出しを可能にする量子メモリを含んでよい。いくつかの実施形態では、例のストレージリソース３０８は、誘電ソリッドステート材料全体で抵抗を変更することによってデータが記憶される、抵抗変化型メモリ（『ＲｅＲＡＭ』）を含んでよい。いくつかの実施形態では、ストレージリソース３０８は、リッドステート不揮発性メモリが、サブリソグラフィパターン化（ｓｕｂｌｉｔｈｏｇｒａｐｈｉｃｐａｔｔｅｒｎｉｎｇ）技術、セル単位の複数ビット、デバイスの複数の層等のなんらかの組合せを使用し、高密度で製造されてよい、ストレージクラスメモリ（『ＳＣＭ』）を含んでよい。読者は、他の形のコンピュータメモリ及びストレージデバイスが、ＤＲＡＭ、ＳＲＡＭ、ＥＥＰＲＯＭ、ユニバーサルメモリ、及び多くの他のものを含んだ上述されたストレージシステムによって利用されてよいことを理解する。図３Ａに示されるストレージリソース３０８は、デュアルインラインメモリモジュール（『ＤＩＭＭ』）、不揮発性デュアルインラインメモリモジュール（『ＮＶＤＩＭＭ』）、Ｍ．２、Ｕ．２、及び他を含むが、これに限定されるものではない、さまざまなフォームファクタで実施されてよい。 3B may include storage resources 308, which may be implemented in many ways. For example, in some embodiments, storage resources 308 may include nanoRAM or another form of nonvolatile random access memory utilizing carbon nanotubes deposited on a substrate. In some embodiments, storage resources 308 may include 3D cross-point nonvolatile memory in which bit storage is based on bulk resistance changes, along with stackable cross-gridded data access arrays. In some embodiments, storage resources 308 may include flash memory, including single-level cell ("SLC") NAND flash, multi-level cell ("MLC") NAND flash, triple-level cell ("TLC") NAND flash, quad-level cell ("QLC") NAND flash, and others. In some embodiments, storage resources 308 may include nonvolatile magnetoresistive random access memory ("MRAM"), including spin-transfer torque ("STT") MRAM, in which data is stored using magnetic storage elements. In some embodiments, example storage resource 308 may include non-volatile phase change memory (“PCM”), which may have the ability to hold multiple bits in a single cell because the cell can achieve several distinct intermediate states. In some embodiments, storage resource 308 may include quantum memory, which allows for the storage and retrieval of optical quantum information. In some embodiments, example storage resource 308 may include resistive random access memory (“ReRAM”), in which data is stored by changing the resistance across a dielectric solid-state material. In some embodiments, storage resource 308 may include storage class memory (“SCM”), in which lid-state non-volatile memory may be fabricated at high density using some combination of sublithographic patterning techniques, multiple bits per cell, multiple layers of devices, etc. The reader will understand that other forms of computer memory and storage devices may be utilized by the above-described storage system, including DRAM, SRAM, EEPROM, universal memory, and many others. The storage resource 308 shown in FIG. 3A may be implemented in a variety of form factors, including, but not limited to, dual in-line memory modules ("DIMMs"), non-volatile dual in-line memory modules ("NVDIMMs"), M.2, U.2, and others.

［0150］図３Ｂに示される例のストレージシステム３０６は、さまざまなストレージアーキテクチャを実装してよい。例えば、本開示のいくつかの実施形態に係るストレージシステムは、データがブロックで記憶され、各ブロックが基本的に個々のハードドライブの機能を果たすブロックストレージを活用してよい。本開示のいくつかの実施形態に係るストレージシステムは、データがオブジェクトとして管理されるオブジェクトストレージを活用してよい。各オブジェクトは、データ自体、可変量のメタデータ、及びグローバル一意識別子を含んでよく、オブジェクトストレージは複数のレベル（例えば、デバイスレベル、システムレベル、インタフェースレベル）で実装できる。本開示のいくつかの実施形態に係るストレージシステムは、データが階層構造で記憶されるファイルストレージを利用する。係るデータはファイル及びフォルダで保存され、それを記憶するシステムと、同じフォーマットでそれを取り出すシステムの両方に提示される。 [0150] The example storage system 306 shown in FIG. 3B may implement a variety of storage architectures. For example, storage systems according to some embodiments of the present disclosure may utilize block storage, where data is stored in blocks, with each block essentially performing the function of an individual hard drive. Storage systems according to some embodiments of the present disclosure may utilize object storage, where data is managed as objects. Each object may include the data itself, a variable amount of metadata, and a globally unique identifier, and object storage can be implemented at multiple levels (e.g., device level, system level, interface level). Storage systems according to some embodiments of the present disclosure utilize file storage, where data is stored in a hierarchical structure. Such data is stored in files and folders and is presented in the same format to both the system that stores it and the system that retrieves it.

［0151］図３Ｂに示される例のストレージシステム３０６は、追加のストレージリソースを、スケールアップモデルを使用して加えることができ、追加のストレージリソースを、スケールアウトモデルを使用して加えることができる、又はそのいくつかの組合せによって、ストレージシステムとして実装されてよい。スケールアップモデルでは、追加のストレージは、追加のストレージデバイスを加えることによって加えられてよい。しかしながら、スケールアウトモデルでは、追加のストレージノードはストレージノードのクラスタに加えられてよく、係るストレージノードは追加の処理リソース、追加のネットワーキングリソース等を含む場合がある。 [0151] The example storage system 306 shown in FIG. 3B may be implemented as a storage system in which additional storage resources can be added using a scale-up model, additional storage resources can be added using a scale-out model, or some combination thereof. In a scale-up model, additional storage may be added by adding additional storage devices. However, in a scale-out model, additional storage nodes may be added to a cluster of storage nodes, and such storage nodes may include additional processing resources, additional networking resources, etc.

［0152］また、図３Ｂに示されるストレージシステム３０６は、ストレージシステム３０６と、ストレージシステム３０６の外部にあるコンピューティングデバイスとの間のデータ通信だけではなく、ストレージシステム３０６の中の構成要素間のデータ通信も容易にする際に有用である場合がある通信リソース３１０も含む。通信リソース３１０は、ストレージシステムの外部にあるコンピューティングデバイスだけではなく、ストレージシステムの中の構成要素間のデータ通信も容易にするためにさまざまな異なるプロトコル及びデータ通信ファブリックを活用するように構成されてよい。例えば、通信リソース３１０は、ＦＣネットワークを介してＳＣＳＩコマンドをトランスポートする場合があるＦＣファブリック及びＦＣプロトコル等のファイバチャネル（『ＦＣ』）技術を含む場合がある。また、通信リソース３１０は、ＦＣフレームがカプセル化され、イーサネットネットワークを介して伝送されるＦＣオーバーイーサネット（『ＦＣｏＥ』）技術も含む場合がある。また、通信リソース３１０は、スイッチドファブリック技術が、チャネルアダプタ間での伝送を容易にするために利用されるインフィニバンド（『ＩＢ』）技術を含む場合もある。また、通信リソース３１０は、ＰＣＩエクスプレス（『ＰＣＩｅ』）バスを介してアタッチされた不揮発性記憶媒体がそれを通してアクセスされ得るＮＶＭエクスプレス（『ＮＶＭｅ』）技術及びＮＶＭｅオーバーファブリック（『ＮＶＭｅｏＦ』）技術も含む場合がある。また、通信リソース３１０は、ストレージシステム３０６の中のストレージリソース３０８に、及びストレージシステム３０６とストレージシステム３０６の外部にあるコンピューティングデバイスとの間のデータ通信だけではなく、ストレージシステム３０６の中の構成要素間のデータ通信も容易にするために有用であってよい他の通信リソースにブロックレベルのアクセスを提供するために、シリアルアタッチドＳＣＳＩ（『ＳＡＳ』）、ストレージシステム３０６の中のストレージリソース３０８をストレージシステム３０６の中のホストバスアダプタに接続するためのシリアルＡＴＡ（『ＳＡＴＡ』）バスインタフェース、インターネットスモールコンピュータシステムインタフェース（『ｉＳｃＳＩ』技術を利用し、ストレージシステム３０６の中のストレージリソース３０８にアクセスするための機構を含む場合もある。 3B also includes communications resources 310 that may be useful in facilitating data communication between components within the storage system 306, as well as between the storage system 306 and computing devices external to the storage system 306. The communications resources 310 may be configured to utilize a variety of different protocols and data communications fabrics to facilitate data communication between components within the storage system, as well as between computing devices external to the storage system. For example, the communications resources 310 may include Fibre Channel ("FC") technology, such as an FC fabric and FC protocol that may transport SCSI commands over an FC network. The communications resources 310 may also include FC over Ethernet ("FCoE") technology, in which FC frames are encapsulated and transmitted over an Ethernet network. The communications resources 310 may also include InfiniBand ("IB") technology, in which switched fabric technology is utilized to facilitate transmission between channel adapters. The communications resources 310 may also include NVM Express ("NVMe") and NVMe over Fabric ("NVMeoF") technologies, through which non-volatile storage media attached via a PCI Express ("PCIe") bus may be accessed. The communications resources 310 may also include mechanisms for accessing the storage resources 308 within the storage system 306 using Serial Attached SCSI ("SAS"), a Serial ATA ("SATA") bus interface for connecting the storage resources 308 within the storage system 306 to a host bus adapter within the storage system 306, and Internet Small Computer System Interface ("iScSI") technologies to provide block-level access to the storage resources 308 within the storage system 306 and other communications resources that may be useful for facilitating data communication between components within the storage system 306 as well as between the storage system 306 and computing devices external to the storage system 306.

［0153］図３Ｂに示されるストレージシステム３０６は、コンピュータプログラム命令を実行し、ストレージシステム３０６の中で他の計算タスクを実行する際に有用である場合がある処理リソース３１２も含む。処理リソース３１２は、１つ以上の中央演算処理装置（『ＣＰＵ』）だけではなく、いくつかの特定の目的のためにカスタマイズされる１つ以上の特定用途向け集積回路（『ＡＳＩＣ』）を含んでよい。また、処理リソース３１２は、１つ以上のデジタルシグナルプロセッサ（『ＤＳＰ』）、１つ以上のフィールドプログラマブルゲートアレイ（『ＦＰＧＡ』）、１つ以上のシステムオンチップ（『ＳｏＣｓ』）、又は他の形の処理リソース３１２も含んでよい。ストレージシステム３０６は、以下により詳細に説明されるソフトウェアリソース３１４の実行をサポートすることを含むが、これに限定されるものではないさまざまなタスクを実行するためにストレージリソース３１２を利用してよい。 [0153] The storage system 306 shown in FIG. 3B also includes processing resources 312 that may be useful in executing computer program instructions and performing other computational tasks within the storage system 306. The processing resources 312 may include one or more central processing units ("CPUs") as well as one or more application-specific integrated circuits ("ASICs") customized for some particular purpose. The processing resources 312 may also include one or more digital signal processors ("DSPs"), one or more field programmable gate arrays ("FPGAs"), one or more systems-on-chips ("SoCs"), or other forms of processing resources 312. The storage system 306 may utilize the storage resources 312 to perform various tasks, including, but not limited to, supporting the execution of software resources 314, which are described in more detail below.

［0154］また、図３Ｂに示されるストレージシステム３０６は、ストレージシステム３０６の中で処理リソース３１２によって実行されるとき、多様なタスクを実行してよいソフトウェアリソース３１４も含む。ソフトウェアリソース３１４は、ストレージシステム３０６の中で処理リソース３１２によって実行されるとき、ストレージシステムの中に記憶されるデータの完全性を保つための多様なデータ保護技術を実施する際に有用である、例えばコンピュータプログラム命令の１つ以上のモジュールを含んでよい。読者は、係るデータ保護技術が、例えばストレージシステムの中のコンピュータハードウェアで実行中のシステムソフトウェアによって、クラウドサービスプロバイダによって、又は他の方法で実施されてよいことを理解する。係るデータ保護技術は、例えば、もはやアクティブに使用されていないデータを、長期の保持のために別個のストレージデバイス又は別個のストレージシステムに移動させる、データアーカイブ技術、ストレージシステムに記憶されるデータが、設備故障又はストレージシステムとのなんらかの他の形の大惨事の場合にデータ損失を回避するためにコピーされ、別の場所に記憶され得るデータバックアップ技術、ストレージシステムに記憶されるデータが、複数のストレージシステムを介してデータがアクセス可能となるように、別のストレージシステムに複製されるデータ複製技術、ストレージシステムの中のデータの状態が時間内に多様な点で取り込まれるデータスナップショット技術、データ及びデータベースの複製のコピーがそれを通して作成される場合があるデータ及びデータベースクローニング技術、及び他のデータ保護技術を含む場合がある。ストレージシステムの故障は、ストレージシステムに記憶されるデータの損失を生じさせない場合があるので、係るデータ保護技術の使用により、ビジネス継続性及び災害復旧の目的が、満たされ得る。 3B also includes software resources 314 that, when executed by processing resources 312 within storage system 306, may perform various tasks. Software resources 314 may include, for example, one or more modules of computer program instructions that, when executed by processing resources 312 within storage system 306, are useful in implementing various data protection techniques to preserve the integrity of data stored within the storage system. The reader will understand that such data protection techniques may be implemented, for example, by system software running on computer hardware within the storage system, by a cloud service provider, or in other manners. Such data protection techniques may include, for example, data archiving techniques, in which data that is no longer in active use is moved to a separate storage device or separate storage system for long-term retention; data backup techniques, in which data stored in a storage system may be copied and stored elsewhere to avoid data loss in the event of equipment failure or some other form of catastrophic event with the storage system; data replication techniques, in which data stored in a storage system is replicated to another storage system so that the data is accessible via multiple storage systems; data snapshot techniques, in which the state of data in a storage system is captured at various points in time; data and database cloning techniques, through which copies of data and database replicas may be created, and other data protection techniques. Because a failure of a storage system may not result in the loss of data stored in the storage system, business continuity and disaster recovery objectives may be met through the use of such data protection techniques.

［0155］また、ソフトウェアリソース３１４は、ソフトウェアデファインドストレージ（『ＳＤＳ』）を実施する際に有用であるソフトウェアを含んでもよい。係る例では、ソフトウェアリソース３１４は、実行されるとき、基本的なハードウェアとは無関係であるデータストレージの方針をベースにしたプロビジョニング及び管理で有用である、コンピュータプログラム命令の１つ以上のモジュールを含んでよい。係るソフトウェアリソース３１４は、ストレージハードウェアを管理するソフトウェアからストレージハードウェアを分離するためにストレージ視覚化を実装する上で有用であってよい。 [0155] Software resources 314 may also include software useful in implementing software-defined storage ("SDS"). In such examples, software resources 314 may include one or more modules of computer program instructions that, when executed, are useful in policy-based provisioning and management of data storage independent of the underlying hardware. Such software resources 314 may be useful in implementing storage visualization to separate storage hardware from the software that manages the storage hardware.

［0156］またソフトウェアリソース３１４は、ストレージシステム３０６内のストレージリソース３０８に向けられるＩ／Ｏ動作を容易にし、最適化する際に有用であるソフトウェアを含んでもよい。例えば、ソフトウェアリソース３１４は、例えばデータ圧縮、データ重複排除、及び他等の多様なデータ削減技術を実行実施するソフトウェアモジュールを含んでよい。ソフトウェアリソース３１４は、基本的なストレージリソース３０８のよりよい使用を容易にするためにＩ／Ｏ動作をともにインテリジェントにグループ化するソフトウェアモジュール、ストレージシステムの中から移行するためにデータ移行動作を実行するソフトウェアモジュール、及び他の動作を実行するソフトウェアモジュールも含んでよい。係るソフトウェアリソース３１４は、１つ以上のソフトウェアコンテナとして、又は多くの他の方法で実施されてよい。 [0156] Software resources 314 may also include software useful in facilitating and optimizing I/O operations directed to storage resources 308 within storage system 306. For example, software resources 314 may include software modules that implement various data reduction techniques, such as data compression, data deduplication, and others. Software resources 314 may also include software modules that intelligently group I/O operations together to facilitate better use of underlying storage resources 308, software modules that perform data migration operations to migrate data out of the storage system, and software modules that perform other operations. Such software resources 314 may be implemented as one or more software containers or in many other ways.

［0157］読者は、図３Ｂに示される多様な構成要素が、集中型インフラストラクチャとして１つ以上の最適化されたコンピューティングパッケージにグループ化されてよいことを理解する。係る集中型インフラストラクチャは、複数のアプリケーションによって共用され、方針主導のプロセスを使用し、集合的に管理される場合がある、コンピュータ、ストレージリソース及びネットワーキングリソースのプールを含んでよい。係る集中型インフラストラクチャは、ストレージシステム３０６の確立及び動作と関連付けられた多様なコストも削減しつつ、ストレージシステム３０６の中に多様な構成要素間の互換性の問題を最小限に抑え得る。係る集中型インフラストラクチャは、集中型インフラストラクチャ参照アーキテクチャで、スタンドアロン機器で、ソフトウェア駆動超集中型手法（例えば、超集中型インフラストラクチャ）で、又は他の方法で実施されてよい。 [0157] The reader will understand that the various components shown in FIG. 3B may be grouped into one or more optimized computing packages as a centralized infrastructure. Such a centralized infrastructure may include a pool of computing, storage, and networking resources that may be shared by multiple applications and collectively managed using policy-driven processes. Such a centralized infrastructure may minimize compatibility issues between the various components in storage system 306 while also reducing various costs associated with establishing and operating storage system 306. Such a centralized infrastructure may be implemented in a centralized infrastructure reference architecture, in standalone appliances, in a software-driven hyper-centralized approach (e.g., a hyper-centralized infrastructure), or in other ways.

［0158］読者は、図３Ｂに示されるストレージシステム３０６が、多様なタイプのソフトウェアアプリケーションをサポートするために有用であってよいことを理解する。例えば、ストレージシステム３０６は、係るアプリケーションにストレージリソースを提供することによって、人工知能（『ＡＩ』）アプリケーション、データベースアプリケーション、ＤｅｖＯｐｓプロジェクト、電子設計自動化ツール、イベント駆動ソフトウェアアプリケーション、高性能コンピューティングアプリケーション、シミュレーションアプリケーション、高速データキャプチャアプリケーション及び解析アプリケーション、機械学習アプリケーション、媒体生成アプリケーション、媒体供給アプリケーション、ピクチャアーカイブ及び通信システム（『ＰＡＣＳ』）アプリケーション、ソフトウェア開発アプリケーション、仮想現実アプリケーション、拡張現実アプリケーション、及び多くの他のタイプのアプリケーションをサポートする上で有用であってよい。 [0158] The reader will appreciate that storage system 306 shown in FIG. 3B may be useful for supporting various types of software applications. For example, storage system 306 may be useful in supporting artificial intelligence ("AI") applications, database applications, DevOps projects, electronic design automation tools, event-driven software applications, high-performance computing applications, simulation applications, high-speed data capture and analysis applications, machine learning applications, media generation applications, media provisioning applications, picture archiving and communication systems ("PACS") applications, software development applications, virtual reality applications, augmented reality applications, and many other types of applications by providing storage resources for such applications.

［0159］上述されたストレージシステムは、多種多様なアプリケーションをサポートするために動作してよい。ストレージシステムが、計算リソース、ストレージリソース、及び多種多様な他のリソースを含むという事実を考慮して、ストレージシステムは、例えばＡＩアプリケーション等、資源集約的であるアプリケーションをサポートするためにうまく適合してよい。係るＡＩアプリケーションは、デバイスがその環境を認識し、なんらかの目標に対するその成功の確率を最大限にする処置を講じることを可能にしてよい。係るＡＩアプリケーションの例は、ＩＢＭＷａｔｓｏｎ、ＭｉｃｒｏｓｏｆｔＯｘｆｏｒｄ、ＧｏｏｇｌｅＤｅｅｐＭｉｎｄ、ＢａｉｄｕＭｉｎｗａ、及び他を含む場合がある。また、上述されたストレージシステムは、例えば機械学習アプリケーション等、資源集約的である他のタイプのアプリケーションをサポートするためにうまく適合してよい。機械学習アプリケーションは、解析モデル構築を自動化するために、多様なタイプのデータ解析を実行してよい。機械学習アプリケーションは、データから反復して学習するアルゴリズムを使用し、コンピュータが明示的にプログラミングされることなく学習することを可能にできる。 [0159] The storage system described above may operate to support a wide variety of applications. Given the fact that the storage system includes computational resources, storage resources, and a wide variety of other resources, the storage system may be well suited to support resource-intensive applications, such as AI applications. Such AI applications may enable a device to recognize its environment and take actions that maximize its probability of success toward some goal. Examples of such AI applications may include IBM Watson, Microsoft Oxford, Google DeepMind, Baidu Minwa, and others. The storage system described above may also be well suited to support other types of resource-intensive applications, such as machine learning applications. Machine learning applications may perform various types of data analysis to automate analytical model building. Machine learning applications use algorithms that iteratively learn from data, allowing computers to learn without being explicitly programmed.

［0160］すでに説明されたリソースに加えて、上述されたストレージシステムは、ときおり視覚処理ユニット（『ＶＰＵ』）とも呼ばれるグラフィックスプロセッシングユニット（『ＧＰＵ』）も含んでよい。係るＧＰＵは、表示装置への出力のために意図されたフレームバッファ内での画像の作成を加速するためにメモリを迅速に操作し、改変する特殊化された電子回路として実施されてよい。係るＧＰＵは、ストレージシステムの多くの個別にスケーラブルな構成要素の１つとして、を含んだ、上述されるストレージシステムの部分であるコンピューティングデバイスのいずれかの中に含まれてよく、係るストレージシステムの個々にスケーラブルな構成要素の他の例は、ストレージ構成要素、メモリ構成要素、計算構成要素（例えば、ＣＰＵ、ＦＰＧＡ、ＡＳＩＣ）、ネットワーキング構成要素、ソフトウェア構成要素及び他を含む場合がある。ＧＰＵに加えて、上述されたストレージシステムは、ニューラルネットワーク処理の多様な態様での使用のためにニューラルネットワークプロセッサ（『ＮＮＰ』）も含んでよい。係るＮＮＰは、ＧＰＵの代わりに（又は加えて）使用されてよく、個別にスケーラブルであってよい。 [0160] In addition to the resources already described, the storage systems described above may also include a graphics processing unit ("GPU"), sometimes referred to as a visual processing unit ("VPU"). Such a GPU may be implemented as specialized electronic circuitry that rapidly manipulates and modifies memory to accelerate the creation of images within a frame buffer intended for output to a display device. Such a GPU may be included within any of the computing devices that are part of the storage systems described above, including as one of many individually scalable components of the storage system; other examples of individually scalable components of such storage systems may include storage components, memory components, computational components (e.g., CPUs, FPGAs, ASICs), networking components, software components, and others. In addition to the GPU, the storage systems described above may also include a neural network processor ("NNP") for use in various aspects of neural network processing. Such NNPs may be used instead of (or in addition to) GPUs and may be independently scalable.

［0161］上述されたように、本明細書に説明されるストレージシステムは、人工知能アプリケーション、機械学習アプリケーション、ビッグデータ解析アプリケーション、及び多くの他のタイプのアプリケーションをサポートするように構成されてよい。これらの種類のアプリケーションの急成長は、３つの技術、つまりディープラーニング（ＤＬ）、ＧＰＵプロセッサ、及びビッグデータによって動かされている。ディープラーニングは、人間の脳によって刺激を与えられる超並列ニューラルネットワークを利用するコンピューティングモデルである。専門家がソフトウェアを手作りする代わりに、ディープラーニングモデルは、多くの例から学習することによって独自のソフトウェアを書き込む。ＧＰＵは、人間の脳の並行した性質を緩やかに表すアルゴリズムを実行するためにうまく適合している数千のコアを有する近代的なプロセッサである。 [0161] As mentioned above, the storage systems described herein may be configured to support artificial intelligence applications, machine learning applications, big data analytics applications, and many other types of applications. The rapid growth of these types of applications is driven by three technologies: deep learning (DL), GPU processors, and big data. Deep learning is a computing model that utilizes massively parallel neural networks inspired by the human brain. Instead of experts handcrafting software, deep learning models write their own software by learning from many examples. GPUs are modern processors with thousands of cores that are well suited to running algorithms that loosely represent the parallel nature of the human brain.

［0162］ディープニューラルネットワークの進歩は、データサイエンティストが人工知能（ＡＩ）を用いてそのデータを利用するためのアルゴリズム及びツールの新しい波に火をつけた。改善されたアルゴリズム、より大きいデータセット、及び（一連のタスク全体での機械学習のためのオープンソースソフトウェアライブラリを含んだ）多様なフレームワークを用いて、データサイエンティストは、自律運転車両、自然言語処理、及び多くの他のもののような新しい使用例に取り組んでいる。しかしながら、ディープニューラルネットワークの訓練は、高品質の入力データ及び大量の計算の両方を必要とする。ＧＰＵは、同時に大量のデータに対して作用できる超並列プロセッサである。マルチＧＰＵクラスタに結合されるとき、高スループットパイプラインは、ストレージから計算エンジンに入力データを送るように要求されてよい。ディープラーニングは、単にモデルを構築し、訓練するだけ以上である。データサイエンスチームが成功するために必要な規模、反復、及び実験のために設計されなければならないデータパイプライン全体も存在する。 [0162] Advances in deep neural networks have ignited a new wave of algorithms and tools for data scientists to utilize that data with artificial intelligence (AI). With improved algorithms, larger datasets, and diverse frameworks (including open-source software libraries for machine learning across a range of tasks), data scientists are tackling new use cases such as autonomous vehicles, natural language processing, and many others. However, training deep neural networks requires both high-quality input data and large amounts of computation. GPUs are massively parallel processors that can operate on large amounts of data simultaneously. When coupled to multi-GPU clusters, high-throughput pipelines may be required to send input data from storage to the computation engine. Deep learning is more than simply building and training a model. There is also an entire data pipeline that must be designed for the scale, iteration, and experimentation required for data science teams to be successful.

［0163］データは、近代的なＡＩ及びディープラーニングアルゴリズムの核心である。訓練が開始できる前に、対応しなければならない１つの問題は、正確なＡＩモデルを訓練するために重要であるラベル付きデータを収集することを中心に展開する。大量のデータを連続的に収集し、整理し、変形し、ラベルを付け、記憶するためには、本格的なＡＩ配備が必要とされる場合がある。追加の高品質のデータポイントを加えることは、直接的により正確なモデル及びより優れた洞察力になる。データサンプルは、１）外部ソースから訓練システムにデータを取り込み、未処理形式でデータを記憶するステップ、２）データサンプルを適切なラベルにリンクすることを含んだ、データを整理し、訓練に便利なフォーマットで変換するステップ、３）パラメータ及びモデルを調査し、より小さいデータセットで迅速に試験し、生産クラスタの中に押し込むために最も有望なモデルに集中することを反復するステップ、４）新しいサンプルとより旧いサンプルの両方を含んだ入力データのランダムバッチを選択するために訓練段階を実行し、モデルパラメータを更新するための計算のために生産ＣＰＵサーバにランダムバッチをフィードするステップ、及び５）ホールドアウトデータでのモデル精度を評価するために、訓練で使用されなかったデータのホールドバック（ｈｏｌｄｂａｃｋ）部分を使用することを含む、評価するステップを含むが、これに限定されるものではない一連の処理ステップを受けてよい。このライフサイクルは、単にニューラルネットワーク又はディープラーニングだけではなく、任意のタイプの並列化機械学習のために適用してよい。例えば、標準的な機械学習フレームワークは、ＧＰＵの代わりにＣＰＵに頼る場合があるが、データ取込み及び訓練の作業フローは同じである場合がある。読者は、単一の共用ストレージデータハブが、取込み段階、前処理段階、及び訓練段階の中で余分なデータコピーを必要とすることなく、ライフサイクルを通して調整ポイントを作成することを理解する。取り込まれたデータが１つの目的にしか使用されないのはまれであり、共用ストレージは、複数の異なるモデルを訓練する、又は従来の解析をデータに適用するために柔軟性を与える。 [0163] Data is at the heart of modern AI and deep learning algorithms. Before training can begin, one problem that must be addressed revolves around collecting labeled data, which is critical for training accurate AI models. Full-scale AI deployments may be required to continuously collect, organize, transform, label, and store large amounts of data. Adding additional high-quality data points directly translates into more accurate models and better insights. Data samples may undergo a series of processing steps, including, but not limited to, 1) ingesting data from external sources into the training system and storing the data in a raw format; 2) organizing the data and converting it into a format convenient for training, including linking data samples to appropriate labels; 3) iteratively exploring parameters and models, rapidly testing them on smaller datasets, and focusing on the most promising models for pushing into the production cluster; 4) running a training phase to select random batches of input data containing both new and older samples and feeding the random batches to a production CPU server for calculations to update the model parameters; and 5) evaluating, including using a holdback portion of the data not used in training to evaluate model accuracy on holdout data. This lifecycle may apply for any type of parallelized machine learning, not just neural networks or deep learning. For example, a standard machine learning framework may rely on a CPU instead of a GPU, but the data ingestion and training workflow may be the same. Readers will understand that a single shared storage data hub creates a coordination point throughout the lifecycle without requiring extra data copies during the ingestion, preprocessing, and training stages. Ingested data is rarely used for just one purpose, and shared storage provides the flexibility to train multiple different models or apply traditional analytics to the data.

［0164］読者は、ＡＩデータパイプラインの各段階がデータハブ（例えば、ストレージシステム又はストレージシステムの集合体）からの変化する要件を有する場合があることを理解する。スケールアウトストレージシステムは－小さいメタデータから重い、大きいファイルへ、ランダムアクセスパターンから連続アクセスパターンへ、及び低同時並列性から高同時並列性へ－あらゆる種類のアクセスタイプ及びパターンの妥協のない性能を送達しなければならない。システムは、構造化されていない作業負荷にサービスを提供する場合があるので、上述されたストレージシステムは、理想的なＡＩデータハブとしての機能を果たしてよい。第１の段階では、データは、理想的には、過剰なデータコピーを回避するために、取り込まれ、続く段階が使用する同じデータハブ上に記憶される。次の２つのステップは、任意選択でＧＰＵを含む標準的な計算サーバで実行でき、次いで第４の且つ最後の段階で、完全訓練生産ジョブが強力なＣＰＵ加速サーバで実行される。多くの場合、同じデータセットで動作中の実験パイプラインに沿った生産パイプラインがある。さらに、ＧＰＵ加速サーバは、異なるモデルのために独立して使用できる場合もあれば、分散訓練用の複数のシステムにも及ぶ１つのより大きいモデルで訓練するために互いに結合できる場合もある。共用ストレージ層が遅い場合、次いでデータは段階ごとにローカルストレージにコピーされなければならず、異なるサーバにデータを段階分けする時間の浪費を生じさせる。ＡＩ訓練パイプライン用の理想的なデータハブは、すべてのパイプライン段階が同時に動作できるようにするための簡潔性及び性能も有しながら、サーバノード上にローカルに記憶されるデータに類似した性能を発揮する。 [0164] The reader understands that each stage of an AI data pipeline may have varying requirements from a data hub (e.g., a storage system or collection of storage systems). A scale-out storage system must deliver uncompromising performance for all types of access types and patterns—from small metadata to heavy, large files, random access patterns to sequential access patterns, and low concurrency to high concurrency. Because the system may serve unstructured workloads, the storage system described above may serve as an ideal AI data hub. In the first stage, data is ideally ingested and stored on the same data hub used by subsequent stages to avoid excessive data copying. The next two steps can be run on standard compute servers, optionally including GPUs, and then in the fourth and final stage, full training production jobs are run on powerful CPU-accelerated servers. Often, there is a production pipeline alongside an experimental pipeline running on the same dataset. Furthermore, GPU-accelerated servers can be used independently for different models or can be combined together to train on a single larger model, spanning multiple systems for distributed training. If the shared storage layer is slow, then data must be copied to local storage for each stage, resulting in wasted time staging data on different servers. An ideal data hub for an AI training pipeline would provide performance similar to data stored locally on the server nodes, while also possessing the simplicity and performance to allow all pipeline stages to operate simultaneously.

［0165］データサイエンティストは、多種多様の方式を通して、訓練されたモデルの有用性、つまりより多くのデータ、より良いデータ、よりスマートな訓練、及び深いモデルを改善するために作業する。多くの場合、同じデータセットを共用し、新しい且つ改善された訓練モデルを作り出すために並行して作業するデータサイエンティストのチームがある。しばしば、同じ共用データセットで同時にこれらの段階の中で作業するデータサイエンティストのチームがある。データ処理、実験、及び本格的な訓練の複数の同時作業負荷は複数のアクセスパターンの要求をストレージ層の上に階層化する。言い換えると、ストレージは、大きいファイル読取りを満たすことができるのではなく、大きい及び小さいファイルの読取り及び書込みの混合物に取り組まなければならない。最後に、複数のデータサイエンティストがデータセット及びモデルを探索する状態では、各ユーザが、独自のやり方でデータを変換し、整理し、使用するための柔軟性を提供するためにデータをそのネイティブフォーマットで記憶することは、きわめて重大であり得る。上述されたストレージシステムは、データセットのための自然な共用ストレージホームに（例えば、ＲＡＩＤ６を使用することによる）データ保護冗長性、及び複数の開発者及び複数の実験に対する共通アクセスポイントとなるために必要な性能を提供し得る。上述されたストレージシステムを使用することは、ローカルな作業のためのデータの部分集合を慎重にコピーする必要性を回避し、エンジニアリングとＣＰＵ加速サーバ使用時間の両方を節約し得る。これらのコピーは、未処理データセットとして持続し、増え続ける重荷になり、所望される変換はつねに更新し、変化する。 [0165] Data scientists work to improve the usability of trained models through a variety of methods: more data, better data, smarter training, and deeper models. Often, there are teams of data scientists sharing the same dataset and working in parallel to create new and improved training models. Often, there are teams of data scientists working through these stages simultaneously on the same shared dataset. Multiple concurrent workloads of data processing, experimentation, and full-scale training layer the demands of multiple access patterns on top of the storage layer. In other words, rather than being able to satisfy large file reads, storage must handle a mixture of large and small file reads and writes. Finally, with multiple data scientists exploring datasets and models, storing data in its native format can be crucial to provide each user with the flexibility to transform, organize, and use the data in their own way. The storage system described above can provide a natural shared storage home for datasets, data protection redundancy (e.g., by using RAID 6), and the performance necessary to be a common access point for multiple developers and multiple experiments. Using the storage system described above can avoid the need to carefully copy subsets of data for local work, saving both engineering and CPU-accelerated server time. These copies become an ever-growing burden as raw datasets persist and desired transformations are constantly updated and changing.

［0166］読者は、ディープラーニングの成功がなぜ急増しているのかの根本的な理由がより大きいデータセットサイズのモデルの継続的な改善であることを理解する。対照的に、ロジスティック回帰のような旧式の機械学習アルゴリズムは、より小さいデータセットサイズで精度を改善するのを停止する。このようにして、計算リソース及びストレージリソースの分離は、各層の独立したスケーリングも可能にし、両方を共に管理することに固有の複雑さの多くを回避し得る。データセットサイズが大きくなる又は新しいデータセットが検討されるにつれ、スケールアウトストレージシステムは容易に拡大できなければならない。同様に、より多くの同時訓練が必要とされる場合、追加のＧＰＵ又は他の計算リソースは、その内部ストレージに対して懸念することなく加えることができる。さらに、上述されたストレージシステムは、ストレージシステムによって提供される無作為な読取り帯域幅、（個々のデータポイントを統合して、より大きいストレージフレンドリファイルを作成するために余分な努力が必要とされないことを意味する）小さいファイル（５０ＫＢ）を高速で無作為に読み取るストレージシステムの能力、データセットが大きくなる又はスループット要件が増えるにつれ、ストレージシステムが容量及び性能を拡大縮小できる能力、ストレージシステムのファイル又はオブジェクトをサポートする能力、大きいファイル又は小さいファイルのために性能を調整するストレージシステムの能力（つまり、ユーザがファイルシステムをセットアップする必要はない）、生産モデル訓練中でもハードウェア及びソフトウェアの非破壊的な更新をサポートするストレージシステムの能力のために、及び多くの他の理由から、ＡＩシステムを構築し、操作し、成長させることをより容易にしてよい。 [0166] The reader will understand that the underlying reason why deep learning is experiencing such rapid success is the continuous improvement of models with larger dataset sizes. In contrast, older machine learning algorithms, such as logistic regression, stop improving accuracy at smaller dataset sizes. In this way, the separation of computational and storage resources also allows for independent scaling of each tier, avoiding much of the complexity inherent in managing both together. As dataset sizes grow or new datasets are considered, the scale-out storage system must be able to easily expand. Similarly, if more concurrent training is required, additional GPUs or other computational resources can be added without concern for their internal storage. Additionally, the storage system described above may make it easier to build, operate, and grow AI systems due to the random read bandwidth provided by the storage system, the ability of the storage system to randomly read small files (50 KB) at high speed (meaning that no extra effort is required to consolidate individual data points to create larger storage-friendly files), the ability of the storage system to scale capacity and performance as data sets grow or throughput requirements increase, the ability of the storage system to support files or objects, the ability of the storage system to tune performance for large or small files (i.e., the user does not need to set up a file system), the ability of the storage system to support non-destructive hardware and software updates even during production model training, and for many other reasons.

［0167］テキスト、音声、又は画像を含んだ多くのタイプの入力は、小さいファイルとしてネイティブに記憶されるので、ストレージ層の小さいファイル性能が重要である場合がある。ストレージ層が小さいファイルをうまく処理しない場合、サンプルを前処理し、より大きいファイルにグループ化するために余分なステップが必要とされる。キャッシュ層としてＳＳＤに頼る、回転盤の上部に構築されたストレージは、必要とされる性能に及ばない場合がある。無作為な入力バッチを用いた訓練がより正確なモデルを生じさせるため、データセット全体は完全な性能でアクセス可能でなければならない。ＳＳＤキャッシュは、データの小さい部分集合に高性能を提供するにすぎず、回転するドライブの待ち時間を隠すことには無効である。 [0167] Many types of input, including text, audio, or images, are natively stored as small files, so small file performance at the storage layer can be important. If the storage layer does not handle small files well, extra steps are required to preprocess and group samples into larger files. Storage built on top of a spinning platter that relies on SSDs as a cache layer may fall short of the required performance. Because training with random input batches yields more accurate models, the entire dataset must be accessible at full performance. SSD caches only provide high performance for a small subset of data and are ineffective at hiding the latency of the spinning drives.

［0168］読者は、上述されたストレージシステムがブロックチェーンの（データのタイプの中の）ストレージをサポートするように構成されてよいことを理解する。係るブロックチェーンは、暗号法を使用し、リンクされ、保護されるブロックと呼ばれるレコードの連続的に増え続けるリストとして実施されてよい。ブロックチェーンの各ブロックは、以前のブロックに対するリンクとしてのハッシュポインタ、タイムスタンプ、トランザクションデータ等を含んでよい。ブロックチェーンは、データの修正に耐性があるように設計されてよく、２つの関係者間のトランザクションを、効率的に並びに検証可能且つ恒久的な方法で記録できるオープン分散型台帳としての機能を果たすことができる。これは、ブロックチェーンを、イベント、医療記録、及び例えば、アイデンティティ管理、トランザクション処理、その他等の他の記録管理活動の記録に潜在的に適切にする。 [0168] The reader will understand that the storage systems described above may be configured to support the storage of blockchains (among other types of data). Such blockchains may be implemented as a continuously growing list of records called blocks that are linked and secured using cryptography. Each block in a blockchain may contain a hash pointer as a link to the previous block, a timestamp, transaction data, etc. Blockchains may be designed to be resistant to data modification and can serve as an open distributed ledger that can record transactions between two parties efficiently, in a verifiable, and permanent manner. This makes blockchains potentially suitable for recording events, medical records, and other record-keeping activities such as identity management, transaction processing, and the like.

［0169］読者は、いくつかの実施形態では、上述されたストレージシステムが、上述されたアプリケーションをサポートするために他のリソースと対にされてよいことをさらに理解する。例えば、１つのインフラストラクチャは、ディープニューラルネットワーク用のパラメータを訓練するために計算エンジンに相互接続されるディープラーニングアプリケーションを加速するために、グラフィックスプロセッシングユニットでの汎用計算（『ＧＰＧＰＵ』）を使用することに特化するサーバ及びワークステーションの形での一次計算を含むことがあるであろう。各システムは、イーサネット外部接続性、インフィニバンド外部接続性、なんらかの他の形の外部接続性、又はそのなんらかの組合せを有してよい。係る例では、ＧＰＵは、単一の大きい訓練用にグループ化される場合もあれば、複数のモデルを訓練するために独立して使用される場合もある。また、インフラストラクチャは、例えばＮＦＳ、Ｓ３等の高性能プロトコルを通してデータにアクセスできる、例えばスケールアウトオールフラッシュファイル又はオブジェクトストアを提供するために上述されたストレージシステム等のストレージシステムを含む場合もあるであろう。また、インフラストラクチャは、例えば、冗長性のためにＭＬＡＧポートチャンネルでのポートを介してストレージ及び計算に接続された冗長なトップオブラックイーサネットスイッチを含む場合もある。また、インフラストラクチャは、データ取込み、前処理、及びモデルデバッグのために、任意選択でＧＰＵを用いて、ホワイトボックスサーバの形の追加計算を含むことがあるであろう。読者は、追加のインフラストラクチャも可能であることを理解する。 The reader will further understand that in some embodiments, the storage systems described above may be paired with other resources to support the applications described above. For example, one infrastructure might include primary computing in the form of servers and workstations specialized in using general-purpose computing on graphics processing units (GPGPUs) to accelerate deep learning applications, interconnected to a compute engine to train parameters for deep neural networks. Each system may have Ethernet external connectivity, InfiniBand external connectivity, some other form of external connectivity, or some combination thereof. In such an example, the GPUs might be grouped for a single large training run or used independently to train multiple models. The infrastructure might also include a storage system, such as those described above to provide a scale-out all-flash file or object store, with data accessible through high-performance protocols such as NFS, S3, etc. The infrastructure might also include redundant top-of-rack Ethernet switches connected to the storage and compute via ports in MLAG port channels for redundancy. The infrastructure may also include additional computation in the form of white-box servers, optionally using GPUs, for data ingestion, pre-processing, and model debugging. The reader understands that additional infrastructure is also possible.

［0170］読者は、上述されたシステムが、サーバノードに配備された分散型ダイレクトアタッチドストレージ（ＤＤＡＳ）解決策を含む場合がある他のシステムに比して上述されたアプリケーションにより適している場合があることを理解する。係るＤＤＡＳ解決策は、大きく、あまり連続的ではないアクセスの処理のために構築されてよいが、小さいランダムアクセスはあまり処理できない場合がある。読者は、ストレージシステムがより安全で、よりローカルに且つ内部で管理され、特徴集合及び性能でよりロバストであるオンサイト又はインハウスのインフラストラクチャに含まれてよいのでクラウドベースのリソースの活用に好ましい、又はそれ以外の場合、上述されたアプリケーションをサポートするためのプラットフォームの一部としてクラウドベースリソースの活用に好ましいプラットフォームを上述されたアプリケーションに提供するために上述されたストレージシステムが活用されてよいことを理解する。例えば、ＩＢＭのＷａｔｓｏｎ等のプラットフォーム上に構築されたサービスは、企業に、例えば金融取引情報又は識別可能な患者レコード等の個人ユーザ情報を他の機関に分散することを要求する場合がある。したがって、サービスとしてのＡＩのクラウドベースの提供物は、多様な業務上の理由のためだけではなく、多様な技術的な理由から、例えば上述されたストレージシステム等のストレージシステムによってサポートされるサービスとして内部で管理され、提供されるＡＩよりも望ましくない場合がある。 [0170] The reader will understand that the above-described system may be more suitable for the above-described applications than other systems, which may include distributed direct-attached storage (DDAS) solutions deployed on server nodes. Such DDAS solutions may be built to handle larger, less sequential accesses, but may not be able to handle smaller, random accesses as well. The reader will understand that the above-described storage system may be utilized to provide the above-described applications with a platform that is preferable to leveraging cloud-based resources, or otherwise as part of a platform to support the above-described applications, because the storage system may be included in an on-site or in-house infrastructure that is more secure, more locally and internally managed, and more robust in feature set and performance. For example, services built on a platform such as IBM's Watson may require a company to distribute personal user information, such as financial transaction information or identifiable patient records, to other institutions. Thus, a cloud-based offering of AI as a service may be less desirable than AI that is managed internally and delivered as a service supported by a storage system, such as the storage system described above, for a variety of technical as well as business reasons.

［0171］読者は、上述されたストレージシステムが、単独で又は他のコンピューティング機械と連携してのどちらかで、他のＡＩ関連ツールをサポートするように構成されてよいことを理解する。例えば、ストレージシステムは、ＯＮＸＸ等のツール、又は異なるＡＩフレームワークで作成されたモデルを転送することを容易にする他のオープンニューラルネットワーク交換フォーマットを利用してよい。同様に、ストレージシステムは、開発者が、ディープラーニングモデルのプロトタイプを作り、構築し、訓練できるようにする、ＡｍａｚｏｎのＧｌｕｏｎのようなツールをサポートするように構成されてよい。 [0171] The reader will understand that the storage system described above may be configured to support other AI-related tools, either alone or in conjunction with other computing machines. For example, the storage system may utilize tools such as ONXX or other open neural network exchange formats that facilitate the transfer of models created in different AI frameworks. Similarly, the storage system may be configured to support tools such as Amazon's Gluon, which allows developers to prototype, build, and train deep learning models.

［0172］読者は、上述されたストレージシステムが、エッジソリューションとして配備されてもよいことをさらに理解する。係るエッジソリューションは、データのソースに近い、ネットワークの端縁でデータ処理を実行することによって、クラウドコンピューティングシステムを最適化するために実施されていてよい。エッジコンピューティングは、集中したポイントからネットワークの論理的な端へアプリケーション、データ、及び計算力（つまり、サービス）をプッシュする場合がある。例えば上述されたストレージシステム等のエッジソリューションを使用することで、計算タスクは、係るストレージシステムによって提供される計算リソースを使用し、実行されてよく、データは、ストレージシステムのストレージリソースを使用するストレージであってよく、クラウドベースのサービスは、（ネットワーキングリソースを含んだ）ストレージシステムの多様なリソースの使用によりアクセスされてよい。エッジソリューションで計算タスクを実行し、エッジソリューションでデータを記憶し、概してエッジソリューションを利用することによって、高価なクラウドベースリソースの消費は回避され得、実際には、性能の改善は、クラウドベースリソースに対してより大きい依存に対して経験されてよい。 [0172] The reader will further understand that the storage systems described above may be deployed as edge solutions. Such edge solutions may be implemented to optimize cloud computing systems by performing data processing at the edge of the network, closer to the source of the data. Edge computing may push applications, data, and computing power (i.e., services) from a centralized point to the logical edge of the network. Using an edge solution, such as the storage systems described above, computational tasks may be performed using the computational resources provided by such storage systems, data may be stored using the storage resources of the storage systems, and cloud-based services may be accessed using various resources of the storage systems (including networking resources). By performing computational tasks at edge solutions, storing data at edge solutions, and generally utilizing edge solutions, consumption of expensive cloud-based resources may be avoided, and in fact, performance improvements may be experienced with greater reliance on cloud-based resources.

［0173］多くのタスクは、エッジソリューションの活用から恩恵を受ける場合があるが、いくつかの特定の用途が係る環境での配備に特に適する場合がある。例えば、ドローン、自律走行車、ロボット及び他のようなデバイスは高速処理を必要とする場合がある－実際には、非常に速いので、データ処理サポートを受けるためにデータをクラウド環境に上げて戻すことは、単に遅すぎる場合がある。同様に、豊富なデータ生成センサの使用により大量の情報を生成する機関車及びガスタービンのような機械は、エッジソリューションの高速データ処理機能から恩恵を受ける場合がある。追加の例として、単に関与する純粋なデータ量のため、クラウドにデータを送信することが（プライバシーの観点、セキュリティの観点、又は財政的な観点からだけではなく）実際的ではない場合があるので、例えば接続されたビデオカメラ等のいくつかのＩｏＴデバイスは、クラウドベースのリソースの活用に適切ではない場合がある。したがって、データの処理、ストレージ、又は通信に依存する多くのタスクは、例えば上述されたストレージシステム等のエッジソリューションを含むプラットフォームによってより適している場合がある。 [0173] While many tasks may benefit from leveraging edge solutions, some specific applications may be particularly suited for deployment in such environments. For example, devices such as drones, autonomous vehicles, robots, and others may require high-speed processing—so fast, in fact, that escalating data back to a cloud environment for data processing support may simply be too slow. Similarly, machines such as locomotives and gas turbines, which generate large amounts of information through the use of abundant data-generating sensors, may benefit from the high-speed data processing capabilities of edge solutions. As an additional example, some IoT devices, such as connected video cameras, may not be suitable for leveraging cloud-based resources because transmitting data to the cloud may be impractical (not just from a privacy, security, or financial perspective) simply due to the sheer volume of data involved. Thus, many tasks that rely on data processing, storage, or communication may be better suited by platforms that include edge solutions, such as the storage systems described above.

［0174］倉庫、流通センタ、又は類似した場所での在庫管理の具体例を検討する。大きい在庫業務、貨物保管業務、出荷業務、注文調達業務、製造作業、又は他の業務は、在庫の棚に大量の在庫、及び大きいデータのファイアホース（ｆｉｒｅｈｏｓｅ）を生じさせる高解像度デジタルカメラを有する。このデータのすべては、小さなデータのファイアホースへデータ量を削減し得る画像処理システムの中に取り込まれてよい。小さいデータのすべては、ストレージにオンプレミスで記憶されてよい。施設の端縁にあるオンプレミスストレージは、外部レポート、リアルタイム制御、及びクラウドストレージのためにクラウドに結合されてよい。在庫管理は画像処理の結果を用いて実行されてよく、これにより在庫は、棚上で追跡され、補充され、移動され、出荷され、新製品で修正される、又は打ち切られる場合がある／陳腐化した製品が削除される等である。上記の状況は、上述された構成可能な処理システム及びストレージシステムの実施形態の主要な候補である。計算専用のブレードと画像処理に適したオフロードブレードの、おそらくオフロードＦＰＧＡ又はオフロードカスタムブレード（複数可）でのディープラーニングとの組合せにより、デジタルカメラのすべてから大量のデータのファイアホースを取り込み、小さいデータのファイアホースを作り出すことができるであろう。小さいデータのすべては次いでストレージノードによって記憶され、ストレージブレードのタイプのどの組合せがデータフローを最もうまく処理しようと、ストレージユニットと動作するであろう。これは、ストレージ及び機能の加速及び統合の一例である。クラウドとの外部通信の必要性、及びクラウド内での外部処理に応じて、並びにネットワーク接続及びクラウドリソースの信頼性に応じて、システムは、爆発的な作業負荷及び可変伝導度信頼性を有するストレージ及び計算の管理のためのサイズに設定されるであろう。また、他の在庫管理の態様に応じて、システムは、ハイブリッドエッジ／クラウド環境でのスケジューリング及びリソース管理のために構成できるであろう。 [0174] Consider the specific example of inventory management in a warehouse, distribution center, or similar location. A large inventory operation, freight storage operation, shipping operation, order fulfillment operation, manufacturing operation, or other operation has large amounts of inventory on the inventory shelves and high-resolution digital cameras that produce a large firehose of data. All of this data may be captured into an image processing system that can reduce the amount of data into a firehose of smaller data. All of the smaller data may be stored on-premise in storage. The on-premise storage at the edge of the facility may be coupled to the cloud for external reporting, real-time control, and cloud storage. Inventory management may be performed using the results of the image processing, whereby inventory is tracked on the shelves, replenished, moved, shipped, revised with new products, or discontinued/obsolete products are removed, etc. The above situation is a prime candidate for the configurable processing system and storage system embodiments described above. A combination of compute-specific blades and offload blades suitable for image processing, perhaps with deep learning on offload FPGAs or offload custom blade(s), could take the firehose of massive amounts of data from all of the digital cameras and create a firehose of smaller data. All of the smaller data would then be stored by the storage node, which would work with whatever combination of storage blade types best handles the data flow. This is an example of storage and functionality acceleration and consolidation. Depending on the need for external communication with and processing within the cloud, as well as the reliability of network connections and cloud resources, the system would be sized for managing bursty workloads and storage and computation with variable conductivity reliability. Also, depending on other inventory management aspects, the system could be configured for scheduling and resource management in a hybrid edge/cloud environment.

［0175］また、上述されたストレージシステムは、ビッグデータ解析での使用のために最適化されてもよい。ビッグデータ解析は、概して、組織がより多くの情報に基づいた経営判断を下すのを支援できる隠れたパターン、未知の相関、市場の傾向、カスタマの好み、及び他の有用な情報を明らかにするために大きく且つ変化に富んだデータセットを調べるプロセスとして説明される場合がある。ビッグデータ解析アプリケーションは、データサイエンティスト、予測モデラー、統計学者、及び他の分析学専門家が、多くの場合従来のビジネスインテリジェンス（ＢＩ）及び分析論プログラムによって活用されないまま残されている他の形のデータを加えた、増え続ける大量の構造化されたトランザクションデータを解析できるようにする。そのプロセスの一部として、例えばインターネットクリックストリームデータ、ウェブサーバログ、ソーシャルメディアコンテンツ、カスタマの電子メール及び調査応答からのテキスト、携帯電話呼詳細記録、ＩｏＴセンサデータ、並びに他のデータ等の半構造化されたデータ、及び構造化されていないデータが、構造化された形に変換されてよい。ビッグデータ分析は、例えば高性能分析システムを装備した予測モデル、統計アルゴリズム、及びｗｈａｔ－ｉｆ分析等の要素を有する複雑なアプリケーションを必要とする高度な分析学の形である。 [0175] The storage systems described above may also be optimized for use in big data analytics. Big data analytics may be broadly described as the process of examining large and varied data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences, and other useful information that can help organizations make more informed business decisions. Big data analytics applications enable data scientists, predictive modelers, statisticians, and other analytics professionals to analyze ever-increasing volumes of structured transactional data, along with other forms of data often left untapped by traditional business intelligence (BI) and analytics programs. As part of that process, semi-structured and unstructured data, such as internet clickstream data, web server logs, social media content, text from customer emails and survey responses, mobile phone call detail records, IoT sensor data, and other data, may be converted into a structured form. Big data analytics is a form of advanced analytics that requires complex applications involving elements such as predictive models, statistical algorithms, and what-if analysis powered by high-performance analytical systems.

［0176］また、上述されたストレージシステムは、人間の発話に応えてタスクを実行するアプリケーションをサポートしてもよい（システムインタフェースとして実装することを含む）。例えば、ストレージシステムは、例えばＡｍａｚｏｎのＡｌｅｘａ、ＡｐｐｌｅＳｉｒｉ、ＧｏｏｇｌｅＶｏｉｃｅ、ＳａｍｓｕｎｇＢｉｘｂｙ、ＭｉｃｒｏｓｏｆｔＣｏｒｔａｎａ及び他等の実行インテリジェントパーソナルアシスタントアプリケーションをサポートしてよい。前文に説明された例は、入力として音声を利用するが、上述のストレージシステムは、聴覚方法又はテキスト方法を介して会話を実施するように構成される、チャットボット、トークボット、チャターボット、又は人工会話エンティティをサポートしてもよい。同様に、ストレージシステムは、システム管理者等のユーザが発話を介してストレージシステムと対話できるようにするために係るアプリケーションを実際に実行してよい。本開示に係る実施形態では、係るアプリケーションは、多様なシステム管理動作に対するインタフェースとして活用され得るが、係るアプリケーションは、概して音声対話、音楽再生、やることリストの作成、アラームのセット、ポッドキャストのストリーミング、オーディオブックの再生、並びに天気、交通、及び例えばニュース等の他のリアルタイム情報を提供することができる。 [0176] The storage system described above may also support applications (including implementation as a system interface) that perform tasks in response to human speech. For example, the storage system may support running intelligent personal assistant applications such as Amazon's Alexa, Apple Siri, Google Voice, Samsung Bixby, Microsoft Cortana, and others. While the examples described above utilize speech as input, the storage system described above may also support chatbots, talkbots, chatterbots, or artificial conversational entities configured to conduct conversations via auditory or textual methods. Similarly, the storage system may actually run such applications to enable users, such as system administrators, to interact with the storage system via speech. In embodiments of the present disclosure, such applications may be utilized as an interface for a variety of system management operations, but generally, such applications may provide voice interaction, music playback, to-do list creation, alarm setting, podcast streaming, audiobook playback, and weather, traffic, and other real-time information such as news.

［0177］また、上述されたストレージシステムは、自動運転ストレージのビジョンを達成するためのＡＩプラットフォームを実装してもよい。係るＡＩプラットフォームは、大量のストレージシステムテレメトリデータポイントを収集し、分析して、楽な管理、分析、及びサポートを可能にすることによってグローバルな予測インテリジェンスを実現させるように構成されてよい。実際に、係るストレージシステムは、作業負荷の展開、相互作用、及び最適化に対するインテリジェントなアドバイスを生成するだけではなく、容量と性能の両方も予測できる場合がある。係るＡＩプラットフォームは、すべての入信ストレージシステムテレメトリデータを、問題のフィンガープリントのライブラリに対して走査して、インシデントがカスタマの環境に影響を与える前に、リアルタイムでインシデントを予測し、解決し、性能負荷を予測するために使用される性能に関係する数百の変数を取り込むように構成されてよい。 [0177] The storage systems described above may also implement AI platforms to achieve the vision of self-driving storage. Such AI platforms may be configured to achieve global predictive intelligence by collecting and analyzing large volumes of storage system telemetry data points to enable effortless management, analysis, and support. Indeed, such storage systems may be able to predict both capacity and performance, as well as generate intelligent advice for workload deployment, interaction, and optimization. Such AI platforms may be configured to scan all incoming storage system telemetry data against a library of problem fingerprints to capture hundreds of performance-related variables that are used to predict and resolve incidents and forecast performance loads in real time, before the incidents impact customer environments.

［0178］追加の説明のために、図４は、本開示のいくつかの実施形態に従ってポッドをサポートする複数のストレージシステム（４０２、４０４、４０６）を示すブロック図を説明する。あまり詳細には示されないが、図４に示されるストレージシステム（４０２、４０４、４０６）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図４に示されるストレージシステム（４０２、４０４，４０６）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、又は追加の構成要素を含んでよい。 [0178] For further explanation, FIG. 4 illustrates a block diagram showing multiple storage systems (402, 404, 406) supporting pods in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (402, 404, 406) shown in FIG. 4 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage systems (402, 404, 406) shown in FIG. 4 may include the same components as the storage systems described above, fewer components, or additional components.

［0179］図４に示される例では、ストレージシステム（４０２、４０４、４０６）のそれぞれは、少なくとも１つのコンピュータプロセッサ（４０８、４１０、４１２）、コンピュータメモリ（４１４、４１６、４１８）、及びコンピュータストレージ（４２０、４２２、４２４）を有するとして示されている。いくつかの実施形態では、コンピュータメモリ（４１４、４１６、４１８）及びコンピュータストレージ（４２０、４２２、４２４）は同じハードウェアデバイスの一部であってよいが、他の実施形態では、コンピュータメモリ（４１４、４１６、４１８）及びコンピュータストレージ（４２０、４２２、４２４）は異なるハードウェアデバイスの一部であってよい。この特定の例でのコンピュータメモリ（４１４、４１６、４１８）とコンピュータストレージ（４２０、４２２、４２４）との相違は、コンピュータメモリ（４１４、４１６、４１８）がコンピュータプロセッサ（４０８、４１０、４１２）に物理的に近接しており、コンピュータプロセッサ（４０８、４１０、４１２）によって実行されるコンピュータプログラム命令を記憶してよく、一方コンピュータストレージ（４２０、４２２、４２４）は、ユーザデータ、ユーザデータを記述するメタデータ等を記憶するための不揮発性ストレージとして実施されるという点であってよい。例えば、図１Ａの上記の例を参照すると、特定のストレージシステム（４０２、４０４、４０６）のためのコンピュータプロセッサ（４０８、４１０、４１２）及びコンピュータメモリ（４１４、４１６、４１８）は、コントローラ（１１０Ａ～１１０Ｄ）の１つ以上の中に常駐してよい。一方、アタッチされたストレージデバイス（１７１Ａから１７１Ｆ）は、特定のストレージシステム（４０２、４０４、４０６）の中のコンピュータストレージ（４２０、４２２、４２４）としての機能を果たしてよい。 4, each of the storage systems (402, 404, 406) is shown as having at least one computer processor (408, 410, 412), computer memory (414, 416, 418), and computer storage (420, 422, 424). In some embodiments, the computer memory (414, 416, 418) and the computer storage (420, 422, 424) may be part of the same hardware device, while in other embodiments, the computer memory (414, 416, 418) and the computer storage (420, 422, 424) may be part of different hardware devices. The difference between computer memory (414, 416, 418) and computer storage (420, 422, 424) in this particular example is that computer memory (414, 416, 418) may be physically proximate to computer processors (408, 410, 412) and may store computer program instructions executed by computer processors (408, 410, 412), while computer storage (420, 422, 424) may be implemented as non-volatile storage for storing user data, metadata describing the user data, etc. For example, with reference to the above example of Figure 1A, the computer processors (408, 410, 412) and computer memory (414, 416, 418) for particular storage systems (402, 404, 406) may reside within one or more of the controllers (110A-110D). Meanwhile, the attached storage devices (171A to 171F) may function as computer storage (420, 422, 424) within a particular storage system (402, 404, 406).

［0180］図４に示される例では、示されるストレージシステム（４０２、４０４、４０６）は、本開示のいくつかの実施形態に従って、１つ以上のポッド（４３０、４３２）にアタッチしてよい。図４に示されるポッド（４３０、４３２）のそれぞれは、データセット（４２６、４２８）を含む場合がある。例えば、３つのストレージシステム（４０２、４０４、４０６）がアタッチしている第１のポッド（４３０）は、第１のデータセット（４２６）を含み、２つのストレージシステム（４０４、４０６）がアタッチしている第２のポッド（４３２）は、第２のデータセット（４２８）を含む。係る例では、特定のストレージシステムがポッドにアタッチするとき、ポッドのデータセットは、特定のストレージシステムにコピーされ、次いでデータセットが修正されるにつれ、最新に保たれる。ストレージシステムは、ポッドから削除され、データセットは削除されたストレージシステムではもはや最新には保たれない結果となる。図４に示される例では、ポッドにとってアクティブである（それは、故障していない（ｎｏｎ－ｆａｕｌｔｅｄ）ポッドの最新の動作中の故障していない部材である）任意のストレージシステムが、ポッドのデータセットを修正する又は読み取る要求を受け取り、処理することができる。 In the example shown in FIG. 4, the illustrated storage systems (402, 404, 406) may be attached to one or more pods (430, 432) in accordance with some embodiments of the present disclosure. Each of the pods (430, 432) shown in FIG. 4 may include a dataset (426, 428). For example, a first pod (430) with three attached storage systems (402, 404, 406) includes a first dataset (426), and a second pod (432) with two attached storage systems (404, 406) includes a second dataset (428). In such an example, when a particular storage system attaches to a pod, the pod's dataset is copied to the particular storage system and then kept up to date as the dataset is modified. A storage system can be removed from a pod, resulting in the dataset no longer being kept up to date on the removed storage system. In the example shown in Figure 4, any storage system that is active for a pod (that is, it is the most recent, operational, non-faulted member of a non-faulted pod) can receive and process requests to modify or read the pod's dataset.

［0181］図４に示される例では、各ポッド（４３０、４３２）は、特定のポッド（４３０、４３２）と関連付けられるデータセット（４２６、４２８）を修正又は読み取るためのアクセス動作の集合だけではなく、管理されたオブジェクト及び管理動作の集合を含んでもよい。係る例では、管理動作は、ストレージシステムのいずれかを通して同等に管理オブジェクトを修正する又は問い合わせしてよい。同様に、データセットを読み取る又は修正するためのアクセス動作は、ストレージシステムのいずれかを通して同等に動作してよい。係る例では、各ストレージシステムは、ストレージシステムによる使用のために記憶され、宣伝されるデータセットの適切な部分集合としてデータセットの別個のコピーを記憶するが、任意の１つのストレージシステムを通して実行され、完了された管理オブジェクト又はデータセットを修正するための動作は、ポッドを問い合わせるための後続の管理オブジェクト、又はデータセットを読み取るための後続のアクセス動作に反映される。 [0181] In the example shown in FIG. 4, each pod (430, 432) may include a collection of managed objects and management operations, as well as a collection of access operations to modify or read the dataset (426, 428) associated with the particular pod (430, 432). In such an example, a management operation may equally modify or query a managed object through either of the storage systems. Similarly, an access operation to read or modify a dataset may equally operate through either of the storage systems. In such an example, each storage system stores a separate copy of the dataset as an appropriate subset of the dataset stored and advertised for use by the storage system, but a managed object or operation to modify a dataset performed and completed through any one storage system is reflected in a subsequent managed object to query the pod or a subsequent access operation to read the dataset.

［0182］読者は、ポッドが単にクラスタ化された同期複製されたデータセットよりも多くの機能を実装してよいことを理解する。例えば、ポッドは、それによりデータセットがなんらかの方法で互いから安全に隔離されるテナントを実装するために使用できる。また、ポッドは、各ポッドが別々のアドレスを有するネットワーク（例えばストレージエリアネットワーク、又はインターネットプロトコルネットワーク）上で一意のストレージエンティティとして提示される仮想アレイ又は仮想ストレージシステムを実装するために使用できる。仮想ストレージシステムを実装するマルチストレージシステムポッドの場合、ポッドと関連付けられたすべての物理ストレージシステムは、（例えば、あたかも複数の物理ストレージシステムが、単一のストレージシステムの中への複数のネットワークポートとは異ならないかのように）なんらかの方法で同じストレージシステムとして、それら自体を提示してよい。 [0182] The reader will understand that pods may implement more functionality than simply clustered, synchronously replicated data sets. For example, pods can be used to implement tenants whereby data sets are securely isolated from one another in some way. Pods can also be used to implement virtual arrays or virtual storage systems where each pod presents as a unique storage entity on a network (e.g., a storage area network or Internet Protocol network) with a separate address. In the case of a multi-storage system pod implementing a virtual storage system, all physical storage systems associated with the pod may present themselves in some way as the same storage system (e.g., as if multiple physical storage systems were no different than multiple network ports into a single storage system).

［0183］読者は、ポッドが、ボリューム、ファイルシステム、オブジェクト／分析ストア、スナップショット、及び他の管理エンティティの集合体を表す、管理の単位である場合もあり、任意の１つのストレージシステムでの管理変更（例えば、名前変更、プロパティ変更、エクスポート又はポッドのデータセットのなんらかの部分に対する許可を管理すること）がポッドと関連付けられたすべてのアクティブなストレージシステムに自動的に反映されることを理解する。さらに、ポッドはデータ収集及びデータ分析の単位でもある場合があり、性能及び容量の測定基準は、ポッドのためにすべてのアクティブストレージシステム全体で統合する、又はポッドごとに別々にデータの収集及び分析に挑戦するように提示される、又はおそらくそれぞれのアタッチされたストレージシステムの入信コンテンツ及びポッドごとの性能に対する貢献を提示する。 [0183] The reader understands that a pod may be a unit of management representing a collection of volumes, file systems, object/analytic stores, snapshots, and other management entities, and that management changes in any one storage system (e.g., renaming, property changes, exporting, or managing permissions for any portion of the pod's dataset) are automatically reflected in all active storage systems associated with the pod. Furthermore, a pod may also be a unit of data collection and data analysis, where performance and capacity metrics may be presented aggregated across all active storage systems for the pod, or presented to challenge data collection and analysis per pod separately, or perhaps presenting the contribution of each attached storage system to incoming content and performance per pod.

［0184］ポッドのメンバーシップの１つのモデルは、ストレージシステムのリスト、及びストレージシステムがポッドのために同期中であると見なされるそのリストの部分集合として定義され得る。ストレージシステムは、ポッドがポッドと関連付けられたデータセットの最後の読み取られたコピーのための同一のアイドルコンテンツを有する少なくとも回復の中にある場合に、ポッドのために同期中であると見なされてよい。アイドルコンテンツは、新しい修正の処理なく任意の進行中の修正が完了した後のコンテンツである。これは「クラッシュ回復可能な」一貫性と呼ばれる場合がある。ポッドの回復は、ポッド内で同期したストレージシステムに同時更新を適用する際の差異を調整するプロセスを実施する。回復は、ポッドの多様なメンバーに対して要求されたが、無事に完了したとしていずれの要求者にも信号で知らされなかった同時修正の完了でストレージシステム間のあらゆる不一致を解決する場合がある。ポッドメンバーとして一覧されるが、ポッドのために同期中として一覧されていないストレージシステムは、ポッドから「デタッチされている」として記述できる。ポッドメンバーとして一覧されるストレージシステムは、ポッドのために同期中であり、ポッドのために「オンラインである」ポッドのためのデータをアクティブに供給するために現在利用可能である。 [0184] One model of pod membership may be defined as a list of storage systems and a subset of that list of storage systems that are considered in sync for a pod. A storage system may be considered in sync for a pod if the pod is at least in recovery with identical idle content for the last read copy of the dataset associated with the pod. Idle content is the content after any ongoing modifications have completed without processing new modifications. This is sometimes referred to as "crash-recoverable" consistency. Pod recovery implements a process to reconcile differences in applying concurrent updates to synchronized storage systems within a pod. Recovery may resolve any inconsistencies between storage systems upon completion of concurrent modifications that were requested for various members of the pod but not signaled to either requestor as successfully completed. Storage systems that are listed as pod members but not listed as in sync for a pod can be described as "detached" from the pod. Storage systems listed as pod members are in sync for a pod and are currently available to actively serve data for the pod when they are "online" for the pod.

［0185］ポッドの各ストレージシステムメンバーは、それが最後に知っていたどのストレージシステムが同期中であったのか、及びそれが最後に知っていたどのストレージシステムが、ポッドメンバーの完全な集合を含んでいたのかを含む、メンバーシップの独自のコピーを有してよい。ポッドのためにオンラインであるためには、ストレージシステムは、それ自体をポッドのために同期中と見なさなければならず、それがポッドのために同期中と見なす他のすべてのストレージシステムと通信していなければならない。ストレージシステムが、それが同期中であり、同期中の他のすべてのストレージシステムと通信していることを確信できない場合、次いでストレージシステムは、それが同期中であり、同期中の他のすべてのストレージシステムと通信していることが確信できるまで、ポッドのために新しい入信要求を処理することを停止しなければならない（又は、エラー若しくは例外とともにそれらを完了しなければならない）。第１のストレージシステムは、第２の対にされたストレージシステムがデタッチされるべきであると断定する場合があり、このことにより、第１のストレージシステムは、それが現在リスト中のすべてのシステムと現在同期しているので続行できるようになる。しかし、代わりに、第２のストレージシステムは、第２のストレージシステムが動作を続行している状態で、第１のストレージシステムがデタッチされるべきであると断定することを妨げられなければならない。これは、他の危険の中でも共存できないデータセット、データセット破損、又はアプリケーション破損につながる場合がある「スプリットブレイン」状態を生じさせるであろう。 [0185] Each storage system member of a pod may have its own copy of the membership, including which storage systems it last knew were in sync and which storage systems it last knew contained the complete set of pod members. To be online for a pod, a storage system must consider itself in sync for the pod and be in communication with all other storage systems it considers in sync for the pod. If a storage system cannot be sure that it is in sync and in communication with all other storage systems that are in sync, then the storage system must stop processing new incoming requests for the pod (or complete them with an error or exception) until it is sure that it is in sync and in communication with all other storage systems that are in sync. A first storage system may determine that a second paired storage system should be detached, which allows the first storage system to continue because it is currently in sync with all systems in its list. Instead, however, the second storage system must be prevented from determining that the first storage system should be detached while the second storage system continues to operate. This would create a "split-brain" condition that could lead to incoherent datasets, dataset corruption, or application corruption, among other risks.

［0186］対にされたストレージシステムと通信していないときにどのように進むのかを判断する必要がある状況は、ストレージシステムが正常に実行しており、次いで通信が失われたことに気づく間に、それが現在なんらかの以前の障害から回復している間に、それがリブートしている又は一時的な電力損失又は回復された通信停止状態から再開している間に、それが理由のいかんに関わらずストレージシステムコントローラのあるセットから別のセットに動作を切り替えている間に、又はこれらの若しくは他の種類のイベントの任意の組合せの間あるいは後に発生する場合がある。実際には、ポッドと関連付けられるストレージシステムがすべての既知のデタッチされていないメンバーと通信できないときはいつでも、ストレージシステムは通信が確立できるまで短時間待機し、オフラインになり、待機を続行する、又はストレージシステムは、通信していないストレージシステムが別の見方を推断することによりスプリットブレインを生じさせるリスクなく、なんらかの手段によって、通信していないストレージシステムをデタッチすることが安全であると判断し、次いで続行する場合があるかのどちらかである。安全なデタッチが十分に迅速に起こる場合、ストレージシステムは、短い遅延以上もほとんどなく、及び残っているオンラインストレージシステムに対する要求を発行する場合があるアプリケーションに対してアプリケーション停止状態を生じさせることなく、ポッドのためにオンライにとどまることができる。 [0186] Situations requiring a determination of how to proceed when not communicating with a paired storage system may occur while a storage system is running normally and then realizes that communication has been lost, while it is currently recovering from some previous failure, while it is rebooting or resuming from a temporary power loss or restored communication outage, while it is switching operation from one set of storage system controllers to another for whatever reason, or during or after any combination of these or other types of events. In practice, whenever a storage system associated with a pod cannot communicate with all known non-detached members, the storage system either waits a short time until communication can be established, goes offline, and continues waiting, or the storage system may determine, by some means, that it is safe to detach the non-communicating storage system and then proceed, without risk of the non-communicating storage system inferring otherwise and causing a split-brain situation. If the safe detach occurs quickly enough, the storage system can remain online for the pod with little more than a short delay and without causing application outages for applications that may issue requests to the remaining online storage system.

［0187］この状況の一例は、ストレージシステムが、それが旧いことを知っている場合があるときである。それは、例えば、第１のストレージシステムが、１つ以上のストレージシステムとすでに関連付けられているポッドに初めて加えられるとき、又は第１のストレージシステムが別のストレージシステムに再接続し、他のストレージシステムがすでに第１のストレージシステムをデタッチ済みとしてマークしていたことを見つけるときに起こる場合がある。この場合、第１のストレージシステムは、単に、それがポッドのために同期中であるストレージシステムのなんらかの他の集合に接続するまで待機するにすぎない。 [0187] An example of this situation is when a storage system may know it is out of date. This may occur, for example, when a first storage system is first added to a pod that is already associated with one or more storage systems, or when the first storage system reconnects to another storage system and finds that the other storage system has already marked the first storage system as detached. In this case, the first storage system simply waits until it connects to some other set of storage systems that it is in sync with for the pod.

［0188］このモデルは、ストレージシステムがどのようにしてポッドに又は同期中のポッドメンバーリストに加えられる、又はポッドから又は同期中のポッドメンバーリストから削除されるのかについてある程度の検討を要求する。各ストレージシステムはリストの独自のコピーを有するので、及び２つの独立したストレージシステムはまったく同時にそのローカルコピーを更新できないので、及びローカルコピーは、リブートで又は多様な障害の状況で利用できるすべてであるので、一過性の不一致は問題を起こさないことを確実にするために注意しなければならない。例えば、あるストレージシステムがポッドのために同期しており、第２のストレージシステムが加えられる場合、次いで第２のストレージシステムが両方のストレージシステムを最初に同期中と示すために更新されると、次いで障害及び両方のストレージシステムの再起動がある場合、第２は起動し、第１のストレージシステムに接続するために待機する可能性がある。一方、第１は、それが第２のストレージシステムを待機するべきである又は待機するであろうことに気づいていない可能性がある。第２のストレージシステムが次いで第１のストレージシステムと接続できないことに、プロセスを通過してそれをデタッチすることによって対応する場合、次いで、それは第１のストレージシステムが気づいていないプロセスを完了することに成功し、スプリットブレインを生じさせる可能性がある。したがって、ストレージシステムが、それらが通信していない場合に、それらがデタッチプロセスを経ることを選ぶ可能性があるかどうかに関して、不適切に同意しないことがないことを確実にすることが必要である場合がある。 [0188] This model requires some consideration of how storage systems are added to or removed from a pod or from the syncing pod member list. Because each storage system has its own copy of the list, and because two independent storage systems cannot update their local copies at exactly the same time, and because the local copies are all that is available on reboot or in various failure situations, care must be taken to ensure that transient inconsistencies do not cause problems. For example, if one storage system is syncing for a pod and a second storage system is added, and then the second storage system is updated to show both storage systems as syncing initially, then if there is a failure and a restart of both storage systems, the second may start up and wait to connect to the first storage system. Meanwhile, the first may not be aware that it should or will wait for the second storage system. If the second storage system then responds to its inability to connect with the first storage system by going through a process to detach it, it may then succeed in completing a process that the first storage system is unaware of, potentially resulting in a split-brain. Therefore, it may be necessary to ensure that storage systems do not inappropriately disagree as to whether they may choose to go through the detach process when they are not communicating.

［0189］ストレージシステムが、それらが通信していない場合に、それらがデタッチプロセスを経ることを選ぶ可能性があるかどうかに関して不適切に同意しないことがないことを確実にする１つの方法は、ポッドのための同期メンバーリストに新しいストレージシステムを加えるとき、新しいストレージシステムが最初に、それがデタッチされたメンバーであること（及びおそらく、それが同期中のメンバーとして加えられていること）を記憶することを確実にすることである。次いで、既存の同期中のストレージシステムは、新しいストレージシステムがその同じ事実をローカルに記憶する前に、新しいストレージシステムが同期中のポッドメンバーであることをローカルに記憶できる。新しいストレージシステムがその同期ステータスを記憶する前に、リブート又はネットワーク停止状態の集合がある場合、次いで元のストレージシステムは、通信がないために新しいストレージシステムをデタッチしてよいが、新しいストレージシステムは待機する。通信しているストレージシステムをポッドから削除するために、この変更の逆のバージョンが必要とされる可能性がある。最初に、削除中のストレージシステムは、それがもはや同期していないことを記憶し、次いで残るストレージシステムは、削除中のストレージシステムがもはや同期していないことを記憶し、次いですべてのストレージシステムが、そのポッドメンバーシップリストから、削除中のストレージシステムを削除する。実施態様によっては、中間の持続デタッチ状態は必要ではない場合がある。メンバーシップリストのローカルコピーで注意が必要とされるかどうかは、互いをモニタするために、又はそのメンバーシップを確証するためにモデルストレージシステムに依存する場合がある。コンセンサスモデルが両方に使用される場合、又は外部システム（若しくは外部の分散システムあるいはクラスタ化されたシステム）が、ポッドメンバーシップを記憶し、確証するために使用される場合、次いでローカルに記憶されるメンバーシップリストの不一致は問題にならない場合がある。 [0189] One way to ensure that storage systems do not inappropriately disagree regarding whether they may choose to undergo the detach process if they are not communicating is to ensure that when adding a new storage system to the in-sync member list for a pod, the new storage system first remembers that it is a detached member (and possibly that it has been added as an in-sync member). Existing in-sync storage systems can then locally remember that the new storage system is an in-sync pod member before the new storage system remembers that same fact locally. If there is a reboot or network outage set up before the new storage system remembers its in-sync status, then the original storage system may detach the new storage system due to lack of communication, while the new storage system waits. To remove a communicating storage system from a pod, the reverse version of this change may be required. First, the removing storage system remembers that it is no longer in-sync, then the remaining storage systems remember that the removing storage system is no longer in-sync, and then all storage systems remove the removing storage system from their pod membership lists. Depending on the implementation, the intermediate persistent detach state may not be necessary. Whether or not local copies of membership lists require attention may depend on the model storage system to monitor each other or to validate its membership. If a consensus model is used for both, or if an external system (or external distributed or clustered system) is used to store and validate pod membership, then inconsistencies in locally stored membership lists may not be an issue.

［0190］通信が失敗する、又はポッドの１つ若しくは複数のストレージシステムが故障するとき、あるいはストレージシステムが起動し（又は二次コントローラに障害迂回し）、ポッドのために対にされたストレージシステムと通信できず、１つ以上のストレージシステムが１つ以上の対にされたストレージシステムをデタッチすると決定するとき、なんらかのアルゴリズム又は機構が、そうすることが安全であると決定し、デタッチに関してやり通すために利用されなければならない。デタッチを解決する１つの手段は、メンバーシップに過半数（又は定足数）モデルを使用することである。３つのストレージシステムの場合、２つが通信している限り、それらは通信していない第３のストレージシステムをデタッチすることに同意できるが、その第３のストレージシステムは、独力で他の２つのどちらかをデタッチすることを選ぶことはできない。ストレージシステム通信に一貫性がないとき、混乱状態が発生する場合がある。例えば、ストレージシステムＡは、ストレージシステムＢと通信している可能性があるが、Ｃとは通信していない。一方、ストレージシステムＢはＡとＣの両方と通信している可能性がある。したがって、Ａ及びＢはＣをデタッチする、又はＢ及びＣはＡをデタッチするであろうが、これを理解するためには、ポッドメンバー間のより多くの通信が必要とされる場合がある。 [0190] When communication fails or one or more storage systems in a pod fail, or a storage system starts up (or fails over to a secondary controller) and is unable to communicate with its paired storage systems for a pod, and one or more storage systems decide to detach one or more paired storage systems, some algorithm or mechanism must be utilized to determine that it is safe to do so and to follow through on the detachment. One means of resolving detachment is to use a majority (or quorum) model for membership. In the case of three storage systems, as long as two are communicating, they can agree to detach the non-communicating third storage system, but that third storage system cannot independently choose to detach either of the other two. Confusion can arise when storage system communication is inconsistent. For example, storage system A may be communicating with storage system B but not with C. Meanwhile, storage system B may be communicating with both A and C. So A and B will detach C, or B and C will detach A, but realizing this may require more communication between pod members.

［0191］ストレージシステムを加え、削除するとき、定足数メンバーシップモデルでは注意する必要がある。例えば、第４のストレージシステムが加えられる場合、次いでストレージシステムの「過半数」はその時点では３つである。（過半数には２つが必要とされる状態での）３つのストレージシステムから（過半数には３つが必要とされる状態での）第４のストレージシステムを含んだポッドへの遷移は、同期リストにストレージシステムを注意深く加えるために上述されたモデルに類似した何かを必要とする場合がある。例えば、第４のストレージシステムはアタッチする状態で開始するが、まだアタッチしておらず、それは定足数を超える票を絶対に引き起こさないであろう。いったんその状態になると、元の３つのポッドメンバーはそれぞれ更新されて、第４のメンバー及び第４のメンバーをデタッチするための３つのストレージシステムの過半数に対する新しい要件を認識する。ストレージシステムをポッドから削除することは、同様に、他のポッドメンバーを更新する前にそのストレージシステムをローカルに記憶された「デタッチ」状態に動かす可能性がある。これの異形の方式は、例えばＰＡＸＯＳ又はＲＡＦＴ等の分散型コンセンサス機構を使用して、任意のメンバーシップの変更を実施する又はデタッチ要求を処理することである。 [0191] Care must be taken with the quorum membership model when adding and removing storage systems. For example, if a fourth storage system is added, then the "majority" of storage systems is now three. Transitioning from three storage systems (where two is required for a majority) to a pod containing the fourth storage system (where three is required for a majority) may require something similar to the model described above to carefully add the storage system to the synchronization list. For example, the fourth storage system starts in an attaching state but is not yet attached, which would never trigger a quorum vote. Once in that state, each of the original three pod members is updated to recognize the fourth member and the new requirement for a majority of three storage systems to detach the fourth member. Removing a storage system from a pod may similarly move that storage system to a locally stored "detached" state before updating other pod members. A variant of this is to use a distributed consensus mechanism, such as PAXOS or RAFT, to implement any membership changes or handle detachment requests.

［0192］メンバーシップ遷移を管理する別の手段は、ポッドメンバーシップを処理するためにストレージシステム自体の外部にある外部システムを使用することである。ポッドのためにオンラインになるために、ストレージシステムは最初に外部ポッドメンバーシップシステムに連絡して、それがポッドのために同期していることを検証しなければならない。ポッドのためにオンラインである任意のストレージシステムは、次いでポッドメンバーシップシステムと通信したままとなるべきであり、ストレージシステムが通信を失う場合には待機する又はオフラインになるべきである。外部ポッドメンバーシップマネージャは、例えばＯｒａｃｌｅＲＡＣ、ＬｉｎｕｘＨＡ、ＶＥＲＩＴＡＳＣｌｕｓｔｅｒＳｅｒｖｅｒ、ＩＢＭのＨＡＣＭＰ、又は他等の多様なクラスタツールを使用し、きわめて利用可能なクラスタとして実装できるであろう。また、外部ポッドメンバーシップマネージャは、Ｅｔｃｄ又はＺｏｏｋｅｅｐｅｒ等の分散型構成ツール、又はＡｍａｚｏｎのＤｙｎａｍｏＤＢ等の信頼できる分散型データベースを使用できるであろう。 [0192] Another means of managing membership transitions is to use an external system outside of the storage system itself to handle pod membership. To come online for a pod, the storage system must first contact the external pod membership system to verify that it is synchronized for the pod. Any storage system that is online for a pod should then remain in communication with the pod membership system and wait or go offline if the storage system loses communication. The external pod membership manager could be implemented as a highly available cluster using a variety of cluster tools, such as Oracle RAC, Linux HA, VERITAS Cluster Server, IBM's HACMP, or others. The external pod membership manager could also use a distributed configuration tool such as Etcd or Zookeeper, or a trusted distributed database such as Amazon's DynamoDB.

［0193］図４に示される例では、本開示のいくつかの実施形態に従って、示されるストレージシステム（４０２、４０４、４０６）は、データセット（４２６、４２８）の一部分を読み取る要求を受け取り、データセットの一部分をローカルに読み取る要求を処理してよい。読者は、データセット（４２６、４２８）は、ポッドのすべてのストレージシステム（４０２、４０４、４０６）全体で一貫性があるべきであるので、データセット（４２６、４２８）を修正する要求（例えば、書込み動作）は、ポッドのストレージシステム（４０２、４０４、４０６）間で調整を必要とするが、データセット（４２６、４２８）の一部分を読み取る要求に応えることはストレージシステム（４０２、４０４、４０６）間で類似する調整を必要としないことを理解する。したがって、読取り要求を受け取る特定のストレージシステムは、ポッド内の他のストレージシステムとの同期通信がない状態で、ストレージシステムのストレージデバイスの中に記憶されるデータセット（４２６、４２８）の一部分を読み取ることによってローカルに読取り要求にサービスを提供してよい。複製されたクラスタでの複製されたデータセットのために１つのストレージシステムによって受け取られた読取り要求は、少なくとも、やはり名目上実行中であるクラスタの中で実行中のストレージシステムによって受け取られるとき、大多数のケースでいかなる通信も回避することを期待される。係る読取りは、通常、クラスタ内の他のストレージシステムとの追加の対話が必要とされない状態で、クラスタ化されたデータのローカルコピーから単に読み取ることによって正常に処理されるべきである。 4, in accordance with some embodiments of the present disclosure, the illustrated storage systems (402, 404, 406) may receive a request to read a portion of a dataset (426, 428) and process the request to read the portion of the dataset locally. The reader will understand that because the dataset (426, 428) should be consistent across all storage systems (402, 404, 406) of the pod, a request (e.g., a write operation) to modify the dataset (426, 428) requires coordination among the storage systems (402, 404, 406) of the pod, but fulfilling a request to read a portion of the dataset (426, 428) does not require similar coordination among the storage systems (402, 404, 406). Thus, a particular storage system receiving a read request may service the read request locally by reading a portion of the dataset (426, 428) stored in the storage system's storage device, without synchronous communication with other storage systems in the pod. Read requests received by one storage system for a replicated dataset in a replicated cluster are expected to avoid any communication in the majority of cases, at least when received by a storage system running in the cluster that is also nominally running. Such reads should typically be successfully processed by simply reading from the local copy of the clustered data, with no additional interaction with other storage systems in the cluster required.

［0194］読者は、どのストレージシステムが読取り要求を処理するのかに関わりなく、読取り要求が同じ結果を返すように、ストレージシステムが読取り一貫性を確実にするために対策を講じてよいことを理解する。例えば、クラスタ内のストレージシステムの任意の集合によって受け取られる更新の任意の集合に対する結果として生じるクラスタ化されたデータセットコンテンツは、少なくとも更新がアイドルである（すべての以前の修正動作が完了と示され、新しい更新要求が決して受け取られず、処理されない）ときはいつでも、クラスタ全体で一貫性があるべきである。より詳細には、ストレージシステムの集合全体でのクラスタ化されたデータセットのインスタンスは、まだ完了されていない更新の結果としてのみ異なる場合がある。つまり、例えば、そのボリュームブロック範囲で重複する任意の２つの書込み要求、又は書込み要求と重複するスナップショット、コンペアアンドライト（ｃｏｍｐａｒｅ－ａｎｄ－ｗｒｉｔｅ）、又は仮想ブロック範囲コピーの任意の組合せは、データセットのすべてのコピーで一貫した結果を生じさせなければならない。２つの動作は、あたかもそれらが１つのストレージシステムで１つの順序で、及び複製されたクラスタの別のストレージシステムで異なる順序で起こったかのように結果を生じさせるべきではない。 [0194] The reader understands that storage systems may take measures to ensure read consistency, so that read requests return the same result regardless of which storage system processes the read request. For example, the resulting clustered dataset contents for any set of updates received by any set of storage systems in a cluster should be consistent across the cluster, at least whenever the updates are idle (all previous modification operations are marked as complete, and no new update requests are being received or processed). More specifically, instances of a clustered dataset across a set of storage systems may differ only as a result of updates that have not yet been completed. That is, for example, any two write requests that overlap on that volume block range, or any combination of a snapshot, compare-and-write, or virtual block range copy that overlaps with a write request, must produce consistent results across all copies of the dataset. Two operations should not produce results as if they occurred in one order on one storage system and in a different order on another storage system of a replicated cluster.

［0195］さらに、読取り要求は、時間順序一貫にされる場合がある。例えば、１つの読取り要求が複製クラスタで受け取られ、完了され、その読取りの後に次いで重複するアドレス範囲に対する別の読取り要求が続き、一方又は両方の読取りが、形はどうあれ複製クラスタによって受け取られる修正要求と、時間及びボリュームのアドレス範囲で重複する場合（読取り又は修正のいずれかが、同じストレージシステムによって受け取られるのか、それとも複製クラスタ内の異なるストレージシステムによって受け取られるのか）次いで第１の読取りが更新の結果を反映する場合、次いで第２の読取りも、更新に先行したデータをおそらく返すよりむしろ、その更新の結果を反映すべきである。第１の読取りが更新を反映しない場合、次いで第２の読取りは更新を反映する、又は反映しないのかのどちらかである。これは、２つの読取り要求の間で、データセグメントのための「時間」が後方に進むことができないことを確実にする。 [0195] Additionally, read requests may be made time-ordered consistent. For example, if one read request is received and completed at a replica cluster, and that read is then followed by another read request for an overlapping address range, and one or both reads overlap in time and volume address range with a modify request received by the replica cluster in any way (whether the read or the modify is received by the same storage system or a different storage system in the replica cluster), then if the first read reflects the results of the update, then the second read should also reflect the results of that update, rather than possibly returning data that preceded the update. If the first read does not reflect the update, then the second read will either reflect the update or not. This ensures that "time" for the data segment cannot progress backward between the two read requests.

［0196］図４に示される例では、示されるストレージシステム（４０２、４０４、４０６）は、他のストレージシステムの１つ以上とのデータ通信の中断を検出し、特定のストレージシステムがポッドに留まるべきであるかどうかを判断してもよい。他のストレージシステムの１つ以上とのデータ通信の中断は、さまざまな理由から発生する場合がある。例えば、他のストレージシステムの１つ以上とのデータ通信の中断は、ストレージシステムの１つが故障したため、ネットワーク相互接続が故障したため、又はなんらかの他の理由から発生する場合がある。同期した複製クラスタの重要な態様は、任意の障害処理が、回復不可能な不一致又は応答の任意の不一致を生じさないことを確実にすることである。例えば、ネットワークが２つのストレージシステム間で故障する場合、ストレージシステムのうちの最大で１つがポッドに対する新規に入信するＩ／Ｏ要求を処理し続けることができる。そして、一方のストレージシステムが処理を続行する場合、他方のストレージシステムは、読取り要求を含んだ任意の新しい要求を完了まで処理できない。 [0196] In the example shown in FIG. 4, the illustrated storage systems (402, 404, 406) may detect an interruption in data communication with one or more of the other storage systems and determine whether the particular storage system should remain in the pod. The interruption in data communication with one or more of the other storage systems may occur for a variety of reasons. For example, the interruption in data communication with one or more of the other storage systems may occur because one of the storage systems has failed, because the network interconnect has failed, or for some other reason. An important aspect of a synchronized replicated cluster is ensuring that any failure handling does not result in irrecoverable discrepancies or any discrepancies in responses. For example, if the network fails between two storage systems, at most one of the storage systems can continue to process new incoming I/O requests for the pod. And, if one storage system continues processing, the other storage system will not be able to process any new requests, including read requests, to completion.

［0197］図４に示される例では、示されたストレージシステム（４０２、４０４、４０６）は、特定のストレージシステムが、他のストレージシステムの１つ以上とのデータ通信の中断を検出することに応えて、ポッドに留まるべきであるかどうかを判断してもよい。上述されたように、ポッドの一部として『オンライン』であるためには、ストレージシステムは、それ自体をポッドのために同期中であると見なさなければならず、それがポッドのために同期中であると見なす他のすべてのストレージシステムと通信していなければならない。ストレージシステムが、それが同期中であり、同期中である他のすべてのストレージシステムと通信していると確信をもてない場合、次いでストレージシステムは新しい入信要求を処理するのを停止して、データセット（４２６、４２８）にアクセスする。したがって、ストレージシステムは、例えば、ストレージシステムが、（１つ以上のテストメッセージを介して）、それがポッドのために同期中であると見なす他のすべてのストレージシステムと通信できるかどうかを判断することによって、それがポッドのために同期中であると見なす他のすべてのストレージシステムも、ストレージシステムがポッドにアタッチされると見なすかどうかを判断することによって、特定のストレージシステムが、それがポッドのために同期中であると見なす他のすべてのストレージシステムと特定のストレージシステムが通信できること、及びそれがポッドのために同期中であると見なす他のすべてのストレージシステムも、ストレージシステムをポッドにアタッチされると見なすことを確認しなければならない両方のステップの組合せを通して、又は他のなんらかの機構を通して、特定のストレージシステムが、ポッドの一部としてオンラインに留まるべきであるかどうかを判断してよい。 4, the illustrated storage systems (402, 404, 406) may determine whether a particular storage system should remain in a pod in response to detecting an interruption in data communication with one or more of the other storage systems. As described above, to be “online” as part of a pod, a storage system must consider itself in sync for the pod and be in communication with all other storage systems that it considers in sync for the pod. If a storage system cannot be confident that it is in sync and in communication with all other storage systems that are in sync, then the storage system stops processing new incoming requests to access the datasets (426, 428). Thus, a storage system may determine whether a particular storage system should remain online as part of a pod through, for example, a combination of both steps where the storage system must verify that it can communicate with all other storage systems it considers in-sync for the pod and that all other storage systems it considers in-sync for the pod also consider the storage system to be attached to the pod, by determining (via one or more test messages) whether the storage system can communicate with all other storage systems it considers in-sync for the pod, and that all other storage systems it considers in-sync for the pod also consider the storage system to be attached to the pod, or through some other mechanism.

［0198］図４に示される例では、示されるストレージシステム（４０２、４０４、４０６）は、特定のストレージシステムがポッド内にとどまるべきであると判断したことに応えて、管理動作及びデータセット動作のために特定のストレージシステム上のデータセットをアクセス可能に保ってよい。ストレージシステムは、例えば、ストレージシステム上に記憶されるデータセット（４２６、４２８）のバージョンにアクセスする要求を受け入れ、係る要求を処理することによって、ホスト又は権限を与えられた管理者によって発行されるデータセット（４２６、４２８）と関連付けられた管理動作を受け入れ、処理することによって、他のストレージシステムの１つによって発行されるデータセット（４２６，４２８）と関連付けられる管理動作を受け入れ、処理することによって、又はなんらかの他の方法で、特定のストレージシステム上のデータセット（４２６、４２８）を管理動作及びデータセット動作のためにアクセス可能に保ってよい。 4, the illustrated storage systems (402, 404, 406) may keep the datasets on the particular storage system accessible for management and dataset operations in response to determining that the particular storage system should remain within the pod. The storage system may keep the datasets (426, 428) on the particular storage system accessible for management and dataset operations, for example, by accepting and processing requests to access versions of the datasets (426, 428) stored on the storage system, by accepting and processing management operations associated with the datasets (426, 428) issued by a host or authorized administrator, by accepting and processing management operations associated with the datasets (426, 428) issued by one of the other storage systems, or in some other manner.

［0199］しかしながら、図４に示される例では、示されるストレージシステム（４０２、４０４、４０６）は、特定のストレージシステムがポッド内に留まるべきではないと判断したことに応えて、特定のストレージシステム上のデータセットを管理動作及びデータセット動作のためにアクセス不可にしてよい。ストレージシステムは、例えばストレージシステムに記憶されるデータセット（４２６、４２８）のバージョンにアクセスする要求を拒絶することによって、ホスト又は他の権限を与えられた管理者によって発行されるデータセット（４２６、４２８）と関連付けられた管理動作を拒絶することによって、ポッド内の他のストレージシステムの１つによって発行されるデータセット（４２６，４２８）と関連付けられた管理動作を拒絶することによって、又はなんらかの他の方法で、特定のストレージシステム上のデータセット（４２６、４２８）を管理動作及びデータセット動作のためにアクセス不可にしてよい。 [0199] However, in the example shown in FIG. 4, the illustrated storage systems (402, 404, 406) may make datasets on the particular storage systems inaccessible for management and dataset operations in response to determining that the particular storage system should not remain in the pod. The storage systems may make datasets (426, 428) on the particular storage systems inaccessible for management and dataset operations, for example, by rejecting requests to access versions of the datasets (426, 428) stored on the storage systems, by rejecting management operations associated with the datasets (426, 428) issued by a host or other authorized administrator, by rejecting management operations associated with the datasets (426, 428) issued by one of the other storage systems in the pod, or in some other manner.

［0200］図４に示される例では、示されるストレージシステム（４０２、４０４、４０６）は、他のストレージシステムの１つ以上とのデータ通信の中断は修復されたことを検出し、特定のストレージシステム上のデータセットを管理動作及びデータセット動作のためにアクセス可能にしてもよい。ストレージシステムは、例えば他のストレージシステムの１つ以上からメッセージを受け取ることによって他のストレージシステムの１つ以上とのデータ通信の中断が修復されたことを検出してよい。他のストレージシステムの１つ以上とのデータ通信の中断が修復されたことを検出することに応えて、以前にデタッチされたストレージシステムが、ポッドにアタッチされたままであったストレージシステムと再同期されると、ストレージシステムは、特定のストレージシステム上のデータセット（４２６、４２８）を管理動作及びデータセット動作のためにアクセス可能にしてもよい。 [0200] In the example shown in FIG. 4, the illustrated storage systems (402, 404, 406) may detect that an interruption in data communication with one or more of the other storage systems has been repaired and may make the datasets on the particular storage system accessible for management and dataset operations. The storage system may detect that the interruption in data communication with one or more of the other storage systems has been repaired, for example, by receiving a message from one or more of the other storage systems. In response to detecting that the interruption in data communication with one or more of the other storage systems has been repaired, the storage system may make the datasets (426, 428) on the particular storage system accessible for management and dataset operations once the previously detached storage system has resynchronized with the storage systems that remain attached to the pod.

［0201］図４に示される例では、示されるストレージシステム（４０２、４０４、４０６）は、特定のストレージシステムがもはや管理動作及びデータセット動作を可能にしないようにポッドからオフラインになってよい。示されているストレージシステム（４０２、４０４、４０６）は、特定のストレージシステムがさまざまな理由から管理動作及びデータセット動作をもはや可能にしないように、ポッドからオフラインになってよい。例えば、示されているストレージシステム（４０２、４０４、４０６）は、ストレージシステム自体でのなんらかの障害のため、更新又は他のなんらかの保守がストレージシステムで発生しているため、通信障害のため、又は多くの他の理由からポッドからオフラインになってよい。係る例では、以下に含まれる再同期の項でより詳細に説明されるように、特定のストレージシステムはオフラインになり、特定のストレージシステムが管理動作及びデータセット動作を可能にするように、ポッドでオンラインに戻ったので、示されるストレージシステム（４０２、４０４、４０６）は、データセットに対するすべての更新を含むように特定のストレージシステム上のデータセットをその後更新してよい。 4, the illustrated storage systems (402, 404, 406) may be taken offline from the pod such that the particular storage system no longer enables management and dataset operations. The illustrated storage systems (402, 404, 406) may be taken offline from the pod such that the particular storage system no longer enables management and dataset operations for a variety of reasons. For example, the illustrated storage systems (402, 404, 406) may be taken offline from the pod because of some failure at the storage system itself, because an update or some other maintenance is occurring at the storage system, because of a communication failure, or for many other reasons. In such an example, as described in more detail in the resynchronization section included below, the particular storage system may be taken offline and then brought back online with the pod such that the particular storage system enables management and dataset operations. The illustrated storage systems (402, 404, 406) may subsequently update the dataset on the particular storage system to include all updates to the dataset, as the particular storage system goes offline and comes back online with the pod such that the particular storage system enables management and dataset operations.

［0202］図４に示される例では、示されるストレージシステム（４０２、４０４、４０６）は、データセットを非同期で受け取るためにターゲットストレージシステムを識別してもよく、ターゲットストレージシステムは、データセットが同期して複製される複数のストレージシステムの１つではない。係るターゲットストレージシステムは、例えば、同期して複製されたデータセット等を利用するなんらかのストレージシステムとして、バックアップストレージシステムを表してよい。実際には、同期複製は、より良いローカル読取り性能のために、データセットのコピーをサーバのなんらかのラックのより近くに分散するために利用される場合がある。１つの係る場合は、データセンタ又はキャンパスで中心に位置するより大きいストレージシステムに対称で複製されるより小さいトップオブラックストレージシステムであり、それらのより大きいストレージシステムは、信頼性のためにより慎重に管理される、又は非同期複製又はバックアップサービスのために外部ネットワークに接続される。 [0202] In the example shown in FIG. 4, the illustrated storage systems (402, 404, 406) may identify a target storage system to asynchronously receive a dataset, where the target storage system is not one of multiple storage systems to which the dataset is synchronously replicated. Such a target storage system may represent a backup storage system, for example, as any storage system that utilizes a synchronously replicated dataset. In practice, synchronous replication may be used to distribute copies of a dataset closer to some rack of servers for better local read performance. One such case is a smaller top-of-rack storage system that is symmetrically replicated to a larger storage system centrally located in a data center or campus, which is more carefully managed for reliability or connected to an external network for asynchronous replication or backup services.

［0203］図４に示される例では、示されているストレージシステム（４０２、４０４、４０６）は、他のストレージシステムのいずれかによってターゲットストレージシステムに非同期複製されていないデータセットの一部分を識別し、ターゲットストレージシステムに、他のストレージシステムのいずれかによってターゲットストレージシステムに非同期複製されていないデータセットの一部分を非同期複製してもよく、２つ以上のストレージシステムは、集合的にデータセット全体をターゲットストレージシステムに複製する。係るようにして、特定のデータセットを非同期複製することに関連付けられた作業は、ポッドのメンバーの間で分割されてよく、これによりポッドの各ストレージシステムは、ターゲットストレージシステムにデータセットの部分集合を非同期複製することを担うにすぎない。 [0203] In the example shown in FIG. 4, the illustrated storage systems (402, 404, 406) may identify a portion of a dataset that has not been asynchronously replicated to the target storage system by any of the other storage systems and asynchronously replicate to the target storage system the portion of the dataset that has not been asynchronously replicated to the target storage system by any of the other storage systems, with the two or more storage systems collectively replicating the entire dataset to the target storage system. In this manner, the work associated with asynchronously replicating a particular dataset may be divided among the members of the pod, such that each storage system of the pod is only responsible for asynchronously replicating a subset of the dataset to the target storage system.

［0204］図４に示される例では、示されているストレージシステム（４０２、４０４、４０６）はポッドからデタッチしてもよく、これによりポッドからデタッチする特定のストレージシステムは、データセットがその全体で同期複製されるストレージシステムの集合には含まれなくなる。例えば、図４のストレージシステム（４０４）が図４に示されるポッド（４３０）からデタッチした場合、ポッド（４３０）は、ポッド（４３０）に含まれるデータセット（４２６）が全体で同期複製されるであろうストレージシステムとして、ストレージシステム（４０２、４０６）を含むにすぎないであろう。係る例では、ストレージシステムをポッドからデタッチすることは、ポッドからデタッチした特定のストレージシステムからデータセットを削除することも含む場合があるであろう。図４のストレージシステム（４０４）が図４に示されるポッド（４３０）からデタッチした例を続行すると、ポッド（４３０）に含まれるデータセット（４２６）は、削除される、又はそれ以外の場合ストレージシステム（４０４）から除去される場合があるであろう。 [0204] In the example shown in FIG. 4, the illustrated storage systems (402, 404, 406) may be detached from the pod, such that the particular storage system detached from the pod is no longer included in the set of storage systems across which a dataset is synchronously replicated. For example, if storage system (404) in FIG. 4 were to detach from pod (430) shown in FIG. 4, pod (430) would only include storage systems (402, 406) as storage systems across which dataset (426) included in pod (430) would be synchronously replicated. In such an example, detaching a storage system from a pod might also include deleting the dataset from the particular storage system detached from the pod. Continuing with the example in which storage system (404) in FIG. 4 has detached from pod (430) shown in FIG. 4, dataset (426) included in pod (430) might be deleted or otherwise removed from storage system (404).

［0205］読者は、さらにサポートできるポッドモデルによって可能にされるいくつかの特有の管理機能があることを理解する。また、ポッドモデル自体が、実施態様により対応される場合があるいくつかの問題を生じさせる。例えば、相互接続が失敗し、ポッドのための別のストレージシステムが仲介を勝ち取ったために、ストレージシステムがポッドのためにオフラインであるが、それ以外の場合実行中であるとき、オフラインストレージシステム上のオフラインのポッドのデータセットにアクセスする願望又は必要性がある場合がある。１つの解決策は、単になんらかのデタッチされたモードでポッドを有効化し、データセットにアクセスできるようにすることである場合がある。しかしながら、その解決策は危険であり、その解決策は、ストレージシステムが通信を取り戻すときに、ポッドのメタデータ及びデータを調整するのをはるかに困難にさせる場合がある。さらに、ホストが、まだオンラインのストレージシステムだけではなくオフラインのストレージシステムにアクセスするための別個の経路もあるであろう。その場合、ホストは、両方のストレージシステムとももはや同期中に保たれていなくても両方のストレージシステムにＩ／Ｏを発行する可能性がある。というのは、ホストは、同じ識別子を有するボリュームを報告するターゲットポートを見ており、ホストＩ／Ｏドライバは、それが同じボリュームに対する追加の経路を見ると推測するためである。例えホストがありのまま推測したとしても、両方のストレージシステムに発行された読取り及び書込みがもはや一致していないので、これは、かなり不利なデータ破損を生じさせる場合がある。この場合の異形として、例えば共用ストレージクラスタ化データベース等のクラスタ化されたアプリケーションでは、あるホスト上で実行しているクラスタ化アプリケーションは、あるストレージシステムを読み取っている又はあるストレージシステムに書き込んでいる可能性があり、別のホストで実行中の同じクラスタ化アプリケーションは、「デタッチされた」ストレージシステムを読み取る又は該ストレージシステムに書き込んでいる可能性があるが、クラスタ化されたアプリケーションの２つのインスタンスは、それらがそれぞれ見ているデータセットが、完了された書込みに対して完全に一致していると仮定して、互いの間で通信している。それらは一致していないので、その仮定は破られ、アプリケーションのデータセット（例えば、データベース）はすぐに破損されることになる可能性がある。 [0205] The reader will also understand that there are several unique management features enabled by the pod model that can be supported. The pod model itself also presents several issues that may be addressed by an implementation. For example, when a storage system is offline for a pod but is otherwise running because an interconnect failed and another storage system for the pod won arbitration, there may be a desire or need to access the offline pod's dataset on the offline storage system. One solution may be to simply enable the pod in some detached mode so that the dataset can be accessed. However, that solution is risky, and it may make it much more difficult to reconcile the pod's metadata and data when the storage systems regain communication. Furthermore, there will also be separate paths for the host to access the offline storage system as well as the storage system that is still online. In that case, the host may issue I/O to both storage systems even though they are no longer kept in sync. This is because the host sees target ports reporting a volume with the same identifier, and the host I/O driver infers that it sees additional paths to the same volume. Even if the host makes a naive guess, this can result in significant data corruption, as the reads and writes issued to both storage systems no longer match. A variant of this case is in a clustered application, such as a shared storage clustered database, where the clustered application running on one host might be reading from or writing to one storage system, and the same clustered application running on another host might be reading from or writing to a "detached" storage system, but the two instances of the clustered application communicate with each other assuming that the data sets they each see are perfectly consistent with respect to completed writes. Because they are not consistent, that assumption is violated, and the application's data set (e.g., a database) can quickly become corrupted.

［0206］これらの問題の両方ともを解決する１つの方法は、オフラインポッド、つまりおそらくオフラインポッドのスナップショットが、ホストＩ／Ｏドライバ及びクラスタ化されたアプリケーションが、別のストレージシステムでまだオンラインボリュームと同じであると、コピーされたボリュームを混同しない旨の十分に新しいアイデンティティを有する新しいボリュームで新しいポッドにコピーされることを可能にすることである。各ポッドはデータセットの完全なコピーを維持し、これはクラッシュ一貫性（ｃｒａｓｈｃｏｎｓｉｓｔｅｎｔ）があるが、おそらく別のストレージシステム上のポッドデータセットのコピーとはわずかに異なるデータセットの完全なコピーを維持するので、並びに各ポッドはポッドコンテンツで動作するために必要とされるすべてのデータ及びメタデータの独立したコピーを有するので、新しいポッド内の新しいボリュームに対して、ポッドの一部の又はすべてのボリューム又はスナップショットの仮想コピーを作ることはわかりやすい問題である。例えば、論理エクステントグラフの実施態様では、必要とされるのは、ポッドのボリューム又はスナップショットと、及び書込み時にコピーとしてマークされている論理エクステントグラフと関連付けられたコピーされたポッドから論理エクステントグラフを参照する、新しいポッドで新しいボリュームを定義することである。新しいボリュームにコピーされるボリュームスナップショットがどのようにして実装される可能性があるのかと同様に、新しいボリュームは新しいボリュームとして処理されるべきである。ボリュームは、新しいポッド名前空間の中ではあるが、同じ管理名を有してよい。しかし、ボリュームは、異なる基本的な識別子、及び元のボリュームとは異なる論理ユニット識別子を有するべきである。 [0206] One way to solve both of these problems is to allow offline pods, or perhaps snapshots of offline pods, to be copied to a new pod with a new volume that has a sufficiently new identity that host I/O drivers and clustered applications will not confuse the copied volume with a still-online volume on another storage system. Because each pod maintains a complete copy of the dataset that is crash consistent but perhaps slightly different from the copy of the pod dataset on another storage system, and because each pod has an independent copy of all data and metadata needed to operate on the pod contents, making a virtual copy of some or all of the pod's volumes or snapshots to a new volume in the new pod is a straightforward problem. For example, in a logical extent graph implementation, all that is required is to define a new volume in the new pod that references the pod's volumes or snapshots and the logical extent graph from the copied pod associated with the logical extent graph being marked as copied on write. Similar to how a volume snapshot that is copied to a new volume might be implemented, the new volume should be treated as a new volume. The volume may have the same administrative name, albeit in a new pod namespace. However, the volume should have a different fundamental identifier and a different logical unit identifier than the original volume.

［0207］一部の場合では、いくつかのインタフェースに提示されるボリュームの分離が、元のボリュームを見る可能性もあるホストネットワークインタフェース又はホストＳＣＳＩ開始ポートからアクセス不可であることが確定できるように、（例えば、ＩＰネットワークの場合仮想ＬＡＮを、又はファイバチャネルネットワークの場合仮想ＳＡＮを作成することによって）仮想ネットワーク分離技術を使用することが可能な場合がある。係る場合、ボリュームのコピーに、元のボリュームとして同じＳＣＳＩ又は他のストレージ識別子を提供することが安全である場合がある。これは、例えばアプリケーションが、再構成での不当な負担なしに機能するために、ストレージ識別子の特定の集合を見ることを期待する場合に使用できるであろう。 [0207] In some cases, it may be possible to use virtual network isolation techniques (e.g., by creating a virtual LAN in the case of an IP network, or a virtual SAN in the case of a Fibre Channel network) so that isolation of volumes presented to some interfaces can be ensured to be inaccessible from host network interfaces or host SCSI initiation ports that might also see the original volume. In such cases, it may be safe to provide the copy of the volume with the same SCSI or other storage identifiers as the original volume. This could be used, for example, if an application expects to see a particular set of storage identifiers in order to function without undue burden on reconfiguration.

［0208］また、本明細書に説明される技術のいくつかは、障害を処理するための準備を試験するためにアクティブな障害状況の外で使用されるであろう。準備試験（「火災訓練」と呼ばれることもある）は、一般的に災害復旧構成に必要とされ、頻繁且つ繰り返される試験は、災害復旧計画の大部分又はすべての態様が正しく、アプリケーション、データセットに対する任意の最近の変化、又は設備の変化を説明することを確実にするために必要と見なされる。準備試験は、複製を含んだ現在の生産動作に対して無停止であるべきである。多くの場合、実際の動作はアクティブな構成で実際に呼び出すことはできないが、近づく良い方法は、生産データセットのコピーを作るためにストレージ動作を使用し、次いでおそらくそれを仮想ネットワーキングの使用と結合し、災害時に無事に起動しなければならない重要なアプリケーションにとって必要と考えられるすべてのデータを含んだ分離された環境を作り出すことである。同期複製された（又は非同期でも複製された）データセットの係るコピーを、災害復旧準備試験手順を実行し、次いで重要なアプリケーションをそのデータセットで起動して、それが起動し、機能することを確実にすることを期待されるサイト（又はサイトの集合体）の中で利用可能にすることは、それが、災害復旧計画でアプリケーションデータセットの重要な部分が除外されなかったことを確実にするのに役立つので素晴らしいツールである。必要な場合、及び実際的な場合、これは、現実世界の災害復旧引継ぎ状況に可能な限り近づくために、物理マシン又はバーチャルマシンの分離された集合体とおそらく結合された仮想分離ネットワークと結合できるであろう。ポッド（又はポッドの集合）を別のポッドに、ポッドデータセットのある時点の画像としてバーチャルにコピーすることは、元のポッドとは別に単一のサイト（又はいくつかのサイト）への分離を可能にするだけではなく、コピーされたすべての要素を含み、元のポッドに基本的に同様に操作できる分離されたデータセットを直ちに作り出す。さらに、これらは高速動作であり、それらは解体し、所望されるほど頻繁に試験を容易に繰り返すことができる。 [0208] Some of the techniques described herein may also be used outside of active disaster situations to test readiness to handle disasters. Readiness testing (sometimes called "fire drills") is typically required for disaster recovery configurations, and frequent and repeated testing is deemed necessary to ensure that most or all aspects of the disaster recovery plan are correct and account for any recent changes to the application, data set, or equipment changes. Readiness testing should be non-disruptive to current production operations, including replicas. In many cases, the actual operations cannot actually be invoked in the active configuration, but a good approach is to use storage operations to make copies of the production data set, and then perhaps combine this with the use of virtual networking to create an isolated environment containing all the data deemed necessary for critical applications that must start successfully in the event of a disaster. Having such a copy of a synchronously replicated (or even asynchronously replicated) data set available at the site (or collection of sites) expected to perform disaster recovery readiness testing procedures and then start critical applications on that data set to ensure it starts and functions is an excellent tool, as it helps ensure that critical portions of the application data set were not left out of the disaster recovery plan. Where necessary and practical, this could be combined with a virtual isolation network, perhaps coupled with an isolated collection of physical or virtual machines, to approximate as closely as possible a real-world disaster recovery takeover situation. Virtually copying a pod (or collection of pods) to another pod as a point-in-time image of the pod dataset not only allows isolation to a single site (or several sites) separate from the original pod, but also immediately creates an isolated dataset containing all the copied elements and that can be operated essentially the same as the original pod. Furthermore, these are fast operations, and they can be easily torn down and tested repeatedly as often as desired.

［0209］完全な災害復旧試験に向けてさらに近づくためにいくつかの強化を行うことができるであろう。例えば、分離されたネットワークと併せて、ＳＣＳＩ論理ユニットアイデンティティ又は他のタイプのアイデンティティは、試験サーバ、バーチャルマシン、及びアプリケーションが同じ識別子を見るようにターゲットポッドにコピーされるであろう。さらに、サーバの管理環境は、仮想ネットワークの特定の仮想集合からの要求に応えて、スクリプトがオブジェクト名の代替「試験」バージョンとともに試験変形の使用を必要としないように、元のポット名に対する要求又は動作に応えるように構成されるであろう。追加の強化は、災害引継ぎの場合に引き継ぐホスト側サーバインフラストラクチャが、試験中使用できる場合に使用できる。これは、災害復旧データセンタが、一般に、災害によりそのようにするように命令されるまで使用されない代替のサーバインフラストラクチャで完全に備えられる場合を含む。また、インフラストラクチャが重要ではない動作（例えば、生産データに対して分析を実行すること、又は重要である場合があるが、より重要な機能のために必要とされる場合、休止できるアプリケーション開発機能又は他の機能を単にサポートすること）に使用される可能性がある場合も含む。具体的には、ホストの定義及び構成、並びにそれらを使用するサーバインフラストラクチャは、それらが実際の災害復旧引継ぎイベントのためであり、災害復旧引継ぎ試験の一部として試験され、試験されたボリュームが、データセットのスナップショットを提供するために使用される仮想ポッドコピーからこれらのホスト定義に接続されているので、セットアップできる。関与するストレージシステムの観点から、次いで試験に使用されるこれらのホスト定義及び構成、並びに試験中に使用されるボリュームからホストへの接続構成は、実際の災害引継ぎイベントがトリガされるとき再利用することができ、試験構成と、災害復旧引継ぎの場合に使用される実際の構成との間の構成の相違を大幅に最小限に抑える。 [0209] Several enhancements could be made to move even closer toward a complete disaster recovery test. For example, in conjunction with isolated networks, SCSI logical unit identities or other types of identities could be copied to the target pod so that test servers, virtual machines, and applications see the same identifiers. Furthermore, the server management environment would be configured to respond to requests from a specific virtual set of virtual networks and respond to requests or operations on the original pod name so that scripts do not require the use of test variants along with alternate "test" versions of object names. Additional enhancements could be used when the hosting server infrastructure that takes over in the event of a disaster is available for use during testing. This includes cases where a disaster recovery datacenter is generally fully equipped with alternate server infrastructure that is not used until dictated to do so by a disaster. It also includes cases where the infrastructure may be used for non-critical operations (e.g., performing analytics on production data, or simply supporting application development or other functions that may be critical but can be paused if needed for more critical functions). Specifically, host definitions and configurations, and the server infrastructure that uses them, can be set up as they are for an actual disaster recovery takeover event, tested as part of the disaster recovery takeover test, with the tested volumes connected to these host definitions from the virtual pod copies used to provide snapshots of the datasets. From the perspective of the storage systems involved, these host definitions and configurations used for testing, as well as the volume-to-host connection configurations used during testing, can then be reused when an actual disaster takeover event is triggered, significantly minimizing the configuration differences between the test configuration and the actual configuration that will be used in the event of a disaster recovery takeover.

［0210］一部の場合では、第１のポッドから及びそれらのボリュームだけを含んだ新しい第２のポッドへボリュームを移動させることが意味をなす場合がある。ポッドメンバーシップ及び高可用性及び復旧特徴は、次いで別々に調整することができ、２つの結果として生じるポッドデータセットの管理は、次いで互いから分離できる。また、一方の方向で行うことができる動作は、他方の方向でも可能であるべきである。ある時点で、元の２つのポッドのそれぞれのボリュームがストレージシステムメンバーシップ及び高可用性並びに復旧特徴及びイベントのためにここで互いを追跡するように、２つのポッドを採取し、それらを１つにマージすることが意味をなす場合がある。両方の動作とも、上述の項で説明された、ポッドのための仲介又は定足数特性を変更するために提案された特徴に頼ることによって、安全に且つ実行中のアプリケーションに対して妥当に最小の中断又は中断なく達成できる。例えば、仲介により、ポッドのメディエータは、ポッドの各ストレージシステムが、第１のメディエータと第２のメディエータの両方に依存するために変更され、それぞれが次いで第２のメディエータだけに依存するために変更されるステップから成るシーケンスを使用し、変更できる。シーケンスの途中で障害が発生する場合、いくつかのストレージシステムは、第１のメディエータと第２のメディエータの両方に依存する場合があるが、復旧及び障害処理により、いくつかのストレージシステムが第１のメディエータだけに依存し、他のストレージシステムが第２のメディエータだけに依存することに決してならない。定足数は、復旧に進むために、第１の定足数モデルと第２の定足数モデルの両方に勝つことに一時的に依存することによって同様に処理できる。これは、障害に直面したポッドの可用性が追加のリソースに依存し、したがって潜在的な可用性を削減する非常に短い期間を生じさせる場合があるが、この期間は非常に短く、可用性の削減は多くの場合、ほとんどない。仲介により、メディエータパラメータの変化が、仲介に使用される鍵の変化にすぎず、使用される仲介サービスが同じである場合、それはいま同じサービスに対する２つの呼出し対そのサービスに対する１つの呼出しに依存し、及びむしろ別々の呼出しよりむしろ２つの別々のサービスにだけ依存するので、次いで可用性の潜在的な削減もより少ない。 [0210] In some cases, it may make sense to move volumes from a first pod and to a new, second pod containing only those volumes. Pod membership and high availability and recovery features can then be adjusted separately, and management of the two resulting pod data sets can then be decoupled from each other. Also, operations that can be performed in one direction should also be possible in the other direction. At some point, it may make sense to take two pods and merge them into one, such that the volumes in each of the original two pods now track each other for storage system membership and high availability and recovery features and events. Both operations can be achieved safely and with reasonably minimal or no disruption to running applications by relying on the proposed features for changing the brokering or quorum characteristics for pods, as described in the sections above. For example, through mediation, a pod's mediator can be modified using a sequence consisting of steps in which each storage system in the pod is modified to depend on both a first mediator and a second mediator, and then each is modified to depend only on the second mediator. If a failure occurs midway through the sequence, some storage systems may depend on both the first mediator and the second mediator, but recovery and failure handling never results in some storage systems depending only on the first mediator and other storage systems depending only on the second mediator. Quorum can be handled similarly by temporarily relying on winning both the first and second quorum models to proceed to recovery. While this may result in a very brief period in which the availability of a pod facing a failure depends on additional resources, thus reducing potential availability, this period is so brief that the reduction in availability is often negligible. With intermediation, if the change in mediator parameters is merely a change in the key used for intermediation and the intermediary service used is the same, then the potential reduction in availability is also less, as it now relies on two calls to the same service versus one call to that service, and only on two separate services rather than separate calls.

［0211］読者は、定足数モデルを変更することがきわめて複雑な場合があることに留意する。第２の定足数モデルにも依存するステップが次いで後に続く、ストレージシステムが第２の定足数モデルに参加するが、その第２の定足数モデルで勝つことに依存しない追加のステップが必要となる場合がある。これは、１つのシステムだけが定足数モデルに依存するために変更を処理した場合、過半数は絶対にないので、次いでそれは決して定足数を達成しないという事実を説明するために必要である場合があってよい。高可用性パラメータ（仲介関係、定足数モデル、引継ぎの優先度）を変更するためにこのモデルを実施させた状態で、ポッドを２つに分割する又は２つのポッドを１つに接合するためにこれらの動作のための安全手順を作成できる。これは、他の１つの機能、つまり２つのポッドが、互換性がある高可用性パラメータを含む場合、第１のポッドにリンクされた第２のポッドがデタッチ関係の処理及び動作、オフライン状態及び同期状態、並びに復旧及び再同期アクションを決定し、推進するために第１のポッドに依存できるように、高可用性のために第２のポッドを第１のポッドにリンクすることを追加することを必要とする場合がある。 [0211] The reader should note that changing the quorum model can be quite complicated. Additional steps may be required in which the storage system joins the second quorum model but does not rely on winning in that second quorum model, followed by steps that also rely on the second quorum model. This may be necessary to account for the fact that if only one system processes the change to rely on the quorum model, then it will never achieve quorum because there will never be a majority. With this model in place to change high availability parameters (intermediary relationships, quorum model, takeover priority), safety procedures can be created for splitting a pod into two or joining two pods into one. This may require adding one other feature: linking a second pod to a first pod for high availability, so that if the two pods contain compatible high availability parameters, the second pod linked to the first pod can rely on the first pod to determine and drive detach relationship handling and behavior, offline and sync states, and recovery and resync actions.

［0212］いくつかのボリュームを新規に作成されたポッドに移動させるための動作である、１つのポッドを２つに分割するために、以下、つまり以前には第１のポッドにあったボリュームの集合を移動させる第２のポッドを形成すること、第１のポッドから第２のポッドに高可用性パラメータをコピーして、それらがリンクのために互換性があることを確実にすること、及び高可用性のために第２のポッドを第１のポッドにリンクすることとして説明できる分散動作が形成されてよい。この動作はメッセージとして符号化されてよく、ストレージシステムが、動作がそのストレージシステムで完全に発生する、又は処理が障害により割り込まれる場合まったく発生しないことを確実にするように、ポッド内の各ストレージシステムによって実装されるべきである。いったん２つのポッドのためのすべての同期中のストレージシステムがこの動作を処理すると、ストレージシステムは次いで、第２のポッドがもはや第１のポッドにリンクされないように、第２のポッドを変更する後続の動作を処理する場合がある。ポッドのための高可用性特徴に対する他の変更と同様に、これは、最初にそれぞれの同期中のストレージシステムを、以前のモデル（そのモデルは、高可用性が第１のポッドにリンクされることである）と新しいモデル（そのモデルは、それ自体が現在独立した高可用性である）の両方に依存させることを含む。仲介又は定足数の場合、これは、この変更を処理したストレージシステムが、最初に、第１のポッドのために適宜に達成される仲介又は定足数に依存し、さらに第２のポッドが、仲介又は定足数のための試験を必要とした障害に続いて進む前に、第２のポッドのために達成されている新しい別個の仲介（例えば、新しい仲介鍵）又は定足数に依存することを意味する。定足数モデルを変更する上記説明と同様に、中間ステップは、ストレージシステムが第２のポッドのための定足数に参加し、依存するステップの前に、第２のポッドのための定足数に参加するためにストレージシステムを設定してよい。いったんすべての同期中のストレージシステムが、第１のポッドと第２のポッドの両方のための仲介又は定足数のために新しいパラメータに依存するために変更を処理すると、分割は完了する。 [0212] To split one pod into two, an operation for moving some volumes into a newly created pod may be performed, which can be described as follows: creating a second pod that moves a collection of volumes that were previously in the first pod; copying high availability parameters from the first pod to the second pod to ensure they are compatible for linking; and linking the second pod to the first pod for high availability. This operation may be encoded as a message and should be implemented by each storage system in the pod, such that the storage system ensures that the operation occurs entirely on that storage system, or not at all if processing is interrupted by a failure. Once all synchronized storage systems for the two pods have processed this operation, the storage system may then process a subsequent operation that modifies the second pod so that it is no longer linked to the first pod. As with other changes to high availability features for pods, this involves first having each in-sync storage system rely on both the previous model (where high availability is linked to the first pod) and the new model (where it is now highly available independently). In the case of brokering or quorum, this means that the storage system that processed the change first relies on the brokering or quorum achieved accordingly for the first pod, and then relies on the new, separate brokering (e.g., new broker key) or quorum being achieved for the second pod before proceeding following the failure that required the test for brokering or quorum. As with the above description of changing the quorum model, an intermediate step may be configuring the storage system to participate in a quorum for the second pod before the step in which the storage system joins and relies on the quorum for the second pod. Once all in-sync storage systems have processed the change to rely on the new parameters for brokering or quorum for both the first pod and the second pod, the split is complete.

［0213］第２のポッドを第１のポッドに接合することは、本来逆に動作する。第１に、第２のポッドは、ストレージシステムの同一リストを有することによって、及び互換性のある高可用性モデルを有することによって、第１のポッドと互換性があるように調整されなければならない。これは、例えば本書の他の箇所に説明されるストレージシステムを加える若しくは削除するための、又はメディエータ及び定足数モデルを変更するためのステップ等のステップのなんらかの集合を含む場合がある。実施態様に応じて、ストレージシステムの同一リストに達することだけが必要な場合がある。接合することは、それぞれの同期中のストレージシステムで動作を処理して、高可用性のために第２のポッドを第１のポッドにリンクさせることによって進む。その動作を処理する各ストレージシステムは、次いで高可用性のために第１のポッドに、次いで高可用性のために第２のポッドに依存する。第２のポッドのためのすべての同期中のストレージシステムがその動作を処理すると、ストレージシステムは、次いで第２のポッドと第１のポッドとの間のリンクを排除し、第２のポッドから第１のポッドにボリュームを移行し、第２のポッドを削除するためにそれぞれ後続の動作を処理する。ホスト又はアプリケーションデータセットアクセスは、実施態様が、ホスト若しくはアプリケーションデータセットの修正の適切な方向、又はアイデンティごとのボリュームに対する読取り動作を可能にする限り、及びアイデンティティがストレージプロトコル又はストレージモデルに対して適宜に保存される限り（例えば、ボリューム及びボリュームにアクセスするためのターゲットポートの使用のための論理ユニット識別子がＳＣＳＩの場合に保存される限り）これらの動作を通して保存できる。 [0213] Joining a second pod to a first pod essentially works in reverse. First, the second pod must be adjusted to be compatible with the first pod by having an identical list of storage systems and a compatible high availability model. This may involve any set of steps, such as steps for adding or removing storage systems or for modifying mediator and quorum models, as described elsewhere herein. Depending on the implementation, it may only be necessary to arrive at an identical list of storage systems. Joining proceeds by processing operations at each synchronizing storage system to link the second pod to the first pod for high availability. Each storage system that processes the operation then depends on the first pod for high availability and then on the second pod for high availability. Once all synchronized storage systems for the second pod have processed that operation, the storage system then processes subsequent operations to remove the link between the second pod and the first pod, migrate the volume from the second pod to the first pod, and delete the second pod, respectively. Host or application data set access can be preserved throughout these operations, so long as the implementation allows appropriate direction of host or application data set modification or read operations to volumes by identity, and so long as identity is preserved appropriately for the storage protocol or storage model (e.g., as long as logical unit identifiers for volumes and use of target ports to access volumes are preserved in the case of SCSI).

［0214］ポッド間でボリュームを移行することが問題を呈する場合がある。ポッドが、同期中のメンバーシップストレージシステムの同一の集合を有する場合、次いで、移行中のボリュームに対する動作を一時的に中断し、それらのボリュームに対する動作に対する制御を新しいポッドのための制御ソフトウェア及び構造に切り替え、次いで動作を再開することが簡単な場合がある。ネットワーク及びポートがポッド間で適切に移行するならば、これは、非常に短い動作の一時中断とは別に、アプリケーションのために連続アップタイムを有するシームレスな移行を可能にする。実施態様に応じて、一時中断動作は、必要ではない場合もあれば、システムにとって非常に内部であるので、動作の一時中断は影響を及ぼさない場合もある。異なる同期中のメンバーシップ集合を有するポッド間でボリュームをコピーすることは、ますます問題である。コピーのターゲットポッドがソースポッドからの同期中のメンバーの部分集合を有する場合、これはそれほど問題ではない。メンバーストレージシステムは、より多くの作業を行う必要なく十分に安全に外すことができる。しかし、ターゲットポッドが同期中のメンバーストレージシステムをソースポッド上のボリュームに加える場合、次いで加えられたストレージシステムは、それらが使用できるようになる前にボリュームのコンテンツを含むために同期しなければならない。障害処理が異なり、まだ同期していないメンバーストレージシステムからの要求処理がうまくいかない又は転送されなければならない又は読取りが相互接続を横断しなければならないため速くないという点で、同期するまで、これは、コピーされたボリュームをすでに同期しているボリュームとはっきりと異なったままとする。また、内部実装は、同期中で、障害処理のために準備完了しているいくつかのボリュームを処理しなければならず、他は同期していない。 [0214] Migrating volumes between pods can present problems. If the pods have the same set of in-sync membership storage systems, then it may be simple to temporarily suspend operations on the migrating volumes, switch control over operations on those volumes to the control software and structures for the new pod, and then resume operations. If networks and ports migrate appropriately between pods, this allows for a seamless transition with continuous uptime for applications, apart from a very brief interruption in operations. Depending on the implementation, the suspend operation may not be necessary, or may be so internal to the system that the interruption in operations has no impact. Copying volumes between pods with different in-sync membership sets is increasingly problematic. This is less of an issue if the target pod of the copy has a subset of in-sync members from the source pod. Member storage systems can be removed safely enough without the need for more work. However, if the target pod adds an in-sync member storage system to a volume on the source pod, then the added storage system must be synchronized to include the volume's contents before they can be used. Until synchronized, this leaves the copied volumes distinctly different from volumes that are already synchronized, in that failure handling is different and requests from member storage systems that are not yet synchronized may not work or may have to be forwarded or may not be fast because reads must traverse the interconnect. Also, the internal implementation must handle some volumes being synchronized and ready for failure handling, while others are not.

［0215］障害に直面した動作の信頼性に関する他の問題がある。マルチストレージシステムポッド間でボリュームの移行を調整することは、分散動作である。ポッドが障害処理及び回復の単位である場合、及び仲介又は定足数又はどのような手段でもスプリットブレイン状況を回避するために使用される場合、次いで障害処理、回復、仲介、及び定足数のための状態及び構成及び関係の特定の集合を有するあるポッドから別のポッド、次いでポッドのストレージシステムへのボリュームの切替えは、あらゆるボリュームのためのその処理に関係する変更を調整することについて注意深くなくてはならない。動作は、ストレージシステム間でアトミックに分散することはできないが、なんらかの方法で段階分けされなければならない。仲介モデル及び定足数モデルは、本来ポッドに分散トランザクションアトミック性を実装するためのツールを提供するが、これは、実装を増加させることなくポッド間動作には及ばない場合がある。 [0215] There are other issues related to the reliability of operations in the face of failures. Coordinating the migration of volumes across multiple storage system pods is a distributed operation. If the pod is the unit of failure handling and recovery, and if intermediary or quorum or whatever means are used to avoid split-brain situations, then switching a volume from one pod to another pod, and then to the pod's storage system, with a particular set of state, configuration, and relationships for failure handling, recovery, intermediary, and quorum, must be careful to coordinate the changes involved in that operation for every volume. Operations cannot be atomically distributed across storage systems, but must be phased in some way. The intermediary and quorum models inherently provide tools for implementing distributed transaction atomicity within pods, but this may not extend to inter-pod operations without increasing implementation complexity.

［0216］同じ第１のストレージシステム及び第２のストレージシステムを共用する２つのポッドの場合の第１のポッドから第２のポッドへのボリュームの単純な移行も検討する。ある時点で、ストレージシステムは、ボリュームが現在第２のポッドにあり、もはや第１のポッドにはないことを定めるために調整する。２つのポッドのためのストレージシステム全体でトランザクションアトミック性のための固有の機構がない場合、次いでネイティブ実装は、２つのポッドからストレージシステムをデタッチするための障害処理を生じさせるネットワーク障害のときに、第１のポッド内のボリュームを第１のストレージシステムに残し、第２のポッドを第２のストレージシステムに残す場合があるであろう。ポッドが、どのストレージシステムが他方をデタッチするのに成功するのかを別々に判断する場合、結果は、同じストレージシステムが両方のポッドのために他のストレージシステムをデタッチすることになり、その場合、ボリューム移行回復の結果には一貫性がなければならない、又は異なるストレージシステムが、２つのポッドのために他をデタッチすることになるであろう。第１のストレージシステムが第１のポッドのために第２のストレージシステムをデタッチし、第２のストレージシステムが第２のポッドのために第１のストレージシステムをデタッチする場合、次いで回復により、ボリュームが第１のストレージシステムでの第１のポッドに、及び第２のストレージシステムでの第２のポッドに回復され、ボリュームは次いで実行中であり、両方のストレージシステムのホスト及びストレージアプリケーションにエクスポートされる。代わりに、第２のストレージシステムが、第１のポッドのために第１のストレージシステムをデタッチし、第１のストレージが第２のポッドのために第２のストレージシステムをデタッチする場合、次いで回復により、ボリュームは、第１のストレージシステムによって第２のポッドから廃棄され、ボリュームは、第２のストレージシステムによって第１のポッドから廃棄され、ボリュームが完全に消える結果となる可能性がある。ボリュームが間で移行されているポッドがストレージシステムの異なる集合上にある場合、次いでものごとはさらにより複雑化する場合がある。 [0216] Consider also the simple migration of a volume from a first pod to a second pod in the case of two pods sharing the same first and second storage systems. At some point, the storage systems coordinate to determine that the volume is now in the second pod and no longer in the first pod. Without an inherent mechanism for transaction atomicity across storage systems for the two pods, a native implementation might leave the volume in the first pod on the first storage system and the second pod on the second storage system in the event of a network failure that triggers a failure process to detach the storage systems from the two pods. If the pods independently determine which storage system succeeds in detaching the other, the result would be either the same storage system detaching the other for both pods, in which case the volume migration recovery results must be consistent, or different storage systems would detach the other for the two pods. If a first storage system detaches a second storage system for a first pod, and the second storage system detaches the first storage system for the second pod, then recovery will recover the volume to the first pod on the first storage system and to the second pod on the second storage system, and the volume will then be running and exported to hosts and storage applications on both storage systems. Alternatively, if a second storage system detaches the first storage system for the first pod, and the first storage system detaches the second storage system for the second pod, then recovery may result in the volume being discarded from the second pod by the first storage system and the volume being discarded from the first pod by the second storage system, causing the volume to disappear entirely. If the pods between which the volume is being migrated are on different sets of storage systems, then things may become even more complicated.

［0217］これらの問題の解決策は、ポッドを分割し、接合するための上述された技術とともに中間ポッドを使用することである場合がある。この中間ポッドは、ストレージシステムと関連付けられた可視管理オブジェクトとして絶対に提示されてはならない。このモデルでは、第１のポッドから第２のポッドに移動されるボリュームは、最初に、上述された分割動作を使用し、第１のポッドから新しい中間ポッドに分割される。中間ポッドのためのストレージシステムメンバーは、次いで、必要に応じてストレージシステムを加える又はストレージシステムをポッドから削除することによって、ストレージシステムのメンバーシップに一致するように調整できる。その後、第２のポッドに中間ポッドを接合する場合がある。 [0217] A solution to these problems may be to use an intermediate pod in conjunction with the techniques described above for splitting and joining pods. This intermediate pod must never be presented as a visible management object associated with the storage system. In this model, a volume to be moved from a first pod to a second pod is first split from the first pod into a new intermediate pod using the split operation described above. The storage system membership for the intermediate pod can then be adjusted to match the membership of the storage system by adding or removing storage systems from the pod as needed. The intermediate pod may then be joined to the second pod.

［0218］追加の説明のため、図５は、本開示のいくつかの実施形態に従ってポッドをサポートするストレージシステム（４０２、４０４、４０６）によって実行されてよいステップを示すフローチャートを説明する。あまり詳細には示されていないが、図５に示されるストレージシステム（４０２、４０４、４０６）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、図４、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図５に示されるストレージシステム（４０２、４０４、４０６）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0218] For additional explanation, FIG. 5 sets forth a flowchart illustrating steps that may be performed by storage systems (402, 404, 406) supporting pods in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (402, 404, 406) illustrated in FIG. 5 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, 4, or any combination thereof. In practice, the storage systems (402, 404, 406) illustrated in FIG. 5 may include the same components as the storage systems described above, fewer components, or additional components.

［0219］図５に示される例の方法では、ストレージシステム（４０２）は、ポッドにアタッチしてよい（５０８）。ポッドメンバーシップのためのモデルは、ストレージシステムのリスト、及びストレージシステムがポッドのために同期中であると推定されるそのリストの部分集合を含んでよい。ストレージシステムは、それが、少なくともポッドと関連付けられたデータセットの最後の作成コピーの同一のアイドルコンテンツを有する回復の中にある場合、ポッドのために同期している。アイドルコンテンツは、新しい修正の処理がなく任意の進行中の修正が完了した後のコンテンツである。これは「クラッシュ回復可能」一貫性と呼ばれることがある。ポッドメンバーとして一覧されているが、ポッドのために同期中として一覧されていないストレージシステムは、ポッドから「デタッチされている」として記述される場合がある。ポッドメンバーとして一覧されるストレージシステムは、ポッドのために同期中であり、ポッドのために「オンラインである」ポッドのためのデータをアクティブに供給するために現在利用可能である。 [0219] In the example method shown in FIG. 5, a storage system (402) may be attached to a pod (508). The model for pod membership may include a list of storage systems and a subset of that list of storage systems that are presumed to be in sync for the pod. A storage system is in sync for a pod if it is in recovery with at least the same idle content of the last created copy of the dataset associated with the pod. Idle content is the content after no new modifications are processed and any ongoing modifications have completed. This is sometimes referred to as "crash recoverable" consistency. A storage system that is listed as a pod member but not listed as in sync for the pod may be described as "detached" from the pod. A storage system that is listed as a pod member is in sync for the pod and is currently available to actively serve data for the pod when it is "online" for the pod.

［0220］図５に示される例の方法では、ストレージシステム（４０２）は、例えば用語が上述されたように、オンラインであるポッドの他のストレージシステム（４０４、４０６）に記憶されるデータセット（４２６）の最新バージョンとともにデータセット（４２６）のそのローカルに記憶されているバージョンを同期させることによって、ポッドにアタッチしてよい（５０８）。係る例では、ストレージシステム（４０２）がポッドにアタッチする（５０８）ために、ポッド内のストレージシステム（４０２、４０４、４０６）のそれぞれの中にローカルに記憶されるポッド定義は、ストレージシステム（４０２）がポッドにアタッチする（５０８）ために、更新される必要がある。係る例では、ポッドの各ストレージシステムメンバーは、それが最後に知っていたどのストレージシステムが同期していたのか、及びそれが最後に知っていたどのストレージシステムがポッドメンバーの集合全体を含んでいたのかを含む、メンバーシップの独自のコピーを有する場合がある。 5, the storage system (402) may attach (508) to the pod by synchronizing its locally stored version of the dataset (426) with the latest version of the dataset (426) stored in the pod's other storage systems (404, 406) that are online, e.g., as the terminology is described above. In such an example, in order for the storage system (402) to attach (508) to the pod, the pod definition stored locally in each of the storage systems (402, 404, 406) in the pod needs to be updated in order for the storage system (402) to attach (508) to the pod. In such an example, each storage system member of the pod may have its own copy of the membership, including which storage systems it last knew were in sync and which storage systems it last knew contained the entire set of pod members.

［0221］図５に示される例の方法では、ストレージシステム（４０２）は、データセット（４２６）の一部分を読み取る要求を受け取ってもよく（５１０）、ストレージシステム（４０２）は、データセット（４２６）の部分を読み取る要求をローカルに処理してよい（５１２）。読者は、データセット（４２６）を修正する要求（例えば、書込み動作）がポッド内のストレージシステム（４０２、４０４、４０６）間の調整を必要とするが、データセット（４２６）は、ポッド内のすべてのストレージシステム（４０２、４０４、４０６）全体で一致している必要があるので、データセット（４２６）の一部分を読み取る要求に応えることは、ストレージシステム（４０２、４０４、４０６）間の類似する調整を必要としないことを理解する。したがって、読取り要求を受け取る特定のストレージシステム（４０２）は、ポッド内の他のストレージシステム（４０４、４０６）との同期通信がない状態で、ストレージシステムの（４０２）ストレージデバイスの中に記憶されるデータセット（４２６）の一部分を読み取ることによって読取り要求にローカルにサービスを提供してよい。複製されたクラスタ内の複製されたデータセットのために１つのストレージシステムによって受け取られる読取り要求は、少なくともやはり名目上実行中であるクラスタの中で実行中であるストレージシステムによって受け取られるとき、圧倒的多数の場合に任意の通信を回避することを期待される。係る読取りは、通常、クラスタ内の他のストレージシステムとの追加の対話が必要とされない状態で、クラスタ化されたデータセットのローカルコピーから単に読み取ることによって正常に処理されるべきである。 5, storage system (402) may receive a request to read a portion of dataset (426) (510), and storage system (402) may process the request to read the portion of dataset (426) locally (512). The reader will understand that while a request to modify dataset (426) (e.g., a write operation) requires coordination among storage systems (402, 404, 406) in the pod, servicing a request to read a portion of dataset (426) does not require similar coordination among storage systems (402, 404, 406) because dataset (426) must be consistent across all storage systems (402, 404, 406) in the pod. Thus, a particular storage system (402) receiving a read request may service the read request locally by reading a portion of the dataset (426) stored in its (402) storage device, without synchronous communication with other storage systems (404, 406) in the pod. Read requests received by one storage system for a replicated dataset in a replicated cluster are expected to avoid any communication in the vast majority of cases, at least when received by a storage system that is also running in a nominally running cluster. Such reads should typically be successfully processed by simply reading from the local copy of the clustered dataset, with no additional interaction with other storage systems in the cluster required.

［0222］読者は、読取り要求がどのストレージシステムが読取り要求を処理するのかに関わらずに同じ結果を返すように、ストレージシステムが、読取り一貫性を確実にするための処置を講じてよいことを理解する。例えば、クラスタ内のストレージシステムの任意の集合によって受け取られる更新の任意の集合に対する結果として生じるクラスタ化されたデータセットコンテンツは、クラスタ全体で一貫性があるべきであり、少なくとも任意の時点で更新はアイドルである（すべての以前の修正動作は完了として示されており、新しい更新要求は決して受け取られず、処理されていない）。すなわち、ストレージシステムの集合全体でのクラスタ化されたデータセットのインスタンスは、まだ完了していない更新の結果としてだけ異なる場合がある。つまり、例えば、そのボリュームブロック範囲で重複する任意の２つの書込み要求、又は書込み要求と重複するスナップショット、コンペアアンドライト、又は仮想ブロック範囲コピーの任意の組合せは、データセットのすべてのコピーに対する一貫した結果を生じさせなければならない。２つの動作は、あたかもそれらがあるストレージシステムではある順序で、及び複製クラスタの別のストレージシステムでは異なる順序で起こるかのように結果を生じさせることはできない。 [0222] The reader understands that storage systems may take steps to ensure read consistency, so that a read request returns the same result regardless of which storage system processes the read request. For example, the resulting clustered dataset contents for any set of updates received by any set of storage systems in a cluster should be consistent across the cluster, at least at any point in time when the updates are idle (all previous modification operations are marked as completed, and no new update requests have been received or processed). That is, instances of a clustered dataset across a set of storage systems may differ only as a result of updates that have not yet completed. That is, for example, any two write requests that overlap on that volume block range, or any combination of a snapshot, compare-and-write, or virtual block range copy that overlaps with a write request, must produce consistent results for all copies of the dataset. Two operations cannot produce results as if they occurred in one order on one storage system and in a different order on another storage system in a replication cluster.

［0223］さらに、読取り要求は時間順序が一貫してよい。例えば、１つの読取り要求が複製されたクラスタで受け取られ、完了し、次いでその読取りの後に、複製クラスタによって受け取られる重複するアドレス範囲に対する別の読取り要求が続き、（読取り又は修正のいずれかが、複製クラスタ内の同じストレージシステムによって受け取られるのか、それとも異なるストレージシステムによって受け取られるのかに関わらず）一方の読取り又は両方の読取りが、複製クラスタによって受け取られる修正要求と、時間及びボリュームアドレス範囲で多少なりとも重複している場合、次いで第１の読取りが更新の結果を反映する場合、次いで第２の読取りも、おそらく更新に先行したデータを返すよりむしろ、その更新の結果を反映するべきである。第１の読取りが更新を反映しない場合、次いで第２の読取りは、更新を反映する、又は反映しないかのどちらかである場合がある。これは、２つの読取り要求の間で、データセグメントの「時間」が決して過去に戻ることがないことを確実にする。 [0223] Furthermore, read requests may be time-consistent. For example, if one read request is received and completed at a replicated cluster, and then that read is followed by another read request for an overlapping address range that is received by the replicated cluster, and one or both reads overlap in time and volume address range to any extent with the modify request received by the replicated cluster (regardless of whether the read or the modify is received by the same storage system in the replicated cluster or a different storage system), then if the first read reflects the results of the update, then the second read should also reflect the results of that update, presumably rather than returning data that preceded the update. If the first read does not reflect the update, then the second read may or may not reflect the update. This ensures that the "time" of the data segment never goes back in time between the two read requests.

［0224］図５に示される例の方法では、ストレージシステム（４０２）は、他のストレージシステム（４０４、４０６）の１つ以上とのデータ通信の中断を検出する（５１４）場合もある。他のストレージシステム（４０４、４０６）の１つ以上とのデータ通信の中断は、さまざまな理由で発生する場合がある。例えば、他のストレージシステム（４０４、４０６）の１つ以上とのデータ通信の中断は、ストレージシステム（４０２、４０４、４０６）のうちの１つが故障したため、ネットワーク相互接続が故障したため、又は他のなんらかの理由により発生する場合がある。同期複製クラスタ化の重要な態様は、いかなる障害処理も、回復不可能な不一致、又は対応の不一致を生じさせないことを確実にすることである。例えば、ネットワークが２つのストレージシステム間で故障する場合、ストレージシステムのうちの多くても１つがポッドのために新規に入信するＩ／Ｏ要求を処理し続ける場合がある。そして、一方のストレージシステムが処理を続行する場合、他方のストレージシステムは、読取り要求を含んだ、新しい要求を完了まで処理することはできない。 5, the storage system (402) may detect (514) an interruption in data communication with one or more of the other storage systems (404, 406). The interruption in data communication with one or more of the other storage systems (404, 406) may occur for a variety of reasons. For example, the interruption in data communication with one or more of the other storage systems (404, 406) may occur because one of the storage systems (402, 404, 406) has failed, because the network interconnect has failed, or for some other reason. An important aspect of synchronous replication clustering is ensuring that any failure handling does not result in an irrecoverable inconsistency or correspondence inconsistency. For example, if the network fails between two storage systems, at most one of the storage systems may continue to process new incoming I/O requests for the pod. And, if one storage system continues processing, the other storage system will not be able to process new requests, including read requests, to completion.

［0225］図５に示される例の方法では、ストレージシステム（４０２）は、特定のストレージシステム（４０２）がポッドの一部としてオンラインに留まるべきかどうかを判断してもよい（５１６）。上述されたように、ポッドの一部として『オンライン』であるためには、ストレージシステムは、それ自体をポッドのために同期中であると見なす必要があり、それがポッドのために同期中であると見なす他のすべてのストレージシステムと通信していなければならない。ストレージシステムが、それが同期中であり、同期中である他のすべてのストレージシステムと通信中であることを確信できない場合、次いでストレージシステムはデータセット（４２６）にアクセスするための新しい入信要求を処理することを停止する場合がある。したがって、ストレージシステム（４０２）は、特定のストレージシステム（４０２）がポッドの一部としてオンラインに留まるべきであるかどうかを、ストレージシステムが、それがポッドのために同期中であると見なすすべての他のストレージシステム（４０４、４０６）と（例えば、１つ以上のテストメッセージを介して）通信できるかどうかを判断することによって、それがポッドのために同期中であると見なす他のすべてのストレージシステム（４０４、４０６）もストレージシステム（４０２）がポッドにアタッチされるべきと見なすかどうかを判断することによって、特定のストレージシステム（４０２）が、それがポッドのために同期中であると見なす他のすべてのストレージシステム（４０４、４０６）と通信できること、及びそれがポッドのために同期中であると見なす他のすべてのストレージシステム（４０４、４０６）もストレージシステム（４０２）をポッドにアタッチされるべきであると見なすことを確認しなければならない両方のステップの組合せによって、あるいはなんらかの他の仕組みによって、判断してよい（５１６）。 5, a storage system (402) may determine whether a particular storage system (402) should remain online as part of a pod (516). As described above, to be 'online' as part of a pod, a storage system must consider itself in sync for the pod and be in communication with all other storage systems that it considers in sync for the pod. If a storage system cannot be sure that it is in sync and in communication with all other storage systems that are in sync, then the storage system may stop processing new incoming requests to access the dataset (426). Thus, the storage system (402) may determine whether a particular storage system (402) should remain online as part of a pod by determining (516) whether the storage system can communicate (e.g., via one or more test messages) with all other storage systems (404, 406) that it considers in-sync for the pod, by determining whether all other storage systems (404, 406) that it considers in-sync for the pod also consider the storage system (402) to be attached to the pod, by verifying that the particular storage system (402) can communicate with all other storage systems (404, 406) that it considers in-sync for the pod and that all other storage systems (404, 406) that it considers in-sync for the pod also consider the storage system (402) to be attached to the pod, or by some other mechanism.

［0226］図５に示される例の方法では、ストレージシステム（４０２）は、特定のストレージシステム（４０２）がポッドの一部としてオンラインに留まるべきであると肯定的に判断する（５１８）ことに応えて、特定のストレージシステム（４０２）上のデータセット（４２６）を管理動作及びデータセット動作のためにアクセス可能にしておいてもよい（５２２）。ストレージシステム（４０２）は、例えば、ストレージシステム（４０２）上に記憶されるデータセット（４２６）のバージョンにアクセスする要求を受け入れ、係る要求を処理することによって、ホスト又は権限を与えられた管理者によって発行されるデータセット（４２６）と関連付けられた管理動作を受け入れ、処理することによって、ポッド内の他のストレージシステム（４０４、４０６）の１つによって、又は他のなんらかの方法で発行されるデータセット（４２６）と関連付けられた管理動作を受け入れ、処理することによって、特定のストレージシステム（４０２）上のデータセット（４２６）を、管理動作及びデータセット動作のためにアクセス可能にしてよい（５２２）。 5, in response to affirmatively determining (518) that a particular storage system (402) should remain online as part of a pod, the storage system (402) may keep (522) the dataset (426) on the particular storage system (402) accessible for management and dataset operations. The storage system (402) may make (522) the dataset (426) on the particular storage system (402) accessible for management and dataset operations by, for example, accepting and processing requests to access versions of the dataset (426) stored on the storage system (402), accepting and processing management operations associated with the dataset (426) issued by a host or authorized administrator, accepting and processing management operations associated with the dataset (426) issued by one of the other storage systems (404, 406) in the pod, or in some other manner.

［0227］図５に示される例の方法では、ストレージシステム（４０２）は、また、特定のストレージシステムがポッドの一部としてオンラインに留まるべきではない（５２０）と判断したことに応えて、特定のストレージシステム（４０２）上のデータセット（４２６）を管理動作及びデータセット動作のためにアクセス不可にしてもよい（５２４）。ストレージシステム（４０２）は、例えば、ストレージシステム（４０２）に記憶されるデータセット（４２６）のバージョンにアクセスする要求を拒絶することによって、ホスト又は他の権限を与えられた管理者によって発行されるデータセット（４２６）と関連付けられた管理動作を拒絶することによって、ポッド内の他のストレージシステム（４０４、４０６）の１つによって発行されるデータセット（４２６）と関連付けられた管理動作を拒絶することによって、又は他のなんらかの方法で、特定のストレージシステム（４０２）上のデータセット（４２６）を管理動作及びデータセット動作のためにアクセス不可にしてもよい（５２４）。 5, the storage system (402) may also make the dataset (426) on the particular storage system (402) inaccessible for management and dataset operations (524) in response to determining (520) that the particular storage system should not remain online as part of the pod. The storage system (402) may make the dataset (426) on the particular storage system (402) inaccessible for management and dataset operations (524), for example, by rejecting requests to access the version of the dataset (426) stored on the storage system (402), by rejecting management operations associated with the dataset (426) issued by a host or other authorized administrator, by rejecting management operations associated with the dataset (426) issued by one of the other storage systems (404, 406) in the pod, or in some other manner.

［0228］図５に示される例の方法では、ストレージシステム（４０２）は、他のストレージシステム（４０４、４０６）の１つ以上とのデータ通信の中断が修復されたことを検出してもよい（５２６）。ストレージシステム（４０２）は、例えば、他のストレージシステム（４０４、４０６）の１つ以上からメッセージを受け取ることによって、他のストレージシステム（４０４、４０６）の１つ以上とのデータ通信の中断が修復されたことを検出してよい（５２６）。他のストレージシステム（４０４、４０６）の１つ以上とのデータ通信の中断が修復されたことを検出する（５２６）ことに応えて、ストレージシステム（４０２）は、特定のストレージシステム（４０２）上のデータセット（４２６）を管理動作及びデータセット動作のためにアクセス可能にしてよい（５２８）。 5, the storage system (402) may detect (526) that a disruption in data communication with one or more of the other storage systems (404, 406) has been repaired. The storage system (402) may detect (526) that a disruption in data communication with one or more of the other storage systems (404, 406) has been repaired, for example, by receiving a message from one or more of the other storage systems (404, 406). In response to detecting (526) that a disruption in data communication with one or more of the other storage systems (404, 406) has been repaired, the storage system (402) may make (528) the dataset (426) on the particular storage system (402) accessible for management and dataset operations.

［0229］読者は、図５に示される例が、順序付けは必要とされていないが、多様なアクションが同じ順序の中で発生するとして示される実施形態を説明することを理解する。さらに、ストレージシステム（４０２）が、説明されるアクションの部分集合を実施するにすぎない他の実施形態が存在してよい。例えば、ストレージシステム（４０２）は、他のストレージシステム（４０４、４０６）の１つ以上とのデータ通信での中断を検出する（５１４）ステップ、特定のストレージシステム（４０２）がポッドに留まるべきかどうかを判断する（５１６）ステップ、特定のストレージシステム（４０２）上のデータセット（４２６）を管理動作及びデータセット動作のためにアクセス可能にしておく（５２２）ステップ、又は最初にデータセット（４２６）の一部分を読み取る要求を受け取って（５１０）、データセット（４２６）の一部分を読み取る要求をローカルで処理する（５１２）ことなく、特定のストレージシステム（４０２）上のデータセット（４２６）を管理動作及びデータセット動作のためにアクセス不可にする（５２４）ステップを実行してよい。さらに、ストレージシステム（４０２）は、他のストレージシステム（４０４、４０６）の１つ以上とのデータ通信の中断が修復されたことを検出し（５２６）、最初にデータセット（４２６）の一部分を読み取る要求を受け取って（５１０）、データセット（４２６）の一部分を読み取る要求をローカルで処理する（５１２）ことなく、特定のストレージシステム（４０２）上のデータセット（４２６）を管理動作及びデータセット動作のためにアクセス可能にしてよい（５２８）。実際には、本明細書に説明されるステップのいずれも、本明細書に説明される他のステップを実行するための前提条件として、すべての実施形態で明示的に必要とされていない。 5 describes an embodiment in which various actions are shown as occurring in the same order, although no ordering is required. Additionally, other embodiments may exist in which the storage system (402) performs only a subset of the described actions. For example, the storage system (402) may perform the steps of: detecting (514) an interruption in data communication with one or more of the other storage systems (404, 406); determining (516) whether the particular storage system (402) should remain in the pod; keeping (522) a dataset (426) on the particular storage system (402) accessible for management and dataset operations; or making (524) a dataset (426) on the particular storage system (402) inaccessible for management and dataset operations without first receiving (510) a request to read a portion of the dataset (426) and processing (512) the request to read the portion of the dataset (426) locally. Additionally, the storage system (402) may detect (526) that an interruption in data communication with one or more of the other storage systems (404, 406) has been repaired and may make (528) the dataset (426) on the particular storage system (402) accessible for management and dataset operations without first receiving (510) a request to read a portion of the dataset (426) and processing (512) the request to read the portion of the dataset (426) locally. Indeed, none of the steps described herein are explicitly required in all embodiments as a prerequisite for performing any other step described herein.

［0230］追加の説明のために、図６は、本開示のいくつかの実施形態に従ってポッドをサポートするストレージシステム（４０２、４０４、４０６）によって実行されてよいステップを示すフローチャートを説明する。あまり詳細には示されないが、図６に示されるストレージシステム（４０２、４０４、４０６）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、図４、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図６に示されるストレージシステム（４０２、４０４、４０６）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0230] For additional explanation, FIG. 6 sets forth a flowchart illustrating steps that may be performed by storage systems (402, 404, 406) supporting pods in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (402, 404, 406) illustrated in FIG. 6 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, 4, or any combination thereof. In practice, the storage systems (402, 404, 406) illustrated in FIG. 6 may include the same components as the storage systems described above, fewer components, or additional components.

［0231］図６に示される例の方法では、ストレージシステム（４０２、４０４）の２つ以上は、データセット（４２６）を非同期で受け取るためのそれぞれターゲットストレージシステム（６１８）を識別してよい（６０８）。データセット（４２６）を非同期で受け取るためのターゲットストレージシステム（６１８）は、クラウドサービスプロバイダにより提供されるクラウドストレージとして、又は多くの他の方法で、例えば、特定のポッドのメンバーであるストレージシステム（４０２、４０４）のどちらとも異なるデータセンタに位置するバックアップストレージシステムとして実施されてよい。読者は、ターゲットストレージシステム（６１８）は、その全体でデータセット（４２６）が同期複製される複数のストレージシステム（４０２、４０４）の１つではなく、したがってターゲットストレージシステム（６１８）は、当初、データセット（４２６）の最新のローカルコピーを含まないことを理解する。 6, two or more of the storage systems (402, 404) may each identify a target storage system (618) for asynchronously receiving the dataset (426) (608). The target storage system (618) for asynchronously receiving the dataset (426) may be implemented as cloud storage provided by a cloud service provider or in many other ways, such as a backup storage system located in a different data center than either of the storage systems (402, 404) that are members of a particular pod. The reader will understand that the target storage system (618) is not one of multiple storage systems (402, 404) across which the dataset (426) is synchronously replicated, and therefore the target storage system (618) does not initially contain an up-to-date local copy of the dataset (426).

［0232］図６に示される例の方法では、ストレージシステム（４０２、４０４）の２つ以上は、それぞれ、データセット（４２６）を含むポッドのメンバーである他のストレージシステムのいずれかによってターゲットストレージ（６１８）システムに非同期複製されていないデータセット（４２６）の一部分を識別してよい（６１０）。係る例では、ストレージシステム（４０２、４０４）は、それぞれ、ターゲットストレージシステム（６１８）に、他のストレージシステムのいずれかによってターゲットストレージシステムに非同期複製されていないデータセット（４２６）の一部分を非同期複製してよい（６１２）。第１のストレージシステム（４０２）が、データセット（４２６）の第１の部分（例えば、アドレス空間の第１の半分）をターゲットストレージシステム（６１８）に非同期複製することを担う例を考える。係る例では、第２のストレージシステム（４０４）は、２つ以上のストレージシステム（４０２、４０４）がデータセット（４２６）全体をターゲットストレージシステム（６１８）に集合的に複製するように、データセット（４２６）の第２の部分（例えば、アドレス空間の第２の半分）をターゲットストレージシステム（６１８）に非同期複製することを担うであろう。 6, two or more of the storage systems (402, 404) may each identify (610) a portion of the dataset (426) that has not been asynchronously replicated to a target storage system (618) by any of the other storage systems that are members of the pod that includes the dataset (426). In such an example, the storage systems (402, 404) may each asynchronously replicate (612) to the target storage system (618) a portion of the dataset (426) that has not been asynchronously replicated to the target storage system by any of the other storage systems. Consider an example in which a first storage system (402) is responsible for asynchronously replicating a first portion of the dataset (426) (e.g., a first half of its address space) to the target storage system (618). In such an example, the second storage system (404) would be responsible for asynchronously replicating a second portion of the data set (426) (e.g., a second half of the address space) to the target storage system (618) such that the two or more storage systems (402, 404) collectively replicate the entire data set (426) to the target storage system (618).

［0233］読者は、上述されたように、ポッドの使用により、２つのストレージシステム間の複製関係が、データが非同期複製される関係からデータが同期複製される関係へ切り替えられ得ることを理解する。例えば、ストレージシステムＡがストレージシステムＢにデータセットを非同期複製するように構成される場合、データセット、メンバーとしてのストレージシステムＡ、及びメンバーとしてのストレージシステムＢを含むポッドを作成することは、データが非同期複製される関係からデータが同期複製される関係へ関係を切り替えることができる。同様に、ポッドを使用することによって、２つのストレージシステム間の複製関係は、データが同期複製される関係から、データが非同期複製される関係へ切り替えられてよい。例えば、データセット、メンバーとしてのストレージシステムＡ、及びメンバーとしてのストレージシステムＢを含むポッドが、（メンバーとしてのストレージシステムＡを削除するために又はメンバーとしてのストレージシステムＢを削除するために）単にポッドを伸展しないことによって作成される場合、データがストレージシステム間で同期複製される関係は、データが非同期複製される関係にただちに切り替えることができる。係るようにして、ストレージシステムは、非同期複製と同期複製との間で、必要に応じて前後に切り替わってよい。 [0233] The reader will understand that, as described above, through the use of pods, the replication relationship between two storage systems can be switched from one in which data is asynchronously replicated to one in which data is synchronously replicated. For example, if storage system A is configured to asynchronously replicate a dataset to storage system B, creating a pod that includes the dataset, storage system A as a member, and storage system B as a member can switch the relationship from one in which data is asynchronously replicated to one in which data is synchronously replicated. Similarly, through the use of pods, the replication relationship between two storage systems can be switched from one in which data is synchronously replicated to one in which data is asynchronously replicated. For example, if a pod that includes a dataset, storage system A as a member, and storage system B as a member is created by simply not extending the pod (to remove storage system A as a member or to remove storage system B as a member), the relationship in which data is synchronously replicated between the storage systems can immediately be switched to one in which data is asynchronously replicated. In this manner, storage systems can be switched back and forth between asynchronous replication and synchronous replication as needed.

［0234］この切替えは、実施態様が同期複製と非同期複製の両方のための類似した技術に依存することにより助長される場合がある。例えば、同期複製されたデータセットの再同期が非同期複製に使用されるのと同じ仕組み又は互換性のある仕組みに頼る場合、次いで非同期複製への切替えは、同期中の状態を外し、関係を「恒久回復」モードに類似した状態に残すことと概念的に同一である。同様に、非同期複製から同期複製へ切り替えることは、再同期を完了し、切り替えシステムが同期中のポッドメンバーになるときに「追い付き」、まさに行われるように同期中になることによって概念上作用する場合がある。 [0234] This switch may be facilitated by an implementation relying on similar techniques for both synchronous and asynchronous replication. For example, if resynchronization of synchronously replicated datasets relies on the same or compatible mechanisms used for asynchronous replication, then switching to asynchronous replication is conceptually identical to dropping the in-sync state and leaving the relationship in a state similar to "durable recovery" mode. Similarly, switching from asynchronous replication to synchronous replication may conceptually work by completing resynchronization and "catching up" as the switching system becomes an in-sync pod member, just as it does.

［0235］代わりに又はさらに、同期複製と非同期複製の両方が、類似する又は同一の共通メタデータ、又は論理エクステント若しくは記憶されたブロックアイデンティティを表し、識別するための共通モデル、又はコンテンツアドレス指定可能な記憶されたブロックを表すための共通モデルに頼る場合、次いでこれらの共通性の態様は、同期複製及び非同期複製に並びに同期複製及び非同期複製から切り替わるとき転送される必要がある場合があるコンテンツを劇的に削減するために利用される場合がある。さらに、データセットが、ストレージシステムＡからストレージシステムＢに非同期複製され、システムＢがさらにそのデータセットをストレージシステムＣに非同期複製する場合、次いで共通メタデータモデル、共通論理エクステント若しくはブロックアイデンティティ、又はコンテンツアドレス指定可能記憶ブロックの共通表現は、ストレージシステムＡとストレージシステムＣとの間の同期複製を可能にするために必要とされるデータ転送を劇的に削減できる。 [0235] Alternatively, or in addition, if both synchronous and asynchronous replication rely on similar or identical common metadata, or a common model for representing and identifying logical extents or stored block identities, or a common model for representing content-addressable stored blocks, then these aspects of commonality may be exploited to dramatically reduce the content that may need to be transferred when switching to and from synchronous and asynchronous replication. Furthermore, if a data set is asynchronously replicated from storage system A to storage system B, and system B further asynchronously replicates that data set to storage system C, then the common metadata model, common logical extents or block identities, or common representation of content-addressable storage blocks can dramatically reduce the data transfers required to enable synchronous replication between storage system A and storage system C.

［0236］読者は、上述されたように、ポッドを使用することにより、複製技術が、データ複製以外のタスクを実行するために使用され得ることをさらに理解する。実際には、ポッドは管理オブジェクトの集合を含んでよいため、バーチャルマシンを移行することのようなタスクは、本明細書に説明されるポッド及び複製技術を使用し、実施され得る。例えば、バーチャルマシンＡがストレージシステムＡで実行中である場合、管理オブジェクトとしてのバーチャルマシンＡ、メンバーとしてのストレージシステムＡ、及びメンバーとしてのストレージシステムＢを含むポッドを作成することによって、バーチャルマシンＡ及び任意の関連付けられた画像及び定義はストレージシステムＢに移行されてよく、そのときに、ポッドが単に破壊される場合がある、メンバーシップが更新される場合がある、又は他の処置が必要に応じて取られ得る。 [0236] The reader will further understand that, by using pods as described above, replication techniques can be used to perform tasks other than data replication. Indeed, since a pod may include a collection of managed objects, tasks such as migrating a virtual machine can be performed using the pod and replication techniques described herein. For example, if virtual machine A is running on storage system A, by creating a pod that includes virtual machine A as a managed object, storage system A as a member, and storage system B as a member, virtual machine A and any associated images and definitions may be migrated to storage system B, at which time the pod may simply be destroyed, membership may be updated, or other action may be taken as necessary.

［0237］追加の説明のために、図７は、本開示のいくつかの実施形態に従って２つ以上のストレージシステム（７１４、７２４、７２８）の間で同期複製関係を確立する例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図７に示されるストレージシステム（７１４、７２４、７２８）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図７に示されるストレージシステム（７１４、７２４、７２８）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0237] For further explanation, FIG. 7 sets forth a flowchart illustrating an example method for establishing a synchronous replication relationship between two or more storage systems (714, 724, 728) according to some embodiments of the present disclosure. While not shown in great detail, the storage systems (714, 724, 728) illustrated in FIG. 7 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage systems (714, 724, 728) illustrated in FIG. 7 may include the same components as the storage systems described above, fewer components, or additional components.

［0238］図７に示される例の方法は、データセット（７１２）のために、データセット（７１２）がその全体で同期複製される複数のストレージシステム（７１４、７２４、７２８）を識別すること（７０２）を含む。図７に示されるデータセット（７１２）は、例えば、特定のボリュームのコンテンツとして、ボリュームの特定のシャードのコンテンツとして、又は１つ以上のデータ要素の任意の他の集合体として実施されてよい。データセット（７１２）は、各ストレージシステム（７１４、７２４、７２８）がデータセット（７１２）のローカルコピーを保持するように、複数のストレージシステム（７１４、７２４、７２８）全体で同期されてよい。本明細書に説明される例では、係るデータセット（７１２）は、少なくともアクセスされているクラスタ及び特定のストレージシステムが名目上動作している限り、クラスタ内の任意の一方のストレージシステムが、クラスタ内の任意の他方のストレージシステムよりも実質的により最適に動作しないような性能特性を有するストレージシステム（７１４、７２４、７２８）のいずれかを通してデータセット（７１２）にアクセスできるように、ストレージシステム（７１４、７２４、７２８）全体で同期複製される。係るシステムでは、データセット（７１２）に対する修正は、ストレージシステム（７１４、７２４、７２８）のいずれかでデータセット（７１２）にアクセスすることが一貫した結果を生じさせるように、各ストレージシステム（７１４、７２４、７２８）に常駐するデータセットのコピーに対して加えられるべきである。例えば、データセットに発行される書込み要求は、すべてのストレージシステム（７１４、７２４、７２８）でサービスを提供されなければならない、又はストレージシステム（７１４、７２４、７２８）のいずれでもサービスを提供されてはならない。同様に、動作のいくつかのグループ（例えば、データセットの中の同じ場所に向けられる２つの書込み動作）は、各ストレージシステム（７１４、７２４、７２８）に常駐するデータセットのコピーが最終的に同一となるように、すべてのストレージシステム（７１４、７２４、７２８）上で同じ順序で実行されなければならない。データセット（７１２）に対する修正は、正確に同時に行われる必要はないが、いくつかのアクション（例えば、書込み要求がデータセットに向けられ、すべてのストレージシステムでまだ完了していない書込み要求によってターゲットにされるデータセットの中の場所への読取りアクセスを可能にする旨の肯定応答を発行すること）は、各ストレージシステム（７１４、７２４、７２８）上のデータセット（７１２）のコピーが修正されるまで先延ばしにされてよい。 7 includes identifying (702), for a dataset (712), multiple storage systems (714, 724, 728) across which the dataset (712) is synchronously replicated. The dataset (712) shown in FIG. 7 may be embodied, for example, as the contents of a particular volume, as the contents of a particular shard of a volume, or as any other collection of one or more data elements. The dataset (712) may be synchronized across the multiple storage systems (714, 724, 728) such that each storage system (714, 724, 728) maintains a local copy of the dataset (712). In the example described herein, such a dataset (712) is synchronously replicated across storage systems (714, 724, 728) such that the dataset (712) can be accessed through any of the storage systems (714, 724, 728) that have performance characteristics such that any one storage system in the cluster operates substantially less optimally than any other storage system in the cluster, at least as long as the cluster and the particular storage system being accessed are nominally operating. In such a system, modifications to the dataset (712) should be made to copies of the dataset resident on each storage system (714, 724, 728) so that accessing the dataset (712) on any of the storage systems (714, 724, 728) produces consistent results. For example, a write request issued to the dataset must be serviced by all storage systems (714, 724, 728) or by none of the storage systems (714, 724, 728). Similarly, some groups of operations (e.g., two write operations directed to the same location in a dataset) must be performed in the same order on all storage systems (714, 724, 728) so that the copies of the dataset resident on each storage system (714, 724, 728) are ultimately identical. Modifications to dataset (712) need not occur at exactly the same time, although some actions (e.g., issuing an acknowledgment that a write request is directed to the dataset and allows read access to locations in the dataset targeted by the write request that have not yet completed on all storage systems) may be deferred until the copies of dataset (712) on each storage system (714, 724, 728) have been modified.

［0239］図７に示される例の方法では、データセット（７１２）のために、データセット（７１２）がその全体で同期複製される複数のストレージシステム（７１４、７２４、７２８）を識別すること（７０２）は、例えば、ポッド定義、又はデータセット（７１２）を名目上そのデータセット（７１２）を記憶する１つ以上のストレージシステム（７１４、７２４、７２８）と関連付ける類似したデータ構造を調べることによって実施されてよい。『ポッド』は、該用語がここで及び本願の残りを通して使用されるように、データセット、管理オブジェクト及び管理動作の集合、データセットを修正する又は読み取るためのアクセス動作の集合、及び複数のストレージシステムを表す管理エンティティとして実施されてよい。係る管理動作は、データセットを読み取る又は修正するためのアクセス動作が、ストレージシステムのいずれかを通して同等に動作する、ストレージシステムのいずれかを通して同等に管理オブジェクトを修正する又は問い合わせしてよい。各ストレージシステムは、ストレージシステムによる使用のために記憶され、宣伝されたデータセットの適切な部分集合としてデータセットの別々のコピーを記憶してよく、任意の１つのストレージシステムを通して実行され、完了された管理オブジェクト又はデータセットを修正する動作は、ポッドを問い合わせるために後続の管理オブジェクトで、又はデータセットを読み取るために後続のアクセス動作で反映される。『ポッド』に関する追加の詳細は、参照により本明細書に援用される以前に出願された仮特許出願第６２／５１８，０７１号で見つけられてよい。係る例では、ポッド定義は、データセット（７１２）、及びデータセット（７１２）がその全体で同期複製されるストレージシステム（７１４、７２４、７２８）の集合の少なくとも識別を含んでよい。係るポッドは、対称アクセス、レプリカの柔軟な追加／削除、高可用性データ一貫性、データセット、管理ホストアクセス、アプリケーションクラスタ化等に対する関係性におけるストレージシステム全体での一元的なユーザ管理等を含んだ多くの（おそらく任意選択の）特性のいくつかをカプセル化してよい。ストレージシステムはポッドに加えられる場合があり、結果的にポッドのデータセット（７１２）はそのストレージシステムにコピーされ、次いでデータセット（７１２）が修正されると、最新に保たれる。また、ストレージシステムはポッドから削除される場合もあり、結果的にデータセット（７１２）は削除されたストレージシステムでもはや最新に保たれなくなる。係る例では、ポッド定義又は類似するデータ構造は、ストレージシステムが特定のポッドに加えられ、特定のポッドから削除されると更新されてよい。 7, identifying (702), for a dataset (712), multiple storage systems (714, 724, 728) across which the dataset (712) is synchronously replicated may be performed, for example, by examining a pod definition or similar data structure that associates the dataset (712) with one or more storage systems (714, 724, 728) that nominally store the dataset (712). A 'pod,' as the term is used here and throughout the remainder of this application, may be implemented as a management entity representing a dataset, a collection of management objects and management operations, a collection of access operations to modify or read the dataset, and multiple storage systems. Such management operations may modify or query managed objects equally through any of the storage systems, with access operations to read or modify the dataset operating equally through any of the storage systems. Each storage system may store a separate copy of the dataset as an appropriate subset of the dataset stored and advertised for use by the storage system, and operations that modify a management object or dataset performed and completed through any one storage system are reflected in subsequent management objects to query the pod or subsequent access operations to read the dataset. Additional details regarding "pods" may be found in previously filed provisional patent application Ser. No. 62/518,071, which is incorporated herein by reference. In such an example, a pod definition may include at least an identification of a dataset (712) and a collection of storage systems (714, 724, 728) across which the dataset (712) is synchronously replicated. Such a pod may encapsulate some of many (possibly optional) characteristics, including symmetric access, flexible addition/removal of replicas, highly available data consistency, centralized user management across storage systems in relation to the dataset, management host access, application clustering, etc. A storage system may be added to a pod, resulting in the pod's dataset (712) being copied to that storage system and then kept up to date as the dataset (712) is modified. A storage system may also be removed from a pod, resulting in the dataset (712) being no longer kept up to date with the removed storage system. In such examples, the pod definition or similar data structure may be updated as storage systems are added to and removed from particular pods.

［0240］また、図７に示される例の方法は、データセット（７１２）を同期複製するために使用される複数のストレージシステム（７１４、７２４、７２８）のそれぞれの間で１つ以上のデータ通信リンク（７１６，７１８、７２０）を構成すること（７０４）も含む。図６に示される例の方法では、ポッド内のストレージシステム（７１４、７２４、７２８）は、高帯域幅データ転送のため、並びにクラスタ通信、ステータス通信、及び管理通信のための両方のために互いと通信しなければならない。これらの別個のタイプの通信は、同じデータ通信リンク（７１６、７１８、７２０）を介する場合もあれば、代替実施形態では、これらの別個のタイプの通信は、別々のデータ通信リンク（７１６、７１８、７２０）を介する場合もあるであろう。デュアルコントローラストレージシステムのクラスタでは、各ストレージシステムの両方のコントローラとも、任意の対にされたストレージシステム（つまり、ポッド内の任意の他のストレージシステム）のための両方のコントローラと通信する名目能力を有するべきである。 7 also includes configuring (704) one or more data communication links (716, 718, 720) between each of the multiple storage systems (714, 724, 728) used to synchronously replicate the data set (712). In the example method shown in FIG. 6, the storage systems (714, 724, 728) within a pod must communicate with each other both for high-bandwidth data transfers and for cluster communication, status communication, and management communication. These distinct types of communication may be over the same data communication links (716, 718, 720), or in alternative embodiments, these distinct types of communication may be over separate data communication links (716, 718, 720). In a cluster of dual-controller storage systems, both controllers of each storage system should have the nominal ability to communicate with both controllers for any paired storage system (i.e., any other storage system within the pod).

［0241］一次／二次コントローラ設計では、アクティブな複製のための全クラスタ通信は、障害が発生するまで一次コントローラ間で実行してよい。係るシステムでは、係るエンティティ間のデータ通信リンクが操作可能であることを検証するために、一次コントローラと二次コントローラとの間、又は別個のストレージシステムの二次コントローラ間でなんらかの通信が発生してよい。他の場合、仮想ネットワークアドレスは、データセンタ間のネットワークリンクのために必要とされる構成を制限するために、又はストレージシステムのクラスタ化された態様の設計を簡略化するために使用される可能性がある。アクティブ／アクティブコントローラ設計では、クラスタ通信は、あるストレージシステムのすべてのアクティブコントローラから任意の対にされたストレージシステムのいくつかの若しくはすべてのアクティブコントローラまで続く可能性がある、又はクラスタ通信は共通スイッチを通してフィルタにかけられる可能性がある、又はクラスタ通信は構成を簡略化するために仮想ネットワークアドレスを使用する可能性がある、又はクラスタ通信はなんらかの組合せを使用する可能性がある。スケールアウト設計では、２つ以上の共通ネットワークスイッチは、データトラフィックを処理するためにストレージシステムの中のすべてのスケールアウトストレージコントローラが、ネットワークスイッチに接続するように使用されてよい。スイッチは、曝露されるネットワークアドレスの数を制限するための技術を使用する可能性もあれば、使用しない可能性もあり、これにより対にされたストレージシステムは、すべてのストレージコントローラのネットワークアドレスで構成される必要はない。 [0241] In a primary/secondary controller design, all cluster communications for active replication may occur between primary controllers until a failure occurs. In such systems, some communication may occur between the primary and secondary controllers, or between secondary controllers of separate storage systems, to verify that the data communication links between such entities are operational. In other cases, virtual network addresses may be used to limit the configuration required for network links between data centers or to simplify the design of clustered aspects of storage systems. In an active/active controller design, cluster communications may continue from all active controllers of one storage system to some or all active controllers of any paired storage system, or the cluster communications may be filtered through a common switch, or the cluster communications may use virtual network addresses to simplify configuration, or the cluster communications may use some combination. In a scale-out design, two or more common network switches may be used, with all scale-out storage controllers in the storage system connecting to the network switch for handling data traffic. The switch may or may not use techniques to limit the number of exposed network addresses, so that the paired storage system does not need to be configured with the network addresses of all storage controllers.

［0242］図７に示される例の方法では、データセット（７１２）を同期複製するために使用される複数のストレージシステム（７１４、７２４、７２８）のそれぞれの間の１つ以上のデータ通信リンク（７１６、７１８、７２０）を構成すること（７０４）は、例えば、データ通信ネットワークを介して定められたポートを介して通信するようにストレージシステム（７１６、７１８、７２０）を構成することによって、ストレージシステム（７１６、７２４、７２８）の２つの間のポイントツーポイントデータ通信リンクを介して通信するようにストレージシステム（７１６、７１８、７２０）を構成することによって、又はさまざまな方法で実施されてよい。安全な通信が必要とされる場合、なんらかの形の鍵交換が必要とされる場合もあれば、通信は、例えばＳＳＨ（ＳｅｃｕｒｅＳＨｅｌｌ）、ＳＳＬ、又は公開鍵若しくはディッフィーヘルマン鍵共用の回りで構築された他のなんらかのサービス又はプロトコル若しくは妥当な代替策等のなんらかのサービスを通して行われる又はブートされる場合もあるであろう。また、安全な通信は、なんらかの方法でカスタマアイデンティティと結び付けられるベンダ提供のなんらかのクラウドサービスを通して仲介される場合もあるであろう。代わりに、例えばバーチャルマシン又は仮想コンテナで実行する等、カスタマ施設で実行するように構成されたサービスは、複製するストレージシステム（７１６、７１８、７２０）間の安全な通信のために必要な鍵交換を仲介するために使用できるであろう。読者は、３つ以上のストレージシステムを含んだポッドが、個々のストレージシステムの大部分又はすべての間で通信リンクを必要とする場合があることを理解する。他の実施形態では追加のデータ通信リンクが存在する場合があるが、図６に示される例では、３つのデータ通信リンク（７１６、７１８、７２０）が示されている。 7, configuring (704) one or more data communication links (716, 718, 720) between each of the multiple storage systems (714, 724, 728) used to synchronously replicate the data set (712) may be performed, for example, by configuring the storage systems (716, 718, 720) to communicate via defined ports over a data communication network, by configuring the storage systems (716, 718, 720) to communicate via a point-to-point data communication link between two of the storage systems (716, 724, 728), or in a variety of other ways. If secure communication is required, some form of key exchange may be required, or communication may occur or be initiated through some service, such as SSH (Secure Shell), SSL, or some other service or protocol built around public key or Diffie-Hellman key sharing, or a suitable alternative. Secure communications could also be mediated through some vendor-provided cloud service that is tied in some way to the customer identity. Alternatively, a service configured to run on the customer premises, e.g., running on a virtual machine or virtual container, could be used to broker the key exchange necessary for secure communications between the replicating storage systems (716, 718, 720). The reader will understand that a pod containing more than two storage systems may require communication links between most or all of the individual storage systems. In the example shown in FIG. 6, three data communication links (716, 718, 720) are shown, although additional data communication links may be present in other embodiments.

［0243］読者は、データセット（７１２）がそれ全体で同期複製されるストレージシステム（７１４、７２４、７２８）間の通信が、いくつかの目的を果たすことを理解する。例えば、１つの目的は、Ｉ／Ｏ処理の一部として、あるストレージシステム（７１４、７２４、７２８）から別のストレージシステム（７１４、７２４、７２８）にデータを送達することである。例えば、一般的に書込みを処理することは、書込みコンテンツ及び書込みのなんらかの記述をポッドのための任意の対にされたストレージシステムに送達することを必要とする。ストレージシステム（７１４、７２４、７２８）の間のデータ通信によって果たされる別の目的は、ボリューム、ファイル、オブジェクトバケット等を作成すること、拡張すること、削除すること、又は名前を変えることを処理するために、構成変更及び分析データを通信することである場合がある。ストレージシステム（７１４、７２４、７２８）の間のデータ通信によって果たされる別の目的は、ストレージシステム及び相互接続障害を検出し、処理する際に関与する通信を実施することである場合がある。このタイプの通信はタイムクリティカルである場合があり、それが、書込みトラフィックの大きいバーストがデータセンタ相互接続時突然ダンプされるとき、長いネットワーク待ち行列遅延の後方で抜け出せなくならないことを確実にするために優先される必要がある場合がある。 [0243] The reader will understand that communication between the storage systems (714, 724, 728) across which the dataset (712) is synchronously replicated serves several purposes. For example, one purpose is to deliver data from one storage system (714, 724, 728) to another storage system (714, 724, 728) as part of an I/O operation. For example, processing a write typically requires delivering the write content and some description of the write to any paired storage system for a pod. Another purpose served by data communication between the storage systems (714, 724, 728) may be to communicate configuration changes and analysis data for processing the creation, expansion, deletion, or renaming of volumes, files, object buckets, etc. Another purpose served by data communication between the storage systems (714, 724, 728) may be to facilitate communications involved in detecting and handling storage system and interconnect failures. This type of communication can be time-critical and may need to be prioritized to ensure it doesn't get stuck behind long network queuing delays when large bursts of write traffic are suddenly dumped on the data center interconnect.

［0244］読者は、異なるタイプの通信が、多様な組合せで、同じ接続又は異なる接続を使用する場合があり、同じネットワーク又は異なるネットワークを使用する場合があることをさらに理解する。さらに、一部の通信が暗号化され、保護される場合がある。一方、他の通信は暗号化されない可能性がある。一部の場合では、データ通信リンクは、あるストレージシステムから別のストレージシステムに（要求自体として直接的に、又はＩ／Ｏ要求が表す動作の論理記述としてのどちらかで）Ｉ／Ｏ要求を転送するために使用できるであろう。これは、例えばあるストレージシステムがポッドのために最新の同期中のコンテンツを有し、別のストレージシステムが現在ポッドのために最新の同期中のコンテンツを有していない場合に使用できるであろう。係る場合、データ通信リンクが実行中である限り、要求は最新で同期中ではないストレージシステムから、最新で同期中であるストレージシステムに転送できる。 [0244] The reader will further understand that different types of communications may use the same or different connections, in various combinations, and may use the same or different networks. Furthermore, some communications may be encrypted and protected, while other communications may not be encrypted. In some cases, a data communications link could be used to transfer I/O requests from one storage system to another (either directly as the request itself or as a logical description of the operation that the I/O request represents). This could be used, for example, when one storage system has up-to-date, in-sync content for a pod and another storage system does not currently have up-to-date, in-sync content for the pod. In such a case, as long as the data communications link is running, a request can be transferred from a storage system that is not up-to-date and in-sync to a storage system that is up-to-date and in-sync.

［0245］また、図７に示される例の方法は、複数のストレージシステム（７１４、７２４、７２８）の間で、複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためにタイミング情報（７１０、７２２、７２６）を交換すること（７０６）も含む。図６に示される例の方法では、特定のストレージシステム（７１４、７２４、７２８）のためのタイミング情報（７１０、７２２、７２６）は、例えばストレージシステム（７１４、７２４、７２８）の中のクロックの値として実施されてよい。代替実施形態では、特定のストレージシステム（７１４、７２４、７２８）のためのタイミング情報（７１０、７２２、７２６）は、クロック値のプロキシとしての機能を果たす値として実施されてよい。クロック値のプロキシとしての機能を果たす値は、ストレージシステムの間で交換されるトークンに含められてよい。例えば、特定のストレージシステム（７１４、７２４、７２８）又はストレージコントローラが、特定のときに送信されたとして内部で記録できるシーケンス番号等、クロック値のためのプロキシとしての機能を果たす値は、実施されてよい。係る例では、トークン（例えば、シーケンス番号）がまた受け取られる場合、関連付けられたクロック値が見つけられ、有効なリースが依然として実施されているのかを判断するための基準として活用される場合がある。図６に示される例の方法では、複数のストレージシステム（７１４、７２４、７２８）の間で複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためにタイミング情報（７１０、７２２、７２６）を交換すること（７０６）は、例えば各ストレージシステム（７１４、７２４、７２８）が周期的に、オンデマンドで、リース確立後所定の時間量の中で、リースが失効するように設定される前の所定の時間量の中で、同期複製関係を開始する若しくは確立し直す試みの一部として、ポッド内のそれぞれの他のストレージシステム（７１４、７２４、７２８）にタイミング情報を送信することによって、又は他のなんらかの方法で実施されてよい。 7 also includes exchanging (706) timing information (710, 722, 726) between the plurality of storage systems (714, 724, 728) for at least one of the plurality of storage systems (714, 724, 728). In the example method shown in FIG. 6, the timing information (710, 722, 726) for a particular storage system (714, 724, 728) may be embodied, for example, as a value of a clock within the storage system (714, 724, 728). In an alternative embodiment, the timing information (710, 722, 726) for a particular storage system (714, 724, 728) may be embodied as a value that serves as a proxy for the clock value. The value that serves as a proxy for the clock value may be included in a token exchanged between the storage systems. For example, a value that acts as a proxy for a clock value may be implemented, such as a sequence number that a particular storage system (714, 724, 728) or storage controller may internally record as having been sent at a particular time. In such an example, if a token (e.g., a sequence number) is also received, the associated clock value may be found and utilized as a criterion for determining whether a valid lease is still in effect. In the example method shown in FIG. 6 , exchanging (706) timing information (710, 722, 726) among the plurality of storage systems (714, 724, 728) for at least one of the plurality of storage systems (714, 724, 728) may be performed, for example, by each storage system (714, 724, 728) sending timing information to each other storage system (714, 724, 728) in the pod periodically, on-demand, within a predetermined amount of time after lease establishment, within a predetermined amount of time before the lease is set to expire, as part of an attempt to initiate or re-establish a synchronous replication relationship, or in some other manner.

［0246］また、図７に示される例の方法は、複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためのタイミング情報（７１０、７２２、７２６）に従って、同期複製リースを確立すること（７０８）も含み、同期複製リースは、同期複製関係が有効である期間を識別する。図７に示される例の方法では、同期複製関係は、大部分は独立したこれらのストア間でなんらかのデータ（７１２）を複製するストレージシステム（７１４、７２４、７２８）の集合として形成され、各ストレージシステム（７１４、７２４、７２８）は、ストレージオブジェクトを定義するため、オブジェクトを物理ストレージにマッピングするため、重複排除のため、スナップショットへのコンテンツのマッピングを定義するため等のために関連するデータ構造の独自のコピー及び独自の別々の内部管理を有する。同期複製関係は特定のデータセットにとって特有である場合があり、これにより特定のストレージシステム（７１４、７２４、７２８）は、１つを超える同期複製関係と関連付けられてよく、各同期複製関係は、記述されているデータセットによって差別化され、さらに追加のメンバーストレージシステムの異なる集合から成ってよい。 7 also includes establishing (708) a synchronous replication lease according to the timing information (710, 722, 726) for at least one of the plurality of storage systems (714, 724, 728), the synchronous replication lease identifying a period of time during which the synchronous replication relationship is valid. In the example method shown in FIG. 7, the synchronous replication relationship is formed as a collection of storage systems (714, 724, 728) that replicate some data (712) between these largely independent stores, with each storage system (714, 724, 728) having its own copy and its own separate internal management of relevant data structures for defining storage objects, mapping objects to physical storage, for deduplication, defining mapping of content to snapshots, etc. Synchronous replication relationships may be specific to a particular data set, such that a particular storage system (714, 724, 728) may be associated with more than one synchronous replication relationship, each differentiated by the data set being described and may further comprise a different set of additional member storage systems.

［0247］図７に示される例の方法では、同期複製リースは、さまざまな異なる方法で複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためのタイミング情報（７１０、７２２、７２６）に依存して確立されてよい（７０８）。一実施形態では、ストレージシステムは、クロックを調整するために複数のストレージシステム（７１４、７２４、７２８）のそれぞれのためのタイミング情報（７１０、７２２、７２６）を活用することによって同期複製リースを確立してよい（７０８）。係る例では、クロックがストレージシステム（７１４、７２４、７２８）のそれぞれについて調整されると、ストレージシステムは、調整されたクロック値を超えて所定の期間延長する同期複製リースを確立してよい（７０８）。例えば、各ストレージシステム（７１４、７２４、７２８）のためのクロックがＸの値となるように調整される場合、ストレージシステム（７１４、７２４、７２８）は、それぞれ、Ｘ＋２秒まで有効である同期複製リースを確立するように構成されてよい。 7, a synchronous replication lease may be established (708) depending on timing information (710, 722, 726) for at least one of a plurality of storage systems (714, 724, 728) in a variety of different ways. In one embodiment, the storage system may establish a synchronous replication lease (708) by utilizing timing information (710, 722, 726) for each of the plurality of storage systems (714, 724, 728) to adjust the clock. In such an example, once the clock is adjusted for each of the storage systems (714, 724, 728), the storage system may establish a synchronous replication lease (708) that extends for a predetermined period beyond the adjusted clock value. For example, if the clock for each storage system (714, 724, 728) is adjusted to a value of X, the storage systems (714, 724, 728) may each be configured to establish a synchronous replication lease that is valid for up to X+2 seconds.

［0248］代替実施形態では、ストレージシステム（７１４、７２４、７２８）の間でクロックを調整する必要性は、タイミング保証を達成しつつも回避されてよい。係る実施形態では、各ストレージシステム（７１４、７２４、７２８）の中のストレージコントローラは、ローカルの単調増加クロックを有してよい。各コントローラが、それが他のストレージコントローラから受け取った最後のクロック値に沿って、そのクロック値を他のストレージコントローラに送信することによって、同期複製リースは、ストレージコントローラ（例えば、対にされたストレージシステムの一次コントローラと通信しているあるストレージシステムの一次コントローラ）間で確立されてよい（７０８）。特定のコントローラが、別のコントローラからそのクロック値をまた受け取るとき、特定のコントローラは、その受け取られたクロック値になんらかの同意されたリース間隔を加え、そのローカル同期複製リースを確立する（７０８）ためにそれを使用する。係るようにして、同期複製リースは、別のストレージシステムから受け取られたローカルクロックの値に従って計算されてよい。 [0248] In an alternative embodiment, the need to coordinate clocks between storage systems (714, 724, 728) may be avoided while still achieving timing guarantees. In such an embodiment, the storage controller in each storage system (714, 724, 728) may have a local monotonically increasing clock. A synchronous replication lease may be established between storage controllers (e.g., a primary controller of one storage system communicating with a primary controller of a paired storage system) by each controller sending its clock value to the other storage controller along with the last clock value it received from the other storage controller (708). When a particular controller also receives its clock value from another controller, the particular controller adds any agreed-upon lease interval to the received clock value and uses it to establish its local synchronous replication lease (708). In this manner, the synchronous replication lease may be calculated according to the value of the local clock received from the other storage system.

［0249］第１のストレージシステム（７１４）のストレージコントローラが、第２のストレージシステム（７２４）のストレージコントローラと通信している例を考える。係る例では、第１のストレージシステム（７１４）のストレージコントローラのための単調増加クロックの値は１０００ミリ秒であると仮定する。さらに、第１のストレージシステム（７１４）のストレージコントローラが、第２のストレージシステム（７２４）のストレージコントローラに、メッセージが生成されたときのそのクロック値が１０００ミリ秒であったことを示すメッセージを送信すると仮定する。係る例では、第１のストレージシステム（７１４）のストレージコントローラが、第２のストレージシステム（７２４）のストレージコントローラに、メッセージが生成されたときのそのクロック値が１０００ミリ秒であったことを示すメッセージを送信した後の５００ミリ秒、第１のストレージシステム（７１４）のストレージコントローラは、第２のストレージシステム（７２４）のストレージコントローラから、１）第２のストレージシステム（７２４）のストレージコントローラの単調増加クロックの値が、メッセージが生成されたとき５０００ミリ秒の値であったこと、及び２）第２のストレージシステム（７２４）によって受け取られた第１のストレージシステム（７１４）のストレージコントローラの単調増加するクロックの最後の値が１０００ミリ秒だったことを示すメッセージを受け取ると仮定する。係る例では、同意されたリース間隔が２０００ミリ秒である場合、第１のストレージシステム（７１４）は、第１のストレージシステム（７１４）のストレージコントローラのための単調増加クロックが３０００ミリ秒の値となるまで有効である同期複製リースを確立する（７０８）。第１のストレージシステム（７１４）のストレージコントローラが、第１のストレージシステム（７１４）のストレージコントローラのための単調増加クロックが３０００ミリ秒の値に達するときまでに、第１のストレージシステム（７１４）のストレージコントローラのための単調増加クロックの更新された値を含むメッセージを第２のストレージシステム（７２４）のストレージコントローラから受け取らない場合、第１のストレージシステム（７１４）は、同期複製リースを失効したとして扱い、以下により詳細に説明される多様な処置を講じてよい。読者は、ポッド内の残りのストレージシステム（７２４、７２８）の中のストレージコントローラが同様に反応し、同期複製リースの同様な追跡及び更新を実行することを理解する。本来、受取り側コントローラは、ネットワーク及び対にされたコントローラが、その時間間隔中にどこかで実行していたことを保証される場合があり、対にされたコントローラが、それがその時間間隔中のどこかで送信したメッセージを受け取ったことが保証される場合がある。クロックのいずれの調整もなしでは、受取り側コントローラは、その時間間隔のどこで、ネットワーク及び対にされたコントローラが実行していたのかを正確に知ることができず、そのクロック値を送信する際に、又はそのクロック値をまた受け取る際に待ち行列遅延があったかどうかを本当に知ることができない。 [0249] Consider an example in which a storage controller of a first storage system (714) is communicating with a storage controller of a second storage system (724). In such an example, assume that the value of the monotonically increasing clock for the storage controller of the first storage system (714) is 1000 milliseconds. Further assume that the storage controller of the first storage system (714) sends a message to the storage controller of the second storage system (724) indicating that the value of its clock was 1000 milliseconds when the message was generated. In such an example, assume that 500 milliseconds after the storage controller of the first storage system (714) sends a message to the storage controller of the second storage system (724) indicating that its clock value was 1000 milliseconds when the message was generated, the storage controller of the first storage system (714) receives a message from the storage controller of the second storage system (724) indicating that 1) the value of the monotonically increasing clock of the storage controller of the second storage system (724) was 5000 milliseconds when the message was generated, and 2) the last value of the monotonically increasing clock of the storage controller of the first storage system (714) received by the second storage system (724) was 1000 milliseconds. In such an example, if the agreed-upon lease interval is 2000 milliseconds, the first storage system (714) establishes (708) a synchronous replication lease that is valid until the monotonically increasing clock for the storage controller of the first storage system (714) reaches a value of 3000 milliseconds. If the storage controller of the first storage system (714) does not receive a message from the storage controller of the second storage system (724) containing an updated value of the monotonically increasing clock for the storage controller of the first storage system (714) by the time the monotonically increasing clock for the storage controller of the first storage system (714) reaches a value of 3000 milliseconds, the first storage system (714) may treat the synchronous replication lease as expired and take various actions, which are described in more detail below. The reader will understand that the storage controllers in the remaining storage systems (724, 728) in the pod will react similarly and perform similar tracking and updating of the synchronous replication lease. Essentially, the receiving controller may be guaranteed that the network and its paired controller were running somewhere during that time interval, and that its paired controller received the message it sent somewhere during that time interval. Without any adjustment of the clock, the receiving controller cannot know exactly where in the time interval the network and paired controller were performing, and cannot truly know if there were any queuing delays in sending its clock value or in receiving it back.

［0250］一次コントローラがそのクラスタ通信の一部としてクロックを交換している、それぞれが簡略な一次コントローラを有する２つのストレージシステムから成るポッドでは、各一次コントローラは、それが、いつ、対にされたコントローラが実行していたことを確実に知らないのかに関して制限をかけるために活動リースを使用できる。一次コントローラが不確かになった点で（コントローラの接続の活動リースが失効したとき）、一次コントローラは、それが不確かであること、及び活動リースを再開できるようになる前に適切に同期された接続が確立し直されなければならないことを示すメッセージを送信できる。ネットワークが一方の方向で機能しているが、他方の方向では適切に機能していない場合、これらのメッセージが受け取られてよく、応答は受け取られない場合がある。これは、それ自体の活動リースが失われたメッセージ及び待ち行列遅延の異なる組合せのためにまだ失効していない場合があるため、接続が正常に実行していない旨の、実行中の対にされたコントローラによる第１の表示であってよい。結果として、係るメッセージが受け取られる場合、それは、独自の活動リースも失効すると見なすべきであり、それは独自のメッセージを送信し始め、接続の同期及び活動リースの再開を調整しようと試みるべきである。それが起こり、クロック交換の新しい集合が成功するまで、どのコントローラもその活動リースが有効であると見なすことはできない。 [0250] In a pod consisting of two storage systems, each with a simple primary controller, where the primary controllers exchange clocks as part of their cluster communication, each primary controller can use an activity lease to place limits on when it does not know with certainty that its paired controller was running. At the point where the primary controller becomes uncertain (when the controller's active lease on the connection has expired), the primary controller can send a message indicating that it is uncertain and that a properly synchronized connection must be re-established before the active lease can be resumed. If the network is functioning in one direction but not properly in the other, these messages may be received and no response may be received. This may be the first indication by the running paired controller that the connection is not running properly, since its own active lease may not have yet expired due to a different combination of lost messages and queuing delays. As a result, if such a message is received, it should consider its own active lease to have expired as well, and it should begin sending its own messages, attempting to coordinate synchronization of the connection and resumption of the active lease. Until that happens and a new set of clock exchanges is successful, no controller can consider its active lease valid.

［0251］このモデルでは、コントローラは、それが再確立メッセージの送信を開始した後リース間隔秒、待機することができ、コントローラは応答を受け取っていない場合、対にされたコントローラがダウンしている、又は対にされたコントローラ自体の接続のためのリースが失効しているかのどちらかであることを保証される場合がある。クロックドリフトの少量のクロックドリフトを取り扱うために、コントローラはリース間隔（つまり、再確立リース）よりもわずかに長く待機する場合がある。コントローラが再確立メッセージを受け取ると、コントローラは、（コントローラは、コントローラの活動リースを送信することが失効したことを知っているので）待機するよりむしろ、再確立リースをただちに失効されると見なすであろうが、メッセージ喪失が、例えば混雑したネットワークスイッチにより引き起こされる一時的な状態であった場合に、多くの場合断念する前に追加のメッセージングを試すことに意味がある。 [0251] In this model, the controller may wait lease interval seconds after it begins sending a re-establishment message; if the controller does not receive a response, it may be assured that either the paired controller is down or the paired controller's own lease for the connection has expired. To handle small amounts of clock drift, the controller may wait slightly longer than the lease interval (i.e., re-establish the lease). When the controller receives the re-establishment message, it will immediately consider the re-establishment lease to have expired rather than waiting (because the controller knows that sending the controller's active lease has expired), but it often makes sense to try additional messaging before giving up if the message loss was a temporary condition caused, for example, by a congested network switch.

［0252］代替実施形態では、同期複製リースを確立することに加えて、クラスタメンバーシップリースも、対にされたストレージシステムからのクロック値の受取り時に、又は対にされたストレージシステムと交換されたクロックの受取り時に確立されてよい。係る例では、各ストレージシステムは、あらゆる対にされたストレージシステムとの独自の同期複製リースを及び独自のクラスタメンバーシップリースを有してよい。任意の対との同期複製リースの満了は、処理の休止につながる場合がある。しかしながら、クラスタメンバーシップは、クラスタメンバーシップリースがすべての対と失効するまで再計算することはできない。したがって、クラスタメンバーシップリースの持続時間は、対とのクラスタメンバーシップリースが、そのリンクのための対の同期複製リンクが失効した後まで失効しないことを保証するために、メッセージ及びクロック値の相互作用に基づいて設定されるべきである。読者は、クラスタメンバーシップリースが、ポッド内の各ストレージシステムによって確立することができ、ポッドのメンバーである任意の２つのストレージシステム間の通信リンクと関連付けられてよいことを理解する。さらに、クラスタメンバーシップリースは、同期複製リースの満了後、少なくとも同期複製リースの満了のための期間と同程度に長い期間、延長してよい。クラスタメンバーシップリースは、クロック交換の一部として対にされたストレージシステムから受け取られたクロック値の受取り時に延長されてよく、現在のクロック値からのクラスタメンバーシップリース期間は、少なくとも、交換されたクロック値に基づいて最後の同期複製リース延長のために確立された期間と同程度長くてよい。追加の実施形態では、セッションが最初にいつ交渉されるのかを含んだ追加のクラスタメンバーシップ情報が、接続を介して交換される場合がある。読者は、クラスタメンバーシップリースを利用する実施形態では、各ストレージシステム（又はストレージコントローラ）が、クラスタメンバーシップリースのための独自の値を有してよいことを理解する。係るリースは、クラスタリース満了が、例えばメディエータの競争を通して等、新しいメンバーシップを確立することを可能にし、同期複製リース満了が、新しい要求の処理を強制的に休止させることを所与として、すべてのポッドメンバー全体でのすべての同期複製リースが失効することが保証されるまで失効するべきではない。係る例では、休止は、クラスタメンバーシップ処置を講じることができる前にどこででも実施されることが保証されなければならない。 [0252] In an alternative embodiment, in addition to establishing a synchronous replication lease, a cluster membership lease may also be established upon receipt of a clock value from a paired storage system or upon receipt of a clock exchanged with a paired storage system. In such an example, each storage system may have its own synchronous replication lease with every paired storage system and its own cluster membership lease. Expiration of a synchronous replication lease with any pair may result in processing pauses. However, cluster membership cannot be recalculated until cluster membership leases with all pairs have expired. Therefore, the duration of a cluster membership lease should be set based on the interaction of messages and clock values to ensure that a cluster membership lease with a pair does not expire until after the pair's synchronous replication link for that link has expired. The reader will understand that a cluster membership lease may be established by each storage system in a pod and may be associated with a communication link between any two storage systems that are members of the pod. Furthermore, the cluster membership lease may be extended after the expiration of the synchronous replication lease for a period at least as long as the period for the synchronous replication lease expiration. The cluster membership lease may be extended upon receipt of a clock value received from a paired storage system as part of a clock exchange, and the cluster membership lease period from the current clock value may be at least as long as the period established for the last synchronous replication lease extension based on the exchanged clock values. In additional embodiments, additional cluster membership information, including when the session was initially negotiated, may be exchanged over the connection. The reader will understand that in embodiments utilizing cluster membership leases, each storage system (or storage controller) may have its own value for the cluster membership lease. Such a lease should not expire until all synchronous replication leases across all pod members are guaranteed to have expired, given that a cluster lease expiration allows new membership to be established, for example, through a mediator competition, and that a synchronous replication lease expiration forces a pause in the processing of new requests. In such instances, quiescing must be guaranteed to occur everywhere before cluster membership actions can be taken.

［0253］読者は、ストレージシステム（７１４）の１つだけが、データセット（７１２）のために、データセット（７１２）がその全体で同期複製される複数のストレージシステム（７１４、７２４、７２８）を識別すること（７０２）、データセット（７１２）を同期複製するために使用される複数のストレージシステム（７１４、７２４、７２８）のそれぞれの間で１つ以上のデータ通信リンク（７１６、７１８、７２０）を構成すること（７０４）、複数のストレージシステム（７１４、７２４、７２８）の間で複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためのタイミング情報（７１０、７２２、７２６）を交換すること（７０６）、及び複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためのタイミング情報（７１０、７２２、７２６）に従って、同期複製リースを確立すること（７０８）として示されているが、残りのストレージシステム（７２４、７２８）も係るステップを実施してよいことを理解する。実際には、２つ以上のストレージシステム（７１４、７２４、７２８）の間で同期複製関係を確立することは、２つ以上のストレージシステム（７１４、７２４、７２８）の間の協働及び相互作用を必要とする場合があるので、３つすべてのストレージシステム（７１４、７２４、７２８）が、同時に上述されたステップの１つ以上を実施し得る。 [0253] The reader will now be familiar with identifying (702) multiple storage systems (714, 724, 728) across which the dataset (712) is synchronously replicated, with only one of the storage systems (714) being a data set (712); configuring (704) one or more data communication links (716, 718, 720) between each of the multiple storage systems (714, 724, 728) used to synchronously replicate the dataset (712); and configuring the multiple storage systems (714, 724) to synchronously replicate the dataset (712). While the illustration shows exchanging (706) timing information (710, 722, 726) for at least one of the plurality of storage systems (714, 724, 728) between the plurality of storage systems (714, 724, 728) and establishing (708) a synchronous replication lease in accordance with the timing information (710, 722, 726) for at least one of the plurality of storage systems (714, 724, 728), it is understood that the remaining storage systems (724, 728) may also perform such steps. In practice, establishing a synchronous replication relationship between two or more storage systems (714, 724, 728) may require cooperation and interaction between two or more storage systems (714, 724, 728), so all three storage systems (714, 724, 728) may simultaneously perform one or more of the steps described above.

［0254］追加の説明のため、図８は、本開示のいくつかの実施形態に係る２つ以上のストレージシステム（７１４、７２４、７２８）の間で同期複製関係を確立する追加の例の方法を示すフローチャートを説明する。図８に示される例の方法は、データセット（７１２）のために、データセット（７１２）がその全体で同期複製される複数のストレージシステム（７１４、７２４、７２８）を識別すること（７０２）と、データセット（７１２）を同期複製するために使用される複数のストレージシステム（７１４、７２４、７２８）のそれぞれの間で１つ以上のデータ通信リンク（７１６、７１８、７２０）を構成すること（７０４）と、複数のストレージシステム（７１４、７２４、７２８）の間で、複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためにタイミング情報（７１０、７２２、７２６）を交換すること（７０６）と、複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためのタイミング情報（７１０、７２２、７２６）に従って、同期複製リースを確立することであって、同期複製リースが、同期複製関係が有効である期間を識別する、確立することも含むので図８に示される例の方法は、図４６に示される例の方法に類似する。 For further explanation, FIG. 8 sets forth a flowchart illustrating an additional example method for establishing a synchronous replication relationship between two or more storage systems (714, 724, 728) according to some embodiments of the present disclosure. The example method illustrated in FIG. 8 includes identifying (702), for a data set (712), a plurality of storage systems (714, 724, 728) across which the data set (712) is to be synchronously replicated; configuring (704) one or more data communication links (716, 718, 720) between each of the plurality of storage systems (714, 724, 728) used to synchronously replicate the data set (712); and configuring (705) one or more data communication links (716, 718, 720) between the plurality of storage systems (714, 724, 728) to synchronously replicate the data set (712). The example method illustrated in FIG. 8 is similar to the example method illustrated in FIG. 46 because it also includes exchanging (706) timing information (710, 722, 726) for at least one of the storage systems (714, 724, 728) and establishing a synchronous replication lease in accordance with the timing information (710, 722, 726) for at least one of the storage systems (714, 724, 728), where the synchronous replication lease identifies a period of time during which the synchronous replication relationship is valid.

［0255］図８に示される例の方法では、複数のストレージシステム（７１４、７２４、７２８）の少なくとも１つのためのタイミング情報（７１０、７２２、７２６）に従って、同期複製リースを確立すること（７０８）は、複数のストレージシステム（７１４、７２４、７２８）の間のクロックを調整すること（８０２）を含む場合がある。図８に示される例の方法では、複数のストレージシステム（７１４、７２４、７２８）間でクロックを調整すること（８０２）は、例えばストレージシステム（７１４、７２４、７２８）間で送信される１つ以上のメッセージ交換により実施されてよい。ストレージシステム（７１４、７２４、７２８）間で送信される１つ以上のメッセージは、例えばそのクロック値が、他のすべてのストレージシステムによって使用されるストレージシステムのクロック値、全ストレージシステムに対するそのクロック値を所定の値に設定するための命令、そのクロック値を更新した旨のストレージシステムからの確認メッセージ等の情報を含んでよい。係る例では、ストレージシステム（７１４、７２４、７２８）は、特定のストレージシステム（例えば、リーダーストレージシステム）のためのクロック値が、他のすべてのストレージシステムによって使用されるべきであり、いくつかの特定の基準（例えば、最高クロック値）を満たすストレージシステムのすべてからのクロック値は、他のすべてのストレージシステム等、使用されるべきであるように構成されてよい。係る例では、メッセージの交換と関連付けられた送信時間を説明するために、なんらかの所定量の時間が、別のストレージシステムから受け取られるクロック値に加算されてよい。 8, establishing a synchronous replication lease (708) according to timing information (710, 722, 726) for at least one of the plurality of storage systems (714, 724, 728) may include coordinating (802) clocks between the plurality of storage systems (714, 724, 728). In the example method shown in FIG. 8, coordinating (802) clocks between the plurality of storage systems (714, 724, 728) may be performed, for example, by exchanging one or more messages sent between the storage systems (714, 724, 728). The one or more messages sent between the storage systems (714, 724, 728) may include information, such as a clock value of the storage system whose clock value is used by all other storage systems, instructions to set the clock value for all storage systems to a predetermined value, and a confirmation message from the storage system that its clock value has been updated. In such examples, the storage systems (714, 724, 728) may be configured such that a clock value for a particular storage system (e.g., a leader storage system) should be used by all other storage systems, clock values from all of the storage systems that meet some particular criteria (e.g., highest clock value) should be used by all other storage systems, etc. In such examples, some predetermined amount of time may be added to a clock value received from another storage system to account for transmission time associated with the exchange of messages.

［0256］図８に示される例の方法では、複数のストレージシステム（７１４、７２４、７２８）の少なくとも１つのためのタイミング情報（７１０、７２２、７２６）に従って、同期複製リースを確立すること（７０８）は、複数のストレージシステム（７１４、７２４、７２８）の間で調整のついていないクロックを交換すること（８０４）を含む場合がある。複数のストレージシステム（７１４、７２４、７２８）の間で調整がついていないクロックを交換すること（８０４）は、例えば各ストレージシステム（７１４、７２４、７２８）のストレージコントローラが、上記により詳細に説明されるように、ローカルの単調増加クロックの値を交換することによって実施されてよい。係る例では、各ストレージシステム（７１４、７２４、７２８）は、同意された同期複製リース間隔、及び同期複製リースを確立する（７０８）ために他のストレージシステム（７１４、７２４、７２８）から受け取られるメッセージングを活用してよい。 8, establishing (708) a synchronous replication lease in accordance with timing information (710, 722, 726) for at least one of the plurality of storage systems (714, 724, 728) may include exchanging (804) uncoordinated clocks among the plurality of storage systems (714, 724, 728). Exchanging (804) uncoordinated clocks among the plurality of storage systems (714, 724, 728) may be performed, for example, by the storage controllers of each storage system (714, 724, 728) exchanging values of local monotonically increasing clocks, as described in more detail above. In such an example, each storage system (714, 724, 728) may utilize agreed-upon synchronous replication lease intervals and messaging received from the other storage systems (714, 724, 728) to establish (708) a synchronous replication lease.

［0257］また、図８に示される例の方法は、同期複製リースが失効した後に受け取られたＩ／Ｏ要求の処理を先延ばしにすること（８０６）も含む。同期複製リースが失効した後にストレージシステムのいずれかによって受け取られたＩ／Ｏ要求は、例えば、新しい同期複製リースが確立されるまで等、同期複製関係を再確立しようと試みるためには十分である所定の時間量、先延ばしにされてよい。係る例では、ストレージシステムは、なんらかのタイプの『ビジー』若しくは一時故障表示で故障することで、又は他のなんらあの方法でＩ／Ｏ要求の処理を先延ばししてよい（８０６）。 [0257] The example method shown in FIG. 8 also includes postponing (806) processing of I/O requests received after a synchronous replication lease has expired. I/O requests received by any of the storage systems after a synchronous replication lease has expired may be postponed for a predetermined amount of time sufficient to attempt to re-establish the synchronous replication relationship, e.g., until a new synchronous replication lease is established. In such an example, the storage system may postpone processing of the I/O request by failing with some type of "busy" or temporary failure indication, or in some other manner (806).

［0258］追加の説明のために、図９は、本開示のいくつかの実施形態に従って２つ以上のストレージシステム（７１４、７２４、７２８）の間の同期複製間関係を確立する追加の例の方法を示すフローチャートを説明する。図９に示される例の方法は、データセット（７１２）のために、データセット（７１２）がその全体で同期複製される複数のストレージシステム（７１４、７２４、７２８）を識別すること（７０２）と、データセット（７１２）を同期複製するために使用される複数のストレージシステム（７１４、７２４、７２８）のそれぞれの間で１つ以上のデータ通信リンク（７１６ａ、７１６ｂ、７１８ａ、７１８ｂ、７２０ａ、７２０ｂ）を構成すること（７０４）と、複数のストレージシステム（７１４、７２４、７２８）の間で複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためのタイミング情報（７１０、７２２、７２６）を交換すること（７０６）と、複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためのタイミング情報（７１０、７２２、７２６）に従って、同期複製リースを確立すること（７０８）とであって、同期複製リースが、同期複製関係が有効である期間を識別する、確立することも含むので、図９に示される例の方法は図４６に示される例の方法に類似する。 For further explanation, FIG. 9 sets forth a flowchart illustrating an additional example method for establishing a synchronous replication relationship between two or more storage systems (714, 724, 728) according to some embodiments of the present disclosure. The example method illustrated in FIG. 9 includes identifying (702), for a data set (712), a plurality of storage systems (714, 724, 728) across which the data set (712) is to be synchronously replicated; configuring (704) one or more data communication links (716a, 716b, 718a, 718b, 720a, 720b) between each of the plurality of storage systems (714, 724, 728) used to synchronously replicate the data set (712); and configuring (705) one or more data communication links (716a, 716b, 718a, 718b, 720a, 720b) between the plurality of storage systems (714, 724, 728). The example method illustrated in FIG. 9 is similar to the example method illustrated in FIG. 46 because it also includes exchanging (706) timing information (710, 722, 726) for at least one of the plurality of storage systems (714, 724, 728) and establishing (708) a synchronous replication lease in accordance with the timing information (710, 722, 726) for at least one of the plurality of storage systems (714, 724, 728), where the synchronous replication lease identifies a period of time during which the synchronous replication relationship is valid.

［0259］図９に示される例の方法では、データセット（７１２）を同期複製するために使用される複数のストレージシステム（７１４、７２４、７２８）のそれぞれの間で１つ以上のデータ通信リンク（７１６ａ、７１６ｂ、７１８ａ、７１８ｂ、７２０ａ、７２０ｂ）を構成すること（７０４）は、複数のデータ通信タイプのそれぞれのために、データセット（７１２）を同期複製するために使用される複数のストレージシステム（７１４、７２４、７２８）のそれぞれの間でデータ通信リンク（７１６ａ、７１６ｂ、７１８ａ、７１８ｂ、７２０ａ、７２０ｂ）を構成すること（９０２）を含む場合がある。図９に示される例の方法では、各ストレージシステムは、ストレージシステムがポッド内の他のストレージシステムに送信する複数のデータ通信タイプを生成するように構成されてよい。例えば、ストレージシステムは、Ｉ／Ｏ処理の一部であるデータ（例えば、ホストによって発行される書込み要求の一分としてストレージシステムに書き込まれるデータ）を含む第１のタイプのデータ通信を生成してよく、構成変更（例えば、ボリュームを作成、拡張、削除、又は名前を変えることに応えて生成される情報）を含む第２のタイプのデータ通信を生成するように構成されてよく、ストレージシステムは、ストレージシステム及び相互接続障害を検出し、処理することに関わる通信を含む第３のタイプのデータ通信を生成するように構成されてよい。係る例では、データ通信タイプは、例えば、どのソフトウェアモジュールがメッセージを開始したのかに基づいて、どのハードウェア構成要素がメッセージを開始したのかに基づいて、メッセージを開始させたイベントのタイプに基づいて、及び他の方法で決定されてよい。図９に示される例の方法では、複数のデータ通信タイプのそれぞれのために複数のストレージシステム（７１４、７２４、７２８）のそれぞれの間でデータ通信リンク（７１６ａ、７１６ｂ、７１８ａ、７１８ｂ、７２０ａ、７２０ｂ）を構成すること（９０２）は、例えば複数のデータ通信タイプのそれぞれに別個の相互接続を使用するようにストレージシステムを構成することによって、複数のデータ通信タイプのそれぞれのために別個のネットワークを使用するようにストレージシステムを構成することによって、又は他の方法で実施されてよい。 9, configuring (704) one or more data communication links (716a, 716b, 718a, 718b, 720a, 720b) between each of the plurality of storage systems (714, 724, 728) used to synchronously replicate the dataset (712) may include configuring (902) a data communication link (716a, 716b, 718a, 718b, 720a, 720b) between each of the plurality of storage systems (714, 724, 728) used to synchronously replicate the dataset (712) for each of a plurality of data communication types. In the example method shown in FIG. 9, each storage system may be configured to generate a plurality of data communication types that the storage system transmits to other storage systems in the pod. For example, a storage system may generate a first type of data communication that includes data that is part of an I/O operation (e.g., data written to the storage system as part of a write request issued by a host), may be configured to generate a second type of data communication that includes configuration changes (e.g., information generated in response to creating, expanding, deleting, or renaming a volume), and may be configured to generate a third type of data communication that includes communications related to detecting and handling storage system and interconnect failures. In such examples, the data communication type may be determined, for example, based on which software module initiated the message, based on which hardware component initiated the message, based on the type of event that initiated the message, and in other manners. In the example method shown in FIG. 9, configuring (902) a data communication link (716a, 716b, 718a, 718b, 720a, 720b) between each of the plurality of storage systems (714, 724, 728) for each of the plurality of data communication types may be performed, for example, by configuring the storage systems to use a separate interconnect for each of the plurality of data communication types, by configuring the storage systems to use a separate network for each of the plurality of data communication types, or in other ways.

［0260］また、図９に示される例の方法は、同期複製リースが失効したことを検出すること（９０４）も含む。図９に示される例の方法では、同期複製リースが失効したことを検出すること（９０４）は、例えば特定のストレージシステムが現在のクロック値をリースが有効であった期間に比較することによって実施されてよい。ストレージシステム（７１４、７２４、７２８）が、各ストレージシステム（７１４、７２４、７２８）の中のクロックの値を５０００ミリ秒の値に設定するためにクロックを調整し、各ストレージシステム（７１４、７２４、７２８）がそのクロック値を超えて２０００ミリ秒のリース間隔の間延長した同期複製リースを確立する（７０８）ように構成され、これにより各ストレージシステム（７１４、７２４、７２８）の同期複製リースは、特定のストレージシステム（７１４、７２４、７２８）の中のクロックが、７０００ミリ秒を超える値に達したときに失効した。係る例では、同期複製リースが失効したことを検出すること（９０４）は、特定のストレージシステム（７１４、７２４、７２８）の中のクロックが７００１ミリ秒以上の値に達したと判断することによって実施されてよい。 9 also includes detecting (904) that a synchronous replication lease has expired. In the example method shown in FIG. 9, detecting (904) that a synchronous replication lease has expired may be performed, for example, by a particular storage system comparing its current clock value to the period during which the lease was valid. The storage systems (714, 724, 728) are configured to adjust their clocks to set the value of the clock within each storage system (714, 724, 728) to a value of 5000 milliseconds, and each storage system (714, 724, 728) establishes (708) a synchronous replication lease that extends beyond that clock value for a lease interval of 2000 milliseconds, whereby the synchronous replication lease for each storage system (714, 724, 728) has expired when the clock within the particular storage system (714, 724, 728) reaches a value greater than 7000 milliseconds. In such an example, detecting that a synchronous replication lease has expired (904) may be performed by determining that a clock within a particular storage system (714, 724, 728) has reached a value of 7001 milliseconds or greater.

［0261］読者は、他のイベントの発生も、各ストレージシステム（７１４、７２４、７２８）に、同期複製リースを失効しているとしてただちに扱わせてよいと理解する。例えば、ストレージシステム（７１４、７２４、７２８）は、ストレージシステム（７１４、７２４、７２８）とポッド内の別のストレージシステム（７１４、７２４、７２８）との間の通信故障を検出すると、同期複製リースを失効しているとしてただちに扱ってよく、ストレージシステム（７１４、７２４、７２８）は、ポッド内の別のストレージシステム（７１４、７２４、７２８）からリース再確立メッセージを受け取ると、同期複製リースを失効しているとして直ちに扱ってよく、ストレージシステム（７１４、７２４、７２８）は、ポッド内の別のストレージシステム（７１４、７２４、７２８）が故障したことを検出すると、同期複製リースを失効しているとしてただちに扱ってよい等々である。係る例では、上述の文章に説明されるイベントのいずれかの発生は、ストレージシステムに、同期複製リースが失効したことを検出させて（９０４）よい。 [0261] The reader will understand that the occurrence of other events may also cause each storage system (714, 724, 728) to immediately treat the synchronous replication lease as expired. For example, a storage system (714, 724, 728) may immediately treat the synchronous replication lease as expired upon detecting a communication failure between the storage system (714, 724, 728) and another storage system (714, 724, 728) in the pod; a storage system (714, 724, 728) may immediately treat the synchronous replication lease as expired upon receiving a lease re-establishment message from another storage system (714, 724, 728) in the pod; a storage system (714, 724, 728) may immediately treat the synchronous replication lease as expired upon detecting that another storage system (714, 724, 728) in the pod has failed; and so on. In such an example, the occurrence of any of the events described in the preceding paragraphs may cause the storage system to detect (904) that the synchronous replication lease has expired.

［0262］また、図９に示される例の方法は、同期複製関係を再確立すること（９０６）も含む。図９に示される例の方法では、同期複製関係を再確立すること（９０６）は、例えば１つ以上の再確立メッセージの使用により実施されてよい。係る再確立メッセージは、例えば、同期複製関係が再確立されるポッドの識別、１つ以上のデータ通信リンクを構成するために必要とされる情報、更新されたタイミング情報等を含む場合がある。係るようにして、ストレージシステム（７１４、７２４、７２８）は、各ストレージシステムが、データセット（７１２）のために、データセット（７１２）がその全体で同期複製される複数のストレージシステム（７１４、７２４、７２８）を識別すること（７０２）と、データセット（７１２）を同期複製するために使用される複数のストレージシステム（７１４、７２４、７２８）のそれぞれの間で１つ以上のデータ通信リンク（７１６ａ、７１６ｂ、７１８ａ、７１８ｂ、７２０ａ、７２０ｂ）を構成すること（７０４）と、複数のストレージシステム（７１４、７２４、７２８）の間で複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためのタイミング情報（７１０、７２２、７２６）を交換すること（７０６）と、複数のストレージシステム（７１４、７２４、７２８）のうちの少なくとも１つのためのタイミング情報（７１０、７２２、７２６）に従って、同期複製リースを確立すること（７０８）とであって、同期複製リースが、同期複製関係が有効である期間を識別する、確立することのうちの１つ以上を実行することを含むが、これに限定されるものではない、同期複製関係が最初に作成されたのとほぼ同じ方法で同期複製関係を再確立してよい（９０６）。 9 also includes reestablishing the synchronous replication relationship (906). In the example method shown in FIG. 9, reestablishing the synchronous replication relationship (906) may be performed, for example, through the use of one or more reestablishment messages. Such reestablishment messages may include, for example, an identification of the pod with which the synchronous replication relationship is to be reestablished, information needed to configure one or more data communication links, updated timing information, etc. In this manner, the storage systems (714, 724, 728) may each identify (702) for a data set (712) a plurality of storage systems (714, 724, 728) across which the data set (712) is to be synchronously replicated; configure (704) one or more data communication links (716a, 716b, 718a, 718b, 720a, 720b) between each of the plurality of storage systems (714, 724, 728) used to synchronously replicate the data set (712); and configure (705) one or more data communication links (716a, 716b, 718a, 718b, 720a, 720b) between the plurality of storage systems (714, 724, 728). The synchronous replication relationship may be re-established (906) in substantially the same manner as the synchronous replication relationship was originally created, including, but not limited to, performing one or more of: exchanging (706) timing information (710, 722, 726) for at least one of the plurality of storage systems (714, 724, 728); and establishing (708) a synchronous replication lease in accordance with the timing information (710, 722, 726) for at least one of the plurality of storage systems (714, 724, 728), the synchronous replication lease identifying a period of time during which the synchronous replication relationship is valid.

［0263］図９に示される例の方法では、同期複製リースの満了の後に、新しい活動イベントが後に続く又はなんらかの他のアクションが後に続く再確立メッセージが後に続く、イベントのなんらかの集合が続いてよい。データ通信、構成通信、又は他の通信は、遷移中である可能性がある。一方、同期複製リースは失効し、再確立される。実際には、通信は、例えば新しい同期複製リースが確立された後まで受け取られられない場合がある。係る場合、通信は、ポッド、クラスタ、又はネットワークリンクの受胎のある理解に基づいて送信された可能性があり、現在ではその状態のある態様又は別の態様の異なる理解を有するストレージシステム（７１４、７２４、７２８）によって受け取られてよい。したがって、概して、通信がクラスタのなんらかの集合の前に送信された、又はリンク状態が変化する場合、受け取られた通信が廃棄されることを確実にするなんらかの手段があるべきである。通信がクラスタのなんらかの集合の前に送信された、又はリンク状態が変化する場合、受け取られた通信が廃棄されることを確実にする１つの方法は、延長されている作業中の同期複製リースとのリンクを確立又は再確立することに関連付けられるなんらかのセッション識別子（例えば、番号）を確立することである。クラスタ通信が再確立された後、リンクは新しいセッション識別子を得る。この識別子は、データメッセージ、構成メッセージ、又は他の通信メッセージとともに含まれる場合がある。間違ったセッション識別子とともに受け取られる任意のメッセージは、廃棄される、又はセッション識別子不一致を示すエラー応答を生じさせる。 9, expiration of a synchronous replication lease may be followed by some set of events, such as a re-establishment message followed by a new activity event or some other action. Data communications, configuration communications, or other communications may be in transition while the synchronous replication lease expires and is re-established. In fact, communications may not be received until, for example, after a new synchronous replication lease is established. In such cases, communications may have been sent based on one understanding of the state of a pod, cluster, or network link, and may now be received by a storage system (714, 724, 728) that has a different understanding of one aspect or another of that state. Thus, generally, there should be some means of ensuring that communications received are discarded if they were sent before some set of clusters or link states change. One way to ensure that communications sent before any convergence of the cluster or received communications are discarded if link conditions change is to establish some session identifier (e.g., a number) associated with establishing or re-establishing the link with an extended working synchronous replication lease. After cluster communications are re-established, the link gets a new session identifier. This identifier may be included with data messages, configuration messages, or other communication messages. Any message received with an incorrect session identifier is discarded or results in an error response indicating a session identifier mismatch.

［0264］読者は、ストレージシステム（７１４、７２４、７２８）が、同期複製リースの再確立にどのように対応するのかが、ストレージシステム及びポッドがとる異なる実施形態に基づいて変化する場合があることを理解する。２つのストレージシステムを有する簡略な一次コントローラの場合、コントローラの同期複製リースを受け取ることが失効した後に受け取られる、ストレージシステムで動作（読取り、書込み、ファイル動作、オブジェクト動作、管理動作等）を実行する新しい要求は、その処理を先延ばしさせる、打ち切らせる、又はなんらかの「後で再試行」エラーコードで故障させる場合がある。したがって、実行中の一次ストレージコントローラは、対にされたストレージコントローラの同期複製リースが失効している場合新しい要求を処理していないことが保証される場合があり、それは、いつ独自の再確立リースが失効したのかを確信する場合がある。再確立リースが失効した後、コントローラが、対にされたコントローラをオフラインであると見なし、次いで対にされたコントローラなしでストレージ処理を続行することを含んだ補正処置を検討することが安全である。正確にそれらがどのようなアクションである可能性があるのかは、多種多様な検討及び実施態様の詳細に基づいて異なる場合がある。 [0264] The reader will understand that how a storage system (714, 724, 728) responds to the re-establishment of a synchronous replication lease may vary based on the different implementations of the storage system and pod. In the case of a simple primary controller with two storage systems, a new request to perform an operation (read, write, file operation, object operation, management operation, etc.) on the storage system that is received after the controller's synchronous replication lease has expired may have its processing postponed, aborted, or fail with some kind of "retry later" error code. Thus, a running primary storage controller may be guaranteed not to process new requests if its paired storage controller's synchronous replication lease has expired, and it may be certain when its own re-established lease has expired. After the re-established lease has expired, it is safe for the controller to consider its paired controller offline and then consider corrective actions, including continuing storage processing without the paired controller. Exactly what those actions may be may vary based on a wide variety of considerations and implementation details.

［0265］一次コントローラ及び二次コントローラを有するストレージシステムの場合、あるストレージシステムで依然として実行中の一次コントローラは、対にされたストレージシステムの以前の第２のコントローラが引き継いでいる可能性があると仮定して、対にされたストレージシステムの以前の第２のコントローラに接続を試みる可能性がある。あるいは、あるストレージシステム上で依然として実行中の一次コントローラは、おそらく最大二次引継ぎ時間であるある特定の時間量待機する可能性がある。二次コントローラが接続し、適当な時間内に新しい同期複製リースとの新しい接続を確立する場合、ポッドは次いで（以下に説明される）一貫性のある状態にそれ自体を回復させ、次いで正常に続行してよい。対にされた二次コントローラが十分に迅速に接続しない場合、次いで依然として実行中の一次コントローラは、例えば、依然として実行中の一次コントローラが対にされたストレージシステムを故障していると見なし、次いで対にされたストレージシステムなしで動作し続けるべきであるかどうかを判断しようと試みる等の追加の処置を講じる場合がある。一次コントローラは、代わりにポッドの中の対にされたストレージシステムで二次コントローラに対するアクティブなリースの接続を保つ可能性がある。その場合、一次対一次の再確立リースの満了は、代わりに、最初にその接続を確立することを必要とするよりむしろ、生き残った一次が、二次の引継ぎを問い合わせるためにその接続を使用することになる可能性がある。２つの一次ストレージコントローラが、ネットワークがそれらの間で機能していない間に実行中である可能性もあるが、ネットワークは一方又は他方の一次コントローラと対にされた二次コントローラとの間で機能している。その場合、ストレージシステムの中での高可用性モニタリングが、一次コントローラから二次コントローラへのフェイルオーバをトリガする条件を自力で検出しない可能性がある。その条件に対する対応は、同期複製を単に再開するために一次から二次へのフェイルオーバをいずれにせよトリガすること、二次を通る一次からの通信トラフィックの経路を定めること、又はまさにあたかも２つのストレージシステム間で通信が完全に故障し、それが発生したのと同じ障害処理を生じさせるかのように動作することを含む。 [0265] In the case of storage systems with primary and secondary controllers, the primary controller still running on a storage system may attempt to connect to the former second controller of the paired storage system, assuming that the former second controller of the paired storage system may take over. Alternatively, the primary controller still running on a storage system may wait a certain amount of time, perhaps a maximum secondary takeover time. If the secondary controller connects and establishes a new connection with a new synchronous replication lease within a reasonable amount of time, the pod may then recover itself to a consistent state (described below) and then proceed normally. If the paired secondary controller does not connect quickly enough, then the still-running primary controller may take additional action, such as, for example, considering the paired storage system to have failed and then attempting to determine whether it should continue operating without the paired storage system. The primary controller may instead maintain an active lease connection to a secondary controller on the paired storage system in the pod. In that case, expiration of the primary-to-primary re-established lease may instead result in the surviving primary using that connection to query the secondary for takeover, rather than first requiring that the connection be established. Two primary storage controllers may be running while the network is not functioning between them, but the network is functioning between one or the other primary controller and the paired secondary controller. In that case, high availability monitoring within the storage system may not by itself detect the condition that triggers a failover from the primary to the secondary controller. Responses to that condition may include triggering a primary-to-secondary failover anyway to simply resume synchronous replication, routing communication traffic from the primary through the secondary, or acting exactly as if communication had completely failed between the two storage systems, resulting in the same failure processing that occurred.

［0266］複数のコントローラが、（デュアルアクティブ－アクティブコントローラストレージシステムで及びスケールアウトストレージシステムでの両方を含む）ポッドのためにアクティブである場合も、リースは、対にされたストレージシステムの任意の又はすべてのコントローラとの個々のコントローラクラスタ通信によって保たれる可能性がある。この場合、失効した同期複製リースは、ストレージシステム全体でのポッドのための新しい要求処理の休止につながる可能性がある。リースモデルは、それらのクロックの、対にされたストレージシステムでのあらゆる対にされたコントローラとの追加の交換により、ストレージシステム内のすべてのアクティブなコントローラ間でのクロック及び対にされたクロック応答の交換で拡張される場合がある。特定のローカルコントローラのクロックが任意の対にされたコントローラと交換される動作経路がある場合、次いでコントローラは、独立した同期複製リースのために、及びおそらく独立した再確立リースのためにその経路を使用できる。この場合、ストレージシステムの中のローカルコントローラは、互いとの間のローカルリースのためにも互いの間でクロックをさらに交換してよい。これは、すでにローカルストレージシステムの高可用性機構及びモニタリング機構に組み込まれてよいが、ストレージシステムの高可用性機構に関係するあらゆるタイミングは、活動リース及び再確立リースの持続期間中、又は再確立リース満了と、相互接続障害を処理するために講じられる処置との間の任意の追加の遅延で考慮に入れられるべきである。 [0266] When multiple controllers are active for a pod (including both dual active-active controller storage systems and scale-out storage systems), leases may still be maintained through individual controller cluster communication with any or all controllers in the paired storage systems. In this case, an expired synchronous replication lease may lead to a stall in new request processing for the pod across the storage system. The leasing model may be extended with an exchange of clocks and paired clock responses between all active controllers in the storage system, with the additional exchange of their clocks with every paired controller in the paired storage system. If there is an operational path in which a particular local controller's clock is exchanged with any paired controller, the controllers can then use that path for independent synchronous replication leases and possibly for independent re-establishment of leases. In this case, local controllers in a storage system may further exchange clocks among each other for local leases between each other as well. This may already be incorporated into the local storage system's high availability and monitoring mechanisms, but any timing related to the storage system's high availability mechanisms should be taken into account during the duration of the active and re-established leases, or any additional delays between the re-established lease expiration and the action taken to handle the interconnect failure.

［0267］代わりに、ストレージシステム対ストレージシステムのクラスタ通信又はリースプロトコル単独が、少なくとも特定のポッドに対して、個々のマルチコントローラ又はスケールアウトシステムの中で一度に１つの一次コントローラに割り当てられてよい。このサービスは、障害の結果として、又はおそらく負荷不均衡の結果としてコントローラからコントローラに移行してよい。あるいは、クラスタ通信又はリースプロトコルは、クロック交換又は障害状況を分析することの複雑度を制限するためにコントローラの部分集合（例えば、２つ）で実行する可能性がある。各ローカルコントローラは、個々のコントローラが処理休止を達成したことを確実にできるときの潜在的なカスケード遅延を説明するために、ストレージシステム対ストレージシステムのリース、及びリース満了が相応して調整されなければならない後に応答するための時間を処理するコントローラの中のクロックを交換する必要がある場合がある。処理休止に関係するリースについて現在当てにされていない接続は、警告するために依然としてモニタされる可能性がある。 [0267] Alternatively, storage system-to-storage system cluster communication or leasing protocols alone may be assigned to one primary controller at a time in each multi-controller or scale-out system, at least for a particular pod. This service may migrate from controller to controller as a result of failure, or perhaps as a result of load imbalance. Alternatively, cluster communication or leasing protocols may run on a subset of controllers (e.g., two) to limit the complexity of clock exchange or analyzing failure conditions. Each local controller may need to exchange clocks among controllers that process storage system-to-storage system leases, and the time to respond after lease expiration must be adjusted accordingly, to account for potential cascading delays when individual controllers can be sure that they have achieved processing pauses. Connections that are not currently relied upon for leases related to processing pauses may still be monitored for alerts.

［0268］また、図９に示される例の方法は、データセットのためにＩ／Ｏ処理を引き継ごうと試みること（９０８）も含む。図９に示される例の方法では、データセット（７１２）のためにＩ／Ｏ処理を引き継ごうと試みること（９０８）は、例えばメディエータまで急ぐ（ｒａｃｅｔｏ）ストレージシステム（７１４、７２４、７２８）によって実施されてよい。特定のストレージシステム（７１４、７２４、７２８）がデータセット（７１２）のためにＩ／Ｏ処理を無事に引き継ぐ場合、データセット（７１２）のすべてのアクセスは、同期複製関係が再確立できるまで特定のストレージシステム（７１４、７２４、７２８）によってサービスを提供され、以前の同期複製関係が失効した後に発生したデータセット（７１２）に対する任意の変更を、次いで他のストレージシステム（７１４、７２４、７２８）上で転送し、持続する場合がある。係る例では、データセット（７１２）のためにＩ／Ｏ処理を引き継ごうとする試み（９０８）は、同期複製リースが失効した後のなんらかの期間の満了後にのみ発生し得る。例えば、（ストレージシステムの１つ以上が、データセットのためにＩ／Ｏ処理を引き継ごうと試みることを含む）リンク故障後にどのように始めるのかを決定しようとする試みは、同期複製リースが失効した後の期間まで、つまり、例えば少なくともクロック交換から生じる最大リース時間ほど長く開始しない場合がある。 9 also includes attempting to take over I/O processing for the dataset (908). In the example method shown in FIG. 9, attempting to take over I/O processing for dataset (712) (908) may be performed by storage systems (714, 724, 728), for example, racing to a mediator. If a particular storage system (714, 724, 728) successfully takes over I/O processing for dataset (712), all accesses of dataset (712) are serviced by the particular storage system (714, 724, 728) until a synchronous replication relationship can be reestablished, and any changes to dataset (712) that occurred after the previous synchronous replication relationship expired may then be transferred and persisted on the other storage system (714, 724, 728). In such an example, an attempt 908 to take over I/O processing for dataset 712 may occur only after the expiration of some period of time after the synchronous replication lease expires. For example, an attempt to determine how to proceed after a link failure (including one or more of the storage systems attempting to take over I/O processing for the dataset) may not begin until a period of time after the synchronous replication lease expires, i.e., at least as long as the maximum lease time resulting from a clock exchange.

［0269］読者は、上記に示された例の多くにおいて、ストレージシステム（７１４）のうちの１つしか、上述されたステップを実施するとして示されていないが、２つ以上のストレージシステム間で同期複製関係を確立することは、２つ以上のストレージシステムの間での協働及び相互作用を必要とする場合があるので、実際には、ポッド内の（又は形成中のポッド内の）すべてのストレージシステム（７１４、７２４、７２８）が、同時に上述されたステップのうちの１つ以上を実施し得ることを理解する。 [0269] The reader will understand that while in many of the examples shown above, only one of the storage systems (714) is shown performing the steps described above, establishing a synchronous replication relationship between two or more storage systems may require cooperation and interaction between two or more storage systems, and therefore, in practice, all of the storage systems (714, 724, 728) in a pod (or in a pod being formed) may simultaneously perform one or more of the steps described above.

［0270］追加の説明のために、図１０は、本開示のいくつかの実施形態に係る２つ以上のストレージシステム（１０２４、１０４６）の間で同期複製関係を確立する追加の例の方法を示すフローチャートを説明する。図１０に示される例の方法は、データセット（１０２２）が２つのストレージシステム（１０２４、１０４６）全体だけで同期複製される実施形態を示すが、図１０に示される例は、データセット（１０２２）が、２つの示されるストレージシステム（１０２４、１０４６）によって実行されるステップに類似するステップを実行し得る追加のストレージシステム全体で同期複製される実施形態に拡張される場合がある。 [0270] For further explanation, FIG. 10 sets forth a flowchart illustrating an additional example method for establishing a synchronous replication relationship between two or more storage systems (1024, 1046) according to some embodiments of the present disclosure. While the example method illustrated in FIG. 10 illustrates an embodiment in which the dataset (1022) is synchronously replicated across only two storage systems (1024, 1046), the example illustrated in FIG. 10 may be extended to embodiments in which the dataset (1022) is synchronously replicated across additional storage systems that may perform steps similar to those performed by the two illustrated storage systems (1024, 1046).

［0271］図１０に示される例の方法は、ストレージシステム（１０２４）によって、ストレージシステム（１０２４、１０４６）と第２のストレージシステム（１０４６）との間に１つ以上のデータ通信リンク（１０５２）を構成すること（１００２）を含む。図１０に示される例の方法では、ストレージシステム（１０２４）は、例えば、第２のストレージシステム（１０４６）とデータ通信を交換するために使用されるデータ通信ネットワークを介して定義されたポートを識別することによって、第２のストレージシステム（１０４６）とデータ通信を交換するために使用されるポイントツーポイントデータ通信リンクを識別することによって、又はさまざまな方法で、ストレージシステム（１０２４）と第２のストレージシステム（１０４６）との間に１つ以上のデータ通信リンク（１０５２）を構成してよい（１００２）。安全な通信が必要とされる場合、なんらかの形の鍵交換が必要とされる場合もあれば、通信は、例えばＳＳＨ（ＳｅｃｕｒｅＳＨｅｌｌ）、ＳＳＬ、又は公開鍵若しくはディッフィーヘルマン鍵共用の回りで構築された他のなんらかのサービス又はプロトコル若しくは妥当な代替策等のなんらかのサービスを通して行われる又はブートされる場合もあるであろう。また、安全な通信は、なんらかの方法でカスタマアイデンティティと結び付けられるベンダ提供のなんらかのクラウドサービスを通して仲介される場合もあるであろう。代わりに、例えばバーチャルマシン又は仮想コンテナで実行する等、カスタマ施設で実行するように構成されたサービスは、複製するストレージシステム（１０２４、１０４６）間の安全な通信のために必要な鍵交換を仲介するために使用できるであろう。読者は、３つ以上のストレージシステムを含んだポッドが、個々のストレージシステムの大部分又はすべての間で通信リンクを必要とする場合があることを理解する。図１０に示される例の方法では、第２のストレージシステム（１０４６）は、ストレージシステム（１０２４）と第２のストレージシステム（１０４６）との間で１つ以上のデータ通信リンク（１０５２）を同様に構成してよい（１０２６）。 10 includes configuring (1002), by storage system (1024), one or more data communication links (1052) between storage system (1024, 1046) and a second storage system (1046). In the example method shown in FIG. 10, storage system (1024) may configure (1002) one or more data communication links (1052) between storage system (1024) and the second storage system (1046), for example, by identifying a port defined over a data communication network used to exchange data communication with the second storage system (1046), by identifying a point-to-point data communication link used to exchange data communication with the second storage system (1046), or in various other ways. If secure communication is required, some form of key exchange may be required, or the communication may occur or be initiated through some service, such as Secure Shell (SSH), SSL, or some other service or protocol built around public key or Diffie-Hellman key sharing, or a suitable alternative. Secure communication may also be mediated through some vendor-provided cloud service that is tied in some way to a customer identity. Alternatively, a service configured to run at the customer premises, e.g., running in a virtual machine or virtual container, could be used to broker the key exchange necessary for secure communication between the replicating storage systems (1024, 1046). The reader will understand that a pod containing more than two storage systems may require communication links between most or all of the individual storage systems. In the example method shown in FIG. 10, the second storage system (1046) may similarly configure one or more data communication links (1052) between the storage system (1024) and the second storage system (1046) (1026).

［0272］また、図１０に示される例の方法は、ストレージシステム（１０２４）から第２のストレージシステム（１０４６）に、ストレージシステム（１０２４）のためのタイミング情報（１０４８）を送信すること（１００４）も含む。ストレージシステム（１０２４）のためのタイミング情報（１０４８）は、例えば、ストレージシステム（１０２４）の中のクロックの値として、クロック値（例えば、ストレージシステム（１０２４）が最初に記録できるシーケンス番号）の表現として、第２のストレージシステム（１０４６）の中でクロックのごく最近に受け取られた値として等、実施されてよい。図１０に示される例の方法では、ストレージシステム（１０２４）は、例えば、２つのストレージシステム（１０２４、１０４６）間のデータ通信リンク（１０５２）を介してストレージシステム（１０２４）から第２のストレージシステム（１０４６）へ送信される１つ以上のメッセージを介して、ストレージシステム（１０２４）のためのタイミング情報（１０４８）を第２のストレージシステム（１０４６）に送信してよい（１００４）。図１０に示される例では、第２のストレージシステム（１０４６）は、第２のストレージシステム（１０４６）からストレージシステム（１０２４）に、第２のストレージシステム（１０４６）のためのタイミング情報（１０５０）を同様に送信してよい（１０３０）。 10 also includes transmitting (1004) timing information (1048) for storage system (1024) from storage system (1024) to second storage system (1046). The timing information (1048) for storage system (1024) may be embodied, for example, as the value of a clock within storage system (1024), as a representation of a clock value (e.g., a sequence number that storage system (1024) can first record), as the most recently received value of a clock within second storage system (1046), etc. In the example method shown in FIG. 10 , storage system (1024) may transmit timing information (1048) for storage system (1024) to second storage system (1046) (1004), for example, via one or more messages transmitted from storage system (1024) to second storage system (1046) over a data communication link (1052) between the two storage systems (1024, 1046). In the example shown in FIG. 10 , second storage system (1046) may similarly transmit timing information (1050) for second storage system (1046) from second storage system (1046) to storage system (1024) (1030).

［0273］図１０に示される例の方法では、ストレージシステム（１０２４）から第２のストレージシステム（１０４６）に、ストレージシステム（１０２４）のためのタイミング情報（１０４８）を送信すること（１００４）は、ストレージシステム（１０２４）のクロックの値を送信すること（１００６）を含む場合がある。図１０に示される例では、ストレージシステム（１０２４）は、ストレージシステム（１０２４、１０４６）の間でクロックを調整する努力の一部として、ストレージシステム（１０２４）のクロックの値を第２のストレージシステム（１０４６）に送信してよい（１００６）。係る例では、ストレージシステム（１０２４）は、その値が、２つのストレージシステム（１０２４、１０４６）間のデータ通信リンク（１０５２）を介して第２のストレージシステム（１０４６）に送信される１つ以上のメッセージを介して送信される（１００６）ローカルの単調増加クロックを含んでよい。図１０に示される例の方法では、第２のストレージシステム（１０４６）からストレージシステム（１０２４）に、第２のストレージシステム（１０４６）のためのタイミング情報（１０５０）を送信すること（１０３０）は、第２のストレージシステム（１０４６）のクロックの値を送信すること（１０３２）を同様に含んでよい。 10, transmitting (1004) timing information (1048) for storage system (1024) from storage system (1024) to second storage system (1046) may include transmitting (1006) the value of a clock for storage system (1024). In the example shown in FIG. 10, storage system (1024) may transmit (1006) the value of its clock to second storage system (1046) as part of an effort to coordinate the clocks between storage systems (1024, 1046). In such an example, storage system (1024) may include a local monotonically increasing clock whose value is transmitted (1006) via one or more messages sent to second storage system (1046) via data communication link (1052) between the two storage systems (1024, 1046). In the example method shown in FIG. 10, transmitting (1030) timing information (1050) for the second storage system (1046) from the second storage system (1046) to the storage system (1024) may also include transmitting (1032) the value of a clock of the second storage system (1046).

［0274］図１０に示される例の方法では、ストレージシステム（１０２４）から第２のストレージシステム（１０４６）に、ストレージシステム（１０２４）のためのタイミング情報（１０４８）を送信すること（１００４）は、第２のストレージシステム（１０４６）のクロックのごく最近受け取られた値を送信すること（１００８）を含む場合もある。図１０に示される例の方法では、第２のストレージシステム（１０４６）のクロックのごく最近受け取られた値を送信すること（１００８）は、例えばタイミング保証を達成しながらも、ストレージシステム（１０２４、１０４６）間でクロックを調整する必要性を排除するための努力の一部として実施されてよい。係る実施形態では、各ストレージシステム（１０２４、１０４６）は、ローカルの単調増加クロックを有してよい。同期複製リースは、各ストレージシステム（１０２４、１０４８）が、それが他のストレージシステム（１０２４、１０４８）から受け取った最後のクロック値とともに、他のストレージシステム（１０２４、１０４８）にそのクロック値を送信することによってストレージシステム（１０２４、１０４８）間で確立されてよい。特定のストレージシステム（１０２４、１０４８）は、別のストレージシステム（１０２４、１０４８）からクロック値をまた受け取るとき、特定のストレージシステム（１０２４、１０４８）は、受け取られたクロック値になんらかの同意されたリース間隔を加え、同期複製リースを確立するためにそれを使用してよい。図１０に示される例の方法では、第２のストレージシステム（１０４６）からストレージシステム（１０２４）に、第２のストレージシステム（１０４６）のためのタイミング情報（１０５０）を送信すること（１０３０）は、ストレージシステム（１０２４）のクロックのごく最近受け取られた値を送信すること（１０３４）も同様に含んでよい。 10, transmitting (1004) timing information (1048) for storage system (1024) from storage system (1024) to second storage system (1046) may also include transmitting (1008) the most recently received value of the clock of the second storage system (1046). In the example method shown in FIG. 10, transmitting (1008) the most recently received value of the clock of the second storage system (1046) may be performed, for example, as part of an effort to eliminate the need to adjust clocks between storage systems (1024, 1046) while still achieving timing guarantees. In such an embodiment, each storage system (1024, 1046) may have a local monotonically increasing clock. A synchronous replication lease may be established between storage systems (1024, 1048) by each storage system (1024, 1048) sending its clock value to the other storage system (1024, 1048) along with the last clock value it received from the other storage system (1024, 1048). When a particular storage system (1024, 1048) also receives a clock value from another storage system (1024, 1048), the particular storage system (1024, 1048) may add any agreed-upon lease interval to the received clock value and use it to establish a synchronous replication lease. In the example method shown in FIG. 10, transmitting (1030) timing information (1050) for the second storage system (1046) from the second storage system (1046) to the storage system (1024) may also include transmitting (1034) the most recently received value of the clock of the storage system (1024).

［0275］また、図１０に示される例の方法は、第２のストレージシステム（１０４６）からストレージシステム（１０２４）によって、第２のストレージシステム（１０４６）のためのタイミング情報（１０５０）を受け取ること（１０１０）も含む。図１０に示される例の方法では、ストレージシステム（１０２４）は、２つのストレージアレイ（１０２４、１０４６）間のデータ通信リンク（１０５２）を介して第２のストレージシステム（１０４６）から送信される１つ以上のメッセージを介して第２のストレージシステム（１０４６）から、第２のストレージシステム（１０４６）のためのタイミング情報（１０５０）を受け取ってよい（１０１０）。図１０に示される例では、第２のストレージシステム（１０４６）は、ストレージシステム（１０２４）から、ストレージシステム（１０２４）のためのタイミング情報を同様に受け取ってよい（１０２８）。 10 also includes receiving (1010) timing information (1050) for the second storage system (1046) by the storage system (1024) from the second storage system (1046). In the example method shown in FIG. 10, the storage system (1024) may receive (1010) the timing information (1050) for the second storage system (1046) from the second storage system (1046) via one or more messages transmitted from the second storage system (1046) via a data communication link (1052) between the two storage arrays (1024, 1046). In the example shown in FIG. 10, the second storage system (1046) may similarly receive (1028) the timing information for the storage system (1024) from the storage system (1024).

［0276］また、図１０に示される例の方法は、第２のストレージシステム（１０４６）のためのタイミング情報（１０５０）に従って、ストレージシステム（１０２４）でクロック値を設定すること（１０１２）も含む。図１０に示される例では、第２のストレージシステム（１０４６）のためのタイミング情報（１０５０）に従って、ストレージシステム（１０２４）でクロック値を設定すること（１０１２）は、例えば２つのストレージシステム（１０２４、１０４６）間でクロックを調整する努力の一部として実施されてよい。係る例では、２つのストレージシステム（１０２４、１０４６）は、例えばそれらのそれぞれのクロック値を、ストレージシステム（１０２４、１０４６）間の最高のクロック値よりも高いなんらかの所定量である値に設定するように、それらのそれぞれのクロック値を、ストレージシステム（１０２４、１０４６）の対の間の最高のクロック値に等しい値に設定するように、そのそれぞれのクロック値を、各ストレージシステム（１０２４、１０４６）のそれぞれのクロック値になんらかの関数を適用することによって生成される値に設定するように、又は他のなんらかの方法で構成されてよい。図１０に示される例の方法では、第２のストレージシステム（１０４６）は、ストレージシステム（１０２４）のためのタイミング情報（１０４８）に従って、第２のストレージシステム（１０４６）のクロック値を同様に設定してよい（１０３６）。 10 also includes setting (1012) a clock value at storage system (1024) according to timing information (1050) for second storage system (1046). In the example shown in FIG. 10, setting (1012) a clock value at storage system (1024) according to timing information (1050) for second storage system (1046) may be performed, for example, as part of an effort to coordinate clocks between the two storage systems (1024, 1046). In such an example, the two storage systems (1024, 1046) may be configured, for example, to set their respective clock values to a value that is some predetermined amount higher than the highest clock value between the storage systems (1024, 1046), to set their respective clock values to a value equal to the highest clock value between the pair of storage systems (1024, 1046), to set their respective clock values to a value generated by applying some function to the respective clock values of each storage system (1024, 1046), or in some other manner. In the example method shown in FIG. 10 , the second storage system (1046) may similarly set (1036) its clock value in accordance with the timing information (1048) for storage system (1024).

［0277］また、図１０に示される例の方法は、同期複製リースを確立すること（１０１４）も含む。図１０に示される例の方法では、同期複製リースを確立すること（１０１４）は、例えば、２つのストレージシステム（１０２４、１０４６）間の同等のクロック値を超えるなんらかの所定のリース間隔の間延長する同期複製リースを確立することによって、ストレージシステム（１０２４、１０４６）のうちの１つと関連付けられた調整がついていないクロック値を超えるなんらかの所定のリース間隔の間延長する同期複製リースを確立することによって、又はなんらかの他の方法で実施されてよい。図１９に示される例の方法では、第２のストレージシステム（１０４６）は、ストレージシステム（１０２４）のためのタイミング情報（１０４８）に従って、第２のストレージシステム（１０４６）のクロック値を同様に設定してよい（１０３６）。 10 also includes establishing (1014) a synchronous replication lease. In the example method shown in FIG. 10, establishing (1014) a synchronous replication lease may be implemented, for example, by establishing a synchronous replication lease that extends for some predetermined lease interval beyond the equivalent clock values between the two storage systems (1024, 1046), by establishing a synchronous replication lease that extends for some predetermined lease interval beyond the unadjusted clock value associated with one of the storage systems (1024, 1046), or in some other manner. In the example method shown in FIG. 19, the second storage system (1046) may similarly set (1036) its clock value in accordance with the timing information (1048) for storage system (1024).

［0278］また、図１０に示される例の方法は、ストレージシステム（１０２４）によって、同期複製リースが失効したことを検出すること（１０１６）も含む。図１０に示される例の方法では、同期複製リースが失効したことを検出すること（１０１６）は、例えば、ストレージシステム（１０２４）が現在のクロック値を、リースが有効であった期間に比較することによって実施されてよい。ストレージシステム（１０２４、１０４６）が、各ストレージシステム（１０２４、１０４６）の中のクロックの値を５０００ミリ秒の値に調整し、各ストレージシステム（１０２４、１０４６）が、そのクロック値を超える２０００ミリ秒のリース間隔の間延長した同期複製リースを確立する（１０３８）ように構成され、これにより各ストレージシステム（１０２４、１０４６）のための同期複製リースが、特定のストレージシステム（１０２４、１０４６）の中のクロックが１００００ミリ秒を超える値に達したときに失効した例を考える。係る例では、同期複製リースが失効したことを検出すること（１００６）は、ストレージシステム（１０２４）の中のクロックが１０００１ミリ秒以上の値に達したと判断することによって実施されてよい。図１０に示される例の方法では、第２のストレージシステム（１０４６）は、同期複製リースが失効したことを同様に検出してよい（１０４０）。 10 also includes detecting (1016) by the storage system (1024) that the synchronous replication lease has expired. In the example method shown in FIG. 10, detecting (1016) that the synchronous replication lease has expired may be performed, for example, by the storage system (1024) comparing the current clock value to the period during which the lease was valid. Consider an example in which the storage systems (1024, 1046) are configured to adjust the value of the clock within each storage system (1024, 1046) to a value of 5000 milliseconds, and each storage system (1024, 1046) establishes (1038) an extended synchronous replication lease for a lease interval of 2000 milliseconds beyond that clock value, thereby causing the synchronous replication lease for each storage system (1024, 1046) to expire when the clock within a particular storage system (1024, 1046) reaches a value greater than 10,000 milliseconds. In such an example, detecting that the synchronous replication lease has expired (1006) may be performed by determining that a clock within the storage system (1024) has reached a value of 10001 milliseconds or greater. In the example method shown in FIG. 10, the second storage system (1046) may similarly detect that the synchronous replication lease has expired (1040).

［0279］また、図１０に示される例の方法は、ストレージシステム（１０２４）によって、データセット（１０２２）のためにＩ／Ｏ処理を引き継ごうと試みること（１０２０）も含む。図１０に示される例の方法では、データセット（１０２２）のためにＩ／Ｏ処理を引き継ごうと試みること（１０２０）は、例えばメディエータまで急ぐストレージシステム（１０２４）によって実施されてよい。ストレージシステム（１０２４）がデータセット（１０２２）のためのＩ／Ｏ処理を無事に引き継ぐ場合、データセット（１０２２）のすべてのアクセスは、同期複製関係が再確立されるまでストレージシステム（１０２４）によってサービスを提供され、以前の同期複製関係が失効した後に発生したデータセット（１０２２）に対するあらゆる変更は、第２のストレージシステム（１０４６）に転送され、持続される場合がある。図１０に示される例の方法では、第２のストレージシステム（１０４６）は、同様にデータセット（１０２２）のためのＩ／Ｏ処理を引き継ごうと試みてよい（１０４４）。 10 also includes attempting (1020) by storage system (1024) to take over I/O processing for dataset (1022). In the example method shown in FIG. 10, attempting (1020) to take over I/O processing for dataset (1022) may be performed by storage system (1024) dispatching to a mediator, for example. If storage system (1024) successfully takes over I/O processing for dataset (1022), all accesses of dataset (1022) may be serviced by storage system (1024) until the synchronous replication relationship is reestablished, and any changes to dataset (1022) that occurred after the previous synchronous replication relationship expired may be transferred to and persisted in the second storage system (1046). In the example method shown in FIG. 10, the second storage system (1046) may also attempt to take over I/O processing for the dataset (1022) (1044).

［0280］また、図１０に示される例の方法は、ストレージシステム（１０２４）によって、同期複製関係を再確立しようと試みること（１０１８）も含む。図１０に示される例の方法では、同期複製関係を再確立しようと試みること（１０１８）は、例えば１つ以上の再確立メッセージを使用することによって実施されてよい。係る再確立メッセージは、例えば、同期複製関係が再確立されるポッドの識別、１つ以上のデータ通信リンクを構成するために必要とされる情報、更新されたタイミング情報等を含む場合がある。係るようにして、ストレージシステム（１０２４）は、同期複製関係が最初に作成されたのとほぼ同じように同期複製関係を再確立してよい。図１０に示される例では、第２のストレージシステム（１０４６）は、同様に同期複製関係を再確立しようと試みてよい（１０４２）。 [0280] The example method illustrated in FIG. 10 also includes, by the storage system (1024), attempting (1018) to reestablish the synchronous replication relationship. In the example method illustrated in FIG. 10, attempting (1018) to reestablish the synchronous replication relationship may be performed, for example, by using one or more reestablishment messages. Such reestablishment messages may include, for example, identification of the pod with which the synchronous replication relationship is to be reestablished, information needed to configure one or more data communication links, updated timing information, etc. In this manner, the storage system (1024) may reestablish the synchronous replication relationship substantially in the same manner as the synchronous replication relationship was originally created. In the example illustrated in FIG. 10, the second storage system (1046) may similarly attempt (1042) to reestablish the synchronous replication relationship.

［0281］追加の説明のために、図１１は、本開示のいくつかの実施形態に従って複数のストレージシステム（１１３８、１１４０）全体で同期されるデータセット（１１４２）に向けられたＩ／Ｏ動作にサービスを提供する例の方法を示すフローチャーを説明する。あまり詳細には示されないが、図１１に示されるストレージシステム（１１３８、１１４０）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図１１に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0281] For further explanation, FIG. 11 sets forth a flowchart illustrating an example method for servicing I/O operations directed to a dataset (1142) synchronized across multiple storage systems (1138, 1140) in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (1138, 1140) illustrated in FIG. 11 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system illustrated in FIG. 11 may include the same components as the storage systems described above, fewer components, or additional components.

［0282］図１１に示されるデータセット（１１４２）は、例えば特定のボリュームのコンテンツとして、ボリュームの特定の共用のコンテンツとして、又は１つ以上のデータ要素の任意の他の集合体として実施されてよい。データセット（１１４２）は、各ストレージシステム（１１３８、１１４０）がデータセットのローカルコピーを保持するように複数のストレージシステム（１１３８、１１４０）全体で同期されてよい。本明細書に説明される例では、係るデータセット（１１４２）は、少なくともアクセスされているクラスタ及び特定のストレージシステムが名目上実行している限り、クラスタ内の任意の一方のストレージシステムが、クラスタ内の任意の他方のストレージシステムよりも実質的により最適に動作しないような性能特性を有するストレージシステム（１１３８、１１４０）のいずれかを通してデータセット（１１４２）にアクセスできるように、ストレージシステム（１１３８、１１４０）全体で同期複製される。係るシステムでは、データセット（１１４２）に対する修正は、任意のストレージシステム（１１３８、１１４０）のデータセット（１１４２）にアクセスすることが一貫した結果を生じさせるように、各ストレージシステム（１１３８、１１４０）に常駐するデータセットのコピーに対して加えられるべきである。例えばデータセットに対する書込み要求は、すべてのストレージシステム（１１３８、１１４０）でサービスを提供されなければならない、又は書込みの初めに名目上実行しており、書込みの完了まで名目上実行中のままであったストレージシステム（１１３８、１１４０）のいずれでもサービスを提供されてならない。同様に、動作のいくつかのグループ（例えば、データセットの中の同じ場所に向けられる２つの書込み動作）は、データセットがすべてのストレージシステム（１１３８、１１４０）で最終的に同一となるように、同じ順序で実行されなければならない、又は以下により詳細に説明されるように、すべてのストレージシステム（１１３８、１１４０）で他のステップが講じられなければならない。データセット（１１４２）に対する修正は、正確に同時に行われる必要はないが、いくつかのアクション（例えば、書込み要求がデータセットに向けられ、両方のストレージシステムでまだ完了していない書込み要求によってターゲットにされるデータセットの中の場所への読取りアクセスを可能にする旨の肯定応答を発行すること）は、各ストレージシステム（１１３８、１１４０）上のデータセットのコピーが修正されるまで先延ばしにされてよい。 11 may be embodied, for example, as the contents of a particular volume, as the contents of a particular share of a volume, or as any other collection of one or more data elements. The dataset (1142) may be synchronized across multiple storage systems (1138, 1140) such that each storage system (1138, 1140) maintains a local copy of the dataset. In the examples described herein, such dataset (1142) is synchronously replicated across storage systems (1138, 1140) such that the dataset (1142) can be accessed through any of the storage systems (1138, 1140) having performance characteristics such that any one storage system in the cluster operates substantially less optimally than any other storage system in the cluster, at least as long as the cluster and the particular storage system being accessed are nominally performing. In such a system, modifications to a dataset (1142) should be made to copies of the dataset resident on each storage system (1138, 1140) so that accessing the dataset (1142) on any storage system (1138, 1140) produces consistent results. For example, a write request to a dataset must be serviced on all storage systems (1138, 1140), or on none of the storage systems (1138, 1140) that were nominally running at the beginning of the write and remained nominally running until the write was completed. Similarly, some groups of operations (e.g., two write operations directed to the same location in a dataset) must be performed in the same order, or other steps must be taken on all storage systems (1138, 1140), so that the dataset is ultimately identical on all storage systems (1138, 1140). Modifications to the dataset (1142) need not occur at exactly the same time, although some actions (e.g., issuing an acknowledgment that a write request is directed to the dataset and allowing read access to locations within the dataset targeted by the write request that have not yet completed on both storage systems) may be deferred until the copies of the dataset on each storage system (1138, 1140) have been modified.

［0283］図１１に示される例の方法では、あるストレージシステム（１１４０）の『リーダー』として及び別のストレージシステム（１１３８）の『フォロワー』の名称は、ストレージシステム全体で特定のデータセットを同期複製する目的のために各ストレージシステムのそれぞれの関係を指してよい。係る例では、及び以下により詳細に説明されるように、リーダーストレージシステム（１１４０）は、入信するＩ／Ｏ動作のなんらかの処理を実行し、フォロワーストレージシステム（１１３８）に係る情報を渡すこと、又はフォロワーストレージシステム（１１４０）に求められていない他のタスクを実行することを担ってよい。リーダーストレージシステム（１１４０）は、すべての入信Ｉ／Ｏ動作のためにフォロワーストレージシステム（１１３８）に求められていないタスクを実行することを担ってよい、又は代わりにリーダー－フォロワー関係は、どちらかのストレージシステムによって受け取られるＩ／Ｏ動作の部分集合だけに特有であってよい。例えば、リーダー－フォロワー関係は、第１のボリューム、ボリュームの第１のグループ、論理アドレスの第１のグループ、物理アドレスの第１のグループ、又は他のなんらかの論理デリネータ（ｄｅｌｉｎｅａｔｏｒ）又は物理デリネータに向けられるＩ／Ｏ動作に特有であってよい。このようにして、第１のストレージシステムは、ボリューム（又は他のデリネータ）の第１の集合に向けられたＩ／Ｏ動作のためのリーダーストレージシステムとしての機能を果たしてよい。一方、第２のストレージシステムは、ボリューム（又は他のデリネータ）の第２の集合に向けられたＩ／Ｏ動作のためのリーダーストレージシステムとしての機能を果たしてよい。図１１に示される例の方法は、以下により詳細に説明されるように、複数のストレージシステム（１１３８、１１４０）を同期させることが、フォロワーストレージシステム（１１３８）によるデータセット（１１４２）を修正する要求（１１０４）の受取りに応えて実施されてもよいが、複数のストレージシステム（１１３８、１１４０）を同期させることが、リーダーストレージシステム（１１４０）によるデータセット（１１４２）を修正する要求（１１０４）の受取りに応えて発生する実施例を示す。 11, designating one storage system (1140) as a "leader" and another storage system (1138) as a "follower" may refer to the respective relationship of each storage system for purposes of synchronously replicating a particular data set across the storage systems. In such an example, and as described in more detail below, the leader storage system (1140) may be responsible for performing some processing of incoming I/O operations, passing related information to the follower storage system (1138), or performing other tasks not solicited by the follower storage system (1140). The leader storage system (1140) may be responsible for performing tasks not solicited by the follower storage system (1138) for all incoming I/O operations, or alternatively, the leader-follower relationship may be specific to only a subset of the I/O operations received by either storage system. For example, a leader-follower relationship may be specific to I/O operations directed to a first volume, a first group of volumes, a first group of logical addresses, a first group of physical addresses, or some other logical or physical delineator. In this manner, a first storage system may act as a leader storage system for I/O operations directed to a first set of volumes (or other delineators), while a second storage system may act as a leader storage system for I/O operations directed to a second set of volumes (or other delineators). The example method shown in FIG. 11 illustrates an embodiment in which synchronizing the multiple storage systems (1138, 1140) occurs in response to receiving a request (1104) to modify a data set (1142) by a leader storage system (1140), although synchronizing the multiple storage systems (1138, 1140) may also be performed in response to receiving a request (1104) to modify a data set (1142) by a follower storage system (1138), as described in more detail below.

［0284］図１１に示される例の方法は、リーダーストレージシステム（１１４０）によるデータセット（１１４２）を修正する要求（１１０４）を受け取ること（１１０６）を含む。データセット（１１４２）を修正する要求（１１０４）は、例えばデータセット（１１４２）に含まれるデータを含むストレージシステム（１１４０）の中の場所にデータを書き込む要求として、データセット（１１４２）に含まれるデータを含むボリュームにデータを書き込む要求として、データセット（１１４２）のスナップショットを撮影する要求として、基本的に仮想範囲コピーとして、データセット（１１４２）内のデータのなんらかの部分の削除を表すＵＮＭＡＰ動作として、（データセットの中のデータの一部分に対する変更よりむしろ）データセット（１１４２）の変換を修正するとして、又はデータセット（１１４２）に含まれるデータのなんらかの部分に対する変更を生じさせるなんらかの他の動作として実施されてよい。図１１に示される例の方法では、データセット（１１４２）を修正する要求（１１０４）は、例えばバーチャルマシンで実行中であるアプリケーションとして、ストレージシステム（１１４０）に接続されるコンピューティングデバイスで実行中であるアプリケーションとして、ストレージシステム（１１４０）にアクセスするように構成されたなんらかの他のエンティティとして実施されてよいホスト（１１０２）によって発行される。 11 includes receiving (1106) a request (1104) by a leader storage system (1140) to modify a dataset (1142). The request (1104) to modify a dataset (1142) may be implemented, for example, as a request to write data to a location in the storage system (1140) containing data included in the dataset (1142), as a request to write data to a volume containing data included in the dataset (1142), as a request to take a snapshot of the dataset (1142), essentially as a virtual range copy, as an UNMAP operation representing the deletion of some portion of the data in the dataset (1142), as a modifying transformation of the dataset (1142) (rather than a change to a portion of the data within the dataset), or as some other operation that causes a change to some portion of the data included in the dataset (1142). In the example method shown in FIG. 11, a request (1104) to modify a dataset (1142) is issued by a host (1102), which may be embodied, for example, as an application running on a virtual machine, as an application running on a computing device connected to the storage system (1140), or as some other entity configured to access the storage system (1140).

［0285］また、図１１に示される例の方法は、リーダーストレージシステム（１１４０）によって、データセット（１１４２）に対する修正を記述する情報（１１１０）を生成すること（１１０８）も含む。リーダーストレージシステム（１１４０）は、例えば、順序付け対進行中である任意の他の動作を決定することによって、重複する修正の適切な結末（例えば、同じ記憶場所を修正する２つの要求の適切な結末）を決定し、ポッドのすべてのメンバー（例えば、データセットが同期複製されるすべてのストレージシステム）全体で、例えばメタデータの共通要素に対するあらゆる分散状態の変更を計算することによって等、データセット（１１４２）に対する修正を記述する情報（１１１０）を生成してよい（１１０８）。データセット（１１４２）に対する修正を記述する情報（１１１０）は、例えば、ストレージシステムにおって実行されるＩ／Ｏ動作を記述するために使用されるシステムレベル情報として実施されてよい。リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）にサービスを提供するために、起こるべきことを理解するのに十分なだけ、データセット（１１４２）を修正する要求（１１０４）を処理することによってデータセット（１１４２）に対する修正を記述する情報（１１１０）を生成してよい（１１０８）。例えば、リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する他の要求に対する、データセット（１１４２）を修正する要求（１１０４）の実行のなんらかの順序付けが必要とされるかどうかを判断する場合がある、又は各ストレージシステム（１１３８、１１４０）で同等な結果を生じさせるために、以下により詳細に説明されるように、なんらかの他のステップが講じられなければならない。 11 also includes generating (1108), by the leader storage system (1140), information (1110) describing modifications to the dataset (1142). The leader storage system (1140) may determine the appropriate outcome of overlapping modifications (e.g., the appropriate outcome of two requests modifying the same storage location), for example, by determining ordering versus any other operations that are in progress, and may generate (1108) the information (1110) describing modifications to the dataset (1142), for example, by computing any distributed state changes to common elements of metadata across all members of a pod (e.g., all storage systems to which the dataset is synchronously replicated). The information (1110) describing modifications to the dataset (1142) may be implemented, for example, as system-level information used to describe I/O operations performed on the storage system. The leader storage system (1140) may generate (1108) information describing modifications to the dataset (1142) by processing the request (1104) to modify the dataset (1142) sufficiently to understand what must occur in order to service the request (1104) to modify the dataset (1142). For example, the leader storage system (1140) may determine whether some ordering of execution of the request (1104) to modify the dataset (1142) relative to other requests to modify the dataset (1142) is required, or some other step must be taken to produce equivalent results at each storage system (1138, 1140), as described in more detail below.

［0286］データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）内の第１のアドレス範囲からデータセット（１１４２）内の第２のアドレス範囲にブロックをコピーする要求として実施される例を考える。係る例では、３つの他の書込み動作（書込みＡ、書込みＢ、書込みＣ）は、データセット（１１４２）の第１のアドレス範囲に向けられると仮定する。係る例では、リーダーストレージシステム（１１４０）は、データセット（１１４２）内の第１のアドレス範囲からデータセット（１１４２）内の第２のアドレス範囲にブロックをコピーする前に、書込みＡ及び書込みＢにサービスを提供する（が、書込みＣにはサービスを提供しない）場合、フォロワーストレージシステム（１１３８）も、一貫した結果を生じさせるために、データセット（１１４２）内の第１のアドレス範囲からデータセット（１１４２）内の第２のアドレス範囲にブロックをコピーする前に、書込みＡ及び書込みＢにサービスを提供しなければならない（が、書込みＣにはサービスを提供しない）。したがって、リーダーストレージシステム（１１４０）が、データセット（１１４２）に対する修正を記述する情報（１１１０）を生成する（１１０８）とき、この例では、リーダーストレージシステム（１１４０）は、フォロワーストレージシステム（１１３８）が、データセット（１１４２）を修正する要求（１１０４）を処理できる前に完了されなければならない他の動作を識別する情報（例えば、書込みＡ及び書込みＢのためのシーケンス番号）を生成できるであろう。 [0286] Consider an example in which request (1104) to modify dataset (1142) is implemented as a request to copy blocks from a first address range within dataset (1142) to a second address range within dataset (1142). In such an example, assume that three other write operations (Write A, Write B, Write C) are directed to the first address range of dataset (1142). In such an example, if the leader storage system (1140) services Write A and Write B (but does not service Write C) before copying blocks from a first address range in the dataset (1142) to a second address range in the dataset (1142), then the follower storage system (1138) must also service Write A and Write B (but does not service Write C) before copying blocks from the first address range in the dataset (1142) to the second address range in the dataset (1142) in order to produce consistent results. Thus, when the leader storage system (1140) generates (1108) information (1110) describing modifications to the data set (1142), in this example, the leader storage system (1140) could generate information (e.g., sequence numbers for Write A and Write B) that identifies other operations that must be completed before the follower storage system (1138) can process the request (1104) to modify the data set (1142).

［0287］２つの要求（例えば、書込みＡ及び書込みＢ）が、データセット（１１４２）の重複部分に向けられる追加の例を考える。係る例では、フォロワーストレージシステム（１１３８）は書込みＢにサービスを提供し、後で書込みＡにサービスを提供するが、リーダーストレージシステム（１１４０）が書込みＡにサービスを提供し、後で書込みＢにサービスを提供する場合、データセット（１１４２）は、両方のストレージシステム（１１３８、１１４０）全体で一致しないであろう。したがって、リーダーストレージシステム（１１４０）がデータセット（１１４２）に対する修正を記述する情報（１１１０）を生成する（１１０８）とき、この例では、リーダーストレージシステム（１１４０）は、要求が実行されるべきである順序を識別する情報（例えば、書込みＡ及び書込みＢのシーケンス番号）を生成できるであろう。代わりに、リーダーストレージシステム（１１４０）は、各ストレージシステム（１１３８、１１４０）から中間の動作を必要とするデータセット（１１４２）に対する修正を記述する情報（１１１０）を生成するよりむしろ、２つの要求の適切な結末を識別する情報を含む、データセット（１１４２）に対する修正を記述する情報（１１１０）を生成してよい（１１０８）。例えば、書込みＢが書込みＡに論理的に続く（及び書込みＡと重複する）場合、最終結果は、データセット（１１４２）が、書込みＢと重複する書込みＡの部分を含むよりむしろ、書込みＡと重複する書込みＢの部分を含むことでなければならない。係る結末は、メモリで結果をマージし、特定のストレージシステム（１１３８、１１４１０）が書込みＡを実行し、次いでその後書込みＢを実行することを厳格に必要とするよりむしろ、データセット（１１４２）への係るマージの結果を書き込むことによって容易にされるであろう。読者は、より微妙なケースがスナップショット及び仮想アドレス範囲コピーに関係することを理解する。 [0287] Consider an additional example in which two requests (e.g., Write A and Write B) are directed to overlapping portions of data set (1142). In such an example, if the follower storage system (1138) services Write B and later services Write A, but the leader storage system (1140) services Write A and later services Write B, the data set (1142) will not be consistent across both storage systems (1138, 1140). Thus, when the leader storage system (1140) generates (1108) information (1110) describing modifications to the data set (1142), in this example, the leader storage system (1140) could generate information identifying the order in which the requests should be executed (e.g., sequence numbers for Write A and Write B). Alternatively, rather than generating information (1110) describing modifications to the dataset (1142) that require intermediate operations from each storage system (1138, 1140), the leader storage system (1140) may generate information (1110) describing modifications to the dataset (1142) that includes information identifying the appropriate outcome of the two requests (1108). For example, if write B logically follows (and overlaps with) write A, the end result should be that dataset (1142) includes portions of write B that overlap with write A, rather than including portions of write A that overlap with write B. Such an outcome would be facilitated by merging the results in memory and writing the results of such a merge to the dataset (1142), rather than strictly requiring a particular storage system (1138, 11410) to perform write A and then subsequently perform write B. The reader will understand that more subtle cases pertain to snapshots and virtual address range copies.

［0288］読者は、任意の動作に対する正しい結果が、動作の受取り確認が可能である前に回復可能である点にコミットされなければならないことをさらに理解する。しかし、複数の動作をともにコミットできる場合もあれば、回復が正しさを確実にするのであるならば、動作を部分的にコミットできる場合もある。例えば、スナップショットは、予想される書込みＡ及び書込みＢに対する記録された依存関係とローカルにコミットできるであろうが、Ａ又はＢ自体はコミットされていない可能性がある。スナップショットの受取りを確認することができず、紛失したＩ／Ｏを別のアレイから回復できない場合、回復が結局スナップショットを取り消すことになる可能性がある。また、書込みＢが書込みＡと重複する場合、リーダーストレージシステムはＢにＡの後となるように『命令』してよいが、Ａは実際には廃棄され、次いでＡを書き込む動作は単にＢを待機するであろう。Ａ、ＢとＣ、Ｄとの間のスナップショットと結合された書込みＡ、書込みＢ、書込みＣ、及び書込みＤは、回復がアレイ全体でスナップショット不一致を生じさせることがない限り、及び肯定応答が、より早期の動作が、それが回復可能であることを保証される点まで持続される前に後の動作を完了しない限り、いくつかの又はすべての部分をともにコミットする及び／又は受取りを確認することができるであろう。 [0288] The reader will further understand that the correct result for any operation must be committed to a point where it is recoverable before receipt of the operation can be acknowledged. However, in some cases, multiple operations can be committed together, and in other cases, operations can be partially committed if recovery ensures correctness. For example, a snapshot could be committed locally with recorded dependencies on expected write A and write B, but A or B itself may not be committed. If receipt of the snapshot cannot be acknowledged and lost I/O cannot be recovered from another array, recovery may end up canceling the snapshot. Also, if write B overlaps with write A, the leader storage system may "instruct" B to be behind A, but A is actually discarded, and then the operation writing A will simply wait for B. Write A, Write B, Write C, and Write D combined with the snapshot between A, B and C, D will be able to commit and/or acknowledge receipt of some or all portions together, as long as recovery does not cause a snapshot inconsistency across the array, and as long as the acknowledgment does not complete a later operation before the earlier operation has persisted to the point where it is guaranteed to be recoverable.

［0289］また、図１１に示される例の方法は、リーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）へ、データセット（１１４２）に対する修正を記述する情報（１１１０）を送信すること（１１１２）も含む。リーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）へ、データセット（１１４２）に対する修正を記述する情報（１１１０）を送信すること（１１１２）は、例えばリーダーストレージシステム（１１４０）が１つ以上のメッセージをフォロワーストレージシステム（１１３８）に送信することによって実施されてよい。また、リーダーストレージシステム（１１４０）は、同じメッセージで又は１つ以上の異なるメッセージで、データセット（１１４２）を修正する要求（１１０４）に対するＩ／Ｏペイロード（１１１４）を送信してもよい。Ｉ／Ｏペイロード（１１１４）は、例えばデータセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）にデータを書き込む要求として実施されるとき、フォロワーストレージシステム（１１３８）の中のストレージに書き込まれるデータとして実施されてよい。係る例では、データセット（１１４２）を修正する要求（１１０４）が、リーダーストレージシステム（１１４０）によって受け取られた（１１０６）ため、フォロワーストレージシステム（１１３８）は、データセット（１１４２）を修正する要求（１１０４）と関連付けられたＩ／Ｏペイロード（１１１４）をまだ受け取っていない。図１１に示される例では、データセット（１１４２）に対する修正を記述する情報（１１１０）及びデータセット（１１４２）を修正する要求（１１０４）と関連付けられるＩ／Ｏペイロード（１１４２）は、リーダーストレージシステム（１１４０）をフォロワーストレージシステム（１１３８）に結合する１つ以上のデータ通信ネットワークを介して、リーダーストレージシステム（１１４０）をフォロワーストレージシステム（１１３８）に結合する１つ以上の専用データ通信リンク（例えば、Ｉ／Ｏペイロードを送信するための第１のリンク及びデータセットに対する修正を記述する情報を送信するための第２のリンク）を介して、又はなんらかの他の機構を介してリーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）に送信されてよい（１１１２）。 11 also includes transmitting (1112) information (1110) from the leader storage system (1140) to the follower storage system (1138) describing the modifications to the dataset (1142). The transmitting (1112) information (1110) from the leader storage system (1140) to the follower storage system (1138) describing the modifications to the dataset (1142) may be performed, for example, by the leader storage system (1140) sending one or more messages to the follower storage system (1138). The leader storage system (1140) may also transmit an I/O payload (1114) for the request (1104) to modify the dataset (1142) in the same message or in one or more different messages. The I/O payload (1114) may be implemented as data written to storage in the follower storage system (1138), for example, when the request (1104) to modify the dataset (1142) is implemented as a request to write data to the dataset (1142). In such an example, because the request (1104) to modify the dataset (1142) was received (1106) by the leader storage system (1140), the follower storage system (1138) has not yet received the I/O payload (1114) associated with the request (1104) to modify the dataset (1142). In the example shown in FIG. 11 , information (1110) describing modifications to the dataset (1142) and an I/O payload (1142) associated with a request (1104) to modify the dataset (1142) may be transmitted (1112) from the leader storage system (1140) to the follower storage system (1138) via one or more data communications networks coupling the leader storage system (1140) to the follower storage system (1138), via one or more dedicated data communications links (e.g., a first link for transmitting the I/O payload and a second link for transmitting the information describing the modifications to the dataset) coupling the leader storage system (1140) to the follower storage system (1138), or via some other mechanism.

［0290］また、図１１に示される例の方法は、フォロワーストレージシステム（１１３８）によって、データセット（１１４２）に対する修正を記述する情報（１１１０）を受け取ること（１１１６）も含む。フォロワーストレージシステム（１１３８）は、例えばリーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）に送信される１つ以上のメッセージを介して、データセット（１１４２）に対する修正を記述する情報（１１１０）及びリーダーストレージシステム（１１４０）からのＩ／Ｏペイロード（１１１４）を受け取ってよい（１１１６）。１つ以上のメッセージは、２つのストレージシステム（１１３８、１１４０）間の１つ以上の専用データ通信リンクを介して、リーダーストレージシステム（１１４０）が、ＲＤＭＡもしくは類似する機構を使用し、フォロワーストレージシステム（１１３８）上の所定の記憶場所（例えば、待ち行列の場所）にメッセージを書き込むことによって、又は他の方法で、リーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）に送信されてよい。 11 also includes receiving (1116) by the follower storage system (1138) information (1110) describing modifications to the dataset (1142). The follower storage system (1138) may receive (1116) the information (1110) describing the modifications to the dataset (1142) and the I/O payload (1114) from the leader storage system (1140), for example, via one or more messages sent from the leader storage system (1140) to the follower storage system (1138). One or more messages may be sent from the leader storage system (1140) to the follower storage system (1138) via one or more dedicated data communication links between the two storage systems (1138, 1140), by the leader storage system (1140) writing the messages to a predetermined storage location (e.g., a queue location) on the follower storage system (1138) using RDMA or a similar mechanism, or in other ways.

［0291］一実施形態では、フォロワーストレージシステム（１１３８）は、通信機構としてＳＣＳＩ要求（送信者から受信者への書込み、又は受信者から送信者への読取り）の使用の使用により、リーダーストレージシステム（１１４０）からデータセット（１１４２）に対する修正を記述する情報（１１１０）及びＩ／Ｏペイロード（１１１４）を受け取ってよい（１１１６）。係る実施形態では、ＳＣＳＩ書込み要求は、（どのようなデータ及びメタデータであれ含む）送信されることが意図され、特別な疑似デバイスに又は特別に構成されたＳＣＩＣネットワークを介して、又は任意の他の同意されたアドレス指定機構を通して送達されてよい情報を符号化するために使用される。あるいは、代わりに、モデルは、特別なデバイス、特別に構成されたＳＣＳＩネットワーク、又は他の同意された機構も使用し、受信者から送信者へオープンＳＣＳＩ読取り要求の集合を発行できる。データ及びメタデータを含んだ符号化された情報は、これらのオープンＳＣＳＩ要求の１つ以上に対する応答として受信者に送達される。係るモデルは、多くの場合データセンタ間の「ダークファイバ」ストレージネットワークとして配備されるファイバチャネルＳＣＳＩネットワークを介して実装できる。また、係るモデルは、ホストからリモートアレイへのマルチパシング及びバルクアレイ間（ａｒｒａｙ－ｔｏ－ａｒｒａｙ）通信用に同じネットワーク回線の使用も可能にする。 [0291] In one embodiment, the follower storage system (1138) may receive information (1110) and I/O payload (1114) describing modifications to the dataset (1142) from the leader storage system (1140) using SCSI requests (write from sender to receiver, or read from receiver to sender) as the communication mechanism (1116). In such an embodiment, the SCSI write request is used to encode the information intended to be sent (including whatever data and metadata) and may be delivered to a special pseudo device, or over a specially configured SCSI network, or through any other agreed-upon addressing mechanism. Alternatively, the model may issue a set of open SCSI read requests from the receiver to the sender using a special device, a specially configured SCSI network, or other agreed-upon mechanism. The encoded information, including the data and metadata, is delivered to the receiver in response to one or more of these open SCSI requests. Such a model can be implemented over a Fibre Channel SCSI network, which is often deployed as a "dark fibre" storage network between data centres. It also enables the use of the same network lines for multipathing from the host to the remote array and for bulk array-to-array communications.

［0292］また、図１１に示される例の方法は、フォロワーストレージシステム（１１３８）によって、データセット（１１４２）を修正する要求（１１０４）を処理すること（１１１８）も含む。図１１に示される例の方法では、フォロワーストレージシステム（１１３８）は、リーダーストレージシステム（１１４０）から受け取られたＩ／Ｏペイロード（１１１４）だけではなく、データセット（１１４２）に対する修正を記述する情報（１１１０）にも従って、フォロワーストレージシステム（１１３８）に含まれる１つ以上のストレージデバイス（例えば、ＮＶＲＡＭデバイス、ＳＳＤ、ＨＤＤ）のコンテンツを修正することによって、データセット（１１４２）を修正する要求（１１０４）を処理してよい（１１１８）。データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）に含まれるボリュームに向けられる書込み動作として実施され、データセット（１１４２）に対する修正を記述する情報（１１１０）が、書込み動作が、以前に発行された書込み動作が処理された後にのみ実行できることを示す例を考える。係る例では、データセット（１１４２）を修正する要求（１１０４）を処理すること（１１１８）は、フォロワーストレージシステム（１１３８）が最初に、以前に発行された書込み動作がフォロワーストレージシステム（１１３８）で処理されたことを検証し、その後書込み動作と関連付けられたＩ／Ｏペイロード（１１１４）をフォロワーストレージシステム（１１３８）に含まれる１つ以上のストレージデバイスに書き込むことによって実施されてよい。係る例では、データセット（１１４２）を修正する要求（１１０４）は、例えばＩ／Ｏペイロード（１１１４）がフォロワーストレージシステム（１１３８）の中の永続記憶装置にコミットされているとき、完了され、無事に処理されたと見なされてよい。 11 also includes processing (1118) the request (1104) to modify the dataset (1142) by the follower storage system (1138). In the example method shown in FIG. 11, the follower storage system (1138) may process (1118) the request (1104) to modify the dataset (1142) by modifying the contents of one or more storage devices (e.g., NVRAM devices, SSDs, HDDs) included in the follower storage system (1138) according to not only the I/O payload (1114) received from the leader storage system (1140) but also the information (1110) describing the modifications to the dataset (1142). Consider an example in which a request (1104) to modify a dataset (1142) is implemented as a write operation directed to a volume included in the dataset (1142), and the information (1110) describing the modification to the dataset (1142) indicates that the write operation can only be performed after a previously issued write operation has been processed. In such an example, processing (1118) the request (1104) to modify the dataset (1142) may be implemented by the follower storage system (1138) first verifying that the previously issued write operation has been processed at the follower storage system (1138), and then writing the I/O payload (1114) associated with the write operation to one or more storage devices included in the follower storage system (1138). In such an example, a request (1104) to modify a dataset (1142) may be considered completed and successfully processed, for example, when the I/O payload (1114) has been committed to persistent storage in the follower storage system (1138).

［0293］また、図１１に示される例の方法は、フォロワーストレージシステム（１１３８）によってリーダーストレージシステム（１１４０）に対して、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１１２０）も含む。図１１に示される例の方法では、フォロワーストレージシステム（１１３８）によってリーダーストレージシステム（１１４０）に対して、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１１２０）は、フォロワーストレージシステム（１１３８）が、リーダーストレージシステム（１１４０）に肯定応答（１１２２）メッセージを送信することによって実施されてよい。係るメッセージは、例えばフォロワーストレージシステム（１１３８）によるデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１１２０）際に有用な任意の追加の情報だけではなく、完了されたデータセット（１１４２）を修正する特定の要求（１１０４）を識別する情報も含んでよい。図１１に示される例の方法では、リーダーストレージシステム（１１４０）に対してデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１１２０）は、フォロワーストレージシステム（１１３８）が、リーダーストレージシステム（１１４０）に肯定応答（１１２２）メッセージを発行することによって示される。 11 also includes acknowledging (1120) by the follower storage system (1138) to the leader storage system (1140) receipt of completion of the request (1104) to modify the data set (1142). In the example method shown in FIG. 11, acknowledging (1120) by the follower storage system (1138) to the leader storage system (1140) receipt of completion of the request (1104) to modify the data set (1142) may be performed by the follower storage system (1138) sending an acknowledgement (1122) message to the leader storage system (1140). Such a message may include information identifying the particular request (1104) to modify the completed dataset (1142), as well as any additional information useful, for example, in acknowledging (1120) receipt of completion of the request (1104) to modify the dataset (1142) by the follower storage system (1138). In the example method shown in FIG. 11, acknowledging (1120) receipt of completion of the request (1104) to modify the dataset (1142) to the leader storage system (1140) is indicated by the follower storage system (1138) issuing an acknowledgement (1122) message to the leader storage system (1140).

［0294］また、図１１に示される例の方法は、リーダーストレージシステム（１１４０）によって、データセット（１１４２）を修正する要求（１１０４）を処理すること（１１２４）も含む。図１１に示される例の方法では、リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）の一部として受け取られたＩ／Ｏペイロード（１１１４）だけではなく、データセット（１１４２）に対する修正を記述する情報（１１１０）に従ってリーダーストレージシステム（１１４０）に含まれる１つ以上のストレージデバイス（例えば、ＮＶＲＡＭデバイス、ＳＳＤ、ＨＤＤ）のコンテンツも修正することによって、データセット（１１４２）を修正する要求（１１０４）を処理してよい（１１２４）。データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）に含まれるボリュームに向けられる書込み動作として実施され、データセット（１１４２）に対する修正を記述する情報（１１１０）が、書込み動作が、以前に発行された書込み動作が処理された後にのみ実行できることを示す例を考える。係る例では、データセット（１１４２）を修正する要求（１１０４）を処理すること（１１２４）は、リーダーストレージシステム（１１４０）が最初に、以前に発行された書込み動作がリーダーストレージシステム（１１４０）によって処理されたことを検証し、その後書込み動作と関連付けられたＩ／Ｏペイロード（１１１４）をリーダーストレージシステム（１１４０）に含まれる１つ以上のストレージデバイスに書き込むことによって実施されてよい。係る例では、データセット（１１４２）を修正する要求（１１０４）は、例えばＩ／Ｏペイロード（１１１４）がリーダーストレージシステム（１１４０）の中の永続記憶装置にコミットされているとき、完了され、無事に処理されたと見なされてよい。 11 also includes processing (1124) the request (1104) to modify the dataset (1142) by the leader storage system (1140). In the example method shown in FIG. 11, the leader storage system (1140) may process (1124) the request (1104) to modify the dataset (1142) by modifying not only the I/O payload (1114) received as part of the request (1104) to modify the dataset (1142), but also the contents of one or more storage devices (e.g., NVRAM devices, SSDs, HDDs) included in the leader storage system (1140) according to information (1110) describing the modifications to the dataset (1142). Consider an example in which a request (1104) to modify a dataset (1142) is implemented as a write operation directed to a volume included in the dataset (1142), and the information (1110) describing the modification to the dataset (1142) indicates that the write operation can only be performed after a previously issued write operation has been processed. In such an example, processing (1124) the request (1104) to modify the dataset (1142) may be implemented by the leader storage system (1140) first verifying that the previously issued write operation has been processed by the leader storage system (1140), and then writing the I/O payload (1114) associated with the write operation to one or more storage devices included in the leader storage system (1140). In such an example, a request (1104) to modify a dataset (1142) may be considered completed and successfully processed, for example, when the I/O payload (1114) has been committed to persistent storage in the reader storage system (1140).

［0295］また、図１１に示される例の方法は、フォロワーストレージシステム（１１３８）から、フォロワーストレージシステム（１１３８）がデータセット（１１３６）を修正する要求（１１０４）を処理した旨の表示を受け取ること（１１２６）も含む。この例では、フォロワーストレージシステム（１１３８）がデータセット（１１３６）を修正する要求（１１０４）を処理した旨の表示は、フォロワーストレージシステム（１１３８）からリーダーストレージシステム（１１４０）に送信される肯定応答（１１２２）メッセージとして実施される。読者は、上述されたステップの多くが、特定の順序で発生するとして示され、説明されるが、特定の順序は実際には必要とされないことを理解する。実際には、フォロワーストレージシステム（１１３８）及びリーダーストレージシステム（１１４０）は、独立したストレージシステムであるため、各ストレージシステムは、上述されたステップのいくつかを並行して実行していてよい。例えば、フォロワーストレージシステム（１１３８）は、データセット（１１４２）に対する修正を記述する情報（１１１０）を受け取ってよい（１１１６）、データセット（１１４２）を修正する要求（１１０４）を処理してよい（１１１８）、又はリーダーストレージシステム（１１４０）がデータセット（１１４２）を修正する要求（１１０４）を処理する（１１２４）前にデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認してよい（１１２０）。代わりに、リーダーストレージシステム（１１４０）は、フォロワーストレージシステム（１１３８）が、データセット（１１４２）に対する修正を記述する情報（１１１０）を受け取る（１１１６）前にデータセット（１１４２）を修正する要求（１１０４）を処理した（１１２４）、データセット（１１４２）を修正する要求（１１０４）を処理した（１１１８）、又はデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認した（１１２０）可能性がある。 11 also includes receiving (1126) an indication from the follower storage system (1138) that the follower storage system (1138) has processed the request (1104) to modify the dataset (1136). In this example, the indication that the follower storage system (1138) has processed the request (1104) to modify the dataset (1136) is embodied as an acknowledgment (1122) message sent from the follower storage system (1138) to the leader storage system (1140). The reader will understand that although many of the steps described above are shown and described as occurring in a particular order, no particular order is actually required. In practice, the follower storage system (1138) and the leader storage system (1140) are independent storage systems, and therefore each storage system may be performing some of the steps described above in parallel. For example, the follower storage system (1138) may receive (1116) information (1110) describing modifications to the dataset (1142), may process (1118) the request (1104) to modify the dataset (1142), or may confirm (1120) receipt of completion of the request (1104) to modify the dataset (1142) before the leader storage system (1140) processes (1124) the request (1104) to modify the dataset (1142). Alternatively, the leader storage system (1140) may have processed (1124) the request (1104) to modify the dataset (1142), processed (1118) the request (1104) to modify the dataset (1142), or acknowledged (1120) receipt of completion of the request (1104) to modify the dataset (1142) before the follower storage system (1138) received (1116) the information (1110) describing the modifications to the dataset (1142).

［0296］また、図１１に示される例の方法は、リーダーストレージシステム（１１４０）によって、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１１３４）も含む。図１１に示される例の方法では、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１１３４）は、リーダーストレージシステム（１１４０）からホスト（１１０２）に送信される１つ以上の肯定応答（１１３６）メッセージを使用することにより、又はなんらかの他の適切な機構を介して実施されてよい。図１１に示される例の方法では、リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１１３４）前にフォロワーストレージシステム（１１３８）によって処理された（１１１８）かどうかを判断してよい（１１２８）。リーダーストレージシステム（１１４０）は、例えば、リーダーストレージシステム（１１４０）が、フォロワーストレージシステム（１１３８）から、肯定応答メッセージ又はデータセット（１１４２）を修正する要求（１１０４）がフォロワーストレージシステム（１１３８）によって処理された（１１１８）ことを示す他のメッセージを受け取ったかどうかを判断することによって、データセット（１１４２）を修正する要求（１１０４）が、フォロワーストレージシステム（１１３８）によって処理された（１１１８）かどうかを判断してよい（１１２８）。係る例では、リーダーストレージシステム（１１４０）が、データセット（１１４２）を修正する要求（１１０４）がフォロワーストレージシステム（１１３８）によって処理（１１１８）され、リーダーストレージシステム（１１３８）によっても処理された（１１１８）と肯定的に判断する（１１３０）場合、リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）に対して、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１１３４）ことによって進んでよい。しかしながら、リーダーストレージシステム（１１４０）が、データセット（１１４２）を修正する要求（１１０４）が、フォロワーストレージシステム（１１３８）によってまだ処理（１１１８）されていない（１１３２）、又はリーダーストレージシステム（１１３８）によって処理され（１１２４）ていないと判断する場合、リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）がその全体で同期複製されるすべてのストレージシステム（１１３８、１１４０）上で無事に処理されたときだけ、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認してよい（１１３４）ので、リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）に対して、データセット（１１４２）を修正する要求（１１０４）の完了の受取りをまだ確認していない（１１３４）場合がある。 11 also includes acknowledging (1134) receipt of completion of the request (1104) to modify the dataset (1142) by the leader storage system (1140). In the example method shown in FIG. 11, acknowledging (1134) receipt of completion of the request (1104) to modify the dataset (1142) may be performed using one or more acknowledgment (1136) messages sent from the leader storage system (1140) to the host (1102) or via some other suitable mechanism. In the example method shown in FIG. 11, the leader storage system (1140) may determine (1128) whether the request (1104) to modify the dataset (1142) has been processed (1118) by the follower storage system (1138) before acknowledging (1134) receipt of completion of the request (1104) to modify the dataset (1142). The leader storage system (1140) may determine (1128) whether the request (1104) to modify the dataset (1142) has been processed (1118) by the follower storage system (1138), for example, by determining whether the leader storage system (1140) has received an acknowledgment message or other message from the follower storage system (1138) indicating that the request (1104) to modify the dataset (1142) has been processed (1118) by the follower storage system (1138). In such an example, if the leader storage system (1140) affirmatively determines (1130) that the request (1104) to modify the dataset (1142) has been processed (1118) by the follower storage system (1138) and has also been processed (1118) by the leader storage system (1138), the leader storage system (1140) may proceed by acknowledging (1134) receipt of completion of the request (1104) to modify the dataset (1142) to the host (1102) that initiated the request (1104) to modify the dataset (1142). However, if the leader storage system (1140) determines that the request (1104) to modify the dataset (1142) has not yet been processed (1118) (1132) by the follower storage system (1138) or processed (1124) by the leader storage system (1138), the leader storage system (1140) may acknowledge receipt of completion of the request (1104) to modify the dataset (1142) (1134) only when the request (1104) to modify the dataset (1142) has been successfully processed on all storage systems (1138, 1140) across which the dataset (1142) is synchronously replicated, and therefore the leader storage system (1140) may not yet acknowledge receipt of completion of the request (1104) to modify the dataset (1142) (1134) from the host (1102) that initiated the request (1104) to modify the dataset (1142).

［0297］読者は、図１１に示される例の方法では、リーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）へ、データセット（１１４２）に対する修正を記述する情報（１１１０）を送信すること（１１１２）、及びフォロワーストレージシステム（１１３８）によってリーダーストレージシステム（１１４０）に対して、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１１２０）が、単一のラウンドトリップメッセージングを使用し、実施される場合がある。単一のラウンドトリップメッセージングは、例えばデータ相互接続としてファイバチャネルの使用により使用されてよい。通常、ＳＣＳＩプロトコルが、ファイバチャネルとともに使用される。いくつかのより旧式の複製技術は、ファイバチャネルネットワークを介してＳＣＳＩトランザクションとして基本的にデータを複製するために構築され得るため、係る相互接続はデータセンタ間で一般的にセットアップされる。また、従来ファイバチャネルＳＣＳＩインフラストラクチャは、イーサネット及びＴＣＰ／ＩＰに基づいたネットワークよりもより少ないオーバヘッド及びより低い待ち時間を有していた。さらに、データセンタがファイバチャネルを使用し、ストレージアレイを遮断するために内部で接続されるとき、ファイバチャネルネットワークは、あるデータセンタのホストが、ローカルストレージアレイが故障すると、遠隔のデータセンタでのストレージアレイへのアクセスに切り替わることができるように、他のデータセンタまで伸ばされてもよい。 [0297] The reader will note that in the example method shown in FIG. 11 , the sending (1112) of information (1110) describing modifications to a data set (1142) from a leader storage system (1140) to a follower storage system (1138), and the acknowledgment (1120) by the follower storage system (1138) to the leader storage system (1140) of receipt of completion of a request (1104) to modify the data set (1142) may be accomplished using a single round-trip message. Single round-trip messaging may be used, for example, with Fibre Channel as the data interconnect. Typically, the SCSI protocol is used with Fibre Channel. Because some older replication technologies can be built to essentially replicate data as SCSI transactions over a Fibre Channel network, such interconnects are commonly set up between data centers. Furthermore, traditional Fibre Channel SCSI infrastructures have had less overhead and lower latency than Ethernet and TCP/IP-based networks. Additionally, when data centers use Fibre Channel and are internally connected to disconnect storage arrays, the Fibre Channel network may be extended to other data centers so that hosts in one data center can switch to accessing a storage array at a remote data center if the local storage array fails.

［0298］ＳＣＳＩは、たとえそれが通常ブロック指向ボリュームにデータを記憶し、取り出すための（又はテープのための）ブロックストレージプロトコルとの使用のために設計されていても、一般的な通信機構として使用できるであろう。例えば、ＳＣＳＩＲＥＡＤ又はＳＣＳＩＷＲＩＴＥは、対にされたストレージシステムのストレージコントローラ間でメッセージデータを送達する又は取り出すために使用できるであろう。ＳＣＳＩＷＲＩＴＥの典型的な実施態様は、２つのメッセージラウンドトリップを必要とする。つまり、ＳＣＳＩ開始プログラムは、ＳＣＳＩＷＲＩＴＥ動作を記述するＳＣＳＩＣＤＢを送信し、ＳＣＳＩターゲットはそのＣＤＢを受け取り、ＳＣＳＩターゲットは、ＳＣＳＩ開始プログラムに「受信準備完了」メッセージを送信する。ＳＣＳＩ開始プログラムは、次いでＳＣＳＩターゲットにデータを送信し、ＳＣＳＩＷＲＩＴＥが完了すると、ＳＣＳＩターゲットは、成功完了でＳＣＳＩ開始プログラムに応答する。他方、ＳＣＳＩＲＥＡＤ要求は、１つのラウンドトリップしか必要としない。つまり、ＳＣＳＩ開始プログラムは、ＳＣＳＩＲＥＡＤ動作を記述するＳＣＳＩＣＤＢを送信し、ＳＣＳＩターゲットはそのＣＤＢを受け取り、データ、及び次いで成功完了とともに応答する。結果として、距離に関して、ＳＣＳＩＲＥＡＤは、ＳＣＳＩＷＲＩＴＥとして距離関係の待ち時間の半分を生じさせる。このため、データ通信受信機がＳＣＳＩＲＥＡＤ要求を使用してメッセージを受信する方が、メッセージの送信者がＳＣＳＩＷＲＩＴＥ要求を使用してデータを送信するよりも速い場合がある。ＳＣＳＩＲＥＡＤを使用することは、単に、メッセージ送信者がＳＣＳＩターゲットとして動作し、メッセージ受信者がＳＣＳＩ開始プログラムとして動作することを必要とする。メッセージ受信者は、任意のメッセージ送信者にいくつかのＳＣＳＩＣＤＢＲＥＡＤ要求を送信してよく、メッセージ送信者は、メッセージデータが利用可能であるとき未決のＣＤＢＲＥＡＤ要求の１つに応答するであろう。ＳＣＳＩサブシステムは、ＲＥＡＤ要求が未決であるのが長すぎる（例えば、１０秒）場合、タイムアウトする場合があるので、ＲＥＡＤ要求は、たとえ送信されるメッセージがなくても数秒以内に応答されるべきである。 [0298] SCSI could be used as a general communications mechanism, even though it is typically designed for use with block storage protocols for storing and retrieving data on block-oriented volumes (or for tape). For example, SCSI READ or SCSI WRITE could be used to send or retrieve message data between storage controllers in paired storage systems. A typical implementation of a SCSI WRITE requires two message round-trips: the SCSI initiator sends a SCSI CDB describing the SCSI WRITE operation, the SCSI target receives the CDB, and the SCSI target sends a "ready to receive" message to the SCSI initiator. The SCSI initiator then sends the data to the SCSI target, and once the SCSI WRITE is complete, the SCSI target responds with a successful completion to the SCSI initiator. A SCSI READ request, on the other hand, requires only one round-trip. That is, a SCSI initiator sends a SCSI CDB describing a SCSI READ operation, and the SCSI target receives the CDB and responds with the data and then a successful completion. As a result, in terms of distance, a SCSI READ incurs half the distance-related latency as a SCSI WRITE. For this reason, it may be faster for a data communication receiver to receive a message using a SCSI READ request than for the message sender to send data using a SCSI WRITE request. Using SCSI READ simply requires that the message sender act as a SCSI target and the message receiver act as a SCSI initiator. A message receiver may send several SCSI CDB READ requests to any message sender, and the message sender will respond to one of the pending CDB READ requests when message data is available. The SCSI subsystem may time out a READ request if it is pending for too long (e.g., 10 seconds), so a READ request should be responded to within a few seconds, even if there is no message to send.

［0299］ＳＣＳＩテープ要求は、ＩｎｔｅｒＮａｔｉｏｎａｌＣｏｍｍｉｔｔｅｅｏｎＩｎｆｏｒｍａｔｉｏｎＴｅｃｈｎｏｌｏｇｙＳｔａｎｄｒｄｓのＴ１０ＴｅｃｈｎｉｃａｌＣｏｍｍｉｔｔｅｅのＳＣＳＩＳｔｒｅａｍＣｏｍｍａｎｄｓ規格に説明されるように、可変サイズのメッセージデータを返すためにより柔軟である場合がある可変応答データをサポートする。また、ＳＣＳＩ規格は、単一ラウンドトリップＳＣＳＩＷＲＩＴＥコマンドを可能にできるであろうＳＣＳＩＷＲＩＴＥ要求のための即時モードもサポートする。読者は、以下に説明される実施形態の多くが単一ラウンドトリップメッセージングも活用することを理解する。 [0299] SCSI tape requests support variable response data, which may be more flexible for returning variable-sized message data, as described in the SCSI Stream Commands standard of the T10 Technical Committee of the International Committee on Information Technology Standards. The SCSI standard also supports an immediate mode for SCSI WRITE requests, which could allow for a single round-trip SCSI WRITE command. The reader will understand that many of the embodiments described below also utilize single-round-trip messaging.

［0300］追加の説明のために、図１２は、本開示のいくつかの実施形態に従って、複数のストレージシステム（１１３８、１１４０、１１５０）全体で同期されるデータセット（１１４２）に向けられたＩ／Ｏ動作にサービスを提供するための追加の例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図１１に示されるストレージシステム（１１３８、１１４０、１１５０）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図１１に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。図１２に示される例の方法は、図１１に示される例の方法に類似する。図１２に示される例の方法も、リーダーストレージシステム（１１４０）によるデータセット（１１４２）を修正する要求（１１０４）を受け取ること（１１０６）と、ストレージシステム（１１４０）によって、データセット（１１４２）に対する修正を記述する情報（１１１０）を生成すること（１１０８）と、リーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）へ、データセット（１１４２）に対する修正を記述する情報（１１１０）を送信すること（１１１２）と、フォロワーストレージシステム（１１３８）によって、データセット（１１４２）に対する修正を記述する情報（１１１０）を受け取ること（１１１６）と、フォロワーストレージシステム（１１３８）によって、データセット（１１４２）を修正する要求（１１０４）を処理すること（１１１８）と、フォロワーストレージシステム（１１３８）によってリーダーストレージシステム（１１４０）に対して、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１１２０）と、リーダーストレージシステム（１１４０）によって、データセット（１１４２）を修正する要求（１１０４）を処理すること（１１２４）と、リーダーストレージシステム（１１４０）によって、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１１３４）も含むので、図１１に示される例の方法に類似する。 12 illustrates a flowchart showing an example method for servicing I/O operations directed to a data set (1142) synchronized across multiple storage systems (1138, 1140, 1150) in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (1138, 1140, 1150) shown in FIG. 11 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system shown in FIG. 11 may include the same components as the storage systems described above, fewer components, or additional components. The example method shown in FIG. 12 is similar to the example method shown in FIG. 11. The example method shown in FIG. 12 also includes receiving (1106) a request to modify (1142) a data set (1142) by a leader storage system (1140), generating (1108) information (1110) by the storage system (1140) describing the modifications to the data set (1142), transmitting (1112) the information (1110) describing the modifications to the data set (1142) from the leader storage system (1140) to a follower storage system (1138), receiving (1116) the information (1110) describing the modifications to the data set (1142) by the follower storage system (1138), and transmitting (1116) the information (1110) describing the modifications to the data set (1142) to the follower storage system (1138). The method is similar to the example method shown in FIG. 11 , as it also includes processing (1118) the request (1104) to modify the dataset (1142) by the follower storage system (1138), acknowledging (1120) receipt of completion of the request (1104) to modify the dataset (1142) to the leader storage system (1140) by the follower storage system (1138), processing (1124) the request (1104) to modify the dataset (1142) by the leader storage system (1140), and acknowledging (1134) receipt of completion of the request (1104) to modify the dataset (1142) by the leader storage system (1140).

［0301］しかしながら、図１２に示される例の方法は、データセット（１１４２）が、ストレージシステムの１つがリーダーストレージシステム（１１４０）であり、残りのストレージシステムがフォロワーストレージシステム（１１３８、１１５０）である、３つのストレージシステム全体で同期複製される実施形態を示すので、図１１に示される例の方法とは異なる。係る例では、追加のフォロワーストレージシステム（１１５０）は、リーダーストレージシステム（１１４０）から、データセット（１１４２）に対する修正を記述する情報（１１１０）を受け取る（１１４２）、データセット（１１４２）に対する修正を記述する情報（１１１０）に従ってデータセット（１１４２）を修正する要求（１１０４）を処理する（１１４２）、リーダーストレージシステム（１１４０）に対して、肯定応答（１１４８）メッセージ又は他の適切な機構等を使用することによりデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１１４６）ことができるので、図１１に示されたフォロワーストレージシステム（１１３８）と同じステップの多くを実施する。 [0301] However, the example method shown in FIG. 12 differs from the example method shown in FIG. 11 because it illustrates an embodiment in which a dataset (1142) is synchronously replicated across three storage systems, one of which is a leader storage system (1140) and the remaining storage systems are follower storage systems (1138, 1150). In such an example, the additional follower storage system (1150) performs many of the same steps as the follower storage system (1138) shown in FIG. 11, such as receiving (1142) information (1110) from the leader storage system (1140) describing modifications to the data set (1142), processing (1142) requests (1104) to modify the data set (1142) according to the information (1110) describing the modifications to the data set (1142), and confirming (1146) receipt of completion of the request (1104) to modify the data set (1142) to the leader storage system (1140), such as by using an acknowledgement (1148) message or other suitable mechanism.

［0302］図１２に示される例の方法では、データセット（１１４２）に対する修正を記述する情報（１１１０）は、データセット（１１４２）を修正する要求（１１０４）のための順序付け情報（１１５２）を含む場合がある。図１２に示される例の方法では、データセット（１１４２）を修正する要求（１１０４）のための順序付け情報（１１５２）は、動作（例えば、データセットを修正する要求）と、リーダーストレージシステム（１１４０）によって記述できる共通メタデータ更新との関係の記述を、データセットを修正する別々の要求の間、及びおそらくデータセットを修正する要求と多様なメタデータ変更との間の相互依存性の集合として表す場合がある。これらの相互依存性は、データセットを修正するある要求がなんらかの方法で依存するプレカーソル（ｐｒｅｃｕｒｓｏｒｓ）の集合として、データセットを修正するその要求が完了するには真でなければならないプレディケートとして説明される場合がある。 [0302] In the example method shown in FIG. 12, the information (1110) describing modifications to a dataset (1142) may include ordering information (1152) for a request (1104) to modify the dataset (1142). In the example method shown in FIG. 12, the ordering information (1152) for a request (1104) to modify the dataset (1142) may express a description of the relationship between the operation (e.g., a request to modify the dataset) and common metadata updates that may be described by the reader storage system (1140) as a set of interdependencies between separate requests to modify the dataset, and possibly between requests to modify the dataset and various metadata changes. These interdependencies may be described as a set of precursors on which a request to modify the dataset depends in some way, and as predicates that must be true for the request to modify the dataset to be completed.

［0303］待ち行列プレディケートは、データセットを修正するその要求が完了するには真でなければならないプレディケートの一例である。待ち行列プレディケートは、データセットを修正する以前の要求が完了するまで、データセットを修正する特定の要求が完了しない場合があることを明記する場合がある。待ち行列プレディケートは、例えば重複する書込みタイプ動作のために使用できる。係る例では、リーダーストレージシステム（１１４０）は、第２の書込みタイプ動作が論理的に第１の係る動作に従うことを宣言することができ、これにより第２の書込みタイプ動作は、第１の書込みタイプ動作が完了するまで完了できない。実施態様に応じて、第２の書込みタイプ動作は、第１の係る書込みタイプ動作が永続的であることが確実にされるまで永続的にもされ得ない（２つの動作はともに永続的にすることができる）。また、待ち行列プレディケートは、不完全なプレカーソル動作の既知の集合（例えば、書込み型動作の集合）がそれぞれ、スナップショットが完了できる前に完了しなければならないことを宣言することによって、スナップショット動作及び仮想ブロック範囲コピー動作にも使用でき、追加の動作は（スナップショットが完了する前に）スナップショットに続くとして識別されるので、これらの動作のそれぞれは、スナップショット動作自体が完了することを前提する場合がある。また、このプレディケートは、それらの続く動作が、スナップショットに含まれるよりむしろボリュームのスナップショット後の画像に適用することも示す場合があるであろう。 [0303] A queue predicate is an example of a predicate that must be true for a request to modify a dataset to complete. A queue predicate may specify that a particular request to modify a dataset may not complete until a previous request to modify the dataset has completed. A queue predicate may be used, for example, for overlapping write type operations. In such an example, the reader storage system (1140) may declare that a second write type operation logically follows a first such operation, such that the second write type operation cannot complete until the first write type operation completes. Depending on the implementation, the second write type operation may not be made durable until the first such write type operation is assured to be durable (the two operations may both be made durable). A queuing predicate can also be used for snapshot operations and virtual block range copy operations by declaring that a known set of incomplete precursor operations (e.g., a set of write-type operations) must each complete before the snapshot can complete; as additional operations are identified as following the snapshot (before the snapshot completes), each of these operations may presuppose that the snapshot operation itself completes. The predicate could also indicate that these subsequent operations apply to the post-snapshot image of the volume rather than being included in the snapshot.

［0304］スナップショットに使用できるであろう代替プレディケートは、あらゆるスナップショットに識別子を割り当て、特定のスナップショットに含むことができるすべての修正動作をその識別子と関連付けるためである。その結果、スナップショットは、含まれる修正動作のすべてが完了するとき完了できる。これは、カウントプレディケートで行うことができる。データセットがその全体で同期複製される各ストレージシステムは、最後のスナップショット以降又は他のなんらかのまれな動作以降、時間と関連付けられた独自の動作のカウントを実施できる（あるいは、複数のリーダーストレージシステムを実施する実施形態の場合、特定のリーダーストレージシステムによって編成されるそれらの動作を用いて、カウントは、そのリーダーストレージシステムによって、それが制御するデータセットの部分について確立できる）。スナップショット動作自体は、次いでスナップショットがそれ自体永続的にされる、又は完了済みとして信号で知らせることができる前に、受け取られ、永続的にされる動作のその数に依存するカウントプレディケートを含む場合がある。（スナップショット完了前に）スナップショットに続くべきである修正動作は、スナップショットに依存する待ち行列プレディケートを所与として先延ばしにされる場合があるか、スナップショットアイデンティティが、修正動作がスナップショットから除外されるべきである旨の表示として使用される場合があるかのどちらかである。仮想ブロック範囲コピー（ＳＣＳＩＥＸＴＥＮＤＥＤＣＯＰＹ又は類似動作）は、待ち行列プレディケートを使用する場合もあれば、仮想ブロック範囲コピーはカウントプレディケート及びスナップショット又は類似する識別子を使用する場合もある。カウントプレディケート及びスナップショット又は仮想コピー識別子を用いると、各仮想ブロック範囲コピーは、たとえコピー動作が１つ又は２つのボリュームの２つの小さい領域しかカバーしなくても、新しい仮想スナップショット又は仮想コピー識別子を確立する可能性がある。上述された例では、データセット（１１４２）を修正する要求（１１０４）は、データセット（１１４２）のスナップショットを撮影する要求を含む場合があり、したがってデータセット（１１４２）を修正する要求（１１０４）のための順序付け情報（１１５２）は、データセット（１１４２）のスナップショットを撮影する前に完了されなければならないデータセットを修正する１つ以上の他の要求の識別を含む場合がある。 [0304] An alternative predicate that could be used for snapshots is to assign an identifier to every snapshot and associate all modification operations that can be included in a particular snapshot with that identifier. As a result, a snapshot can be completed when all of the modification operations it contains are complete. This can be done with a count predicate. Each storage system across which a dataset is synchronously replicated can perform its own count of operations associated with time since the last snapshot or some other infrequent operation (alternatively, for embodiments implementing multiple leader storage systems, a count can be established by a particular leader storage system for the portion of the dataset it controls, with those operations orchestrated by that leader storage system). The snapshot operation itself may then include a count predicate that depends on the number of operations that are received and made persistent before the snapshot itself can be made persistent or signaled as completed. Modification operations that should follow the snapshot (before the snapshot is complete) may either be deferred given a queue predicate that depends on the snapshot, or the snapshot identity may be used as an indication that modification operations should be excluded from the snapshot. A virtual block range copy (SCSI EXTENDED COPY or similar operation) may use a queue predicate, or it may use a count predicate and a snapshot or similar identifier. With a count predicate and a snapshot or virtual copy identifier, each virtual block range copy may establish a new virtual snapshot or virtual copy identifier, even if the copy operation covers only two small regions of one or two volumes. In the example described above, the request (1104) to modify dataset (1142) may include a request to take a snapshot of dataset (1142), and therefore the ordering information (1152) for the request (1104) to modify dataset (1142) may include the identification of one or more other requests to modify the dataset that must be completed before taking a snapshot of dataset (1142).

［0305］図１２に示される例の方法では、データセット（１１４２）に対する修正を記述する情報（１１１０）は、データセット（１１４２）を修正する要求（１１０４）と関連付けられた共通メタデータ情報（１１５４）を含む場合がある。データセット（１１４２）を修正する要求（１１０４）と関連付けられた共通のメタデータ情報（１１５４）は、データセット（１１４２）が全体で同期複製されるストレージシステム（１１３８、１１４０、１１５０）内のデータセット（１１４２）と関連付けられる共通メタデータを確実にするために使用されてよい。これに関連する共通メタデータは、例えば１つ以上の要求（例えば、ホストによって発行された１つ以上の書込み要求）によってデータセット（１１４２）に記憶されたコンテンツ以外の任意のデータとして実施されてよい。共通メタデータは、特にその共通メタデータが、記憶されているコンテンツがどのようにして管理され、回復され、再同期され、スナップショットを撮影される、又は同期複製されるのかに関する場合、同期複製実施態様がなんらかの方法で、データセット（１１４２）が全体で同期複製されるストレージシステム（１１３８、１１４０、１１５０）全体で一貫性を保つデータを含んでよい。読者は、２つ以上の修正動作が同じ共通メタデータに依存してよく、修正動作自体の順序付けは不必要であるが、２回よりむしろ１回、共通メタデータの一貫性がある適用が必要であることを理解する。共通メタデータに対する複数の依存関係を処理する１つの方法は、リーダーストレージシステムからインスタンス化され、記述される別々の動作でメタデータを定めることである。次いで、その共通メタデータに依存する２つの修正動作は、その修正動作に依存する待ち行列プレディケートを与えられる場合がある。共通メタデータに対する複数の依存関係を処理する別の方法は、共通メタデータを２つの動作のうちの第１の動作と関連付け、第２の動作を第１の動作に依存させることである。変形形態は、第２の動作を第１の動作の共通メタデータ態様だけに依存させ、これにより第１の動作のその部分だけが、第２の動作を処理できる前に永続的にされなければならない。共通メタデータに対する複数の依存関係を処理するさらに別の方法は、その共通メタデータに依存するすべての動作記述に共通メタデータを含めることである。これは、共通メタデータを適用することが、例えば単に識別子を共通メタデータにアタッチするだけで等、冪である場合にうまく機能する。その識別子が既に処理されている場合、それは無視できる。一部の場合では、識別子は、共通メタデータの部分と関連付けられる可能性がある。 12, the information (1110) describing modifications to the dataset (1142) may include common metadata information (1154) associated with the request (1104) to modify the dataset (1142). The common metadata information (1154) associated with the request (1104) to modify the dataset (1142) may be used to ensure common metadata associated with the dataset (1142) within the storage systems (1138, 1140, 1150) across which the dataset (1142) is synchronously replicated. This associated common metadata may be embodied as any data other than content stored in the dataset (1142), for example, by one or more requests (e.g., one or more write requests issued by a host). Common metadata may include data that the synchronous replication implementation maintains consistency across the storage systems (1138, 1140, 1150) across which the dataset (1142) is synchronously replicated, particularly if the common metadata relates to how the stored content is managed, recovered, resynchronized, snapshotted, or synchronously replicated. The reader will understand that two or more modification operations may depend on the same common metadata; ordering of the modification operations themselves is unnecessary, but consistent application of the common metadata once rather than twice is required. One way to handle multiple dependencies on common metadata is to define the metadata in separate operations that are instantiated and described from the reader storage system. Two modification operations that depend on that common metadata may then be given queue predicates that depend on the modification operations. Another way to handle multiple dependencies on common metadata is to associate the common metadata with the first of two operations and make the second operation dependent on the first. A variation is to have the second operation depend only on the common metadata aspect of the first operation, so that only that part of the first operation must be made persistent before the second operation can be processed. Yet another way to handle multiple dependencies on common metadata is to include the common metadata in all operation descriptions that depend on it. This works well when applying the common metadata is power-based, for example, by simply attaching an identifier to the common metadata. If the identifier has already been processed, it can be ignored. In some cases, the identifier may be associated with a portion of the common metadata.

［0306］図１２に示される例の方法では、フォロワーストレージシステムが、データセット（１１４２）を修正する要求（１１０４）処理した旨の表示を受け取ること（１１２６）は、フォロワーストレージシステム（１１３８、１１５０）のそれぞれから、フォロワーストレージシステム（１１３８、１１５０）が、データセット（１１４２）を修正する要求（１１０４）処理した旨の表示を受け取ること（１１５６）を含む場合がある。この例では、各フォロワーストレージシステム（１１３８、１１５０）が、データセット（１１３６）を修正する要求（１１０４）処理した旨の表示は、各フォロワーストレージシステム（１１３８、１１５０）からリーダーストレージシステム（１１４０）へ送信された別個の肯定応答（１１２２、１１４８）メッセージとして実施される。読者は、上述されたステップの多くが特定の順序で発生するとして示され、説明されるが、特定の順序は実際には必要とされないことを理解する。実際には、フォロワーストレージシステム（１１３８、１１５０）及びリーダーストレージシステム（１１４０）は、独立したストレージシステムであるため、各ストレージシステムは並行して上述されたステップのいくつかを実行している場合がある。例えば、フォロワーストレージシステム（１１３８、１１５０）の１つ以上は、データセット（１１４２）に対する修正を記述する情報（１１１０）を受け取ってよい（１１１６、１１４２）、データセット（１１４２）を修正する要求（１１０４）を処理してよい（１１１８、１１４４）、又はリーダーストレージシステム（１１４０）が、データセット（１１４２）を修正する要求（１１０４）を処理する前に、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認してよい（１１２０、１１４６）。代わりに、フォロワーストレージシステム（１１３８、１１５０）の１つ以上が、データセット（１１４２）に対する修正を記述する情報（１１１０）を受け取った（１１１６、１１４２）、データセット（１１４２）を修正する要求（１１０４）を処理した（１１１８、１１４４）、又はデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認した（１１２０、１１４６）前に、リーダーストレージシステム（１１４０）がデータセット（１１４２）を修正する要求（１１０４）を処理した（１１２４）可能性がある。 12, receiving (1126) an indication that the follower storage system has processed (1104) the request to modify the data set (1142) may include receiving (1156) an indication from each of the follower storage systems (1138, 1150) that the follower storage system (1138, 1150) has processed (1104) the request to modify the data set (1142). In this example, the indication that each follower storage system (1138, 1150) has processed (1104) the request to modify the data set (1136) is embodied as a separate acknowledgement (1122, 1148) message sent from each follower storage system (1138, 1150) to the leader storage system (1140). The reader will understand that, although many of the steps described above are shown and described as occurring in a particular order, no particular order is actually required. In practice, because the follower storage systems (1138, 1150) and the leader storage system (1140) are independent storage systems, each storage system may be performing some of the above-described steps in parallel. For example, one or more of the follower storage systems (1138, 1150) may receive (1116, 1142) information (1110) describing modifications to the dataset (1142), process (1118, 1144) the request (1104) to modify the dataset (1142), or acknowledge (1120, 1146) receipt of completion of the request (1104) to modify the dataset (1142) before processing the request (1104) to modify the dataset (1142). Alternatively, the leader storage system (1140) may have processed (1124) the request (1104) to modify the dataset (1142) before one or more of the follower storage systems (1138, 1150) have received (1116, 1142) the information (1110) describing the modifications to the dataset (1142), processed (1118, 1144) the request (1104) to modify the dataset (1142), or acknowledged (1120, 1146) receipt of the completion of the request (1104) to modify the dataset (1142).

［0307］また、図１２に示される例の方法は、リーダーストレージシステム（１１４０）によって、データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１１３４）前にフォロワーストレージシステム（１１３８、１１５０）のそれぞれによって処理された（１１１８、１１４４）かどうかを判断すること（１１５８）も含む。リーダーストレージシステム（１１４０）は、例えば、リーダーストレージシステム（１１４０）が、フォロワーストレージシステム（１１３８、１１５０）のそれぞれから、データセット（１１４２）を修正する要求（１１０４）が、フォロワーストレージシステム（１１３８、１１５０）のそれぞれによって処理された（１１１８、１１４４）ことを示す肯定応答メッセージ又は他のメッセージを受け取ったかどうかを判断することによって、データセット（１１４２）を修正する要求（１１０４）が、フォロワーストレージシステム（１１３８、１１５０）のそれぞれによって処理された（１１１８、１１４４）かどうかを判断してよい（１１５８）。係る例では、リーダーストレージシステム（１１４０）が、データセット（１１４２）を修正する要求（１１０４）がフォロワーストレージシステム（１１３８、１１５０）のそれぞれによって処理され（１１１８、１１４４）、リーダーストレージシステム（１１３８）によっても処理された（１１２４）と肯定的に（１１６２）に判断する場合、リーダーストレージステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）に対してデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１１３４）ことによって進んでよい。しかしながら、リーダーストレージシステム（１１４０）が、データセット（１１４２）を修正する要求（１１０４）が、フォロワーストレージシステム（１１３８、１１５０）の少なくとも１つによって処理され（１１１８、１１４４）ていない（１１６０）、又はリーダーストレージシステム（１１３８）によって処理され（１１２４）ていないと判断する場合、リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）がその全体で同期複製されるすべてのストレージシステム（１１３８、１１４０、１１５０）で無事に処理されたときにだけ、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）に対してデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認してよい（１１３４）ので、リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）に対してデータセット（１１４２）を修正する要求（１１０４）の完了の受取りをまだ確認して（１１３４）いない場合がある。 [0307] The example method shown in FIG. 12 also includes determining (1158) by the leader storage system (1140) whether the request (1104) to modify the dataset (1142) has been processed (1118, 1144) by each of the follower storage systems (1138, 1150) before confirming (1134) receipt of completion of the request (1104) to modify the dataset (1142). The leader storage system (1140) may determine (1158) whether the request (1104) to modify the dataset (1142) has been processed (1118, 1144) by each of the follower storage systems (1138, 1150), for example, by determining whether the leader storage system (1140) has received an acknowledgment message or other message from each of the follower storage systems (1138, 1150) indicating that the request (1104) to modify the dataset (1142) has been processed (1118, 1144) by each of the follower storage systems (1138, 1150). In such an example, if the leader storage system (1140) determines in the affirmative (1162) that the request (1104) to modify the dataset (1142) has been processed (1118, 1144) by each of the follower storage systems (1138, 1150) and has also been processed (1124) by the leader storage system (1138), the leader storage system (1140) may proceed by acknowledging (1134) receipt of completion of the request (1104) to modify the dataset (1142) from the host (1102) that initiated the request (1104) to modify the dataset (1142). However, if the leader storage system (1140) determines that the request (1104) to modify the data set (1142) has not been processed (1160) (1118, 1144) by at least one of the follower storage systems (1138, 1150) or has not been processed (1124) by the leader storage system (1138), the leader storage system (1140) determines that the request (1104) to modify the data set (1142) has not been processed (1118, 1144) by at least one of the follower storage systems (1138, 1150) or has not been processed (1124) by the leader storage system (1138), the leader storage system (1140) determines that the request (1104) to modify the data set (1142) has been processed (1124) by all storage systems across which the data set (1142) is synchronously replicated. Since receipt of completion of the request (1104) to modify the dataset (1142) may be acknowledged (1134) to the host (1102) that initiated the request (1104) to modify the dataset (1142) only once it has been successfully processed by the storage systems (1138, 1140, 1150), the leader storage system (1140) may not yet acknowledge (1134) receipt of completion of the request (1104) to modify the dataset (1142) to the host (1102) that initiated the request (1104) to modify the dataset (1142).

［0308］読者は、図１２に示される例の方法が、データセット（１１４２）が、ストレージシステムのうちの１つがリーダーストレージシステム（１１４０）であり、残りのストレージシステムがフォロワーストレージシステム（１１３８、１１５０）である３つのストレージシステム全体で同期複製される実施形態を示すが、他の実施形態は、さらに追加のストレージシステムを含んでよいことを理解する。係る他の実施形態では、追加のフォロワーストレージシステムは、図１２に示されるフォロワーストレージシステム（１１３８、１１５０）と同じように動作してよい。 [0308] The reader will understand that while the example method shown in FIG. 12 illustrates an embodiment in which a data set (1142) is synchronously replicated across three storage systems, one of which is a leader storage system (1140) and the remaining storage systems are follower storage systems (1138, 1150), other embodiments may include additional storage systems. In such other embodiments, the additional follower storage systems may operate in the same manner as the follower storage systems (1138, 1150) shown in FIG. 12.

［0309］追加の説明のために、図１３は、本開示のいくつかの実施形態に従って複数のストレージシステム（１１３８、１１４０）全体で同期されるデータセット（１１４２）に向けられるＩ／Ｏ動作にサービスを提供する例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図１３に示されるストレージシステム（１１３８、１１４０）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図１３に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0309] For further explanation, FIG. 13 sets forth a flowchart illustrating an example method for servicing I/O operations directed to a dataset (1142) synchronized across multiple storage systems (1138, 1140) in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (1138, 1140) illustrated in FIG. 13 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system illustrated in FIG. 13 may include the same components as the storage systems described above, fewer components, or additional components.

［0310］図１３に示される例の方法は、フォロワーストレージシステム（１１３８）によって、データセット（１１４２）を修正する要求（１１０４）を受け取ること（１３０２）を含む。データセット（１１４２）を修正する要求（１１０４）は、例えばデータセット（１１４２）に含まれるデータを含むストレージシステム（１１３８）の中の場所にデータを書き込む要求として、データセット（１１４２）に含まれるデータを含むボリュームにデータを書き込む要求として、又はデータセット（１１４２）に含まれるデータのなんらかの部分に対する変更を生じさせるなんらかの他の動作として実施されてよい。図１３に示される例の方法では、データセット（１１４２）を修正する要求（１１０４）は、例えばバーチャルマシンで実行中であるアプリケーションとして、ストレージシステム（１１３８）に接続されるコンピューティングデバイスで実行中であるアプリケーションとして、又はストレージシステム（１１３８）にアクセスするように構成されたなんらかの他のエンティティとして実施されてよいホスト（１１０２）によって発行される。 13 includes receiving (1302), by a follower storage system (1138), a request (1104) to modify a dataset (1142). The request (1104) to modify a dataset (1142) may be implemented, for example, as a request to write the data to a location in the storage system (1138) that contains the data included in the dataset (1142), as a request to write the data to a volume that contains the data included in the dataset (1142), or as some other operation that causes a change to some portion of the data included in the dataset (1142). In the example method shown in FIG. 13, the request (1104) to modify a dataset (1142) is issued by a host (1102), which may be implemented, for example, as an application running on a virtual machine, as an application running on a computing device connected to the storage system (1138), or as some other entity configured to access the storage system (1138).

［0311］また、図１３に示される例の方法は、フォロワーストレージシステム（１１３８）からリーダーストレージシステム（１１４０）に、データセット（１１４２）を修正する要求（１１０４）の論理記述（１３０６）を送信すること（１３０４）も含む。図１３に示される例の方法では、データセット（１１４２）を修正する要求（１１０４）の論理記述（１３０６）は、リーダーストレージシステム（１１３８）によって理解される方法でフォーマットされてよく、データセット（１１４２）を修正する要求（１１０４）で要求される動作（例えば、読取りタイプ動作、スナップショットタイプ動作）のタイプを記述する情報、Ｉ／Ｏペイロードが置かれている場所を記述する情報、Ｉ／Ｏペイロードのサイズを記述する情報、又は他のなんらかの情報を含んでよい。代替実施形態では、フォロワーストレージシステム（１１３８）は、リーダーストレージシステム（１１４０）にデータセット（１１４２）を修正する要求（１１０４）（のすべてを含んだ）なんらかの部分を単に転送してよい。 13 also includes transmitting (1304) a logical description (1306) of the request (1104) to modify the dataset (1142) from the follower storage system (1138) to the leader storage system (1140). In the example method shown in FIG. 13, the logical description (1306) of the request (1104) to modify the dataset (1142) may be formatted in a manner understood by the leader storage system (1138) and may include information describing the type of operation (e.g., a read-type operation, a snapshot-type operation) requested in the request (1104) to modify the dataset (1142), information describing where the I/O payload is located, information describing the size of the I/O payload, or some other information. In an alternative embodiment, the follower storage system (1138) may simply forward some portion (including all) of the request (1104) to modify the dataset (1142) to the leader storage system (1140).

［0312］また、図１３に示される例の方法は、リーダーストレージシステム（１１４０）によって、データセット（１１４２）に対する修正を記述する情報（１３１０）を生成すること（１３０８）も含む。リーダーストレージシステム（１１４０）は、例えば順序付け対進行中である任意の他の動作を決定すること、ポッドのすべてのメンバー（例えば、データセットがその全体で同期複製されるすべてのストレージシステム）全体で例えばメタデータの共通の要素に対する分散された状態の変化を計算すること等によって、データセット（１１４２）に対する修正を記述する情報（１３１０）を生成してよい（１３０８）。データセット（１１４２）に対する修正を記述する情報（１３１０）は、例えばストレージシステムによって実行されるＩ／Ｏ動作を記述するために使用されるシステムレベルの情報として実施されてよい。リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）にサービスを提供するために、起こるべきことを理解するのに十分なだけ、データセット（１１４２）を修正する要求（１１０４）を処理することによってデータセット（１１４２）に対する修正を記述する情報（１３１０）を生成してよい（１３０８）。例えば、リーダーストレージシステム（１１４０）は、各ストレージシステム（１１３８、１１４０）で同等な結果を生じさせるために、データセット（１１４２）を修正する他の要求に対する、データセット（１１４２）を修正する要求（１１０４）の実行のなんらかの順序付けが必要とされるかどうかを判断する場合がある。 13 also includes generating (1308), by the leader storage system (1140), information (1310) describing modifications to the dataset (1142). The leader storage system (1140) may generate (1308) the information (1310) describing modifications to the dataset (1142), for example, by determining ordering versus any other operations that are in progress, computing distributed state changes, for example, to common elements of metadata, across all members of a pod (e.g., all storage systems across which the dataset is synchronously replicated), etc. The information (1310) describing modifications to the dataset (1142) may be implemented, for example, as system-level information used to describe I/O operations performed by the storage system. The leader storage system (1140) may generate (1308) information (1310) describing modifications to the dataset (1142) by processing the request (1104) to modify the dataset (1142) sufficiently to understand what must occur in order to service the request (1104) to modify the dataset (1142). For example, the leader storage system (1140) may determine whether some ordering of execution of the request (1104) to modify the dataset (1142) relative to other requests to modify the dataset (1142) is required to produce equivalent results at each storage system (1138, 1140).

［0313］データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）内の第１のアドレス範囲からデータセット（１１４２）内の第２のアドレス範囲にブロックをコピーする要求として実施される例を考える。係る例では、３つの他の書込み動作（書込みＡ、書込みＢ、書込みＣ）は、データセット（１１４２）の第１のアドレス範囲に向けられると仮定する。係る例では、リーダーストレージシステム（１１４０）は、データセット（１１４２）内の第１のアドレス範囲からデータセット（１１４２）内の第２のアドレス範囲にブロックをコピーする前に、書込みＡ及び書込みＢにサービスを提供する（が、書込みＣにはサービスを提供しない）場合、フォロワーストレージシステム（１１３８）も、一貫した結果を生じさせるために、データセット（１１４２）内の第１のアドレス範囲からデータセット（１１４２）内の第２のアドレス範囲にブロックをコピーする前に、書込みＡ及び書込みＢに命令しなければならない（が、書込みＣには命令しない）。したがって、リーダーストレージシステム（１１４０）が、データセット（１１４２）に対する修正を記述する情報（１３１０）を生成する（１３０８）とき、この例では、リーダーストレージシステム（１１４０）は、フォロワーストレージシステム（１１３８）が、データセット（１１４２）を修正する要求（１１０４）を処理できる前に完了されなければならない他の動作を識別する情報（例えば、書込みＡ及び書込みＢのためのシーケンス番号）を生成できるであろう。 [0313] Consider an example in which request (1104) to modify dataset (1142) is implemented as a request to copy blocks from a first address range within dataset (1142) to a second address range within dataset (1142). In such an example, assume that three other write operations (Write A, Write B, Write C) are directed to the first address range of dataset (1142). In such an example, if the leader storage system (1140) services Write A and Write B (but does not service Write C) before copying blocks from a first address range in the data set (1142) to a second address range in the data set (1142), then the follower storage system (1138) must also instruct Write A and Write B (but not Write C) before copying blocks from the first address range in the data set (1142) to the second address range in the data set (1142) in order to produce consistent results. Thus, when the leader storage system (1140) generates (1308) information (1310) describing modifications to the data set (1142), in this example, the leader storage system (1140) could generate information (e.g., sequence numbers for Write A and Write B) that identifies other operations that must be completed before the follower storage system (1138) can process the request (1104) to modify the data set (1142).

［0314］読者は、任意の動作に対する正しい結果が、動作の受取り確認が可能である前に回復可能である点にコミットされなければならないことをさらに理解する。しかし、複数の動作をともにコミットできる場合もあれば、回復が正しさを確実にするのであるならば、動作を部分的にコミットできる場合もある。例えば、スナップショットは、予想される書込みＡ及びＢに対する記録された依存関係とローカルにコミットできるであろうが、Ａ又はＢ自体はコミットされていない可能性がある。スナップショットの受取りを確認することができず、紛失したＩ／Ｏを別のアレイから回復できない場合、回復が結局スナップショットを取り消すことになる可能性がある。また、書込みＢが書込みＡと重複する場合、リーダーストレージシステムはＢにＡの後となるように『命令』してよいが、Ａは実際には廃棄され、次いでＡを書き込む動作は単にＢを待機するであろう。Ａ、ＢとＣ、Ｄとの間のスナップショットと結合された書込みＡ、Ｂ、Ｃ、及びＤは、回復がアレイ全体でスナップショット不一致を生じさせることがない限り、及び肯定応答が、より早期の動作が、それが回復可能であることを保証される点まで持続される前に後の動作を完了しない限り、いくつかの又はすべての部分をともにコミットする及び／又は受取りを確認することができるであろう。 [0314] The reader will further understand that the correct outcome for any operation must be committed to a point where it is recoverable before receipt of the operation can be acknowledged. However, in some cases, multiple operations can be committed together, or operations can be partially committed if recovery ensures correctness. For example, a snapshot could be committed locally with dependencies recorded on expected writes A and B, but A or B itself may not be committed. If receipt of the snapshot cannot be acknowledged and lost I/O cannot be recovered from another array, recovery may end up canceling the snapshot. Also, if write B overlaps with write A, the leader storage system may "instruct" B to be behind A, but A is actually discarded, and then the operation writing A will simply wait for B. Writes A, B, C, and D combined with snapshots between A, B and C, D will be able to commit and/or acknowledge receipt of some or all portions together, as long as recovery does not cause snapshot inconsistencies across the array, and as long as acknowledgments do not complete a later operation before an earlier operation has persisted to the point where it is guaranteed to be recoverable.

［0315］また、図１３に示される例の方法は、リーダーストレージシステム（１１４０）からフォロワーストレージシステム（１３３８）にデータセット（１１４２）に対する修正を記述する情報（１３１０）を送信すること（１３１２）も含む。リーダーストレージシステム（１１４０）からフォロワーストレージシステム（１３３８）にデータセット（１１４２）に対する修正を記述する情報（１３１０）を送信すること（１３１２）は、例えば、リーダーストレージシステム（１１４０）がフォロワーストレージシステム（１１３８）に１つ以上のメッセージを送信することによって実施されてよい。しかしながら、フォロワーストレージシステム（１１３８）が、データセット（１１４２）を修正する要求（１１０４）の元の受取人であったという事実を考慮して、リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）のＩ／Ｏペイロードを送信する必要はない場合がある。したがって、フォロワーストレージシステム（１１３８）は、データセット（１１４２）を修正する要求（１１０４）からＩ／ペイロードを抽出してよく、フォロワーストレージシステム（１１３８）は、Ｉ／Ｏペイロードを、データセット（１１４２）を修正する要求（１１０４）と関連付けられた１つ以上の他のメッセージの一部として受け取ってよく、フォロワーストレージシステム（１１３８）は、Ｉ／Ｏペイロードが既知の場所（例えば、ＲＤＭＡ又はＲＤＭＡのようなアクセスを介してアクセスされたフォロワーストレージシステム（１１３８）内のバッファ）で、又はなんらかの他の方法でホストによって記憶されているので、Ｉ／Ｏペイロードにアクセスできてよい。 13 also includes transmitting (1312) information (1310) describing the modifications to the dataset (1142) from the leader storage system (1140) to the follower storage system (1338). The transmitting (1312) information (1310) describing the modifications to the dataset (1142) from the leader storage system (1140) to the follower storage system (1338) may be performed, for example, by the leader storage system (1140) sending one or more messages to the follower storage system (1138). However, given the fact that the follower storage system (1138) was the original recipient of the request (1104) to modify the dataset (1142), the leader storage system (1140) may not need to transmit the I/O payload of the request (1104) to modify the dataset (1142). Thus, the follower storage system (1138) may extract the I/O payload from the request (1104) to modify the dataset (1142), the follower storage system (1138) may receive the I/O payload as part of one or more other messages associated with the request (1104) to modify the dataset (1142), and the follower storage system (1138) may be able to access the I/O payload because it is stored in a known location (e.g., a buffer in the follower storage system (1138) accessed via RDMA or RDMA-like access) or in some other manner by the host.

［0316］また、図１３に示される例の方法は、リーダーストレージシステム（１１４０）によって、データセット（１１４２）を修正する要求（１１０４）を処理すること（１３１８）も含む。図１３に示される例の方法では、リーダーストレージシステム（１１４０）は、例えば、フォロワーストレージシステム（１１３８）から受け取られたＩ／Ｏペイロードだけではなく、データセット（１１４２）に対する修正を記述する情報（１１１０）にも従ってリーダーストレージシステム（１１４０）に含まれる１つ以上のストレージデバイス（例えば、ＮＶＲＡＭデバイス、ＳＳＤ、ＨＨＤ）のコンテンツを修正することによって、データセット（１１４２）を修正する要求（１１０４）を処理してよい（１３１８）。データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）に含まれるボリュームに向けられる書込み動作として実施され、データセット（１１４２）に対する修正を記述する情報（１１１０）が、書込み動作が依然に発行された書込み動作が処理された後にだけ実行できることを示す例を考える。係る例では、データセット（１１４２）を修正する要求（１１０４）を処理すること（１３１８）は、リーダーストレージシステム（１１４０）が最初に、以前に発行された書込み動作がリーダーストレージシステム（１１４０）で処理されたことを検証し、その後書込み動作と関連付けられたＩ／Ｏペイロードをリーダーストレージシステム（１１４０）に含まれる１つ以上のストレージデバイスに書き込むことによって実施されてよい。係る例では、データセット（１１４２）を修正する要求（１１０４）は、例えばＩ／Ｏペイロードがリーダーストレージシステム（１１４０）の中の永続記憶装置にコミットされているとき、完了され、無事に処理されたと見なされてよい。 13 also includes processing (1318) the request (1104) to modify the dataset (1142) by the leader storage system (1140). In the example method shown in FIG. 13, the leader storage system (1140) may process (1318) the request (1104) to modify the dataset (1142), for example, by modifying the contents of one or more storage devices (e.g., NVRAM devices, SSDs, HDDs) included in the leader storage system (1140) according to information (1110) describing the modifications to the dataset (1142) as well as the I/O payload received from the follower storage system (1138). Consider an example in which the request (1104) to modify the dataset (1142) is implemented as a write operation directed to a volume included in the dataset (1142), and the information (1110) describing the modification to the dataset (1142) indicates that the write operation can be performed only after a previously issued write operation has been processed. In such an example, processing (1318) the request (1104) to modify the dataset (1142) may be implemented by the leader storage system (1140) first verifying that the previously issued write operation has been processed at the leader storage system (1140), and then writing the I/O payload associated with the write operation to one or more storage devices included in the leader storage system (1140). In such an example, the request (1104) to modify the dataset (1142) may be considered completed and successfully processed, for example, when the I/O payload has been committed to persistent storage in the leader storage system (1140).

［0317］また、図１３に示される例は、リーダーストレージシステム（１１４０）によってフォロワーストレージシステム（１１３８）に対して、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１３２０）も含む。図１３に示される例では、リーダーストレージシステム（１１４０）は、例えばリーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）に送信される１つ以上の肯定応答（１３２２）メッセージを使用することにより、又は他のなんらかの適切な機構を介してデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認してよい（１３２０）。 [0317] The example shown in FIG. 13 also includes acknowledging (1320) by the leader storage system (1140) to the follower storage system (1138) receipt of completion of the request (1104) to modify the dataset (1142). In the example shown in FIG. 13, the leader storage system (1140) may acknowledge (1320) receipt of completion of the request (1104) to modify the dataset (1142), for example, by using one or more acknowledgment (1322) messages sent from the leader storage system (1140) to the follower storage system (1138), or via some other suitable mechanism.

［0318］また、図１３に示される例の方法は、リーダーストレージシステム（１１４０）から、データセット（１１４２）に対する修正を記述する情報（１３１０）を受け取ること（１３１４）も含む。フォロワーストレージシステム（１１３８）は、例えばリーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）に送信される１つ以上のメッセージを介して、リーダーストレージシステム（１１４０）からデータセット（１１４２）に対する修正を記述する情報（１１１０）を受け取ってよい（１３１４）。１つ以上のメッセージは、リーダーストレージシステム（１１４０）が、ＲＤＭＡ又は類似した機構を使用し、又は他の方法でフォロワーストレージシステム（１１３８）上の所定の記憶場所（例えば、待ち行列の場所）にメッセージを書き込むことによって、２つのストレージシステム（１１３８、１１４０）の間の１つ以上の専用のデータ通信リンクを介してリーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）に送信されてよい。しかしながら、読者は、図１３に示される例の方法では、フォロワーストレージシステム（１１３８）は、フォロワーストレージシステム（１１３８）によって受け取られたデータセット（１１４２）を修正する要求（１１０４）から係るＩ／Ｏペイロードを抽出することができる、フォロワーストレージシステム（１１３８）は、ホスト（１１０２）から受け取られた１つ以上の他のメッセージから係るＩ／Ｏペイロードを抽出することができる、フォロワーストレージシステム（１１３８）は、フォロワーストレージシステム（１１３８）がホスト（１１０２）によって発行されたデータセット（１１４２）を修正する要求（１１０４）のターゲットであったという事実のためになんらかの他の方法でＩ／Ｏペイロードを入手できるので、リーダーストレージシステム（１１４０）は、データセット（１１４２）を修正する要求（１１０４）と関連付けられたＩ／Ｏペイロードをフォロワーストレージシステム（１１３８）に送信する必要はないことを理解する。 13 also includes receiving (1314) information (1310) from the leader storage system (1140) describing modifications to the data set (1142). The follower storage system (1138) may receive (1314) the information (1110) describing modifications to the data set (1142) from the leader storage system (1140), for example, via one or more messages sent from the leader storage system (1140) to the follower storage system (1138). The one or more messages may be sent from the leader storage system (1140) to the follower storage system (1138) over one or more dedicated data communication links between the two storage systems (1138, 1140) by the leader storage system (1140) writing the messages to predetermined storage locations (e.g., queue locations) on the follower storage system (1138) using RDMA or similar mechanisms or otherwise. However, the reader will understand that in the example method shown in FIG. 13, the follower storage system (1138) can extract the relevant I/O payload from the request (1104) to modify the dataset (1142) received by the follower storage system (1138), the follower storage system (1138) can extract the relevant I/O payload from one or more other messages received from the host (1102), and the leader storage system (1140) need not send the I/O payload associated with the request (1104) to modify the dataset (1142) to the follower storage system (1138) because the follower storage system (1138) can obtain the I/O payload in some other manner due to the fact that the follower storage system (1138) was the target of the request (1104) to modify the dataset (1142) issued by the host (1102).

［0319］一実施形態では、フォロワーストレージシステム（１１３８）は、通信機構としてＳＣＳＩ要求（送信者から受信者への書込み、又は受信者から送信者への読取り）の使用の使用により、リーダーストレージシステム（１１４０）からデータセット（１１４２）に対する修正を記述する情報（１１１０）を受け取ってよい（１３１４）。係る実施形態では、ＳＣＳＩ書込み要求は、（どのようなデータ及びメタデータであれ含む）送信することを意図し、特別な疑似デバイスに又は特別に構成されたＳＣＩＣネットワークを介して、又は任意の他の同意されたアドレス指定機構を通して送達されてよい情報を符号化するために使用される。あるいは、代わりに、モデルは、特別なデバイス、特別に構成されたＳＣＳＩネットワーク、又は他の同意された機構も使用し、受信者から送信者へオープンＳＣＳＩ読取り要求の集合を発行できる。データ及びメタデータを含んだ符号化された情報は、これらのオープンＳＣＳＩ要求の１つ以上に対する応答として受信者に送達される。係るモデルは、多くの場合データセンタ間の「ダークファイバ」ストレージネットワークとして配備されるファイバチャネルＳＣＳＩネットワークを介して実装できる。また、係るモデルは、ホストからリモートアレイへのマルチパシング及びバルクアレイ間通信用に同じネットワーク回線の使用も可能にする。 [0319] In one embodiment, the follower storage system (1138) may receive information (1110) describing modifications to the dataset (1142) from the leader storage system (1140) using SCSI requests (write from sender to receiver, or read from receiver to sender) as the communication mechanism (1314). In such an embodiment, a SCSI write request is used to encode the information intended for transmission (including any data and metadata), which may be delivered to a special pseudo-device, or over a specially configured SCSI network, or through any other agreed-upon addressing mechanism. Alternatively, the model may use a special device, a specially configured SCSI network, or other agreed-upon mechanism to issue a set of open SCSI read requests from the receiver to the sender. The encoded information, including the data and metadata, is delivered to the receiver in response to one or more of these open SCSI requests. Such a model may be implemented over a Fibre Channel SCSI network, which is often deployed as a "dark fibre" storage network between data centers. Such a model also allows for multipathing from the host to the remote array and the use of the same network lines for bulk inter-array communication.

［0320］また、図１３に示される例の方法は、フォロワーストレージシステム（１１３８）によって、データセット（１１４２）を修正する要求（１１０４）を処理すること（１３１６）も含む。図１３に示される例の方法では、フォロワーストレージシステム（１１３８）は、データセット（１１４２）に対する修正を記述する情報（１１１０）に従ってフォロワーストレージシステム（１１３８）に含まれる１つ以上のストレージデバイス（例えば、ＮＶＲＡＭデバイス、ＳＳＤ、ＨＤＤ）のコンテンツを修正することによって、データセット（１１４２）を修正する要求（１１０４）を処理してよい（１３１６）。データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）に含まれるボリュームに向けられる書込み動作として実施され、データセット（１１４２）に対する修正を記述する情報（１１１０）が、書込み動作が、以前に発行された書込み動作が処理された後にだけ実行できることを示す例を考える。係る例では、データセット（１１４２）を修正する要求（１１０４）を処理すること（１３１６）は、フォロワーストレージシステム（１１３８）が最初に、以前に発行された書込み動作がフォロワーストレージシステム（１１３８）で処理されたことを検証し、その後書込み動作と関連付けられたＩ／Ｏペイロードをフォロワーストレージシステム（１１３８）に含まれる１つ以上のストレージデバイスに書き込むことによって実施されてよい。係る例では、データセット（１１４２）を修正する要求（１１０４）は、例えばデータセット（１１４２）を修正する要求（１１０４）と関連付けられたＩ／Ｏペイロードがフォロワーストレージシステム（１１３８）の中の永続記憶装置にコミットされているとき、完了され、無事に処理されたと見なされてよい。 13 also includes processing (1316) the request (1104) to modify the dataset (1142) by the follower storage system (1138). In the example method shown in FIG. 13, the follower storage system (1138) may process (1316) the request (1104) to modify the dataset (1142) by modifying the contents of one or more storage devices (e.g., NVRAM devices, SSDs, HDDs) included in the follower storage system (1138) in accordance with the information (1110) describing the modifications to the dataset (1142). Consider an example in which the request (1104) to modify the dataset (1142) is implemented as a write operation directed to a volume included in the dataset (1142), and the information (1110) describing the modifications to the dataset (1142) indicates that the write operation can be performed only after a previously issued write operation has been processed. In such examples, processing (1316) the request (1104) to modify the dataset (1142) may be performed by the follower storage system (1138) first verifying that a previously issued write operation has been processed at the follower storage system (1138), and then writing the I/O payload associated with the write operation to one or more storage devices included in the follower storage system (1138). In such examples, the request (1104) to modify the dataset (1142) may be considered completed and successfully processed, for example, when the I/O payload associated with the request (1104) to modify the dataset (1142) has been committed to persistent storage in the follower storage system (1138).

［0321］また、図１３に示される例の方法は、リーダーストレージシステム（１１４０）から、リーダーストレージシステム（１１４０）が、）データセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取ること（１３２４）も含む。この例では、リーダーストレージシステム（１１４０）が、データセット（１１４２）を修正する要求（１１０４）を処理した旨の表示は、リーダーストレージシステム（１１４０）からフォロワーストレージシステム（１１３８）に送信された肯定応答（１３２２）メッセージとして実施される。読者は、上述されたステップの多くが特定の順序で発生するとして示され、説明されるが、特定の順序は実際には必要とされないことを理解する。実際には、フォロワーストレージシステム（１１３８）及びリーダーストレージシステム（１１４０）は、独立したストレージシステムであるため、各ストレージシステムは並行して上述されたステップのいくつかを実行している場合がある。例えば、フォロワーストレージシステム（１１３８）は、リーダーストレージシステム（１１４０）から、リーダーストレージシステム（１１４０）が、データセット（１１４２）を修正する要求（１１０４）を処理する（１３１６）前にデータセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取ってよい（１３２４）。同様に、フォロワーストレージシステム（１１３８）は、リーダーストレージシステム（１１４０）から、リーダーストレージシステム（１１４０）が、リーダーストレージシステム（１１４０）からデータセット（１１４２）に対する修正を記述する情報（１１１０）を受け取る（１３１４）前に、データセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取ってよい（１３２４）。 13 also includes receiving (1324) an indication from the leader storage system (1140) that the leader storage system (1140) has processed (1104) the request to modify the data set (1142). In this example, the indication that the leader storage system (1140) has processed (1104) the request to modify the data set (1142) is embodied as an acknowledgment (1322) message sent from the leader storage system (1140) to the follower storage system (1138). The reader will understand that although many of the steps described above are shown and described as occurring in a particular order, no particular order is actually required. In practice, the follower storage system (1138) and the leader storage system (1140) are independent storage systems, and therefore each storage system may be performing some of the steps described above in parallel. For example, the follower storage system (1138) may receive (1324) an indication from the leader storage system (1140) that the leader storage system (1140) processed (1316) the request (1104) to modify the data set (1142). Similarly, the follower storage system (1138) may receive (1324) an indication from the leader storage system (1140) that the leader storage system (1140) processed (1314) the request (1104) to modify the data set (1142) before the leader storage system (1140) received (1314) information (1110) from the leader storage system (1140) describing the modifications to the data set (1142).

［0322］また、図１３に示される例は、フォロワーストレージシステム（１１３８）によって、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１３２６）も含む。データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１３２６）は、例えばフォロワーストレージシステム（１１３８）が、データセット（１１４２）を修正する要求（１１０４）を発行したホスト（１１０２）に肯定応答（１３２８）メッセージを発行することによって実施されてよい。図１３に示される例の方法では、フォロワーストレージシステム（１１３８）は、データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１３２８）前に、リーダストレージシステム（１１４０）によって処理された（１３１８）かどうかを判断してよい。フォロワーストレージシステム（１１３８）は、例えば、フォロワーストレージシステム（１１３８）が、肯定応答メッセージ、又はデータセット（１１４２）を修正する要求（１１０４）がリーダーストレージシステム（１１４０）によって処理された（１３１８）ことを示す他のメッセージをリーダーストレージシステム（１１４０）から受け取ったかどうかを判断することによって、データセット（１１４２）を修正する要求（１１０４）がリーダーストレージシステム（１１４０）によって処理された（１３１８）かどうかを判断してよい。係る例では、フォロワーストレージシステム（１１３８）が、データセット（１１４２）を修正する要求（１１０４）がリーダーストレージシステム（１１４０）によって処理された（１３１８）と肯定的に判断し、フォロワーストレージシステム（１１３８）もデータセット（１１４２）を修正する要求（１１０４）を処理した（１３１６）場合、フォロワーストレージシステム（１３１８）は、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）にデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１３２６）によって進んでよい。しかしながら、リーダーストレージシステム（１１４０）が、データセット（１１４２）を修正する要求（１１０４）がリーダーストレージシステム（１１４０）によって処理されていない（１３１８）、又はフォロワーストレージシステム（１１３８）がデータセット（１１４２）を修正する要求（１１０４）をまだ処理していない（１３１６）と判断する場合、フォロワーストレージシステム（１１３８）は、データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）がその全体で同期複製されるすべてのストレージシステム（１１３８、１１４０）で無事に処理されたときにだけ、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）に対してデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認してよいので、フォロワーストレージシステム（１１３８）は、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）に対してデータセット（１１４２）を修正する要求（１１０４）の完了の受取りをまだ確認して（１３２６）いない場合がある。 13 also includes acknowledging (1326) receipt of completion of the request (1104) to modify the dataset (1142) by the follower storage system (1138). Acknowledging (1326) receipt of completion of the request (1104) to modify the dataset (1142) may be performed, for example, by the follower storage system (1138) issuing an acknowledgment (1328) message to the host (1102) that issued the request (1104) to modify the dataset (1142). In the example method shown in FIG. 13, the follower storage system (1138) may determine whether the request (1104) to modify the dataset (1142) has been processed (1318) by the leader storage system (1140) before acknowledging (1328) receipt of completion of the request (1104) to modify the dataset (1142). The follower storage system (1138) may determine whether the request (1104) to modify the dataset (1142) has been processed (1318) by the leader storage system (1140), for example, by determining whether the follower storage system (1138) has received an acknowledgment message or other message from the leader storage system (1140) indicating that the request (1104) to modify the dataset (1142) has been processed (1318) by the leader storage system (1140). In such an example, if the follower storage system (1138) affirmatively determines that the request (1104) to modify the dataset (1142) has been processed (1318) by the leader storage system (1140) and the follower storage system (1138) has also processed (1316) the request (1104) to modify the dataset (1142), the follower storage system (1318) may proceed by acknowledging (1326) receipt of completion of the request (1104) to modify the dataset (1142) from the host (1102) that initiated the request (1104) to modify the dataset (1142). However, if the leader storage system (1140) determines (1318) that the request (1104) to modify the dataset (1142) has not been processed by the leader storage system (1140) or that the follower storage system (1138) has not yet processed (1316) the request (1104) to modify the dataset (1142), the follower storage system (1138) determines that the request (1104) to modify the dataset (1142) has not yet been processed by the leader storage system (1140) in all of the data sets in which the dataset (1142) is synchronously replicated. Since receipt of completion of the request (1104) to modify the dataset (1142) may be acknowledged to the host (1102) that initiated the request (1104) to modify the dataset (1142) only when the request has been successfully processed by all storage systems (1138, 1140), the follower storage system (1138) may not yet acknowledge (1326) receipt of completion of the request (1104) to modify the dataset (1142) to the host (1102) that initiated the request (1104) to modify the dataset (1142).

［0323］追加の説明のため、図１４は、本開示のいくつかの実施形態に従って複数のストレージシステム（１１３８、１１４０、１３３４）全体で同期されるデータセット（１１４２）に向けられるＩ／Ｏ動作にサービスを提供する例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図１３に示されるストレージシステム（１１３８、１１４０、１３３４）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図１３に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0323] For further explanation, FIG. 14 sets forth a flowchart illustrating an example method for servicing I/O operations directed to a dataset (1142) synchronized across multiple storage systems (1138, 1140, 1334) in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (1138, 1140, 1334) illustrated in FIG. 13 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system illustrated in FIG. 13 may include the same components as the storage systems described above, fewer components, or additional components.

［0324］図１４に示される例の方法は、図１３に示される例の方法に類似してよい。図４に示される例の方法は、フォロワーストレージシステム（１１３８）によって、データセット（１１４２）を修正する要求（１１０４）を受け取ること（１３０２）と、フォロワーストレージシステム（１１３８）からリーダーストレージシステム（１１４０）に、データセット（１１４２）を修正する要求（１１０４）の論理記述（１３０６）を送信すること（１３０４）と、リーダーストレージシステム（１１４０）によって、データセット（１１４２）に対する修正を記述する情報（１３１０）を生成すること（１３０８）と、リーダーストレージシステム（１１４０）によって、データセット（１１４２）を修正する要求（１１０４）を処理すること（１３１８）と、リーダーストレージシステム（１１４０）によってフォロワーストレージシステム（１１３８）に対して、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１３２０）と、リーダーストレージシステム（１１４０）から、データセット（１１４２）に対する修正を記述する情報（１３１０）を受け取ること（１３１４）と、フォロワーストレージシステム（１１３８）によって、データセット（１１４２）を修正する要求（１１０４）を処理すること（１３１６）と、リーダーストレージシステム（１１４０）から、リーダーストレージシステム（１１４０）が、）データセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取ること（１３２４）と、フォロワーストレージシステム（１１３８）によって、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認すること（１３２６）も含むので、図１３に示される例の方法に類似してよい。 14 may be similar to the example method shown in FIG. 13. The example method shown in FIG. 4 includes receiving (1302) a request (1104) to modify a dataset (1142) by a follower storage system (1138), transmitting (1304) a logical description (1306) of the request (1104) to modify the dataset (1142) from the follower storage system (1138) to a leader storage system (1140), generating (1308) information describing the modifications to the dataset (1142) by the leader storage system (1140), processing (1318) the request (1104) to modify the dataset (1142) by the leader storage system (1140), and transmitting (1318) the request (1104) to the follower storage system (1138) to the leader storage system (1140) to the follower storage system (1138). The method may be similar to the example method shown in FIG. 13 , as it also includes acknowledging receipt of completion of the request (1104) to modify the dataset (1142) (1320), receiving information (1310) from the leader storage system (1140) describing the modifications to the dataset (1142) (1314), processing (1316) the request (1104) to modify the dataset (1142) by the follower storage system (1138), receiving (1324) an indication from the leader storage system (1140) that the leader storage system (1140) has processed the request (1104) to modify the dataset (1142), and acknowledging receipt of completion of the request (1104) to modify the dataset (1142) (1326) by the follower storage system (1138).

［0325］しかしながら、図１４に示される例の方法は、データセット（１１４２）が、ストレージシステムのうちの１つがリーダーストレージシステム（１１４０）であり、残りのストレージシステムがフォロワーストレージシステム（１１３８、１３３４）である、３つのストレージシステム全体で同期複製される実施形態を示すので、図１３に示される例の方法とは異なる。係る例では、追加のフォロワーストレージシステム（１１３４）は、リーダーストレージシステム（１１４０）から、データセット（１１４２）に対する修正を記述する情報（１１１０）を受け取る（１１４２）、及びデータセット（１１４２９に対する修正を記述する情報（１１１０）に従ってデータセット（１１４２）を修正する要求（１１０４）を処理する（１１４２）ことができるので、図１３に示されたフォロワーストレージシステム（１１３８）と同じステップの多くを実施する。 [0325] However, the example method shown in FIG. 14 differs from the example method shown in FIG. 13 because it illustrates an embodiment in which a data set (1142) is synchronously replicated across three storage systems, one of which is a leader storage system (1140) and the remaining storage systems are follower storage systems (1138, 1134). In such an example, the additional follower storage system (1134) performs many of the same steps as the follower storage system (1138) shown in FIG. 13, since it can receive (1142) information (1110) from the leader storage system (1140) describing modifications to the data set (1142) and process (1142) requests (1104) to modify the data set (1142) in accordance with the information (1110) describing modifications to the data set (1142).

［0326］図１４に示される例の方法では、リーダーストレージシステム（１１４０）は、フォロワーストレージシステム（１１３８、１１３４）のすべてにデータセット（１１４２）に対する修正を記述する情報（１１１０）を送信できる（１３３８）。図１４に示される例の方法では、追加のフォロワーストレージシステム（１３３４）は、データセット（１１４２）を修正する要求（１１０４）を受け取った（１３０２）フォロワーストレージシステム（１１３８）にデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１３３０）こともできる。図１４に示される例の方法では、追加のフォロワーストレージシステム（１３３４）は、例えば追加のフォロワーストレージシステム（１３３４）から、データセット（１１４２）を修正する要求（１１０４）を受け取った（１３０２）フォロワーストレージシステム（１１３８）に送信される１つ以上の肯定応答（１３３２）メッセージの使用により、又はなんらかの他の適切な機構を介して、データセット（１１４２）を修正する要求（１１０４）を受け取った（１３０２）フォロワーストレージシステム（１１３８）に対してデータセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認できる（１３３０）。 [0326] In the example method shown in FIG. 14, the leader storage system (1140) can send (1338) information (1110) describing modifications to the dataset (1142) to all of the follower storage systems (1138, 1134). In the example method shown in FIG. 14, the additional follower storage system (1334) can also acknowledge (1330) receipt of completion of the request (1104) to modify the dataset (1142) to the follower storage system (1138) that received (1302) the request (1104) to modify the dataset (1142). In the example method shown in FIG. 14, the additional follower storage system (1334) may acknowledge (1330) receipt of completion of the request (1104) to modify the dataset (1142) to the follower storage system (1138) that received (1302) the request (1104) to modify the dataset (1142), e.g., by use of one or more acknowledgment (1332) messages sent from the additional follower storage system (1334) to the follower storage system (1138) that received (1302) the request (1104) to modify the dataset (1142), or via some other suitable mechanism.

［0327］図１４に示される例では、データセット（１１４２）を修正する要求（１１０４）を受け取った（１３０２）フォロワーストレージシステム（１１３８）も、他のすべてのフォロワーストレージシステム（１１３４）がデータセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取ってもよい（１３３６）。この例では、他のすべてのフォロワーストレージシステム（１１３４）がデータセット（１１４２）を修正する要求（１１０４）を処理した旨の表示は、他のフォロワーストレージシステム（１１３４）から、データセット（１１４２）を修正する要求（１１０４）を受け取った（１３０２）フォロワーストレージシステム（１１３８）に送信された肯定応答（１３３２）メッセージとして実施される。読者は、上述されたステップの多くが特定の順序で発生するとして示され、説明されるが、特定の順序は実際には必要とされないことを理解する。実際には、フォロワーストレージシステム（１１３８、１３３４）及びリーダーストレージシステム（１１４０）は、それぞれ独立したストレージシステムであるため、各ストレージシステムは並行して上述されたステップのいくつかを実行している場合がある。例えば、フォロワーストレージシステム（１１３８）は、リーダーストレージシステム（１１４０）から、リーダーストレージシステム（１１４０）が、データセット（１１４２）を修正する要求（１１０４）を処理する（１３１６）前にデータセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取ってよい（１３２４）。さらに、フォロワーストレージシステム（１１３８）は、他のすべてのフォロワーストレージシステム（１３３４）が、リーダーストレージシステム（１１４０）がデータセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取る（１３２４）前に、データセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取ってよい（１３３６）。代わりに、フォロワーストレージシステム（１１３８）は、他のすべてのストレージシステム（１３３４）が、データセット（１１４２）を修正する要求（１１０４）を処理する（１３１６）前に、データセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取ってよい（１３３６）。同様に、フォロワーストレージシステム（１１３８）は、リーダーストレージシステム（１１４０）から、リーダーストレージシステム（１１４０）が、リーダーストレージシステム（１１４０）からデータセット（１１４２）に対する修正を記述する情報（１１０）を受け取る（１３１４）前に、データセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取ってよい（１３２４）。さらに、フォロワーストレージシステム（１１３８）は、他のすべてのフォロワーストレージシステム（１３３４）が、リーダーストレージシステム（１１４０）からデータセット（１１４２）に対する修正を記述する情報（１１１０）を受け取る（１３１４）前に、データセット（１１４２）を修正する要求（１１０４）を処理した旨の表示を受け取ってよい（１３３６）。 14, the follower storage system (1138) that received (1302) the request (1104) to modify the dataset (1142) may also receive (1336) an indication that all other follower storage systems (1134) have processed the request (1104) to modify the dataset (1142). In this example, the indication that all other follower storage systems (1134) have processed the request (1104) to modify the dataset (1142) is embodied as an acknowledgment (1332) message sent from the other follower storage systems (1134) to the follower storage system (1138) that received (1302) the request (1104) to modify the dataset (1142). The reader will understand that although many of the steps described above are shown and described as occurring in a particular order, no particular order is actually required. In practice, because the follower storage systems (1138, 1334) and the leader storage system (1140) are independent storage systems, each storage system may be performing some of the above-described steps in parallel. For example, the follower storage system (1138) may receive (1324) an indication from the leader storage system (1140) that the leader storage system (1140) has processed (1316) the request (1104) to modify the data set (1142). Furthermore, the follower storage system (1138) may receive (1336) an indication that the request (1104) to modify the data set (1142) has been processed (1324) by all other follower storage systems (1334) before receiving (1324) an indication that the leader storage system (1140) has processed (1104) the request to modify the data set (1142). Alternatively, the follower storage system (1138) may receive (1336) an indication that it has processed (1316) the request (1104) to modify the data set (1142) before all other storage systems (1334) have processed (1316) the request (1104) to modify the data set (1142). Similarly, the follower storage system (1138) may receive (1324) an indication from the leader storage system (1140) that it has processed (1324) the request (1104) to modify the data set (1142) before the leader storage system (1140) receives (1314) information (110) from the leader storage system (1140) describing the modifications to the data set (1142). Additionally, the follower storage system (1138) may receive (1336) an indication that all other follower storage systems (1334) have processed (1104) the request to modify the data set (1142) before receiving (1314) the information (1110) describing the modifications to the data set (1142) from the leader storage system (1140).

［0328］図１４に明示的に示されていないが、フォロワーストレージシステム（１１３８）は、データセット（１１４２）を修正する要求（１１０４）が、リーダーストレージシステム（１１４０）によって処理され（１３１８）、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１３２８）前に、他のすべてのフォロワーストレージシステム（１３３４）によっても処理された（１１４４）かどうかを判断してよい。フォロワーストレージシステム（１１３８）は、例えば、フォロワーストレージシステム（１１３８）が、データセット（１１４２）を修正する要求（１１０４）が各ストレージシステム（１１４０、１３３４）によって処理された（１３１８、１１４４）ことを示す肯定応答メッセージをリーダーストレージシステム（１１４０）及び他のすべてのフォロワーストレージシステム（１３３４）から受け取ったかどうかを判断することによって、データセット（１１４２）を修正する要求（１１０４）が、リーダーストレージシステム（１１４０）によって処理され（１３１８）、他のすべてのフォロワーストレージシステム（１３３４）によっても処理された（１１４４）かどうかを判断してよい。係る例では、フォロワーストレージシステム（１１３８）が、データセット（１１４２）を修正する要求（１１０４）が、リーダーストレージシステム（１１４０）、他のすべてのフォロワーストレージシステム（１３３４）、及びフォロワーストレージシステム（１１３８）によって処理されたと肯定的に判断する場合、フォロワーストレージシステム（１１３８）は、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）に対して、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認する（１３２６）ことによって進んでよい。しかしながら、リーダーストレージシステム（１１４０）が、データセット（１１４２）を修正する要求（１１０４）が、リーダーストレージシステム（１１４０）、他のすべてのフォロワーストレージシステム（１３３４）、又はフォロワーストレージシステム（１１３８）の少なくとも１つによって処理されていないと判断する場合、フォロワーストレージシステム（１１３８）は、データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）がその全体で同期複製されるすべてのストレージシステム（１１３８、１１４０、１３３４）で無事に処理されたときにだけ、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）に対して、データセット（１１４２）を修正する要求（１１０４）の完了の受取りを確認してよい（１１３４）ので、フォロワーストレージシステム（１１３８）は、データセット（１１４２）を修正する要求（１１０４）を開始したホスト（１１０２）に対してデータセット（１１４２）を修正する要求（１１０４）の完了の受取りをまだ確認して（１３２６）いない場合がある。 [0328] Although not explicitly shown in FIG. 14, the follower storage system (1138) may determine whether the request (1104) to modify the dataset (1142) has been processed (1318) by the leader storage system (1140) and also processed (1144) by all other follower storage systems (1334) before acknowledging receipt of completion of the request (1104) to modify the dataset (1142) (1328). The follower storage system (1138) may determine whether the request (1104) to modify the dataset (1142) has been processed (1318) by the leader storage system (1140) and also processed (1144) by all other follower storage systems (1334), for example, by determining whether the follower storage system (1138) has received acknowledgment messages from the leader storage system (1140) and all other follower storage systems (1334) indicating that the request (1104) to modify the dataset (1142) has been processed (1318, 1144) by each storage system (1140, 1334). In such an example, if the follower storage system (1138) affirmatively determines that the request (1104) to modify the dataset (1142) has been processed by the leader storage system (1140), all other follower storage systems (1334), and the follower storage system (1138), the follower storage system (1138) may proceed by acknowledging (1326) receipt of completion of the request (1104) to modify the dataset (1142) to the host (1102) that initiated the request (1104) to modify the dataset (1142). However, if the leader storage system (1140) determines that the request (1104) to modify the data set (1142) has not been processed by the leader storage system (1140), all other follower storage systems (1334), or at least one of the follower storage system (1138), the follower storage system (1138) determines that the request (1104) to modify the data set (1142) has not been processed by all storage systems (1334) across which the data set (1142) is synchronously replicated. Since receipt of completion of the request (1104) to modify the dataset (1142) may be acknowledged (1134) to the host (1102) that initiated the request (1104) to modify the dataset (1142) only when the request has been successfully processed by the host (1102) that initiated the request (1104) to modify the dataset (1142), the follower storage system (1138) may not yet acknowledge (1326) receipt of completion of the request (1104) to modify the dataset (1142) to the host (1102) that initiated the request (1104) to modify the dataset (1142).

［0329］図１４に明示的に示されていないが、いくつかの実施形態では、１つ又はストレージシステム（１１３８、１１４０、１３３４）で実行中の任意の同時重複読取りをアンロックしようと努力して、データセット（１１２２）を修正する要求（１１０４）を受け取った（１３０２）フォロワーストレージシステム（１１３８）は、修正動作がどこででも完了したことを信号で知らせるために、リーダーストレージシステム（１１４０）に及び他のフォロワーストレージシステム（１３３４）にメッセージを送り返す場合がある。代わりに、データセット（１１２２）を修正する要求（１１０４）を受け取ったフォロワーストレージシステム（１１３８）は、そのメッセージをリーダーストレージシステム（１１３８）に送信し、リーダーストレージシステム（１１３８）は、完了を伝搬し、どこででも読取りから障害物を取り除くためにメッセージを送信できるであろう。 [0329] Although not explicitly shown in FIG. 14, in some embodiments, a follower storage system (1138) that receives (1302) a request (1104) to modify a dataset (1122) in an effort to unlock any concurrent overlapping reads in progress at one or more storage systems (1138, 1140, 1334) may send a message back to the leader storage system (1140) and to other follower storage systems (1334) to signal that the modification operation is complete wherever possible. Alternatively, a follower storage system (1138) that receives a request (1104) to modify a dataset (1122) may send that message to the leader storage system (1138), which could then send a message to propagate the completion and remove any obstructions to the reads wherever possible.

［0330］読者は、図１４に示される例の方法が、データセット（１１４２）が、ストレージシステムの１つがリーダーストレージシステム（１１４０）であり、残りのストレージシステムがフォロワーストレージシステム（１１３８、１３３４）である３つのストレージシステム全体で同期複製される実施形態を示すが、他の実施形態が追加のストレージシステムを含んでもよいことを理解する。係る他の実施形態では、追加のフォロワーストレージシステムは、図１４に示される他のフォロワーストレージシステム（１３３４）と同様に動作してよい。 [0330] The reader will understand that although the example method illustrated in FIG. 14 depicts an embodiment in which a data set (1142) is synchronously replicated across three storage systems, one of which is a leader storage system (1140) and the remaining storage systems are follower storage systems (1138, 1334), other embodiments may include additional storage systems. In such other embodiments, the additional follower storage systems may operate similarly to the other follower storage system (1334) illustrated in FIG. 14.

［0331］また、読者は、図１２に示される例だけが、データセット（１１４２）に対する修正を記述する情報（１３１０）がデータセット（１１４２）を修正する要求（１１０４）のための順序付け情報（１１５２）、データセット（１１４２）を修正する要求（１１０４）と関連付けられた共通メタデータ情報（１１５４）、及びデータセット（１１４２）を修正する要求（１１０４）と関連付けられたＩ／Ｏペイロード（１１１４）を含む実施形態を明示的に示すが、データセット（１１４２）に対する修正を記述する情報（１３１０）が、残りの図に示される例に係る情報のすべて（又は部分集合）を含む場合があることも理解する。さらに、データセット（１１４２）を修正する要求（１１０４）が、データセット（１１４２）のスナップショットを撮影する要求を含む実施形態では、データセット（１１４２）に対する修正を記述する情報（１３１０）は、上述された図のそれぞれでデータセット（１１４２）のスナップショットのコンテンツに含まれるデータセット（１１４２）を修正する１つ以上の他の要求の識別を含む場合もある。 [0331] The reader will also understand that while only the example shown in FIG. 12 explicitly illustrates an embodiment in which information (1310) describing modifications to dataset (1142) includes ordering information (1152) for the request (1104) to modify dataset (1142), common metadata information (1154) associated with the request (1104) to modify dataset (1142), and I/O payload (1114) associated with the request (1104) to modify dataset (1142), information describing modifications to dataset (1142) (1310) may include all (or a subset) of the information associated with the examples shown in the remaining figures. Additionally, in embodiments in which the request (1104) to modify the dataset (1142) includes a request to take a snapshot of the dataset (1142), the information (1310) describing the modifications to the dataset (1142) may also include identification of one or more other requests to modify the dataset (1142) that are included in the contents of the snapshot of the dataset (1142) in each of the figures described above.

［0332］読者は、データセット（１１４２）に対する修正を記述する情報（１３１０）が、スナップショットを撮影する前に完了されなければならないデータセット（１１４２）を修正する１つ以上の他の要求を識別する情報を含むよりむしろ、データセット（１１４２）のスナップショットのコンテンツに含まれるデータセット（１１４２）を修正する１つ以上の他の要求の識別を含んだ結果として、いくつかの状況に対応できることを理解する。ある状況とは、アトミック動作がスナップショットを実行し、同じアトミック更新で最後のいくつかの書込みを完了できる、つまり最後のいくつかの書込みはスナップショットの「前に」は完了しない状況である。別の状況とは、書込み完了時に、書込みが含まれるのであるならば、及びスナップショット自体が、すべての書込みがすべての同期中ストレージシステムによって完了されるまで、完了したと見なされないのであるならば、書込みが、スナップショットポイントが撮影された後に実際に完了できるであろう状況である。最後に、スナップショットが受け取られる前に、完了したとして要求者に示されていなかった書込みは、回復アクションの結果として含まれる、又はスナップショットから省略されることがあるであろう。基本的には、回復は、結果に一貫性があり、完了したとして信号で知らされた動作に関係するいかなる保証にも違反しないならば、受け取られた動作の詳細な履歴を書き換えることができる。 [0332] The reader will understand that several situations can be accommodated by having information (1310) describing modifications to dataset (1142) include identification of one or more other requests modifying dataset (1142) that are included in the contents of a snapshot of dataset (1142), rather than including information identifying one or more other requests modifying dataset (1142) that must be completed before the snapshot is taken. One situation is where an atomic operation performs a snapshot and the last few writes can be completed in the same atomic update, i.e., the last few writes do not complete "before" the snapshot. Another situation is where writes may actually complete after the snapshot point is taken, if the writes are included when they complete, and if the snapshot itself is not considered complete until all writes have been completed by all syncing storage systems. Finally, writes that were not indicated to the requester as completed before the snapshot was received may be included or omitted from the snapshot as a result of recovery actions. Essentially, recovery can rewrite the detailed history of received operations, provided the outcome is consistent and does not violate any guarantees associated with the operations signaled as completed.

［0333］追加の説明のために、図１５は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステム間で仲介するための例の方法を示すフローチャートを説明する。図１５に示される例の方法は、データセット（１５１２）は、２つのストレージシステム（１５１４、１５２４）だけ全体で同期複製される実施形態を示すが、図１５に示される例は、データセット（１５１２）が追加のストレージシステム全体で同期複製される実施形態に拡張できる。 [0333] For additional explanation, FIG. 15 sets forth a flowchart illustrating an example method for mediating between storage systems that synchronously replicate a dataset in accordance with some embodiments of the present disclosure. While the example method illustrated in FIG. 15 illustrates an embodiment in which dataset (1512) is synchronously replicated across only two storage systems (1514, 1524), the example illustrated in FIG. 15 can be extended to embodiments in which dataset (1512) is synchronously replicated across additional storage systems.

［0334］以下の例では、ポッドのためのストレージシステム（１５１４、１５２４）の集合の間の仲介は、ストレージシステムが、通信が通信障害又はなんらかの他の種類のシステム障害のために失われる場合がある、対にされたシステムとの失われた通信を解決できるようにする。以下に説明されるように、仲介の解決策は、定足数、及びストレージシステムのどれがポッドデータセットに向けられるＩ／Ｏ動作の処理、及び例えばメディエータ等に対する競争を続行すべきなのかを決定する外部制御システムの使用を含んでよい。しかしながら、仲介の優位点は、それが定足数プロトコルよりも簡略であり、仲介が、共通の構成である、同期複製されたストレージシステムのための２つのストレージシステム構成とうまく機能する点である。さらに、仲介は、外部制御システム及び競争される場合がある多くの他のタイプのリソースよりもよりロバストであり、構成するのが容易である場合がある。 [0334] In the following example, mediation between a collection of storage systems (1514, 1524) for a pod allows a storage system to resolve lost communication with a paired system, where communication may be lost due to a communication failure or some other type of system failure. As described below, the mediation solution may include the use of a quorum and an external control system to determine which storage system should continue processing I/O operations directed to the pod dataset and competing for, for example, a mediator. However, the advantage of mediation is that it is simpler than a quorum protocol, and mediation works well with a common configuration, two storage system configurations for synchronously replicated storage systems. Furthermore, mediation may be more robust and easier to configure than external control systems and many other types of resources that may be competed for.

［0335］図１５に示されるように、データセット（１５１２）を同期複製している複数のストレージシステム（１５１４、１５２４）は、―仲介サービス（１５００）は、ストレージシステム間の通信障害の場合に、ストレージシステムがオフラインになる場合に、又はなんらかの他のトリガイベントに起因して、どのストレージシステムがデータセットにサービスを提供し続けるのかを決定し得る―ネットワーク（１５５４）を介して仲介サービス（１５００）と通信してよい。ストレージシステムが互いと通信できない場合、ストレージシステムは同期複製されたデータセットを維持できない場合があり、任意の受け取られたデータセットを修正する要求は、それ以外の場合、データセットが非同期になるであろうため、サービス提供不可となるであろうため、仲介は有利である。この例では、データセットを同期複製しているストレージシステムのための仲介サービスは、ストレージシステム（１５１４、１５２４）にとっては外部である仲介サービス（１５００）によって提供されてよい。この例では、示されている２つのストレージシステム（１５１４、１５２４）しかないが、一般的には、なんらかの他の数の２つ以上のストレージシステムが、データセットを同期複製している同期中リストの一部であってよい。具体的には、第１のストレージシステム（１５１４）は、例えば第２のストレージシステム（１５２４）に対する通信リンク（１５１６）の損失等のトリガイベントを検出した場合、第１のストレージシステム（１５１４）は、それがデータセットの複製に関して同期しているストレージシステムを指定する同期中リストから通信していないストレージシステムを削除するタスクを無事に引き継ぐことができるかどうかを判断するために、外部仲介サービス（１５００）に連絡してよい。他の場合では、第１のストレージシステム（１５１４）は、外部仲介サービス（１５００）に連絡し、それ、つまり第１のストレージシステム（１５００）が、第２のストレージシステムによって同期中リストから削除された可能性があると判断してよい。これらの例では、通常の状態では、ストレージシステム（１５１４、１５２４）は、正常に動作するために及びデータセット（１５１２）の同期複製を維持するために仲介サービス（１５００）からいずれの情報も必要としないため、ストレージシステム（１５１４、１５２４）は、外部仲介サービス（１５００）と連続通信している必要はない。言い換えると、この例では、仲介サービス（１５００）は、同期中リストのメンバーシップ管理において積極的な役割を有しておらず、さらに、仲介サービス（１５００）は、同期中リストのストレージシステム（１５１４、１５２４）の正常な動作さえ認識していない場合がある。代わりに、仲介サービス（１５００）は、同期中リストのメンバーシップを決定するために、又はストレージシステムが別のストレージシステムをデタッチするために作用できるかどうかを判断するために、ストレージシステム（１５１４、１５２４）によって使用される持続する情報を単に提供してよい。 [0335] As shown in FIG. 15, multiple storage systems (1514, 1524) synchronously replicating a dataset (1512) may communicate with an intermediary service (1500) over a network (1554)—the intermediary service (1500) may determine which storage system continues to service the dataset in the event of a communication failure between the storage systems, if a storage system goes offline, or due to some other triggering event. Intermediation is advantageous because if the storage systems cannot communicate with each other, they may not be able to maintain the synchronously replicated dataset, and any received requests to modify the dataset would otherwise be unavailable for service because the dataset would become unsynchronized. In this example, intermediary services for the storage systems synchronously replicating the dataset may be provided by an intermediary service (1500) that is external to the storage systems (1514, 1524). While only two storage systems (1514, 1524) are shown in this example, generally, some other number of two or more storage systems may be part of the in-sync list that are synchronously replicating the dataset. Specifically, if a first storage system (1514) detects a triggering event, such as the loss of a communication link (1516) to a second storage system (1524), the first storage system (1514) may contact the external intermediary service (1500) to determine whether it can successfully take over the task of removing the non-communicating storage system from the in-sync list that specifies the storage systems with which it is in sync for replicating the dataset. In other cases, the first storage system (1514) may contact the external intermediary service (1500) to determine that it, i.e., the first storage system (1500), may have been removed from the in-sync list by the second storage system. In these examples, under normal conditions, the storage systems (1514, 1524) do not need to be in continuous communication with the external intermediary service (1500) because they do not require any information from the intermediary service (1500) to operate properly and to maintain a synchronized replication of the dataset (1512). In other words, in these examples, the intermediary service (1500) does not have an active role in managing the membership of the in-sync list; further, the intermediary service (1500) may not even be aware of the normal operation of the storage systems (1514, 1524) on the in-sync list. Instead, the intermediary service (1500) may simply provide persistent information used by the storage systems (1514, 1524) to determine membership of the in-sync list or to determine whether a storage system can act to detach another storage system.

［0336］一部の例では、仲介サービス（１５００）は、例えばストレージシステム（１５１４、１５２４）が互いに通信することを妨げる通信リンク故障等のトリガイベントに応えて、１つ以上のストレージシステム（１５１４、１５２４）によって連絡される場合があるが、各ストレージシステム（１５１４、１５２４）は、ストレージシステム（１５１４、１５２４）間で使用される通信チャネルとは異なる通信チャネルを介して仲介サービス（１５００）と通信できる場合がある。結果的に、ストレージシステム（１５１４、１５２４）は互いに通信できない場合があるが、それでもストレージシステム（１５１４、１５２４）は依然として仲介サービス（１５００）と通信してよく、ストレージシステム（１５１４、１５２４）は、どのストレージシステムがデータストレージ要求にサービスを提供するために進み得るのかを決定するために仲介サービス（１５００）を使用してよい。さらに、仲介サービス（１５００）から仲介を勝ち取るストレージシステムは、別のストレージシステムをデタッチし、データセット（１５１２）を同期複製し続けてよいストレージシステムを示す同期中リストを更新してよい。一部の例では、仲介サービス（１５００）は、例えば、要求者ストレージシステムを含み、別のストレージシステムを除外するメンバーシップリストを設定する要求等、多様なタイプの要求を処理してよい。この例では、仲介サービス（１５００）が現在要求者をメンバーとして一覧する場合、要求は無事に完了し、仲介サービス（１５００）が現在要求者をメンバーとして一覧しない場合、要求は失敗する。このようにして、２つのストレージシステム（１５１４、１５２４）が、それぞれほぼ同時に要求を行っており、要求が他方を除外するために働く場合、次いで受け取られた第１の要求は成功する場合があり－仲介サービスは、第１の要求に従って他方のストレージシステムを除外するためにメンバーシップリストを設定する－受け取られた第２の要求は、メンバーシップリストがそれを除外するように設定されているため失敗する場合がある。メンバーシップリストを記憶する共用リソースに対する相互に排他的なアクセスは、一度にただ１つのシステムしかメンバーシップリストを設定することを許されないことを確実にするのに役立つ。 [0336] In some examples, the intermediary service (1500) may be contacted by one or more storage systems (1514, 1524) in response to a triggering event, such as a communication link failure that prevents the storage systems (1514, 1524) from communicating with each other; however, each storage system (1514, 1524) may be able to communicate with the intermediary service (1500) via a communication channel that is different from the communication channel used between the storage systems (1514, 1524). Consequently, the storage systems (1514, 1524) may not be able to communicate with each other, but the storage systems (1514, 1524) may still communicate with the intermediary service (1500), and the storage systems (1514, 1524) may use the intermediary service (1500) to determine which storage system may proceed to service the data storage request. Additionally, a storage system that wins mediation from the intermediary service (1500) may detach another storage system and update its in-sync list to indicate storage systems that may continue to synchronously replicate the dataset (1512). In some examples, the intermediary service (1500) may process various types of requests, such as a request to set a membership list that includes the requester storage system and excludes another storage system. In this example, if the intermediary service (1500) currently lists the requester as a member, the request completes successfully; if the intermediary service (1500) does not currently list the requester as a member, the request fails. In this manner, if two storage systems (1514, 1524) are each making requests at approximately the same time, and the requests serve to exclude the other, then the first request received may succeed—the intermediary service sets the membership list to exclude the other storage system according to the first request—and the second request received may fail because the membership list is set to exclude it. Mutually exclusive access to the shared resource that stores the membership list helps ensure that only one system at a time is allowed to configure the membership list.

［0337］別の例では、仲介は、値が、ポッドメンバーシップパーティション識別子を示して、メンバーシップがポッドからストレージシステムのなんらかの集合を仕切った、又は削除したことを主張するために定義され得る、特定の識別子に基づく場合がある。『ポッド』は、該用語がここで及び本願の残りを通して使用されるように、データセット、管理オブジェクト及び管理動作の集合、データセットを修正する又は読み取るためのアクセス動作の集合、及び複数のストレージシステムを表す管理エンティティとして実施されてよい。係る管理動作は、データセットを読み取る又は修正するためのアクセス動作が、ストレージシステムのいずれかを通して同等に動作する、ストレージシステムのいずれかを通して同等に管理オブジェクトを修正する又は問い合わせしてよい。各ストレージシステムは、ストレージシステムによる使用のために記憶され、宣伝されたデータセットの適切な部分集合としてデータセットの別々のコピーを記憶してよく、任意の１つのストレージシステムを通して実行され、完了された管理オブジェクト又はデータセットを修正する動作は、ポッドを問い合わせるために後続の管理オブジェクトで、又はデータセットを読み取るために後続のアクセス動作で反映される。『ポッド』に関する追加の詳細は、参照により本明細書に援用される以前に出願された仮特許出願第６２／５１８，０７１号で見つけられてよい。 [0337] In another example, mediation may be based on a specific identifier whose value indicates a pod membership partition identifier and can be defined to assert membership partitioning or removal of any collection of storage systems from a pod. A 'pod,' as the term is used here and throughout the rest of this application, may be implemented as a management entity representing a dataset, a collection of managed objects and management operations, a collection of access operations to modify or read the dataset, and multiple storage systems. Such management operations may modify or query managed objects equally well through any of the storage systems, with access operations to read or modify the dataset operating equally well through any of the storage systems. Each storage system may store a separate copy of the dataset as an appropriate subset of the dataset stored and advertised for use by the storage system, and a managed object or operation to modify the dataset performed and completed through any one storage system is reflected in a subsequent managed object to query the pod or in a subsequent access operation to read the dataset. Additional details regarding the 'pod' may be found in previously filed provisional patent application Ser. No. 62/518,071, which is incorporated herein by reference.

［0338］パーティション識別子は、ポッドメンバーシップリストを記憶する所与のストレージシステムに加えて、所与のストレージシステムに記憶されたローカル情報であってよい。互いに適切に通信中であり、同期中であるシステムは、同じパーティション識別子を有してよく、ストレージシステムがポッドに加えられると、次いで現在のパーティション識別子はポッドデータコンテンツとともにコピーされてよい。この例では、ストレージシステムのある集合がストレージシステムの別の集合と通信していないとき、各集合からの１つのストレージシステムは、新しく一意のパーティション識別子を考え出し、最初に共用リソースに対するロックを取得するストレージシステムに対して成功する特定の動作を使用することによって、仲介サービス（１５００）によって維持される共用リソースにパーティション識別子を設定しようと試みる場合があり―共用リソースに対するロックを取得できなかった―別のストレージシステムは、特定の動作を実行する試みに失敗する。一実施態様では、アトミックなコンペアアンドセット（ｃｏｍｐａｒｅ－ａｎｄ－ｓｅｔ）動作が使用されてよく、仲介サービス（１５００）によって記憶された最後のパーティション識別子値が、パーティション識別子を新しい値に変更する許可を有するためにストレージシステムによって提供されてよい。この例では、コンペアアンドセット動作は、現在のパーティション識別子を認識しているストレージシステムに対して成功する場合がある－パーティション識別子値を最初に設定するストレージシステムは、現在のパーティション識別子値を認識するストレージシステムであろう。さらに、ウェブサービスプロトコルで利用可能である場合がある条件付きストア（ｃｏｎｄｉｔｉｏｎａｌｓｔｏｒｅ）又はＰＵＴ動作は、この例で説明されるように、パーティション識別子を設定するために機能してよい。例えばＳＣＳＩ環境において等、他の場合では、コンペアアンドライト動作が使用されてよい。さらに他の場合では、仲介サービス（１５００）は、ストレージシステムから要求を受け取ることによってコンペアアンドセット動作を実行してよく、要求は、旧いパーティション識別子値及び新しいパーティション識別子値も示し、現在記憶されている値が旧いパーティション識別子に等しいとき、且つそのときに限り、仲介サービス（１５００）は、記憶されているパーティション識別子を新しいパーティション識別子に変更する。 [0338] The partition identifier may be local information stored on a given storage system in addition to the storage system storing the pod membership list. Systems that are properly communicating and synchronized with each other may have the same partition identifier, and when a storage system is added to a pod, the current partition identifier may then be copied along with the pod data contents. In this example, when one set of storage systems is not communicating with another set of storage systems, one storage system from each set may attempt to set the partition identifier on a shared resource maintained by the intermediary service (1500) by coming up with a new, unique partition identifier and using a specific operation that succeeds for the storage system that first obtains a lock on the shared resource; if the other storage system fails to obtain a lock on the shared resource, the attempt to perform the specific operation fails. In one embodiment, an atomic compare-and-set operation may be used, and the last partition identifier value stored by the intermediary service (1500) may be provided by the storage system to have permission to change the partition identifier to a new value. In this example, the compare-and-set operation may succeed for storage systems that are aware of the current partition identifier—the storage system that initially sets the partition identifier value will be the storage system that is aware of the current partition identifier value. Additionally, a conditional store or PUT operation, which may be available in a Web Services protocol, may function to set the partition identifier, as described in this example. In other cases, such as in a SCSI environment, a compare-and-write operation may be used. In still other cases, the intermediary service (1500) may perform the compare-and-set operation by receiving a request from a storage system, the request also indicating the old and new partition identifier values, and if and only if the currently stored value is equal to the old partition identifier, the intermediary service (1500) changes the stored partition identifier to the new partition identifier.

［0339］このようにして、パーティション識別子に基づいた仲介は、所与のストレージシステムが、一貫性のあるポッドメンバーの区切られた集合の中に含まれるかどうかを判断するためにストレージシステムによって使用されてよい情報を持続するために使用されてよい。一部の場合では、パーティション識別子は、ストレージシステム又はネットワーク相互接続のどちらかの障害に起因する自然発生的なデタッチの場合にだけ変化してよい。これらの例では、制御された方法でポッドのためにそれ自体をオフラインにするストレージシステムは、それ自体を同期中のポッドメンバーとして削除するために他のストレージシステムと通信してよく、このようにして仲介された新しいパターン識別子の形成を必要としない。さらに、それ自体を同期中のポッドのメンバーとして削除するストレージシステムは、次いで仲介された新しいパーティション識別子を必要としない制御された方法で、同期中のポッドメンバーとしてそれ自体を加え直してよい。さらに、新しいストレージシステムは、ストレージシステムが同期中のポッドメンバーと通信しているのであるならば同期中のポッドに加えられてよく、新しいストレージシステムは、仲介された新しいパーティション識別子を必要としない制御された方法でそれら自体を加えてよい。 [0339] In this manner, intermediation based on the partition identifier may be used to persist information that may be used by a storage system to determine whether a given storage system is included in a bounded set of consistent pod members. In some cases, the partition identifier may change only in the event of a spontaneous detachment due to a failure of either the storage system or the network interconnect. In these examples, a storage system that takes itself offline for a pod in a controlled manner may communicate with other storage systems to remove itself as an in-sync pod member, thus not requiring the creation of a new mediated pattern identifier. Furthermore, a storage system that removes itself as a member of an in-sync pod may then add itself back as an in-sync pod member in a controlled manner that does not require a new mediated partition identifier. Furthermore, new storage systems may be added to an in-sync pod if the storage system is in communication with the in-sync pod member, and the new storage systems may add themselves in a controlled manner that does not require a new mediated partition identifier.

［0340］結果的に、仲介されたパーティション識別子機構の優位点は、仲介サービス（１５００）が、障害、又はストレージシステムの少なくとも１つの集合が、同期中のポッドメンバーシップリストから１つ以上の通信していないストレージシステムを削除しようと試みることによって反応する他のトリガイベントがあるときにだけ必要とされてよく、通信していないストレージシステムは同を行おうと試みる場合があるが、逆になる点である。別の優位点は、仲介サービス（１５００）が、決して絶対的に信頼し得ず、同期中のポッドメンバーによって提供される全体的なストレージサービスの可用性にほとんど影響を与え得ない点である。例えば、２つの同期複製されたストレージシステムがそれぞれ１年に１度故障する場合、次いで仲介サービス（１５００）が正確な瞬間に利用できない場合を除いて、２つのストレージシステムの第１のストレージシステムは故障し、第２のストレージシステムは、第１のストレージシステムを削除するために無事に仲介すべきである。要するに、仲介サービス（１５００）が作動しており、少なくとも９９％の時間利用可能である場合、仲介サービス（１５００）が、必要時に利用できない確率は甚だしく低くなる。この例では、仲介サービス（１５００）が重大なときに利用できないであろう可能性は１００の中の１（１％以下）にすぎないであろう―これは、年に１度の機能停止を１世紀に１度の機能停止に削減できる。しかしながら、仲介サービス（１５００）の非有用性の見込みを減らすためには、仲介サービス（１５００）は、仲介サービスが概して利用できない場合に管理者に警告するために周期的にモニタされる場合があり、仲介サービス（１５００）は、特定のストレージシステムが利用できなくなる場合、アラートを生成するためにストレージシステムをモニタする場合がある。 [0340] Consequently, an advantage of the mediated partition identifier mechanism is that the intermediary service (1500) may be needed only when there is a failure or other triggering event in which at least one set of storage systems reacts by attempting to remove one or more non-communicating storage systems from the in-sync pod membership list; non-communicating storage systems may attempt to do the same, but vice versa. Another advantage is that the intermediary service (1500) can never be absolutely reliable and can have little impact on the availability of the overall storage service provided by the in-sync pod members. For example, if two synchronously replicated storage systems each fail once a year, then the first of the two storage systems should fail and the second storage system should successfully mediate to remove the first storage system, unless the intermediary service (1500) is unavailable at that precise moment. In short, if the intermediary service (1500) is operational and available at least 99% of the time, the probability that the intermediary service (1500) will be unavailable when needed is extremely low. In this example, there would only be a 1 in 100 (less than 1%) chance that the intermediary service (1500) would be unavailable at a critical time - this could reduce a once-a-year outage to a once-a-century outage. However, to reduce the likelihood of the intermediary service (1500) being unavailable, the intermediary service (1500) may be periodically monitored to alert an administrator if the intermediary service is generally unavailable, and the intermediary service (1500) may monitor storage systems to generate alerts if a particular storage system becomes unavailable.

［0341］別の例では、ポッドのための同期中メンバーと関連付けられたパーティション識別子を使用することの代替策として、仲介サービス（１５００）は、一回限りの仲介競争のターゲットを提供してよい。具体的には、ポッドのための同期中のメンバーストレージシステムが、１つのストレージシステムが他によってデタッチされ得る可能性を可能にする必要があるたびに、仲介競争のターゲットが確立されてよい。例えば、仲介値の表の中の同意された鍵は新しい値に一回限り設定されてよく、仲介を勝ち取るためには、ストレージシステムは、同意された鍵を、他の別に競争しているストレージシステムが使用しないであろう一意の値に設定する。仲介競争より前に、同意された鍵は存在しない場合もあれば、同意された鍵が存在する場合には、鍵は例えばＵＮＳＥＴ、又はヌル値等のなんらかの同意されたプレカーソル値に設定されてよい。この例では、鍵を特定の値に設定する動作は、鍵が存在しない場合、鍵がＵＮＳＥＴ状態にある場合、又は鍵が現在値に等しい値に設定されている場合に成功する－それ以外の場合、鍵を設定するための動作は失敗する。ストレージシステムの集合が仲介を勝ち取ると、ストレージシステムの残りの集合は将来の仲介のために新しい鍵を定義してよい。この例では、ストレージシステムは、ストレージシステムが仲介競争に勝った可能性があることを知る前に、ストレージシステムが、それが障害を起こし、回復する又はリブートする場合、値を再び使用し得るように、ストレージシステムが仲介競争の前に使用する値を記録してよい。２つ以上のストレージシステムが通信中であり、通信していないストレージシステムのなんらかの他の集合に対して互いに競争している場合、この値は、それらの他の通信中のストレージシステムに共用されてよく、これによりそれらの任意の１つは仲介競争を続行し、おそらくなんらかの追加の一連の障害の後に第２の仲介競争に参加し得る。例えば、第２の仲介競争のターゲットのための一意の値のために競争する前に、正確さが第１の仲介競争のターゲットのために競争する又は第１の仲介競争のターゲットを確証することが必要である場合がある。特に、このシーケンスは、第２の仲介競争のターゲットが、第１の仲介競争のターゲットを共用するすべてのストレージシステムに確実に配布され、すべてのストレージシステムが、それが確実に配布されたことを認識させられるまで必要である場合がある。その時点では、第２の仲介ターゲットのために競争する前に、最初に第１の仲介ターゲットのために競争する継続的な必要性はない場合がある。 [0341] In another example, as an alternative to using the partition identifier associated with a synchronizing member for a pod, the mediation service (1500) may provide a one-time mediation race target. Specifically, a mediation race target may be established each time a synchronizing member storage system for a pod needs to allow for the possibility that one storage system may be detached by another. For example, the agreed-upon key in the mediation value table may be set to a new value one time; to win mediation, a storage system sets the agreed-upon key to a unique value that other competing storage systems will not use. Prior to the mediation race, an agreed-upon key may not exist; if an agreed-upon key exists, the key may be set to some agreed-upon precursor value, such as UNSET or a null value. In this example, an operation to set a key to a particular value succeeds if the key does not exist, if the key is in the UNSET state, or if the key is set to a value equal to the current value—otherwise, the operation to set the key fails. Once a set of storage systems wins the arbitration, the remaining set of storage systems may define a new key for future arbitration. In this example, before a storage system knows that it may have won the arbitration race, it may record the value it uses before the arbitration race so that it can reuse the value if it fails and recovers or reboots. If two or more storage systems are in communication and competing with each other for some other set of non-communicating storage systems, this value may be shared with those other communicating storage systems so that any one of them can continue the arbitration race and, perhaps after some additional series of failures, participate in a second arbitration race. For example, it may be necessary to compete for or verify the accuracy of the first arbitration race target before competing for a unique value for the second arbitration race target. Notably, this sequence may be necessary until the second arbitration race target has been reliably distributed to all storage systems sharing the first arbitration race target and all storage systems are made aware that it has been reliably distributed. At that point, there may be no ongoing need to first compete for the first intermediary target before competing for the second intermediary target.

［0342］一部の例では、仲介サービス（１５００）は、仲介されているストレージシステムの組織又は所有者以外の組織によって提供されるコンピュータシステムで管理される場合がある。例えば、ベンダが２つのストレージシステムをカスタマに販売する場合、ベンダは、ベンダ所有の又はベンダ管理のデータセンタで提供されるサーバでメディエータのホストとなる場合もあれば、ベンダはサービスのホストとなるためにクラウドサービスプロバイダと契約する場合もある。また、ベンダは、仲介サービスが十分に信頼でき、カスタマの障害ゾーンとは異なっていることを確実にしてもよい。ある事例では、他のクラウドサービスプロバイダを除外することなく、仲介サービスはＡｍａｚｏｎＷｅｂＳｅｒｖｉｃｅｓ（商標）でホストされてよく、仲介サービスは、信頼できるデータベースサービスのためにＤｙｎａｍｏＤＢで実装されてよく、ＤｙｎａｍｏＤＢはウェブＡＰＩデータベース更新として条件付きストアプリミティブに対するサポートを提供してよい。一部の場合では、仲介サービスは、さらに信頼性を高めるために、複数のクラウドサービスプロバイダ領域又は障害領域全体で動作するように実装されてよい。ベンダを使用して仲介サービスを提供することの１つの優位点は、仲介サービスが構成するのが簡単である点である。さらに、ポッドの作成中、ストレージシステムは、仲介サービスから暗号トークンを入手し、パーティション識別子及びポッドメンバーシップリストを記憶することに加えて暗号トークンを記憶してよい－暗号トークンは、ポッドのための一意の仲介サービス情報を安全に通信するために使用され得る。 [0342] In some examples, the intermediary service (1500) may be hosted on a computer system provided by an organization other than the organization or owner of the storage system being brokered. For example, if a vendor sells two storage systems to a customer, the vendor may host the mediator on a server provided in a vendor-owned or vendor-managed data center, or the vendor may contract with a cloud service provider to host the service. The vendor may also ensure that the intermediary service is sufficiently reliable and distinct from the customer's failure zone. In some cases, without excluding other cloud service providers, the intermediary service may be hosted on Amazon Web Services™, and the intermediary service may be implemented with DynamoDB for reliable database services, which may provide support for conditional store primitives as web API database updates. In some cases, the intermediary service may be implemented to operate across multiple cloud service provider domains or failure domains to further increase reliability. One advantage of using a vendor to provide the intermediary service is that the intermediary service is easy to configure. Furthermore, during pod creation, the storage system may obtain a cryptographic token from the intermediary service and store the cryptographic token in addition to storing the partition identifier and pod membership list—the cryptographic token can be used to securely communicate unique intermediary service information for the pod.

［0343］一部の場合では、仲介サービス（１５００）は、ストレージシステムが仲介を試みるときに利用できない場合があり、以下の方法が、少なくとも最終的には、係るサービス機能停止から回復するプロセスを提供する。例えば、ストレージシステムの第１の集合が、仲介サービスを通してストレージシステムの第２の集合をデタッチしようと試みるが、ストレージシステムの第１の集合が仲介サービス（１５００）と通信できない場合、次いでストレージシステムの第１の集合は、デタッチ動作を完了できず、ポッドにサービスを提供し続けることができない。一部の場合では、ストレージシステムの２つの集合が互いとなんとか再接続し、これによりすべての同期中のストレージシステムが―ただし、仲介サービス（１５００）は依然として利用できない状態で―再び通信している場合、ストレージシステムの２つの集合はポッドを同期させ、ポッドにサービスを提供することを再開してよい。しかしながら、この例では、パーティション識別子を変更するために、又はなんであれ仲介と関連付けられた他のプロパティを変更するために、１つ以上の要求が仲介サービス（１５００）に送信された可能性があり、ストレージシステムのどれも、要求が受け取られたのか、それとも受け取られず、処理されなかったのかを確信し得ず、確認応答が失われた可能性がある。結果として、障害を有するストレージシステム又はネットワーク相互接続の集合がある場合、次いでストレージシステムは、仲介サービス（１５００）がオンラインに戻る場合、及び戻るとき、パーティション識別子についてどの値をアサートするのか確信がもてない場合がある。係る状況では、ポッドのサービスが、すべての同期中のストレージシステムがオンラインに戻り、通信を再開するとき、又は同期中のストレージシステムが仲介サービス（１５００）に再接続できるときのどちらかに再開することが好ましい。一実施態様では、すべての同期中のストレージシステムが再接続するとき、同期中のストレージシステムはすべて、仲介サービス（１５００）に送信された可能性がある既知のパーティション識別子値を交換する。例えば、２つのストレージシステムがそれぞれパーティション識別子値を変更しようと試み、あるストレージシステムがパーティション識別子を、例えば１７４９１３７４８１８９０に変更しようと試み、別のストレージシステムが、パーティション識別子を、例えば８７９２７４０１８３９に変更しようと試み、仲介サービス（１５００）によって受取りが確認されたと知られている最後の値が７９２２３４０２９３６であった場合、次いで仲介サービス（１５００）は、現在、これらの３つのパーティション識別子の値のいずれかを記憶してよい。結果として、仲介パーティション識別子を新しい値に変更しようとする将来の試みは、変更を行うための権限を獲得しようとしてこれらの３つのパーティション識別子のいずれか又はすべてを供給する場合がある。さらに、パーティション識別子値を変更しようとする第４の試みも障害に遭遇し、後にさらに別の仲介を試みる任意のストレージシステムによって覚えられる必要がある場合がある第４の値を生じさせる場合がある。さらに、任意のストレージシステムが仲介サービス（１５００）パーティション識別子値を無事に変更する場合、そのストレージシステムは、任意の同期中のストレージシステムから、及び将来同期中になる任意のストレージシステムからより旧いパーティション識別子値をパージする場合がある。 [0343] In some cases, the intermediary service (1500) may be unavailable when a storage system attempts intermediation, and the following method provides a process for recovering, at least eventually, from such a service outage. For example, if a first set of storage systems attempts to detach a second set of storage systems through the intermediary service, but the first set of storage systems cannot communicate with the intermediary service (1500), then the first set of storage systems cannot complete the detach operation and cannot continue to provide service to the pod. In some cases, if the two sets of storage systems manage to reconnect with each other so that all synchronizing storage systems are communicating again—although the intermediary service (1500) is still unavailable—the two sets of storage systems may synchronize the pod and resume providing service to the pod. However, in this example, one or more requests may have been sent to the intermediation service (1500) to change the partition identifier or any other property associated with the intermediation, and none of the storage systems may be sure whether the request was received or not received and processed, and acknowledgments may have been lost. As a result, if there is a collection of failed storage systems or network interconnects, then the storage systems may not be sure what value to assert for the partition identifier if and when the intermediation service (1500) comes back online. In such a situation, it is preferable for the pod's services to resume either when all synchronizing storage systems come back online and resume communication, or when a synchronizing storage system is able to reconnect to the intermediation service (1500). In one embodiment, when all synchronizing storage systems reconnect, they all exchange known partition identifier values that may have been sent to the intermediation service (1500). For example, if two storage systems each attempt to change a partition identifier value, where one storage system attempts to change the partition identifier to, for example, 1749137481890 and another storage system attempts to change the partition identifier to, for example, 87927401839, and the last value known to have been acknowledged by the intermediary service (1500) was 79223402936, then the intermediary service (1500) may currently store any of these three partition identifier values. As a result, future attempts to change the intermediary partition identifier to a new value may supply any or all of these three partition identifiers in an attempt to gain authorization to make the change. Furthermore, a fourth attempt to change the partition identifier value may also encounter a failure, resulting in a fourth value that may need to be remembered by any storage system that attempts yet another intermediation at a later time. Furthermore, if any storage system successfully changes the intermediary service (1500) partition identifier value, that storage system may purge the older partition identifier value from any currently synchronizing storage systems and from any storage systems that will be synchronizing in the future.

［0344］別の例では、仲介サービス（１５００）は、潜在的な将来の各競争に対して準備された一意の鍵に基づいて仲介してよい。係る場合、同期中のストレージシステムは、新しい鍵を使用することに同意してよい。すべての同期中のストレージシステムが新しい鍵を受け取り、記録するまで、新しい鍵が同時にすべてのストレージシステムでアトミックに設定されない場合があることを所与とすれば、すべてのストレージシステムは、任意の以前の仲介試行で各ストレージシステムが設定しようと試みたその旧い鍵及び値を保持するべきである。この例では、任意の早期の競争されなかった鍵及び任意の早期の鍵／値の仲介試行はポッドのためのすべての同期中のストレージシステム間で回覧され、将来の仲介試行に使用する新しい鍵とともに、係る各ストレージシステムで記録されてよい。（新しい鍵を含まない）以前の競争されなかった鍵ごとに、この交換は、すべてのシステムがその鍵に対する競争で使用し得る単一の同意された値を選択してもよい。ポッドのためのすべての同期中のシステムがこれらの仲介の鍵及び値（及び任意の将来の競争のための新しい同意された鍵）のすべてを受け取り、記録した後、ポッド内のストレージシステムは、次いで単一の新しい鍵を選択して、より旧い鍵を廃棄することに同意してよい。２つ以上のストレージシステムが同じ仲介鍵を異なる値に設定することを試みた可能性があり、すべての係る値が記録されてよいことに留意されたい。過去の仲介試行のためのこれらのすべての仲介鍵及び鍵／値対を交換する又は受け取るプロセスの間に障害がある場合、次いで他は受け取り、記録した可能性があるが、いくつかのストレージシステムは新しい仲介の鍵及び値を受け取らず、記録していない可能性がある。仲介サービス（１５００）が、ポッドのためのすべての同期中のストレージシステムが互いと再接続できる前に利用可能になる場合、次いでポッドのためのストレージシステムの部分集合は、ポッドから別のストレージシステムをデタッチするために仲介サービス（１５００）を使用しようと試みる場合がある。仲介を勝ち取るために、ストレージシステムは、すべての記録された鍵をその記録された値に設定しようと試み、それがうまくいく場合、次いで新しい鍵を一意の値に設定しようと試みる場合がある。同じ鍵に対して２つ以上の値が記録された場合、次いでそのステップは、それらの値のいずれか１つを設定することが成功する場合、成功する。第１のステップ（以前の鍵を設定すること）が失敗する場合、又は第２のステップ（新しい鍵を新しい一意の値に設定すること）が失敗する場合、次いで仲介に対するその試みに参加するストレージシステムはオフラインになる（それが新しい鍵について設定しようと試みた値を保持する）場合がある。両方のステップとも成功する場合、次いで通信中のストレージシステムは、通信していないストレージシステムをデタッチし、ポッドを提供し続けてよい。すべての過去の鍵及び値を交換することの代替策として、ストレージシステムは、ポッドのための他のストレージシステムからの鍵及び値の交換なしに、それが試す鍵及び値だけを記録してよい。次いで、同期中のストレージシステムが、（どれも仲介サービスと対話することに成功しなかった）ポッドのための他の同期中のストレージシステムと再接続する場合、同期中のストレージシステムは、１つの新しい仲介鍵を交換し、次いでそれらがともに同意された新しい鍵を受け取り、記録した旨の肯定応答を交換してよい。障害が肯定応答の交換を妨げる場合、次いで新しい鍵を受け取ったことがなかったストレージシステムによる（現在利用可能な仲介サービスに対する）仲介に対する将来の試行は、その以前の鍵及び値を再アサートしようと試みる場合がある。新しい鍵を受け取ったが、ポッドのためのすべてのストレージシステムが鍵を受け取った旨の表示を受け取らなかったストレージシステムは、新しい鍵のための値、最初に以前の鍵、次に新しい鍵をアサートするだけではなく、その以前の仲介鍵をアサートしてよい。その将来の仲介試行はまだ失敗する場合があるが、その結果ストレージシステムは再び他の同期中のストレージシステムに再接続する場合があり、別の鍵につながる新しい鍵を再び不完全に交換する場合がある。これが別の鍵を加える。鍵が、新しい鍵の一連の不完全な交換で経時的に増大するにつれ、ストレージシステムによる将来の仲介試行は、ストレージシステムがすべての鍵の値を無事にアサートする、又はストレージシステムが鍵のアサートの失敗に遭遇し、その時点でストレージシステムが鍵をアサートするのを停止し、オフラインになるまで、それらの鍵が記録された順序で、ストレージシステムがそれらの鍵のために以前にアサートした任意の値とともに、その鍵のそれぞれを再アサートしてよい。 [0344] In another example, the mediation service (1500) may mediate based on a unique key prepared for each potential future competition. In such a case, the synchronizing storage systems may agree to use the new key. Given that the new key may not be set atomically on all storage systems simultaneously until all synchronizing storage systems have received and recorded the new key, all storage systems should retain the old key and value that each storage system attempted to set in any previous mediation attempt. In this example, any earlier uncontested keys and any earlier key/value mediation attempts may be circulated among all synchronizing storage systems for the pod and recorded at each such storage system, along with the new key to use in future mediation attempts. For each earlier uncontested key (not including the new key), this exchange may select a single agreed-upon value that all systems may use in a competition for that key. After all synchronizing systems for the pod have received and recorded all of these mediation keys and values (and the new agreed-upon key for any future competitions), the storage systems in the pod may then agree to select a single new key and discard the older keys. Note that two or more storage systems may attempt to set the same brokered key to different values, and all such values may be recorded. If there is a failure during the process of exchanging or receiving all these brokered keys and key/value pairs for past brokering attempts, then some storage systems may not receive and record the new brokered key and value, while others may have received and recorded them. If the brokering service (1500) becomes available before all in-sync storage systems for a pod are able to reconnect with each other, then a subset of the storage systems for the pod may attempt to use the brokering service (1500) to detach another storage system from the pod. To win the brokerage, a storage system may attempt to set all recorded keys to their recorded values, and if that is successful, then attempt to set a new key to a unique value. If two or more values were recorded for the same key, then the step succeeds if setting any one of those values is successful. If the first step (setting the previous key) fails, or the second step (setting the new key to a new, unique value) fails, then the storage systems participating in that attempt to broker may go offline (retaining the value it attempted to set for the new key). If both steps succeed, then the communicating storage system may detach the non-communicating storage system and continue serving the pod. As an alternative to exchanging all past keys and values, a storage system may record only the keys and values it tries, without exchanging keys and values from other storage systems for the pod. Then, when a synchronizing storage system reconnects with other synchronizing storage systems for the pod (none of which were successful in interacting with the brokering service), the synchronizing storage systems may exchange one new brokering key and then exchange acknowledgments that they both received and recorded the agreed-upon new key. If a failure prevents the exchange of acknowledgments, then future attempts to broker (to the currently available brokering service) by storage systems that did not receive the new key may attempt to re-assert its previous key and value. A storage system that receives a new key but does not receive an indication that all storage systems for the pod have received the key may assert its previous brokered key, rather than just asserting a value for the new key, first the previous key, then the new key. Its future brokering attempts may still fail, which may result in the storage system again reconnecting to other in-sync storage systems and again incompletely exchanging a new key, leading to another key. This adds another key. As the keys grow over time with a series of incomplete exchanges of new keys, future brokering attempts by the storage system may re-assert each of its keys in the order in which they were recorded, along with any values the storage system previously asserted for those keys, until the storage system has successfully asserted values for all keys, or the storage system experiences a key assertion failure, at which point the storage system stops asserting keys and goes offline.

［0345］別の例では、新しい仲介サービスは、現在の仲介サービスが利用できないときに構成されてよい。例えば、ポッドのためのすべての同期中のストレージシステムが互いと通信中であるが、現在の仲介サービスと通信していない場合、次いでポッドは、新しい仲介サービスで構成されてよい。これは、新しい鍵又は新しい仲介値を選択する以前のアルゴリズムに類似しているが、新しい鍵は、単に同じサービスと関連付けられた別の鍵であるよりむしろ新しい仲介サービスを使用するようにさらに構成される。さらに、この動作中に障害がある場合、以前のアルゴリズムと同様に、いくつかのシステムはより旧い鍵のために競争する場合があり、したがって旧い鍵と新しい仲介サービスでの新しい鍵を知っているシステムは、新しいメディエータサービスで新しい鍵のために競争する場合がある。以前の仲介サービスが恒久的に利用できない場合、次いですべての同期中のストレージシステムは、最終的に互いと再接続し、ポッドサービスが安全に再開できる前に、新しい仲介サービス、及び新しい仲介サービスと関連付けられた任意の鍵及び値の交換を完了すべきである。 [0345] In another example, a new mediation service may be configured when the current mediation service is unavailable. For example, if all in-sync storage systems for a pod are in communication with each other but not with the current mediation service, then the pod may be configured with a new mediation service. This is similar to the previous algorithm for selecting a new key or new mediation value, but the new key is further configured to use the new mediation service rather than simply being a different key associated with the same service. Furthermore, if there is a failure during this operation, as with the previous algorithm, some systems may compete for the older key, and therefore systems that know the old key and the new key with the new mediation service may compete for the new key with the new mediation service. If the previous mediation service is permanently unavailable, then all in-sync storage systems should eventually reconnect with each other and complete the exchange of the new mediation service, and any keys and values associated with the new mediation service, before the pod service can safely resume.

［0346］別の例では、障害を解決するためのモデルは、一方のストレージシステムを他方のストレージシステムよりも支持するための優先規則を実施するためであってよい。この例では、好ましいストレージシステムが実行中である場合、ストレージシステムは実行中のままとなり、それが通信していない任意のストレージシステムをデタッチする。さらに、優先システムとの実績のある通信をしていない任意の他のシステムはそれ自体をオフラインにする。この例では、優先ストレージシステムが再接続するストレージシステムをデタッチしており、次いで再接続するストレージシステムが、それがポッドにサービスを提供することを再開できる前にポッドのためにそれを同期させるために最初に再同期されなければならないのに対し、非優先ストレージシステムは、最終的に優先ストレージシステムと再接続するとき、次いで優先ストレージシステムが再接続するストレージシステムをまだデタッチしていなかった場合、２つのストレージシステムは、両方のストレージシステムとも同期中である状態を回復し、該状態から再開してよい。優先ストレージシステムを有することは高可用性を提供するために有用ではない場合があるが、同期複製、特に非対称同期複製の他の用途には有用である場合がある。例えば、データセンタ又はキャンパス内の中心的な大型ストレージシステムから、例えばトップオブラック構成の等、アプリケーションサーバのより近くで実行中のより小型の（おそらくあまり管理されていない）ストレージシステムにミラーリングする例を挙げる。この場合、ネットワーク故障の場合又はトップオブラックストレージシステムの故障時、集中管理されたストレージシステムが故障する場合にポッドのためのサービスを完全に停止させる一方、より大型のより集中管理されたストレージシステムの方をつねに支持することが有益である場合がある。係るトップオブラックストレージシステムは、読取り性能を高めるため又はデータセンタストレージネットワークに対する負荷を削減するためだけに使用される可能性があるが、非同期複製又は他のデータ管理サービスが集中管理されたシステムだけで実行中である場合、トップオブラックストレージシステムが単独で続行することを可能にするよりも、中心的なストレージシステムにトラフィックをルート変更する又はサービス提供を停止し、技術サポートを要求する方が好ましい場合がある。さらに、優先規則はより複雑である場合がある－おそらく優先ストレージシステム又は必須ストレージシステムに頼るなんらかの数の追加のストレージシステムとともに、２つ以上の結合された係る「優先」ストレージシステムがある場合がある。この例では、すべての優先ストレージシステム又は必須ストレージシステムが実行中である場合、ポッドはオンラインであり、そのうちのいくつかが実行中ではない場合、ポッドはダウンしている。これは、定足数のサイズが投票するメンバーの数と同じである定足数モデルに類似しているが、それは、すべてに満たない投票メンバーを可能にする一般化された定足数モデルよりも実装するのがより容易である。 [0346] In another example, a model for resolving failures may be to implement priority rules to favor one storage system over another. In this example, if a preferred storage system is running, it remains running and detaches any storage systems with which it is not communicating. Additionally, any other systems that do not have proven communication with the preferred system take themselves offline. In this example, the preferred storage system detaches the reconnecting storage system, and the reconnecting storage system must then first be resynchronized to synchronize it for the pod before it can resume serving the pod, whereas when the non-preferred storage system eventually reconnects with the preferred storage system, if the preferred storage system had not yet detached the reconnecting storage system, the two storage systems may recover and resume from a state where both storage systems are in sync. While having a preferred storage system may not be useful for providing high availability, it may be useful for other uses of synchronous replication, particularly asymmetric synchronous replication. Take, for example, mirroring from a large central storage system within a data center or campus to a smaller (perhaps less managed) storage system running closer to the application servers, e.g., in a top-of-rack configuration. In this case, in the event of a network outage or top-of-rack storage system failure, it may be beneficial to always favor the larger, more centralized storage system while completely halting service for the pod if the centralized storage system fails. Such a top-of-rack storage system may be used solely to improve read performance or reduce load on the data center storage network, but if asynchronous replication or other data management services are running solely on the centralized system, it may be preferable to reroute traffic to the central storage system or take it out of service and request technical support, rather than allowing the top-of-rack storage system to continue on its own. Furthermore, the priority rules may be more complex—there may be two or more such “preferred” storage systems combined, perhaps with some number of additional storage systems relying on the preferred or required storage system. In this example, a pod is online if all preferred or required storage systems are running, and down if some are not. This is similar to a quorum model where the quorum size is the same as the number of voting members, but it is easier to implement than a generalized quorum model that allows for less than all voting members.

［0347］別の例では、ポッドが３つ以上のストレージシステム全体で広げられるとき有用である場合がある機構の組合せが使用されてよい。一例では、優先規則はメタデータと結合されてよい。トップオブラック例では、データセンタ又はキャンパス内のより大型の中心的なストレージシステムはそれ自体、第２の場所の大型ストレージシステムに同期複製される可能性がある。その場合、トップオブラックストレージシステムは、絶対に単独で再開し得ず、２つの場所のより大型の中心的なストレージシステムのいずれかに優先権を与えてよい。その場合の２つのより大型のストレージシステムは、互いの間で仲介するように構成される可能性があり、オンラインに留まる２つのより大型のストレージシステムのどちらかに接続できる任意のより小型のストレージシステムはそのポッドにサービスを提供し続ける場合があり、２つの大型のストレージシステムのどちらにも接続できない（又は、ポッドのためにオフラインである一方にしか接続できない）任意のより小型のストレージシステムは、ポッドにサービスを提供するのを停止する場合がある。さらに、優先モデルは定足数ベースのモデルと結合されてもよい。例えば３つの場所の３つの大型ストレージシステムは、互いの間で定足数モデルを使用する可能性があり、より小型の衛星ストレージシステム又はトップオブラックストレージシステムはいかなる票も欠き、それらが、オンラインであるより大型の同期中のストレージシステムの一方に接続できる場合のみ機能する。 [0347] In another example, a combination of mechanisms may be used that may be useful when a pod is spread across three or more storage systems. In one example, priority rules may be combined with metadata. In a top-of-rack example, a larger, central storage system in a data center or campus may itself be synchronously replicated to a larger storage system in a second location. In that case, the top-of-rack storage system may never restart on its own and may give priority to either of the larger, central storage systems in the two locations. The two larger storage systems in that case may be configured to mediate between each other, such that any smaller storage system that can connect to either of the two larger storage systems that remain online may continue to serve its pod, and any smaller storage system that cannot connect to either of the two larger storage systems (or can only connect to one that is offline for the pod) may cease to serve the pod. Furthermore, the priority model may be combined with a quorum-based model. For example, three large storage systems in three locations might use a quorum model between each other, with smaller satellite or top-of-rack storage systems lacking any votes and only functioning if they can connect to one of the larger, synchronizing storage systems that is online.

［0348］機構を結合する別の例では、仲介は定足数モデルと結合されてよい。例えば、通常２つのストレージシステムが通信していない第３のストレージシステムを安全にデタッチできることを確実にするために、互いの間で投票する３つのストレージシステムがあってよい。一方、１つのストレージシステムは、２つの他のストレージシステムを独力で絶対にデタッチすることはできない。しかしながら、２つのストレージシステムが第３のストレージシステムを無事にデタッチした後に、構成はいま、それらが同期中であることに同意し、第３のストレージシステムがデタッチされる旨の事実に同意する２つのストレージシステムに少なくなる。その場合、残っている２つのストレージシステムは、（例えばクラウドサービスを用いた）仲介を使用して、追加のストレージシステム又はネットワークの障害を処理することに同意してよい。この仲介及び定足数の組合せは、さらに拡張され得る。例えば、４つのストレージシステム間で広げられるポッドでは、任意の３つが第４のストレージシステムをデタッチすることができるが、２つの同期中のストレージシステムが互いと通信しているが、他の２つのストレージシステムに通信していない場合、それらは両方とも同期中であると現在考え、次いでそれらは他の２つを安全にデタッチするために仲介を使用するであろう。５つのストレージシステムポッド構成でさえ、４つのストレージシステムが第５のストレージシステムをデタッチするために投票する場合、次いで残りの４つは、それらが等しい２つの半分に分割される場合仲介を使用でき、ポッドが２つのストレージシステムに少なくなると、それらは連続する障害を解決するために仲介を使用できる。５つから３つは、３つの間で定足数を使用し、次いで２つへの低下を可能にする可能性があり、残った２つのストレージシステムは、追加の故障がある場合、再び仲介を使用する。この一般的なマルチモード定足数仲介機構は、対称的なストレージシステムの間の定足数も、独自の仲介も処理できない追加の数の状況を処理できる。この組合せは、障害のある又はときおり到達不能なメディエータを確実に使用できる（又は、カスタマが彼らを完全に信用していない場合があるクラウドメディエータの場合）事例の数を増やし得る。さらに、この組合せは、仲介が単独である結果、第１のストレージシステムが、単に第１のストレージシステムにしか影響を及ぼさないネットワーク障害時に、無事に第２のストレージシステム及び第３のストレージシステムをデタッチする可能性がある、３つのストレージシステムポッドの場合をより良く処理する。また、この組合せは、３つから２つ、次いで１つへの例で説明されるように、一度に１つのストレージシステムに影響を及ぼす一連の障害をより良く処理し得る。これらの組合せは、同期中であること及びデタッチ動作が特定の状態を生じさせるために機能する―言い換えると、デタッチ済みから同期中に移動することがプロセスであり、一連の定足数／メディエータ関係での各段階は、あらゆる点で、すべてのオンライン／同期中ストレージシステムがポッドのための現在の持続性のある状態に同意していることを確実にするため、システムはステートフルである。これは、単に通信中のクラスタノードの過半数を再び有することが、動作を再開するために十分であると予想される、なんらかの他のクラスタモデルにおいてとは異なる。しかしながら、優先モデルは、依然として加えられてよく、衛星ストレージシステム又はトップオブラックストレージシステムは、仲介にも定足数にも絶対に関与せず、それらが、仲介又は定足数に参加するオンラインストレージシステムに接続できる場合にだけポッドにサービスを提供する。 [0348] In another example of combining mechanisms, brokering may be combined with a quorum model. For example, there may be three storage systems that vote among themselves to ensure that a third storage system, with which two storage systems normally do not communicate, can be safely detached. On the other hand, one storage system cannot absolutely detach two other storage systems on its own. However, after two storage systems successfully detach the third storage system, the configuration is now reduced to two storage systems that agree that they are in sync and agree on the fact that the third storage system is detached. In that case, the remaining two storage systems may agree to handle additional storage system or network failures using brokering (e.g., using a cloud service). This combination of brokering and quorum can be further extended. For example, in a pod spread across four storage systems, any three can detach the fourth storage system, but if two in-sync storage systems are communicating with each other but not with the other two storage systems, they will currently consider themselves both in sync and will then use brokering to safely detach the other two. Even in a five-storage-system pod configuration, if four storage systems vote to detach the fifth storage system, then the remaining four can use mediation if they split into two equal halves; once the pod is down to two storage systems, they can use mediation to resolve successive failures. Five to three might use quorum among them, then allow a drop to two, with the remaining two storage systems again using mediation if there are additional failures. This general multi-mode quorum mediation mechanism can handle an additional number of situations that neither quorum nor independent mediation between symmetric storage systems can handle. This combination can increase the number of cases where a faulty or occasionally unreachable mediator can be reliably used (or in the case of cloud mediators where customers may not fully trust them). Furthermore, this combination better handles the case of a three-storage-system pod, where the first storage system may successfully detach the second and third storage systems during a network failure that only affects the first storage system, as a result of the mediation alone. This combination may also better handle a series of failures affecting one storage system at a time, as illustrated in the three-to-two-to-one example. These combinations work because the in-sync and detach operations create a specific state—in other words, moving from detached to in-sync is a process, and each stage in the quorum/mediator relationship series ensures that all online/in-sync storage systems agree on the current, durable state for the pod, so the system is stateful. This differs from some other cluster models, where simply having a majority of cluster nodes in communication again is expected to be sufficient to resume operation. However, a priority model may still be added, where satellite or top-of-rack storage systems never participate in mediation or quorum and only serve pods if they can connect to online storage systems that participate in mediation or quorum.

［0349］一部の例では、仲介サービス（１５００）又は外部ポッドメンバーシップマネージャは、同期複製されたストレージシステム（１５１４、１５２４）のための障害ゾーンとは異なる障害ゾーンに位置する場合がある。例えば、２つのストレージシステムポッド（１５０１）を用いる場合、２つのストレージシステム（１５１４、１５２４）が、例えば物理的な場所―一方が都市にあり、他方が都市の郊外にある、又は一方が１つの送電網又はインターネットアクセスポイントに接続されたデータセンタ内にあり、他方が異なる送電網又はインターネットアクセスポイントに接続された別のデータセンタ内にある―によって別個の障害ゾーンに分けられる場合、次いで、２つのストレージシステム以外のなんらかの他の障害ゾーンにいることが概して好ましい。一例として、仲介サービス（１５００）は、都市の拡張された市街地の異なる部分にある場合がある、又は異なる送電網若しくはインターネットアクセスポイントに接続される場合がある。しかしながら、同期複製されたストレージシステムは、より良いストレージ信頼性を提供するために同じデータセンタの中にある場合もあり、この場合、ネットワーク、電力、及び冷却のゾーンが考慮に入れられてよい。 [0349] In some cases, the intermediation service (1500) or external pod membership manager may be located in a different failure zone than the failure zone for the synchronously replicated storage systems (1514, 1524). For example, when using two storage system pods (1501), if the two storage systems (1514, 1524) are separated into separate failure zones, for example, by physical location—one in a city and the other on the outskirts of the city, or one in a data center connected to one power grid or Internet access point and the other in another data center connected to a different power grid or Internet access point—then it is generally preferable to be in some other failure zone other than the two storage systems. As an example, the intermediation service (1500) may be in a different part of an extended urban area of a city or connected to a different power grid or Internet access point. However, the synchronously replicated storage systems may be in the same data center to provide better storage reliability, in which case network, power, and cooling zones may be taken into account.

［0350］図１５に示される例の方法は、トリガイベントを検出することに応えて第１のストレージシステム（１５１４）によって、仲介サービス（１５００）から仲介を要求すること（１５０２）を含む。この例では、トリガイベントは、第１のストレージシステム（１５１４）と第２のストレージシステム（１５２４）との間のデータ通信リンク（１５１６）の通信障害である場合があり、障害を検出することは、ハードウェア故障がことに基づく、伝送の受取りを確認できないことに基づく、又はリトライ努力の失敗に基づく、又は他のなんらかの方法による場合がある。他の場合では、トリガイベントは、同期複製リースの満了である場合があり、仲介を要求することは、接続を同期させること、及び活動リースの再開を調整しようと試みることの一部である場合がある。係るリースは、最初に、さまざまな異なる方法で複数のストレージシステムの少なくとも１つのためのタイミング情報に従って確立されてよい。例えば、ストレージシステムは、複数のストレージシステムのそれぞれがクロックを調整又は交換するためにタイミング情報を活用することによって同期複製リースを確立してよい。係る例では、いったんクロックがストレージシステムのそれぞれに調整されると、ストレージシステムは、調整又は交換されたクロック値を超えた所定の期間の間延長する同期複製リースを確立してよい。例えば、ストレージシステムごとのクロックが時間Ｘで調整される場合、ストレージシステムは、それぞれＸ＋２秒有効である同期複製リースを確立するように構成されてよい。クロックを調整又は交換するための追加の説明は、その全体が参照により本明細書に援用される、米国仮特許第６２／５１８，０７１号の中で見つけられてよい。 15 includes requesting (1502) mediation by a first storage system (1514) from an intermediation service (1500) in response to detecting a trigger event. In this example, the trigger event may be a communications failure of a data communications link (1516) between the first storage system (1514) and the second storage system (1524), and detecting the failure may be due to a hardware failure, a failure to acknowledge receipt of a transmission, a failed retry effort, or some other method. In other cases, the trigger event may be the expiration of a synchronous replication lease, and requesting mediation may be part of an attempt to synchronize the connection and coordinate the resumption of an active lease. Such a lease may initially be established according to timing information for at least one of the multiple storage systems in a variety of different ways. For example, the storage systems may establish a synchronous replication lease by utilizing timing information to coordinate or exchange clocks for each of the multiple storage systems. In such an example, once the clocks are adjusted on each of the storage systems, the storage systems may establish synchronous replication leases that extend for a predetermined period beyond the adjusted or exchanged clock value. For example, if the clocks for each storage system are adjusted by time X, the storage systems may each be configured to establish synchronous replication leases that are valid for X+2 seconds. Additional description of adjusting or exchanging clocks may be found in U.S. Provisional Patent No. 62/518,071, which is incorporated herein by reference in its entirety.

［0351］さらに、トリガイベントを検出することに応えて第１のストレージシステム（１５１４）によって、仲介サービス（１５００）から仲介を要求すること（１５０２）は、第１のストレージシステム（１５１４）のコントローラが、トリガイベントを検出し、ネットワーク（１５５４）を介して仲介サービス（１５００）に要求（１５６０）を送信することによって実施されてよい。一部の例では、仲介サービス（１５００）は―複数のコンピュータシステムに対して―例えば値を記憶するための特定のデータベース入力等の相互に排他的なアクセスをリソースに提供する第３者のサービスであってよい。例えば、仲介サービス（１５００）は、クラウドサービスプロバイダによって提供される、データベースを修正する要求を発行するホストコンピュータによって又はリソースに相互に排他的なアクセスを提供するなんらかの第３者サービスによって提供されるデータベースサービスによって提供されてよく、リソースは、特定のクライアントからの要求に基づいた特定の修正を示すことができるストレージ、状態機械、又は他のなんらかのタイプのリソースであってよい。この例では、仲介に対する要求（１５６０）を送信した後、第１のストレージシステム（１５１４）は、正の仲介結果（１５０３Ｂ）又は負の仲介結果又は応答の欠如（１５０３Ｃ）を示す仲介サービス（１５００）からの表示を待機する（１５０３Ａ）。第１のストレージシステム（１５１４）が負の仲介結果を受け取る又は仲介結果を受け取らない（１５０３Ｃ）場合、及び待機する閾値時間量が超えられていない場合、第１のストレージシステム（１５１４）は、より多くの時間待機し続けてよい（１５０６）。しかしながら、待機時間量が閾値量を超える場合、次いで第１のストレージシステム（１５１４）は、別のコンピュータシステムが仲介を勝ち取ったと判断し、それ自体をオフラインにすることによって続行してよい（１５０６）。一部の例では、上述されたように、仲介に対する要求は、やはりポッド（１５０１）を維持するストレージシステムの別のストレージシステムから受け取られたコンペアアンドセット動作のターゲットであってよい共用リソース（１５５２）の値を設定しようと試みるアトミックコンペアアンドセット動作として仲介サービス（１５００）によって受け取られてよく、共用リソース（１５５２）を無事に設定するストレージシステムが仲介を勝ち取る。 [0351] Furthermore, requesting (1502) mediation from the intermediation service (1500) by the first storage system (1514) in response to detecting the trigger event may be implemented by a controller of the first storage system (1514) detecting the trigger event and sending a request (1560) to the intermediation service (1500) over the network (1554). In some examples, the intermediation service (1500) may be a third-party service that provides—for multiple computer systems—mutually exclusive access to resources, such as specific database entries for storing values. For example, the intermediation service (1500) may be provided by a database service provided by a cloud service provider, by a host computer issuing a request to modify a database, or by some third-party service providing mutually exclusive access to resources, and the resources may be storage, a state machine, or some other type of resource capable of indicating specific modifications based on a request from a particular client. In this example, after sending a request for mediation (1560), the first storage system (1514) waits (1503A) for an indication from the mediation service (1500) indicating a positive mediation result (1503B), a negative mediation result, or a lack of response (1503C). If the first storage system (1514) receives a negative mediation result or no mediation result (1503C), and if a threshold amount of time to wait has not been exceeded, the first storage system (1514) may continue to wait (1506) for more time. However, if the amount of time to wait exceeds the threshold amount, then the first storage system (1514) may determine that another computer system has won the mediation and may proceed by taking itself offline (1506). In some examples, as described above, a request for mediation may be received by the mediation service (1500) as an atomic compare-and-set operation that attempts to set the value of a shared resource (1552) that may also be the target of a compare-and-set operation received from another storage system of the storage systems that maintain the pod (1501), with the storage system that successfully sets the shared resource (1552) winning the mediation.

［0352］また、図１５の例は、第２のストレージシステム（１５２４）が、トリガイベントを検出することに応えて仲介サービス（１５００）から仲介を要求する（１５１０）も含む。リガイベントを検出することに応えて仲介サービス（１５００）から仲介を要求する（１５１０）は、トリガイベントに応えて、第１のストレージシステム（１５１４）に対する仲介を要求すること（１５０２）の実施態様と同様に実装されてよい。しかしながら、この例では、第２のストレージシステム（１５２４）は、仲介サービスに要求（１５６２）を送信したことに応えて―第１のストレージシステム（１５１４）の仲介成功とは逆に―失敗メッセージ、又は仲介に対する要求（１５６２）が成功しなかった旨のなんらかの表示を受け取ってよい。 [0352] The example of FIG. 15 also includes the second storage system (1524) requesting mediation (1510) from the intermediation service (1500) in response to detecting the trigger event. The requesting mediation (1510) from the intermediation service (1500) in response to detecting the trigger event may be implemented similarly to the embodiment of requesting mediation (1502) from the first storage system (1514) in response to the trigger event. However, in this example, the second storage system (1524) may receive a failure message or some indication that the request for mediation (1562) was not successful in response to sending the request (1562) to the intermediation service—as opposed to the first storage system (1514) successfully mediating.

［0353］図１５の例の方法は、万一正の仲介結果の表示（１５６４）が第１のコンピュータシステム（１５１４）によって受け取られる場合には、仲介サービス（１５００）からの正の仲介結果の表示（１５６４）に応えて、第１のコンピュータシステム（１５１４）が－第２のストレージシステム（１５２４）の代わりに－第１のストレージシステム（１５１４）及び第２のストレージシステム（１５２４）全体で同期複製されるデータセット（１５１２）に向けられるデータストレージ要求を処理すること（１５０４）によって続行する。ポッド（１５０１）を実装するデータセット（１５１２）の同期複製は、データセット（１５１２）に向けられるデータストレージ要求を受け取り、処理することに加えて、その全体が参照により本明細書に援用される米国仮特許第６２／４７０，１７２号及び第６２／５１８，０７１号の図８Ａ及び図８Ｂに関して説明されるように実施されてよい。この例では、図１５に関して上述されたように、正の仲介結果の表示（１５６４）に応えて、第１のストレージシステム（１５１４）は、仲介を勝ち取るストレージシステムと見なされてよく、第１のストレージシステム（１５１４）は、通信が失われたストレージシステムをデタッチしてよい。しかしながら、他の例では、仲介は、説明された仲介の他の方法又は仲介の方法の組合せのいずれかに従って実施されてよい。 15 example method continues by having the first computer system (1514) process (1504)—on behalf of the second storage system (1524)—data storage requests directed to the dataset (1512) that is synchronously replicated across the first storage system (1514) and the second storage system (1524) in response to the positive brokered result indication (1564) from the brokering service (1500). Synchronous replication of the dataset (1512) implementing the pod (1501), in addition to receiving and processing data storage requests directed to the dataset (1512), may be performed as described with respect to Figures 8A and 8B of U.S. Provisional Patent Nos. 62/470,172 and 62/518,071, the entireties of which are incorporated herein by reference. In this example, as described above with respect to FIG. 15, in response to a positive mediation result indication (1564), the first storage system (1514) may be considered the winning storage system, and the first storage system (1514) may detach the storage system with which communication was lost. However, in other examples, mediation may be performed according to any of the other methods of mediation described or combinations of mediation methods.

［0354］一部の例では、データセット（１５１２）を同期複製する複数のストレージシステムの間のどのストレージシステムが仲介を勝ち取るのかの優先を定義することは、複数のストレージシステムのそれぞれに遅延値を指定することによって実施されてよい。例えば、第１のストレージシステム（１５１４）が優先ストレージシステムとして指定される場合、次いで第１のストレージシステム（１５１４）は仲介サービスから仲介に対する要求を行う前にゼロ（０）の遅延値を割り当てられてよい。しかしながら、非優先ストレージシステムの場合、遅延値は、例えば３秒、又は概して優先ストレージシステムが、単に同期複製されたストレージシステム間の通信損失のために仲介を勝ち取ることになるであろうなんらかの他の値等、ゼロより大きくなるように割り当てられてよい。 [0354] In some examples, defining a preference for which storage system wins mediation among multiple storage systems that synchronously replicate dataset (1512) may be implemented by assigning a delay value to each of the multiple storage systems. For example, if a first storage system (1514) is designated as the preferred storage system, then the first storage system (1514) may be assigned a delay value of zero (0) before making a request for mediation from the mediation service. However, for non-preferred storage systems, the delay value may be assigned to be greater than zero, such as three seconds, or some other value that would generally result in the preferred storage system winning mediation simply due to communication loss between the synchronously replicated storage systems.

［0355］追加の説明のために、図１６は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステム間で仲介するための例の方法を示すフローチャートを説明する。図１６に示される例の方法は、トリガイベントを検出することに応えて第１のストレージシステム（１５１４）によって、仲介サービス（１５００）から仲介を要求すること（１５０２）、及び仲介サービス（１５００）からの正の仲介悔過の表示（１５６４）に応えて、第１のコンピュータシステム（１５１４）が―第２のストレージシステム（１５２４）の代わりに―第１のストレージシステム（１５１４）及び第２のストレージシステム（１５２４）全体で同期複製されるデータセット（１５１２）に向けられたデータストレージ要求を処理すること（１５０４）も含むので、図１６に示される例の方法は図１５に示される例の方法に類似している。 [0355] For additional explanation, FIG. 16 sets forth a flowchart illustrating an example method for mediating between storage systems that synchronously replicate data sets in accordance with some embodiments of the present disclosure. The example method illustrated in FIG. 16 is similar to the example method illustrated in FIG. 15 in that it also includes requesting mediation (1502) from a mediation service (1500) by a first storage system (1514) in response to detecting a trigger event, and processing (1504) a data storage request directed to a data set (1512) that is synchronously replicated across the first storage system (1514) and the second storage system (1524) by the first computer system (1514)—on behalf of the second storage system (1524)—in response to a positive mediation request indication (1564) from the mediation service (1500).

［0356］しかしながら、図１６に示される例の方法は、正の注記結果の表示（１５６４）に応えて、データセット（１５１２）を同期複製する複数のストレージシステム（１５１４、１５２４）から第２のストレージシステム（１５２４）をデタッチすること（１６０２）をさらに含む。別のストレージシステムをデタッチすること（１６０２）は、データセット（１５１２）を複製しているストレージシステムの同期中リストからもはや通信していないストレージシステム（１５２４）を削除することによって、仲介サービス（１５００）から正の仲介結果の表示を受け取るストレージシステムで実施されてよく、同期中リストからの削除により、仲介を勝ち取るストレージシステム（１５１４）は、後に受け取られたデータセットを修正する要求のためにデタッチされたストレージシステムを同期させることを試みない。この例では、２つのストレージシステム（１５１４、１５２４）があるが、他の例では、他の数量のストレージシステムが意図される。 [0356] However, the example method shown in FIG. 16 further includes detaching (1602) a second storage system (1524) from the plurality of storage systems (1514, 1524) synchronously replicating the dataset (1512) in response to the indication (1564) of a positive annotation result. Detaching (1602) another storage system may be performed at a storage system receiving an indication of a positive mediation result from the mediation service (1500) by removing the no longer communicating storage system (1524) from the in-sync list of storage systems replicating the dataset (1512), such that the removal from the in-sync list causes the winning storage system (1514) to not attempt to synchronize the detached storage system for subsequently received requests to modify the dataset. In this example, there are two storage systems (1514, 1524), although other quantities of storage systems are contemplated in other examples.

［0357］追加の説明のために、図１７は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステム間で仲介するための例の方法を示すフローチャートを説明する。図１７に示される例の方法も、トリガイベントを検出することに応えて第１のストレージシステム（１５１４）によって、仲介サービス（１５００）から仲介を要求すること（１５０２）を含むので、図１７に示される例の方法は図１５に示される例の方法に類似しており、この例ではトリガイベントは通信障害である。 [0357] For additional explanation, FIG. 17 sets forth a flowchart illustrating an example method for mediating between storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. The example method illustrated in FIG. 17 is similar to the example method illustrated in FIG. 15, as it also includes requesting mediation (1502) from a mediation service (1500) by a first storage system (1514) in response to detecting a trigger event, in this example the trigger event being a communication failure.

［0358］しかしながら、図１７に示される例の方法は、第２のストレージシステム（１５２４）によって実行される任意の活動又はアクションを含まないため、図１７に示される例の方法は、図１５に示される例の方法とは異なる。この違いは、複数のストレージシステムの間のストレージシステムが故障する、又はそれ以外の場合反応しないこと、及び１つ以上の他のストレージシステムが、同期複製されたデータセット（１５１２）に向けられたデータストレージ要求にサービスを提供し続けるために仲介サービス（１５００）からの仲介を要求することを可能にする。 [0358] However, the example method shown in FIG. 17 differs from the example method shown in FIG. 15 because the example method shown in FIG. 17 does not include any activities or actions performed by the second storage system (1524). This difference allows for a storage system among the multiple storage systems to fail or otherwise become unresponsive, and for one or more other storage systems to require intermediation from the intermediary service (1500) to continue servicing data storage requests directed to the synchronously replicated data set (1512).

［0359］図１７に示される例の方法は、第１のストレージシステム（１５１４）と第２のストレージシステム（１５２４）との間の通信に障害を検出すること（１７０２）を含み、第１のストレージシステム（１５１４）及び第２のストレージシステム（１５２４）は、データセット（１５１２）を同期複製するストレージシステムに含まれる。通信障害を検出すること（１７０２）は、図１５に関して上述されるように実施されてよい。 [0359] The example method shown in FIG. 17 includes detecting (1702) a failure in communication between a first storage system (1514) and a second storage system (1524), where the first storage system (1514) and the second storage system (1524) are included in a storage system that synchronously replicates a data set (1512). Detecting (1702) the communication failure may be performed as described above with respect to FIG. 15.

［0360］さらに、図１７に示される例の方法は、正の仲介結果の表示（１５６４）に応えて、データセット（１５１２）を同期複製する複数のストレージシステム（１５１４、１５２４）から第２のストレージシステム（１５２４）をデタッチすること（１７０４）も含む。第２のストレージシステム（１５２４）をデタッチすること（１７０４）は、図１６に関して説明された第２のストレージシステム（１５２４）をデタッチすること（１６０２）と同様に実施されてよい。 [0360] Additionally, the example method shown in FIG. 17 also includes, in response to the indication (1564) of a positive mediation result, detaching (1704) the second storage system (1524) from the plurality of storage systems (1514, 1524) that synchronously replicate the dataset (1512). Detaching (1704) the second storage system (1524) may be performed similarly to detaching (1602) the second storage system (1524) described with reference to FIG. 16.

［0361］追加の説明のため、図１８は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの回復のための例の方法を示すフローチャートを説明する。図１８に示される例の方法は、データセット（１８１２）が、そのそれぞれが１つ以上のデータ通信リンク（１８１６、１８１８、１８２０）を介して互いに独立して結合されてよい、２つのストレージシステム（１８１４、１８２４、１８２８）だけで同期複製される実施形態を示すが、図１８に示される例は、データセット（１８１２）が追加のストレージシステム全体で同期複製される実施形態に拡張できる。 [0361] For additional explanation, FIG. 18 sets forth a flowchart illustrating an example method for recovery of a storage system that synchronously replicates a dataset in accordance with some embodiments of the present disclosure. While the example method illustrated in FIG. 18 depicts an embodiment in which dataset (1812) is synchronously replicated across only two storage systems (1814, 1824, 1828), each of which may be independently coupled to one another via one or more data communication links (1816, 1818, 1820), the example illustrated in FIG. 18 can be extended to embodiments in which dataset (1812) is synchronously replicated across additional storage systems.

［0362］データセット（１８４２）を同期複製している複数のストレージシステム（１８１４、１８２４、１８２８）は、ホスト（１８０２）コンピューティングデバイスから要求（１８０４）を受け取り、処理するための通常の動作中に互いと通信してよい。しかしながら、一部の例では、ストレージシステム（１８１４、１８２４、１８２８）の１つ以上は、失敗する、再起動する、アップグレードする、又はそれ以外の場合利用できない場合がある。この状況での回復は、障害又はなんらかの他のサービス機能停止が、同期中のストレージシステムのうちの少なくとも１つに中断させ、おそらく飛行中（ｉｎ－ｆｌｉｇｈｔ）動作の文脈を失わせた後に同期中のポッドメンバーストレージシステムを一貫させるプロセスである。『ポッド』は、該用語がここで及び本願の残りを通して使用されるように、データセット、管理オブジェクト及び管理動作の集合、データセットを修正する又は読み取るためのアクセス動作の集合、及び複数のストレージシステムを表す管理エンティティとして実施されてよい。係る管理動作は、データセットを読み取る又は修正するためのアクセス動作が、ストレージシステムのいずれかを通して同等に動作する、ストレージシステムのいずれかを通して同等に管理オブジェクトを修正する又は問い合わせしてよい。各ストレージシステムは、ストレージシステムによる使用のために記憶され、宣伝されたデータセットの適切な部分集合としてデータセットの別々のコピーを記憶してよく、任意の１つのストレージシステムを通して実行され、完了された管理オブジェクト又はデータセットを修正する動作は、ポッドを問い合わせるために後続の管理オブジェクトで、又はデータセットを読み取るために後続のアクセス動作で反映される。『ポッド』に関する追加の詳細は、参照により本明細書に援用される以前に出願された仮特許出願第６２／５１８，０７１号で見つけられてよい。この例では、示されている３つのストレージシステム（１８１４、１８２４、１８２８）しかないが、概して任意の数のストレージシステムが、データセット（１８１２）を同期複製している同期中リストの一部であってよい。 [0362] Multiple storage systems (1814, 1824, 1828) synchronously replicating a dataset (1842) may communicate with each other during normal operation to receive and process requests (1804) from the host (1802) computing device. However, in some instances, one or more of the storage systems (1814, 1824, 1828) may fail, reboot, upgrade, or otherwise become unavailable. Recovery in this situation is the process of bringing the synchronizing pod member storage systems together after a failure or some other service outage causes at least one of the synchronizing storage systems to suspend and possibly lose the context of in-flight operations. A "pod," as the term is used here and throughout the remainder of this application, may be embodied as a management entity representing a dataset, a collection of management objects and management operations, a collection of access operations to modify or read the dataset, and multiple storage systems. Such management operations may modify or query managed objects equally through any of the storage systems, with access operations to read or modify a dataset operating equally through any of the storage systems. Each storage system may store a separate copy of the dataset as an appropriate subset of the dataset stored and advertised for use by the storage system, and operations that modify managed objects or datasets performed and completed through any one storage system are reflected in subsequent managed objects to query the pod or subsequent access operations to read the dataset. Additional details regarding "pods" may be found in previously filed provisional patent application Ser. No. 62/518,071, which is incorporated herein by reference. In this example, only three storage systems (1814, 1824, 1828) are shown, but generally any number of storage systems may be part of the in-sync list synchronously replicating dataset (1812).

［0363］ポッドのメンバーである任意の１つ以上のストレージシステムが中断されると、次いで任意の残りのストレージシステム又は早期に動作を再開する任意のストレージシステムは、（それらがもはや同期中にならないように）それらをデタッチする、又は先に進む前にそれらを待機し、一貫性を確実にするために回復アクションに関与するかのどちらかであってよい。機能停止が十分に短く、回復が十分に迅速である場合、次いでストレージシステムの外部の又はアプリケーション自体を停止させるように障害を起こさない（ｆａｕｌｔ）ストレージシステム上で実行中であるオペレーティングシステム及びアプリケーションは、ストレージ動作処理で一時的な遅延を経験する場合があるが、サービス停止は経験しない場合がある。ＳＣＳＩ及び他のストレージプロトコルは、一時ストレージコントローラ又はインタフェースターゲットコントローラの機能停止により動作が失われた場合、代替ターゲットストレージインタフェースを含むリトライをサポートし、特にＳＣＳＩは、ストレージコントローラが回復に関与する間使用できるであろう開始プログラムリトライを要求するＢＵＳＹステータスをサポートする。 [0363] If any one or more storage systems that are members of a pod are interrupted, then any remaining storage systems, or any storage systems that resume operation early, may either detach them (so that they are no longer in sync) or wait for them before proceeding and engage in recovery actions to ensure consistency. If the outage is short enough and recovery is rapid enough, then operating systems and applications running outside the storage system or on a storage system that does not fault to stop the application itself, may experience a temporary delay in processing storage operations, but may not experience a service outage. SCSI and other storage protocols support retries involving an alternate target storage interface when operations are lost due to the outage of a primary storage controller or interface target controller; in particular, SCSI supports a BUSY status to request an initiator retry, which may be used while the storage controller participates in recovery.

［0364］一般的には、回復の目標の１つは、進行中の分散動作の予想外の途絶からの任意の不一致を処理すること、及び同期中のポッドメンバーストレージシステムを十分に同一にすることにより不一致を解決することである。その時点で、ポッドサービスを提供することは安全に再開できる。十分に同一は、少なくともポッドに記憶されるコンテンツを含み、他の場合では、十分に同一は持続する予約の状態を含んでよい。また、十分に同一は、スナップショットに一貫性がある－完了した修正動作、同時に起こる修正動作、又はより最近に受け取られた修正動作に関しても正しい－か、又は一貫して削除されるかのどちらかであることを確実にすることも含んでよい。実施態様に応じて、一貫性をもたせられるべきである他のメタデータがある場合がある。複製ソースから非同期複製ターゲット又はスナップショットベースの複製ターゲットへのコンテンツの移動を追跡又は最適化するために使用されるメタデータがある場合、次いでそれは、複製ソースがポッドのあるメンバーストレージシステムから別のメンバーストレージシステムに円滑に切り替わることを可能にするために一貫性をもたされる必要がある可能性がある。また、ボリュームの存在及びプロパティも回復される必要がある場合があり、おそらくアプリケーション又は開始するホストシステムに関係する定義も回復される必要がある場合がある。これらのプロパティの多くは、それらがどのように実装されるのかに応じて、標準的なデータベーストランザクション回復技術を使用し、回復されてよい。 [0364] Generally, one goal of recovery is to handle any inconsistencies from an unexpected disruption of ongoing distributed operations and to resolve the inconsistencies by making synchronized pod member storage systems sufficiently identical, at which point providing pod services can safely resume. Sufficiently identical includes at least the content stored in the pod; in other cases, sufficiently identical may include the state of persistent reservations. Sufficiently identical may also include ensuring that snapshots are either consistent—correct with respect to completed modification operations, concurrent modification operations, or more recently received modification operations—or consistently deleted. Depending on the implementation, there may be other metadata that should be made consistent. If there is metadata used to track or optimize the movement of content from a replication source to an asynchronous or snapshot-based replication target, then it may need to be made consistent to allow the replication source to switch smoothly from one member storage system of the pod to another. The existence and properties of volumes may also need to be restored, perhaps as well as definitions related to the application or initiating host system. Many of these properties may be recovered using standard database transaction recovery techniques, depending on how they are implemented.

［0365］一部の例では、管理メタデータが、ブロックベースのストレージシステムのコンテンツに対して修正動作を実施するストレージシステムで十分に同一であることを確実にすることを超えて、回復は、それらの修正がポッド全体で、及びブロックストレージセマンティックス（ＣＯＭＰＡＲＥＡＮＤＷＲＩＴＥ及びＸＤＷＲＩＴＥＲＥＡＤ等の動作の順序、同時並列性、一貫性、アトミック性）を適切に考慮して一貫して適用又は廃棄されることを確実にしなければならない。根本的には、本実施態様は、回復中に、どの動作がポッドのための他のすべての同期中のストレージシステムには適用されなかった可能性がある、ポッドのための少なくとも１つの同期中のストレージシステムに適用された可能性があるのかを知ることができること、及びそれらをどこででも適用する又は取り消すことのどちらかに頼っている。どちらのアクションも一貫性―どこででも適用する又はどこででも取り消す―を生じさせ、すべての動作全体で答えが一様でなければならない固有の理由はない。取消しは、ポッドのための少なくとも１つの同期中のストレージシステムが動作を適用しなかった場合に許されてよい。一般に、多くの場合、ポッドのための１つ以上の同期中のストレージシステム上にあるいくつかの又はすべての更新を取り消すよりむしろ、ポッドのための任意の同期中のストレージシステムで発見されたが、ポッドのためのすべての同期中のストレージシステム上にはないすべての更新を適用することについて推論する方が簡略である。効率的であるためには、他のシステムでは適用されなかった可能性があるいくつかのシステムで何が適用されたのかを知ることは、概してストレージシステムが未処理データ以外の何かを記録することを必要とする（それ以外の場合、すべてのデータは、比較されなければならない可能性があり、それは桁外れに多大な時間を要するであろう）。以下に説明されるのは、ストレージシステム回復を可能にし得る係る情報を記録するための実施態様に関する追加の詳細である。 [0365] In some cases, beyond ensuring that management metadata is sufficiently identical across storage systems that perform modifying operations on the contents of block-based storage systems, recovery must also ensure that those modifications are consistently applied or discarded across pods and with appropriate consideration of block storage semantics (ordering, concurrency, consistency, atomicity of operations such as COMPARE AND WRITE and XDWRITERREAD). Fundamentally, this embodiment relies on being able to know during recovery which operations may have been applied to at least one in-sync storage system for the pod that may not have been applied to all other in-sync storage systems for the pod, and then either apply them everywhere or undo them. Either action results in consistency—apply everywhere or undo everywhere—and there is no inherent reason why the answer must be uniform across all operations. Undo may be allowed if at least one in-sync storage system for the pod did not apply the operation. In general, it is often simpler to reason about applying all updates found on any in-sync storage system for a pod but not on all in-sync storage systems for the pod, rather than undoing some or all updates on one or more in-sync storage systems for the pod. To be efficient, knowing what was applied on some systems that may not have been applied on other systems generally requires the storage system to record something other than raw data (otherwise, all data may have to be compared, which would take an order of magnitude more time). Described below are additional details regarding implementations for recording such information that may enable storage system recovery.

［0366］一貫性を確実にするための情報を持続的に追跡するための２つの例は、（１）ボリュームのコンテンツが、ポッドのための同期中のストレージシステム全体で異なる可能性があることを識別すること、及び（２）ポッドのためのすべての同期中のシステム全体で普遍的に適用されなかった可能性がある動作の集合体を識別することを含む。第１の例は、ミラーリングのための従来のモデルである。書き込まれている（多くの場合、リストとして又はなんらかの粒度を有するボリュームの論理空間をカバーするビットマップとして）論理領域の追跡マップを保ち、回復中にそのリストを使用して、どの領域があるコピーと別のコピーとの間で異なる可能性があるのかに留意する。追跡マップは、追跡マップの回復が、障害時に流動的であった任意のボリューム領域をカバーすることを保証されるように、ボリュームデータの書込み前又は中にいくつか又はすべてのミラーに書き込まれる（又は別々に書き込まれる）。この第１の変形形態での回復は、概してコンテンツをあるコピーから別のコピーにコピーして、それらが同じであることを確認することから成る。 [0366] Two examples of persistently tracking information to ensure consistency include (1) identifying that the contents of a volume may differ across synchronizing storage systems for a pod, and (2) identifying a collection of actions that may not have been universally applied across all synchronizing systems for a pod. The first example is the traditional model for mirroring: keeping a tracking map of logical areas that have been written (often as a list or as a bitmap that covers the logical space of the volume with some granularity), and using that list during recovery to note which areas may differ between one copy and another. The tracking map is written to some or all mirrors (or written separately) before or during the writing of volume data, such that recovery of the tracking map is guaranteed to cover any volume areas that were in flux at the time of the failure. Recovery in this first variant generally consists of copying the contents from one copy to another and verifying that they are the same.

［0367］この場合は、単に同期複製されたストレージシステム間のボリュームコンテンツの潜在的な差異として追跡することはより困難又は高価である場合があるので－動作追跡に基づく―持続性追跡の第２の例は、ポッド内のボリュームの中及び間での大きいボリューム範囲の仮想コピーを同期複製することをサポートするストレージシステムで有用である場合がある（ただし、コンテンツアドレス指定可能ストレージシステムでの追跡及び回復を説明する項を参照すること）。また、簡略なコンテンツ追跡は、例えば非同期複製の形式を駆動するエクステント及びより大きい粒度の識別子を有するコンテンツ追跡グラフにおいて等、同期複製がより複雑な情報を追跡しなければならず、非同期複製ソースが、ポッド内のある同期中のシステムから別のストレージシステムに移行される又は障害を起こされる（ｆａｕｌｔｅｄｏｖｅｒ）ストレージシステムではあまりうまく機能しない可能性がある。コンテンツの代わりに動作が追跡されるとき、回復は、どこででも完了していない可能性がある動作を識別することを含む。係る動作が識別されると、まさにそれらは例えばリーダー定義の順序付け又はプレディケート又はインターロック例外を通して等の技術を使用する通常の実行時中であるべきなので、任意の順序付け一貫性問題が解決されるはずである。インターロック例外は後述され、プレディケートに関して、動作と共通メタデータ更新との関係の記述は、別々の修正動作間の相互依存性の集合として説明される場合がある－これらの相互依存性は、ある動作がなんらかの方法で依存するプレカーソルの集合として記述される場合があり、プレカーソルの集合は、動作が完了するために真でなければならないプレディケートと見なされる場合がある。この例を続行するために、識別された動作を所与として、動作は次いで再適用されてよい。動作について記録される情報は、ポッドメンバーストレージシステム全体で一貫性がなければならない任意のメタデータ変化を含むべきであり、この記録された情報は、コピーされ、適用される場合がある。さらに、プレディケートは、それらがリーダーとフォロワーとの間の同時並行性に対する制限を広めるために使用される場合、それらのプレディケートが、ストレージシステムが情報を持続する順序を動かすならば持続される情報は多様なもっともらしい結論を暗示するので、保護される必要はない可能性がある。 [0367] A second example of persistence tracking—based on operation tracking—may be useful in storage systems that support synchronous replication of large volume range virtual copies within and between volumes within a pod, because this case may be more difficult or expensive to track simply as potential differences in volume content between synchronously replicated storage systems (but see the section describing tracking and recovery in content-addressable storage systems). Also, simple content tracking may not work as well in storage systems where synchronous replication must track more complex information, such as in a content tracking graph with extents and larger-granularity identifiers that drive a form of asynchronous replication, and where an asynchronous replication source is migrated or faulted over from one in-sync system to another storage system within a pod. When operations are tracked instead of content, recovery involves identifying operations that may not have completed anywhere. Once such operations are identified, any ordering consistency issues should be resolved, as they should be during normal runtime using techniques such as leader-defined ordering or predicates or through interlocked exceptions. Interlock exceptions are discussed below, and with regard to predicates, the description of the relationship between an operation and a common metadata update may be described as a set of interdependencies between separate modifying operations—these interdependencies may be described as a set of precursors on which an operation depends in some way, and the set of precursors may be viewed as predicates that must be true for the operation to complete. To continue with this example, given an identified operation, the operation may then be reapplied. Information recorded about the operation should include any metadata changes that must be consistent across pod member storage systems, and this recorded information may be copied and applied. Furthermore, predicates may not need to be protected if they are used to propagate restrictions on concurrency between leaders and followers, as the information persisted could imply diverse plausible conclusions if those predicates change the order in which storage systems persist information.

［0368］米国仮特許出願第６２／４７０，１７２号及び米国仮特許出願第６２／５１８，０７１号、つまりその全体として本明細書に援用される参照の中でより徹底的に説明されるように、同期中のストレージシステムの集合は、データ一貫性を提供するための対称的なＩ／Ｏモデルを実施してよい。対称的なＩ／Ｏモデルでは、複数のストレージシステムは、ポッドの中にデータセットを維持してよく、Ｉ／Ｏ動作を受け取るメンバーストレージシステムは、ポッド内の他のすべてのストレージシステムでのＩ／Ｏ動作の処理と同時に、ローカルにＩ／Ｏ動作を処理してよい－受け取るストレージシステムは、他のストレージシステム上でＩ／Ｏ動作の処理を開始してよい。しかしながら、一部の場合では、複数のストレージシステムは、重複するメモリ領域に書き込む独立したＩ／Ｏ動作を受け取ってよい。例えば、第１の書込みが第１のストレージシステムに入る場合、次いで第１のストレージシステムは第１の書込みを第２のストレージシステムに送信しつつも、第１の書込みをローカルに持続し始めてよい－ほぼ同時に第１の書込みと重複するボリューム領域に対する第２の書込みは、第２のストレージシステムで受け取られ、第２のストレージシステムは、第２の書込みを第１のストレージシステムに送信しつつも、第２の書込みをローカルに持続し始める。この状況では、なんらかの時点で、第１のストレージシステム、第２のストレージシステムのどちらか、又は両方のストレージシステムは、同時に起こる重複があることに気づく場合がある。さらに、この状況では、第１の書込みは、第２のストレージステムが第１の書込みを持続し、成功表示で応答し、且つ第１のストレージシステムが無事に第１の書込みを持続するまで第１のストレージシステムで完了できない－第２のストレージシステムは、第２の書込みと同様の状況にある。両方のストレージシステムとも第１の書込みと第２の書込みの両方にアクセスできるため、どちらのストレージシステムも同時に起こる重複を検出し得、１つのストレージシステムが同時に起こる重複を検出すると、ストレージシステムは、本明細書では「インターロック例外」と呼ばれる例外をトリガしてよい。状況が追加のストレージシステムまで拡大されるとき、１つの解決策は、２つ又はおそらくそれ以上のストレージシステム、書込み動作が優先する同意に到達するためのインターロック例外に関与するストレージシステムを含む。 [0368] As described more thoroughly in U.S. Provisional Patent Application Nos. 62/470,172 and 62/518,071, the references of which are incorporated herein in their entireties, a collection of synchronizing storage systems may implement a symmetric I/O model to provide data consistency. In a symmetric I/O model, multiple storage systems may maintain datasets within a pod, and a member storage system receiving an I/O operation may process the I/O operation locally concurrently with the processing of the I/O operation on all other storage systems in the pod—the receiving storage system may initiate processing of the I/O operation on the other storage systems. However, in some cases, multiple storage systems may receive independent I/O operations that write to overlapping memory regions. For example, if a first write enters a first storage system, the first storage system may then begin persisting the first write locally while sending it to a second storage system; approximately simultaneously, a second write to a volume region that overlaps with the first write is received at the second storage system, which begins persisting the second write locally while sending the second write to the first storage system. In this situation, at some point, either the first storage system, the second storage system, or both storage systems may become aware of the simultaneous overlap. Furthermore, in this situation, the first write cannot complete at the first storage system until the second storage system has persisted the first write and responded with a success indication, and the first storage system has successfully persisted the first write; the second storage system is in a similar situation as the second write. Because both storage systems have access to both the first and second writes, either storage system may detect the concurrent conflict, and when one storage system detects the concurrent conflict, the storage system may trigger an exception, referred to herein as an "interlock exception." When the situation is expanded to additional storage systems, one solution involves two, or perhaps more, storage systems participating in an interlock exception to reach agreement on which write operation takes precedence.

［0369］重複する書込み要求の場合でのような別の例では、複製を中断し、最終的な回復につながったイベントの時点での時間及びボリュームアドレス範囲において重複していた書込みタイプ要求（例えば、ＷＲＩＴＥ要求、ＷＲＩＴＥＳＡＭＥ要求、ＵＮＭＡＰ要求、又は組合せ）は、同期中のストレージシステム間で一貫性なく完了した可能性がある。この状況がどのようにして処理されるのかは、通常の動作中のＩ／Ｏ経路の実装に依存する場合がある。この例では、以下にさらに時間で重複していた第１の書込み及び第２の書込みが説明され、それぞれが、どちらも完了したとして信号で知らされる前に、ポッドのためのあるストレージシステムによって又は別のストレージシステムによって受け取られていた。この例は、２つずつを順番に考慮することによって２つ以上の書込みに、並びに第１の書込み及び第２の書込みが複数のストレージシステムで完了した可能性があることを考慮することによって、並びに第１の書込み、第２の書込み、及び第３の書込み（又は追加の書込み）が３つ以上のストレージシステムで一貫性なく完了した可能性があることを考慮することによって２つ以上のストレージシステムに容易に拡張される。説明された技術は、これらの場合に容易に拡張される。インターロック例外に基づいた対称的なＩ／Ｏベースのストレージシステム実施態様では、第１の書込みだけが１つのストレージシステムで完了した可能性がある。一方、２つの重複する書込みの第２の書込みは第２のストレージシステムで完了した可能性がある。この場合は、範囲が各書込みの間で重複することに気づくことによって、及びどのストレージシステムも代替の重複書込みを含まないことに気づくことによって検出できる。２つの書込みが完全に重複する（一方が他方を完全にカバーする）場合、次いで２つの書込みのうちの一方は他方のストレージシステムにコピーされ、そのボリュームアドレス範囲のためにそのストレージシステムのコンテンツを置換するために適用されるにすぎない場合がある。書込みが部分的にしか重複していない場合、次いで部分的に重複するコンテンツは、一方のストレージシステムから他方にコピーされ（適用され）る場合がある。一方、重複しない部分は、コンテンツが両方のストレージシステムで統一され、最新にされるように各ストレージシステム間でコピーできる。プレディケート、又はリーダーが、ある書込みが別の書込みに先行することを宣言するためのなんらかの他の手段を有するリーダーベースのシステムでは、書込みを実行しているストレージシステムは、重なり合うようにうまく持続してよい、又は２つをともに持続してよい。別の場合には、実施態様は２つの書込みを別々に及び順不同で持続してよく、順序付けのプレディケートは単に完了シグナリングを制御するためだけに使用される。実施態様が順不同の書込み処理を可能にする場合、次いで先行する例は、どのようにして一貫性を回復できるのかを説明する。ストレージシステムが通常動作中の持続性の順序付けを施行する場合には、次いで回復は、第２のストレージシステムでの第１の書込み及び第２の書込みを除き、第１のストレージシステムでは第１の書込みしか見ない可能性がある。その場合、第２の書込みは、回復の一部として第２のストレージシステムから第１のストレージシステムにコピーできる。 [0369] In another example, such as in the case of overlapping write requests, write-type requests (e.g., WRITE requests, WRITE SAME requests, UNMAP requests, or a combination) that overlapped in time and volume address range at the time of the event that interrupted replication and ultimately led to its recovery may have completed inconsistently between synchronizing storage systems. How this situation is handled may depend on the implementation of the I/O path during normal operation. In this example, further below is described a first write and a second write that overlapped in time, each received by one storage system or another for the pod before either was signaled as completed. This example is easily extended to more than two writes by considering them two by two in sequence, and by considering that the first and second writes may have completed on multiple storage systems, and by considering that the first, second, and third writes (or additional writes) may have completed inconsistently on more than two storage systems. The described technique is easily extended to these cases. In a symmetric I/O-based storage system implementation based on interlock exceptions, it is possible that only the first write completed on one storage system. Meanwhile, the second of two overlapping writes may have completed on a second storage system. This case can be detected by noting that ranges overlap between each write and by noting that neither storage system contains an alternate overlapping write. If two writes completely overlap (one completely covers the other), then one of the two writes may simply be copied to the other storage system and applied to replace the contents of that storage system for that volume address range. If the writes only partially overlap, then the partially overlapping contents may be copied (applied) from one storage system to the other. Meanwhile, the non-overlapping portions can be copied between each storage system so that the contents are unified and up-to-date on both storage systems. In a leader-based system with predicates or some other means for the leader to declare that one write precedes another, the storage systems performing the writes may persist them overlappingly, or may persist both together. Alternatively, an implementation may persist the two writes separately and out of order, with the ordering predicate used solely to control completion signaling. If an implementation allows out-of-order write processing, then the preceding example explains how consistency can be recovered. If the storage systems enforce persistence ordering during normal operation, then recovery may only see the first write on the first storage system, excluding the first and second writes on the second storage system. In that case, the second write can be copied from the second storage system to the first storage system as part of recovery.

［0370］別の例では、スナップショットも回復され得る。リーダー決定のいくつかの修正がスナップショットに含まれるべきであり、他は含まれるべきではない、修正と同時に起こるスナップショットの場合等の一部の場合では、記録された情報は、特定の書込みがスナップショットの中に含まれるべきかどうかについての情報を含む可能性がある。そのモデルでは、リーダーがスナップショットに含めることを決定したすべてが結局回復後のスナップショットに含まれなければならないことを確実にすることは必要ではない場合がある。ポッドのための１つの同期中のストレージシステムがスナップショットの存在を記録し、ポッドのための同期中のストレージシステムが、スナップショットでの包含のために順序付けられた書込みを記録しなかった場合、次いでその書込みを含めずにスナップショットを一様に適用しても、ポッドのためのすべての同期中ストレージシステム全体で完全に一貫性があるスナップショットコンテンツが生じる。この不一致は、一度も完了されたとして信号で知らされたことがなかった同時に起こる書込み及びスナップショットの場合にのみ起こるべきであり、したがって包含保証は保証されない。つまり、リーダーがプレディケートを割り当て、順序付けすることは、回復順序一貫性のためよりむしろ実行時一貫性のためにだけ必要となる場合がある。回復識別子が、スナップショットでの包含のための書込みを識別するが、回復が書込みを位置決めしない場合、スナップショット動作自体が、実施態様に応じてスナップショットを安全に無視する可能性がある。スナップショットについての同じ議論は、ＳＣＳＩＥＸＴＥＮＤＥＤＣＯＰＹ及び類似した動作によるボリュームアドレス範囲の仮想コピーにも適用する。つまり、リーダーは、ソースアドレス範囲に対するどの書込みが論理的にコピーに先行する可能性があるのか、及びターゲットアドレス範囲に対するどの書込みがアドレス範囲コピーに論理的に先行又は後続する可能性があるのかを定義する。しかしながら、回復中、同じ議論は、スナップショットと同様に適用する。つまり、結果がポッドのための同期中のストレージシステム全体で一貫性があり、どこでも完了した修正を後退させず、データセット消費者が読み取った可能性がある修正を逆にしないのであるならば、ボリューム範囲コピーと同時に起こる書込みは、同時に起こる書込み又はボリューム範囲コピーのどちらかを見逃す場合があるであろう。 [0370] In another example, snapshots may also be recovered. In some cases, such as in the case of snapshots that coincide with a modification, where some modifications of the leader decision should be included in the snapshot and others should not, the recorded information may include information about whether a particular write should be included in the snapshot. In that model, it may not be necessary to ensure that everything the leader decided to include in the snapshot must eventually be included in the recovered snapshot. If one synchronizing storage system for a pod recorded the existence of a snapshot, and no synchronizing storage system for the pod recorded a write that was ordered for inclusion in the snapshot, then uniformly applying the snapshot without including that write would result in snapshot content that is completely consistent across all synchronizing storage systems for the pod. This inconsistency should only occur in the case of concurrent writes and snapshots that were never signaled as completed, and therefore containment guarantees are not guaranteed. That is, leader assignment and ordering of predicates may be required only for runtime consistency rather than for recovery ordering consistency. If the recovery identifier identifies the write for inclusion in the snapshot, but the recovery does not position the write, the snapshot operation itself may safely ignore the snapshot, depending on the implementation. The same arguments for snapshots also apply to virtual copies of volume address ranges via SCSI EXTENDED COPY and similar operations. That is, the reader defines which writes to the source address range may logically precede the copy, and which writes to the target address range may logically precede or follow the address range copy. However, during recovery, the same arguments apply as for snapshots. That is, a write that occurs concurrently with a volume range copy may miss either the concurrent write or the volume range copy, provided the results are consistent across all in-sync storage systems for the pod, do not revert modifications completed anywhere, and do not reverse modifications that dataset consumers may have read.

［0371］さらにスナップショットの回復を説明するこの例に関して、任意のストレージシステムがＣＯＭＰＡＲＥＡＮＤＷＲＩＴＥのための書込みを適用した場合、次いで比較は、ポッドのための１つの同期中のストレージシステムで成功したに違いなく、実行時一貫性は、比較がポッドのためのすべての同期中のストレージシステムで成功すべきであったことを意味すべきであったため、任意の係るストレージシステムが書込みを適用した場合、それは、回復前にそれを適用していなかった任意の他のポッドのための同期中のストレージシステムにコピーされ、適用される場合がある。さらにまだ、ＸＤＷＲＩＴＥＲＥＡＤ又はＸＰＷＲＩＴＥ要求（又は既存のデータと新しいデータとの間の類似した算術変換演算）の回復は、あるストレージシステムから生じる読取り後に変換の結果を送達することによって動作する場合もあれば、それは、変換する書込みに先行する任意の順序付けデータが、ポッドのための同期中のストレージシステムの全体で一貫性があることが確実にできる場合、及びどの係るストレージシステムがまだ変換する書込みを適用していなかったのかを確実に判断できる場合、変換する書込みをデータとともに他のストレージシステムに送達することによって動作できる。 [0371] Further with regard to this example illustrating snapshot recovery, if any storage system applied the write for a COMPARE AND WRITE, then the comparison must have succeeded at one in-sync storage system for the pod, and since runtime consistency meant that the comparison should have succeeded at all in-sync storage systems for the pod, if any such storage system applied the write, it may be copied and applied to any other in-sync storage systems for the pod that had not applied it before recovery. Furthermore, recovery of an XDWRITERREAD or XPWRITE request (or similar arithmetic transformation operation between existing data and new data) may work by delivering the results of the transformation after the resulting read from one storage system, or it may work by delivering the transforming write along with the data to other storage systems, if it can be assured that any ordering data preceding the transforming write is consistent across in-sync storage systems for the pod, and if it can be reliably determined which such storage systems have not yet applied the transforming write.

［0372］別の例として、メタデータの回復が実施されてよい。この場合には、回復は、ポッドのための同期中のストレージシステムの間でのメタデータの一貫性のある回復も生じさせるべきであり、そのメタデータは、ポッド全体で一貫していると予想される。このメタデータが動作と含まれるのであるならば、これらは、それらの動作によって説明されるコンテンツ更新とともに適用できる。このデータが既存のメタデータとどのようにしてマージされるのかは、メタデータ及び実施態様に依存する。非同期複製を駆動するためのより長期の変更追跡情報は、多くの場合近傍として非常に簡略にマージできる、又はそれ以外の場合、関係する修正が識別される。 [0372] As another example, metadata recovery may be performed. In this case, recovery should also result in consistent recovery of metadata between synchronized storage systems for the pod, and the metadata is expected to be consistent across the pod. If this metadata is included with operations, it can be applied along with the content updates described by those operations. How this data is merged with existing metadata depends on the metadata and implementation. Longer-term change tracking information for driving asynchronous replication can often be merged very simply as neighboring or otherwise related modifications are identified.

［0373］別の例として、動作追跡のための最近の活動を記録することは、障害又は回復につながった他のタイプのサービス中断時にポッド内の同期中のストレージシステムで進行中であった動作を識別するための多様な方法で実施されてよい。例えば、１つのモデルは、ポッドの中の同期中の各ストレージシステムに対する修正の回復情報を、（更新が高速ジャーナルデバイスを通して段階分けされる場合にうまく機能できる）任意の修正でアトミックに、又は動作が発生する前に進行中である動作についての情報を記録することによってのどちらかで記録するものである。記録された回復情報は、例えば元の要求に基づいた、又は動作を記述することの一部としてリーダーによって割り当てられるなんらかの識別子に基づいた等の論理動作識別子を含んでよく、どのようなレベルの動作記述も回復が動作するために必要とされる場合がある。同時に起こるスナップショットのコンテンツに含まれる書込みのためにストレージシステムによって記録された情報は、書込みが適用されるボリュームのコンテンツにだけではなくスナップショットにも、書込みが含まれることを示すべきである。いくつかのストレージシステム実施態様では、スナップショットのコンテンツは、より新しいスナップショットの特定の重複コンテンツによって置き換えられられない、又はライブでボリュームに後に書き込まれる特定の重複コンテンツによって置き換えられない限り、ボリュームのコンテンツに自動的に含まれる。時間で及びボリュームアドレスで重複する２つの同時に起こる書込みタイプ要求（例えばＷＲＩＴＥ要求、ＷＲＩＴＥＳＡＭＥ要求、又はＵＮＭＡＰ要求、又は組合せ）は、リーダーが、第２の書込みがポッドのための任意の同期中のストレージシステムによって持続できる前に、第１の書込みがポッドのためのすべての同期中のストレージシステムに最初に持続されることを確実にするように、リーダーによって明示的に順序付けされてよい。これは、不一致が発生しないことを簡単に確実にする。さらに、ボリュームに対する同時に起こる重複書込みは非常にまれであるので、これは許容可能である場合がある。その場合、第２の書込みのために任意の回復するストレージシステムに関するレコードがある場合、次いで第１の書込みは、それが回復を必要とするべきではないようにどこででも完了したに違いない。代わりに、プレディケートは、リーダーが、ストレージシステムが第２の書込みの前に第１の書込みを、順序付けることを必要とすることによって記述されてよい。ストレージシステムは、次いで両方の書込みをともに実行してよく、これによりそれらはともに持続する又はともに持続できないかのどちらかとなるように保証される。別の場合には、ストレージシステムは、第１の書込みの持続性が保証された後、第１の書込み、次いで第２の書込みを持続してよい。ＣＯＭＰＡＲＥＡＮＤＷＲＩＴＥ要求、ＸＤＷＲＩＴＥＲＥＡＤ要求、又はＸＰＷＲＩＴＥ要求は、それぞれが動作を実行するとき、プレカーソルコンテンツがすべてのストレージシステムで同一となるように順序付けされるべきである。代わりに、１つのストレージシステムは結果を計算し、すべてのストレージシステムに通常の書込みタイプ要求として要求を送達する可能性がある。さらに、これらの動作を回復可能にすることに関して、どの動作がどこででも完了したのかを追跡することは、その新近性を軽視することを可能にしてよく、完了した動作のための動作回復分析を生じさせる記録された情報は、次いで回復によって廃棄されるか、又は効率的に無視されるかのどちらかの場合がある。 [0373] As another example, recording recent activity for operation tracking may be implemented in a variety of ways to identify operations that were in progress at synchronized storage systems in a pod at the time of a failure or other type of service interruption that led to recovery. For example, one model is to record recovery information for modifications to each synchronized storage system in a pod, either atomically with any modification (which can work well if updates are staged through a high-speed journal device) or by recording information about operations that were in progress before the operation occurred. The recorded recovery information may include a logical operation identifier, such as based on the original request or some identifier assigned by the reader as part of describing the operation; any level of operation description may be required for recovery to operate. Information recorded by the storage system for writes included in the contents of a concurrent snapshot should indicate that the write is included in the snapshot, not just in the contents of the volume to which the write is applied. In some storage system implementations, the contents of the snapshot are automatically included in the contents of the volume unless they are replaced by specific duplicate content in a newer snapshot or by specific duplicate content later written to the volume live. Two concurrent write-type requests (e.g., WRITE, WRITE SAME, or UNMAP requests, or a combination) that overlap in time and in volume address may be explicitly ordered by the leader to ensure that the first write is first persisted to all synchronizing storage systems for the pod before the second write can be persisted by any synchronizing storage system for the pod. This simply ensures that inconsistencies do not occur. Furthermore, this may be acceptable because concurrent overlapping writes to a volume are very rare. In that case, if there is a record on any recovering storage system for the second write, then the first write must have completed everywhere so that it should not require recovery. Alternatively, a predicate may be written by the leader requiring that the storage system order the first write before the second write. The storage system may then execute both writes together, thereby ensuring that they either both persist or neither fail to persist. In other cases, the storage system may persist the first write, then the second, after the durability of the first write is guaranteed. COMPARE AND WRITE, XDWRITERREAD, or XPWRITE requests should be ordered so that the precursor content is identical on all storage systems when each performs the operation. Alternatively, one storage system may compute the results and deliver the request to all storage systems as a normal write-type request. Furthermore, with regard to making these operations recoverable, tracking which operations have completed everywhere may allow their recency to be downplayed; recorded information that would otherwise cause operation recovery analysis for completed operations may then be either discarded by recovery or effectively ignored.

［0374］別の例では、完了した動作を捨てることが実施されてよい。記録された情報のクリアを処理するための１つの例は、動作が、ポッドのためのすべての同期中のストレージシステムで処理されたと知られた後、それをすべてのストレージシステム全体でクリアすることである。これは、要求を受け取り、完了を信号で知らせたストレージシステムに、完了が信号で知らされた後にポッドのためのすべてのストレージシステムにメッセージを送信させ、各ストレージシステムがそれらを捨てることを可能にすることによって実施できる。回復は、次いで回復に関与するポッドのためのすべての同期中のストレージシステム全体で捨てられなかったすべての記録された動作について問い合わせることを含む。代わりに、これらのメッセージは、それらが周期的に（例えば、５０ｍｓごとに）、又はなんらかの数の動作（例えば、１０から１００ごと）の後に起こるようにバッチ処理されるであろう。より完全に完了した動作が潜在的に不完全として報告されるので、このバッチ処理プロセスは、いくぶん増加した回復回数を犠牲にしてメッセージトラフィックを大幅に削減し得る。さらに、（例として）リーダーベースの実施態様では、リーダーは、どの動作が完了しているのかを認識させられ、リーダーはクリアメッセージを送出できるであろう。 [0374] In another example, discarding completed operations may be implemented. One example for handling the clearing of recorded information is to clear an operation across all synchronized storage systems for a pod once it is known to have been processed across all synchronized storage systems. This can be implemented by having the storage system that received the request and signaled completion send a message to all synchronized storage systems for the pod after completion is signaled, allowing each storage system to discard them. Recovery then involves querying all synchronized storage systems for the pod involved in the recovery for all recorded operations that have not been discarded. Instead, these messages could be batched so that they occur periodically (e.g., every 50 ms) or after some number of operations (e.g., every 10 to 100). This batching process could significantly reduce message traffic at the expense of somewhat increased recovery times, as more fully completed operations would potentially be reported as incomplete. Furthermore, in a leader-based implementation (by way of example), the leader would be made aware of which operations have completed, and the leader could send out a clear message.

［0375］別の例では、スライドウィンドウが実装されてよい。係る例は、リーダー及びフォロワーに基づいた実施態様でうまく機能してよく、リーダーは、動作又は動作の集合体にシーケンス番号を付けてよい。このようにして、リーダーがあるシーケンス番号までのすべての動作が完了したことを判断することに応えて、リーダーはすべての同期中のストレージシステムに、そのシーケンス番号までのすべての動作が完了していることを示すメッセージを送信してよい。また、シーケンス番号は任意の番号でもあり、これにより任意の番号と関連付けられたすべての動作が完了したとき、すべてのそれらの動作が完了したことを示すためにメッセージが送信される。シーケンス番号をベースにしたモデルを用いると、回復は、最後に完了したシーケンス番号より大きいシーケンス番号と関連付けられた任意の動作中のストレージシステムに対するすべての動作について問い合わせるであろう。リーダーがいない対称的な実施態様では、ポッドに対する要求を受け取る各ストレージステムは、独自のスライドウィンドウ及びスライドウィンドウアイデンティティ空間を定義できるであろう。その場合、回復は、そのスライドウィンドウアイデンティティが、完了している最後のアイデンティティの後である任意のスライドウィンドウアイデンティティ空間と関連付けられる任意の同期中のストレージウィンドウでのすべての動作について問い合わせることを含んでよく、すべての先行する識別子の動作も完了している。 [0375] In another example, a sliding window may be implemented. Such an example may work well in a leader-and-follower based implementation, where the leader may assign sequence numbers to operations or collections of operations. Thus, in response to the leader determining that all operations up to a certain sequence number are complete, the leader may send a message to all in-sync storage systems indicating that all operations up to that sequence number are complete. Sequence numbers can also be arbitrary numbers, whereby when all operations associated with a given number are completed, a message is sent indicating that all those operations are complete. Using a sequence number-based model, recovery would query any in-sync storage system for all operations associated with a sequence number greater than the last completed sequence number. In a leaderless, symmetric implementation, each storage system receiving requests for a pod could define its own sliding window and sliding window identity space. In that case, recovery may include querying for all operations in any in-sync storage window whose sliding window identity is associated with any sliding window identity space that is after the last completed identity, and for which operations for all preceding identifiers have also completed.

［0376］別の例では、チェックポイントが実装されてよい。チェックポイントモデルでは、プレカーソル動作の一様な集合の完了に依存し、すべての連続する動作が次いで依存する特別な動作が、リーダーによって挿入されてよい。各ストレージシステムは、次いで持続又は完了されたすべてのプレカーソル動作に応えてチェックポイントを持続してよい。連続チェックポイントは、以前のチェックポイントが、ポッドのためのすべての同期中のストレージシステムで持続されたとして信号で知らされた後のあるときに開始されてよい。連続チェックポイントは、すべてのプレカーソル動作がポッド全体で持続された後のあるときまでこのようにして開始されないであろう。それ以外の場合、以前のチェックポイントは、完了しないであろう。このモデルでは、回復は、以前から最後のチェックポイントの後に続くすべての同期中のストレージシステムですべての動作について問い合わせることを含んでよい。これは、ポッドのための任意の同期中のストレージシステムに既知の第２から最後のチェックポイントまでを識別することによって、又は各ストレージシステムに第２から最後の持続されたチェックポイント以降のすべての動作を報告するように頼むことによって達成されるであろう。代わりに、回復は、すべての同期中のストレージシステムで完了したと知られている最後のチェックポイントを検索することを含んでよく、任意の同期中のストレージシステムで続くすべての動作について問い合わせることを含んでよい－チェックポイントがすべての同期中のストレージシステムで完了した場合、次いでそのチェックポイントの前のすべての動作は、どこででも明確に持続されていた。 [0376] In another example, checkpoints may be implemented. In a checkpoint model, a special operation may be inserted by the leader that depends on the completion of a uniform set of pre-cursor operations, upon which all subsequent operations then depend. Each storage system may then persist the checkpoint in response to all pre-cursor operations being persisted or completed. A subsequent checkpoint may be initiated some time after the previous checkpoint has been signaled as persisted at all in-sync storage systems for the pod. A subsequent checkpoint would not be initiated in this manner until some time after all pre-cursor operations have been persisted across the pod. Otherwise, the previous checkpoint would not be completed. In this model, recovery may involve querying all in-sync storage systems for all operations following the previous to last checkpoint. This would be accomplished by identifying the second through last checkpoints known to any in-sync storage systems for the pod, or by asking each storage system to report all operations since the second through last persisted checkpoint. Alternatively, recovery may involve retrieving the last checkpoint known to have completed on all in-sync storage systems, and querying all subsequent operations on any in-sync storage systems - if a checkpoint has completed on all in-sync storage systems, then all operations prior to that checkpoint have been specifically persisted everywhere.

［0377］別の例では、論理エクステントの複製された有向非巡回グラフに基づくポッドの回復が実施されてよい。しかしながら、係る実施態様を記述する前に、論理エクステントの有向非巡回グラフを使用するストレージシステムが最初に説明される。 [0377] In another example, pod recovery may be performed based on a replicated directed acyclic graph of logical extents. However, before describing such an embodiment, a storage system using a directed acyclic graph of logical extents will first be described.

［0378］ストレージシステムは、論理エクステントを含んだ有向非巡回グラフに基づいて実装されてよい。このモデルでは、論理エクステントは２つのページ、つまりなんらかの方法で記憶されているなんらかの量のデータを参照するリーフ（ｌｅａｆ）論理エクステント、及びその他のリーフ論理エクステント又は複合論理エクステントを参照する複合論理エクステントに分類される場合がある。 [0378] A storage system may be implemented based on a directed acyclic graph containing logical extents. In this model, logical extents may be categorized into two types: leaf logical extents, which refer to some amount of data stored in some way, and composite logical extents, which refer to other leaf or composite logical extents.

［0379］リーフエクステントは、さまざまな方法でデータを参照できる。リーフエクステントは、記憶されたデータの単一の範囲（例えば、６４キロバイトのデータ）を直接的に指す場合もあれば、リーフエクステントは記憶されたデータに対する参照の集合体（例えば、物理的に記憶されたブロックに対する範囲と関連付けられたなんらかの数の仮想ブロックをマッピングするコンテンツの１メガバイト「範囲」）である場合もある。後者の場合、これらのブロックは、なんらかのアイデンティティを使用し、参照されてよく、エクステントの範囲内のいくつかのブロックは、何にもマッピングされ得ない。また、その後者の場合、これらのブロックリファレンスは一意である必要はなく、なんらかの数のボリュームの中及び全体でのなんらかの数の論理エクステントの中の仮想ブロックからの複数のマッピングが同じ物理的に記憶されるブロックにマッピングすることを可能にする。記憶されたブロックリファレンスの代わりに、論理エクステントは簡略なパターンを符号化できるであろう。例えば、同一のバイトの文字列であるブロックは、ブロックが同一バイトの繰り返されるパターンであることを単に符号化できるであろう。 [0379] Leaf extents can reference data in a variety of ways. A leaf extent may directly point to a single range of stored data (e.g., 64 kilobytes of data), or it may be a collection of references to stored data (e.g., a 1 megabyte "range" of contents that maps some number of virtual blocks associated with a range to physically stored blocks). In the latter case, these blocks may be referenced using some identity, and some blocks within the extent's range may not map to anything. Also, in the latter case, these block references need not be unique, allowing multiple mappings from virtual blocks in some number of logical extents within and across some number of volumes to map to the same physically stored block. Instead of stored block references, logical extents could encode simple patterns. For example, blocks that are strings of identical bytes could simply encode that the block is a repeated pattern of identical bytes.

［0380］複合論理エクステントは、それぞれ複合論理エクステントのサブレンジから基本的なリーフ論理エクステント又は複合論理エクステントにコンテンツの論理範囲をマッピングする複数のマップを含む、なんらかの仮想サイズを有するコンテンツの論理範囲である場合がある。複合論理エクステントのコンテンツに関係する要求を変換することは、次いで複合論理エクステントのコンテキストの中で要求のためのコンテンツ範囲を採取すること、その要求がどの基本的なリーフ論理エクステント又は複合論理エクステントにマッピングするのかを決定すること、及びそれらの基本的なリーフ論理エクステント又は複合論理エクステントの中のコンテンツの適切な範囲に適用するために要求を変換することを伴う。 [0380] A composite logical extent may be a logical range of content having some virtual size that includes multiple maps, each mapping a logical range of content from a subrange of the composite logical extent to an underlying leaf logical extent or composite logical extent. Transforming a request related to content in a composite logical extent then involves taking the content range for the request within the context of the composite logical extent, determining which underlying leaf logical extent or composite logical extent the request maps to, and transforming the request to apply to the appropriate range of content within those underlying leaf logical extents or composite logical extents.

［0381］ボリューム、つまりファイル又は他のタイプのストレージオブジェクトは、複合論理エクステントとして記述できる。したがって、（説明の大部分では、単にボリュームと呼ばれる）これらの提示されているストレージオブジェクトは、このエクステントモデルを使用し、編成できる。 [0381] Volumes, i.e., files or other types of storage objects, can be described as composite logical extents. Thus, these represented storage objects (which will be referred to simply as volumes in most of the description) can be organized using this extent model.

［0382］実施態様に応じて、リーフ論理エクステント又は複合論理エクステントは、複数の他の複合論理エクステントから参照され、実質的に、ボリュームの中及び全体でのコンテンツのより大きい集合体の安価な重複を可能にするであろう。したがって、論理エクステントは、基本的には参照の非巡回グラフの中で配置される場合があり、それぞれがリーフ論理エクステントで終了する。これは、ボリュームのコピーを作成するために、ボリュームのスナップショットを作成するために、又はＸＴＥＮＤＥＤＣＯＰＹ又は類似するタイプの動作の一部としてボリュームの中で及びボリューム間でサポートする仮想範囲コピーの一部として使用できる。 [0382] Depending on the implementation, a leaf logical extent or a composite logical extent may be referenced from multiple other composite logical extents, essentially enabling inexpensive duplication of larger collections of content within and across volumes. Thus, logical extents may essentially be arranged in an acyclic graph of references, each terminating in a leaf logical extent. This can be used to create copies of volumes, to create snapshots of volumes, or as part of virtual range copy support within and between volumes as part of an EXTENDED COPY or similar type of operation.

［0383］実施態様は、各論理エクステントに、それに名前を付けるために使用できるアイデンティティを提供してよい。複合論理エクステントの中の参照は、論理エクステント識別子及びそれぞれの係る論理エクステントアイデンティティに対応する論理的なサブレンジを含んだリストになるので、これは参照を簡略化する。また、論理エクステントの中では、それぞれの記憶されたデータブロックリファレンスはそれに名前を付けるために使用されるなんらかのアイデンティティに基づいてもよい。 [0383] Implementations may provide each logical extent with an identity that can be used to name it. This simplifies referencing, as the reference within a composite logical extent will be a list containing the logical extent identifier and the logical subrange corresponding to each such logical extent identity. Also, within a logical extent, each stored data block reference may be based on some identity used to name it.

［0384］エクステントのこれらの重複使用をサポートするために、追加機能、つまりコピーオンライト論理エクステントを加えることができる。修正動作がコピーオンライトのリーフ論理エクステント又は複合論理エクステントに影響を及ぼすとき、論理エクステントはコピーされ、コピーは新しい参照であり、（実施態様に応じて）おそらく新しいアイデンティティを有する。コピーは、どのような修正が修正動作から生じても、基本的なリーフ論理エクステント又は複合論理エクステントに関係するすべての参照又はアイデンティティを保持する。例えば、ＷＲＩＴＥ、ＷＲＩＴＥＳＡＭＥ、ＸＤＷＲＩＴＥＲＥＡＤ、ＸＰＷＲＩＴＥ、又はＣＯＭＰＡＲＥＡＮＤＷＲＩＴＥ要求は、ストレージシステムに新しいブロックを記憶（又は、既存の記憶されているブロックを識別するために重複排除技法を使用）してよく、ブロックの新しい集合に対するアイデンティティを参照又は記憶するために対応するリーフ論理エクステントを修正し、おそらくブロックの以前の集合のための参照及び記憶されているアイデンティティを置き換える。代わりに、ＵＮＭＡＰ要求は、１つ以上のブロックリファレンスを削除するためにリーフ論理エクステントを修正してよい。両方のタイプの事例では、リーフ論理エクステントが修正される。リーフ論理エクステントがコピーオンライトである場合、次いで旧いエクステントから影響を受けていないブロックリファレンスをコピーしてから、修正動作に基づいたブロックリファレンスを置換又は削除することによって形成される新しいリーフ論理エクステントが作成される。 [0384] To support these overlapping uses of extents, an additional feature can be added: copy-on-write logical extents. When a modification operation affects a copy-on-write leaf logical extent or composite logical extent, the logical extent is copied, and the copy is a new reference and possibly has a new identity (depending on the implementation). The copy preserves all references or identities related to the underlying leaf logical extent or composite logical extent, regardless of any modifications resulting from the modification operation. For example, a WRITE, WRITE SAME, XDWRITERREAD, XPWRITE, or COMPARE AND WRITE request may store new blocks in the storage system (or use deduplication techniques to identify existing stored blocks) and modify the corresponding leaf logical extent to reference or store identities for the new set of blocks, possibly replacing references and stored identities for the previous set of blocks. Alternatively, an UNMAP request may modify the leaf logical extent to remove one or more block references. In both types of cases, a leaf logical extent is modified. If the leaf logical extent is copy-on-write, then a new leaf logical extent is created, formed by copying unaffected block references from the old extent and then replacing or deleting block references based on the modification operation.

［0385］リーフ論理エクステントの場所を突き止めるために使用された複合論理エクステントは、次いで新しいリーフ論理エクステントリファレンス又はコピーされ、修正されたリーフ論理エクステントと関連付けられたアイデンティティを、以前のリーフ論理エクステントの代替物として記憶するために修正されてよい。その複合論理エクステントがコピーオンライトである場合、次いで新しい複合論理エクステントは新しい参照として又は新しいアイデンティティとともに作成され、その基本的な論理エクステントに対する任意の影響を受けていない参照又はアイデンティティは、その新しい複合論理エクステントにコピーされ、以前のリーフ論理エクステントリファレンス又はアイデンティティは、新しい論理エクステントリファレンス又はアイデンティティで置換される。 [0385] The composite logical extent used to locate the leaf logical extent may then be modified to store a new leaf logical extent reference or identity associated with the copied and modified leaf logical extent as a replacement for the previous leaf logical extent. If the composite logical extent is copy-on-write, then a new composite logical extent is created with the new reference or identity, any unaffected references or identities to its underlying logical extents are copied to the new composite logical extent, and the previous leaf logical extent reference or identity is replaced with the new logical extent reference or identity.

［0386］このプロセスは、修正動作を処理するために使用される非巡回グラフを通る検索経路に基づいて、参照されたエクステントから参照する複合エクステントにさらに過去に遡って続行し、すべてのコピーオンライト論理エクステントがコピーされ、修正され、置換される。 [0386] This process continues from the referenced extent back to the referencing composite extents, based on a search path through the acyclic graph used to process the modification operation, until all copy-on-write logical extents are copied, modified, and replaced.

［0387］これらのコピーされたリーフ論理エクステント及び複合論理エクステントは、次いでコピーオンライトであるという特徴を除外する場合があり、これにより追加の修正は追加のコピーを生じさせない。例えば、第１のときに、コピーオンライト「親」複合エクステントの中のなんらかの基本的な論理エクステントが修正され、その基本的な論理エクステントはコピーされ、修正され、コピーは次いで、親複合論理エクステントのコピーされ、置換されたインスタンスの中に書き込まれる新しいアイデンティティを有する。しかしながら、第２のときに、なんらかの他の基本的な論理エクステントがコピーされ、修正され、他の基本的な論理エクステントコピーの新しいアイデンティティは、親複合論理エクステントに書き込まれ、親は、次いで追加のコピーなく定位置で修正され、親複合論理エクステントに対する参照の代わりに必要な置き換える場合がある。 [0387] These copied leaf logical extents and composite logical extents may then exclude the copy-on-write feature, whereby additional modifications do not result in additional copies. For example, the first time some underlying logical extent in a copy-on-write "parent" composite extent is modified, that underlying logical extent is copied and modified, and the copy then has a new identity written into the copied, replaced instance of the parent composite logical extent. However, the second time some other underlying logical extent is copied and modified, a new identity for the other underlying logical extent copy is written into the parent composite logical extent, and the parent may then be modified in place without additional copies, replacing the necessary references to the parent composite logical extent.

［0388］現在のリーフ論理エクステントがない、ボリュームの又は複合論理エクステントの新しい領域に対する修正動作は、それらの修正の結果を記憶するために新しいリーフ論理エクステントを作成してよい。その新しい論理エクステントが既存のコピーオンライト論理エクステントから参照される場合、次いでその既存のコピーオンライト複合論理エクステントは、新しい論理コンテンツを参照し、既存のリーフ論理エクステントを修正するためのシーケンスに類似するコピー、修正、及び置換の別の動作のシーケンスを生じさせる。 [0388] Modification operations to new areas of a volume or composite logical extent for which there is no current leaf logical extent may create a new leaf logical extent to store the results of those modifications. If that new logical extent is referenced from an existing copy-on-write logical extent, then that existing copy-on-write composite logical extent references the new logical content, resulting in another sequence of copy, modify, and replace operations similar to the sequence for modifying an existing leaf logical extent.

［0389］親複合論理エクステントが（実施態様に基づいて）新しい修正動作のために作成するための新しいリーフ論理エクステントを含む関連付けられたアドレス範囲をカバーするほど十分に大きく成長できない場合、次いで親複合論理エクステントは、さらに再び新しい参照又は新しいアイデンティティである単一の「孫」複合論理エクステントから次いで参照される２つ以上の新しい複合論理エクステントにコピーされてよい。その孫論理エクステントがそれ自体、コピーオンライトである別の複合論理エクステントを通して見つけられる場合、次いでその別の複合論理エクステントは、上述の段落で説明されるのと類似した方法でコピーされ、修正され、置換される。このコピーオンライトモデルは、論理エクステントのこれらの有向非巡回グラフに基づいて、ストレージシステム実施態様の中でスナップショット、ボリュームコピー、及び仮想ボリュームアドレス範囲コピーを実装することの一部として使用できる。それ以外の場合書込み可能なボリュームの読取り専用コピーとしてスナップショットを作成するために、ボリュームと関連付けられた論理エクステントのグラフは、マークされたコピーオンライトであり、元の複合論理エクステントに対する参照はスナップショットによって保持される。ボリュームに対する修正動作は、次いで必要に応じて論理エクステントコピーを作成し、それらの修正動作の結果及び元のコンテンツを保持するスナップショットを記憶するボリュームを生じさせる。ボリュームコピーは、元のボリュームとコピーされたボリュームの両方ともコンテンツを修正し、独自のコピーされた論理エクステントグラフ及びサブグラフを生じさせることを除き、類似している。 [0389] If a parent composite logical extent cannot grow large enough (based on the implementation) to cover the associated address range, including the new leaf logical extent to create for the new modification operation, then the parent composite logical extent may be copied to two or more new composite logical extents that are in turn referenced from a single "grandchild" composite logical extent that is itself a new reference or new identity. If that grandchild logical extent is found through another composite logical extent that is itself copy-on-write, then that other composite logical extent is copied, modified, and replaced in a manner similar to that described in the preceding paragraph. This copy-on-write model can be used as part of implementing snapshots, volume copies, and virtual volume address range copies in storage system implementations based on these directed acyclic graphs of logical extents. To create a snapshot as a read-only copy of an otherwise writable volume, the graph of logical extents associated with the volume is marked copy-on-write, and a reference to the original composite logical extent is maintained by the snapshot. Modification operations on the volume then create logical extent copies as needed, resulting in a volume that stores the results of those modification operations and a snapshot that preserves the original contents. Volume copies are similar, except that both the original and copied volumes modify content, resulting in their own copied logical extent graphs and subgraphs.

［0390］仮想ボリュームアドレス範囲コピーは、（ブロックリファレンスに対する変更がコピーオンライトリーフ論理エクステントを修正しない限り、それ自体、コピーオンライト技術を使用することを伴わない）リーフ論理エクステントの中及び間でブロックリファレンスをコピーすることによって動作する場合がある。代わりに、仮想ボリュームアドアレス範囲コピーは、リーフ論理エクステント又は複合論理エクステントに対する参照を重複させる場合があり、これは、より大きなアドレス範囲のボリュームアドレス範囲コピーにはうまく機能する。さらに、これは、グラフが単にリファレンスツリーよりむしろ参照の有向非巡回グラフになることを可能にする。重複された論理エクステントリファレンスと関連付けられたコピーオンライト技術は、仮想アドレス範囲コピーのソース又はターゲットに対する修正動作が、ボリュームアドレス範囲コピー動作の直後に同じ論理エクステントを共用するターゲット又はソースに影響を及ぼすことなくそれらの修正を記憶するために新しい論理エクステントの作成を生じさせることを確実にするために使用できる。 [0390] Virtual volume address range copies may operate by copying block references within and between leaf logical extents (which does not itself involve using copy-on-write techniques unless changes to the block references modify the copy-on-write leaf logical extents). Alternatively, virtual volume address range copies may duplicate references to leaf logical extents or composite logical extents, which works well for volume address range copies of larger address ranges. Furthermore, this allows the graph to be a directed acyclic graph of references rather than simply a reference tree. Copy-on-write techniques associated with duplicated logical extent references can be used to ensure that modification operations on the source or target of a virtual address range copy result in the creation of new logical extents to store those modifications without affecting the target or source that share the same logical extent immediately after the volume address range copy operation.

［0391］また、ポッドのための入力／出力動作も、論理エクステントの有向非巡回グラフを複製することに基づいて実施されてよい。例えば、ポッドの中の各ストレージシステムは、論理エクステントのプライベートグラフを実装することができ、これによりポッドのための１つのストレージシステムに関するグラフは、ポッドのための任意の第２のストレージシステム上のグラフに特定の関係を有さない。しかしながら、ポッド内のストレージシステム間でグラフを同期させることには価値がある。これは、再同期のため、及び例えば非同期複製又はリモートストレージシステムに対するスナップショットベースの複製等の特徴を調整するために有用である場合がある。さらに、それは、スナップショットの配布を処理するため及びコピー関係処理のためのなんらかのオーバヘッドを削減するためにも有用である場合がある。係るモデルでは、ポッドのためのすべての同期中のストレージシステム全体で同期中のポッドのコンテンツを保つことは、ポッドのためのすべての同期中のストレージシステム全体ですべてのボリュームのためにリーフ論理エクステント及び複合論理エクステントのグラフを同期中に保つこと、及びすべての論理エクステントのコンテンツが同期中であることを確実にすることと、基本的に同じである。同期中であるためには、リーフ論理エクステント及び複合論理エクステントを一致させることは同じアイデンティティを有するべきである、又はマッピング可能なアイデンティティを有するべきであるかのどちらかである。マッピングは、中間マッピングテーブルのなんらかの集合を伴う場合もあれば、なんらかの他のタイプのアイデンティティ変換を伴う場合もあるであろう。一部の場合では、リーフ論理エクステントによってマッピングされたブロックのアイデンティティも同期中に保つことができるであろう。 [0391] Input/output operations for a pod may also be performed based on replicating a directed acyclic graph of logical extents. For example, each storage system in a pod may implement a private graph of logical extents, such that the graph on one storage system for the pod has no particular relationship to the graph on any second storage system for the pod. However, there is value in synchronizing the graphs between storage systems within a pod. This may be useful for resynchronization and for coordinating features such as asynchronous replication or snapshot-based replication to remote storage systems. Furthermore, it may also be useful for handling snapshot distribution and reducing some overhead for copy relationship processing. In such a model, keeping the contents of a pod in sync across all synchronizing storage systems for the pod is essentially the same as keeping the graphs of leaf logical extents and composite logical extents in sync for all volumes across all synchronizing storage systems for the pod and ensuring that the contents of all logical extents are in sync. To be in sync, matching leaf logical extents and composite logical extents should either have the same identity or should have mappable identities. Mapping might involve some set of intermediate mapping tables, or it might involve some other type of identity translation. In some cases, the identities of blocks mapped by leaf logical extents could also be kept in sync.

［0392］ポッドごとに単一リーダーを有するリーダー及びフォロワーに基づいたポッド実施態様では、リーダーは、論理エクステントグラフに対する任意の変更を決定することを任される場合がある。新しいリーフ論理エクステント又は複合論理エクステントが作成される場合、それはアイデンティティを与えられる場合がある。既存のリーフ論理エクステント又は複合論理エクステントが修正を有する新しい論理エクステントを形成するためにコピーされる場合、新しい論理エクステントは、修正のなんらかの集合を有する以前の論理エクステントのコピーとして記述される場合がある。既存の論理エクステントが分割される場合、分割は、結果として生じる新しいアイデンティティとともに記述される場合がある。論理エクステントが、なんらかの追加の複合論理エクステントから基本的な論理エクステントとして参照される場合、その参照は、その基本的な論理エクステントを参照するための複合論理エクステントに対する変更として記述される場合がある。 [0392] In a leader-and-follower based pod implementation with a single leader per pod, the leader may be responsible for determining any changes to the logical extent graph. When a new leaf logical extent or composite logical extent is created, it may be given an identity. When an existing leaf logical extent or composite logical extent is copied to form a new logical extent with modifications, the new logical extent may be described as a copy of the previous logical extent with some set of modifications. When an existing logical extent is split, the split may be described along with the resulting new identity. When a logical extent is referenced as a base logical extent from some additional composite logical extent, the reference may be described as a change to the composite logical extent to reference that base logical extent.

［0393］ポッドでの修正動作は、このようにして（コンテンツを拡張するために新しい論理エクステントが作成される場合、又はスナップショット、ボリュームコピー、及びボリュームアドレス範囲コピーに関係するコピーオンライト状態を処理するために、論理エクステントがコピーされ、修正され、置換される場合）論理エクステントグラフに対する修正の記述を配布すること、及びリーフ論理エクステントのコンテンツに対する修正のための記述及びコンテンツを配布することを含む。有向非巡回グラフの形のメタデータを使用することからくる追加の利点は、上述されたように、物理ストレージ内の記憶されたデータを修正するＩ／Ｏ動作が、物理ストレージに記憶されたデータに対応するメタデータの修正によりユーザレベルで―物理ストレージに記憶されたデータを修正することなく―効果を与えられ得る点である。物理ストレージがソリッドステートドライブであってよいストレージシステムの開示された実施形態では、フラッシュメモリに対する修正に付随する磨耗は、フラッシュメモリの読取り、消去、又は書込みを通しての代わりに、Ｉ／Ｏ動作によってターゲットにされるデータを表すメタデータの修正により効果を与えられるＩ／Ｏ動作のために回避又は削減され得る。さらに、仮想化ストレージシステムでは、上述されたメタデータは、仮想アドレス、つまり論理アドレスと物理アドレス、つまり実アドレスとの関係性を処理するために使用されてよい－言い換えると、記憶されたデータのメタデータ表現は、それがフラッシュメモリ上の摩耗を削減する又は最小限に抑える点でフラッシュフレンドリ（ｆｌａｓｈ－ｆｒｉｅｎｄｌｙ）と見なされてよい仮想化ストレージシステムを可能にする。 [0393] Modification operations in a pod thus include distributing descriptions of modifications to the logical extent graph (when new logical extents are created to extend content, or when logical extents are copied, modified, or replaced to handle copy-on-write conditions related to snapshots, volume copies, and volume address range copies), and distributing descriptions and contents for modifications to the content of leaf logical extents. An additional advantage of using metadata in the form of a directed acyclic graph, as described above, is that I/O operations that modify stored data in physical storage can be effected at the user level—without modifying the data stored in physical storage—by modifying the metadata corresponding to the data stored in physical storage. In disclosed embodiments of a storage system in which the physical storage may be a solid-state drive, wear associated with modifications to flash memory can be avoided or reduced due to I/O operations being effected by modifying metadata representing the data targeted by the I/O operation, instead of through a read, erase, or write to flash memory. Furthermore, in a virtualized storage system, the metadata described above may be used to handle the relationship between virtual addresses, i.e., logical addresses, and physical addresses, i.e., real addresses - in other words, the metadata representation of stored data enables a virtualized storage system that may be considered flash-friendly in that it reduces or minimizes wear on flash memory.

［0394］リーダーストレージシステムは、ポッドデータセットのそのローカルコピー及びローカルストレージシステムのメタデータとの関連でこれらの記述を実装するために、独自のローカル動作を実行してよい。さらに、同期中のフォロワーは、ポッドデータセットのそれらの別々のローカルコピー及びその別々のローカルストレージシステムのメタデータとの関連でこれらの記述を実装するために、独自の別々のローカル動作を実行する。リーダーとフォロワーの両方が完了すると、結果は互換性のあるリーフ論理エクステントコンテンツを有する論理エクステントの互換性のあるグラフである。論理エクステントのこれらのグラフは、次いで上記の例に説明されたように一種の「共通メタデータ」になる。この共通メタデータは、修正動作と必須共通メタデータとの間の依存関係として記述される場合がある。グラフへの変換は、後続の修正動作との待ち行列プレディケート関係を有する別々の動作として記述できる。代わりに、各修正動作は、ポッド全体で完了するとまだ知られていない特定の同じグラフ変換に頼る各修正動作は、それが頼る任意のグラフ変換の部分を含む場合がある。既に存在している「新しい」リーフ論理エクステント又は複合論理エクステントを識別する動作記述を処理することは、その部分はすでになんらかの早期の動作の処理で処理されていたので、新しい論理エクステントを作成することを回避する場合があり、代わりにリーフ論理エクステント又は複合論理エクステントのコンテンツを変更する動作処理部分しか実装できない。変換が互いに互換性があることを確実にすることはリーダーの役割である。例えば、ポッドを受け取る２つの書込みで開始できる。第１の書込みは、複合論理エクステントＡを複合論理エクステントＢとして形成されたコピーで置換し、リーフ論理エクステントＣをリーフ論理エクステントＤとしてのコピーで及び第２の書込みのためのコンテンツを記憶するための修正で置換し、さらにリーフ論理エクステントＤを複合論理エクステントＢに書き込む。一方で、第２の書込みは同じコピー、及び複合論理エクステントＡの複合論理エクステントＢとの置換を暗示するが、異なるリーフ論理エクステントＥをコピーし、第２の書込みのコンテンツを記憶するために修正される論理エクステントＦで置換し、さらに論理エクステントＦを論理コンテンツＢに書き込む。その場合には、第１の書込みのための記述は、ＡのＢとの置換及びＣのＤとの置換、並びにＤの複合論理エクステントＢへの書込み、及び第１の書込みのコンテンツのリーフエクステントＢへの書込みを含む場合があり、第２の書込みの記述は、リーフエクステントＦに書き込まれる第２の書込みのコンテンツとともに、ＡのＢとの置換及びＥのＦとの置換、並びにＦの複合論理エクステントＢへの書込みを含む場合がある。リーダー又は任意のフォロワーは、次いで任意の順序で第１の書込み又は第２の書込みを処理する場合があり、最終結果は、ＢがＡをコピーし、置換し、ＤがＣをコピーし、置換し、ＦがＥをコピーし、置換し、Ｄ及びＦが複合論理エクステントＢに書き込まれることである。Ｂを形成するためのＡの第２のコピーは、Ｂがすでに存在していることを認識することによって回避できる。このようにして、リーダーは、ポッドのための同期中のストレージシステム全体で論理エクステントグラフのために互換性のある共通メタデータを維持することを確実にする場合がある。 [0394] The leader storage system may perform its own local operations to implement these descriptions in the context of its local copy of the pod dataset and the metadata of its local storage system. Additionally, each synchronizing follower performs its own separate local operations to implement these descriptions in the context of their separate local copies of the pod dataset and the metadata of their separate local storage systems. When both the leader and follower are complete, the result is a compatible graph of logical extents with compatible leaf logical extent contents. These graphs of logical extents then become a kind of "common metadata," as described in the example above. This common metadata may be described as dependencies between modification operations and required common metadata. Transformations to the graph may be described as separate operations with queued predicate relationships with subsequent modification operations. Instead, each modification operation may rely on the same particular graph transformation that is not yet known to complete across the entire pod. Each modification operation may include portions of any graph transformations it relies on. Processing an operation description that identifies a "new" leaf or composite logical extent that already exists may avoid creating a new logical extent because that portion was already processed in the processing of some earlier operation; instead, it may implement only the portion of the operation that modifies the contents of the leaf or composite logical extent. It is the reader's responsibility to ensure that the transformations are compatible with each other. For example, a pod may start with two writes receiving the pod. The first write replaces composite logical extent A with a copy created as composite logical extent B, replaces leaf logical extent C with a copy as leaf logical extent D and modifications to store the content for the second write, and writes leaf logical extent D to composite logical extent B. Meanwhile, the second write implies the same copy and replacement of composite logical extent A with composite logical extent B, but copies a different leaf logical extent E and replaces it with logical extent F that is modified to store the content of the second write, and writes logical extent F to logical content B. In that case, the description for the first write might include replacing A with B and C with D, and writing D to composite logical extent B and writing the contents of the first write to leaf extent B, and the description for the second write might include replacing A with B and E with F, and writing F to composite logical extent B, with the contents of the second write written to leaf extent F. The leader or any follower may then process the first write or the second write in any order, with the end result being B copying and replacing A, D copying and replacing C, F copying and replacing E, and D and F being written to composite logical extent B. The second copy of A to form B can be avoided by recognizing that B already exists. In this way, the leader may ensure that compatible common metadata is maintained for the logical extent graph across synchronizing storage systems for the pod.

［0395］論理エクステントの有向非巡回グラフを使用するストレージシステムの実施態様を所与として、論理エクステントの複製された有向非巡回グラフに基づいたポッドの回復が実施されてよい。具体的には、この例では、ポッドでの回復は、複製されたエクステントグラフに基づいてよく、次いでリーフ論理エクステントのコンテンツを回復することだけではなく、これらのグラフの一貫性を回復することも伴う。回復の本実施態様では、動作は、ポッドのためのすべてのストレージシステム全体で完了したと知られていないすべてのリーフ論理エクステントコンテンツ修正だけではなく、ポッドのためのすべての同期中のストレージシステムで完了したと知られていないグラフ変換に対しても問合せすることを含んでよい。係る問い合わせることは、なんらかの調整されたチェックポイント以降の動作に基づいているであろう、又は単に完了したと知られていない動作であり、各ストレージシステムは、完了したとしてまだ信号で知らせていない通常動作中の動作のリストを保つ。この例では、グラフ変換は簡単である。グラフ変換は新しい事柄を作成し、旧い事柄を新しい事柄にコピーし、旧い事柄を２つ以上の分割された新しい事柄にコピーする場合もあれば、それらは他のエクステントへのその参照を修正するために複合エクステントを修正する。任意の論理エクステントを作成又は置換する任意の同期中のストレージシステムで見つけられた任意の記憶されている動作記述は、その論理エクステントをまだ有していない任意の他のストレージシステムでコピーされ、実行される場合がある。関与されたリーフ論理エクステント又は複合論理エクステントが適切に回復されている限り、リーフ論理エクステント又は複合論理エクステントに対する修正を説明する動作は、それらの修正を、まだそれらを適用していなかった任意の同期中のストレージシステムに適用できる。 [0395] Given an implementation of a storage system that uses a directed acyclic graph of logical extents, recovery of a pod may be performed based on the replicated directed acyclic graph of logical extents. Specifically, in this example, recovery at a pod may be based on the replicated extent graph and then involve not only recovering the contents of leaf logical extents but also restoring consistency to these graphs. In this implementation of recovery, operations may include querying not only for all leaf logical extent content modifications that are not known to be completed across all storage systems for the pod, but also for graph transformations that are not known to be completed across all synchronizing storage systems for the pod. Such queries may be based on operations since some coordinated checkpoint, or simply operations that are not known to be completed; each storage system maintains a list of operations during normal operation that have not yet signaled as completed. In this example, the graph transformations are simple. Graph transformations create new things, copy old things to new things, and may copy old things to two or more separate new things; they modify composite extents to modify their references to other extents. Any stored operation description found on any in-sync storage system that creates or replaces any logical extent may be copied and executed on any other storage system that does not already have that logical extent. As long as the involved leaf logical extent or composite logical extent has been properly recovered, operations that describe modifications to a leaf logical extent or composite logical extent can apply those modifications to any in-sync storage system that has not yet applied them.

［0396］さらにこの例では、ポッドの回復は、以下を含んでよい。
・ポッドのためのすべての同期中のストレージシステムで完了したと知られていなかったリーフ論理エクステント及び複合論理エクステントの作成、ならびにもしあれば、それらのプレカーソルのリーフ論理エクステント及び複合論理エクステントについてすべての同期中のストレージシステムを問い合わせることと、
・ポッドのためのすべての同期中のストレージシステムで完了したと知られていなかったリーフ論理エクステントに対する修正動作についてすべての同期中のストレージシステムを問い合わせることと、
・既存のリーフ論理エクステント及び複合論理エクステントに対する新しい参照として論理アドレス範囲コピー動作について問い合わせることと、
・リーフ論理エクステントに対する、完了したと知られていない修正を識別することであって、そのリーフ論理エクステントが、やはり回復を必要とする場合がある代替リーフ論理エクステントのためのソースである―これにより、すでにそれをコピーしていなかった任意の同期中のストレージシステムでリーフ論理エクステントコピーが回復される前に、すべての同期中のストレージシステムに対するそのリーフ論理エクステントに対して修正できるように―識別することと、
・すべてのリーフ論理エクステント及び複合論理エクステントのコピー動作を完了することと、
・新しい論理エクステントリファレンスに名前を付けること、リーフ論理エクステントコンテンツを更新すること、又は論理エクステントリファレンスを削除することを含んだ、リーフ論理エクステント及び複合論理エクステントにすべての追加更新を適用することと、
・すべての必要なアクションが完了したと判断することであって、その時点で回復の追加の態様が進むことができる、判断することと
を含んでよい。 [0396] Further in this example, pod recovery may include:
creating leaf logical extents and composite logical extents that were not known to be completed on all in-sync storage systems for the pod, and querying all in-sync storage systems for their precursor leaf logical extents and composite logical extents, if any;
Querying all in-sync storage systems for the pod for corrective actions on leaf logical extents that were not known to be completed on all in-sync storage systems;
Querying logical address range copy operations as new references to existing leaf logical extents and compound logical extents;
Identifying modifications to a leaf logical extent that are not known to be completed and that are the source for an alternate leaf logical extent that may also need to be recovered, so that modifications can be made to that leaf logical extent on all in-sync storage systems before the leaf logical extent copy is recovered on any in-sync storage systems that did not already copy it;
Completing the copy operations of all leaf logical extents and compound logical extents;
Applying all additional updates to leaf logical extents and compound logical extents, including naming new logical extent references, updating leaf logical extent contents, or deleting logical extent references;
- May include determining that all necessary actions have been completed, at which point additional aspects of recovery can proceed.

［0397］別の例では、論理エクステントグラフを使用する代替策として、ストレージは複製されたコンテンツアドレス指定可能ストアに基づいて実装されてよい。コンテンツアドレス指定可能ストアでは、データのブロックごとに（例えば、５１２バイト、４０９６バイト、８１９２バイトごとに、又は１６３８４バイトごとにも）、（フィンガープリントと呼ばれる場合もある）一意のハッシュ値が、ブロックコンテンツに基づいて計算され、これによりボリューム又はボリュームのエクステント範囲は、特定のハッシュ値を有するブロックに対する参照のリストとして記述できる。同じハッシュ値を有するブロックに対する参照に基づいた同期複製されたストレージシステム実施態様では、複製は、第１のストレージシステムがブロックを受け取り、それらのブロックのフィンガープリントを計算し、それらのフィンガープリントのためのブロックリファレンスを識別し、参照されたブロックに対するボリュームブロックのマッピングに対する更新として１つ又は複数の追加のストレージシステムに対する変更を送達することを伴う場合があるであろう。ブロックが第１のストレージシステムによってすでに記憶されていることがわかる場合、そのストレージシステムは（参照が同じハッシュ値を使用するため、又は参照のための識別子が同一であるか、容易にマッピングできるかのどちらかであるため）追加のストレージシステムのそれぞれで参照に名前を付けるためにその参照を使用できる。代わりに、ブロックが、第１のストレージシステムによって見つけられない場合、次いで第１のストレージシステムのコンテンツは、そのブロックコンテンツと関連付けられたハッシュ値又はアイデンティティとともに、動作記述の一部として他のストレージシステムに送達されてよい。さらに、各同期中のストレージシステムのボリューム記述は、次いで新しいブロックリファレンスで更新される。係るストアでの回復は、次いでボリュームのための最近更新されたブロックリファレンスを比較することを含んでよい。ブロックリファレンスがポッドのための異なる同期中のストレージシステム間で異なる場合、次いで各参照の１つのバージョンは、それらを一貫性があるものにするために他のストレージシステムにコピーされる場合がある。１つのシステム上のブロックリファレンスが存在しない場合、次いでそれはその参照のためのブロックを記憶するなんらかのストレージシステムからコピーされる。仮想コピー動作は、仮想コピー動作を実施することの一部として参照をコピーすることによって係るブロック又はハッシュリファレンスストアでサポートされる場合がある。 [0397] In another example, as an alternative to using a logical extent graph, storage may be implemented based on a replicated content-addressable store. In a content-addressable store, for each block of data (e.g., every 512 bytes, 4096 bytes, 8192 bytes, or even every 16384 bytes), a unique hash value (sometimes called a fingerprint) is calculated based on the block contents, allowing a volume or volume extent range to be described as a list of references to blocks with particular hash values. In a synchronously replicated storage system implementation based on references to blocks with the same hash value, replication might involve a first storage system receiving blocks, calculating fingerprints for those blocks, identifying block references for those fingerprints, and delivering the changes to one or more additional storage systems as updates to the mapping of volume blocks to the referenced blocks. If a block is found to be already stored by the first storage system, that storage system can use that reference to name the reference on each of the additional storage systems (either because the references use the same hash value, or because the identifiers for the references are identical or easily mappable). Alternatively, if a block cannot be found by a first storage system, then the contents of the first storage system, along with the hash value or identity associated with the block contents, may be delivered to the other storage system as part of the operation description. Furthermore, the volume description of each synchronizing storage system is then updated with the new block reference. Recovery in such a store may then include comparing recently updated block references for the volume. If block references differ between different synchronizing storage systems for a pod, then one version of each reference may be copied to the other storage system to make them consistent. If a block reference on one system does not exist, then it is copied from some storage system that stores the block for that reference. Virtual copy operations may be supported in such block or hash reference stores by copying the reference as part of performing the virtual copy operation.

［0398］システム回復のための特定の実施態様に関して、図１８に示される例の方法は、データセット（１８１２）を同期複製する複数のストレージシステム（１８１４、１８２４、１８２８）の中の少なくとも１つのストレージシステムによって、データセット（１８１２）を修正する要求（１８０４）を受け取ること（１８４２）を含む。データセット（１８１２）を修正する要求（１８０４）を受け取ること（１８４２）は、データセット（１８１２）を修正する要求（１８０４）を受け取ること（１８０６）と同様に実施されてよい。 [0398] For a particular embodiment for system recovery, the example method shown in FIG. 18 includes receiving (1842) a request (1804) to modify dataset (1812) by at least one storage system among a plurality of storage systems (1814, 1824, 1828) that synchronously replicate dataset (1812). Receiving (1842) the request (1804) to modify dataset (1812) may be implemented similarly to receiving (1806) the request (1804) to modify dataset (1812).

［0399］また、図１８に示される例の方法は、データセット（１８１２）を修正する要求（１８０４）が、データセット（１８１２）を同期複製する複数のストレージシステム（１８１４、１８２４、１８２８）のすべてのストレージシステムで適用されたかどうかを示す回復情報（１８５２）を生成すること（１８４４）も含む。データセット（１８１２）を修正する要求（１８０４）が、データセット（１８１２）を同期複製する複数のストレージシステム（１８１４、１８２４、１８２８）のすべてのストレージシステムで適用されたかどうかを示す回復情報（１８５２）を生成すること（１８４４）は、異なる追跡に基づいた回復、重複書込みの回復を含んだ動作追跡に基づいた回復、スナップショットの回復、メタデータ及び共通メタデータの回復、完了した動作を捨てること、スライドウィンドウを使用すること、及びチェックポイントを使用することを含んだ動作追跡のための最近の活動を記録することに基づいた回復、論理エクステントの複製された有向非巡回グラフに基づいたポッドの回復、及び複製されたコンテンツアドレス指定可能なストアでの追跡及び回復を含んだ、上述されたさまざまな技術を使用し、実装されてよい。要するに、回復情報を生成するために多様な技術が使用されてよく、回復情報は、複数のストレージシステム（１８１４、１８２４、１８２８）の中のどのストレージシステム、データセット（１８１２）を修正する要求（１８０４）を示す。 [0399] The example method shown in FIG. 18 also includes generating (1844) recovery information (1852) indicating whether the request (1804) to modify the dataset (1812) has been applied at all storage systems of the multiple storage systems (1814, 1824, 1828) that synchronously replicate the dataset (1812). Generating (1844) recovery information (1852) indicating whether the request (1804) to modify the dataset (1812) has been applied at all storage systems (1814, 1824, 1828) of the plurality of storage systems (1814, 1824, 1828) that synchronously replicate the dataset (1812) may be implemented using various techniques described above, including differential tracking-based recovery, operation tracking-based recovery including duplicate write recovery, snapshot recovery, metadata and common metadata recovery, recovery based on recording recent activity for operation tracking including discarding completed operations, using a sliding window, and using checkpoints, pod recovery based on a replicated directed acyclic graph of logical extents, and tracking and recovery in a replicated content-addressable store. In short, various techniques may be used to generate recovery information, and the recovery information indicates which storage systems (1814, 1824, 1828) of the plurality of storage systems (1814, 1824, 1828) the request (1804) to modify the dataset (1812) applied to.

［0400］また、図１８に示される例の方法は、システム障害に応えて、修正する要求が、データセット（１８１２）を同期複製する複数のストレージシステム（１８１４、１８２４、１８２８）のすべてのストレージシステムで適用されたかどうかを示す回復情報（１８５２）に従って回復アクションを適用すること（１８４６）も含む。回復アクションは、データセット（１８１２）を修正する要求（１８０４）を適用していなかったすべてのストレージシステムでデータセット（１８１２）を修正する要求（１８０４）を適用することによって実施されてよい－回復情報（１８５２）は、複数のストレージシステム（１８１４、１８２４、１８２８）の中のどのストレージシステムが、ごく最近受け取られた要求（１８０４）を含んだ、同期複製されたデータセット（１８１２）を修正する１つ以上の要求を適用したのか、それとも適用しなかったのかを示す追跡情報を含んでよい。しかしながら、他の場合では、回復アクションは、要求（１８０４）の適用を完了した、部分的に完了したストレージシステムの集合でのデータセット（１８１２）を修正する要求（１８０４）の適用を取り消す、又はアンドゥすることによって実施されてよい。概して、デフォルト回復アクションは、データセット（１８１２）を修正する他の未決の要求に加えて、無事に要求（１８０４）を完了しなかった各ストレージシステムを識別し、要求（１８０４）を適用することであってよい。回復アクションの他の実施態様は、差異追跡に基づいた回復、重複書込みの回復を含んだ動作追跡に基づいた回復、スナップショットの回復、メタデータ及び共通メタデータの回復、完了した動作を捨てること、スライドウィンドウを使用すること、及びチェックポイントを使用することを含んだ動作追跡の最近の活動を記録することに基づいた回復、論理エクステントの複製された有向非巡回グラフに基づいたポッドの回復、及び複製されたコンテンツアドレス指定可能ストアでの追跡及び回復の記述に関して上述される。 18 also includes applying (1846) recovery actions in response to the system failure according to recovery information (1852) indicating whether the request to modify has been applied at all storage systems of the plurality of storage systems (1814, 1824, 1828) that synchronously replicate the dataset (1812). The recovery actions may be performed by applying the request (1804) to modify the dataset (1812) at all storage systems that did not apply the request (1804) to modify the dataset (1812)—the recovery information (1852) may include tracking information indicating which storage systems among the plurality of storage systems (1814, 1824, 1828) applied or did not apply one or more requests to modify the synchronously replicated dataset (1812), including the most recently received request (1804). However, in other cases, recovery action may be performed by canceling or undoing the application of request 1804 to modify dataset 1812 on a collection of storage systems that have completed or partially completed application of request 1804. Generally, the default recovery action may be to identify each storage system that did not successfully complete request 1804 and apply request 1804 to it in addition to other outstanding requests that modify dataset 1812. Other embodiments of recovery action are described above with respect to recovery based on difference tracking, recovery based on operation tracking including recovery of duplicate writes, snapshot recovery, recovery of metadata and common metadata, recovery based on recording recent activity for operation tracking including discarding completed operations, using a sliding window, and using checkpoints, recovery of pods based on a replicated directed acyclic graph of logical extents, and tracking and recovery in a replicated content-addressable store.

［0401］追加の説明のために、図１９は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムのための回復の例の方法を示すフローチャートを説明する。図１９に示される例の方法も、データセット（１８１２）を同期複製する複数のストレージシステム（１８１４、１８２４、１８２８）の中の少なくとも１つのストレージシステムによって、データセット（１８１２）を修正する要求（１８０４）を受け取ること（１８４２）と、データセット（１８０４）を修正する要求（１８０４）が、データセット（１８１２）を同期複製する複数のストレージシステム（１８１４、１８２４、１８２８）のすべてのストレージシステムで適用されたかどうかを示す回復情報（１８５２）を生成すること（１８４４）と、システム障害に応えて、修正する要求が、データセット（１８１２）を同期複製する複数のストレージシステム（１８１４、１８２４、１８２８）のすべてのストレージシステムで適用されたかどうかを示す回復情報（１８５２）に従って回復アクションを適用すること（１８４６）とを含むので、図１９に示される例の方法は図１８に示される例の方法に類似している。 [0401] For further explanation, Figure 19 illustrates a flowchart showing an example method of recovery for a storage system that synchronously replicates a dataset in accordance with some embodiments of the present disclosure. The example method illustrated in FIG. 19 is similar to the example method illustrated in FIG. 18 in that it also includes receiving (1842) a request to modify a dataset (1812) by at least one storage system among a plurality of storage systems (1814, 1824, 1828) that synchronously replicate the dataset (1812); generating (1844) recovery information (1852) indicating whether the request to modify the dataset (1804) has been applied at all storage systems of the plurality of storage systems (1814, 1824, 1828) that synchronously replicate the dataset (1812); and applying (1846) a recovery action in accordance with the recovery information (1852) indicating whether the request to modify has been applied at all storage systems of the plurality of storage systems (1814, 1824, 1828) that synchronously replicate the dataset (1812).

［0402］しかしながら、図１９に示される例の方法は、回復情報を生成すること（１８４４）が、処理されたと確認された動作について複数のストレージシステムの他のストレージシステムを問い合わせること（１９０２）と、動作が完了したと確認されていないストレージシステムの集合を決定すること（１９０４）と、回復アクションを適用すること（１８４６）が、ストレージシステムの集合に対して、完了したと確認されていない動作を完了すること（１９０６）を含むことをさらに指定する。 [0402] However, the example method shown in FIG. 19 further specifies that generating recovery information (1844) includes querying other storage systems of the plurality of storage systems for operations that have been confirmed as processed (1902), determining a set of storage systems for which operations have not been confirmed as completed (1904), and applying recovery actions (1846) includes completing the operations that have not been confirmed as completed for the set of storage systems (1906).

［0403］完了された又は処理されたと確認された動作について複数のストレージシステムの他のストレージシステムを問い合わせること（１９０２）は、ポッド内のボリュームの中及び間のボリューム範囲の仮想コピーを同期複製することをサポートするストレージシステムのための動作追跡に関して上述されたように実施されてよい。具体的には、動作がポッドのためのすべての同期中のストレージシステムで完了されたと確認された後に、すべてのストレージシステム全体で完了した動作を捨てることに関して上述されたように、要求を受け取り、完了を信号で知らせたストレージシステムに、完了が信号で知らされた後にポッドのためのすべてのストレージシステムにメッセージを送信させ、各ストレージシステムがそれらを捨てることを可能にすることによって実施されてよい。回復は、次いで回復に関与するポッドのためのすべての同期中のストレージシステム全体で捨てられなかったすべての記録された動作について問い合わせることを伴う。 [0403] Querying other storage systems of the plurality of storage systems for operations that have been confirmed as completed or processed (1902) may be performed as described above with respect to operation tracking for a storage system that supports synchronous replication of virtual copies of volume ranges within and between volumes within a pod. Specifically, as described above with respect to discarding completed operations across all synchronizing storage systems for a pod after the operations have been confirmed as completed across all synchronizing storage systems for the pod, this may be performed by having the storage system that received the request and signaled completion send messages to all storage systems for the pod after completion has been signaled, allowing each storage system to discard them. Recovery then involves querying for all recorded operations that have not been discarded across all synchronizing storage systems for the pod involved in the recovery.

［0404］動作が完了したと確認されていないストレージシステムの集合を決定すること（１９０４）は、他のストレージシステムを問い合わせること（１９０２）からの結果に基づいて実施されてよく、ストレージシステムの集合は、問い合わせること（１９０２）が捨てられていなかった動作のリストを含んだ１つ以上のストレージシステムによってポピュレートされる。 [0404] Determining (1904) the set of storage systems for which operations have not been confirmed as completed may be performed based on results from querying (1902) other storage systems, the set of storage systems being populated by one or more storage systems for which querying (1902) included a list of operations that had not been discarded.

［0405］ストレージシステムの集合に対して、完了したと確認されていない動作を完了すること（１９０６）は、本明細書に説明されるように、ストレージシステムの集合に動作を再発行し、完了していない動作ごとに、対応する要求に従ってデータセットに対する修正を記述する情報を送信し（１８１２）、本明細書に説明されるステップを完了することによって実施されてよい。 [0405] Completing operations that have not been confirmed as completed for the collection of storage systems (1906) may be performed by reissuing the operations to the collection of storage systems, and for each uncompleted operation, sending information describing modifications to the dataset according to the corresponding request (1812), as described herein, and completing the steps described herein.

［0406］追加の説明のために、図２０は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムのための回復の例の方法を示すフローチャートを説明する。図２０に示される例の方法も、データセット（１８１２）を同期複製する複数のストレージシステム（１８１４、１８２４、１８２８）の中の少なくとも１つのストレージシステムによって、データセット（１８１２）を修正する要求（１８０４）を受け取ること（１８４２）と、データセット（１８１２）を修正する要求（１８０４）がデータセット（１８１２）を同期複製する複数のストレージシステム（１８１４、１８２４、１８２８）のすべてのストレージシステムで適用されたかどうかを示す回復情報（１８５２）を生成すること（１８４４）と、システム障害に応えて、修正する要求が、データセット（１８１２）を同期複製する複数のストレージシステム（１８１４、１８２４、１８２８）のすべてのストレージシステムで適用されたかどうかを示す回復情報（１８５２）に従って回復アクションを適用すること（１８４６）を含むので、図２０に示される例の方法は図１８に示される例の方法に類似している。 [0406] For further explanation, Figure 20 illustrates a flowchart showing an example method of recovery for a storage system that synchronously replicates a dataset in accordance with some embodiments of the present disclosure. The example method illustrated in FIG. 20 is similar to the example method illustrated in FIG. 18 in that it also includes receiving (1842) a request to modify a dataset (1812) by at least one storage system among a plurality of storage systems (1814, 1824, 1828) that synchronously replicate the dataset (1812); generating (1844) recovery information (1852) indicating whether the request to modify the dataset (1812) has been applied at all storage systems among the plurality of storage systems (1814, 1824, 1828) that synchronously replicate the dataset (1812); and applying (1846) a recovery action in accordance with the recovery information (1852) indicating whether the request to modify has been applied at all storage systems among the plurality of storage systems (1814, 1824, 1828) that synchronously replicate the dataset (1812).

［0407］しかしながら、図２０に示される例の方法は、回復情報を生成すること（１８４４）が、複数のストレージシステムでデータセット（１８１２）を修正する要求（２００４）を適用することに向かう進行を追跡することによって、回復情報（１８５２）を生成すること（２００２）を含むことをさらに指定する。 [0407] However, the example method shown in FIG. 20 further specifies that generating recovery information (1844) includes generating recovery information (1852) (2002) by tracking progress toward applying requests (2004) to modify datasets (1812) at multiple storage systems.

［0408］複数のストレージシステムでデータセット（１８１２）を修正する要求（１８０４）を適用することに向かう進捗を追跡することによって、回復情報（１８５２）を生成すること（２００２）は、処理又は完了されたと確認される動作を決定するためにチェックポインティングを使用することによって、上述されたように実施されてよい。このようにして、生成された（２００２）回復情報（１８５２）は、どのストレージシステムが、データセット（１８１２）を修正する要求（１８０４）を処理又は完了していないのかを示してよい。 [0408] Generating (2002) recovery information (1852) by tracking progress toward applying request (1804) to modify dataset (1812) at multiple storage systems may be performed as described above by using checkpointing to determine operations that are confirmed as processed or completed. In this manner, generated (2002) recovery information (1852) may indicate which storage systems have not processed or completed request (1804) to modify dataset (1812).

［0409］データセット（１８１２）を修正する要求（１８０４）を適用すること（２００４）は、上述されたように実施されてよい要求（１８０４）を再発行するための１つ以上のストレージシステムを識別するために回復情報（１８５２）を使用することと、データセット（１８１２）を修正する要求（１８０４）のために、要求（１８０４）に従ってデータセットに対する修正を記述する情報を送信することと、上述されたステップを完了することによって実施されてよい。 [0409] Applying (2004) the request (1804) to modify the dataset (1812) may be performed by using the recovery information (1852) to identify one or more storage systems for reissuing the request (1804), which may be performed as described above, sending information describing the modifications to the dataset in accordance with the request (1804) for the request (1804) to modify the dataset (1812), and completing the steps described above.

［0410］データセット（１８１２）を修正する要求を適用しなかったストレージシステムでデータセット（１８１２）を修正する要求（１８０４）をアンドゥする（２００６）ことは、要求（１８０４）が、要求（１８０４）が処理又は完了された１つ以上のストレージシステムを識別するために回復情報（１８５２）を使用することによって実施されてよい。さらに、要求をアンドゥすること（２００６）は、要求（１８０４）が完了したストレージシステムごとに、データセット（１８１２）を修正する各要求に対応する変更のログを各ストレージシステムで維持することに依存する場合があり、データセット（１８１２）を修正する各要求は、識別子とさらに関連付けられてよい。また、ログは、要求識別子ごとに、メタデータ表現のバージョンを関連付けてもよい。要求識別子を適用する前に、データセットの状態を表す有向非巡回グラフを含むメタデータ表現のバージョンを関連付けてもよい。一部の例では、係るバージョニング情報は、スナップショットに対応してよい。上述されたように、データセットの可視化表現を所与として、及びデータセットを修正する対応する要求によって上書きされたデータに加えて、特定の要求に対応するデータセットのメタデータ表現に対する相違だけが記憶されることを所与として、ログのストレージ用件は最小限に抑えられるべきである。このようにして、ストレージシステムのコントローラは、ログを使用し、要求（１８０４）の適用前の事前情報にデータセットの応対を復元し、要求（１８０４）の適用前の事前状態にメタデータ表現の現在の状態を定義してよい。 [0410] Undoing (2006) a request (1804) to modify dataset (1812) at a storage system that did not apply the request to modify dataset (1812) may be performed by using recovery information (1852) to identify one or more storage systems on which the request (1804) was processed or completed. Furthermore, undoing (2006) the request may rely on maintaining, for each storage system on which the request (1804) was completed, a log of changes corresponding to each request to modify dataset (1812), where each request to modify dataset (1812) may be further associated with an identifier. The log may also associate, for each request identifier, a version of a metadata representation. The version of the metadata representation may include a directed acyclic graph representing the state of the dataset prior to applying the request identifier. In some examples, such versioning information may correspond to a snapshot. As described above, given the visualization representation of the dataset, and given that only differences to the metadata representation of the dataset corresponding to a particular request are stored in addition to data overwritten by the corresponding request to modify the dataset, the storage requirements of the log should be minimized. In this manner, the storage system's controller may use the log to restore the dataset's correspondence to its prior state before application of request (1804) and define the current state of the metadata representation to its prior state before application of request (1804).

［0411］追加の説明のために、図２１は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの再同期の例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図２１に示されるストレージシステム（２１１４、２１２４、２１２８）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図２１に示されるストレージシステム（２１１４、２１２４、２１２８）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0411] For additional explanation, FIG. 21 sets forth a flowchart illustrating an example method of resynchronization of storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (2114, 2124, 2128) illustrated in FIG. 21 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage systems (2114, 2124, 2128) illustrated in FIG. 21 may include the same components as the storage systems described above, fewer components, or additional components.

［0412］図２１に示される例のストレージシステム構成は、データセット（２１１２）、及びデータセット（２１１２）が同期複製されてよい、複数のストレージシステム（２１１４、２１２４、２１２８）を含む。図２１に示されるデータセット（２１１２）は、例えば、特定のボリュームのコンテンツとして、ボリュームの特定のシャードのコンテンツとして、又は１つ以上のデータ要素の任意の他の集合体として実施されてよい。データセット（２１１２）は、各ストレージシステム（２１１４、２１２４、２１２８）がデータセット（２１１２）のローカルコピーを保持するように、複数のストレージシステム（２１１４、２１２４、２１２８）全体で同期されてよい。本明細書に説明される例では、係るデータセット（２１１２）は、少なくともアクセスされているクラスタ及び特定のストレージシステムが名目上実行している限り、クラスタ内の任意の一方のストレージシステムが、クラスタ内の任意の他方のストレージシステムよりも実質的により最適に動作しないような性能特性を有するストレージシステム（２１１４、２１２４、２１２８）のいずれかを通してデータセット（２１１２）にアクセスできるように、ストレージシステム（２１１４、２１２４、２１２８）全体で同期複製される。係るシステムでは、データセット（２１１２）に対する修正は、ストレージシステム（２１１４、２１２４、２１２８）のいずれかでデータセット（２１１２）にアクセスすることが一貫した結果を生じさせるように、各ストレージシステム（２１１４、２１２４、２１２８）に常駐するデータセットのコピーに対して加えられるべきである。例えば、データセットに発行される書込み要求は、すべてのストレージシステム（２１１４、２１２４、２１２８）でサービスを提供されなければならない、又はストレージシステム（２１１４、２１２４、２１２８）のいずれでもサービスを提供されてはならない。 [0412] The example storage system configuration shown in FIG. 21 includes a dataset (2112) and multiple storage systems (2114, 2124, 2128) across which dataset (2112) may be synchronously replicated. The dataset (2112) shown in FIG. 21 may be embodied, for example, as the contents of a particular volume, as the contents of a particular shard of a volume, or as any other collection of one or more data elements. The dataset (2112) may be synchronized across the multiple storage systems (2114, 2124, 2128) such that each storage system (2114, 2124, 2128) maintains a local copy of dataset (2112). In the example described herein, such dataset (2112) is synchronously replicated across storage systems (2114, 2124, 2128) such that dataset (2112) can be accessed through any of storage systems (2114, 2124, 2128) that have performance characteristics such that any one storage system in the cluster will operate substantially less optimally than any other storage system in the cluster, at least as long as the cluster and the particular storage system being accessed are nominally performing. In such a system, modifications to dataset (2112) should be made to copies of the dataset resident on each storage system (2114, 2124, 2128) so that accessing dataset (2112) on any of storage systems (2114, 2124, 2128) produces consistent results. For example, a write request issued to a dataset must be serviced by all storage systems (2114, 2124, 2128) or by none of the storage systems (2114, 2124, 2128).

［0413］さらに、データセット（２１１２）の場合、データセット（２１１２）がその全体で同期複製される複数のストレージシステム（２１１４、２１２４、２１２８）は、例えばデータセット（２１１２）を、そのデータセット（２１１２）を名目上記憶する１つ以上のストレージシステム（２１１４、２１２４、２１２８）と関連付けるポッド定義又は類似したデータ構造を調べることによって実施されてよい。係る例では、ポッド定義は、データセット（２１１２）の少なくとも１つの識別、及びデータセット（２１１２）がその全体で同期複製されるストレージシステム（２１１４、２１２４、２１２８）の集合を含んでよい。係るポッドは、対称アクセス、レプリカの柔軟性がある追加／削除、高可用性データ一貫性、データセットに対する関係におけるストレージシステム全体での一様なユーザ管理、管理ホストアクセス、アプリケーションクラスタ化等を含んだいくつかの（おそらく任意選択の）データ一貫性の一部をカプセル化してよい。ストレージシステムは、ポッドに加えることができ、ポッドのデータセット（２１１２）は、そのストレージシステムにコピーされ、次いでデータセット（２１１２）が修正されるにつれ、最新に保たれる。また、ストレージシステムは、ポッドから削除することもでき、データセット（２１１２）は、もはや削除されたストレージシステムで最新に保たれない。係る例では、ポッド定義又は類似するデータ構造は、ストレージシステムが特定のポッドに加えられ、特定のポッドから削除されるにつれ更新されてよい。 [0413] Additionally, for a dataset (2112), the multiple storage systems (2114, 2124, 2128) across which the dataset (2112) is synchronously replicated may be implemented, for example, by examining a pod definition or similar data structure that associates the dataset (2112) with one or more storage systems (2114, 2124, 2128) that nominally store the dataset (2112). In such an example, the pod definition may include at least one identification of the dataset (2112) and a collection of storage systems (2114, 2124, 2128) across which the dataset (2112) is synchronously replicated. Such a pod may encapsulate several (possibly optional) pieces of data consistency, including symmetric access, flexible addition/removal of replicas, highly available data consistency, uniform user management across storage systems in relation to the dataset, administrative host access, application clustering, etc. A storage system can be added to a pod, and the pod's dataset (2112) is copied to that storage system and then kept up to date as the dataset (2112) is modified. A storage system can also be removed from a pod, and the dataset (2112) is no longer kept up to date with the removed storage system. In such an example, the pod definition or similar data structure may be updated as storage systems are added to and removed from particular pods.

［0414］また、図２１に示される例のストレージシステムは、データセット（２１１２）を同期複製するために使用される複数のストレージシステム（２１１４、２１２４、２１２８）のそれぞれの間に１つ以上のデータ通信リンク（２１１６、２１１８、２１２０）も含む。図２１に示される例の方法では、ポッド内のストレージシステム（２１１４、２１２４、２１２８）は、ともに高帯域幅データ転送、並びにクラスタ通信、ステータス通信、及び管理通信のために互いと通信する。これらの別個のタイプの通信は、同じデータ通信リンク（２１１６、２１１８、２１２０）を介している、又は代替実施形態では、これらの別個のタイプの通信は、別々のデータ通信リンク（２１１６、２１１８、２１２０）を介するであろう。 [0414] The example storage system shown in FIG. 21 also includes one or more data communication links (2116, 2118, 2120) between each of the multiple storage systems (2114, 2124, 2128) used to synchronously replicate the data set (2112). In the example method shown in FIG. 21, the storage systems (2114, 2124, 2128) within a pod communicate with each other for both high-bandwidth data transfers, as well as cluster communication, status communication, and management communication. These separate types of communication may be over the same data communication links (2116, 2118, 2120), or in alternative embodiments, these separate types of communication may be over separate data communication links (2116, 2118, 2120).

［0415］データセットを同期複製するストレージシステムを実装するための追加の詳細は、その全体として参照により含まれる米国仮特許第６２／４７０，１７２号及び大６２／５１８，０７１号の中で見つけられてよい。 [0415] Additional details for implementing a storage system that synchronously replicates data sets may be found in U.S. Provisional Patent Nos. 62/470,172 and 62/518,071, which are incorporated by reference in their entireties.

［0416］図２１に示されるように、複数のストレージシステム（２１１４、２１２４、２１２８）は、データセット（２１１２）を同期複製しており、ホストコンピューティングデバイスからＩ／Ｏ要求を受け取り、処理するための通常の動作中互いと通信してよい。しかしながら、一部の例では、ストレージシステム（２１１４、２１２４、２１２８）の１つ以上は、故障する、再起動する、アップグレードする、又はそれ以外の場合利用できない場合があり、その結果、１つ以上のストレージシステム（２１１４、２１２４、２１２８）は同期外れになる場合がある。通常の動作を再開するために、同期中のストレージシステム及び同期していないストレージシステムは、回復動作及び再同期を経る－回復は、その全体として本明細書に参照により含まれる出願参照番号第１５／６９６，４１８号により詳細に説明され、再同期は以下に説明される。 [0416] As shown in FIG. 21, multiple storage systems (2114, 2124, 2128) synchronously replicate a dataset (2112) and may communicate with each other during normal operation to receive and process I/O requests from host computing devices. However, in some instances, one or more of the storage systems (2114, 2124, 2128) may fail, reboot, upgrade, or otherwise become unavailable, resulting in one or more of the storage systems (2114, 2124, 2128) becoming out of sync. To resume normal operation, the in-sync and out-of-sync storage systems undergo recovery operations and resynchronization—recovery is described in more detail in Application Serial No. 15/696,418, which is incorporated herein by reference in its entirety, and resynchronization is described below.

［0417］ポッドに加えられるストレージシステムの初期同期－又は、ポッドからデタッチされたストレージシステムの後続の再同期－は、そのストレージシステムがポッドサービスを提供する際のアクティブな使用のためにオンラインにされる前に、ポッドのための同期中のストレージシステムから初期化されていない、つまり同期していないストレージシステムにすべてのコンテンツ、又はすべての紛失したコンテンツをコピーすることを含む。係る初期同期は、ポッドの拡張部として導入されるストレージシステムごとに実行されてよい。 [0417] Initial synchronization of a storage system being added to a pod—or subsequent resynchronization of a storage system detached from a pod—involves copying all content, or all missing content, from a synchronizing storage system for the pod to an uninitialized, or unsynchronized, storage system before that storage system is brought online for active use in providing pod services. Such an initial synchronization may be performed for each storage system being introduced as an extension of the pod.

［0418］ポッドに加えられたストレージシステムに対するコンテンツの初期同期と、ポッドのための同期中のストレージシステムに比してイベントのなんらかの集合を通して同期しなくなったストレージシステムを再同期することとの相違は、概念上きわめて類似している。例えば再同期の場合、同期中のポッドメンバーストレージシステムと同期していないポッドメンバーとの間で異なる場合があるすべてのブロックは、同期していないポッドメンバーがポッドのための同期中のポッドメンバーストレージシステムとしてオンラインに戻ることができる前に最新にされる。初期の同期では、これはすべてのブロックを更新することを含んでよく、結果的に、それはすべてのブロックが異なる場合がある再同期に概念上類似している。言い換えると、初期同期は、任意のボリュームが初期状態から修正される前、又は任意のボリュームが作成される又はポッドに加えられる前に、ポッドの始まりでデタッチされたストレージシステムを再度アタッチすることに同等と見なされてよい。 [0418] The difference between an initial synchronization of content to a storage system added to a pod and resynchronizing a storage system that has become out of sync through some set of events with the in-sync storage system for the pod is conceptually very similar. For example, in the case of a resynchronization, all blocks that may differ between the in-sync pod member storage system and the out-of-sync pod member are brought up to date before the out-of-sync pod member can come back online as the in-sync pod member storage system for the pod. In an initial synchronization, this may involve updating all blocks, and as a result, it is conceptually similar to a resynchronization where all blocks may differ. In other words, an initial synchronization may be considered equivalent to reattaching a storage system that was detached at the beginning of the pod, before any volumes are modified from their initial state or before any volumes are created or added to the pod.

［0419］概して、再同期は、デタッチされたポッドをポッドが同期しており、オンラインに戻ることができる点まで戻すために少なくとも２つの事柄、つまり（ａ）ポッドがデタッチされ、同期中のポッドメンバーによって保持されていなかった頃に、デタッチされたポッドで持続されていた任意の変更を取り消すこと、上書きすること、又はそれ以外の場合置換すること、及び（ｂ）ポッドのためのコンテンツ及び共通メタデータと一致するためにアタッチするストレージシステムを更新することを達成する。オンラインに戻されるためには、ストレージシステムの再アタッチは、同期複製を再有効化すること、対称的な同期複製を再有効化すること、及び再アタッチされたストレージシステムでポッドのための動作の受取り及び処理を再有効化することを含んでよい。ポッドのための動作は、読取り、データ修正動作、又は管理動作を含んでよい。 [0419] Generally, resynchronization accomplishes at least two things to return a detached pod to a point where the pod is in sync and can be brought back online: (a) undoing, overwriting, or otherwise replacing any changes persisted in the detached pod when it was detached and that were not being held by an in-sync pod member, and (b) updating the attaching storage system to match the content and common metadata for the pod. To be brought back online, reattaching a storage system may include reenabling synchronous replication, reenabling symmetric synchronous replication, and reenabling the receiving and processing of operations for the pod at the reattached storage system. Operations for the pod may include reads, data modification operations, or management operations.

［0420］ストレージシステムをデタッチするプロセスでは、なんらかの数の動作がポッドのために進行中である可能性がある。さらに、それらの動作のいくつかは、デタッチされたストレージシステムだけで持続した可能性があり、他の動作は、デタッチが処理された直後には同期中のままであったストレージシステムだけで持続した可能性があり、他の動作は、デタッチされたストレージシステムと同期中のままであったストレージシステムの両方で持続していた可能性がある。この例では、ポッドのための同期中の状態は、デタッチされたストレージシステムだけで持続された動作を記録していないため、ストレージシステムのデタッチ以降のポッドのための同期中のコンテンツ及び共通メタデータに対する任意の更新は、それらの更新を含まないであろう。これが－更新をアンドゥすることによって明示的に、又は再同期プロセスの一部としてそのコンテンツを上書きすることによって暗示的に、のどちらかで－これらの更新が取り消されるべきであろう理由である。同期中のストレージシステム自体では、これらは、デタッチされたストレージシステムの再アタッチを開始する前に説明される２つのリスト、つまり（ａ）デタッチ時に同期中未決動作リストと呼ばれることがあり、進行中であり、再アタッチするストレージシステムがポッドからデタッチされ、ポッドからのデタッチの後の任意の持続時間、同期中にとどまった動作のリスト、及び（ｂ）再アタッチするストレージシステムがポッドからデタッチされたときの時間帯の間のコンテンツ又は共通メタデータに対する変更のリストがある場合がある。さらに、ポッド及びストレージシステムの実施態様に応じて、同期中のストレージシステムと関連付けられた２つのリストは、単一のリスト、つまり再アタッチするストレージシステム上にないと知られていないコンテンツによって表されてよい。複数のストレージシステムがデタッチされ、特にそれらのストレージシステムが異なるポッドでは、各デタッチ以降の変更を追跡することは、別々のリストを生じさせる―及びそれらのリストがどのようにして記述されるのかは、あるポッドの実施態様から別のポッドの実施態様で大幅に異なる場合がある。一部の場合では、デタッチの時点から変更を追跡し、それらの変更をアタッチするストレージシステムにコピーすることを超えた追加の問題は、再同期中に受け取られる新しい修正動作が、アタッチするストレージシステムに適用されることを確実にすることである。概念上、この問題は、データをコピーする動作、及びポッドによって受け取られる修正動作の処理が、結果が、アタッチの最後で、及びアタッチするストレージシステムがポッドのために同期中であると見なす前に正しく最新となるように、マージされてよいことを確実にすることとして記述されてよい。 [0420] In the process of detaching a storage system, any number of operations may have been in progress for the pod. Furthermore, some of those operations may have persisted only on the detached storage system, others may have persisted only on the storage system that remained in sync immediately after the detach was processed, and other operations may have persisted on both the detached storage system and the storage system that remained in sync. In this example, because the in-sync state for the pod does not record the operations that were persisted only on the detached storage system, any updates to the in-sync content and common metadata for the pod since the storage system detachment will not include those updates. This is why these updates would need to be undone—either explicitly by undoing the updates or implicitly by overwriting the content as part of the resynchronization process. For the synchronizing storage system itself, there may be two lists described before initiating reattachment of the detached storage system: (a) a list of operations that are in progress and remained in synchrony for any duration after the reattaching storage system was detached from the pod, sometimes referred to as the in-sync pending operations list, and (b) a list of changes to content or common metadata during the time period when the reattaching storage system was detached from the pod. Furthermore, depending on the implementation of the pod and storage system, the two lists associated with the synchronizing storage system may be represented by a single list, i.e., content that is not known to be on the reattaching storage system. In pods with multiple detached storage systems, and especially in pods where those storage systems are different, tracking changes since each detachment results in separate lists—and how those lists are described may vary significantly from one pod implementation to another. In some cases, an additional issue beyond tracking changes from the point of detachment and copying those changes to the attaching storage system is ensuring that new modification operations received during resynchronization are applied to the attaching storage system. Conceptually, this issue can be described as ensuring that the operations of copying data and processing modification operations received by the pod can be merged so that the result is properly up-to-date at the end of the attach and before the attaching storage system considers itself in sync for the pod.

［0421］単一の変更されたコンテンツ再同期に関して、再同期のための１つのモデルは、同期中のストレージシステムとアタッチするストレージシステムとの間で異なる場合があるブロックの完全なリストーデタッチされたブロックのリストを生成すること、及び修正動作がフォロワーストレージシステムに対して起こるにつれ、任意の修正動作の複製を開始することである。異なる場合があるブロックの完全なリストは、同期中のストレージシステムからのデタッチ時の同期中の未決動作からのもの、アタッチするストレージシステムからのデタッチの時点での未決動作、及びデタッチ以来変化したと知られていたブロックを含んでよい。修正動作は、説明されるように、それらの修正コンテンツを記憶する場合があり、再同期は、デタッチされたブロックのリストからブロックの範囲の場所を突き止め、同期中のストレージシステムからアタッチするストレージシステムへ、セクションでこれらのブロックをコピーすることにより進んでよい。この例では、特定のセクションをコピーしている間に、コピーされているセクションと重複する、入信する修正動作は、コピーの間阻止される場合もあれば、セクションがコピーされた後にそれらの修正動作を適用するための準備がなされてよい。この解決策は、例えばＥＸＴＥＮＤＥＤ＿ＣＯＰＹ動作の仮想化された実施態様等、仮想ブロック範囲コピー動作にとって問題を生じさせる場合がある。さらに、コピーのためのソース範囲は、まだ再同期されていない場合があり、しかもターゲット範囲はすでに再同期されている場合がある。つまり、仮想ブロック範囲コピー動作の簡単な実施態様は、データが、仮想ブロック範囲コピー動作が受け取られるときに既知ではないため（実施態様に応じて）正しいデータをターゲット範囲にコピーできない、又は再同期動作自体が、ターゲット範囲が、それがその最終的な形式で絶対に再同期されなかったとき、ターゲット範囲が正しく同期されたと推定した可能性があるため、ターゲット範囲を正しく再同期できない場合がある。しかしながら、この問題にはいくつかの解決策がある。１つの解決策は、再同期中に仮想ブロック範囲コピー動作を許可しないことである。これは、仮想ブロック範囲コピー動作―クライアントがファイルコピー動作及びバーチャルマシンクローン又は以降動作を操作することを含んだ―の共通使用が、通常、読取り要求及び書込み要求のシーケンスを通してそれ自体直接的にコンテンツをコピーすることによって仮想ブロック範囲コピーに応答するため、多くの場合に機能し得る。別の解決策は、任意の仮想アドレス範囲コピー動作のターゲットアドレス範囲を上書きし、次いでソースデータが利用可能になるとき上書きを説明しながら、コピー動作を実行する修正動作ではなく、不完全な仮想範囲コピー動作を記憶にとどめることである。再同期のターゲットが、コピーのためのソースデータが正しくないことを知らない場合があることを所与として、すべての係る動作は、完全なコピーが完了するまで延期されなければならない可能性がある。再同期のターゲットが、どの領域がまだコピーされていないのかを認識させられる、又はいつ再同期がボリュームの特定の領域を処理することを完了したのかを認識し得る場合、最適化が起こり得る。 [0421] With regard to single changed content resynchronization, one model for resynchronization is to generate a complete list of blocks that may differ between the synchronizing storage system and the attaching storage system—a list of detached blocks—and begin replicating any modification operations as they occur to the follower storage system. The complete list of blocks that may differ may include those from pending operations during synchronization at the time of detachment from the synchronizing storage system, pending operations at the time of detachment from the attaching storage system, and blocks known to have changed since detachment. The modification operations may store their modified content, as described, and resynchronization may proceed by locating ranges of blocks from the list of detached blocks and copying these blocks in sections from the synchronizing storage system to the attaching storage system. In this example, while copying a particular section, incoming modification operations that overlap the section being copied may be blocked during the copy, or provisions may be made to apply those modification operations after the section has been copied. This solution can create problems for virtual block range copy operations, such as virtualized implementations of the EXTENDED_COPY operation. Furthermore, the source range for the copy may not yet be resynchronized, and the target range may already be resynchronized. That is, a simple implementation of a virtual block range copy operation may not be able to copy the correct data to the target range (depending on the implementation) because the data is not known when the virtual block range copy operation is received, or the resynchronization operation itself may not be able to correctly resynchronize the target range because it may have assumed the target range was correctly synchronized when it was never resynchronized in its final form. However, there are several solutions to this problem. One solution is to not allow virtual block range copy operations during resynchronization. This can work in many cases because common uses of virtual block range copy operations—including client operations for file copy operations and virtual machine clone or later operations—typically respond to a virtual block range copy by copying the contents themselves directly through a sequence of read and write requests. Another solution is to remember the incomplete virtual range copy operation rather than performing a corrective operation that overwrites the target address range of any virtual address range copy operation and then performs the copy operation while accounting for the overwrite when the source data becomes available. Given that the resynchronization target may not know that the source data for the copy is incorrect, all such operations may have to be postponed until the full copy is complete. Optimizations can occur if the resynchronization target can be made aware of which areas have not yet been copied or when the resynchronization has finished processing a particular area of the volume.

［0422］ストレージシステムを再同期することの別の態様は、更新されたブロック追跡である場合がある。例えば、ストレージシステムがデタッチされている間に修正されるすべての個々のブロックのリストを保つ（次いで、それらを個別に再同期する）ことは、延長された機能停止期間が多数のブロックを生じさせる場合があるため、一部の場合では実際的ではない場合がある－いくつかのストレージシステムは、不連続ブロックの大きい集合体を非常に効率的に読み取ることができない。結果的に、一部の場合では、例えばボリュームの１ＭＢ範囲等、追跡範囲を開始することが追跡されたメタデータの量を削減するためにはより実際的である場合がある。このコース気質の追跡は、より短期の動作追跡に遅れて更新される場合があり、数分ダウンしているのか、数時間ダウンしているのか、数日ダウンしているのか、又は数週間ダウンしているのかに関わりなく、任意の時代遅れのストレージシステムの再同期を処理するために必要とされる限り、保護されてよい。機械的なスピニングストレージとは対照的に、ソリッドステートストレージを用いると、ボリュームの、又はボリュームの集合体の、又は完全なポッドのどの個々のブロックなのかを追跡することは、変化したそれらの個々のブロックだけを再同期することがきわめて実際的であるように、きわめて実際的である場合がある。概して、無作為な読取り及び書込みペナルティはほとんどなく、マルチレベルマップから読み取ることに対するペナルティは少ししかなく、結果的に短期間（例えば、１００ミリ秒から１０秒の範囲内、又は数百の動作ごとに又は数千の動作ごとに）にわたる動作としてのきめの細かい活動を、すべての修正されたブロックに名前を付けるきめの細かいマップにマージすることは相対的に容易である。さらに、最近の活動のリストは、最近ジャーナルデバイス（高書込み帯域幅及び高上書き速度をサポートすることを目的とした多様な特色のＮＶＲＡＭ等の高速書込みストレージ）に記録されているコンテンツ修正をカバーするリストであってよいが、それらの修正についてのメタデータはおそらく実際のコンテンツよりも長い期間ジャーナルに保たれる。この例では、すべての活動のマージされたリストは、各ビットがブロック又はブロックの小さいグループを表すビットマップである場合もあれば、マージされたリストは、ボリューム単位で例えばＢツリー等のツリー構造に編成されたブロック番号のリスト又はブロック範囲のリストである場合もある。近傍のブロック番号は、あるブロック番号から別のブロック番号への差異として記憶されてよいため、ブロック番号の係るリストは容易に圧縮されてよい。 [0422] Another aspect of resynchronizing a storage system may be updated block tracking. For example, keeping a list of all individual blocks modified while a storage system is detached (and then resynchronizing them individually) may not be practical in some cases because extended outage periods may result in a large number of blocks—some storage systems cannot read large collections of non-contiguous blocks very efficiently. Consequently, in some cases, it may be more practical to start a tracking range, such as a 1MB range of a volume, to reduce the amount of tracked metadata. This course-critical tracking may lag behind shorter-term activity tracking and may be preserved for as long as needed to handle resynchronization of any out-of-date storage system, regardless of whether it is down for minutes, hours, days, or weeks. With solid-state storage, as opposed to mechanical spinning storage, it may be quite practical to track which individual blocks of a volume, or a collection of volumes, or a complete pod, are modified, so that it is quite practical to resynchronize only those individual blocks that have changed. In general, there is little random read and write penalty and only a small penalty for reading from a multilevel map, resulting in a relatively easy merger of fine-grained activity as operations over short periods of time (e.g., in the range of 100 milliseconds to 10 seconds, or every few hundred operations or every few thousand operations) into a fine-grained map that names all modified blocks. Furthermore, the list of recent activity may be a list covering content modifications that have recently been recorded to a journal device (fast-write storage such as NVRAM with various flavors intended to support high write bandwidth and overwrite rates), although metadata about those modifications is likely kept in the journal for a longer period than the actual content. In this example, the merged list of all activity may be a bitmap where each bit represents a block or small group of blocks, or the merged list may be a list of block numbers or block ranges organized by the volume in a tree structure, such as a B-tree. Because nearby block numbers may be stored as the difference from one block number to another, such a list of block numbers may be easily compressed.

［0423］また、ストレージシステムを再同期させることは、シーケンス番号を追跡することによるブロック追跡を含んでもよい。例えば、いくつかのストレージシステムは、通常動作中―すべての修正について―それぞれのシーケンス番号をそれぞれの修正と関連付ける。係る場合には、デタッチの頃にデタッチされたアレイに複製されなかった可能性がある任意のコンテンツを含んだ、デタッチ以来修正されたすべてのコンテンツを見つけるためには、ポッドからデタッチされたストレージシステムと同期したと知られている最後のシーケンス番号が、ポッドのための同期中のストレージシステムを問い合わせるために必要とされるすべてである。 [0423] Resynchronizing storage systems may also include block tracking by tracking sequence numbers. For example, some storage systems associate a respective sequence number with each modification during normal operation—for every modification. In such cases, the last sequence number known to be synchronized with the detached storage system from the pod is all that is needed to query the in-sync storage system for the pod to find all content modified since detachment, including any content that may not have been replicated to the detached array around the time of detachment.

［0424］また、ストレージシステムを再同期させることは、スナップショットとして変更を追跡することを含んでもよい。例えば、スナップショットは、過去のなんらかのとき以来の変更を追跡するために使用されてよく、ストレージシステムは、完了したと知られていないコンテンツを除外することによって、デタッチのときにスナップショットを製造してよい。代わりに、スナップショットは、規則正しく、又はなんらかの周期性をもって作成されてよく、スナップショット作成の時点は、どのスナップショットがデタッチされたストレージシステムを再同期させるための基礎となり得るのかを判断するために、デタッチの時点に比較されてよい。変形形態として、デタッチの前にポッド全体で作成されたスナップショットは、ポッドのための同期中のストレージシステムとデタッチされたストレージシステムの両方に存在するべきであり、再同期のために多様な方法で使用されてよい。例えば、再アタッチされているストレージシステムのコンテンツは、デタッチに先行するその最後に同期されたスナップショットに戻され、ポッド内の現在同期中のコンテンツと一致するためにその時点から前に進められてよい。概して、スナップショットは、以前のスナップショットに対する差異を示す又は現在のコンテンツに対する差異を示す。スナップショットのこれらの特徴を使用し、再アタッチするストレージシステムにコンテンツを再同期させることは、再アタッチの時点と前回の完全なデタッチ前の同期スナップショットの時点との差異を複製することを含んでよい。一部の場合では、再同期は、フォールバックとしてスナップショットベースのモデルを使用してよい。例えば、（約数分の機能停止期間等の）短い機能停止期間は、きめの細かい追跡、又はストレージシステムがデタッチしたとき以来に発生した記録再生動作を通して処理されてよく、より長い機能停止期間は、数分ごとに撮影されるスナップショットに戻ることによって処理されてよい－閾値分数は、デフォルト値である場合もあれば、ユーザ又は管理者により指定される場合もある。相対的に不定期に起こるスナップショットは、低い長期オーバヘッドを有する場合があるが、再起動されるためにより多くのデータを生成する場合があるため、係る構成は、実際的であってよい。例えば－一部の場合では、デタッチの５分前に撮影されたスナップショットは、最大５分間分のコンテンツ修正を転送するのに対して－１０秒の機能停止期間は、記録された動作をリプレイすることによって処理されてよく、再同期は１０秒以下で起こり得る。他の場合では、機能停止期間後の再同期は、例えば短期マップの蓄積サイズに対する制限によって等、蓄積された変更に基づく場合がある。 [0424] Resynchronizing the storage system may also include tracking changes as snapshots. For example, snapshots may be used to track changes since some time in the past, and the storage system may create a snapshot at the time of detachment by filtering out content that is not known to be complete. Alternatively, snapshots may be created regularly or with some periodicity, and the time of snapshot creation may be compared to the time of detachment to determine which snapshots may serve as a basis for resynchronizing the detached storage system. As a variation, snapshots created for the entire pod before detachment should exist on both the synchronizing storage system for the pod and the detached storage system and may be used in various ways for resynchronization. For example, the contents of the storage system being reattached may be reverted to its last synchronized snapshot prior to detachment and advanced from that point forward to match the currently synchronized content in the pod. Generally, snapshots show differences relative to a previous snapshot or show differences relative to the current content. Using these snapshot features, resynchronizing content to a reattaching storage system may involve replicating the difference between the time of reattachment and the time of the last fully synchronized snapshot before detachment. In some cases, resynchronization may use a snapshot-based model as a fallback. For example, short outage periods (such as outage periods on the order of a few minutes) may be handled through fine-grained tracking or record/playback operations that have occurred since the storage system detached, while longer outage periods may be handled by reverting to snapshots taken every few minutes—the threshold number of minutes may be a default value or may be specified by a user or administrator. Such a configuration may be practical because snapshots that occur more infrequently may have low long-term overhead but may generate more data to be restarted. For example—in some cases, a snapshot taken five minutes before detachment will transfer up to five minutes' worth of content modifications—a 10-second outage period may be handled by replaying the recorded operations, and resynchronization may occur in 10 seconds or less. In other cases, resynchronization after an outage period may be based on accumulated changes, for example, due to a limit on the accumulated size of the short-term map.

［0425］一部の場合では、再同期は同期複製に基づく場合がある。例えば、上述されたスナップショットをベースにした再同期モデルは、別の再同期モデルもサポートしてよい。つまり、非同期複製又は周期的な複製をサポートするストレージシステムは、再同期中にコンテンツを複製するためにスナップショット機構を使用してよい。非同期複製モデル又は周期的な複製モデルは、短い機能停止期間の間に時代遅れのデータを潜在的にコピーしてよく、周期的な複製モデルはスナップショット又はチェックポイントの差分表現に基づいてよく、差分表現は機能停止期間を自動的に処理する。非同期複製に関しては、上記説明と同様に延長された機能停止期間のためのバックアップとしてのスナップショット又はチェックポイントに対する依存がある場合があり、結果として実施態様を結合する、又は再同期のために係る利用可能な非同期複製実施態様又は周期複製実施態様を活用することが実際的である場合がある。ただし、１つの問題は、非同期複製モデル又は周期複製モデルが複製ターゲットを最後まで最新に又は完全に同期中にさせるように構成され得ない点である。結果として、係る再同期実施態様を用いると、新しい飛行中動作は、飛行中動作が適用される場合があり、これによりアタッチするストレージシステムに対するすべての修正はポッドに対して最新であるように追跡され得る。 [0425] In some cases, resynchronization may be based on synchronous replication. For example, the snapshot-based resynchronization model described above may also support other resynchronization models. That is, a storage system that supports asynchronous or periodic replication may use a snapshot mechanism to replicate content during resynchronization. The asynchronous or periodic replication model may potentially copy outdated data during short outage periods, and the periodic replication model may be based on snapshot or checkpoint differential representations, which automatically handle outage periods. With asynchronous replication, there may be a reliance on snapshots or checkpoints as backups for extended outage periods, as described above. As a result, it may be practical to combine implementations or utilize available asynchronous or periodic replication implementations for resynchronization. However, one issue is that asynchronous or periodic replication models cannot be configured to keep the replication target up to date or fully in sync. As a result, with such resynchronization implementations, new in-flight operations may be applied, so that all modifications to the attached storage system can be tracked to be up to date for the pod.

［0426］一部の場合では、再同期は、多段階再同期を含むように実施されてよい。例えば、第１の段階で、なんらかの点までのコンテンツが、ポッドのための同期中のストレージシステムからポッドのためのアタッチするストレージシステムへ複製されてよい。この例では、第２のスナップショットはアタッチ中に撮影されてよく、デタッチ前に同期されたと知られている最後のスナップショットであった第１のスナップショットとの間の差異、及び第１のスナップショットと、アタッチするストレージシステムに複製される第２のスナップショットの差異。係る機構は、アタッチするストレージシステムを、それがアタッチ前よりもより密接に同期中にさせ得るが、それはまだ最新ではない場合がある。したがって、第３のスナップショットが作成されてよく、第３のスナップショットと第２のスナップショットとの差異が決定され、次いでアタッチするストレージシステムに複製されてよい。この第３のスナップショット、及び決定された差異は、第２のスナップショットまで複製されたコンテンツ及び現在のコンテンツの差異の部分を構成してよい。さらに、最新になって数秒以内に得るために、追加のスナップショットが撮影され、複製されてよいことが考えられる。この時点で、最後のスナップショットが複製されるまで修正動作は休止される場合がある－それによって、アタッチするストレージシステムをポッドのために最新にさせる。他の場合では、１つ以上のスナップショットを、最終的な再同期スナップショット後に受け取られる修正動作が、それらが複製されたスナップショットコンテンツとマージできるように処理されるなんらかのモードに、複製後に切り替えることが可能である。係る実施態様は、アタッチするストレージシステムにそれらの修正動作を追跡させ、スナップショット複製が完了した後に－又は、スナップショット複製が特定の修正動作により影響を受けた特定のボリューム領域を同期させたと知られた後に－修正動作を適用することを含んでよい。この基本的なコンテンツがコピーされたと知られるまですべての動作を追跡することにより多数の追跡された動作が生じる場合があるので、本実施態様は追加のオーバヘッドを有する場合がある。代替策は、例えば、特定の共通のメタデータ又はあるブロック範囲から別のブロック範囲への拡張されたコピー動作に依存する書込み、及び再同期が、そのコンテンツ又は共通メタデータ情報の処理を優先させる要求等、最近受け取られた動作に関係するコンテンツを検討することである。このようにして、係るプロセスによりコピーされたと知られるコンテンツに結びつけられた任意の受け取られた動作は、次いで追跡構造をはるかに迅速にリリースさせてよい。 [0426] In some cases, resynchronization may be implemented to include multi-stage resynchronization. For example, in a first stage, content up to a certain point may be replicated from the synchronizing storage system for the pod to the attaching storage system for the pod. In this example, a second snapshot may be taken during attach, and the difference between the first snapshot, which was the last snapshot known to be synchronized before detachment, and the difference between the first snapshot and the second snapshot is replicated to the attaching storage system. Such a mechanism may bring the attaching storage system more closely in sync than it was before attach, but it may still not be up to date. Thus, a third snapshot may be created, and the difference between the third snapshot and the second snapshot may be determined and then replicated to the attaching storage system. This third snapshot and the determined difference may constitute part of the difference between the content replicated up to the second snapshot and the current content. It is further contemplated that additional snapshots may be taken and replicated to get within seconds of being up to date. At this point, modification operations may be paused until the final snapshot is replicated—thereby bringing the attaching storage system up to date for the pod. In other cases, one or more snapshots may be switched, post-replication, to some mode in which modification operations received after the final resync snapshot are processed so that they can be merged with the replicated snapshot content. Such an implementation may include having the attaching storage system track those modification operations and applying them after snapshot replication is complete—or after snapshot replication is known to have synchronized the particular volume region affected by the particular modification operation. This implementation may have additional overhead, as tracking every operation until this underlying content is known to have been copied may result in a large number of tracked operations. An alternative is to consider content related to recently received operations, such as writes that depend on particular common metadata or extended copy operations from one block range to another, and requests that resync prioritize processing of that content or common metadata information. In this way, any received operations tied to content known to have been copied by such a process may then cause the tracking structure to be released much more quickly.

［0427］一部の場合では、再同期は、論理エクステントの有向非巡回グラフを使用するために実施されてよい。上述されたように、複製されたストレージシステムは、論理エクステントの有向非巡回グラフに基づいてよい。係るストレージシステムでは、再同期のプロセスは、ポッドのための同期中のストレージシステムからポッドのためのアタッチするストレージシステムへ論理エクステントグラフを複製することが期待されてよい－すべてのリーフ論理エクステントコンテンツを含み、同期中のポッドメンバーとしてアタッチするストレージシステムを有効化する前にグラフが同期され、同期されたままであることを確実にする。このモデルでの再同期は、アタッチのためのターゲットストレージシステムに、ボリュームごとに―ファイル又はオブジェクトをベースにしたストレージシステムでのファイル又はオブジェクトごとに―トップレベルのエクステントアイデンティティを取り出させることによって進んでよい。アタッチターゲットにすでに既知である任意の論理エクステントアイデンティティは最新と見なされてよいが、任意の未知の複合論理エクステントは取り出され、次いでそれぞれがアタッチターゲットにとってすでに既知であるか、又はアタッチターゲットにとって未知であるかのどちらかである基本的なリーフ論理エクステント又は複合論理エクステントに分解されてよい。さらに、任意の未知のリーフ論理エクステントはコンテンツを取り出す場合もあれば、ブロックがターゲットストレージシステムによってすでに記憶されているかどうかを判断するために記憶されているブロックのアイデンティティを取り出す場合もある－次いで認識されていないブロックは同期中のストレージシステムから取り出される。しかしながら、ストレージシステムのデタッチの頃のなんらかの数のエクステントは同じアイデンティティであるが、異なるコンテンツを有する場合があるため、論理エクステントを読取り専用とマークした動作だけが、動作を修正した結果として新しい論理エクステントを形成し得るため、係る手法は、必ずしも再同期につながらない場合がある。さらに、進行中の修正動作は、デタッチにつながる障害の間、異なるストレージシステムで異なるように完了した可能性があり、それらの修正動作が非読取り専用論理エクステントであった場合、次いでそれらの論理エクステントは２つのストレージシステムで同じアイデンティティを有するが、異なるコンテンツを有する場合がある。しかしながら、いくつかの解決策は含む、ポッドのためのストレージシステムの１つの集合が別のストレージシステムをデタッチするとき、ストレージシステムの集合は、進行中の修正動作と関連付けられたリーフ論理エクステント及び複合論理エクステントをマークし、それらのリーフ論理エクステント及び複合論理エクステントを、デタッチされたストレージシステムを含む将来の再アタッチ動作と関連付けてよい。同様に、ポッドのための再アタッチストレージシステムは、それが知っていた、進行中の論理エクステントと関連付けられたリーフ論理エクステント及び複合論理エクステントを識別してよい。結果として、（リーフエクステントのための）そのコンテンツ、又は（複合論理エクステントのための）その参照が、任意の未知のリーフ論理エクステント又は複合論理エクステントを転送することに加えて、転送される必要がある論理エクステントの２つのセット。代わりに、調整されたスナップショットは、複製されたポッドの中で周期的に撮影されてよく、再アタッチ動作のターゲットは、最後の調整されたスナップショット後に作成された論理エクステントが再同期中に廃棄又は無視されることを確実にしてよい。さらに別の代替策として、ストレージシステムがポッドからデタッチされる期間中、残りの同期中のストレージシステムは、すべての完了した動作からのコンテンツを表し、スナップショットよりも後に起こるポッドコンテンツに適用するためのすべての潜在的に進行中の動作をリプレイするスナップショットを製造してよい－この結果、デタッチされたストレージシステムにすでに複製されていない任意のコンテンツは、デタッチされたストレージシステムが絶対に受け取った可能性がない新しい論理エクステントアイデンティティを与えられている。 [0427] In some cases, resynchronization may be performed to use a directed acyclic graph of logical extents. As described above, replicated storage systems may be based on a directed acyclic graph of logical extents. In such storage systems, the resynchronization process may be expected to replicate the logical extent graph—including all leaf logical extent content—from the synchronizing storage system for a pod to the attaching storage system for the pod, ensuring that the graph is synchronized and remains synchronized before enabling the attaching storage system as a synchronizing pod member. Resynchronization in this model may proceed by having the target storage system for the attach retrieve top-level extent identities for each volume—for each file or object in a file- or object-based storage system. Any logical extent identities already known to the attach target may be considered current, while any unknown composite logical extents may be retrieved and then decomposed into elementary leaf logical extents or composite logical extents, each of which is either already known to the attach target or unknown to the attach target. Additionally, any unknown leaf logical extents may have their contents retrieved, or the identities of stored blocks may be retrieved to determine whether the blocks are already stored by the target storage system—then unrecognized blocks are retrieved from the synchronizing storage system. However, because some number of extents around the time of the storage system detachment may have the same identity but different content, such an approach may not necessarily lead to resynchronization, since only the operation that marked the logical extent read-only may form a new logical extent as a result of the modifying operation. Furthermore, ongoing modifying operations may have completed differently on different storage systems during the failure leading to the detachment, and if those modifying operations were on non-read-only logical extents, then those logical extents may have the same identity but different content at the two storage systems. However, some solutions include: when one collection of storage systems for a pod detaches another, the collection of storage systems may mark the leaf logical extents and composite logical extents associated with the ongoing modification operation and associate those leaf logical extents and composite logical extents with future reattach operations involving the detached storage system. Similarly, the reattaching storage system for the pod may identify the leaf logical extents and composite logical extents associated with the ongoing logical extent that it knew about. As a result, two sets of logical extents whose contents (for leaf extents) or references (for composite logical extents) need to be transferred in addition to transferring any unknown leaf logical extents or composite logical extents. Alternatively, coordinated snapshots may be taken periodically within the replicated pod, and the target of the reattach operation may ensure that logical extents created after the last coordinated snapshot are discarded or ignored during resynchronization. As yet another alternative, during the period in which a storage system is detached from a pod, the remaining synchronizing storage systems may produce snapshots that represent content with all completed operations and replay all potentially ongoing operations to apply to pod content occurring after the snapshot - with the result that any content not already replicated to the detached storage system is given a new logical extent identity that the detached storage system could never have received.

［0428］再起動実施態様に直面する場合がある別の問題は、エクステントグラフをベースにした同期複製を完全に同期させ、ライブで実行中にすることである。例えば、再同期は、例えば、最初に、アタッチの始まりで作成されたスナップショット等のより最近のスナップショットを転送することによって、ターゲットストレージシステムに、上述されたようにそれを同期中のストレージシステムから取り出させることによって、進んでよく、ターゲットは、それが有していないリーフ論理エクステント及び複合論理エクステントを増分的に要求する。このプロセスは、デタッチ時の進行中の動作を説明することを含んでよく、このプロセスの最後で、そのより最近のスナップショットまでのコンテンツは、ポッドのための同期中のストレージシステムと、アタッチするストレージシステムとの間で同期される。さらに、このプロセスは、ターゲットストレージシステムを同期中のストレージシステムにより近づけるために、別のスナップショットで、及びおそらく追加のスナップショットで繰り返されてよい。しかしながら、なんらかの時点で、ライブデータも転送される必要がある場合があり、これを行うためには、ライブの修正動作の複製が最後の再同期スナップショットの後にアタッチストレージシステムへの転送のために有効化されてよく、これによりスナップショットに含まれないすべての修正動作は、アタッチするストレージシステムに送達されてよい。本実施態様は、スナップショットに含まれるリーフ論理エクステント及び複合論理エクステントに対する修正を記述する動作を生じさせ、これらの記述は、（指定されたコンテンツを有する）新しいリーフ論理エクステント及び複合論理エクステントの作成、又は既存のリーフ論理エクステント及び複合論理エクステントの、新しいアイデンティティを有するそれらのエクステントの修正されたコピーとの置換を含んでよい。動作の記述が新しい論理エクステントを作成する、又はアタッチするストレージシステムにとってすでに既知の論理エクステントを置換する場合では、動作は、アタッチするストレージシステムが同期中であるかのように正常に処理されてよい。動作のための記述が、アタッチするストレージシステムにすでに既知ではない論理エクステントの少なくとも１つの置換を含む場合では、その動作は、完了を可能にするために永続的にされてよいが、動作の完全な処理は、置換されている論理エクステントが受け取られるまで遅延されてよい。さらに、係る論理エクステントコンテンツ転送を待機しているこれらの動作と関連付けられたオーバヘッドを削減するために、アタッチするストレージシステムは、他の論理エクステントよりも早く取り出されるそれらの論理エクステントを優先してよい。この例では、ストレージシステムが係る既存の論理エクステントを待機するこれらの動作をどのように効率的に処理できるのかに従って、ライブ動作を可能にする前に任意の一連のスナップショット画像を転送するなんの理由もない場合がある。代わりに、デタッチの時点からの（又はデタッチ前のなんらかの時点からの）状態情報を記述する再同期スナップショットが転送できるであろう－動作は、やはり上述されたように、同期中のストレージシステムからアタッチするストレージシステムにスナップショットを転送しながら、上述されたように処理される。 [0428] Another issue that may be faced in restart implementations is keeping extent graph-based synchronous replication fully synchronized and live and running. For example, resynchronization may proceed by first having the target storage system retrieve it from the in-sync storage system as described above, e.g., by transferring a more recent snapshot, such as the snapshot created at the beginning of the attach, and the target incrementally requests leaf logical extents and composite logical extents that it does not have. This process may include accounting for ongoing operations at the time of detachment, and at the end of this process, the contents up to that more recent snapshot are synchronized between the in-sync storage system for the pod and the attaching storage system. Further, this process may be repeated with another snapshot, and possibly additional snapshots, to bring the target storage system closer to the in-sync storage system. However, at some point, live data may also need to be transferred; to do this, live modification replication may be enabled for transfer to the attaching storage system after the last resynchronization snapshot, so that all modification operations not included in the snapshot may be delivered to the attaching storage system. This embodiment generates operations that describe modifications to leaf and composite logical extents included in the snapshot, and these descriptions may include creating new leaf and composite logical extents (with specified contents) or replacing existing leaf and composite logical extents with modified copies of those extents with new identities. If the description of the operation creates a new logical extent or replaces a logical extent already known to the attaching storage system, the operation may be processed normally as if the attaching storage system were in sync. If the description for the operation includes replacing at least one logical extent not already known to the attaching storage system, the operation may be made durable to allow completion, but full processing of the operation may be delayed until the replaced logical extent is received. Furthermore, to reduce the overhead associated with these operations waiting for such logical extent content transfer, the attaching storage system may prioritize those logical extents that are retrieved earlier than other logical extents. In this example, there may be no reason to transfer any series of snapshot images before allowing live operations, due to how the storage system can efficiently handle these operations waiting for such existing logical extents. Instead, a resync snapshot describing state information from the time of detachment (or from some point before detachment) could be transferred - operations would be handled as described above, with snapshots transferred from the synchronizing storage system to the attaching storage system, also as described above.

［0429］一部の場合では、再同期実施態様に直面する問題は、再同期中にブロックリファレンスを保つことである。例えば、同期複製されたストレージシステムでは、特定の書き込まれたブロック、又は動作と関連付けられたブロックの特定の集合は、そのブロック又はそのブロック集合の書込みのための動作記述に含まれる識別を与えられてよい。この例では、そのブロック又はブロック集合の一部若しくはすべてを置換する新しい書込みは、ブロック又はブロック集合のために新しいアイデンティティを供給してよく、この新しいアイデンティティは、（例えば、ＳＨＡ－２５６、又は異なるブロックが同じハッシュ値を生じさせる適切に非常に小さい可能性を有するなんらかの他の機構を使用すること等の）ブロックコンテンツの安全なハッシュから構築される場合もあれば、新しいアイデンティティは、２つの書込みが同一のブロックコンテンツを含んでいたかどうかに関わりなく、一意の方法で単に書込み自体を識別する場合もある。例えば、新しいアイデンティティは、シーケンス番号又はタイムスタンプであってよい。さらに、ブロック又はブロック集合のための新しいアイデンティティが、書込み動作の分散された記述で共用され、ブロック又はブロック集合を書き込むことの一部として各ストレージシステムのなんらかのマップに記憶される場合、次いで論理エクステントは、これらのブロック又はブロック集合アイデンティティに関してそのコンテンツを記述してよい。係る実施態様では、リーフエクステントの再同期は、すでにアタッチするストレージシステムに記憶されたブロック又はブロックセットを同期中のストレージシステムから転送するよりむしろ参照してよい。本実施態様は、再同期中に転送される総データを削減し得る。例えば、デタッチの頃にアタッチするストレージシステムにすでに書き込まれていたが、再同期スナップショットに含まれていなかったデータは、そのアイデンティティとともに記憶された可能性があり、そのブロック又はブロック集合のアイデンティティはすでに既知であり、記憶されているため、再び転送される必要はない場合がある。さらに、なんらかの数の仮想拡張コピー動作が、ストレージシステムがデタッチされた期間中に２つのリーフ論理エクステント間でブロックリファレンスのコピーを生じさせた場合、次いでブロック又はブロック集合のアイデンティティは、仮想コピーされたブロックが２度転送されていないことを確実にするために使用されてよい。 [0429] In some cases, a problem facing resynchronization implementations is preserving block references during resynchronization. For example, in a synchronously replicated storage system, a particular written block, or a particular set of blocks associated with an operation, may be given an identification that is included in the operation description for writing that block or set of blocks. In this example, a new write that replaces some or all of that block or set of blocks may provide a new identity for the block or set of blocks; this new identity may be constructed from a secure hash of the block contents (e.g., using SHA-256 or some other mechanism that has an appropriately very small probability of different blocks producing the same hash value), or the new identity may simply identify the write itself in a unique way, regardless of whether the two writes contained identical block contents. For example, the new identity may be a sequence number or a timestamp. Furthermore, if new identities for blocks or sets of blocks are shared in the distributed description of the write operation and stored in some map on each storage system as part of writing the block or set of blocks, then the logical extent may describe its contents in terms of these block or set of blocks identities. In such an embodiment, resynchronization of a leaf extent may reference blocks or sets of blocks already stored on the attaching storage system rather than transferring them from the synchronizing storage system. This embodiment may reduce the total data transferred during resynchronization. For example, data that was already written to the attaching storage system around the time of detachment but was not included in the resynchronization snapshot may have been stored with its identity, and the identity of that block or set of blocks may not need to be transferred again because it is already known and stored. Furthermore, if any number of virtual expansion copy operations caused the copying of block references between two leaf logical extents during the period when the storage system was detached, then the identity of the block or set of blocks may be used to ensure that the virtually copied blocks are not transferred twice.

［0430］一部の場合では、再同期実施態様は、記憶されていたブロックが、ブロックコンテンツの安全なハッシュに基づく場合がある、一意のアイデンティティを有する場合があるコンテンツアドレス指定ストアを使用してよい。この例では、再同期は、同期中のストレージシステム上のポッドに関係するすべてのブロックアイデンティティのリストを、ポッド内のボリューム（又はファイル又はオブジェクト）に対するそれらのブロックアイデンティティのマッピングとともに、アタッチするストレージシステムに転送することによって進んでよい。この場合、アタッチ動作は－ボリュームからコンテンツへマッピングを変更するライブ動作の処理と統合されてよい－ポッドのための同期中のストレージシステムから、アタッチするストレージシステムが認識していないこれらのブロックを転送することによって進んでよい。さらに、ポッドコンテンツからブロックアイデンティティへのなんらかの早期のバージョンが、ストレージシステムがポッドからデタッチする前から既知である場合、次いでその早期バージョンと現在のバージョンとの差異は、マッピング全体を転送する代わりに転送されてよい。 [0430] In some cases, resynchronization implementations may use a content-addressed store in which stored blocks may have unique identities, which may be based on a secure hash of the block contents. In this example, resynchronization may proceed by transferring a list of all block identities related to the pod on the synchronizing storage system to the attaching storage system, along with a mapping of those block identities to volumes (or files or objects) within the pod. In this case, the attach operation—which may be integrated with the processing of live operations that change the volume-to-content mapping—may proceed by transferring from the synchronizing storage system for the pod those blocks that the attaching storage system does not know about. Furthermore, if any earlier version of the pod contents to block identities was known before the storage system detached from the pod, then the differences between that earlier version and the current version may be transferred instead of transferring the entire mapping.

［0431］図２１に示される例の方法で続行すると、例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異（２１５２）を識別すること（２１０２）を含む。同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異（２１５２）を識別すること（２１０２）は、上記に詳細に説明されるように、多様な技術を使用し、実装されてよい。１つの例の技術は、１つ以上のコントローラがブロック追跡を使用して、例えばストレージシステム（２１２４）等の任意の所与のストレージシステムがデタッチされる間に、及び上記に詳細に説明されるように、修正されるすべての個々のブロックのリストを保つことを含む。他の例の技術は、上記に詳細に説明されるように、他の技術の中でも、ブロック追跡及びスナップショットの組合せ、又はシーケンス番号によるブロック追跡を使用することを含む。 Continuing with the example method shown in FIG. 21 , the example method includes identifying (2102) differences (2152) between an out-of-sync data set (2113) stored on an out-of-sync storage system (2124) and an in-sync data set (2112) stored on one or more in-sync storage systems (2114, 2128). Identifying (2102) differences (2152) between the out-of-sync data set (2113) stored on an out-of-sync storage system (2124) and an in-sync data set (2112) stored on one or more in-sync storage systems (2114, 2128) may be implemented using a variety of techniques, as described in detail above. One example technique includes one or more controllers using block tracking to keep a list of all individual blocks that are modified while any given storage system, such as storage system (2124), is detached, and as described in detail above. Other example techniques include using a combination of block tracking and snapshots, or block tracking by sequence number, among other techniques, as described in detail above.

［0432］この例では、ブロック追跡は、同期中のストレージシステムで修正されるすべての個々のブロックのリストを生成するために使用され、ストレージシステム（２１２４）が再アタッチされるまで、ブロック追跡はストレージシステム（２１２４）が、デタッチしたと検出されると開始する－これは、次のステップ、つまり以下に説明する同期させること（２１０４）が開始するときである。さらに、「デタッチされた」ストレージシステムは、ポッドメンバーとして一覧されるが、ポッドのために同期中として一覧されないストレージシステムと見なされてよい－ポッドメンバーとして一覧されるストレージシステムは、ポッドメンバーがオンラインである、又はポッドのためにデータを石斛的に提供するために現在利用できる場合に同期中である。この例では、ポッドの各ストレージシステムメンバーは、ポッドのためのメンバーストレージシステムを示すメンバーシップリストの独自のコピーを有する場合があり、メンバーシップリストは、どのストレージシステムが現在同期中であると知られているのか、及びどのストレージシステムがポッドメンバーの完全な集合に含まれるのかを含む。概して、ポッドのためにオンラインであるためには、所与のストレージシステムのためのメンバーシップリストは、所与のストレージシステムがポッドのために同期中であることを示し、所与のストレージシステムは、ポッドのために同期中であるとして示される所与のストレージシステムのためのメンバーシップリスト内のすべての他のストレージシステムと通信できる。ストレージシステムが、メンバーシップリストによって同期中であるとして示される他のストレージシステムと通信できない場合、次いでストレージシステムは、ストレージシステムが、それが再び同期中であることを検証できるまでデータセットを修正する入信要求を処理するのを停止する（又はエラー若しくは冷害をもって要求を完了する）。特定のストレージシステムは、疑わしいストレージシステムがデタッチされるべきであると判断してよく、これが、同期中であるとして示されるメンバーシップ中のストレージシステムと同期中であることに基づいて、特定のストレージシステムが、動作し続けることを可能にする。さらに、この状況では、複数の隔離されたストレージシステムがＩ／Ｏ要求を処理している「スプリットブレイン」状況を回避するために、疑わしいストレージシステムは、特定のストレージシステムによる継続的な処理から妨げられ、疑わしいストレージシステムは、どのストレージシステムが、ポッドに向けられるＩ／Ｏ要求を処理することを続行するのか、及びどのストレージシステムが、ポッドに向けられるＩ／Ｏ要求を処理することを停止するのかを判断するために仲介サービスを要求することを妨げられる。係る仲介プロセスのための追加の詳細は、その全体として参照により含まれる、出願参照番号第１５／７０３，５５９号にさらに説明される。 [0432] In this example, block tracking is used to generate a list of all individual blocks modified at the in-sync storage system until the storage system (2124) is reattached. Block tracking begins when the storage system (2124) is detected as detached - this is when the next step, synchronizing (2104), described below, begins. Additionally, a "detached" storage system may be considered a storage system that is listed as a pod member but not listed as in-sync for the pod - a storage system that is listed as a pod member is in-sync if the pod member is online or currently available to consistently serve data for the pod. In this example, each storage system member of the pod may have its own copy of a membership list indicating the member storage systems for the pod, the membership list including which storage systems are currently known to be in-sync and which storage systems are included in the complete set of pod members. Generally, to be online for a pod, the membership list for a given storage system indicates that the given storage system is in sync for the pod, and the given storage system can communicate with all other storage systems in the membership list for the given storage system that are indicated as in sync for the pod. If a storage system cannot communicate with other storage systems indicated as in sync by the membership list, then the storage system stops processing incoming requests that modify datasets (or completes the requests with an error or error) until the storage system can verify that it is again in sync. A particular storage system may determine that a suspect storage system should be detached, allowing the particular storage system to continue operating based on being in sync with storage systems in its membership that are indicated as in sync. Furthermore, in this situation, to avoid a "split-brain" situation in which multiple isolated storage systems are processing I/O requests, the suspect storage system is prevented from continuing processing by the particular storage system, and the suspect storage system is prevented from requesting an intermediation service to determine which storage systems should continue processing I/O requests directed to the pod and which storage systems should stop processing I/O requests directed to the pod. Additional details for such an intermediation process are further described in Application Serial No. 15/703,559, which is incorporated by reference in its entirety.

［0433］また、図２１の例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）も含む。同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）は、上述されたように、複数の技術を使用し、実施されてよい。再同期のための１つの例の技術は－ストレージシステム（２１２４）がデタッチされていた間に生成された修正済みのブロックのリストを所与として－再アタッチされたストレージシステム（２１２４）上での修正済みのブロックのリストに対応する任意の修正動作を複製することを含む。この例では、修正動作を複製することは、Ｉ／Ｏ動作を処理することに関して上述されたように、フォロワーストレージシステム及びリーダーストレージシステムが同期しているときに、フォロワーストレージシステムがリーダーストレージシステムによって提供されるＩ／Ｏ動作をどのようにして実施するのかに類似して実施されてよい。上述されたように、再同期は修正動作も含んでよく、修正コメントを記憶してよく、再同期はデタッチされたブロックのリストからブロックの範囲の場所を突き止め、それらのブロックを、同期中のストレージシステムからアタッチするストレージシステムまでのセクションにコピーすることによって実施されてもよい。同期を実施すること（２１０４）のための他の例の技術は、上記により詳細に説明されている。 21 also includes synchronizing (2104) the out-of-sync data set (2113) with the in-sync data set (2112) according to differences (2152) between the out-of-sync data set (2113) stored on the out-of-sync storage system (2124) and one or more in-sync storage systems (2114, 2128). Synchronizing (2104) the out-of-sync data set (2113) with the in-sync data set (2112) may be performed using multiple techniques, as described above. One example technique for resynchronization includes—given a list of modified blocks generated while the storage system (2124) was detached—replicating any modification operations corresponding to the list of modified blocks on the reattached storage system (2124). In this example, replicating the corrective operations may be performed similarly to how a follower storage system performs I/O operations provided by a leader storage system when the follower and leader storage systems are synchronized, as described above with respect to processing I/O operations. As described above, the resynchronization may also include corrective operations and may store correction comments, and the resynchronization may be performed by locating ranges of blocks from a list of detached blocks and copying those blocks to a section from the synchronizing storage system to the attaching storage system. Other example techniques for performing synchronization (2104) are described in more detail above.

［0434］また、図２１に示される例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）も含む。同期していないデータセット（２１１３）と同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）は、その全体として参照により含まれる出願参照番号第１５／７１３，１５３号の図４から図７に関して説明されるように、ポッドが最初に作成されるとき、ストレージシステム間で同期複製関係を最初に確立することと同様に、実施されてよく、データセットのために、データセットがその全体で同期複製される複数のストレージシステムを識別すること（２１０２）と、データセットを同期複製するために使用される複数のストレージシステムのそれぞれの間で１つ以上のデータ通信リンクを構成すること（２１０４）と、複数のストレージシステム間で、複数のストレージステムのうちの少なくとも１つのためのタイミング情報を交換すること（２１０６）と、複数のストレージシステムの少なくとも１つのためのタイミング情報に従って、同期複製リースを確立すること（２１０８）とを含み、同期複製リースは、同期複製関係が有効である期間を識別する。 [0434] The example method shown in FIG. 21 also includes re-establishing (2106) a synchronous replication relationship between an out-of-sync data set (2113) stored in an out-of-sync storage system (2124) and an in-sync data set (2112) stored in one or more in-sync storage systems (2114, 2128). Re-establishing (2106) the synchronous replication relationship between the out-of-sync dataset (2113) and the in-sync dataset (2112) may be performed similarly to initially establishing a synchronous replication relationship between storage systems when a pod is initially created, as described with respect to Figures 4 through 7 of Application Serial No. 15/713,153, which is incorporated by reference in its entirety, and includes identifying (2102) a plurality of storage systems across which the dataset is synchronously replicated; configuring (2104) one or more data communication links between each of the plurality of storage systems used to synchronously replicate the dataset; exchanging (2106) timing information for at least one of the plurality of storage systems among the plurality of storage systems; and establishing (2108) a synchronous replication lease in accordance with the timing information for at least one of the plurality of storage systems, wherein the synchronous replication lease identifies a period of time for which the synchronous replication relationship is valid.

［0435］追加の説明のために、図２２は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムのための際同期のための追加の例の方法を示すフローチャートを説明する。図２２に示される例の方法も、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異（２１５２）を識別すること（２１０２）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）とを含むので、図２２に示される例の方法は図２１に示される例の方法に類似している。 [0435] For further explanation, FIG. 22 sets forth a flowchart illustrating an additional example method for resynchronization for storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. The example method shown in FIG. 22 also includes identifying (2102) differences (2152) between an out-of-sync dataset (2113) stored on an out-of-sync storage system (2124) and an in-sync dataset (2112) stored on one or more in-sync storage systems (2114, 2128), and resolving the differences (2152) between the out-of-sync dataset (2113) stored on the out-of-sync storage system (2124) and one or more in-sync storage systems (2114, 2128). The example method shown in FIG. 22 is similar to the example method shown in FIG. 21 because it includes synchronizing (2104) an out-of-sync dataset (2113) with an in-sync dataset (2112) in accordance with RFC 52, and re-establishing (2106) a synchronous replication relationship between the out-of-sync dataset (2113) stored in an out-of-sync storage system (2124) and the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128).

［0436］しかしながら、図２２に示される例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と関連付けられたメタデータと、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）と関連付けられたメタデータとの差異（２２５２）を識別すること（２２０２）をさらに含む。同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と関連付けられたメタデータと、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）と関連付けられたメタデータとの差異（２２５２）を識別すること（２２０２）は、より詳細に上述されたように、多様な技術を使用し、実施されてよい。上述された一例のように、再同期中のストレージシステムがデタッチされた期間の間に発生する修正動作を追跡することに加えて、ストレージシステムは、再同期中のストレージシステムがデタッチされた期間の間に追跡された修正動作を記述するメタデータも追跡し、ログ又はジャーナルデバイスに追跡されたメタデータを記憶してもよい。 22 further includes identifying (2202) differences (2252) between metadata associated with the out-of-sync data set (2113) stored in the out-of-sync storage system (2124) and metadata associated with the in-sync data set (2112) stored in one or more in-sync storage systems (2114, 2128). Identifying (2202) differences (2252) between metadata associated with the out-of-sync data set (2113) stored in the out-of-sync storage system (2124) and metadata associated with the in-sync data set (2112) stored in one or more in-sync storage systems (2114, 2128) may be implemented using a variety of techniques, as described in more detail above. In addition to tracking modification operations that occur during the detached period of the resynchronizing storage system, as in the example described above, the storage system may also track metadata describing the modification operations tracked during the detached period of the resynchronizing storage system and store the tracked metadata in a log or journal device.

［0437］図２２に示される例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と関連付けられたメタデータと、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）と関連付けられたメタデータとの差異（２２５２）に従って、同期していないデータセット（２１１３）と関連付けられたメタデータを同期中のデータセット（２１１２）と関連付けられたメタデータと同期させること（２２０４）をさらに含む－これは多様な技術を使用し、実施されてよい。一例の技術は、１つ以上の同期中のストレージシステム（２１１４、２１２８）が、ストレージシステム（２１２４）がデタッチされた後に発生したメタデータに対する変更のリストを生成し、維持することを含み、メタデータに対する変更のリストは、同期していないストレージシステム（２１２４）を同期中にするために、同期していないストレージシステム（２１２４）でのポッドデータのメタデータ表現を更新するために使用されてよく、同期中であることは、「共通メタデータ」を表すための互換性のあるグラフを含み、メタデータ及び共通メタデータは、出願参照番号第１５／６６，４１８号にさらに説明され、この出願はその全体として参照により含まれる。上述されたように、いくつかの動作は、デタッチされたストレージシステム（２１２４）で持続し、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と関連付けられたメタデータに対する修正を生じさせ、メタデータに対するこれらの修正は、専用のリストで、又は同期していないストレージシステム（２１２４）に記憶される同期されていないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との差異を識別すること（２１０２）に関して上述された動作の同じリストで説明されてよい。 [0437] The example method shown in FIG. 22 further includes synchronizing (2204) the metadata associated with the out-of-sync dataset (2113) with the metadata associated with the in-sync dataset (2112) according to differences (2252) between the metadata associated with the out-of-sync dataset (2113) stored in the out-of-sync storage system (2124) and the metadata associated with the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128) - which may be implemented using a variety of techniques. One example technique includes one or more in-sync storage systems (2114, 2128) generating and maintaining a list of changes to metadata that have occurred since the storage system (2124) was detached, and the list of changes to metadata may be used to update the metadata representation of pod data at the out-of-sync storage system (2124) to bring the out-of-sync storage system (2124) in-sync, the in-sync state including a compatible graph to represent "common metadata," where metadata and common metadata are further described in Application Reference No. 15/66,418, which is incorporated by reference in its entirety. As described above, some operations persist in the detached storage system (2124) and result in modifications to the metadata associated with the out-of-sync dataset (2113) stored in the out-of-sync storage system (2124); these modifications to the metadata may be described in a dedicated list or in the same list of operations described above with respect to identifying (2102) differences between the out-of-sync dataset (2113) stored in the out-of-sync storage system (2124) and the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128).

［0438］追加の説明のために、図２３は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムのための再同期の追加の例の方法を示すフローチャートを説明する。図２３に示される例の方法も、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異（２１５２）を識別すること（２１０２）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）も含むので、図２３に示される例の方法は図２１に示される例の方法に類似している。 [0438] For additional explanation, FIG. 23 sets forth a flowchart illustrating an additional example method of resynchronization for storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. The example method shown in FIG. 23 also includes identifying (2102) differences (2152) between an out-of-sync dataset (2113) stored on an out-of-sync storage system (2124) and an in-sync dataset (2112) stored on one or more in-sync storage systems (2114, 2128), and resolving the differences (2152) between the out-of-sync dataset (2113) stored on the out-of-sync storage system (2124) and one or more in-sync storage systems (2114, 2128). The example method illustrated in FIG. 23 is similar to the example method illustrated in FIG. 21 because it also includes synchronizing (2104) the out-of-sync dataset (2113) with the in-sync dataset (2112) in accordance with RFC 52, and re-establishing (2106) a synchronous replication relationship between the out-of-sync dataset (2113) stored in the out-of-sync storage system (2124) and the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128).

［0439］しかしながら、図２３に示される例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異を識別すること（２１０２）が、同期中のデータセット（２１１２）に記憶されるコンテンツとは異なるコンテンツを含む同期していないデータセット（２１１３）の中の１つ以上のブロック（２３５２）を識別すること（２３０２）をさらに含み、これは多様な技術を使用し、実施されてよいことを指定する。１つの例の技術として、上記により詳細に説明されたように、同期中のストレージシステム（２１１４、２１２８）とアタッチするストレージシステム（２１２４）との間で異なる場合があるブロック（２３５２）のリストを生成することを含む―同期中のストレージシステムの１つは、特定のストレージシステムをデタッチすること、又は特定のストレージシステムがデタッチしたことを検出することに応えてブロックのリストの生成を開始してよい。さらに、ストレージシステムをデタッチすることは、その全体として参照により本明細書に援用される出願参照番号第１５／６９６，４１８号により詳細に説明される。 23 specifies, however, that identifying (2102) differences between an out-of-sync data set (2113) stored in an out-of-sync storage system (2124) and an in-sync data set (2112) stored in one or more in-sync storage systems (2114, 2128) further includes identifying (2302) one or more blocks (2352) in the out-of-sync data set (2113) that contain content that differs from the content stored in the in-sync data set (2112), which may be implemented using a variety of techniques. One example technique includes generating a list of blocks (2352) that may differ between the in-sync storage systems (2114, 2128) and the attached storage system (2124), as described in more detail above—one of the in-sync storage systems may initiate generation of the list of blocks in response to detaching or detecting that a particular storage system has detached. Further, detaching a storage system is described in more detail in Application Serial No. 15/696,418, which is incorporated herein by reference in its entirety.

［0440］図２３に示される例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセットと同期させること（２１０４）が、同期中のデータセット（２１１３）の中の１つ以上のブロック（２３５２）に一致するために同期していないデータセット（２１１３）の中の１つ以上のブロックを修正すること（２３０４）を含むことをさらに指定する。同期中のデータセット（２１１３）の中の１つ以上のブロック（２３５２）に一致するために同期していないデータセット（２１１３）の中の１つ以上のブロックを修正すること（２３０４）は―上記により詳細に説明されるように―デタッチされたブロックのリストから１つ以上のブロック（２３５２）の中のメモリアドレスの範囲の場所を突き止め、メモリアドレスのそれらの範囲からのコンテンツを、同期中のストレージシステム（２１１４、２１２８）の１つからアタッチするストレージシステム（２１２４）にコピーすることによって実施されてよい。 [0440] The example method shown in FIG. 23 further specifies that synchronizing (2104) the out-of-sync dataset (2113) with the in-sync dataset according to differences (2152) between the out-of-sync dataset (2113) stored in the out-of-sync storage system (2124) and the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128) includes modifying (2304) one or more blocks in the out-of-sync dataset (2113) to match one or more blocks (2352) in the in-sync dataset (2113). Modifying (2304) one or more blocks in the out-of-sync data set (2113) to match one or more blocks (2352) in the in-sync data set (2113) may be performed—as described in more detail above—by locating ranges of memory addresses in one or more blocks (2352) from the list of detached blocks and copying content from those ranges of memory addresses from one of the in-sync storage systems (2114, 2128) to the attaching storage system (2124).

［0441］追加の説明のために、図２４は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する。図２４に示される例の方法も、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異（２１５２）を識別すること（２１０２）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）も含むので、図２４に示される例の方法は図２１に示される例の方法に類似している。 [0441] For additional explanation, FIG. 24 sets forth a flowchart illustrating an additional example method for resynchronization for storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. The example method shown in FIG. 24 also includes identifying (2102) differences (2152) between an out-of-sync dataset (2113) stored on an out-of-sync storage system (2124) and an in-sync dataset (2112) stored on one or more in-sync storage systems (2114, 2128), and resolving the differences (2152) between the out-of-sync dataset (2113) stored on the out-of-sync storage system (2124) and one or more in-sync storage systems (2114, 2128). The example method shown in FIG. 24 is similar to the example method shown in FIG. 21 because it also includes synchronizing (2104) the out-of-sync dataset (2113) with the in-sync dataset (2112) in accordance with RFC 52, and re-establishing (2106) a synchronous replication relationship between the out-of-sync dataset (2113) stored in the out-of-sync storage system (2124) and the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128).

［0442］しかしながら、図２４に示される例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異を識別すること（２１０２）が、１つ以上の同期中のストレージシステム（２１１４、２１２８）の少なくとも１つによって、同期していないストレージシステムがデタッチされて以後発生したデータセット（２１１２）に対する１つ以上の修正（２４５２）を識別すること（２４０２）をさらに含むことを指定する。１つ以上の同期中のストレージシステム（２１１４、２１２８）の少なくとも１つによって、同期していないストレージシステムがデタッチされて以後発生したデータセット（２１１２）に対する１つ以上の修正（２４５２）を識別すること（２４０２）は、いくつかの技術によって実施されてよい。１つの例の技術として、同期中のストレージシステム（２１２４、２１２８）の１つ以上は、上述されたブロック追跡技術の１つを実施してよい。他の例の技術は、上記に詳細に説明されるように、他の技術の中でも、ブロック追跡とスナップショットの組合せ、又はシーケンス番号によるブロック追跡を使用することを含む。 [0442] However, the example method shown in FIG. 24 specifies that identifying (2102) differences between an out-of-sync data set (2113) stored in an out-of-sync storage system (2124) and an in-sync data set (2112) stored in one or more in-sync storage systems (2114, 2128) further includes identifying (2402), by at least one of the one or more in-sync storage systems (2114, 2128), one or more modifications (2452) to the data set (2112) that occurred since the out-of-sync storage system was detached. Identifying (2402), by at least one of the one or more in-sync storage systems (2114, 2128), one or more modifications (2452) to the data set (2112) that occurred since the out-of-sync storage system was detached may be performed by several techniques. As one example technique, one or more of the synchronizing storage systems (2124, 2128) may implement one of the block tracking techniques described above. Other example techniques include using a combination of block tracking and snapshots, or block tracking by sequence number, among other techniques, as described in detail above.

［0443］追加の説明のために、図２５は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムのための再同期の追加の例の方法を示すフローチャートを説明する。図２５に示される例も、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異（２１５２）を識別すること（２１０２）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）とを含むので、図２５に示される例の方法は図２１に示される例の方法に類似している。 [0443] For further explanation, FIG. 25 illustrates a flowchart showing an additional example method of resynchronization for storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. The example shown in FIG. 25 also includes steps of identifying (2102) differences (2152) between an out-of-sync dataset (2113) stored on an out-of-sync storage system (2124) and an in-sync dataset (2112) stored on one or more in-sync storage systems (2114, 2128), and resolving the differences (2152) between the out-of-sync dataset (2113) stored on the out-of-sync storage system (2124) and one or more in-sync storage systems (2114, 2128). The example method shown in FIG. 25 is similar to the example method shown in FIG. 21 because it includes synchronizing (2104) an out-of-sync dataset (2113) with an in-sync dataset (2112) according to the method described above, and re-establishing (2106) a synchronous replication relationship between the out-of-sync dataset (2113) stored in an out-of-sync storage system (2124) and the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128).

［0444］しかしながら、図２５に示される例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異を識別すること（２１０２）が、１つ以上の同期中のストレージシステム（２１１４、２１２８）の少なくとも１つによって、同期していないストレージシステム（２１２４）がデタッチされたとき未決であったデータセット（２１１２）に対する１つ以上の修正（２５５２）を識別すること（２５０２）をさらに含むことを指定する。１つ以上の同期中のストレージシステム（２１１４、２１２８）の少なくとも１つによって、同期していないストレージシステム（２１２４）がデタッチされたとき未決であったデータセット（２１１２）に対する１つ以上の修正（２５５２）を識別すること（２５０２）は、より詳細に上述されるように、いくつかの技術を使用することによって実施されてよい。 [0444] However, the example method shown in FIG. 25 specifies that identifying (2102) differences between an out-of-sync data set (2113) stored in an out-of-sync storage system (2124) and an in-sync data set (2112) stored in one or more in-sync storage systems (2114, 2128) further includes identifying (2502), by at least one of the one or more in-sync storage systems (2114, 2128), one or more modifications (2552) to the data set (2112) that were outstanding when the out-of-sync storage system (2124) was detached. Identifying (2502), by at least one of the one or more in-sync storage systems (2114, 2128), one or more modifications (2552) to the data set (2112) that were outstanding when the out-of-sync storage system (2124) was detached may be performed using several techniques, as described in more detail above.

［0445］追加の説明のために、図２６は、本開示のいくつかの実施形態にしたがってデータセットを同期複製するストレージのための再同期のための追加の例の方法を示すフローチャートを説明する。図２６に示される例の方法も、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異（２１５２）を識別すること（２１０２）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）とを含むので、図２６に示される例の方法は図２１に示される例の方法に類似している。 [0445] For further explanation, FIG. 26 illustrates a flowchart showing an additional example method for resynchronization for storage that synchronously replicates datasets in accordance with some embodiments of the present disclosure. The example method shown in FIG. 26 also includes identifying (2102) differences (2152) between an out-of-sync dataset (2113) stored on an out-of-sync storage system (2124) and an in-sync dataset (2112) stored on one or more in-sync storage systems (2114, 2128), and resolving the differences (2152) between the out-of-sync dataset (2113) stored on the out-of-sync storage system (2124) and one or more in-sync storage systems (2114, 2128). The example method shown in FIG. 26 is similar to the example method shown in FIG. 21 because it includes synchronizing (2104) an out-of-sync dataset (2113) with an in-sync dataset (2112) in accordance with RFC 52, and re-establishing (2106) a synchronous replication relationship between the out-of-sync dataset (2113) stored in an out-of-sync storage system (2124) and the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128).

［0446］しかしながら、図２６に示される例の方法は、より詳細に上述されたように、多様な技術を使用し、実施されてよい、同期していないストレージシステム（２１２４）が、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係からデタッチしたことを検出すること（２６０２）をさらに含む。また、図２６に示される例の方法は、より詳細に上述されたように、同期していないストレージシステム（２１２４）がデタッチされて以後発生したデータセット（２１１２）に対する修正（２６５２）を追跡すること（２６０４）も含む。 [0446] However, the example method shown in FIG. 26 further includes detecting (2602) that an out-of-sync storage system (2124) has detached from the synchronous replication relationship between an out-of-sync dataset (2113) stored on the out-of-sync storage system (2124) and an in-sync dataset (2112) stored on one or more in-sync storage systems (2114, 2128), which may be implemented using a variety of techniques, as described in more detail above. The example method shown in FIG. 26 also includes tracking (2604) modifications (2652) to the dataset (2112) that have occurred since the out-of-sync storage system (2124) was detached, as described in more detail above.

［0447］追加の説明のために、図２７は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する。図２３に示される例の方法も、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異（２１５２）を識別すること（２１０２）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）とを含むので、図２３に示される例の方法は、図２１に示される例の方法に類似している。 [0447] For additional explanation, FIG. 27 sets forth a flowchart illustrating an additional example method for resynchronization for storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. The example method shown in FIG. 23 also includes identifying (2102) differences (2152) between an out-of-sync dataset (2113) stored on an out-of-sync storage system (2124) and an in-sync dataset (2112) stored on one or more in-sync storage systems (2114, 2128), and resolving the differences (2152) between the out-of-sync dataset (2113) stored on the out-of-sync storage system (2124) and one or more in-sync storage systems (2114, 2128). The example method shown in FIG. 23 is similar to the example method shown in FIG. 21 because it includes synchronizing (2104) an out-of-sync dataset (2113) with an in-sync dataset (2112) according to 2), and re-establishing (2106) a synchronous replication relationship between the out-of-sync dataset (2113) stored in an out-of-sync storage system (2124) and the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128).

［0448］しかしながら、図２７に示される例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）が、より詳細に上述されたように、多様な技術を使用し、実装されてよい、１つ以上の同期中のシステム（２１１４、２１２８）の１つ以上のスナップショットを、同期していないストレージシステム（２１２４）に複製すること（２７０２）をさらに含むことを指定する。 [0448] However, the example method shown in FIG. 27 specifies that synchronizing (2104) the unsynchronized dataset (2113) with the in-synchronization dataset (2112) according to differences (2152) between the unsynchronized dataset (2113) stored in the unsynchronized storage system (2124) and one or more in-synchronization storage systems (2114, 2128) further includes replicating (2702) one or more snapshots of the one or more in-synchronization systems (2114, 2128) to the unsynchronized storage system (2124), which may be implemented using a variety of techniques, as described in more detail above.

［0449］追加の説明のために、図２８は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムのための再同期のための追加の例の方法を示すフローチャートを説明する。図２８に示される例の方法も、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異（２１５２）を識別すること（２１０２）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）とを含むので、図２８に示される例の方法は図２１に示される例の方法に類似している。 2102) and the difference (2152) between the out-of-sync data set (2113) stored on the out-of-sync storage system (2124) and the in-sync data set (2112) stored on one or more in-sync storage systems (2114, 2128). The example method shown in FIG. 28 is similar to the example method shown in FIG. 21 because it includes synchronizing (2104) an out-of-sync dataset (2113) with an in-sync dataset (2112) in accordance with RFC 52, and re-establishing (2106) a synchronous replication relationship between the out-of-sync dataset (2113) stored in an out-of-sync storage system (2124) and the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128).

［0450］しかしながら、図２８に示される例の方法は、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）が、より詳細に上述されるように、多様な技術を使用し、実施されてよい、同期していないストレージシステム（２１２４）だけで持続していたデータセット（２１１３）の１つ以上の修正（２８５２）を識別すること（２８０２）と、同期していないストレージシステム（２１２４）だけで持続していたデータセット（２１１３）の１つ以上の修正（２８５２）をアンドゥすること（２８０４）をさらに含むことを指定する。 [0450] However, the example method shown in FIG. 28 specifies that synchronizing (2104) the out-of-sync dataset (2113) with the in-sync dataset (2112) according to differences (2152) between the out-of-sync dataset (2113) stored in the out-of-sync storage system (2124) and one or more in-sync storage systems (2114, 2128) further includes identifying (2802) one or more modifications (2852) of the dataset (2113) that were persisted only in the out-of-sync storage system (2124), which may be implemented using a variety of techniques, as described in more detail above, and undoing (2804) one or more modifications (2852) of the dataset (2113) that were persisted only in the out-of-sync storage system (2124).

［0451］追加の説明のために、図２９は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムのための再同期の追加の例の方法を示すフローチャートを説明する。図２９に示される例の方法も、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）の差異（２１５２）を識別すること（２１０２）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）との差異（２１５２）に従って、同期していないデータセット（２１１３）を同期中のデータセット（２１１２）と同期させること（２１０４）と、同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）とを含むので、図２９に示される例の方法は図２１に示される例の方法に類似している。 29 illustrates a flowchart showing an additional example method of resynchronization for storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. The example method shown in FIG. 29 also includes identifying (2102) differences (2152) between an out-of-sync dataset (2113) stored on an out-of-sync storage system (2124) and an in-sync dataset (2112) stored on one or more in-sync storage systems (2114, 2128), and resolving the differences (2152) between the out-of-sync dataset (2113) stored on the out-of-sync storage system (2124) and one or more in-sync storage systems (2114, 2128). The example method shown in FIG. 29 is similar to the example method shown in FIG. 21 because it includes synchronizing (2104) an out-of-sync dataset (2113) with an in-sync dataset (2112) in accordance with RFC 52, and re-establishing (2106) a synchronous replication relationship between the out-of-sync dataset (2113) stored in an out-of-sync storage system (2124) and the in-sync dataset (2112) stored in one or more in-sync storage systems (2114, 2128).

［0452］しかしながら、図２９に示される例の方法は、指定する。同期していないストレージシステム（２１２４）に記憶される同期していないデータセット（２１１３）と、１つ以上の同期中のストレージシステム（２１１４、２１２８）に記憶される同期中のデータセット（２１１２）との間の同期複製関係を再確立すること（２１０６）が、より詳細に上述されたように、多様な技術を使用し、実装されてよい、同期していないストレージシステム（２１２４）のために、データベース（２１１３）のためのＩ／Ｏ処理を可能にすること（２９０２）をさらに含む。 [0452] However, the example method shown in FIG. 29 specifies that re-establishing (2106) the synchronous replication relationship between the out-of-sync data set (2113) stored in the out-of-sync storage system (2124) and the in-sync data set (2112) stored in one or more in-sync storage systems (2114, 2128) further includes enabling (2902) I/O operations for the database (2113) for the out-of-sync storage system (2124), which may be implemented using a variety of techniques, as described in more detail above.

［0453］追加の説明のために、図３０は、本開示のいくつかの実施形態に従って同期複製されたストレージシステム（３０１４、３０２４、３０２８）への接続性を管理するための例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図３０に示されるストレージシステム（３０１４、３０２４、３０２８）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図３０に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0453] For further explanation, FIG. 30 sets forth a flowchart illustrating an example method for managing connectivity to synchronously replicated storage systems (3014, 3024, 3028) in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (3014, 3024, 3028) illustrated in FIG. 30 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system illustrated in FIG. 30 may include the same components as the storage systems described above, fewer components, or additional components.

［0454］図３０に示される例の方法は、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）を識別すること（３００２）を含む。図３０に示されるデータセット（３０１２）は、例えば特定のボリュームのコンテンツとして、ボリュームの特定のシャードのコンテンツとして、又は１つ以上のデータ要素の任意の他の集合体として実施されてよい。データセット（３０１２）は、各ストレージシステム（３０１４、３０２４、３０２８）がデータセット（３０１２）のローカルコピーを保持するように、複数のストレージシステム（３０１４、３０２４、３０２８）全体で同期されてよい。本明細書に説明される例では、係るデータセット（３０１２）は、少なくともアクセスされているクラスタ及び特定のストレージシステムが名目上動作している限り、クラスタ内の任意の一方のストレージシステムが、クラスタ内の任意の他方のストレージシステムよりも実質的により最適に動作しないような性能特性を有するストレージシステム（３０１４、３０２４、３０２８）のいずれかを通してデータセット（３０１２）にアクセスできるように、ストレージシステム（３０１４、３０２４、３０２８）全体で同期複製される。係るシステムでは、データセット（３０１２）に対する修正は、ストレージシステム（３０１４、３０２４、３０２８）のいずれかでデータセット（３０１２）にアクセスすることが一貫した結果を生じさせるように、各ストレージシステム（３０１４、３０２４、３０２８）に常駐するデータセットのコピーに対して加えられるべきである。例えば、データセットに発行される書込み要求は、すべてのストレージシステム（３０１４、３０２４、３０２８）で実行されなければならない、又はストレージシステム（３０１４、３０２４、３０２８）のいずれでも実行されてはならない。同様に、動作のいくつかのグループ（例えば、データセットの中の同じ場所に向けられる２つの書込み動作）は、各ストレージシステム（３０１４、３０２４、３０２８）に常駐するデータセットのコピーが、すべてのストレージシステム（３０１４、３０２４、３０２８）で最終的に同一となるように、同じ順序で、又はそれらがあたかも同じ順序で実行されたかのように、すべてのストレージシステム（３０１４、３０２４、３０２８）で実行されなければならない。データセット（３０１２）に対する修正は、正確に同時に行われる必要はないが、いくつかのアクション（例えば、書込み要求がデータセットに向けられ、すべてのストレージシステムでまだ完了していない書込み要求によってターゲットにされるデータセットの中の場所への読取りアクセスを可能にする旨の肯定応答を発行すること）は、各ストレージシステム（３０１４、３０２４、３０２８）上のデータセット（３０１２）のコピーが修正されるまで先延ばしにされてよい。 [0454] The example method shown in FIG. 30 includes identifying (3002) multiple storage systems (3014, 3024, 3028) across which a dataset (3012) is synchronously replicated. The dataset (3012) shown in FIG. 30 may be embodied, for example, as the contents of a particular volume, as the contents of a particular shard of a volume, or as any other collection of one or more data elements. The dataset (3012) may be synchronized across the multiple storage systems (3014, 3024, 3028) such that each storage system (3014, 3024, 3028) maintains a local copy of the dataset (3012). In the examples described herein, such dataset (3012) is synchronously replicated across storage systems (3014, 3024, 3028) such that the dataset (3012) can be accessed through any of the storage systems (3014, 3024, 3028) that have performance characteristics such that any one storage system in the cluster will operate substantially less optimally than any other storage system in the cluster, at least as long as the cluster and the particular storage system being accessed are nominally operating. In such a system, modifications to the dataset (3012) should be made to copies of the dataset resident on each storage system (3014, 3024, 3028) so that accessing the dataset (3012) on any of the storage systems (3014, 3024, 3028) produces consistent results. For example, a write request issued to a dataset must be executed on all storage systems (3014, 3024, 3028) or must not be executed on any of the storage systems (3014, 3024, 3028). Similarly, some groups of operations (e.g., two write operations directed to the same location in a dataset) must be executed on all storage systems (3014, 3024, 3028) in the same order, or as if they were executed in the same order, so that the copies of the dataset resident on each storage system (3014, 3024, 3028) are ultimately identical on all storage systems (3014, 3024, 3028). Modifications to the dataset (3012) need not occur at exactly the same time, although some actions (e.g., issuing an acknowledgment that a write request has been directed to the dataset and that allows read access to locations within the dataset targeted by write requests that have not yet completed on all storage systems) may be deferred until the copies of the dataset (3012) on each storage system (3014, 3024, 3028) have been modified.

［0455］書込み要求（又は複数のストレージシステム全体で同期複製されるデータセットを修正する他の要求）の処理と対照的に、他のタイプの要求は、係る動作の分散メッセージング待ち時間を増加することなく、要求を受け取ったストレージシステムによってローカルにサービスを提供されてよい。例えば、読取り要求、問合せ要求、又はデータセット（３０１２）の修正を生じさせない他の要求は、通常は、係る動作の分散メッセージング待ち時間を増加することなく、要求を受け取ったストレージシステムによってローカルに処理できる。例えば、ホストが、データセット（３０１２）がその全体で同期分散されるストレージシステム（３０１４、３０２４、３０２８）のクラスタの中の第１のストレージシステム（３０１４）に読取り要求を発行する場合、次いで実施態様は、ストレージシステム（３０１４、３０２４、３０２８）の間に、読取り要求を完了するために必要とされ、頻繁にローカルの複製されていないストレージシステムの待ち時間と同一である読取り待ち時間を生じさせるインラインメッセージングがないことを確実にできる。一部の例では、係る動作（例えば、読取り要求）は、実施態様の中でブロックされ得る。書込み要求（つまり、読取り要求に応えて読み取られるデータセットの部分と重複するデータセットの部分にデータを書き込む要求）、又はストレージシステム（３０１４、３０２４、３０２８）のすべてでまだ完了していない他の形の矛盾した修正動作を構成することによって実施態様の中でブロックされ得る。ブロックは、例えば、１つ以上の同時に起こる修正要求と時間的に重複する複数の読取り要求に対する順序付け要件を保つために必要である場合がある。係るブロックは、同じ書込み又は他の修正動作と重複する、ポッド内の同じストレージシステム又は別のストレージシステム上の書込み動作又は他の修正動作と同時に起こり、ポッド内の別のストレージシステムで第１の読取りの後に第２の読取りが続く、１つのストレージシステムでの第１の読取りが、第２の読取りのための修正動作の前からコンテンツを戻す一方、第１の読取りのための修正動作の結果を絶対に戻さないことを確実にするために使用できる。ストレージシステムがそれについて知った、及びポッド内のどこででもまだ処理されていない飛行中の修正動作のための重複する読取り要求のブロックは、読取り動作のためのこの逆の時間順が、他のすべての重複する読取り要求もその重複する修正動作から結果を返すことも確定するまで、重複する修正動作からの結果を返す可能性がある読取り要求を先延ばしにすることによって起こらないことを確実にできる。 [0455] In contrast to processing write requests (or other requests that modify a dataset that is synchronously replicated across multiple storage systems), other types of requests may be serviced locally by the storage system that receives the request without increasing the distributed messaging latency of such operations. For example, read requests, query requests, or other requests that do not result in modification of the dataset (3012) can typically be processed locally by the storage system that receives the request without increasing the distributed messaging latency of such operations. For example, if a host issues a read request to a first storage system (3014) in a cluster of storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously distributed, then embodiments can ensure that there is no inline messaging between the storage systems (3014, 3024, 3028) that is required to complete the read request, which would frequently result in read latency that is identical to the latency of a local, non-replicated storage system. In some examples, such operations (e.g., read requests) may be blocked in an embodiment by configuring write requests (i.e., requests to write data to portions of a dataset that overlap with portions of the dataset read in response to a read request) or other forms of conflicting modification operations that have not yet completed at all of the storage systems (3014, 3024, 3028). Blocking may be necessary, for example, to preserve ordering requirements for multiple read requests that overlap in time with one or more concurrent modification requests. Such blocking can be used to ensure that a first read at one storage system, followed by a second read at another storage system in the pod, that coincides with a write operation or other modification operation on the same or another storage system in the pod that overlaps with the same write or other modification operation, returns content from before the modification operation for the second read, while never returning the results of the modification operation for the first read. This reverse chronological order for read operations can ensure that blocking of overlapping read requests due to in-flight modification operations that the storage system has learned about and that have not yet been processed anywhere in the pod does not occur by postponing a read request that may return results from an overlapping modification operation until it has determined that all other overlapping read requests have also returned results from their overlapping modification operations.

［0456］図３０に示される例の方法では、データセット（３０１２）のために、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）を識別すること（３００２）は、例えばデータセット（３０１２）を、そのデータセット（３０１２）を名目的に記憶する１つ以上のストレージシステム（３０１４、３０２４、３０２８）と関連付けるポッド定義又は類似するデータ構造を調べることによって実施されてよい。『ポッド』は、該用語がここで及び本願の残りを通して使用されるように、データセット、管理オブジェクト及び管理動作の集合、データセットを修正する又は読み取るためのアクセス動作の集合、及び複数のストレージシステムを表す管理エンティティとして実施されてよい。係る管理動作は、データセットを読み取る又は修正するためのアクセス動作が、ストレージシステムのいずれかを通して同等に動作する、ストレージシステムのいずれかを通して同等に管理オブジェクトを修正する又は問い合わせしてよい。各ストレージシステムは、ストレージシステムによる使用のために記憶され、宣伝されたデータセットの適切な部分集合としてデータセットの別々のコピーを記憶してよく、任意の１つのストレージシステムを通して実行され、完了された管理オブジェクト又はデータセットを修正する動作は、ポッドを問い合わせるために後続の管理オブジェクトで、又はデータセットを読み取るために後続のアクセス動作で反映される。『ポッド』に関する追加の詳細は、参照により本明細書に援用される以前に出願された仮特許出願第６２／５１８，０７１号で見つけられてよい。ストレージシステムはポッドに加えられる場合があり、結果的にポッドのデータセット（３０１２）は、そのストレージシステムにコピーされ、次いでデータセット（３０１２）が修正されると、最新に保たれる。また、ストレージシステムは、ポッドから削除される場合もあり、結果的にデータセット（３０１２）は削除されたストレージシステムでもはや最新に保たれなくなる。係る例では、ポッド定義又は類似するデータ構造は、ストレージシステムが特定のポッドに加えられ、特定のポッドから削除されると更新されてよい。 [0456] In the example method shown in FIG. 30, for a dataset (3012), identifying (3002) multiple storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously replicated may be performed, for example, by examining a pod definition or similar data structure that associates the dataset (3012) with one or more storage systems (3014, 3024, 3028) that nominally store the dataset (3012). A 'pod,' as the term is used here and throughout the remainder of this application, may be implemented as a management entity representing a dataset, a collection of management objects and management operations, a collection of access operations to modify or read the dataset, and multiple storage systems. Such management operations may modify or query managed objects equally well across any of the storage systems, and access operations to read or modify the dataset may equally well operate across any of the storage systems. Each storage system may store a separate copy of the dataset as an appropriate subset of the dataset stored and advertised for use by the storage system, and operations that modify a management object or dataset performed and completed through any one storage system are reflected in subsequent management objects to query the pod or subsequent access operations to read the dataset. Additional details regarding "pods" may be found in previously filed provisional patent application Ser. No. 62/518,071, incorporated herein by reference. A storage system may be added to a pod, resulting in the pod's dataset (3012) being copied to that storage system and then kept up to date as the dataset (3012) is modified. A storage system may also be removed from a pod, resulting in the dataset (3012) no longer being kept up to date on the removed storage system. In such examples, the pod definition or similar data structure may be updated as storage systems are added to and removed from particular pods.

［0457］また、図３０に示される例の方法は、データセット（３０１２）に向けられるＩ／Ｏ動作を発行できるホスト（３０３２）を識別すること（３００４）も含む。図３０に示されるホスト（３０３２）は、例えばストレージシステム（３０１４、３０２４、３０２８）にとって外部で実行中のアプリケーションサーバとして、又は１つ以上のデータ通信経路を介してストレージシステム（３０１４、３０２４、３０２８）にアクセス要求（例えば、読取り、書込み）を発行する任意の他のデバイスとして実施されてよい。データセット（３０１２）に向けられたＩ／Ｏ動作を発行できる特定のホスト（３０３２）を識別すること（３００４）は、例えば、ストレージシステム（３０１４、３０２４、３０２８）の１つ以上が、リスト、又はストレージシステム（３０１４、３０２４、３０２８）がそこからデータセット（３０１２）に向けられるＩ／Ｏ動作を受け取った各ホストの識別を含む他のデータ構造を保持することによって、リスト、又はデータセット（３０１２）にアクセスするために必要な適切な許可を有する各ホストを識別する他のデータ構造を調べることによって、又は他のなんらかの方法で実施されてよい。 [0457] The example method shown in FIG. 30 also includes identifying (3004) a host (3032) that can issue I/O operations directed to the dataset (3012). The host (3032) shown in FIG. 30 may be implemented, for example, as an application server running external to the storage systems (3014, 3024, 3028) or as any other device that issues access requests (e.g., read, write) to the storage systems (3014, 3024, 3028) via one or more data communication paths. Identifying (3004) particular hosts (3032) that can issue I/O operations directed to the dataset (3012) may be performed, for example, by one or more of the storage systems (3014, 3024, 3028) maintaining a list or other data structure that includes the identification of each host from which the storage system (3014, 3024, 3028) has received an I/O operation directed to the dataset (3012), by consulting a list or other data structure that identifies each host that has the appropriate permissions necessary to access the dataset (3012), or in some other manner.

［0458］また、図３０に示される例の方法は、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）を識別すること（３００６）も含む。ホスト（３０３２）と複数のストレージシステム（３０１４、３０２４、３０２８）との間の各データ通信経路（３０２２、３０２６、３０３０）は、ホスト開始プログラムポートとストレージシステムターゲットポートとの間の、ホスト上のネットワークインタフェースとストレージシステム上のネットワークインタフェースとの間等の関係を表す場合がある。係る例では、いくつかのホスト開始プログラムポート及びいくつかのストレージシステムターゲットポートがある場合があり、ストレージシステムは、それぞれが複数のターゲットポートをホストし得るいくつかのストレージコントローラを含む場合もある。別々のストレージシステム上のターゲットポート又はネットワークインタフェースは、同じポッド内であっても、通常、互いとは別個であるべきである。ターゲットポートは、アクティブ／最適化済み、アクティブ／未最適化、スタンバイ、及びオフラインに関して共通の状態を共用する、ストレージシステムボリュームと関連付けられたポートのグループであるターゲットポートグループを使用し、管理されてよい。ターゲットポートグループは、全体としてストレージシステムと関連付けられるよりむしろ、個々のストレージシステムの各ストレージコントローラと関連付けられてよい。実際には、ターゲットポートグループは、単一のストレージコントローラの中でもターゲットポートの部分集合と関連付けられることを含み、完全に任意である場合がある。また、ストレージシステムは、ターゲットポートグループを構築する又はターゲットポートグループに知らせる際に、ホスト開始プログラム情報も使用できるであろう。ただし、ストレージシステムは、マルチパシングドライバスタックでの混乱がないことを確実にするために、各ホスト開始プログラムにこの情報を一貫して提供しなければならない。図３０に示される例の方法では、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）を識別すること（３００６）は、例えば、以下の段落でより詳細に説明されるＳＣＳＩ非対称論理ユニットアクセス（『ＡＬＵＡ』）機構の使用により、なんらかの他のネットワーク発見ツールの使用により、又は他のなんらかの方法で実施されてよい。 [0458] The example method shown in FIG. 30 also includes identifying (3006) multiple data communication paths (3022, 3026, 3030) between the host (3032) and multiple storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously replicated. Each data communication path (3022, 3026, 3030) between the host (3032) and the multiple storage systems (3014, 3024, 3028) may represent a relationship between a host initiator port and a storage system target port, between a network interface on the host and a network interface on the storage system, etc. In such an example, there may be several host initiator ports and several storage system target ports, and the storage system may include several storage controllers, each of which may host multiple target ports. Target ports or network interfaces on separate storage systems should typically be distinct from each other, even within the same pod. Target ports may be managed using target port groups, which are groups of ports associated with storage system volumes that share common states with respect to active/optimized, active/non-optimized, standby, and offline. Target port groups may be associated with each storage controller of an individual storage system, rather than with the storage system as a whole. In fact, target port groups may be entirely arbitrary, including being associated with a subset of target ports even within a single storage controller. The storage system could also use host-initiated information when building or advertising target port groups. However, the storage system must provide this information consistently to each host-initiated program to ensure there is no confusion in the multipathing driver stack. In the example method shown in FIG. 30, identifying (3006) multiple data communication paths (3022, 3026, 3030) between a host (3032) and multiple storage systems (3014, 3024, 3028) across which a dataset (3012) is synchronously replicated may be performed, for example, by use of the SCSI Asymmetric Logical Unit Access ("ALUA") mechanism described in more detail in the following paragraphs, by use of some other network discovery tool, or in some other manner.

［0459］また、図３０に示される例の方法は、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）の中から１つ以上の最適経路を識別すること（３００８）も含む。図３０に示されるストレージシステムは、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）と、ストレージシステムと関連付けられたストレージ通信エンドポイントの間の複数のデータ通信経路（３０２２、３０２６、３０３０）の中から１つ以上の最適経路を識別してよい（３００８）。図３０に示される例の方法では、１つ以上の最適経路を識別すること（３００８）は、単一の最適経路を識別すること又は複数の最適経路を識別することを含んでよい。例えば、多様な性能スレッショルドを満たす任意の経路が識別されてよい、所定数の最も適した経路（例えば、他の利用可能な経路に対して優れた性能を示すそれらの経路）が識別されてよい、所定の割合の最も適した経路が識別されてよい等、（ホストと特定のストレージシステムとの間の経路等の）より多くの最適経路の部分集合が識別されてよい。読者は、ストレージシステムホスト（３０１４、３０２４、３０２８）は、互いからなんらかの距離に位置してよいので、ストレージシステム（３０１４、３０２４、３０２８）は、別々のストレージネットワーク又はストレージネットワークの別々の部分に位置してよいので、又は他のなんらかの理由から、ホスト（３０３２）が、別のストレージシステムに対してあるストレージシステムにＩ／Ｏ動作を発行することに関連付けられた性能優位点があってよいことを理解する。例えば、遠いデータセンタ又はキャンパスの中に物理的に位置するストレージシステムにＩ／Ｏ動作を発行するホスト（３０３２）に比して、ホスト（３０３２）が、ホスト（３０３２）と同じデータセンタ又はキャンパスの中に物理的に位置するストレージシステムにＩ／Ｏ動作を発行することに関連付けられた性能優位点がある場合がある。信頼性のために、ホスト（３０３２）にすべてのストレージシステム（３０３２）に対する接続性を保持させることは有益であってよいが、性能のために、ホスト（３０３２）が特定のストレージシステムを通してデータセット（３０１２）にアクセスすることが好ましい場合がある。読者は、異なるホストがデータセット（３０１２）にアクセスする場合があるため、あるホストがデータセット（３０１２）にアクセスするための１つ以上の最適経路が、別のホストがデータセット（３０１２）にアクセスするための１つ以上の最適経路とは異なる場合があることを理解する。いくつかの実施形態では、２つのストレージシステムが十分に類似しているので、両方のストレージシステムへの経路が最適と見られ得ることが考えられる。例えば、２つのストレージシステムが同じデータセンタ又はキャンパスにおり、ホストとそれらの２つのストレージシステムとの間に豊富なネットワーキングを有する場合、第３のストレージシステムは十分に遠いので、第３のストレージシステムはフォールバックとして、を除き、使用されるべきではないが、ホストと２つの十分に類似したストレージシステムとの間の経路は、すべて最適経路として識別される（３００８）ことに対する候補となってよい。 30 also includes identifying (3008) one or more optimal paths from among a plurality of data communication paths (3022, 3026, 3030) between the host (3032) and a plurality of storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously replicated. The storage system illustrated in FIG. 30 may identify (3008) one or more optimal paths from among a plurality of data communication paths (3022, 3026, 3030) between the host (3032), a plurality of storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously replicated, and storage communication endpoints associated with the storage system. In the example method illustrated in FIG. 30, identifying (3008) one or more optimal paths may include identifying a single optimal path or identifying multiple optimal paths. For example, any path that meets various performance thresholds may be identified, a predetermined number of the most suitable paths (e.g., those paths that exhibit superior performance relative to other available paths) may be identified, a predetermined percentage of the most suitable paths may be identified, or a subset of more suitable paths (e.g., paths between a host and a particular storage system) may be identified. The reader will understand that because storage system hosts (3014, 3024, 3028) may be located at some distance from each other, because storage systems (3014, 3024, 3028) may be located on separate storage networks or separate portions of storage networks, or for some other reason, there may be a performance advantage associated with a host (3032) issuing I/O operations to one storage system versus another storage system. For example, there may be performance advantages associated with a host (3032) issuing I/O operations to a storage system physically located in the same data center or campus as the host (3032), compared to a host (3032) issuing I/O operations to a storage system physically located in a distant data center or campus. For reliability reasons, it may be beneficial to have the host (3032) maintain connectivity to all storage systems (3032), but for performance reasons, it may be preferable for the host (3032) to access a dataset (3012) through a specific storage system. The reader will understand that because different hosts may access the dataset (3012), one or more optimal paths for one host to access the dataset (3012) may differ from one or more optimal paths for another host to access the dataset (3012). In some embodiments, it is contemplated that two storage systems are sufficiently similar that paths to both storage systems may be deemed optimal. For example, if two storage systems are in the same data center or campus and have extensive networking between the host and those two storage systems, then the third storage system may be far enough away that it should not be used except as a fallback, but all paths between the host and two sufficiently similar storage systems may be candidates for being identified as an optimal path (3008).

［0460］図３０に示される例の方法では、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）の中から１つ以上の最適経路を識別すること（３００８）は、例えば、ＳＣＳＩＡＬＵＡ機構の使用により実施されてよい。ＳＣＳＩＡＬＵＡ機構は、コマンド及び問合せの集合として、（ＳＣＳＩでは「論理ユニットとしても知られる」ストレージボリュームへの複数のターゲットポートを通したストレージシステムボリュームへの非同期アクセスに対するサポートを記述するＳＣＳＩＳＰＣ－４技術規格及びＳＢＣ－３技術規格に説明される。係る実施形態では、（そのコンテンツが、複数のストレージシステム全体で同期複製されるデータセットを表す場合がある）ボリュームは、１つ以上のターゲットポートグループと関連付けられた複数のＳＣＳＩターゲットポートを通してホスト（３０３２）に一意のＩＤを報告することができ、これによりホスト（３０３２）は、１つ以上のＳＣＳＩホストポートを通して、そのホストポート及び宣伝されたターゲットポートのいくつかの又はすべての組合せを通してボリュームにアクセスするようにそのＩ／Ｏドライバを構成できる。一意のボリュームＩＤは、同じボリュームにアクセスするすべてのＳＣＳＩ論理ユニット番号、ホストポート、及びターゲットポート句の組合せを認識するためにホスト（３０３２）Ｉ／Ｏドライバによって使用される場合がある。ホストＩ／Ｏドライバは、次いで識別されたボリュームの状態及びコンテンツを修正するために、それらの組合せ（経路）のいくつか、いずれか、又はすべての下方にＳＣＳＩコマンドを発行することができる。障害の結果、ホストは代替経路を下方に要求を再発行する場合があり、性能の懸念により、ホスト（３０３２）は、複数のポート及び複数のネットワークの相互接続の使用により、改善されたホスト対ストレージシステム帯域幅の利点を得るために複数の経路を多大に利用する場合がある。 [0460] In the example method shown in FIG. 30, identifying (3008) one or more optimal paths from among multiple data communication paths (3022, 3026, 3030) between a host (3032) and multiple storage systems (3014, 3024, 3028) across which a dataset (3012) is synchronously replicated may be performed, for example, through the use of the SCSI ALUA mechanism. The SCSI ALUA mechanism is a SCSI specification that describes support for asynchronous access to storage system volumes (also known as "logical units" in SCSI) through multiple target ports as a set of commands and queries. This is described in the SPC-4 and SBC-3 technical standards. In such an embodiment, a volume (whose contents may represent a data set that is synchronously replicated across multiple storage systems) may report a unique ID to the host (3032) through multiple SCSI target ports associated with one or more target port groups, allowing the host (3032) to configure its I/O driver to access the volume through one or more SCSI host ports, and through some or all combinations of the host ports and advertised target ports. The unique volume ID is used by all volumes accessing the same volume. It may be used by the host (3032) I/O driver to recognize combinations of SCSI logical unit number, host port, and target port clauses. The host I/O driver can then issue SCSI commands down some, any, or all of these combinations (paths) to modify the state and contents of the identified volumes. Failures may result in the host reissuing requests down alternate paths, and performance concerns may lead the host (3032) to make heavy use of multiple paths to take advantage of improved host-to-storage system bandwidth through the use of multiple ports and multiple network interconnects.

［0461］ＳＣＳＩのためのＡＬＵＡ仕様を用いて、それぞれがそれぞれ状態を割り当てられる場合がある、ボリュームにアクセスできる複数のターゲットポートグループを説明することができる。ターゲットポートグループは、ストレージシステム上の１つ以上のＳＣＳＩポートを表す場合がある。マルチコントローラストレージシステムでは、ターゲットポートグループは、１つのコントローラ上のすべてのＳＣＳＩターゲットポートを表す可能性がある、又は対称でアクセス可能な同期複製されたストレージシステムを用いると、ターゲットポートグループは、個々のストレージシステム上のすべてのＳＣＳＩターゲットポートを表す場合がある、又はターゲットポートは他のなんらかの方法でグループ化される場合がある。ターゲットポートグループと関連付けることのできる状態は、ポートグループが、Ｉ／Ｏ（アクティブ／最適化済み）を発行するために好まれるべきか、Ｉ／Ｏ（アクティブ／非最適化）、スタンバイ（Ｉ／Ｏが、アクティブ／最適化済み又はアクティブ／非最適化に戻る状態の変化があるまで発行できない）を発行するために好まれるべきではないのか、又はターゲットポートは応答しないため等でオフラインとなる場合があるのかを示す。ＳＣＳＩ仕様は、ターゲットポートグループ及び状態を割り当てられたＡＬＵＡターゲットポートグループの定義を、各ボリュームに特有だけではなく、各要求ホストに（又は各要求ホストポートにも）特有とすることができ、これによりボリュームごとに、ストレージシステムは、ターゲットポートグループ、及び状態を割り当てられたターゲットポートグループの一意の集合を、各ホスト又はそのボリュームにアクセスできるホストポートに提示できる。 [0461] The ALUA specification for SCSI can be used to describe multiple target port groups that can access a volume, each of which may be assigned a state. A target port group may represent one or more SCSI ports on a storage system. In a multi-controller storage system, a target port group could represent all SCSI target ports on one controller, or with a symmetrically accessible, synchronously replicated storage system, a target port group may represent all SCSI target ports on an individual storage system, or target ports may be grouped in some other way. The states that can be associated with a target port group indicate whether the port group should be preferred for issuing I/O (active/optimized), not preferred for issuing I/O (active/non-optimized), standby (I/O cannot be issued until there is a state change back to active/optimized or active/non-optimized), or whether the target port may be offline, such as for not responding. The SCSI specification allows the definition of target port groups and state-assigned ALUA target port groups to be specific not only to each volume, but also to each requesting host (or even to each requesting host port), allowing for each volume the storage system to present a unique set of target port groups and state-assigned target port groups to each host or host port that has access to that volume.

［0462］対称でアクセス可能な同期複製されたストレージシステムを用いると、ポッド内のすべてのストレージシステムは、ポッド内のすべてのストレージシステムが、ホストにとって、あたかもポッドのためのいくつかの又はすべてのストレージシステムでＳＣＳＩターゲットポートを通して同じボリュームを提示する１つのストレージシステムであるかのようにホストに同じボリュームを提示できる。これらの機構は、次いでポッド内のボリュームへアクセスを向ける、及び向け直すために所望されるすべての機能を提供できる。（例えば、ネットワーク又はホストのそのストレージシステムに対する地理的な近接性のために）ポッドのための特定のストレージシステムにとってより良い性能を得るホストの場合、そのホストのホストポートに対するそのストレージシステムのための状態を割り当てられたＡＬＵＡターゲットポートグループは、アクティブ／最適化済みとして示される場合がある。一方、ポッドのためのその特定のストレージシステムに対しより少ない性能を得る他のホストの場合、その他の他のホストポートに対するストレージシステムのための状態が割り当てられたＡＬＵＡターゲットポートグループは、アクティブ／非最適化として示される場合がある。このようにして、アクティブ／最適化であると判断されるターゲットポートグループのメンバーは、最適経路（複数可）としてホストに識別されてよい（３００８）。 [0462] With symmetrically accessible, synchronously replicated storage systems, all storage systems in a pod can present the same volumes to the host as if all storage systems in the pod were a single storage system presenting the same volumes through SCSI target ports on some or all storage systems for the pod. These mechanisms can then provide all the desired functionality for directing and redirecting access to volumes within the pod. For a host that receives better performance for a particular storage system for the pod (e.g., due to network or geographic proximity of the host to that storage system), the ALUA target port group assigned a state for that storage system for that host's host ports may be indicated as active/optimized. Meanwhile, for other hosts that receive less performance for that particular storage system for the pod, the ALUA target port group assigned a state for that storage system for other other host ports may be indicated as active/non-optimized. In this manner, members of the target port group that are determined to be active/optimized may be identified to the host as the optimal path(s) (3008).

［0463］新しいストレージシステムがポッドに加えられる場合、次いで新しいターゲットポートグループがそのポッド内のボリュームごとに、そのボリュームにアクセスするホストポートに加えられ、ターゲットポートグループは、新しいストレージシステムのためのホスト／ストレージシステム近接性に適切な状態を割り当てられる場合がある。なんらかの数のＳＡＮレベルイベントの後、ホストは、ボリュームごとに新しいポートを認識し、新しい経路を適切に使用するようにそのドライバを構成できる。ストレージシステムは、ポッドの代わりに、ホストが現在新規に加えられたストレージシステムでＳＣＳＩターゲットポートを使用するように適切に構成されていると判断するためにホストアクセス（例えば、ＲＥＰＯＲＴＬＵＮＳコマンド及びＩＮＱＵＩＲＹコマンドを待機する等）がないかモニタできる。ストレージシステムがポッドから削除される場合、次いでポッドに留まる他のストレージシステムは、ポッドのボリュームのための削除されたストレージシステムについて任意のターゲットポート又はターゲットポートグループを任意のホストポートに報告するのを停止できる。さらに、削除されたストレージシステムは、任意のＲＥＰＯＲＴＬＵＮ要求でポッドのボリュームを一覧することを停止することができ、ストレージシステムは、ポッドのボリュームに対するコマンドに応えてボリュームが存在していないことを報告し始めることができる。ボリュームがポッドの中に又はポッドの中から移動され、結果的にボリュームがストレージシステムの拡大又は削減された集合と関連付けられる場合、同じアクションは、ストレージシステムを加える又はポッドからストレージシステムを削除する際に適用されるであろう個々のボリュームに適用できる。障害の取り扱いに関しては、任意の係る経路が利用可能であり、適切に機能している場合には、ホストＩ／Ｏドライバは、アクティブ／最適化済みとして割り当てられるターゲットポートグループ内のターゲットポートを通してそれらのボリュームにアクセスするが、アクティブ／非最適化経路が利用不能であり、適切に機能していない場合、アクティブ／非最適化経路に切り替えることができる。 [0463] When a new storage system is added to a pod, then a new target port group may be added for each volume in the pod to the host ports that access that volume, and the target port group may be assigned the appropriate state for host/storage system proximity for the new storage system. After some number of SAN-level events, the host may recognize the new ports for each volume and configure its driver to use the new paths appropriately. The storage system, on behalf of the pod, may monitor for host access (e.g., wait for REPORT LUNS and INQUIRY commands) to determine that the host is now appropriately configured to use the SCSI target ports on the newly added storage system. If a storage system is removed from a pod, then other storage systems remaining in the pod may stop reporting any target ports or target port groups to any host ports for the removed storage system for the pod's volumes. Additionally, the removed storage system may stop listing the pod's volumes in any REPORT LUN requests, and the storage system may begin reporting that the volume is not present in response to commands for the pod's volumes. If a volume is moved into or out of a pod, resulting in the volume being associated with a larger or smaller collection of storage systems, the same actions can be applied to individual volumes that would be applied when adding or removing a storage system from a pod. With regard to failure handling, host I/O drivers access their volumes through target ports in the target port group that are assigned as active/optimized if any such paths are available and functioning properly, but can switch to an active/non-optimized path if the active/non-optimized path is unavailable or not functioning properly.

［0464］また、図３０に示される例の方法では、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）の中から１つ以上の最適経路を識別すること（３００８）は、例えばポッド内の特定のインタフェース又はストレージシステムに対するホスト経路が、他の特定のインタフェース又はポッド内のストレージシステムへのホスト経路よりも、より低い待ち時間、より良いスループット、又はより少ない切替えインフラストラクチャを有することを判断するために、タイミング情報又はネットワーク情報を使用することによって自動化された方法で実施されてよい。したがって、係る例では、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）の中から１つ以上の最適経路を識別すること（３００８）は、ホストとストレージシステムの１つとの間の相対的に最も低い待ち時間を示す１つ以上のデータ通信経路を識別することと、所定の閾値以下である、ホストとストレージシステムの１つとの間の待ち行列を示す１つ以上のデータ通信経路を識別すること等を含む。 [0464] Also, in the example method shown in FIG. 30, identifying (3008) one or more optimal paths from among multiple data communication paths (3022, 3026, 3030) between a host (3032) and multiple storage systems (3014, 3024, 3028) across which a dataset (3012) is synchronously replicated may be performed in an automated manner by using timing information or network information to determine, for example, that a host path to a particular interface or storage system within a pod has lower latency, better throughput, or less switching infrastructure than a host path to another particular interface or storage system within a pod. Thus, in such an example, identifying (3008) one or more optimal paths from among a plurality of data communication paths (3022, 3026, 3030) between a host (3032) and a plurality of storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously replicated includes identifying one or more data communication paths that exhibit a relatively lowest latency between the host and one of the storage systems, identifying one or more data communication paths that exhibit queuing between the host and one of the storage systems that is below a predetermined threshold, and the like.

［0465］例えばＩＰベースのネットワークでは、ｐｉｎｇコマンド及びｔｒａｃｅｒｏｕｔｅコマンド（又はその基本的なＩＣＭＰエコー要求を活用すること）が、待ち時間、及び既知のホストネットワークインタフェースとポッド内のストレージシステムとの間のネットワークルートを決定するために使用されてよい。ｔｒａｃｅｒｏｕｔｅファシリティ、又は（ルートがＩＣＭＰ時間超過応答を送信する前のネットワークホップの数を制限するための）徐々に増加するＴＴＬフィールドを有するＩＣＭＰエコー要求の直接的な使用は、より高い待ち時間を有する特定のネットワークホップがあると判断するために、又は係るネットワークホップがないことを判断するために使用できる。この場合では、高い待ち時間ホップを有するストレージインタフェースルートに対するホストインタフェースは、高待ち時間ホップのないストレージインタフェースルートへのホストインタフェースをサポートして回避される場合がある。代わりに、他に比べてほとんどホップを有さず、低待ち時間を有するネットワークルートがある場合、ネットワークインタフェースを有するストレージシステムが好まれる場合がある。 [0465] For example, in IP-based networks, ping and traceroute commands (or leveraging their basic ICMP echo requests) may be used to determine latency and network routes between known host network interfaces and storage systems within a pod. The traceroute facility, or direct use of ICMP echo requests with incrementally increasing TTL fields (to limit the number of network hops before the route sends an ICMP Time Exceeded response), can be used to determine the presence or absence of particular network hops with higher latency. In this case, host interface to storage interface routes with high latency hops may be avoided in favor of host interface to storage interface routes without high latency hops. Alternatively, if there is a network route with fewer hops and lower latency than others, the storage system with the network interface may be preferred.

［0466］ファイバチャネルベースのネットワークでは、そのＦｉｂｒｅＣｈａｎｎｅｌＷｏｒｋｉｎｇＧｒｏｕｐからのＳｔｏｒａｇｅＮｅｔｗｏｒｋｉｎｇＩｎｄｕｓｔｒｙＡｓｓｏｃｉａｔｉｏｎによってサポートされるプラグインライブラリのＨＢＡＡＰＩ仕様は、ＦＣストレージネットワークを精密に計画するために使用できる。また、ファイバチャネルプロトコルのＥＬＳエコー機能も、ネットワーク待ち時間を検出するために使用できる。上述されたＩＰネットワークと同様に、これは、他のホストポートとターゲットポートの組合せよりも低い待ち時間及び少ないネットワークホップを有するターゲットポートネットワークに対して、ホストポートを識別するために使用でき、このことは、ホストごとのアクティブ／最適化済み対アクティブ／非最適についてポッド内のストレージシステムを構成するために、ポッド内のどのストレージシステムがあるホスト又は別のホストにより近くに又はより良く接続されるのかを判断するために使用できる。 [0466] In Fibre Channel-based networks, the HBA API specification of the plug-in library supported by the Storage Networking Industry Association from its Fibre Channel Working Group can be used to precisely plan FC storage networks. The ELS echo function of the Fibre Channel protocol can also be used to detect network latency. Similar to the IP network described above, this can be used to identify host ports to target port networks that have lower latency and fewer network hops than other host and target port combinations, which can be used to determine which storage systems within a pod are closer to or better connected to one host or another in order to configure the storage systems within the pod for active/optimized versus active/non-optimized per host.

［0467］また、図３０に示される例の方法は、ホスト（３０３２）に対して、１つ以上の最適経路の識別を示すこと（３０１０）も含む。図３０に示される例の方法では、ストレージシステム（３０１４、３０２４、３０２８）は、例えばストレージシステム（３０１４、３０２４、３０２８）とホスト（３０３２）との間で交換される１つ以上のメッセージにより、ホスト（３０３２）に最適経路の識別を示してよい（３０１０）。係るメッセージは、上述された機構の多くを使用し、交換されてよく、ポート識別子、ネットワークインタフェース識別子、又はなんらかの他の識別子の使用により最適経路を識別してよい。例えば、ホスト（３０３２）に対して、１つ以上の最適経路の識別を示すこと（３０１０）は、ＳＣＳＩＡＬＵＡ機構を介してホスト（３０３２）に対して１つ以上のアクティブ／最適化済み経路を示すこと（４３４）によって実施されてよい。 [0467] The example method illustrated in FIG. 30 also includes indicating (3010) to the host (3032) the identification of one or more optimal paths. In the example method illustrated in FIG. 30, the storage systems (3014, 3024, 3028) may indicate (3010) the identification of the optimal paths to the host (3032), for example, via one or more messages exchanged between the storage systems (3014, 3024, 3028) and the host (3032). Such messages may be exchanged using many of the mechanisms described above and may identify the optimal paths through the use of a port identifier, a network interface identifier, or some other identifier. For example, indicating (3010) the identification of one or more optimal paths to the host (3032) may be implemented by indicating (434) one or more active/optimized paths to the host (3032) via a SCSI ALUA mechanism.

［0468］読者は、本明細書に説明されるストレージシステム（３０１４、３０２４、３０２８）が、ポート又はネットワークインタフェースの名前を付けられた集合としてホスト（３０３２）を定義するためにホスト定義を活用してよく、それらのホスト定義は、追加の情報又は例えばオペレーティングシステム若しくはアプリケーションタイプ若しくは作業負荷の分類等の追加の特徴を含んでよいことを理解する。ホスト定義は、管理の便宜性のために名前を与えられ、ストレージシステムユーザインタフェースの第１級のオブジェクトとして表される場合もあれば、特定のアプリケーション、ユーザ、又はホストベースのデータベース若しくはファイルシステムクラスタと関連付けられたすべてのホストを一覧するために、多様な方法でともにグループ化される場合もある。これらのホスト定義は、ストレージシステムがホスト場所についての情報を関連付けるために、又はポッドのためのホスト対ストレージシステムの優先度のための便利な管理オブジェクトとしての機能を果たす場合がある。ストレージシステムが、ポッドごとに１つのホスト定義よりむしろ、ホストごとに１つのホスト定義を管理することが便利な場合がある。ホストと関連付けられた開始プログラムポート及びネットワークインタフェースは、すべてのポッドについておそらく同じであるため、それは便利な場合がある。ポッドが、各ポッドが他のポッドから安全に分離される強力な形の仮想機器として使用される場合、これは当てはまらない場合があるが、これは係る安全なポット分離に達していない任意の用途又は実施態様のためにセットアップするためには便利且つより容易である場合がある。 [0468] The reader will understand that the storage systems (3014, 3024, 3028) described herein may utilize host definitions to define hosts (3032) as named collections of ports or network interfaces, and that these host definitions may include additional information or characteristics, such as operating system or application type or workload classification. Host definitions may be named for management convenience and represented as first-class objects in the storage system user interface, or may be grouped together in various ways to list all hosts associated with a particular application, user, or host-based database or file system cluster. These host definitions may serve as convenient management objects for the storage system to associate information about host location or for host-to-storage system prioritization for pods. It may be convenient for the storage system to manage one host definition per host rather than one host definition per pod. This may be convenient because the initiator ports and network interfaces associated with a host are likely the same for all pods. This may not be the case if pods are used as a strong form of virtual device where each pod is securely isolated from other pods, but this may be convenient and easier to set up for any application or implementation that falls short of such secure pod isolation.

［0469］ポッドがポッド内の複数のストレージシステムから同じホストにデータセット又はストレージオブジェクトを供給する場合、並びにそのホストのためのＡＬＵＡ状態、及びホスト対ストレージシステムの優先度が、ポッドのためのすべてのストレージシステム全体で調整されたやり方で管理されなければならない場合、次いでホスト定義はポッド全体で調整又は同期される必要がある場合があるが、ポッドのための大部分の他の管理されているオブジェクトとは異なり、ホストは、ポッドオブジェクトよりむしろ（ネットワークインタフェース及びＳＣＳＩターゲットポートは多くの場合ストレージシステムオブジェクトであるため）ストレージシステムオブジェクトであってよい。結果として、定義が調和しない場合があるため、ホストオブジェクトはポッドメンバー間で容易に同期されない場合がる。 [0469] If a pod supplies datasets or storage objects to the same host from multiple storage systems within the pod, and if the ALUA state and host-to-storage system priorities for that host must be managed in a coordinated manner across all storage systems for the pod, then the host definition may need to be coordinated or synchronized across the pod; however, unlike most other managed objects for the pod, the host may be a storage system object (because network interfaces and SCSI target ports are often storage system objects) rather than a pod object. As a result, host objects may not be easily synchronized between pod members because the definitions may not be consistent.

［0470］さらに、ホストは、ホスト側開始プログラムとネットワークインタフェースの１つの集合を通してポッドのためのあるストレージシステムに、及びホスト側開始プログラム及びネットワークインタフェースの異なる集合を通してポッドのための別のストレージシステムに相互接続されてよい。さらに、２つの集合の間にはなんらかの重複がある場合がある、又は２つの集合の間には重複がない場合がある。一部の場合では、インタフェースが同じホストを表すと判断するためにストレージシステムが使用できるホスト情報がある場合がある。例えば、インタフェースが、同じｉＳＣＩＩＱＮを使用する場合もあれば、多様な開始プログラム又はネットワークインタフェースが同じホストを表すことを示すために、ホスト側ドライバがストレージシステムにホスト情報を供給する場合もある。他の場合では、係る情報がない場合がある。発見可能な情報がない場合は、ホスト定義のパラメータは、代わりに、ホスト名をネットワークエンドポイント、ｉＳＣＳＩＩＱＮ、開始プログラムポート等の集合に関連付けるために、ユーザによって又はなんらかのＡＰＩ若しくは他のインタフェースによりストレージシステムに供給されてよい。 [0470] Additionally, a host may be interconnected to one storage system for a pod through one set of host-side initiators and network interfaces, and to another storage system for a pod through a different set of host-side initiators and network interfaces. Furthermore, there may be some overlap between the two sets, or there may be no overlap between the two sets. In some cases, there may be host information that the storage system can use to determine that the interfaces represent the same host. For example, the interfaces may use the same iSCSI IQN, or the host-side driver may provide host information to the storage system to indicate that various initiators or network interfaces represent the same host. In other cases, such information may not be available. In the absence of discoverable information, host-defined parameters may instead be provided to the storage system by the user or through some API or other interface to associate a host name with a set of network endpoints, iSCSI IQNs, initiator ports, etc.

［0471］ポッドと関連付けられたデータセットの部分が、ホスト定義を通して特定のホストにエクスポートされる（つまり、それがネットワークエンドポイントのリスト、ｉＳＣＩＩＱＮ、又は開始プログラムポートを通して、ポッドの現在のストレージシステムの独自のネットワークエンドポイント、及びＳＣＳＩターゲットからホスト定義に基づいたホストに提供される）場合、次いで追加のストレージシステムがポッドに加えられると、加えられたストレージシステムのホスト定義を調べることができる。同じホストオブジェクト名の付いたホストも、ホストネットワークエンドポイント、ｉＳＣＩＩＱＮ、又は開始プログラムポートの重複するリストを有するホストも加えられたストレージシステムに存在しない場合、次いでホスト定義は加えられたストレージシステムにコピーできる。ホストネットワークエンドポイント、ｉＳＣＳＩＩＱＮ、及び開始プログラムポートの同じ名前及び同じ構成を有するホスト定義が加えられたストレージシステムに存在する場合、次いで元のポッドメンバーストレージシステムからのホスト定義、及び加えられたストレージシステムのホスト定義は、それ以降リンクし、調整することができる。同じ名前であるが、ホストネットワークエンドポイント、ｉＳＣＩＩＱＮ、又は開始プログラムポートの異なる構成を有するホストが、加えられたストレージシステムに存在する場合、次いでホストオブジェクトの限定版が、ポッドのためのストレージシステム間で交換でき、異なるバージョンはストレージシステム限定子で名前を付けられる。例えば、ストレージシステムＡは、Ａ：Ｈ１としてそのホスト定義を提示してよい。一方、ストレージシステムＢは、Ｂ：Ｈ１としてそのホスト定義を提示してよい。同は、名前は異なるが、ホストネットワークエンドポイント、ｉＳＣＳＩＩＱＮ、又は開始プログラムポートでなんらかの重複を有するホスト定義にも提供される場合がある。その場合、ホスト定義はストレージシステム間でコピーされない場合があるが、代わりにストレージシステムはローカルのまま留まり、例えばホスト開始プログラムＸ及びＹを一覧するホスト定義Ａ：Ｈ１、並びにホスト開始プログラムＹ及びＸを一覧するホスト定義Ｂ：Ｈ２を生じさせる場合がある。さらに、動作は、これらのホスト定義を同期させるために提供される場合がある。２つのホスト定義が同じ名前、又はホストネットワークエンドポイント、ｉＳＣＳＩＩＱＮ、又は開始プログラムポートの重複する集合を有する場合、次いでユーザは、共通名、次いでともにリンクされる交換されたホストネットワークエンドポイント、ｉＳＣＳＩＩＱＮ、又は開始プログラムポートとともにそれらを統一するための簡略なインタフェースを提供される場合がある。２つの定義間の唯一の衝突が、いくつかのホスト定義が、別のインタフェースに一覧されないホストインタフェースを含むが、少なくとも１つのホストインタフェースが一致することである場合、これらの定義は、ユーザが係る要求を行うのを待機するよりむしろ、自動的にマージされ、リンクできるであろう。 [0471] If a portion of the dataset associated with a pod is exported to a specific host through a host definition (i.e., it is provided to the host based on the host definition from the pod's current storage system's unique network endpoints and SCSI targets through a list of network endpoints, iSCSI IQNs, or initiator ports), then when an additional storage system is added to the pod, the host definition of the added storage system can be examined. If no host with the same host object name or a host with a duplicate list of host network endpoints, iSCSI IQNs, or initiator ports exists in the added storage system, then the host definition can be copied to the added storage system. If a host definition with the same names and the same configuration of host network endpoints, iSCSI IQNs, and initiator ports exists in the added storage system, then the host definition from the original pod member storage system and the host definition of the added storage system can subsequently be linked and reconciled. If hosts with the same name but different configurations of host network endpoints, iSCSI IQNs, or initiator ports exist in the added storage systems, then qualified versions of the host object can be exchanged between the storage systems for the pod, with the different versions named with the storage system qualifier. For example, storage system A may present its host definition as A:H1, while storage system B may present its host definition as B:H1. The same may also be provided for host definitions that have different names but some overlap in host network endpoints, iSCSI IQNs, or initiator ports. In that case, the host definitions may not be copied between storage systems, but instead remain local to the storage systems, resulting in, for example, host definition A:H1 listing host initiators X and Y, and host definition B:H2 listing host initiators Y and X. Furthermore, operations may be provided to synchronize these host definitions. If two host definitions have the same name or an overlapping set of host network endpoints, iSCSI IQNs, or initiator ports, then the user may be provided with a simple interface to unify them with a common name, the exchanged host network endpoints, iSCSI IQNs, or initiator ports then linked together. If the only conflict between two definitions is that some host definitions contain a host interface that is not listed in another, but at least one host interface matches, then these definitions could be merged and linked automatically, rather than waiting for the user to make such a request.

［0472］ポッド内のストレージシステムが、一連の条件（障害、停止等）の結果として別のストレージシステムをデタッチする場合、次いでデタッチされたストレージシステムは、それがポッドのためにオフラインであるが、それ以外の場合はまだ実行中である場合、そのホスト定義に変更を加える場合がある。また、ポッドのためにオンラインに留まるストレージシステムは、そのホスト定義に対しても変更を加える場合がある。結果は一致しないホスト定義である場合がある。デタッチされたストレージシステムが後にポッドに再接続される場合、次いでホスト定義は一致しなくなる場合がある。その時点で、ポッドは、各ストレージシステムでの別々の定義を区別するために、そのストレージシステム名接頭辞とともにホスト定義を報告することを再開してよい。 [0472] If a storage system in a pod detaches another storage system as a result of a set of conditions (failure, outage, etc.), then the detached storage system may make changes to its host definitions if it is offline for the pod but otherwise still running. Also, a storage system that remains online for the pod may make changes to its host definitions. The result may be inconsistent host definitions. If the detached storage system is later reattached to the pod, then the host definitions may no longer match. At that point, the pod may resume reporting host definitions with its storage system name prefix to distinguish between the separate definitions on each storage system.

［0473］ホスト定義の別の態様は、ホスト定義が、どのストレージシステムのターゲットポートがアクティブ／最適化済みのＡＬＵＡステータスを生じさせるべきなのか、及びどれがアクティブ／非最適化のＡＬＵＡステータスを生じさせるべきなのかの観点で、何がＡＬＵＡ情報のために返されるのかを構成することの一部として、場所又はストレージシステムの優先度を定義してよい点である。また、この状態は、ストレージシステム間で調整され、リンクされる必要がある場合もある。そうだとしたら、それは調整を要する別の態様である場合がある。また、ポッド内のストレージシステムのためにホスト定義を調整するときに検出される衝突、つまり任意の設定の欠如は、ユーザに場所又はストレージシステムの優先度を設定するのを促す機会も提示してよい。 [0473] Another aspect of the host definition is that it may define location or storage system priorities as part of configuring what is returned for ALUA information in terms of which storage system target ports should result in an active/optimized ALUA status and which should result in an active/non-optimized ALUA status. This state may also need to be reconciled and linked between storage systems. If so, that may be another aspect that requires reconciliation. Conflicts, or the lack of any settings, detected when reconciling host definitions for storage systems in a pod may also present an opportunity to prompt the user to set location or storage system priorities.

［0474］読者は、ホスト定義が、実際には、ポッドレベルのオブジェクトよりむしろストレージシステムレベルのオブジェクトである場合があるので、同じホスト定義が、単一のストレージシステムを越えて広がらないポッドに対してだけではなく、複数のストレージシステム間で広がったポッドに対しても使用できることに留意する。純粋にローカルなポッド（つまり別のポッドよりストレージシステムの異なる集合に広がるあるポッド）との関連でのホストの使用は、ホスト（又はホストのリスト）がどのように見られるのかを改変できるであろう。例えば、ローカルポッドとの関連では、ストレージシステムによってホストを限定することは、意味をなさない場合があり、ローカルストレージシステムに対する経路を有さないホストを一覧することも意味をなさない場合がある。この例は、異なるメンバーストレージシステムを有するポッドに拡張できるであろう。例えば、ホスト定義は、（ストレージシステム限定を有用にする）１つのポッドのために、対にされたストレージシステムと衝突する場合がある。一方、ホスト定義は、（ストレージシステム限定を潜在的に不必要にする）異なるポッドのために、異なる対にされたストレージシステムと衝突しない場合がある。類似する問題は、あるポッドが追加のストレージシステムのある集合に広げられ、別のポッドが追加のストレージシステムの別の集合に広げられるときにホストの使用で発生する場合がある。その場合、特定のポッドの問題に関連性のあるストレージシステムのためのターゲットインタフェースだけ、並びにポッドのための関連性のあるストレージシステム上のターゲットインタフェースに対して可視であるホストエンドポイント、ｉＳＣＳＩＩＱＮ、及び開始プログラムポートだけ。 [0474] The reader should note that because a host definition may actually be a storage-system-level object rather than a pod-level object, the same host definition can be used not only for pods that do not span more than a single storage system, but also for pods that span multiple storage systems. The use of host in the context of a purely local pod (i.e., one pod that spans a different set of storage systems than another pod) could alter how the host (or list of hosts) is viewed. For example, in the context of a local pod, it may not make sense to qualify hosts by storage system, and it may not make sense to list hosts that do not have a path to the local storage system. This example could be extended to pods with different member storage systems. For example, a host definition may conflict with a paired storage system for one pod (making storage system qualification useful). On the other hand, a host definition may not conflict with a different paired storage system for a different pod (making storage system qualification potentially unnecessary). Similar issues can arise with host usage when one pod is spread across one set of additional storage systems and another pod is spread across another set of additional storage systems, where only the target interfaces for the storage systems relevant to the particular pod's issue, and only the host endpoints, iSCSI IQNs, and initiator ports are visible to the target interfaces on the relevant storage systems for the pod.

［0475］図３０に示されるストレージシステム（３０１４、３０２４、３０２８）の１つだけしか、上述されたステップを実行するとして明示的に示されていないが、読者は、ストレージシステム（３０１４、３０２４、３０２８）のそれぞれが、上述されたステップを実行してよいことを理解する。実際には、ストレージシステム（３０１４、３０２４、３０２８）のそれぞれは、ほぼ同時に上述されたステップを実行してよく、これにより最適経路の識別は調整された取組みとなる。例えば、各ストレージシステム（３０１４、３０２４、３０２８）は、それ自体とホストとの間のすべてのデータ通信経路を個々に識別し、それ自体とホストとの間の各データ通信と関連付けられた多様な性能測定基準を収集し、１つ以上の最適経路を識別しようと努力して係る情報を他のストレージシステムと共用してよい。 [0475] While only one of the storage systems (3014, 3024, 3028) shown in FIG. 30 is explicitly shown as performing the steps described above, the reader will understand that each of the storage systems (3014, 3024, 3028) may perform the steps described above. In practice, each of the storage systems (3014, 3024, 3028) may perform the steps described above substantially simultaneously, making the identification of the optimal path a coordinated effort. For example, each storage system (3014, 3024, 3028) may individually identify all data communication paths between itself and the host, collect various performance metrics associated with each data communication between itself and the host, and share such information with the other storage systems in an effort to identify one or more optimal paths.

［0476］追加の説明のために、図３１は、本開示のいくつかの実施形態に従って同期複製されたストレージシステム（３０１４、３０２４、３０２８）への接続性を管理するための追加の例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図３１に示されるストレージシステム（３０１４、３０２４、３０２８）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図３１に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0476] For further explanation, FIG. 31 sets forth a flowchart illustrating an additional example method for managing connectivity to synchronously replicated storage systems (3014, 3024, 3028) in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (3014, 3024, 3028) illustrated in FIG. 31 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system illustrated in FIG. 31 may include the same components, fewer components, or additional components as the storage systems described above.

［0477］図３１に示される例の方法は、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）を識別すること（３００２）と、データセット（３０１２）に向けられるＩ／Ｏ動作を発行できるホスト（３０３２）を識別すること（３００４）と、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）を識別すること（３００６）と、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）の中から１つ以上の最適経路を識別すること（３００８）と、ホスト（３０３２）に対して、１つ以上の最適経路の識別を示すこと（３０１０）とを含むので、図３１に示される例の方法は図３０に示される例の方法に類似している。 [0477] The example method shown in FIG. 31 includes identifying (3002) a plurality of storage systems (3014, 3024, 3028) across which a dataset (3012) is synchronously replicated; identifying (3004) a host (3032) that can issue I/O operations directed to the dataset (3012); and identifying a plurality of data communication paths (3022, 3032) between the host (3032) and the plurality of storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously replicated. The example method illustrated in FIG. 31 is similar to the example method illustrated in FIG. 30 in that it includes identifying (3006) one or more optimal paths from among a plurality of data communication paths (3022, 3026, 3030) between a host (3032) and a plurality of storage systems (3014, 3024, 3028) across which a dataset (3012) is synchronously replicated, and indicating (3010) to the host (3032) the identification of the one or more optimal paths.

［0478］また、図３１に示される例の方法は、データセット（３０１２）が、ストレージシステムの更新された集合全体で同期複製されることを検出すること（３１０２）も含む。図３１に示される例の方法では、データセット（３０１２）が同期複製されるストレージシステムの集合は、さまざまな理由から変化する場合がある。データセット（３０１２）が同期複製されるストレージシステムの集合は、例えば１つ以上の適切に機能しているストレージシステムがポッドに加えられる、又はポッドから削除されるために変化する場合がある。さらに、データセット（３０１２）が同期複製されるストレージシステムの集合は、例えば１つ以上のストレージシステムが到達不能になり、到達不能又は利用不能になったことに応えて、ポッドからデタッチされるために変化する場合がある。図３１に示される例では、データセット（３０１２）が、ストレージシステムの更新された集合全体で同期複製されることを検出すること（３１０２）は、例えば、ポッド定義に対する変更を検出することによって、ストレージシステムが到達不能になった、若しくはそれ以外の場合利用不能になったことを検出することによって、又はなんらかの他の方法で実施されてよい。 31 also includes detecting (3102) that the dataset (3012) is synchronously replicated across the updated set of storage systems. In the example method shown in FIG. 31, the set of storage systems across which the dataset (3012) is synchronously replicated may change for various reasons. For example, the set of storage systems across which the dataset (3012) is synchronously replicated may change because one or more properly functioning storage systems are added to or removed from a pod. Furthermore, the set of storage systems across which the dataset (3012) is synchronously replicated may change because one or more storage systems become unreachable and detached from a pod in response to becoming unreachable or unavailable. In the example shown in FIG. 31, detecting (3102) that the dataset (3012) is synchronously replicated across the updated set of storage systems may be performed, for example, by detecting a change to a pod definition, by detecting that a storage system has become unreachable or otherwise unavailable, or in some other manner.

［0479］図３１に示される例の方法では、データセット（３０１２）がストレージシステムの更新された集合全体で同期複製される実施形態が示される。係る例では、ストレージシステム（３０２４）は、データセット（３０１２）が同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）の１つとして当初識別される（３００２）として図３０に説明された。しかしながら、図３１に示される例は、ストレージシステム（３０２４）が到達不能になった、又はそれ以外の場合利用不能になった、本明細書では、他のストレージシステム（３０１４、３０２８）間のデータ通信リンク（３０１６、３０１８、３０２０）及びストレージシステム（３０２４）とのデータ通信のためにストレージシステム（３０２４）とホスト（３０３２）を結合するために使用できるデータ通信経路（３０３０）のための（アクティブな接続がないことを表す）点線の使用により示される実施形態を示す。したがって、データセット（３０１２）がその全体で同期複製されるストレージシステムの更新された集合は２つのストレージシステム（３０１４、３０２８）しか含んでいないのに対して、データセット（３０１２）がその全体で同期複製されたストレージシステムの初期の集合は、示されているすべてのストレージシステム（３０１４、３０２４、３０２８）を含んでいた。 31 illustrates an embodiment in which a data set (3012) is synchronously replicated across an updated collection of storage systems. In such an example, storage system (3024) was initially identified (3002) as one of multiple storage systems (3014, 3024, 3028) across which data set (3012) is synchronously replicated, as described in FIG. 30. However, the example illustrated in FIG. 31 illustrates an embodiment in which storage system (3024) becomes unreachable or otherwise unavailable, indicated herein by the use of dotted lines (representing no active connection) for data communication links (3016, 3018, 3020) between other storage systems (3014, 3028) and for data communication paths (3030) that can be used to couple storage system (3024) with a host (3032) for data communication with storage system (3024). Thus, the updated set of storage systems across which dataset (3012) is synchronously replicated includes only two storage systems (3014, 3028), whereas the initial set of storage systems across which dataset (3012) was synchronously replicated included all of the storage systems shown (3014, 3024, 3028).

［0480］また、図３１に示される例の方法は、ホスト（３０３２）とストレージシステム（３０１４、３０２８）の更新された集合との間の複数のデータ通信経路を識別すること（３１０４）も含む。図３１に示される例の方法では、ホスト（３０３２）とストレージシステム（３０１４、３０２８）の更新された集合との間の複数のデータ通信経路を識別すること（３１０４）は、例えば、より詳細に上述されるＳＣＳＩＡＬＵＡ機構の使用により、他のなんらかのネットワーク発見ツールの使用により、又はなんらかの他の方法で実施されてよい。 [0480] The example method shown in FIG. 31 also includes identifying (3104) multiple data communication paths between the host (3032) and the updated set of storage systems (3014, 3028). In the example method shown in FIG. 31, identifying (3104) multiple data communication paths between the host (3032) and the updated set of storage systems (3014, 3028) may be performed, for example, by using the SCSI ALUA mechanism described in more detail above, by using some other network discovery tool, or in some other manner.

［0481］また、図３１に示される例の方法は、ホスト（３０３２）とストレージシステムの更新された集合との間の複数のデータ通信経路の中から最適経路の更新された集合を識別すること（３１０６）も含む。図３１に示される例の方法では、最適経路の更新された集合を識別すること（３１０６）は、単一の最適経路を識別すること、又は複数の最適経路を識別することを含んでよい。多様な性能閾値を満たす任意の経路が識別されてよい、所定数の最も適した経路（例えば、他の利用可能な経路に対して優れた性能を示すそれらの経路）が識別されてよい、所定の割合の最も適した経路が識別されてよい等、例えば、（ホストと特定のストレージシステムとの間の経路等の）より多くの最適経路の部分集合が識別されてよい。読者は、ストレージシステムホスト（３０１４、３０２４、３０２８）は、互いからなんらかの距離に位置してよいので、ストレージシステム（３０１４、３０２４、３０２８）は、別々のストレージネットワーク又はストレージネットワークの別々の部分に位置してよいので、又は他のなんらかの理由から、ホスト（３０３２）が、別のストレージシステムに対してあるストレージシステムにＩ／Ｏ動作を発行することに関連付けられた性能優位点があってよいことを理解する。例えば、遠いデータセンタ又はキャンパスの中に物理的に位置するストレージシステムにＩ／Ｏ動作を発行するホスト（３０３２）に比して、ホスト（３０３２）が、ホスト（３０３２）と同じデータセンタ又はキャンパスの中に物理的に位置するストレージシステムにＩ／Ｏ動作を発行することに関連付けられた性能優位点がある場合がある。信頼性のために、ホスト（３０３２）にすべてのストレージシステム（３０３２）に対する接続性を保持させることは有益であってよいが、性能のために、ホスト（３０３２）が特定のストレージシステムを通してデータセット（３０１２）にアクセスすることが好ましい場合がある。読者は、異なるホストがデータセット（３０１２）にアクセスする場合があるため、あるホストがデータセット（３０１２）にアクセスするための１つ以上の最適経路が、別のホストがデータセット（３０１２）にアクセスするための１つ以上の最適経路とは異なる場合があることを理解する。図３０に示される例の方法では、ホスト（３０３２）とストレージシステムの更新された集合との間の複数のデータ通信経路の中から最適経路の更新された集合を識別すること（３１０６）は、例えば上述されたＳＣＳＩＡＬＵＡ機構の使用により、特定のインタフェース又はポッド内のストレージシステムに対するホスト経路が、他の特定のインタフェース又はポッド内のストレージシステムへのホスト経路よりも、より低い待ち時間、より良いスループット、又はより少ない切替えインフラストラクチャを有することを判断するために、タイミング情報又はネットワーク情報を使用することによって自動化された方法で、又はなんらかの他の方法で実施されてよい。読者は、上述の段落のいくつかが『集合』を参照しているが、係る集合が単一のメンバーを含む場合があり、係る集合がどのようにして表されるのかに関して特定の制限を課していないことを理解する。 31 also includes identifying (3106) an updated set of optimal paths from among the multiple data communication paths between the host (3032) and the updated set of storage systems. In the example method illustrated in FIG. 31, identifying (3106) the updated set of optimal paths may include identifying a single optimal path or identifying multiple optimal paths. Any path that meets various performance thresholds may be identified, a predetermined number of the most optimal paths (e.g., those paths that exhibit superior performance relative to other available paths) may be identified, a predetermined percentage of the most optimal paths may be identified, or a subset of more optimal paths (e.g., paths between a host and a particular storage system) may be identified. The reader will understand that because storage system hosts (3014, 3024, 3028) may be located at some distance from one another, because the storage systems (3014, 3024, 3028) may be located on separate storage networks or separate portions of a storage network, or for some other reason, there may be a performance advantage associated with a host (3032) issuing I/O operations to one storage system versus another storage system. For example, there may be a performance advantage associated with a host (3032) issuing I/O operations to a storage system that is physically located in the same data center or campus as the host (3032), compared to a host (3032) issuing I/O operations to a storage system that is physically located in a distant data center or campus. For reliability reasons, it may be beneficial for the host (3032) to maintain connectivity to all storage systems (3032), but for performance reasons, it may be preferable for the host (3032) to access a dataset (3012) through a specific storage system. The reader will understand that because different hosts may access the dataset (3012), one or more optimal paths for one host to access the dataset (3012) may differ from one or more optimal paths for another host to access the dataset (3012). In the example method shown in FIG. 30, identifying (3106) an updated set of optimal paths from among multiple data communication paths between the host (3032) and the updated set of storage systems may be performed in an automated manner, by using timing or network information to determine that a host path to a storage system within a particular interface or pod has lower latency, better throughput, or less switching infrastructure than a host path to a storage system within another particular interface or pod, for example, by using the SCSI ALUA mechanism described above, or in some other manner. The reader will understand that while several of the above paragraphs refer to a "set," such a set may include a single member and does not impose any particular restrictions on how such a set may be represented.

［0482］読者は、上述された性能の優位点のさまざまな情報源があり得ることを理解する。書込みの場合、より遠方のストレージシステムへの書込みを要求するホストは、ストレージシステムからストレージシステムへの複製自体に（いずれにせよ）必要とされる長距離帯域幅に加えて、ホストから遠方のストレージシステムへのネットワークのために長距離帯域幅を必要とする（これは、それ以外の場合必要ではないホストからストレージへの帯域幅を使用する、又はそれは一方の方向だけでのトラフィックが必要であるはずであったときに両方向での書込みコンテンツのためのトラフィックを生じさせる）。さらに、書込みの場合、長距離待ち時間が多大である場合、次いでその待ち時間は４倍又は６倍生じる（遠方のストレージシステムからローカルストレージシステムへの書込みコンテンツの送達が加えられ、ローカルストレージシステムから遠方のストレージシステムへの完了又は類似した表示の送達が加えられ、２段階又は４段階の書込み要求の最後の部分のための遠方のストレージシステムからホストに送信される最終定な完了が加えられるホストから遠方のストレージシステムへの２段階又は４段階の書込みのうちの第１の段階又は３つの段階）。それに対し、ローカルストレージシステムへの書込みの場合、長距離待ち時間は２回だけつまり、ローカルストレージシステムから遠方のストレージシステムへの書込みコンテンツの送達のために１回、及び遠方のストレージシステムからローカルストレージシステムへの完了又は類似する表示の送達のために１回、生じる。読取りの場合、ローカルストレージシステムからの読取りを要求するホストは、多くの場合、長距離帯域幅はまったく消費せず、通常長距離待ち時間ペナルティを生じさせない。 [0482] The reader will understand that there can be various sources of the performance advantage described above. In the case of a write, a host requesting a write to a more distant storage system requires long-distance bandwidth for the network from the host to the distant storage system, in addition to the long-distance bandwidth required (in any case) for the storage-system-to-storage-system replication itself (which uses host-to-storage bandwidth that would not otherwise be needed, or which generates traffic for the write content in both directions when traffic would only be needed in one direction). Furthermore, in the case of a write, if long-distance latency is significant, then that latency is multiplied by four or six (the first or third phase of a two- or four-phase write from the host to the distant storage system, plus the delivery of the write content from the distant storage system to the local storage system, plus the delivery of a completion or similar indication from the local storage system to the distant storage system, plus the final completion sent from the distant storage system to the host for the last part of the two- or four-phase write request). In contrast, for writes to the local storage system, long-distance latency is incurred only twice: once for delivery of the write contents from the local storage system to the distant storage system, and once for delivery of a completion or similar indication from the distant storage system to the local storage system. For reads, a host requesting a read from the local storage system often does not consume any long-distance bandwidth and typically does not incur a long-distance latency penalty.

［0483］また、図３１に示される例の方法は、ホスト（３０１２）に対し、更新された最適経路の識別を示すこと（３１０８）も含む。図３１に示される例の方法では、ストレージシステム（３０１４、３０２４、３０２８）は、例えばストレージシステム（３０１４、３０２４、３０２８）とホスト（３０３２）との間で交換される１つ以上のメッセージを通してホスト（３０３２）への更新された最適経路の識別を示してよい（３０１８）。係るメッセージは、上述された機構の多くを使用し、交換されてよく、ポート識別子、ネットワークインタフェース識別子、又はなんらかの他の識別子の使用により最適経路を識別してよい。読者は、いくつかの実施形態では、ストレージシステム（３０１４、３０２４、３０２８）が、それによりホスト（３０３２）に対して更新された最適経路の識別を示してよい（３１０８）プロセスの一部分は、係る情報をホスト（３０１２）によって発行されるコマンドに対する応答にピギーバックすることを含む場合がある。例えば、ストレージシステム（３０１４、３０２４、３０２８）の１つは、ホストに対するＳＣＳＩユニットの注意を集めてよい（３１１０）。ＳＣＳＩユニットの注意は、デバイス（例えば、ストレージシステム）がホスト側のＳＣＳＩドライバに、デバイスの動作状態又はファブリック状態が変化したことを知らせることを可能にする仕組みである。別の言い方をすれば、ストレージシステムは、ユニットの注意を集めることによって、ホストに、ホストが状態の変化についてストレージシステムに問い合わせるべきであることを示してよく、該状態の変化により、ホストは、ターゲットポートグループ状態が、アクティブ／最適化済み及びアクティブ／非最適化のターゲットポートグループの異なる集合を示すために変化したことに気づくことができる。係る例では、ターゲット（例えば、ストレージシステム）は、次にホスト側のＳＣＳＩドライバにユニットの注意をクリアする前に更新されたＡＬＵＡ状態を要求するように命じるコマンドに対する応答がホスト（３０１２）に送信されるときに、ホスト（３０１２）に返される「ユニットの注意」を内部で集める。この仕組みは、ストレージシステムが、ホストにそのＡＬＵＡ状態を所望されるように更新させることを可能にしてよいが、それはオフラインではないなんらかのターゲットポートに発行されるなんらかの種類の将来のＳＣＳＩ要求に依存する。係る例では、ＳＣＳＩプロトコルがホスト（３０１２）によって発行されるコマンド及びターゲット（例えば、ストレージシステム）によって返される応答の形をとるため、ＳＣＳＩ「ユニット注意」の仕組みはターゲットがホスト（３０１２）に頼んでもいない更新を返す方法を提供するので、最適経路を更新するための情報を転送することは、わずかに遠回りにこの機構でピギーバックする必要がある場合がある。 [0483] The example method shown in FIG. 31 also includes indicating (3108) to the host (3012) the identification of the updated optimal path. In the example method shown in FIG. 31, the storage systems (3014, 3024, 3028) may indicate (3018) the identification of the updated optimal path to the host (3032), for example, through one or more messages exchanged between the storage systems (3014, 3024, 3028) and the host (3032). Such messages may be exchanged using many of the mechanisms described above and may identify the optimal path through the use of a port identifier, a network interface identifier, or some other identifier. The reader will note that in some embodiments, part of the process by which a storage system (3014, 3024, 3028) may indicate (3108) an updated identification of an optimal path to a host (3032) may include piggybacking such information onto a response to a command issued by the host (3012). For example, one of the storage systems (3014, 3024, 3028) may gather SCSI unit attention for the host (3110). SCSI unit attention is a mechanism that allows a device (e.g., a storage system) to inform a host-side SCSI driver that the device's operational state or fabric state has changed. In other words, by gathering unit attention, the storage system may indicate to the host that the host should query the storage system for a state change, which allows the host to notice that the target port group state has changed to indicate different sets of active/optimized and active/non-optimized target port groups. In such an example, the target (e.g., storage system) internally collects a "unit attention" that is returned to the host (3012) when a response to a command instructing the host-side SCSI driver to request an updated ALUA state before clearing the unit attention is sent to the host (3012). This mechanism may allow the storage system to have the host update its ALUA state as desired, but it is dependent on a future SCSI request of some kind being issued to any target port that is not offline. In such an example, because the SCSI protocol takes the form of commands issued by the host (3012) and responses returned by the target (e.g., storage system), the SCSI "unit attention" mechanism provides a way for the target to return unsolicited updates to the host (3012), so transferring information to update the optimal path may need to be piggybacked on this mechanism in a slightly circuitous manner.

［0484］追加の説明のために、図３２は、本開示のいくつかの実施形態に従って同期複製されたストレージシステム（３０１４、３０２４、３０２８）への接続性を管理するための追加の例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図３２に示されるストレージシステム（３０１４、３０２４、３０２８、３２１０）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図３２に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0484] For further explanation, FIG. 32 sets forth a flowchart illustrating an additional example method for managing connectivity to synchronously replicated storage systems (3014, 3024, 3028) in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (3014, 3024, 3028, 3210) illustrated in FIG. 32 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system illustrated in FIG. 32 may include the same components, fewer components, or additional components as the storage systems described above.

［0485］図３２に示される例の方法は、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）を識別すること（３００２）、データセット（３０１２）に向けられるＩ／Ｏ動作を発行できるホスト（３０３２）を識別すること（３００４）、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）を識別すること（３００６）、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）の中から１つ以上の最適経路を識別すること（３００８）、及びホスト（３０３２）に対して、１つ以上の最適経路の識別を示すこと（３０１０）、データセット（３０１２）が、ストレージシステムの更新された集合全体で同期複製されることを検出すること（３１０２）、ホスト（３０３２）とストレージシステム（３０１４、３０２８）の更新された集合との間の複数のデータ通信経路を識別すること（３１０４）、ホスト（３０３２）とストレージシステムの更新された集合との間の複数のデータ通信経路の中から最適経路の更新された集合を識別すること（３１０６）、ホスト（３０１２）に対し、更新された最適経路の識別を示すこと（３１０８）も含むので、図３２に示される例の方法は、図３０及び図３１に示される例の方法に類似している。 [0485] The example method shown in FIG. 32 includes identifying (3002) a plurality of storage systems (3014, 3024, 3028) across which a dataset (3012) is synchronously replicated; identifying (3004) a host (3032) that can issue I/O operations directed to the dataset (3012); identifying (3006) a plurality of data communication paths (3022, 3026, 3030) between the host (3032) and the plurality of storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously replicated; The example method illustrated in FIG. 32 is similar to the example methods illustrated in FIGS. 30 and 31, in that it also includes identifying one or more optimal paths from the host (3032) (3008), and indicating to the host (3032) the identification of the one or more optimal paths (3010), detecting that the dataset (3012) is synchronously replicated across the updated set of storage systems (3102), identifying multiple data communication paths between the host (3032) and the updated set of storage systems (3014, 3028) (3104), identifying an updated set of optimal paths from among the multiple data communication paths between the host (3032) and the updated set of storage systems (3106), and indicating to the host (3012) the identification of the updated optimal paths (3108).

［0486］図３２に示される例の方法では、データセット（３０１２）が、ストレージシステムの更新された集合全体で同期複製されることを検出すること（３１０２）は、データセット（３０１２）がその全体で同期複製されていたストレージシステムの元の集合からストレージシステム（３０２４）がデタッチしたことを検出すること（３２０２）を含む場合がある。図３２に示される例の方法では、ストレージシステム（３０２４）は、ストレージシステム（３０２４）がもはや複数のストレージシステム全体でデータセット（３０１２）の同期複製に関与していないとき『デタッチされた』と見なされてよい。特定のストレージシステムは、例えばストレージシステムの中のハードウェア故障のために、ストレージシステムがデータ通信に従事することを妨げるネットワーク故障のために、ストレージシステムに対する電力の損失のために、ストレージシステム上でのソフトウェア衝突のために、又はさまざまな他の理由からデタッチしてよい。図３２に示される例の方法では、ストレージシステム（３０２４）が、データセット（３０１２）がその全体で同期複製されていたストレージシステムの元の集合からデタッチしたことを検出すること（３２０２）は、例えば、ストレージシステムが利用不能に又はそれ以外の場合到達不能になったと判断することによって実施されてよい。図３２に示される例の方法では、ストレージシステム（３０２４）によって使用されるすべてのデータ通信リンク（３０１６、３０１８、３０２０、３２０８）及びデータ通信経路（３０３０）が、ストレージシステム（３０２４）によって使用されるデータ通信リンク（３０１６、３０１８、３０２０、３２０８）及びデータ通信経路（３０３０）が稼働していないことを示すための点線で表されるので、ストレージシステムの１つ（３０２４）は、ストレージシステムがデータ通信に従事するのを妨げるネットワーキング故障のためにデタッチされているとして示される。 32, detecting (3102) that dataset (3012) is synchronously replicated across the updated set of storage systems may include detecting (3202) that storage system (3024) has detached from the original set of storage systems across which dataset (3012) was synchronously replicated. In the example method shown in FIG. 32, storage system (3024) may be considered 'detached' when storage system (3024) is no longer involved in synchronous replication of dataset (3012) across multiple storage systems. A particular storage system may detach for example, due to a hardware failure within the storage system, due to a network failure that prevents the storage system from engaging in data communications, due to a loss of power to the storage system, due to a software conflict on the storage system, or for a variety of other reasons. In the example method shown in FIG. 32, detecting (3202) that storage system (3024) has detached from the original set of storage systems across which data set (3012) was synchronously replicated may be performed, for example, by determining that the storage system has become unavailable or otherwise unreachable. In the example method shown in FIG. 32, one of the storage systems (3024) is shown as detached due to a networking failure that prevents the storage system from engaging in data communications because all data communication links (3016, 3018, 3020, 3208) and data communication paths (3030) used by storage system (3024) are represented by dotted lines to indicate that the data communication links (3016, 3018, 3020, 3208) and data communication paths (3030) used by storage system (3024) are not operational.

［0487］また、図３２に示される例の方法では、データセット（３０１２）が、ストレージシステムの更新された集合全体で同期複製されることを検出すること（３１０２）は、データセット（３０１２）がその全体で同期複製されていたストレージシステムの元の集合に含まれていなかったストレージシステム（３２１０）が、データセット（３０１２）がその全体で同期複製されているストレージシステムの集合にアタッチしたことを検出すること（３２０４）も含む場合がある。図３２に示される例の方法では、ストレージシステム（３２１０）は、ストレージシステム（３２１０）が複数のストレージシステム全体でのデータセット（３０１２）の同期複製に関与しているとき『アタッチされている』と見なされてよい。特定のストレージシステムは、例えばストレージシステムがポッドに加えられるため、ストレージシステムがストレージシステムの中のハードウェア故障から回復するため、ストレージシステムがネットワーク故障から回復するため、ストレージシステムがストレージシステムに対する電力の損失から回復するため、ストレージシステムがストレージシステム上のソフトウェア衝突から回復するため、又はさまざまな他の理由からアタッチしてよい。図３２に示される例の方法では、上記図のいずれにも含まれていなかったストレージシステム（３２１０）の１つは、データセット（３０１２）がその全体で同期複製されるストレージシステムの集合にアタッチされているとして示され、ストレージシステム（３２１０）は、１つ以上のデータ通信リンク（３２０６）及びデータ通信経路（３２１２）を介してホスト（３０３２）及び他のストレージシステム（３０２４）とのデータ通信のために結合される。読者は、データ通信リンクがストレージシステム（３０２８、３２１０）のいくつかの間に示されていないが、係るデータ通信リンクは実際には存在する場合があるが、ここでは説明の便宜上省略されているにすぎないことを理解する。 32, detecting (3102) that dataset (3012) is synchronously replicated across the updated set of storage systems may also include detecting (3204) that a storage system (3210) that was not included in the original set of storage systems across which dataset (3012) was synchronously replicated has attached to the set of storage systems across which dataset (3012) is synchronously replicated. In the example method shown in FIG. 32, a storage system (3210) may be considered "attached" when it participates in the synchronous replication of dataset (3012) across multiple storage systems. A particular storage system may be attached, for example, because the storage system is added to a pod, because the storage system recovers from a hardware failure within the storage system, because the storage system recovers from a network failure, because the storage system recovers from a loss of power to the storage system, because the storage system recovers from a software collision on the storage system, or for various other reasons. In the example method shown in FIG. 32, one of the storage systems (3210) not included in any of the previous figures is shown as attached to a collection of storage systems across which data set (3012) is synchronously replicated, and storage system (3210) is coupled for data communication with host (3032) and other storage systems (3024) via one or more data communication links (3206) and data communication paths (3212). The reader will understand that while data communication links are not shown between some of the storage systems (3028, 3210), such data communication links may actually exist but are merely omitted here for convenience of explanation.

［0488］また、図３２に示される例の方法は、データセット（３０１２）がその全体で同期複製されていたストレージシステムの元の集合に含まれていなかったストレージシステム（３２１０）に対するホストアクセスをモニタすること（３２０６）を含む場合もある。上述されたように、新しいストレージシステムがポッドに加えられる場合、次いで新しいターゲットポートグループがそのポッド内のボリュームごとに、そのボリュームにアクセスするホストポートに加えられ、ターゲットポートグループは、新しいストレージシステムのためのホスト／ストレージシステム近接性に適切な状態を割り当てられる場合がある。なんらかの数のＳＡＮレベルイベントの後、ホストは、ボリュームごとに新しいポートを認識し、新しい経路を適切に使用するようにそのドライバを構成できる。ストレージシステムは、ポッドの代わりに、ホストが現在新規に加えられたストレージシステムでＳＣＳＩターゲットポートを使用するように適切に構成されていると判断するためにホストアクセス（例えば、ＲＥＰＯＲＴＬＵＮＳコマンド及びＩＮＱＵＩＲＹコマンドを待機する等）がないかモニタできる。係る例では、ホストは、ホストがそのターゲットポートグループのメンバーにコマンドを発行する準備が完了していることに依存するポッドに対して処置を講じる前に、ホストがターゲットポートグループの新規に加えられたメンバーにコマンドを発行する準備が完了している旨の保証がないかモニタされてよい。これは例えばポッドメンバーの削除を調整するときに有用である場合がある。係る例では、削除されているストレージシステムを使用していると知られている１つ以上のホストが、最近加えられたストレージシステムを使用していることがまだ気づかれていない場合、次いで削除されているメンバーはそれらのホストの１つ以上のために機能するとして知られる最後の残っているストレージシステムである場合、次いで動作が進むことを可能にする前にアラートを発行することが有益である可能性がある（又は動作を即座に妨げることができる）。 [0488] The example method shown in FIG. 32 may also include monitoring (3206) host access to a storage system (3210) that was not included in the original set of storage systems across which the dataset (3012) was synchronously replicated. As described above, if a new storage system is added to a pod, then a new target port group may be added for each volume in the pod to the host ports that access the volume, and the target port group may be assigned a state appropriate to the host/storage system proximity for the new storage system. After some number of SAN-level events, the host may recognize the new ports for each volume and configure its driver to appropriately use the new paths. The storage system, on behalf of the pod, may monitor for host access (e.g., by waiting for REPORT LUNS and INQUIRY commands) to determine that the host is now appropriately configured to use the SCSI target ports on the newly added storage system. In such an example, a host may be monitored for assurance that the host is ready to issue commands to a newly added member of a target port group before taking action against a pod that depends on the host being ready to issue commands to members of that target port group. This may be useful, for example, when coordinating the removal of a pod member. In such an example, if one or more hosts known to be using the storage system being removed are not yet aware that they are using a recently added storage system, and the member being removed is the last remaining storage system known to serve one or more of those hosts, then it may be beneficial to issue an alert before allowing the operation to proceed (or may immediately prevent the operation).

［0489］追加の説明のために、図３３は、本開示のいくつかの実施形態に係る同期複製されたストレージシステム（３０１４、３０２４、３０２８）への接続性を管理するための追加の例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図３３に示されるストレージシステム（３０１４、３０２４、３０２８、３２１０）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図３３に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0489] For further explanation, FIG. 33 sets forth a flowchart illustrating an additional example method for managing connectivity to synchronously replicated storage systems (3014, 3024, 3028) according to some embodiments of the present disclosure. While not shown in great detail, the storage systems (3014, 3024, 3028, 3210) shown in FIG. 33 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system shown in FIG. 33 may include the same components as the storage systems described above, fewer components, or additional components.

［0490］図３３に示される例の方法は、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）を識別すること（３００２）と、データセット（３０１２）に向けられるＩ／Ｏ動作を発行できるホスト（３０３２）を識別すること（３００４）と、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）を識別すること（３００６）と、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）の中から１つ以上の最適経路を識別すること（３００８）と、ホスト（３０３２）に対して、１つ以上の最適経路の識別を示すこと（３０１０）とを含むバイもあるので図３０、図３１、及び図３２に示される例の方法に類似している。 [0490] The example method shown in FIG. 33 includes identifying (3002) a plurality of storage systems (3014, 3024, 3028) across which a dataset (3012) is synchronously replicated; identifying (3004) a host (3032) that can issue I/O operations directed to the dataset (3012); and identifying a plurality of data communication paths (3022, 3024) between the host (3032) and the plurality of storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously replicated. 30, 31, and 32, as it may include identifying (3006) one or more optimal paths from among a plurality of data communication paths (3022, 3026, 3030) between a host (3032) and a plurality of storage systems (3014, 3024, 3028) across which a dataset (3012) is synchronously replicated; identifying (3008) one or more optimal paths from among a plurality of data communication paths (3022, 3026, 3030) between the host (3032) and a plurality of storage systems (3014, 3024, 3028) across which a dataset (3012) is synchronously replicated; and indicating (3010) to the host (3032) the identification of the one or more optimal paths.

［0491］また、図３３に示される例の方法は、ホスト（３０３２）と、データセット（３０１２）が同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の少なくとも１つ又は複数のデータ通信経路（３０２２、３０２６、３０３０）に対する変更を検出すること（３３０２）も含む。ホスト（３０３２）と、データセット（３０１２）が同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の少なくとも１つ又は複数のデータ通信経路（３０２２、３０２６、３０３０）に対する変更を検出すること（３３０２）は、例えば、特定のデータ通信経路がもはや操作可能ではないことを検出することによって、特定のデータ通信経路全体での性能（例えば、帯域幅、スループット）が所定の閾値量を超えて変化したと判断することによって、より少ない又は追加のホップが特定のデータ通信経路に導入されたと判断することによって等、実施されてよい。読者は、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）に対する変更が、どの特定のデータ通信経路が最適経路として識別されるのかに影響を与える場合があり、したがってストレージシステム（３０１４、３０２４、３０２８）は、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）の中から最適経路を識別するステップ（３００８）、及びホスト（３０３２）に対して、１つ以上の最適経路の識別を示すステップ（３０１０）を繰り返す必要がある場合があることを理解する。 33 also includes detecting (3302) changes to at least one or more data communication paths (3022, 3026, 3030) between the host (3032) and the plurality of storage systems (3014, 3024, 3028) to which the dataset (3012) is synchronously replicated. Detecting (3302) changes to at least one or more data communication paths (3022, 3026, 3030) between the host (3032) and the plurality of storage systems (3014, 3024, 3028) to which the dataset (3012) is synchronously replicated may be performed, for example, by detecting that a particular data communication path is no longer operational, by determining that performance (e.g., bandwidth, throughput) across a particular data communication path has changed by more than a predetermined threshold amount, by determining that fewer or additional hops have been introduced to a particular data communication path, etc. The reader will understand that changes to the multiple data communication paths (3022, 3026, 3030) between the host (3032) and the multiple storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously replicated may affect which particular data communication path is identified as the optimal path, and therefore the storage systems (3014, 3024, 3028) may need to repeat the step (3008) of identifying an optimal path from among the multiple data communication paths (3022, 3026, 3030) between the host (3032) and the multiple storage systems (3014, 3024, 3028) across which the dataset (3012) is synchronously replicated, and the step (3010) of indicating the identification of one or more optimal paths to the host (3032).

［0492］また、図３３に示される例の方法は、ホスト（３０３２）への変更を検出すること（３３０４）も含む。ホスト（３０３２）に対する変更は、例えばホスト（３０３２）に対するソフトウェア又はハードウェアのアップグレードの結果として、ホスト（３０３２）に対する電力の損失の結果として、ホスト（３０３２）でのハードウェア又はソフトウェアの故障の結果として、ホスト（３０３２）が移動される結果として、新しいホストが、データセット（３０１２）に向けられるＩ／Ｏ動作を発行するなんらかのアプリケーションの実行をサポートとするために使用される結果として、又はさまざまな他の理由から発生する場合がある。読者は、ホスト（３０３２）に対する変更が、どの特定のデータ通信経路が最適経路として識別されるのかに影響を与える場合があり、したがってストレージシステム（３０１４、３０２４、３０２８）は、ホスト（３０３２）と、データセット（３０１２）がその全体で同期複製される複数のストレージシステム（３０１４、３０２４、３０２８）との間の複数のデータ通信経路（３０２２、３０２６、３０３０）の中から最適経路を識別するステップ（３００８）、及びホスト（３０３２）に対して、１つ以上の最適経路の識別を示すステップ（３０１０）を繰り返す必要がある場合があることを理解する。 [0492] The example method shown in FIG. 33 also includes detecting (3304) a change to host (3032). The change to host (3032) may occur, for example, as a result of a software or hardware upgrade to host (3032), as a result of a loss of power to host (3032), as a result of a hardware or software failure at host (3032), as a result of host (3032) being moved, as a result of a new host being used to support the execution of some application that issues I/O operations directed to dataset (3012), or for a variety of other reasons. The reader will understand that changes to the host (3032) may affect which particular data communication path is identified as the optimal path, and therefore the storage systems (3014, 3024, 3028) may need to repeat the step (3008) of identifying an optimal path from among multiple data communication paths (3022, 3026, 3030) between the host (3032) and multiple storage systems (3014, 3024, 3028) across which the data set (3012) is synchronously replicated, and the step (3010) of indicating to the host (3032) the identification of one or more optimal paths.

［0493］追加の説明のために、図３４は、本開示の実施形態に従ってストレージシステム（３４２４、３４２６、３４２８）全体で同期複製されたデータへの接続性を管理する追加の例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図３４に示されるストレージシステム（３４２４、３４２６、３４２８）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図３４に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。さらに、図３４に示されるストレージシステム（３４２４、３４２６、３４２８）のそれぞれは、１つ以上のデータ通信リンク（３４２０、３４２２）を介して互いに接続され、１つ以上のデータ通信経路（３４１０、３４１２、３４１４）を介してホスト（３４０２）に接続されてもよい。 [0493] For further explanation, FIG. 34 sets forth a flowchart illustrating an additional example method for managing connectivity to synchronously replicated data across storage systems (3424, 3426, 3428) in accordance with an embodiment of the present disclosure. While not shown in great detail, the storage systems (3424, 3426, 3428) illustrated in FIG. 34 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system illustrated in FIG. 34 may include the same components as the storage systems described above, fewer components, or additional components. Furthermore, each of the storage systems (3424, 3426, 3428) illustrated in FIG. 34 may be connected to each other via one or more data communication links (3420, 3422) and to the host (3402) via one or more data communication paths (3410, 3412, 3414).

［0494］図３４に示される例の方法は、複数のストレージシステム（３４２４、３４２６、３４２８）全体で同期複製されるデータセット（３４１８）に向けられるＩ／Ｏ動作（３４１６）を受け取ること（３４０４）を含む。図３４に示される例の方法では、ホスト（３４０２）は、複数のストレージシステム（３４２４、３４２６、３４２８）全体で同期複製されるデータセット（３４１８）に向けられるＩ／Ｏ動作（３４１６）を、例えば、ホストで実行中であるアプリケーションから、ホスト（３４０２）とのなんらかのユーザとの対話の結果として、又はさまざまな他の方法で受け取る（３４０４）。複数のストレージシステム（３４２４、３４２６、３４２８）全体で同期複製されるデータセット（３４１８）に向けられるＩ／Ｏ動作（３４１６）は、例えばデータセット（３４１８）へデータを書き込む要求として、データセット（３４１８）からデータを読み取る要求として、データセット（３４１８）のデータをコピーし、係るコピーを他のどこかに記憶する要求として、データセット（３４１８）のデータのスナップショットを撮影する要求として等、実施されてよい。 34 includes receiving (3404) an I/O operation (3416) directed to a dataset (3418) that is synchronously replicated across multiple storage systems (3424, 3426, 3428). In the example method shown in FIG. 34, a host (3402) receives (3404) an I/O operation (3416) directed to the dataset (3418) that is synchronously replicated across multiple storage systems (3424, 3426, 3428), for example, from an application running on the host, as a result of some user interaction with the host (3402), or in various other ways. An I/O operation (3416) directed to a dataset (3418) that is synchronously replicated across multiple storage systems (3424, 3426, 3428) may be implemented, for example, as a request to write data to the dataset (3418), as a request to read data from the dataset (3418), as a request to copy the data in the dataset (3418) and store such a copy elsewhere, as a request to take a snapshot of the data in the dataset (3418), etc.

［0495］また、図３４に示される例の方法は、Ｉ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして複数のストレージシステム（３４２４、３４２６、３４２８）のうちの特定のストレージシステム（３４２６）を識別すること（３４０６）も含む。図３４に示される例の方法では、ホスト（３４０２）は、例えば、ストレージシステム（３２４２、３４２６、３４２８）のそれぞれにＩ／Ｏ動作を発行するときにホスト（３４０２）が以前に経験したことがある応答時間を追跡すること（又はそうでなければ、応答時間を記述する情報にアクセスできる）こと、及びＩ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして最速の応答時間を示したストレージシステム（３４２６）を選択することによってＩ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして複数のストレージシステム（３４２４、３４２６、３４２８）のうちの特定のストレージシステム（３４２６）を識別してよい（３４０６）。読者は、ホスト（３４３２）が、Ｉ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして特定のストレージシステム（３４２６）を識別する（３４０６）ために単独で又は組み合わせて使用され得る他の測定基準（例えば、測定基準に関する信頼性、測定基準に関する有用性、スループット測定基準）を記述する情報を追跡してよい、又は情報にアクセスできることを理解する。代わりに、ホスト（３４０２）は、構成パラメータとしてシステム管理者から、ストレージアレイ自体から、又はなんらかの他の方法で好ましいストレージシステムの識別を受け取るように構成されてよく、これによりＩ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして特定のストレージシステム（３４２６）を識別すること（３４０６）は、なんらかの構成パラメータ又はホスト（３４０２）の中に記憶された他の構成情報を単に調べることによって実施されてよい。 34 also includes identifying (3406) a particular storage system (3426) among the plurality of storage systems (3424, 3426, 3428) as a preferred storage system for receiving the I/O operation (3416). In the example method shown in FIG. 34, the host (3402) may identify (3406) a particular storage system (3426) among the plurality of storage systems (3424, 3426, 3428) as a preferred storage system for receiving the I/O operation (3416), for example, by tracking (or otherwise having access to information describing the response times) response times previously experienced by the host (3402) when issuing I/O operations to each of the storage systems (3242, 3426, 3428) and selecting the storage system (3426) that exhibited the fastest response time as the preferred storage system for receiving the I/O operation (3416). The reader will understand that the host (3432) may track or have access to information describing other metrics (e.g., reliability of the metric, availability of the metric, throughput metric) that may be used alone or in combination to identify (3406) a particular storage system (3426) as a preferred storage system for receiving I/O operations (3416). Alternatively, the host (3402) may be configured to receive the identification of the preferred storage system from a system administrator as a configuration parameter, from the storage array itself, or in some other manner, such that identifying (3406) a particular storage system (3426) as a preferred storage system for receiving I/O operations (3416) may be performed simply by examining some configuration parameter or other configuration information stored within the host (3402).

［0496］また、図３４に示される例の方法は、好ましいストレージシステム（３４２６）を識別した後に、データセット（３４１８）に向けられる１つ以上のＩ／Ｏ動作（３４１６）のために好ましいストレージシステム（３４２６）に１つ以上のＩ／Ｏ動作（３４１６）を発行すること（３４０８）も含む。図３４に示される例の方法では、ホスト（３４０２）は、例えばホスト（３４０２）と好ましいストレージシステム（３４２６）との間のデータ通信経路（８１２）を介してホスト（３４０２）と好ましいストレージシステム（３４２６）との間で交換される１つ以上のメッセージを介して好ましいストレージシステム（３４２６）に対してデータセット（３４１８）に向けられる１つ以上のＩ／Ｏ動作（３４１６）を発行してよい（３４０８）。 [0496] The example method shown in FIG. 34 also includes, after identifying the preferred storage system (3426), issuing (3408) one or more I/O operations (3416) to the preferred storage system (3426) for the one or more I/O operations (3416) directed to the dataset (3418). In the example method shown in FIG. 34, the host (3402) may issue (3408) the one or more I/O operations (3416) directed to the dataset (3418) to the preferred storage system (3426) via one or more messages exchanged between the host (3402) and the preferred storage system (3426), for example, via a data communication path (812) between the host (3402) and the preferred storage system (3426).

［0497］追加の説明のために、図３５は、本開示の実施形態に従ってストレージシステム（３４２４、３４２６、３４２８）全体で同期複製されたデータへの接続性を管理するための追加の例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図３５に示されるストレージシステム（３４２４、３４２６、３４２８）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図３５に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0497] For further explanation, FIG. 35 sets forth a flowchart illustrating an additional example method for managing connectivity to synchronously replicated data across storage systems (3424, 3426, 3428) in accordance with an embodiment of the present disclosure. While not shown in great detail, the storage systems (3424, 3426, 3428) illustrated in FIG. 35 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system illustrated in FIG. 35 may include the same components, fewer components, or additional components as the storage systems described above.

［0498］図３５に示される例の方法は、複数のストレージシステム（３４２４、３４２６、３４２８）全体で同期複製されるデータセット（３４１８）に向けられるＩ／Ｏ動作（３４１６）を受け取ること（３４０４）と、Ｉ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして複数のストレージシステム（３４２４、３４２６、３４２８）のうちの特定のストレージシステム（３４２６）を識別すること（３４０６）と、好ましいストレージシステム（３４２６）を識別した後に、データセット（３４１８）に向けられる１つ以上のＩ／Ｏ動作（３４１６）のために好ましいストレージシステム（３４２６）に１つ以上のＩ／Ｏ動作（３４１６）を発行すること（３４０８）も含むので、図３５に示される例の方法は図３４に示される例の方法に類似している。 [0498] The example method illustrated in FIG. 35 is similar to the example method illustrated in FIG. 34 in that it also includes receiving (3404) I/O operations (3416) directed to a data set (3418) that is synchronously replicated across a plurality of storage systems (3424, 3426, 3428), identifying (3406) a particular storage system (3426) among the plurality of storage systems (3424, 3426, 3428) as a preferred storage system for receiving the I/O operations (3416), and, after identifying the preferred storage system (3426), issuing (3408) one or more I/O operations (3416) to the preferred storage system (3426) for the one or more I/O operations (3416) directed to the data set (3418).

［0499］また、図３５に示される例の方法は、複数のストレージシステム（３４２４、３２４６、３４２８）のうちの複数のストレージシステムのためにそれぞれの応答時間を決定すること（３５０２）も含む。図３５に示される例の方法では、ホスト（３４０２）は、複数のストレージシステム（３４２４、３２４６、３４２８）のうちの複数のストレージシステムのためにそれぞれの応答時間を、例えば類似するＩ／Ｏ動作にサービスを提供するためにストレージシステム（３４２４、３４２６、３４２８）のそれぞれによって必要とされる時間量を決定することによって、類似するＩ／Ｏ動作にサービスを提供するために、ストレージシステム（３４２４、３４２６、３４２８）のそれぞれによって必要とされる平均時間量を追跡すること等によって決定する（３５０２）。係る例では、ホスト（３４０２）は、１つ以上の内部クロックの使用により、１つ以上のメッセージに付けられるタイムスタンプを調べることによって、又はなんらかの他の方法で係る情報を追跡してよい。 [0499] The example method shown in FIG. 35 also includes determining (3502) respective response times for a plurality of storage systems (3424, 3246, 3428). In the example method shown in FIG. 35, the host (3402) determines (3502) respective response times for a plurality of storage systems (3424, 3246, 3428), for example, by determining the amount of time required by each of the storage systems (3424, 3426, 3428) to service similar I/O operations, by tracking the average amount of time required by each of the storage systems (3424, 3426, 3428) to service similar I/O operations, etc. In such an example, the host (3402) may track such information through the use of one or more internal clocks, by examining timestamps attached to one or more messages, or in some other manner.

［0500］図３５に示される例の方法では、Ｉ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして複数のストレージシステム（３４２４、３４２６、３４２８）のうちの特定のストレージシステム（３４２６）を識別すること（３４０６）は、複数のストレージシステム（３４２４、３４２６、３４２８）のうちの複数のストレージシステムのためのそれぞれの応答時間に従ってＩ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして複数のストレージシステム（３４２４、３４２６、３４２８）のうちの特定のストレージシステムを識別すること（３５０４）を含む場合がある。図３５に示される例の方法では、ホスト（３４０２）は、例えば、好ましいストレージシステムとして最速の応答時間と関連付けられたストレージシステムを選択することによって、その応答時間が好ましいストレージシステムとしてのサービス閾値の所定の質を満たす任意のストレージシステムを選択することによって、又は他のなんらかの方法で、複数のストレージシステム（３４２４、３４２６、３４２８）のうちの複数のストレージシステムのためのそれぞれの応答時間に従って、Ｉ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして複数のストレージシステム（３４２４、３４２６、３４２８）のうちの特定のストレージシステムを識別してよい（３５０４）。 [0500] In the example method shown in FIG. 35, identifying (3406) a particular storage system (3426) among the plurality of storage systems (3424, 3426, 3428) as a preferred storage system for receiving the I/O operation (3416) may include identifying (3504) a particular storage system among the plurality of storage systems (3424, 3426, 3428) as a preferred storage system for receiving the I/O operation (3416) according to respective response times for the plurality of storage systems among the plurality of storage systems (3424, 3426, 3428). In the example method shown in FIG. 35, the host (3402) may identify (3504) a particular storage system among the plurality of storage systems (3424, 3426, 3428) as a preferred storage system for receiving the I/O operation (3416) according to the respective response times for the plurality of storage systems among the plurality of storage systems (3424, 3426, 3428), for example, by selecting the storage system associated with the fastest response time as the preferred storage system, by selecting any storage system whose response time meets a predetermined quality of service threshold as the preferred storage system, or in some other manner.

［0501］また、図３５に示される例の方法は、ストレージシステム（３４２４、３４２６、３４２８）の少なくとも１つのための応答時間の変化を検出すること（３５０６）も含む。図３５に示される例の方法では、ホスト（３４０２）は、ストレージシステムのそれぞれに対して追加試験を実行した結果として、平均応答時間が所定の閾値量を越えて逸脱したと判断することによって、特定のデータ通信リンクを介してメッセージを交換する能力に対するなんらかの中断を検出することによって、又は他のなんらかの方法で、ストレージシステム（３４２４、３４２６、３４２８）の少なくとも１つのための応答時間の変化を検出する（３５０６）。 [0501] The example method shown in FIG. 35 also includes detecting (3506) a change in response time for at least one of the storage systems (3424, 3426, 3428). In the example method shown in FIG. 35, the host (3402) detects (3506) the change in response time for at least one of the storage systems (3424, 3426, 3428) by determining that the average response time has deviated by more than a predetermined threshold amount as a result of performing additional tests on each of the storage systems, by detecting any interruption to the ability to exchange messages over a particular data communications link, or in some other manner.

［0502］また、図３５に示される例の方法は、好ましいストレージシステムとして複数のストレージシステム（３４２４、３４２６、３４２８）の異なるストレージシステムを、応答時間の変化に従って選択すること（３５０８）も含む。図３５に示される例の方法では、ホスト（３４０２）は、Ｉ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして複数のストレージシステム（３４２４、３４２６、３４２８）の異なるストレージシステムを、例えば選好ましいストレージシステムとして最速の更新された応答時間と関連付けられたストレージシステムを選択することによって、その更新された応答時間が、好ましいストレージシステムとしてサービス閾値の所定の質を満たす任意のストレージシステムを選択することによって、又は他のなんらかの方法で選択する（３５０８）。 [0502] The example method shown in FIG. 35 also includes selecting (3508) a different storage system of the plurality of storage systems (3424, 3426, 3428) as a preferred storage system according to a change in response time. In the example method shown in FIG. 35, the host (3402) selects (3508) a different storage system of the plurality of storage systems (3424, 3426, 3428) as a preferred storage system for receiving the I/O operation (3416), for example, by selecting the storage system associated with the fastest updated response time as the preferred storage system, by selecting any storage system whose updated response time meets a predetermined quality of service threshold as the preferred storage system, or in some other manner.

［0503］追加の説明のために、図３６は、本開示の実施形態に従って、ストレージシステム（３４２４、３４２６、３４２８）全体で同期複製されたデータへの接続性を管理する追加の例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図３６に示されるストレージシステム（３４２４、３４２６、３４２８）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図３６に示されるストレージシステムは、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0503] For further explanation, FIG. 36 sets forth a flowchart illustrating an additional example method for managing connectivity to synchronously replicated data across storage systems (3424, 3426, 3428) in accordance with an embodiment of the present disclosure. While not shown in great detail, the storage systems (3424, 3426, 3428) illustrated in FIG. 36 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system illustrated in FIG. 36 may include the same components as the storage systems described above, fewer components, or additional components.

［0504］図３６に示される例の方法は、複数のストレージシステム（３４２４、３４２６、３４２８）全体で同期複製されるデータセット（３４１８）に向けられるＩ／Ｏ動作（３４１６）を受け取ること（３４０４）と、Ｉ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして複数のストレージシステム（３４２４、３４２６、３４２８）のうちの特定のストレージシステム（３４２６）を識別すること（３４０６）と、好ましいストレージシステム（３４２６）を識別した後に、データセット（３４１８）に向けられる１つ以上のＩ／Ｏ動作（３４１６）のために好ましいストレージシステム（３４２６）に１つ以上のＩ／Ｏ動作（３４１６）を発行すること（３４０８）も含むので、図３６に示される例の方法は図３４に示される例の方法に類似している。 [0504] The example method illustrated in FIG. 36 is similar to the example method illustrated in FIG. 34 in that it also includes receiving (3404) I/O operations (3416) directed to a data set (3418) that is synchronously replicated across a plurality of storage systems (3424, 3426, 3428), identifying (3406) a particular storage system (3426) among the plurality of storage systems (3424, 3426, 3428) as a preferred storage system for receiving the I/O operations (3416), and, after identifying the preferred storage system (3426), issuing (3408) one or more I/O operations (3416) to the preferred storage system (3426) for the one or more I/O operations (3416) directed to the data set (3418).

［0505］また、図３６に示される例の方法は、ストレージシステム（３４２８）の１つから、好ましいストレージシステムの識別（３６１０）を受け取ること（３６０２）も含む。図３６に示される例の方法では、ホスト（３４０２）は、ストレージシステム（３４２８）の１つから、ストレージシステム（３４２８）とホスト（３４０２）との間のデータ通信経路（３４１４）を介して交換される１つ以上のメッセージを介して好ましいストレージシステムの識別（３６１０）を受け取ってよい（３６０２）。ホスト（３４０２）は、例えばホストの中に記憶される構成設定値として好ましいストレージシステムの識別（３６１０）を保持してよい。したがって、他の実施形態では、構成設定値は異なる方法で（例えば、システム管理者によって、ホストで実行中の別のソフトウェアモジュールによって設定されてよいが、Ｉ／Ｏ動作（３４１６）を受け取るための好ましいストレージシステムとして複数のストレージシステム（３４２４、３４２６、３４２８）のうちの特定のストレージシステム（３４２６）を識別すること（３４０６）は、構成設定値に従って実施されてよい。 36 also includes receiving (3602) an identification (3610) of a preferred storage system from one of the storage systems (3428). In the example method shown in FIG. 36, the host (3402) may receive (3602) the identification (3610) of the preferred storage system from one of the storage systems (3428) via one or more messages exchanged over a data communication path (3414) between the storage system (3428) and the host (3402). The host (3402) may maintain the identification (3610) of the preferred storage system, for example, as a configuration setting stored within the host. Thus, in other embodiments, the configuration setting may be set in a different manner (e.g., by a system administrator, by another software module running on the host), but the identification (3406) of a particular storage system (3426) of the plurality of storage systems (3424, 3426, 3428) as the preferred storage system for receiving the I/O operation (3416) may be performed in accordance with the configuration setting.

［0506］また、図３６に示される例の方法は、ホスト（３４０２）がある場所から別の場所に移動したことを検出すること（３６０４）も含む。図３６に示される例の方法では、ホスト（３４０２）は、例えばホスト（３４０２）が新しいデータ通信相互接続に接続されたと判断することによって、ホスト（３４０２）がラックの中の異なる位置の中に取り付けられている、又は新しいラックに取り付けられていることを検出することによって、又は他のなんらかの方法で、それがある場所から別の場所に移動したことを検出してよい（３６０４）。ラックに対する又はデータセンタの中の、又はデータセンタ若しくはキャンパスのネットワークトポロジーに基づいた場所は、ホストとポッドのための特定のストレージシステムとの間の性能に影響を及ぼす場合がある「局所性」の態様である場合がある。ラックに及ぶ又は複数のネットワークに接続される単一のストレージシステムでは、局所性はポッドの中の個々のストレージシステム上の個々のストレージネットワークアダプタにも適用する可能性がある。 [0506] The example method shown in FIG. 36 also includes detecting (3604) that the host (3402) has moved from one location to another. In the example method shown in FIG. 36, the host (3402) may detect that it has moved from one location to another (3604), for example, by determining that the host (3402) has been connected to a new data communication interconnect, by detecting that the host (3402) is installed in a different position in a rack or in a new rack, or in some other manner. Location relative to a rack or within a data center, or based on the network topology of a data center or campus, may be an aspect of "locality" that may affect performance between a host and a particular storage system for a pod. In a single storage system that spans racks or is connected to multiple networks, locality may also apply to individual storage network adapters on individual storage systems in a pod.

［0507］また、図３６に示される例の方法は、データセット（３４１８）に向けられるＩ／Ｏ動作を受け取るための好ましいストレージシステムとして複数のストレージシステムの異なるストレージシステムを識別すること（３６０８）も含む。図３６に示される例の方法では、ホスト（３４０２）は、例えばストレージシステムＮそれぞれと関連付けられた応答時間を再測定し、最速の応答時間を示すストレージシステムを選択することによって、データセット（３４１８）に向けられるＩ／Ｏ動作を受け取るための好ましいストレージシステムとして複数のストレージシステムの異なるストレージシステムを識別してよい（３６０８）。図３６に示される例の方法では、データセット（３４１８）に向けられるＩ／Ｏ動作を受け取るための好ましいストレージシステムとして複数のストレージシステムの異なるストレージシステムを識別すること（３６０８）は、ホスト（３４０２）が移動したことを検出することに応えて実施されてよい。 36 also includes identifying (3608) a different storage system of the plurality of storage systems as a preferred storage system for receiving I/O operations directed to dataset (3418). In the example method shown in FIG. 36, host (3402) may identify (3608) a different storage system of the plurality of storage systems as a preferred storage system for receiving I/O operations directed to dataset (3418), for example, by re-measuring the response time associated with each of storage systems N and selecting the storage system exhibiting the fastest response time. In the example method shown in FIG. 36, identifying (3608) a different storage system of the plurality of storage systems as a preferred storage system for receiving I/O operations directed to dataset (3418) may be performed in response to detecting that host (3402) has moved.

［0508］また、図３６に示される例の方法は、ホスト（３４０２）に対する構成変更を検出すること（３６０６）も含む。図３６に示される例の方法では、ホスト（３４０２）は、例えば、なんらかのソフトウェアの異なるバージョンがホスト（３４０２）にインストールされていることを検出することによって、ホスト（３４０２）の中のなんらかのハードウェア構成要素が変更された、又は加えられたことを検出すること等によってホスト（３４０２）に対する構成変更を検出してよい（３６０６）。図３６に示される例の方法では、代わりに、データセット（３４１８）に向けられるＩ／Ｏ動作を受け取るための好ましいストレージシステムとして複数のストレージシステムの異なるストレージシステムを識別すること（３６０８）は、ホスト（３４０２）に対する構成変更を検出することに応えて実施されてよい。 [0508] The example method illustrated in FIG. 36 also includes detecting (3606) a configuration change to host (3402). In the example method illustrated in FIG. 36, host (3402) may detect (3606) a configuration change to host (3402), for example, by detecting that a different version of some software is installed on host (3402), by detecting that some hardware component in host (3402) has been changed or added, etc. In the example method illustrated in FIG. 36, identifying (3608) a different storage system of the plurality of storage systems as a preferred storage system for receiving I/O operations directed to dataset (3418) may instead be performed in response to detecting the configuration change to host (3402).

［0509］追加の説明のために、図３７は、本開示のいくつかの実施形態に係る仲介サービスのための自動ストレージシステム構成のための例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図３７に示されるストレージシステム（３７００Ａ～３７００Ｎ）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図３７に示されるストレージシステム（３７００Ａ～３７００Ｎ）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0509] For further explanation, FIG. 37 sets forth a flowchart illustrating an example method for automated storage system configuration for an intermediation service according to some embodiments of the present disclosure. While not shown in great detail, the storage systems (3700A-3700N) illustrated in FIG. 37 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage systems (3700A-3700N) illustrated in FIG. 37 may include the same components as the storage systems described above, fewer components, or additional components.

［0510］以下の例では、仲介サービスのための自動ストレージシステム構成は、ストレージシステム（３７００Ａ～３７００Ｎ）の集合の中の所与のストレージシステム（３７００Ａ）が、仲介ターゲット又はサービスから仲介を要求するように構成されるかどうかを判断することを含んでよい－所与のストレージシステム（３７００Ａ）が仲介を要求するように構成されていない場合、所与のストレージシステム（３７００Ａ）は、操作可能になる前に、事前に構成された場所から仲介サービスハンドルを要求又は入手するように構成される。ストレージシステムが仲介を要求するように構成されるかどうかの係る判断は、ストレージシステムが最初にオンラインにされる又は起動されるときに起こる場合があり、ストレージシステムは、出荷前に、例えば、構成サービス（３７５１）等の指定された構成サービスから構成情報を要求するように構成されており、構成サービスは独立したコンピュータシステム又は構成サービスを提供するサードパーティコンピューティング環境の中で動作する場合がある。 [0510] In the following example, automated storage system configuration for an intermediation service may include determining whether a given storage system (3700A) among a collection of storage systems (3700A-3700N) is configured to request intermediation from an intermediation target or service - if the given storage system (3700A) is not configured to request intermediation, the given storage system (3700A) is configured to request or obtain an intermediation service handle from a pre-configured location before becoming operational. Such a determination of whether a storage system is configured to request intermediation may occur when the storage system is initially brought online or started up, and the storage system may be configured prior to shipping to request configuration information from a designated configuration service, such as configuration service (3751), which may operate as an independent computer system or within a third-party computing environment that provides configuration services.

［0511］さらに、一部の例では、ストレージシステム（３７００Ａ～３７００Ｎ）の中の所与のストレージシステム（３７００Ａ）が、仲介サービスから仲介を要求するように構成されるかどうかを判断することは、ストレージシステム（３７００Ａ～３７００Ｎ）の中でデータセット同期複製を開始することに応えて実行される。この例では、ストレージシステムがデータセット又はポッドを同期複製するストレージシステムの集合に加えられるのに応じて、現在ポッドのメンバーであるストレージシステムは、仲介サービス及び加えられているストレージシステムに対する仲介競争ターゲットのための又は関係する１つ以上のハンドルを自動的に転送してよい。このようにして、第１のストレージシステムは、構成サービスから仲介ターゲットを受け取り、ポッドに加えられた各ストレージシステムはポッドメンバーからその仲介ターゲットを受け取り、結果的にすべてのメンバーストレージシステムは同じ仲介ターゲットから仲介を要求するように構成される。 [0511] Additionally, in some examples, determining whether a given storage system (3700A) among the storage systems (3700A-3700N) is configured to request mediation from the mediation service is performed in response to initiating dataset synchronous replication among the storage systems (3700A-3700N). In this example, in response to a storage system being added to a collection of storage systems that synchronously replicate a dataset or pod, a storage system that is currently a member of the pod may automatically transfer one or more handles for or related to the mediation service and mediation target(s) to the added storage system. In this manner, the first storage system receives the mediation target(s) from the configuration service, and each storage system added to the pod receives its mediation target(s) from the pod members, such that all member storage systems are configured to request mediation from the same mediation target(s).

［0512］上述されたように、仲介サービスハンドルは、ポッドが作成されるとき、ポッドが広げられるとき、又はポッド内のストレージシステム間の将来の通信障害の場合には仲介サービスが必要とされ得るように、ポッドが最初に広げられるときに要求されてよい。仲介サービスハンドルは広域ネットワーク上の連絡アドレスであってよく、ストレージシステムクラスタ又はポッドに対する仲介必要性を処理するための鍵のプールを管理するために使用できるトークンを暗号で守ってよい。代わりに、ポッド内の第１のストレージシステムは、第１の又は後続の仲介競争で使用するための周知の仲介サービスと使用するための安全なハンドルを決定してよく、ハンドルは、仲介サービスとの特定の又は必要な対話なく、その第１のストレージシステムによってプライベートに決定され、ハンドルは、次いですでにポッド内にある又はポッドとして他のストレージシステムに通信される。 [0512] As described above, a mediation service handle may be requested when a pod is created, when a pod is expanded, or when a pod is initially expanded so that a mediation service may be needed in the event of a future communication failure between storage systems within the pod. The mediation service handle may be a contact address on a wide area network and may be a cryptographically secured token that can be used to manage a pool of keys for handling mediation needs for the storage system cluster or pod. Alternatively, a first storage system in a pod may determine a secure handle to use with a well-known mediation service for use in a first or subsequent mediation competition, the handle being determined privately by that first storage system without any specific or necessary interaction with the mediation service, and the handle then being communicated to other storage systems already in the pod or as the pod.

［0513］例えば、ストレージシステム間の通信障害等のエラーに応えて仲介サービスを従事させるプロセス―ストレージシステムは、広域ネットワークを介する連絡アドレスとして示すハンドル、及び仲介のための鍵のプールを管理するために使用できる暗号で安全なトークンを記憶するように構成されてよい―その全体として本願に援用される出願参照番号第１５／７０３，５５９号の中でより詳細に説明される。やはり出願参照番号第１５／７０３，５５９号の中で説明されるのは、データセットを複製するストレージシステムの集合の中のどのストレージシステムが、データセットに向けられたＩ／Ｏ要求にサービスを提供し続けるべきかを判断するための多様な定足数プロトコルの使用である。 [0513] The process of engaging an intermediary service in response to an error, such as a communication failure between storage systems—storage systems may be configured to store handles indicating contact addresses over a wide area network and cryptographically secure tokens that can be used to manage a pool of keys for the intermediary—is described in more detail in Application Serial No. 15/703,559, which is incorporated herein by reference in its entirety. Also described in Application Serial No. 15/703,559 is the use of various quorum protocols to determine which storage system in a collection of storage systems replicating a dataset should continue to service I/O requests directed to the dataset.

［0514］しかしながら、本願参照番号第１５／７０３，５５９号は、仲介及び定足数プロトコルのための実施態様を説明しているが、本開示の焦点は、仲介サービスのための自動ストレージシステム構成である。言い換えると、ストレージシステムは、構成サービスに連絡して、ストレージシステムがそれ自体を、仲介を通して通信障害に対応するように構成するように、仲介サービスに対するハンドルを要求するために出荷前にセットアップされてよく、ストレージシステムは、より多くのストレージシステムがポッドに加えられるときに処理される仲介を自動的に転送する－管理者等のユーザは、ストレージシステムが仲介を実行するように構成されるためになんの処置を講じる必要もない。 [0514] However, while this application Ser. No. 15/703,559 describes implementations for mediation and quorum protocols, the focus of this disclosure is automatic storage system configuration for the mediation service. In other words, a storage system may be set up before shipping to contact a configuration service and request a handle for the mediation service so that the storage system configures itself to handle communication failures through the mediation, and the storage system automatically transfers the mediation process as more storage systems are added to the pod—a user, such as an administrator, does not need to take any action to configure the storage system to perform the mediation.

［0515］図３７に示されるように、データセット（３７５２）を同期複製している複数のストレージシステム（３７００Ａ～３７００Ｎ）は、１つ以上のネットワーク（不図示）を介してそれぞれの他のストレージシステムと、及び仲介サービス（３７０１）と通信してよい－仲介サービス（３７０１）は、どのストレージシステムが、ストレージシステム間の通信障害の場合に、ストレージシステムがオフラインになった場合に、又はなんらかの他のトリガイベントに起因してデータセットにサービスを提供し続けるのかを決定してよい。さらに、この例では、ストレージシステムがオフラインになることに応えて、構成サービス（３７５１）は、事前に設定された連絡アドレスの使用により、仲介サービス（３７０１）に対する１つ以上のハンドルを要求するためのストレージシステムによって自動的に到達可能であってよい。一般的に、任意の数のストレージシステムは、データセット（３７５２）を同期複製している同期中リストの一部であってよい。 [0515] As shown in FIG. 37, multiple storage systems (3700A-3700N) synchronously replicating dataset (3752) may communicate with each other via one or more networks (not shown) and with an intermediary service (3701)—the intermediary service (3701) may determine which storage system will continue to service the dataset in the event of a communication failure between the storage systems, if a storage system goes offline, or due to some other trigger event. Furthermore, in this example, in response to a storage system going offline, the configuration service (3751) may be automatically reachable by the storage system to request one or more handles to the intermediary service (3701) through the use of pre-configured contact addresses. In general, any number of storage systems may be part of an in-sync list that are synchronously replicating dataset (3752).

［0516］図３７に示される例の方法は、ストレージシステム（３７００Ａ～３７００Ｎ）の中の特定のストレージシステム（３７００Ａ）が、データセット（３７５２）を同期複製するストレージシステム間の仲介のために仲介ターゲットから仲介を要求するように構成されていないと判断すること（３７０２）を含む。ストレージシステム（３７００Ａ～３７００Ｎ）の中の特定のストレージシステム（３７００Ａ）が仲介ターゲットから仲介を要求するように構成されていないと判断すること（３７０２）は、コントローラの中の起動プロセスを含んだストレージシステム（３７００Ａ）によって実装されてよく、起動プロセスは、通信障害、又は通信障害に関係する中断がどのように処理されるのかを構成してよい。例えば、仲介ハンドラ（３７６２）が、例えば仲介ハンドルが決定されたかどうか、受け取れたかどうか、それとも要求されたかどうかを示すステータスフラグ又は条件コードを読み取ることによってチェックしてよい、コントローラ４６２）の起動プロセス。仲介ハンドラ（３７６２）が、仲介ハンドルがすでに構成されたことを検出する場合、次いで起動プロセスのこの部分は完了する。しかしながら、仲介ハンドラ（３７６２）が、仲介ハンドルが構成されていないと検出する場合、仲介ハンドラ（３７６２）は、係るハンドルを決定する又は構成サービス（３７５１）から仲介ハンドルを要求するために進んでよい－仲介サービス（３７０１）のための連絡アドレスは、ストレージシステムが出荷される前に製造メーカによって定義されたシステム設定値又はシステム変数であってよい。 37 includes determining (3702) that a particular storage system (3700A) among the storage systems (3700A-3700N) is not configured to request intermediation from an intermediary target for intermediation between storage systems that synchronously replicate a dataset (3752). Determining (3702) that a particular storage system (3700A) among the storage systems (3700A-3700N) is not configured to request intermediation from an intermediary target may be implemented by the storage system (3700A) including a startup process within the controller, which may configure how a communication failure or an interruption related to a communication failure is handled. For example, the startup process of the controller (462) may check, for example, by reading a status flag or condition code indicating whether an intermediation handle has been determined, received, or requested. If the intermediation handler (3762) detects that an intermediation handle has already been configured, then this portion of the startup process is complete. However, if the intermediary handler (3762) detects that an intermediary handle has not been configured, the intermediary handler (3762) may proceed to determine such a handle or request an intermediary handle from the configuration service (3751) - the contact address for the intermediary service (3701) may be a system setting or system variable defined by the manufacturer before the storage system is shipped.

［0517］また、図３７に示される例の方法は、構成サービス（３７５１）から特定のストレージシステム（３７００Ａ）によって、仲介サービス（３７０１）のための１つ以上のサービスハンドルを示す構成情報を要求すること（３７０４）も含む。特定のストレージシステム（３７００Ａ）によって構成サービス（３７５１）から仲介サービス（３７０１）のための１つ以上のサービスハンドルを示す構成情報を要求すること（３７０４）は、仲介ハンドラ（３７６２）が構成サービス（３７５１）のための記憶されている連絡情報にアクセスすること、及び仲介サービス（３７０１）のための１つ以上のサービスハンドルに対する要求（３７５４）を送信することによって実施されてよい。 [0517] The example method shown in FIG. 37 also includes requesting (3704) configuration information indicating one or more service handles for the intermediary service (3701) by the particular storage system (3700A) from the configuration service (3751). Requesting (3704) configuration information indicating one or more service handles for the intermediary service (3701) from the configuration service (3751) by the particular storage system (3700A) may be performed by the intermediary handler (3762) accessing stored contact information for the configuration service (3751) and sending a request (3754) for one or more service handles for the intermediary service (3701).

［0518］また、図３７に示される例の方法は、構成サービス（３７５１）から受け取られた１つ以上のサービスハンドルに従って、ストレージシステム（３７００Ｂ～３７００Ｎ）の１つとの通信障害を検出することに応えて、仲介サービスと通信するように仲介ハンドラを構成すること（３７０６）も含む。構成サービス（３７５１）から受け取られた１つ以上のサービスハンドルに従って、ストレージシステム（３７００Ｂ～３７００Ｎ）の１つとの通信障害を検出することに応えて、仲介サービスと通信するように仲介ハンドラを構成すること（３７０６）は、仲介ハンドラ（３７６２）が、通信障害に対応するときに使用された仲介サービス（３７０１）のためにサービスハンドル又は連絡アドレスを定義することによって実施されてよく、サービスハンドル（３７５６）は、要求（３７５４）に応える構成サービス（３７５１）からの応答メッセージの中で指定されてよい。 [0518] The example method shown in FIG. 37 also includes configuring (3706) the intermediary handler to communicate with the intermediary service in response to detecting a communication failure with one of the storage systems (3700B-3700N) according to one or more service handles received from the configuration service (3751). Configuring (3706) the intermediary handler to communicate with the intermediary service in response to detecting a communication failure with one of the storage systems (3700B-3700N) according to one or more service handles received from the configuration service (3751) may be implemented by the intermediary handler (3762) defining a service handle or contact address for the intermediary service (3701) to be used when responding to the communication failure, and the service handle (3756) may be specified in a response message from the configuration service (3751) responding to the request (3754).

［0519］追加の説明のために、図３８は、本開示のいくつかの実施形態に従って仲介サービスのための自動ストレージシステム構成のための例の方法を示すフローチャートを説明する。図３８に示される例の方法は、ストレージシステム（３７００Ａ～３７００Ｎ）の中の特定のストレージシステム（３７００Ａ）が、データセット（３７５２）を同期複製するストレージシステム間の仲介のために仲介ターゲットから仲介を要求するように構成されていないと判断すること（３７０２）と、構成サービス（３７５１）から特定のストレージシステム（３７００Ａ）によって、仲介サービス（３７０１）のための１つ以上のサービスハンドルを示す構成情報を要求すること（３７０４）と、構成サービス（３７５１）から受け取られた１つ以上のサービスハンドルに従って、ストレージシステム（３７００Ｂ～３７００Ｎ）の１つとの通信障害を検出することに応えて、仲介サービスと通信するように仲介ハンドラを構成すること（３７０６）も含むので、図３８に示される例の方法は図３７に示される例の方法に類似している。
しかしながら、図３８に示される例の方法は、ストレージシステムを、データセット（３７５２）を同期複製するストレージシステム（３７００Ａ～３７００Ｎ）に加えることに応えて、加えられているストレージシステムに仲介サービスのための１つ以上のハンドルを自動的に転送すること（３８０２）をさらに含む。加えられたストレージシステムが同期中リストのメンバーになり得るようにストレージシステムをポッドに加えるプロセスが本明細書に説明される。ストレージシステムをポッドに加えるための説明されたプロセスを所与として、仲介サービスのための１つ以上のハンドルを、加えられているストレージシステムに転送すること（３８０２）は、加えられているシステムに対して、仲介ハンドラによって使用されるサービスハンドルが、コマンドの中で送信されているサービスハンドルであると定義されることを示すコマンドを送信するための説明されたプロセスをさらに指定することによって実施されてよい。 For additional explanation, Figure 38 sets forth a flowchart illustrating an example method for automated storage system configuration for an intermediation service in accordance with some embodiments of the present disclosure. The example method illustrated in Figure 38 is similar to the example method illustrated in Figure 37 in that it also includes determining (3702) that a particular storage system (3700A) among storage systems (3700A-3700N) is not configured to request intermediation from an intermediation target for intermediation between storage systems that synchronously replicate a dataset (3752), requesting (3704) configuration information indicating one or more service handles for the intermediation service (3701) by the particular storage system (3700A) from a configuration service (3751), and configuring (3706) an intermediation handler to communicate with the intermediation service in response to detecting a communication failure with one of the storage systems (3700B-3700N) according to the one or more service handles received from the configuration service (3751).
However, the example method shown in Figure 38 further includes automatically transferring 3802 one or more handles for the intermediary service to the storage system being added in response to adding the storage system to the storage systems (3700A-3700N) that synchronously replicate dataset 3752. Processes for adding a storage system to a pod are described herein such that the added storage system may become a member of the in-sync list. Given the described process for adding a storage system to a pod, transferring 3802 one or more handles for the intermediary service to the storage system being added may be implemented by further specifying the described process for sending a command to the system being added indicating that the service handle to be used by the intermediary handler is defined to be the service handle being sent in the command.

［0520］追加の説明のために、図３９は、本開示のいくつかの実施形態に従って仲介サービスの自動ストレージシステム構成のための例の方法を示すフローチャートを説明する。図３９に示される例の方法は、ストレージシステム（３７００Ａ～３７００Ｎ）の中の特定のストレージシステム（３７００Ａ）が、データセット（３７５２）を同期複製するストレージシステム間の仲介のために仲介ターゲットから仲介を要求するように構成されていないと判断すること（３７０２）と、構成サービス（３７５１）から特定のストレージシステム（３７００Ａ）によって、仲介サービス（３７０１）のための１つ以上のサービスハンドルを示す構成情報を要求すること（３７０４）と、構成サービス（３７５１）から受け取られた１つ以上のサービスハンドルに従って、ストレージシステム（３７００Ｂ～３７００Ｎ）の１つとの通信障害を検出することに応えて、仲介サービスと通信するように仲介ハンドラを構成すること（３７０６）も含むので、図３９に示される例の方法は図３７に示される例の方法に類似している。 For further explanation, FIG. 39 illustrates a flowchart depicting an example method for automated storage system configuration of an intermediary service in accordance with some embodiments of the present disclosure. The example method illustrated in FIG. 39 is similar to the example method illustrated in FIG. 37 in that it also includes determining (3702) that a particular storage system (3700A) among storage systems (3700A-3700N) is not configured to request intermediation from an intermediary target for intermediation between storage systems that synchronously replicate a dataset (3752); requesting (3704) configuration information indicating one or more service handles for the intermediary service (3701) by the particular storage system (3700A) from a configuration service (3751); and configuring (3706) an intermediary handler to communicate with the intermediary service in response to detecting a communication failure with one of the storage systems (3700B-3700N) according to the one or more service handles received from the configuration service (3751).

［0521］しかしながら、図３９に示される例の方法は、仲介サービス（３７０１）に対して特定のストレージシステム（３７００Ａ）によって、ストレージシステム（３７００Ａ～３７００Ｎ）の間の仲介のための安全な鍵又は無作為抽出された鍵（３９５２）を提供すること（３９０２）をさらに含む。仲介サービス（３７０１）に対して特定のストレージシステム（３７００Ａ）によって、ストレージシステム（３７００Ａ～３７００Ｎ）の間の仲介のための安全な鍵又は無作為抽出された鍵（３９５２）を提供すること（３９０２）は、その全体として本明細書に援用される出願参照第１５／７０３，５５９号の中に説明させっるように実施されてよい。 [0521] However, the example method shown in FIG. 39 further includes providing (3902) a secure key or randomly selected key for mediation between the storage systems (3700A-3700N) by a particular storage system (3700A) to the intermediary service (3701). Providing (3902) a secure key or randomly selected key for mediation between the storage systems (3700A-3700N) by a particular storage system (3700A) to the intermediary service (3701) may be implemented as described in application Ser. No. 15/703,559, the entire contents of which are incorporated herein by reference.

［0522］追加の説明のために、図４０は、本開示のいくつかの実施形態に係る仲介サービスのための自動ストレージシステム構成のための例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図４０に示されるストレージシステム（３７００Ａ～３７００Ｎ）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｂ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図４０に示されるストレージシステム（３７００Ａ～３７００Ｎ）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0522] For further explanation, FIG. 40 sets forth a flowchart illustrating an example method for automated storage system configuration for an intermediary service according to some embodiments of the present disclosure. While not shown in great detail, the storage systems (3700A-3700N) illustrated in FIG. 40 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage systems (3700A-3700N) illustrated in FIG. 40 may include the same components as the storage systems described above, fewer components, or additional components.

［0523］自動ストレージシステムのこの例の実施態様では、出荷又はインストールされる前に、ストレージシステムが構成サービス等の固定された事前に構成された場所から仲介サービスハンドルを要求又は入手するように事前に構成される代わりに―ストレージシステムは、仲介サービスのための固定された連絡情報で事前に構成されてよい。データセットのためにポッドを作成する際に、所与のストレージシステムは、鍵のインスタンスを生成し、鍵を仲介サービスに提供することによって、ポッド内のストレージシステム間で仲介するための仲介サービスの使用を構成してよい。さらに、データセットのためにポッドを作成する際に、所与のストレージシステムは―例えば、ストレージシステムがポッドに加えられると―ポッド内のストレージシステムのそれぞれに対する仲介サービスに提供できる同じ生成された鍵を提供することによって、ポッド内の他のストレージシステムを構成してもよい。生成された鍵を受け取る際に、他のストレージシステムは、仲介がシステム障害に対する適切な対応であると判断する場合、仲介を要求するために生成された鍵を使用してよい。 [0523] In this example implementation of an automated storage system, instead of the storage system being pre-configured to request or obtain an intermediation service handle from a fixed, pre-configured location, such as a configuration service, before being shipped or installed, the storage system may be pre-configured with fixed contact information for the intermediation service. When creating a pod for a dataset, a given storage system may configure the use of the intermediation service to intermediate between storage systems in the pod by instantiating a key and providing the key to the intermediation service. Furthermore, when creating a pod for a dataset, a given storage system may configure other storage systems in the pod by providing the same generated key that can be provided to the intermediation service for each of the storage systems in the pod—e.g., as the storage system is added to the pod. Upon receiving the generated key, the other storage systems may use the generated key to request intermediation if they determine that intermediation is an appropriate response to a system failure.

［0524］この例では、仲介サービスは、特定の鍵の使用をロックする単純な要求に応答してよく、仲介サービスは、特定の鍵を受け取る前に特定の鍵で構成されず、仲介サービスは任意の特定のストレージシステム、ポッド、又はカスタマから特定の鍵を受け取るように構成されていない。特定の鍵が所与のストレージシステムによって仲介サービスでロックされている場合、次いで他のストレージシステムは仲介サービスでその特定の鍵をロックすることはできない。 [0524] In this example, the intermediary service may respond to a simple request to lock the use of a particular key; the intermediary service is not configured with the particular key prior to receiving it, and the intermediary service is not configured to receive the particular key from any particular storage system, pod, or customer. If a particular key is locked at the intermediary service by a given storage system, then other storage systems cannot lock that particular key at the intermediary service.

［0525］概して、鍵は、アタッカーが鍵の値を予測し、それを事前にロックする可能性があるエクスプロイトコード、つまりハッキングの試みを防ぐために暗号で安全な方式を使用し、生成されてよい、又は名前を付けられてよい。さらに、クラスタ又はポッド内のストレージシステムの集合は、クラスタ又はポッドに対する次の競争を決定する際に使用するための特定の鍵を使用し、各仲介試行後に使用される新規に生成される鍵を交換してよい。別の例では、ストレージシステムがクラスタ又はポッドに加えられた後、ポッド構成の態様の１つは、加えられているストレージシステムに、次の仲介要求に使用するための現在の鍵を提供することを含んでよい。例えば、ストレージデバイス故障又はネットワーク障害等のなんらかのイベントが仲介をトリガする場合、次いで１つ以上のストレージシステムは、仲介サービスに対する現在の鍵のロックを要求してよい。現在の鍵に対するロックに対するこの要求は、せいぜい１人の要求者に対して成功する場合がある－要求に対する応答が失われる場合、ロックは再び要求される（又は実施態様によっては、問い合わせされる）場合があり、後続の要求は、後続の要求が同じ要求者からくるとして仲介サービスによって識別される場合に成功する場合がある。 [0525] In general, keys may be generated or named using a cryptographically secure method to prevent exploit code, or hacking attempts, where an attacker could predict the value of a key and pre-lock it. Furthermore, a collection of storage systems within a cluster or pod may use a specific key for use in determining the next race for the cluster or pod, and may rotate a newly generated key to be used after each mediation attempt. In another example, after a storage system is added to a cluster or pod, one aspect of the pod configuration may include providing the added storage system with the current key to use for the next mediation request. For example, if some event triggers mediation, such as a storage device failure or network failure, one or more storage systems may then request a lock on the current key from the mediation service. This request for a lock on the current key may be successful for at most one requestor—if the response to the request is lost, the lock may be requested (or, in some embodiments, queried) again, and subsequent requests may be successful if the subsequent requests are identified by the mediation service as coming from the same requestor.

［0526］一部の例では、特定の仲介鍵の単回使用後、クラスタ又はポッド内の任意の残りのアクティブなストレージシステムは、将来の仲介試行に使用するために新しい鍵を交換してよい。また、新しい鍵のこの交換は－障害時にメンバーであったストレージシステムが互いとの通信を再開するのであるならば－以前の仲介が仲介サービスから成功した回答を受け取ることにより絶対に確認されなかった場合に実行される場合がある。 [0526] In some examples, after a single use of a particular brokering key, any remaining active storage systems in the cluster or pod may exchange for a new key for use in future brokering attempts. This exchange of a new key may also be performed if the previous brokering attempt was never confirmed by receiving a successful response from the brokering service—if the storage systems that were members at the time of the failure are to resume communication with each other.

［0527］代替実施態様では、カスタマは、カスタマによって管理されるバーチャルマシンで動作するサービス等の仲介サービスを明示的に構成してよい。さらに別の実施態様では、クラスタ又はポッドを構成する際に中間ステップがある場合があり、それによりストレージシステムは、サービス場所アドレス又はクラスタ若しくはポッドが仲介のために使用する他の連絡情報を決定するために、１つのアドレスで仲介構成サービスに連絡する。本実施態様では、この中間ステップは、クラスタ又はポッドのためのコア構成プロセスの一部であってよく、サービス場所アドレス又は他の連絡情報は、クラスタ又はポッドの中のストレージシステム間で交換され、サービス場所アドレス又は他の連絡情報は、新しいストレージシステムがクラスタ又はポッドに加えられると、新しいストレージシステムに送信される。係る実施態様は、ベンダが特定のカスタマのために特定の仲介サービスインスタンスを調整する状況で有用である場合がある。代わりに、係る実施態様は、地形に基づいた仲介サービスの場所を明示的に突き止めるために、又はカスタマの場所をクラウドサービス可用性又は確実性ゾーンに一致させるために使用されてよい。 [0527] In an alternative embodiment, a customer may explicitly configure an intermediary service, such as a service running on a customer-managed virtual machine. In yet another embodiment, there may be an intermediate step in configuring a cluster or pod whereby a storage system contacts an intermediary configuration service at an address to determine a service location address or other contact information that the cluster or pod will use for intermediation. In this embodiment, this intermediate step may be part of the core configuration process for the cluster or pod, whereby service location addresses or other contact information are exchanged between storage systems in the cluster or pod, and service location addresses or other contact information are sent to new storage systems as they are added to the cluster or pod. Such an embodiment may be useful in situations where a vendor tailors a specific intermediary service instance for a specific customer. Alternatively, such an embodiment may be used to explicitly locate an intermediary service based on terrain or to match a customer's location to a cloud service availability or assurance zone.

［0528］一部の例では、仲介サービスに関して、仲介サービスは、カスタマの対話なしでストレージシステムによって連絡されるように構成されてよく、仲介サービスは、クラスタ又はポッドの中の仲介の特定のインスタンスに対してロックし、同ロックに対して矛盾する要求がなされなかった場合にだけ成功して応答することによって動作し、成功は、１つのストレージシステムが、少なくとも１つの他方のストレージシステムから一方のストレージシステムを隔離した障害を検出後にクラスタ又はポッドのために安全にサービスを再開できることを確実にすることの一部としてクラスタ又はポッド内の１つのストレージシステムによって使用される。仲介サービスは、１つ以上の接続するクライアントから要求を受け取る複数のフロントエンドウェブサーバを提供するサービス等のクラウドベースのサービスであってよく、フロントエンドウェブサーバは、なんらかの数のＩＰアドレスにマッピングする―複数のネットワークスイッチの後方でさらに視覚化されてよい―特定のＤＮＳホスト名に対して宣伝してよい。一部の例では、複数のホスト名が提供されてよく、フロントエンドサーバは、複数のバックエンドサーバ全体で仲介に対する要求をばらまく、つまり分散させるように構成されてよい。例えば、仲介サービスの場合、ストレージシステムからの特定のロック要求と関連付けられた特定の仲介鍵を所与として、特定の鍵は、次いで受け取られた鍵ごとにロックを実施する複数のバックエンドデータベースのいずれかに対してハッシュ値を計算されて（ｈａｓｈｅｄ）よい。さらに、複数のストレージシステムは、競争して同じポッドに対して仲介するために同じ鍵を使用し得るため、ストレージシステムは、フロントエンドウェブサーバ―それらが同じ全体的なクラウドベースの仲介サービスと関連付けられる限り―のいずれかに連絡してよく、ロックを実施するために同じバックエンドデータベースに対してハッシュ値を計算してよい。この例では、バックエンドデータベースは、適切な保証を有する分散型トランザクションデータベース（例えば、とりわけＤｙｎａｍｏＤＢ（商標））として実装される場合もあれば、バックエンドデータベースは共用ストレージ上の高可用性データベースサーバとして実装される場合もあれば、バックエンドデータベースは高可用性及びデータ冗長性のための適切な機構を有する同期複製されたデータベースサーバとして実装される場合もあれば、バックエンドデータベースは適切な保証された条件付きストアプリミティブを有するオブジェクトストレージモデルを使用し、実装される場合もあれば、バックエンドデータベースはさまざまな他の技術のいずれにより実装されてもよい。 [0528] In some examples, with respect to an intermediary service, the intermediary service may be configured to be contacted by a storage system without customer interaction, and the intermediary service operates by locking on a particular instance of the intermediary in a cluster or pod and responding with success only if no conflicting requests have been made on the same lock, with success being used by one storage system in a cluster or pod as part of ensuring that it can safely resume service for the cluster or pod after detecting a failure that has isolated one storage system from at least one other storage system. The intermediary service may be a cloud-based service, such as a service that provides multiple front-end web servers that receive requests from one or more connecting clients, and the front-end web servers may advertise for a particular DNS hostname—which may be further visualized behind multiple network switches—that maps to any number of IP addresses. In some examples, multiple hostnames may be provided, and the front-end servers may be configured to scatter or distribute requests to the intermediary across multiple back-end servers. For example, in the case of an intermediation service, given a particular intermediation key associated with a particular lock request from a storage system, the particular key may then be hashed against any of multiple backend databases that enforce the lock for each received key. Furthermore, because multiple storage systems may compete to use the same key to intermediate for the same pod, the storage system may contact any of the frontend web servers—as long as they are associated with the same overall cloud-based intermediation service—and hash against the same backend database to enforce the lock. In this example, the backend database may be implemented as a distributed transactional database with appropriate guarantees (e.g., DynamoDB™, among others); as a highly available database server on shared storage; as a synchronously replicated database server with appropriate mechanisms for high availability and data redundancy; or using an object storage model with appropriate guaranteed conditional storage primitives; or the backend database may be implemented with any of a variety of other technologies.

［0529］さらに、一部の例では、仲介サービスは、各鍵－各クラスタの鍵、ポッドの鍵、又はカスタマの鍵、又は他のタイプのドメインに関係する鍵が、実施態様に応じて他の鍵、クラスタ、ポッド、カスタマ、又は他のタイプのドメインから安全に隔離されることを確実にするために実装されてよい。これは、例えば、仲介サービスのための効果的なマルチ手ナンシーモデルを保証できる。 [0529] Additionally, in some examples, the intermediary service may be implemented to ensure that each key—whether it be a cluster key, a pod key, a customer key, or a key associated with another type of domain—is securely isolated from other keys, clusters, pods, customers, or other types of domains, depending on the implementation. This can ensure, for example, an effective multi-tenancy model for the intermediary service.

［0530］図４０に示される例の方法は、ストレージシステム（３７００Ａ～３７００Ｎ）のうちの特定のストレージシステム（３７００Ａ）によって、データセット（３７５２）を同期複製するストレージシステム（３７００Ａ～３７００Ｎ）の間の仲介のために仲介サービス（３７０１）から仲介を要求するようにストレージシステムの１つ以上を構成することを決定すること（４００２）を含む。ストレージシステム（３７００Ａ～３７００Ｎ）のうちの特定のストレージシステム（３７００Ａ）によって、データセット（３７５２）を同期複製するストレージシステム（３７００Ａ～３７００Ｎ）の間の仲介のために仲介サービス（３７０１）から仲介を要求するようにストレージシステムの１つ以上を構成することを決定すること（４００２）は、異なる技術を使用し、実施されてよい。一例では、ストレージシステムを構成することを決定すること（４００２）は、１つ以上の別のストレージシステムを含むためのクラスタ又はポッドの拡大に基づいて実施されてよく、新しいストレージシステムがクラスタ又はポッドに加えられる場合、次いでそれぞれの新しいストレージシステムは、それぞれの他の既存のストレージシステムが仲介のために使用するように構成される仲介のための鍵を与えられる。別の例では、ストレージシステムを構成することを決定すること（４００２）は、データセット（３７５２）を同期複製するポッドのメンバーであるストレージシステムごとに、所与のストレージシステムが鍵のインスタンスを与えられているかどうかを記述するメタデータを維持することによって実施されてよく、メタデータの初期状態は、ストレージシステムが鍵を有していないことを示す場合がある。例えば、鍵に対する最新の意見の一致がない場合、特定のストレージシステム（３７００Ａ）は、コンピュータ的に予測が実現可能ではないであろう鍵を生成する暗号技術を使用し、上述されたように鍵のインスタンスを生成してよい。さらに、特定のストレージシステム（３７００Ａ）は、起動中又は周期的に、鍵が生成されていない－つまり、鍵値に対する最新の意見の一致がない－と判断する場合があり、それに応じて、鍵のインスタンスを生成する場合がある。さらに、一部の例では、仲介サービスは、互いに同期中であるストレージシステムの中の同期されたデータセット（３７５２）の、メタデータを含んだ複製されたコンテンツの一部である場合がある。次いで、定足数方針が、どのストレージシステムが互いに同期中のままとなるのかを判断するためのアルゴリズムの全体的な集合の一部として、ストレージシステムの間で使用されてよい。このようにして、ストレージシステム（３７００Ａ～３７００Ｎ）は、仲介サービスが使用される場合、鍵の同じインスタンスを使用してよい。 40 includes determining (4002) to configure one or more of the storage systems (3700A-3700N) by a particular storage system (3700A) from an intermediary service (3701) for intermediation between the storage systems (3700A-3700N) that synchronously replicate a data set (3752). Determining (4002) to configure one or more of the storage systems (3700A-3700N) by a particular storage system (3700A) from an intermediary service (3701) for intermediation between the storage systems (3700A-3700N) that synchronously replicate a data set (3752) by a particular storage system (3700A) from the intermediary service (3701) for intermediation between the storage systems (3700A-3700N) that synchronously replicate a data set (3752) may be implemented using different techniques. In one example, determining to configure a storage system (4002) may be performed based on expanding a cluster or pod to include one or more additional storage systems, and as new storage systems are added to the cluster or pod, each new storage system is then provided with a key for mediation that each other existing storage system is configured to use for mediation. In another example, determining to configure a storage system (4002) may be performed by maintaining, for each storage system that is a member of a pod that synchronously replicates the dataset (3752), metadata describing whether a given storage system has been provided with an instance of a key, and the initial state of the metadata may indicate that the storage system does not have a key. For example, if there is no current consensus on a key, the particular storage system (3700A) may generate an instance of a key as described above, using cryptographic techniques that generate keys that may not be computationally feasible to predict. Furthermore, the particular storage system (3700A) may determine, during startup or periodically, that a key has not been generated—i.e., there is no current consensus on a key value—and may generate an instance of a key accordingly. Additionally, in some examples, the intermediary service may be part of the replicated content, including metadata, of the synchronized data sets (3752) among the storage systems that are in sync with each other. A quorum policy may then be used between the storage systems as part of an overall set of algorithms to determine which storage systems will remain in sync with each other. In this way, the storage systems (3700A-3700N) may use the same instance of a key when the intermediary service is used.

［0531］また、図４０に示される例の方法は、ストレージシステム（３７００Ａ～３７００Ｎ）の１つ以上に仲介サービス（３７０１）から仲介を要求するために鍵（４０５２）のインスタンスを提供すること（４００４）も含む。ストレージシステム（３７００Ａ～３７００Ｎ）の１つ以上に仲介サービス（３７０１）から仲介を要求するために鍵（４０５２）のインスタンスを提供すること（４００４）は、特定のストレージシステム（３７００Ａ）から、仲介サービスから仲介を要求するように構成されていないと判断された２つ以上のストレージシステムのそれぞれに対し、１つ以上のネットワークポートを使用し、及び１つ以上の通信ネットワーク全体で鍵のインスタンスを送信することによって実施されてよい。一部の場合では、１つ以上のストレージシステム（３７００Ｂ～３７００Ｎ）に鍵のインスタンスを提供すること（４００４）は、ストレージシステム（３７００Ａ～３７００Ｎ）のうちの特定のストレージシステム（３７００Ａ）によって、ストレージシステムの１つ以上が、データセット（３７５２）を同期複製するストレージシステム（３７００Ａ～３７００Ｎ）の間の仲介のために仲介サービス（３７０１）から仲介を要求するように構成されていないと判断すること（４００２）に応えて、実行されてよい。 40 also includes providing (4004) an instance of key (4052) to one or more of storage systems (3700A-3700N) to request intermediation from the intermediary service (3701). Providing (4004) an instance of key (4052) to one or more of storage systems (3700A-3700N) to request intermediation from the intermediary service (3701) may be implemented by transmitting, from the particular storage system (3700A), an instance of key to each of two or more storage systems determined not to be configured to request intermediation from the intermediary service, using one or more network ports and across one or more communication networks. In some cases, providing (4004) an instance of the key to one or more storage systems (3700B-3700N) may be performed by a particular storage system (3700A) of the storage systems (3700A-3700N) in response to determining (4002) that one or more of the storage systems is not configured to request intermediation from the intermediation service (3701) for intermediation between the storage systems (3700A-3700N) that synchronously replicate the dataset (3752).

［0532］一部の実施態様では、正常な状況なら、ポッドは仲介鍵で開始してよく、ポッドが追加のストレージシステムを含むために広げられると、仲介鍵を送信するであろう－一部の場合では「リーダー」と見なされる場合がある特定のストレージシステムは、ある鍵が一度使用されたことに応えて、後続の仲介競争のために要求される新しい仲介鍵を送信するであろう。言い換えると、上記の実施態様と対照的に、一部の実施態様では、ストレージシステムが仲介鍵を使用するように構成されているかどうかを明示的に判断する代わりに、ストレージシステムは、ポッドの作成に基づいて、ポッドの作成の部分として仲介鍵にアクセスできる場合がある－ストレージシステムがポッドに加えられることに応えて、仲介鍵は、ポッド内のストレージシステム全体でのデータ及びメタデータの同期の一部として新しいストレージシステムに配布される。さらに、上述されたように、本実施態様では、仲介鍵を使用すること又は仲介鍵の使用を試みることに応えて、新しい鍵が生成され、ポッド内のストレージシステムに配布されてよい。このようにして、ポッド内のストレージシステムは、ポッドの作成に応えて生成される同じ仲介鍵を使用し、追加のストレージシステムを含むようにポッドを広げ、仲介鍵を使用する又は使用しようと試みるように構成されてよい。上述されたように、仲介鍵を使用するように構成することは、仲介ハンドラによって使用可能な現在の鍵をポッドのストレージシステム全体で同期されている仲介鍵であると指定することを含んでよい。 [0532] In some embodiments, under normal circumstances, a pod may start with an intermediary key and will transmit the intermediary key as the pod expands to include additional storage systems—a particular storage system, which in some cases may be considered the “leader,” will transmit a new intermediary key required for subsequent intermediation competitions in response to a key being used once. In other words, in contrast to the above embodiments, in some embodiments, instead of explicitly determining whether a storage system is configured to use the intermediary key, the storage system may have access to the intermediary key based on the creation of the pod as part of the pod's creation—in response to a storage system being added to the pod, the intermediary key is distributed to the new storage system as part of the synchronization of data and metadata across the storage systems in the pod. Further, as described above, in this embodiment, in response to using or attempting to use the intermediary key, new keys may be generated and distributed to storage systems in the pod. In this manner, storage systems in the pod may be configured to use the same intermediary key generated in response to the creation of the pod, expand the pod to include additional storage systems, and use or attempt to use the intermediary key. As described above, configuring to use an intermediary key may include specifying that the current key available to the intermediary handler is the intermediary key that is synchronized across the pod's storage systems.

［0533］また、図４０に示される例の方法は、仲介サービス（３７０１）に対して、鍵（４０５２）のインスタンスを提供すること（４００６）を含み、仲介サービスが、鍵（４０５２）のインスタンスを提供する所与のストレージシステムに仲介サービスを提供することを含む。仲介サービスが、鍵（４０５２）のインスタンスを提供する所与のストレージシステムに仲介サービスを提供する、仲介サービス（３７０１）に対して鍵（４０５２）のインスタンスを提供すること（４００６）は、１つ以上のネットワークポートを使用し、及び１つ以上の通信ネットワーク全体で、特定のストレージシステム（３７００Ａ）から仲介サービス（３７０１）に鍵のインスタンスを送信することによって実施されてよい。一部の場合では、仲介サービスが、鍵（４０５２）のインスタンスを提供する所与のストレージシステムに仲介サービスを提供する、仲介サービス（３７０１）に対して鍵（４０５２）のインスタンスを提供すること（４００６）は、ストレージシステム（３７００Ａ～３７００Ｎ）のうちの特定のストレージシステム（３７００Ａ）によって、ストレージシステムのうちの１つ以上が、データセット（３７５２）を同期複製するストレージシステム（３７００Ａ～３７００Ｎ）の間の仲介のために仲介サービス（３７０１）から仲介を要求するように構成されていないと判断すること（４００２）に応えて実行されてよい。 40 also includes providing (4006) an instance of key (4052) to intermediary service (3701), where the intermediary service provides the intermediary service to a given storage system that provides the instance of key (4052). Providing (4006) the instance of key (4052) to intermediary service (3701), where the intermediary service provides the intermediary service to a given storage system that provides the instance of key (4052), may be implemented by transmitting the instance of key (4052) from a particular storage system (3700A) to intermediary service (3701) using one or more network ports and across one or more communication networks. In some cases, providing (4006) an instance of the key (4052) to the intermediary service (3701), in which the intermediary service provides the intermediary service to a given storage system that provides the instance of the key (4052), may be performed in response to determining (4002) by a particular storage system (3700A) among the storage systems (3700A-3700N) that one or more of the storage systems is not configured to request intermediation from the intermediary service (3701) for intermediation between the storage systems (3700A-3700N) that synchronously replicate the dataset (3752).

［0534］また、図４０に示される例の方法は、ストレージシステム（３７００Ｂ～３７００Ｎ）の少なくとも１つとの通信障害を検出することに応えて仲介サービスに仲介サービスの鍵のインスタンスを提供するように仲介ハンドラ（３７６２）を構成すること（７０８）も含む。ストレージシステム（３７００Ｂ～３７００Ｎ）の少なくとも１つの通信障害を検出することに応えて仲介サービスに仲介サービスの鍵のインスタンスを提供するように仲介ハンドラ（３７６２）を構成すること（４００６）は、仲介ハンドラ（３７６２）が、通信障害に対応するとき、仲介サービス（３７０１）に提供される生成された鍵のインスタンスとなるように現在の鍵を定義することによって実施されてよい。 [0534] The example method shown in FIG. 40 also includes configuring (708) the intermediary handler (3762) to provide an instance of the intermediary service's key to the intermediary service in response to detecting a communication failure with at least one of the storage systems (3700B-3700N). Configuring (4006) the intermediary handler (3762) to provide an instance of the intermediary service's key to the intermediary service in response to detecting a communication failure with at least one of the storage systems (3700B-3700N) may be implemented by the intermediary handler (3762) defining the current key to be the instance of the generated key provided to the intermediary service (3701) when responding to the communication failure.

［0535］追加の説明のために、図４１は、本開示のいくつかの実施形態に従って、ともにストレージデータの論理ボリューム又は論理ボリュームの一部分を表してよい、メタデータオブジェクトの構造化された集合体として実装されてよいメタデータ表現の図を説明する。メタデータ表現４１５０、４１５４、及び４１６０は、ストレージシステム（４１０６）の中に記憶されてよく、１つ以上のメタデータ表現は、例えばストレージシステム（４１０６）の中に記憶されたボリューム又はボリュームの部分等の複数のストレージオブジェクトのそれぞれについて生成及び維持されてよい。 [0535] For further explanation, FIG. 41 illustrates a diagram of metadata representations that may be implemented as a structured collection of metadata objects that together may represent a logical volume or portion of a logical volume of storage data, in accordance with some embodiments of the present disclosure. Metadata representations 4150, 4154, and 4160 may be stored within storage system (4106), and one or more metadata representations may be generated and maintained for each of multiple storage objects, e.g., volumes or portions of volumes, stored within storage system (4106).

［0536］他のタイプのメタデータオブジェクトの構造化された集合体が考えられるが、この例では、メタデータ表現は、ノードの有向非巡回グラフ（ＤＡＧ）として構造化されてよく、任意の所与のノードに対する効率的なアクセスを維持するために、ＤＡＧは多様な方法に従って構造化され、バランスを取られてよい。例えば、メタデータ表現のためのＤＡＧは、一種のＢツリーとして定義され、メタデータ表現の構造に対する変更に応えて相応してバランスがとられてよく、メタデータ表現に対する変更は、メタデータ表現によって表される基本的なデータに対する変更に応えて、又は加えて発生する場合がある。この例では、簡略にするために２つのレベルしかないが、他の例では、メタデータ表現は、複数のレベルにわたって及んでよく、数百又は数千のノードを含んでよく、各ノードは他のノードに対する任意の数のリンクを含んでよい。 [0536] While other types of structured collections of metadata objects are possible, in this example the metadata representation may be structured as a directed acyclic graph (DAG) of nodes, and the DAG may be structured and balanced according to a variety of methods to maintain efficient access to any given node. For example, the DAG for the metadata representation may be defined as a type of B-tree and balanced accordingly in response to changes to the structure of the metadata representation, which may occur in response to or in addition to changes to the underlying data represented by the metadata representation. While in this example there are only two levels for simplicity, in other examples the metadata representation may span multiple levels and include hundreds or thousands of nodes, with each node including any number of links to other nodes.

［0537］さらに、この例では、メタデータ表現のリーフは、ボリューム又はボリュームの部分のために記憶されたデータに対するポインタを含んでよく、論理アドレス、つまりボリューム及びオフセットは、論理アドレスに対応する記憶されたデータを参照する１つ以上のリーフノードに到達するためにメタデータ表現を識別し、メタデータ表現を通ってナビゲートするために使用されてよい。例えば、ボリューム（４１５２）は、メタデータオブジェクトノード（４１５２、４１５２Ａ～４１５２Ｎ）を含むメタデータ表現（４１５０）によって表されてよく、リーフノード（４１５２Ａ～４１５２Ｎ）は、それぞれのデータオブジェクト（４１５３Ａ～４１５３Ｎ、４１５７）に対するポインタを含む。データオブジェクトは、ストレージシステム（４１０６）の中の任意のサイズのデータの単位であってよい。例えば、データオブジェクト（４１５３Ａ～４１５３Ｎ、４１５７）は、それぞれ論理エクステントであってよく、論理エクステントは、例えば１ＭＢ、４ＭＢ等のなんらかの指定されたサイズ又はなんらかの他のサイズであってよい。 [0537] Further, in this example, the leaves of the metadata representation may include pointers to data stored for the volume or portion of the volume, and the logical address, i.e., volume and offset, may be used to identify and navigate through the metadata representation to reach one or more leaf nodes that reference the stored data corresponding to the logical address. For example, volume (4152) may be represented by metadata representation (4150) that includes metadata object nodes (4152, 4152A-4152N), and the leaf nodes (4152A-4152N) include pointers to respective data objects (4153A-4153N, 4157). A data object may be a unit of data of any size within storage system (4106). For example, data objects (4153A-4153N, 4157) may each be a logical extent, and the logical extent may be some specified size, such as 1 MB, 4 MB, etc., or some other size.

［0538］この例では、スナップショット（４１５６）は、記憶されたオブジェクトのスナップショット、この場合はボリューム（４１５２）として作成されてよく、スナップショット（４１５６）が作成された時点で、スナップショット（４１５６）のためのメタデータ表現（４１５４）は、ボリューム（４１５２）のためのメタデータ表現（４１５０）のためのメタデータオブジェクトのすべてを含む。さらに、スナップショット（４１５６）の作成に応えて、メタデータ表現（４１５４）は、読取り専用となるように指定されてよい。しかしながら、メタデータ表現を共用するボリューム（４１５２）は修正され続けてよく、スナップショットが作成される瞬間、修正がボリューム（４１５２）対応するデータに加えられるので、ボリューム（４１５２）のためのメタデータ及びスナップショット（４１５６）は同一であり、修正に応えて、ボリューム（４１５２）のメタデータ表現及びスナップショット（４１５６）は、分岐し、異なってくる。 [0538] In this example, snapshot (4156) may be created as a snapshot of a stored object, in this case volume (4152), and at the time snapshot (4156) is created, metadata representation (4154) for snapshot (4156) includes all of the metadata objects for metadata representation (4150) for volume (4152). Furthermore, in response to the creation of snapshot (4156), metadata representation (4154) may be designated as read-only. However, volume (4152), which shares the metadata representation, may continue to be modified, and at the moment the snapshot is created, as modifications are made to the data corresponding to volume (4152), the metadata for volume (4152) and snapshot (4156) are identical; in response to the modifications, the metadata representation of volume (4152) and snapshot (4156) diverge and become different.

［0539］例えば、ボリューム（４１５２）を表現するためのメタデータ表現（４１５０）及びスナップショット（４１５６）を表現するためのメタデータ表現（４１５４）を所与として、ストレージシステム（４１０６）は、特定のデータオブジェクト（４１５３Ｂ）の中に最終的に記憶されるデータに書き込むＩ／Ｏ動作を受け取ってよく、データオブジェクト（４１５３Ｂ）は、リーフノードポインタ（４１５２Ｂ）によって指され、リーフノードポインタ（４１５２Ｂ）は、メタデータ表現（４１５０、４１５４）の部分である。書込み動作に応えて、メタデータ表現（４１５４）によって参照される読取り専用データオブジェクト（４１５３Ａ～４１５３Ｎ）は未変更のままとなり、ポインタ（４１５２Ｂ）も未変更のままとなってよい。しかしながら、現在のボリューム（４１５２）を表すメタデータ表現（４１５０）は、書込み動作によって書き込まれるデータを保持するために新しいデータオブジェクトを含むように修正され、修正されたメタデータ表現は、メタデータ表現（４１６０）で示される。さらに、書込み動作は、データオブジェクト（４１５３Ｂ）の一部分だけに向けられてよく、結果的に、新しいデータオブジェクト（４１５７）は、書込み動作のためのペイロードに加えてデータオブジェクト（４１５３Ｂ）の以前のコンテンツのコピーを含んでよい。 [0539] For example, given a metadata representation (4150) for representing a volume (4152) and a metadata representation (4154) for representing a snapshot (4156), the storage system (4106) may receive an I/O operation to write data that will ultimately be stored in a particular data object (4153B), the data object (4153B) being pointed to by a leaf node pointer (4152B), which is part of the metadata representations (4150, 4154). In response to the write operation, the read-only data objects (4153A-4153N) referenced by the metadata representation (4154) may remain unchanged, and the pointers (4152B) may also remain unchanged. However, metadata representation 4150 representing current volume 4152 is modified to include a new data object to hold the data written by the write operation, and the modified metadata representation is shown in metadata representation 4160. Furthermore, the write operation may be directed to only a portion of data object 4153B, and as a result, new data object 4157 may include a copy of the previous contents of data object 4153B in addition to the payload for the write operation.

［0540］この例では、書込み動作を処理することの一部として、ボリューム（４１５２）のためのメタデータ表現（４１６０）は、既存のメタデータオブジェクトポインタ（４１５２Ｂ）を削除するために、及び新しいメタデータオブジェクトポインタ（４１５８）を含めるために修正され、新しいメタデータオブジェクトポインタ（４１５８）は、新しいデータオブジェクト（４１５７）を指すように構成され、新しいデータオブジェクト（４１５７）は、書込み動作によって書き込まれたデータを記憶する。さらに－ターゲットデータオブジェクトを参照したメタデータオブジェクトポインタ（４１５２Ｂ）の例外はあるが、ボリューム（４１５２）のためのメタデータ表現（４１６０）は、以前のメタデータ表現（４１５０）の中に含まれたすべてのメタデータオブジェクトを含み続け、メタデータオブジェクトポインタ（４１５２Ｂ）は、上書きされたであろう読取り専用データオブジェクト（４１５３Ｂ）を参照し続ける。 [0540] In this example, as part of processing the write operation, the metadata representation (4160) for volume (4152) is modified to remove the existing metadata object pointer (4152B) and to include a new metadata object pointer (4158), which is configured to point to a new data object (4157), which stores the data written by the write operation. Furthermore—with the exception of metadata object pointer (4152B), which referenced the target data object, the metadata representation (4160) for volume (4152) continues to include all of the metadata objects contained in the previous metadata representation (4150), and metadata object pointer (4152B) continues to reference the read-only data object (4153B) that may have been overwritten.

［0541］このようにして、メタデータ表現を使用し、ボリューム又はボリュームの一部分は、メタデータオブジェクトを作成することによって、及びデータオブジェクトの実際の重複なしに、スナップショットを撮られると見なされてよい、又はコピーされると見なされてよい－データオブジェクトの重複は、書込み動作が、メタデータ表現によって参照される読取り専用データオブジェクトの１つに向けられるまで延期されてよい。 [0541] In this way, using the metadata representation, a volume or portion of a volume may be considered to be snapshotted or copied by creating a metadata object and without actual duplication of data objects—duplication of data objects may be deferred until a write operation is directed to one of the read-only data objects referenced by the metadata representation.

［0542］言い換えると、ボリュームを表現するためにメタデータ表現を使用することの優位点は、ボリュームのスナップショット又はコピーが作成され、一定のより旧いときに、具体的にはスナップショット又はコピーのためにメタデータオブジェクトを作成し、スナップショットを撮られている又はコピーされているボリュームの既存のメタデータ表現に対するスナップショット又はコピーメタデータオブジェクトのための参照を作成するのに要する時間内に、アクセス可能であってよいという点である。 [0542] In other words, the advantage of using a metadata representation to represent a volume is that a snapshot or copy of the volume may be created and accessible at a certain older time, specifically within the time it takes to create a metadata object for the snapshot or copy and to create a reference for the snapshot or copy metadata object to the existing metadata representation of the volume being snapshotted or copied.

［0543］例の使用として、仮想化されたコピーバイリファレンス（ｃｏｐｙ－ｂｙ－ｒｅｆｅｒｅｎｃｅ）は、ボリュームのスナップショットを作成する際のメタデータ表現の使用に類似するやり方でメタデータ表現を利用してよい－仮想化されたコピーバイリファレンスのためのメタデータ表現は、多くの場合ボリューム全体のためのメタデータ表現の一部分に相当してよい。仮想化されたコピーバイリファレンスの例の実施態様は、仮想化されたストレージシステムのコンテキストの中にあってよく、ボリュームの中及び間の複数のブロック範囲は、記憶されたデータの一元化されたコピーを参照してよい。係る仮想化ストレージシステムでは、上述されたメタデータは、仮想アドレス、つまり論理アドレスと物理アドレス、つまり実アドレスとの関係を処理するために使用されてよい－言い換えると、記憶されたデータのメタデータ表現は、それがフラッシュメモリ上の摩耗を削減する又は最小限に抑える点でフラッシュフレンドリと見なされてよい仮想化ストレージシステムを可能にする。 [0543] By way of example, virtualized copy-by-reference may utilize metadata representations in a manner similar to the use of metadata representations when creating a snapshot of a volume—the metadata representation for virtualized copy-by-reference may often correspond to a portion of the metadata representation for the entire volume. An example implementation of virtualized copy-by-reference may be in the context of a virtualized storage system, where multiple block ranges within and between volumes may reference a unified copy of the stored data. In such a virtualized storage system, the metadata described above may be used to handle the relationship between virtual addresses, i.e., logical addresses, and physical addresses, i.e., real addresses—in other words, the metadata representation of stored data enables a virtualized storage system that may be considered flash-friendly in that it reduces or minimizes wear on flash memory.

［0544］一部の例では、論理エクステントは、簡略な集合体として、又は論理エクステントリファレンスの集合として形成されるなんらかのより大規模な論理エクステントの中の論理的に関連するアドレス範囲として、を含んだ多様な方法で結合されてよい。これらのより大きい結合は、多様な種類の論理エクステントアイデンティティを与えられる場合もあり、さらにより大きい論理エクステント又は集合体にさらに結合される場合があるであろう。コピーオンライトステータスは、実施態様に応じて多様な層に及び多様な方法で適用する場合があるであろう。例えば、エクステントの論理的集合体のうちの１つの論理的集合体に適用されたコピーオンライトステータスは、コピーされた集合体に未変更の論理エクステントに対する参照を保持させ、コピーオンライト論理集合体の一部だけしか変更されないときに（必要に応じて任意の未変更の記憶されたデータブロックに対する参照をコピーすることによる）コピーオンライトされた論理エクステントを作成させる可能性がある。 [0544] In some examples, logical extents may be combined in a variety of ways, including as simple aggregations or as logically related address ranges within some larger logical extent formed as a collection of logical extent references. These larger associations may be given various types of logical extent identities and may be further combined into even larger logical extents or aggregations. Copy-on-write status may be applied at various tiers and in various ways depending on the implementation. For example, copy-on-write status applied to one logical collection of extents may cause the copied collection to hold references to unchanged logical extents, creating a copy-on-write logical extent (by copying references to any unchanged stored data blocks as needed) when only a portion of the copy-on-write logical collection is modified.

［0545］重複排除、ボリュームスナップショット、又はブロック範囲スナップショットは、記憶されたデータブロックを参照すること、又は論理エクステントを参照すること、又は論理エクステント（若しくは論理エクステントの識別された集合体）をコピーオンライトとしてマークすることの組合せによりこのモデルで実施されてよい。 [0545] Deduplication, volume snapshots, or block range snapshots may be implemented in this model by a combination of referencing stored data blocks, referencing logical extents, or marking logical extents (or identified collections of logical extents) as copy-on-write.

［0546］さらに、フラッシュストレージシステムを用いると、記憶されたデータブロックは、多様な方法でともに編成され、グループ化されてよい。集合体は、より大きい消去ブロックの一部であるページの中に書き出されるので、削除又は置換された記憶されたデータブロックの最終的なガベージコレクションは、消去ブロック全体が消去され、再利用のために準備できるように、他のどこかでなんらかのページ数で記憶されたコンテンツを移動させることを伴ってよい。物理フラッシュページを選択し、最終的にそれらを移行し、ガベージコレクションし、次いで再利用のためにフラッシュ消去ブロックを消去するこのプロセスは、論理エクステント、重複排除、圧縮、スナップショット、仮想コピー、又は他のストレージシステム機能も処理しているストレージシステムの態様によって調整、駆動、実行される場合もあれば、されない場合もある。ページを選択し、ページを移行し、消去ブロックをガベージコレクションし、消去するための調整又は駆動されたプロセスは、例えば、用途の数、エージング予測、記憶されたデータを回復するために過去に必要にされた電圧レベル又はリトライの数に対する調整等、フラッシュメモリデバイスセル、ページ、及び消去ブロックの多様な特徴をさらに考慮に入れる場合がある。また、それらは、ストレージシステムの中のすべてのフラッシュメモリ全体での分析及び予測を考慮に入れてもよい。 [0546] Furthermore, with flash storage systems, stored data blocks may be organized and grouped together in a variety of ways. Because aggregates are written out into pages that are part of larger erase blocks, the eventual garbage collection of deleted or replaced stored data blocks may involve moving the contents stored in some number of pages elsewhere so that the entire erase block is erased and ready for reuse. This process of selecting physical flash pages, ultimately migrating and garbage collecting them, and then erasing flash erase blocks for reuse may or may not be coordinated, driven, or performed by aspects of the storage system that also handle logical extents, deduplication, compression, snapshots, virtual copies, or other storage system functions. The coordinated or driven process for selecting pages, migrating pages, garbage collecting, and erasing erase blocks may further take into account various characteristics of flash memory device cells, pages, and erase blocks, such as, for example, number of uses, aging projections, adjustments for voltage levels or number of retries previously required to recover stored data, etc. They may also take into account analysis and prediction across all flash memory in the storage system.

［0547］ストレージシステムが、論理エクステントを含んだ有向非巡回グラフに基づいて実装されてよいこの例を続行するために、論路エクステントは、２つのタイプ、つまりなんらかの量の記憶されたデータをなんらかの方法で参照するリーフ論理エクステント、及び他のリーフ又は複合論理エクステントを参照する複合論理エクステントに分類できる。 [0547] Continuing with this example, in which a storage system may be implemented based on a directed acyclic graph containing logical extents, logical extents can be classified into two types: leaf logical extents, which reference some amount of stored data in some way, and composite logical extents, which reference other leaf or composite logical extents.

［0548］リーフエクステントは、さまざまな方法でデータを参照できる。リーフエクステントは、記憶されたデータの単一の範囲（例えば、６４キロバイトのデータ）を直接的に指す場合もあれば、リーフエクステントは記憶されたデータに対する参照の集合体（例えば、物理的に記憶されたブロックに対する範囲と関連付けられたなんらかの数の仮想ブロックをマッピングするコンテンツの１メガバイト「範囲」）である場合もある。後者の場合、これらのブロックは、なんらかのアイデンティティを使用し、参照されてよく、エクステントの範囲内のいくつかのブロックは、何にもマッピングされ得ない。また、その後者の場合、これらのブロックリファレンスは一意である必要はなく、なんらかの数のボリュームの中及び全体でのなんらかの数の論理エクステントの中の仮想ブロックからの複数のマッピングが同じ物理的に記憶されるブロックにマッピングすることを可能にする。記憶されたブロックリファレンスの代わりに、論理エクステントは簡略なパターンを符号化できるであろう。例えば、同一のバイトの文字列であるブロックは、ブロックが同一バイトの繰り返されるパターンであることを単に符号化できるであろう。 [0548] Leaf extents can reference data in a variety of ways. A leaf extent may directly point to a single range of stored data (e.g., 64 kilobytes of data), or it may be a collection of references to stored data (e.g., a 1 megabyte "range" of contents that maps some number of virtual blocks associated with a range to physically stored blocks). In the latter case, these blocks may be referenced using some identity, and some blocks within the extent's range may not map to anything. Also, in the latter case, these block references need not be unique, allowing multiple mappings from virtual blocks in some number of logical extents within and across some number of volumes to map to the same physically stored block. Instead of stored block references, logical extents could encode simple patterns. For example, blocks that are strings of identical bytes could simply encode that the block is a repeated pattern of identical bytes.

［0549］複合論理エクステントは、それぞれ複合論理エクステントのサブレンジから基本的なリーフ論理エクステント又は複合論理エクステントにコンテンツの論理範囲をマッピングする複数のマップを含む、なんらかの仮想サイズを有するコンテンツの論理範囲である場合がある。複合論理エクステントのコンテンツに関係する要求を変換することは、次いで複合論理エクステントのコンテキストの中で要求のためのコンテンツ範囲を採取すること、その要求がどの基本的なリーフ論理エクステント又は複合論理エクステントにマッピングするのかを決定すること、及びそれらの基本的なリーフ論理エクステント又は複合論理エクステントの中のコンテンツの適切な範囲に適用するために要求を変換することを伴う。 [0549] A composite logical extent may be a logical range of content having some virtual size that includes multiple maps, each mapping a logical range of content from a subrange of the composite logical extent to an underlying leaf logical extent or composite logical extent. Transforming a request related to content in a composite logical extent then involves taking the content range for the request within the context of the composite logical extent, determining which underlying leaf logical extent or composite logical extent the request maps to, and transforming the request to apply to the appropriate range of content within those underlying leaf logical extents or composite logical extents.

［0550］ボリューム、つまりファイル又は他のタイプのストレージオブジェクトは、複合論理エクステントとして記述できる。したがって、これらの提示されているストレージオブジェクトは、このエクステントモデルを使用し、編成できる。 [0550] Volumes, files, or other types of storage objects can be described as composite logical extents. Therefore, these represented storage objects can be organized using this extent model.

［0551］実施態様に応じて、リーフ論理エクステント又は複合論理エクステントは、複数の他の複合論理エクステントから参照され、実質的に、ボリュームの中及び全体でのコンテンツのより大きい集合体の安価な重複を可能にするであろう。したがって、論理エクステントは、基本的には参照の非巡回グラフの中で配置される場合があり、それぞれがリーフ論理エクステントで終了する。これは、ボリュームのコピーを作成するために、ボリュームのスナップショットを作成するために、又はＸＴＥＮＤＥＤＣＯＰＹ又は類似するタイプの動作の一部としてボリュームの中で及びボリューム間でサポートする仮想範囲コピーの一部として使用できる。 [0551] Depending on the implementation, a leaf logical extent or a composite logical extent may be referenced from multiple other composite logical extents, essentially allowing for inexpensive duplication of larger collections of content within and across volumes. Thus, logical extents may essentially be arranged in an acyclic graph of references, each terminating in a leaf logical extent. This can be used to create copies of volumes, to create snapshots of volumes, or as part of virtual range copy support within and between volumes as part of an EXTENDED COPY or similar type of operation.

［0552］実施態様は、各論理エクステントに、それに名前を付けるために使用できるアイデンティティを提供してよい。複合論理エクステントの中の参照は、論理エクステント識別子及びそれぞれの係る論理エクステントアイデンティティに対応する論理的なサブレンジを含んだリストになるので、これは参照を簡略化する。また、論理エクステントの中では、それぞれの記憶されたデータブロックリファレンスはそれに名前を付けるために使用されるなんらかのアイデンティティに基づいてもよい。 [0552] Implementations may provide each logical extent with an identity that can be used to name it. This simplifies referencing, as the reference within a composite logical extent will be a list containing the logical extent identifier and the logical subrange corresponding to each such logical extent identity. Also, within a logical extent, each stored data block reference may be based on some identity used to name it.

［0553］エクステントのこれらの重複使用をサポートするために、追加機能、つまりコピーオンライト論理エクステントを加えることができる。修正動作がコピーオンライトのリーフ論理エクステント又は複合論理エクステントに影響を及ぼすとき、論理エクステントはコピーされ、コピーは新しい参照であり、（実施態様に応じて）おそらく新しいアイデンティティを有する。コピーは、どのような修正が修正動作から生じても、基本的なリーフ論理エクステント又は複合論理エクステントに関係するすべての参照又はアイデンティティを保持する。例えば、ＷＲＩＴＥ、ＷＲＩＴＥＳＡＭＥ、ＸＤＷＲＩＴＥＲＥＡＤ、ＸＰＷＲＩＴＥ、又はＣＯＭＰＡＲＥＡＮＤＷＲＩＴＥ要求は、ストレージシステムに新しいブロックを記憶（又は、既存の記憶されているブロックを識別するために重複排除技法を使用）してよく、ブロックの新しい集合に対するアイデンティティを参照又は記憶するために対応するリーフ論理エクステントを修正し、おそらくブロックの以前の集合のための参照及び記憶されているアイデンティティを置き換える。代わりに、ＵＮＭＡＰ要求は、１つ以上のブロックリファレンスを削除するためにリーフ論理エクステントを修正してよい。両方のタイプの事例では、リーフ論理エクステントが修正される。リーフ論理エクステントがコピーオンライトである場合、次いで旧いエクステントから影響を受けていないブロックリファレンスをコピーしてから、修正動作に基づいたブロックリファレンスを置換又は削除することによって形成される新しいリーフ論理エクステントが作成される。 [0553] To support these overlapping uses of extents, an additional feature can be added: copy-on-write logical extents. When a modification operation affects a copy-on-write leaf logical extent or composite logical extent, the logical extent is copied, and the copy is a new reference and possibly has a new identity (depending on the implementation). The copy preserves all references or identities related to the underlying leaf logical extent or composite logical extent, regardless of any modifications resulting from the modification operation. For example, a WRITE, WRITE SAME, XDWRITERREAD, XPWRITE, or COMPARE AND WRITE request may store new blocks in the storage system (or use deduplication techniques to identify existing stored blocks) and modify the corresponding leaf logical extent to reference or store identities for the new set of blocks, possibly replacing references and stored identities for the previous set of blocks. Alternatively, an UNMAP request may modify the leaf logical extent to remove one or more block references. In both types of cases, a leaf logical extent is modified. If the leaf logical extent is copy-on-write, then a new leaf logical extent is created, formed by copying unaffected block references from the old extent and then replacing or deleting block references based on the modification operation.

［0554］リーフ論理エクステントの場所を突き止めるために使用された複合論理エクステントは、次いで新しいリーフ論理エクステントリファレンス又はコピーされ、修正されたリーフ論理エクステントと関連付けられたアイデンティティを、以前のリーフ論理エクステントの代替物として記憶するために修正されてよい。その複合論理エクステントがコピーオンライトである場合、次いで新しい複合論理エクステントは新しい参照として又は新しいアイデンティティとともに作成され、その基本的な論理エクステントに対する任意の影響を受けていない参照又はアイデンティティは、その新しい複合論理エクステントにコピーされ、以前のリーフ論理エクステントリファレンス又はアイデンティティは、新しい論理エクステントリファレンス又はアイデンティティで置換される。 [0554] The composite logical extent used to locate the leaf logical extent may then be modified to store a new leaf logical extent reference or identity associated with the copied and modified leaf logical extent as a replacement for the previous leaf logical extent. If the composite logical extent is copy-on-write, then a new composite logical extent is created with the new reference or identity, any unaffected references or identities to its underlying logical extents are copied to the new composite logical extent, and the previous leaf logical extent reference or identity is replaced with the new logical extent reference or identity.

［0555］このプロセスは、修正動作を処理するために使用される非巡回グラフを通る検索経路に基づいて、参照されたエクステントから参照する複合エクステントにさらに過去に遡って続行し、すべてのコピーオンライト論理エクステントがコピーされ、修正され、置換される。 [0555] This process continues from the referenced extent back to the referencing composite extents, based on a search path through the acyclic graph used to process the modification operation, until all copy-on-write logical extents are copied, modified, and replaced.

［0556］これらのコピーされたリーフ論理エクステント及び複合論理エクステントは、次いでコピーオンライトであるという特徴を除外する場合があり、これにより追加の修正は追加のコピーを生じさせない。例えば、第１のときに、コピーオンライト「親」複合エクステントの中のなんらかの基本的な論理エクステントが修正され、その基本的な論理エクステントはコピーされ、修正され、コピーは次いで、親複合論理エクステントのコピーされ、置換されたインスタンスの中に書き込まれる新しいアイデンティティを有する。しかしながら、第２のときに、なんらかの他の基本的な論理エクステントがコピーされ、修正され、他の基本的な論理エクステントコピーの新しいアイデンティティは、親複合論理エクステントに書き込まれ、親は、次いで追加のコピーなく定位置で修正され、親複合論理エクステントに対する参照の代わりに必要な置き換える場合がある。 [0556] These copied leaf logical extents and composite logical extents may then exclude the copy-on-write feature, whereby additional modifications do not result in additional copies. For example, the first time some underlying logical extent in a copy-on-write "parent" composite extent is modified, that underlying logical extent is copied and modified, and the copy then has a new identity written into the copied, replaced instance of the parent composite logical extent. However, the second time some other underlying logical extent is copied and modified, a new identity for the other underlying logical extent copy is written into the parent composite logical extent, and the parent may then be modified in place without additional copies, replacing the necessary references to the parent composite logical extent.

［0557］現在のリーフ論理エクステントがない、ボリュームの又は複合論理エクステントの新しい領域に対する修正動作は、それらの修正の結果を記憶するために新しいリーフ論理エクステントを作成してよい。その新しい論理エクステントが既存のコピーオンライト論理エクステントから参照される場合、次いでその既存のコピーオンライト複合論理エクステントは、新しい論理コンテンツを参照し、既存のリーフ論理エクステントを修正するためのシーケンスに類似するコピー、修正、及び置換の別の動作のシーケンスを生じさせる。 [0557] Modification operations to new areas of a volume or composite logical extent for which there is no current leaf logical extent may create a new leaf logical extent to store the results of those modifications. If that new logical extent is referenced from an existing copy-on-write logical extent, then that existing copy-on-write composite logical extent references the new logical content, resulting in another sequence of copy, modify, and replace operations similar to the sequence for modifying an existing leaf logical extent.

［0558］親複合論理エクステントが（実施態様に基づいて）新しい修正動作のために作成するための新しいリーフ論理エクステントを含む関連付けられたアドレス範囲をカバーするほど十分に大きく成長できない場合、次いで親複合論理エクステントは、さらに再び新しい参照又は新しいアイデンティティである単一の「孫」複合論理エクステントから次いで参照される２つ以上の新しい複合論理エクステントにコピーされてよい。その孫論理エクステントがそれ自体、コピーオンライトである別の複合論理エクステントを通して見つけられる場合、次いでその別の複合論理エクステントは、上述の段落で説明されるのと類似した方法でコピーされ、修正され、置換される。このコピーオンライトモデルは、論理エクステントのこれらの有向非巡回グラフに基づいて、ストレージシステム実施態様の中でスナップショット、ボリュームコピー、及び仮想ボリュームアドレス範囲コピーを実装することの一部として使用できる。それ以外の場合書込み可能なボリュームの読取り専用コピーとしてスナップショットを作成するために、ボリュームと関連付けられた論理エクステントのグラフは、マークされたコピーオンライトであり、元の複合論理エクステントに対する参照はスナップショットによって保持される。ボリュームに対する修正動作は、次いで必要に応じて論理エクステントコピーを作成し、それらの修正動作の結果及び元のコンテンツを保持するスナップショットを記憶するボリュームを生じさせる。ボリュームコピーは、元のボリュームとコピーされたボリュームの両方ともコンテンツを修正し、独自のコピーされた論理エクステントグラフ及びサブグラフを生じさせることを除き、類似している。 [0558] If a parent composite logical extent cannot grow large enough (based on the implementation) to cover the associated address range, including the new leaf logical extent to create for the new modification operation, then the parent composite logical extent may be copied to two or more new composite logical extents that are in turn referenced from a single "grandchild" composite logical extent that is itself a new reference or new identity. If that grandchild logical extent is found through another composite logical extent that is itself copy-on-write, then that other composite logical extent is copied, modified, and replaced in a manner similar to that described in the preceding paragraph. This copy-on-write model can be used as part of implementing snapshots, volume copies, and virtual volume address range copies in storage system implementations based on these directed acyclic graphs of logical extents. To create a snapshot as a read-only copy of an otherwise writable volume, the graph of logical extents associated with the volume is marked copy-on-write, and a reference to the original composite logical extent is maintained by the snapshot. Modification operations on the volume then create logical extent copies as needed, resulting in a volume that stores the results of those modification operations and a snapshot that preserves the original contents. Volume copies are similar, except that both the original and copied volumes modify content, resulting in their own copied logical extent graphs and subgraphs.

［0559］仮想ボリュームアドレス範囲コピーは、（ブロックリファレンスに対する変更がコピーオンライトリーフ論理エクステントを修正しない限り、それ自体、コピーオンライト技術を使用することを伴わない）リーフ論理エクステントの中及び間でブロックリファレンスをコピーすることによって動作する場合がある。代わりに、仮想ボリュームアドアレス範囲コピーは、リーフ論理エクステント又は複合論理エクステントに対する参照を重複させる場合があり、これは、より大きなアドレス範囲のボリュームアドレス範囲コピーにはうまく機能する。さらに、これは、グラフが単にリファレンスツリーよりむしろ参照の有向非巡回グラフになることを可能にする。重複された論理エクステントリファレンスと関連付けられたコピーオンライト技術は、仮想アドレス範囲コピーのソース又はターゲットに対する修正動作が、ボリュームアドレス範囲コピー動作の直後に同じ論理エクステントを共用するターゲット又はソースに影響を及ぼすことなくそれらの修正を記憶するために新しい論理エクステントの作成を生じさせることを確実にするために使用できる。 [0559] Virtual volume address range copies may operate by copying block references within and between leaf logical extents (which does not itself involve using copy-on-write techniques unless changes to the block references modify the copy-on-write leaf logical extents). Alternatively, virtual volume address range copies may duplicate references to leaf logical extents or composite logical extents, which works well for volume address range copies of larger address ranges. Furthermore, this allows the graph to be a directed acyclic graph of references rather than simply a reference tree. Copy-on-write techniques associated with duplicated logical extent references can be used to ensure that modification operations on the source or target of a virtual address range copy result in the creation of new logical extents to store those modifications without affecting the target or source that share the same logical extent immediately after the volume address range copy operation.

［0560］また、ポッドのための入力／出力動作も、論理エクステントの有向非巡回グラフを複製することに基づいて実施されてよい。例えば、ポッドの中の各ストレージシステムは、論理エクステントのプライベートグラフを実装することができ、これによりポッドのための１つのストレージシステムに関するグラフは、ポッドのための任意の第２のストレージシステム上のグラフに特定の関係を有さない。しかしながら、ポッド内のストレージシステム間でグラフを同期させることには価値がある。これは、再同期のため、及び例えば非同期複製又はリモートストレージシステムに対するスナップショットベースの複製等の特徴を調整するために有用である場合がある。さらに、それは、スナップショットの配布を処理するため及びコピー関係処理のためのなんらかのオーバヘッドを削減するためにも有用である場合がある。係るモデルでは、ポッドのためのすべての同期中のストレージシステム全体で同期中のポッドのコンテンツを保つことは、ポッドのためのすべての同期中のストレージシステム全体ですべてのボリュームのためにリーフ論理エクステント及び複合論理エクステントのグラフを同期中に保つこと、及びすべての論理エクステントのコンテンツが同期中であることを確実にすることと、基本的に同じである。同期中であるためには、リーフ論理エクステント及び複合論理エクステントを一致させることは同じアイデンティティを有するべきである、又はマッピング可能なアイデンティティを有するべきであるかのどちらかである。マッピングは、中間マッピングテーブルのなんらかの集合を伴う場合もあれば、なんらかの他のタイプのアイデンティティ変換を伴う場合もあるであろう。一部の場合では、リーフ論理エクステントによってマッピングされたブロックのアイデンティティも同期中に保つことができるであろう。 [0560] Input/output operations for a pod may also be performed based on replicating a directed acyclic graph of logical extents. For example, each storage system in a pod may implement a private graph of logical extents, such that the graph on one storage system for the pod has no particular relationship to the graph on any second storage system for the pod. However, there is value in synchronizing the graphs between storage systems within a pod. This may be useful for resynchronization and for coordinating features such as asynchronous replication or snapshot-based replication to remote storage systems. Furthermore, it may also be useful for handling snapshot distribution and reducing some overhead for copy relationship processing. In such a model, keeping the contents of a pod in sync across all synchronizing storage systems for the pod is essentially the same as keeping the graphs of leaf logical extents and composite logical extents in sync for all volumes across all synchronizing storage systems for the pod and ensuring that the contents of all logical extents are in sync. To be in sync, matching leaf logical extents and composite logical extents should either have the same identity or should have mappable identities. Mapping might involve some set of intermediate mapping tables, or it might involve some other type of identity translation. In some cases, the identities of blocks mapped by leaf logical extents could also be kept in sync.

［0561］ポッドごとに単一リーダーを有するリーダー及びフォロワーに基づいたポッド実施態様では、リーダーは、論理エクステントグラフに対する任意の変更を決定することを任される場合がある。新しいリーフ論理エクステント又は複合論理エクステントが作成される場合、それはアイデンティティを与えられる場合がある。既存のリーフ論理エクステント又は複合論理エクステントが修正を有する新しい論理エクステントを形成するためにコピーされる場合、新しい論理エクステントは、修正のなんらかの集合を有する以前の論理エクステントのコピーとして記述される場合がある。既存の論理エクステントが分割される場合、分割は、結果として生じる新しいアイデンティティとともに記述される場合がある。論理エクステントが、なんらかの追加の複合論理エクステントから基本的な論理エクステントとして参照される場合、その参照は、その基本的な論理エクステントを参照するための複合論理エクステントに対する変更として記述される場合がある。 [0561] In a leader-and-follower based pod implementation with a single leader per pod, the leader may be responsible for determining any changes to the logical extent graph. When a new leaf logical extent or composite logical extent is created, it may be given an identity. When an existing leaf logical extent or composite logical extent is copied to form a new logical extent with modifications, the new logical extent may be described as a copy of the previous logical extent with some set of modifications. When an existing logical extent is split, the split may be described along with the resulting new identity. When a logical extent is referenced as a base logical extent from some additional composite logical extent, the reference may be described as a change to the composite logical extent to reference that base logical extent.

［0562］ポッドでの修正動作は、このようにして（コンテンツを拡張するために新しい論理エクステントが作成される場合、又はスナップショット、ボリュームコピー、及びボリュームアドレス範囲コピーに関係するコピーオンライト状態を処理するために、論理エクステントがコピーされ、修正され、置換される場合）論理エクステントグラフに対する修正の記述を配布すること、及びリーフ論理エクステントのコンテンツに対する修正のための記述及びコンテンツを配布することを含む。有向非巡回グラフの形のメタデータを使用することからくる追加の利点は、上述されたように、物理ストレージ内の記憶されたデータを修正するＩ／Ｏ動作が、物理ストレージに記憶されたデータに対応するメタデータの修正によりユーザレベルで―物理ストレージに記憶されたデータを修正することなく―効果を与えられ得る点である。物理ストレージがソリッドステートドライブであってよいストレージシステムの開示された実施形態では、フラッシュメモリに対する修正に付随する磨耗は、フラッシュメモリの読取り、消去、又は書込みを通しての代わりに、Ｉ／Ｏ動作によってターゲットにされるデータを表すメタデータの修正により効果を与えられるＩ／Ｏ動作のために回避又は削減され得る。さらに、仮想化ストレージシステムでは、上述されたメタデータは、仮想アドレス、つまり論理アドレスと物理アドレス、つまり実アドレスとの関係性を処理するために使用されてよい－言い換えると、記憶されたデータのメタデータ表現は、それがフラッシュメモリ上の摩耗を削減する又は最小限に抑える点でフラッシュフレンドリと見なされてよい仮想化ストレージシステムを可能にする。 [0562] Modification operations in a pod thus include distributing descriptions of modifications to the logical extent graph (when new logical extents are created to extend content, or when logical extents are copied, modified, or replaced to handle copy-on-write conditions related to snapshots, volume copies, and volume address range copies), and distributing descriptions and contents for modifications to the content of leaf logical extents. An additional advantage of using metadata in the form of a directed acyclic graph, as described above, is that I/O operations that modify stored data in physical storage can be effected at the user level—without modifying the data stored in physical storage—by modifying the metadata corresponding to the data stored in physical storage. In disclosed embodiments of a storage system in which the physical storage may be a solid-state drive, wear associated with modifications to flash memory can be avoided or reduced due to I/O operations being effected by modifying metadata representing the data targeted by the I/O operation, instead of through a read, erase, or write to flash memory. Furthermore, in a virtualized storage system, the metadata described above may be used to handle the relationship between virtual addresses, i.e., logical addresses, and physical addresses, i.e., real addresses - in other words, the metadata representation of stored data enables a virtualized storage system that may be considered flash-friendly in that it reduces or minimizes wear on flash memory.

［0563］リーダーストレージシステムは、ポッドデータセットのそのローカルコピー及びローカルストレージシステムのメタデータとの関連でこれらの記述を実装するために、独自のローカル動作を実行してよい。さらに、同期中のフォロワーは、ポッドデータセットのそれらの別々のローカルコピー及びその別々のローカルストレージシステムのメタデータとの関連でこれらの記述を実装するために、独自の別々のローカル動作を実行する。リーダー動作とフォロワー動作の両方が完了すると、結果は互換性のあるリーフ論理エクステントコンテンツを有する論理エクステントの互換性のあるグラフである。論理エクステントのこれらのグラフは、次いで上記の例に説明されたように一種の「共通メタデータ」になる。この共通メタデータは、修正動作と必須共通メタデータとの間の依存関係として記述される場合がある。グラフへの変換は、１つ以上の他の動作とともに、例えば依存関係等の関係を記述してよいプレディケートの集合又はより多くのプレディケートの中の別々の動作として記述できる。言い換えると、動作間の相互依存性は、１つの動作がなんらかの方法で依存するプレカーソルの集合として記述されてよく、プレカーソルの集合は、動作が完了するために真でなければならないプレディケートと見なされてよい。プレディケートのより完全な説明は、その全体として参照により本明細書に含まれる出願参照番号第１５／６９６，４１８号の中で見つけられてよい。代わりに、ポッド全体で完了するとまだ知られていない特定の同じグラフ変換に頼る各修正動作は、それが頼る任意のグラフ変換の部分を含む場合がある。既に存在している「新しい」リーフ論理エクステント又は複合論理エクステントを識別する動作記述を処理することは、その部分はすでになんらかの早期の動作の処理で処理されていたので、新しい論理エクステントを作成することを回避することができ、代わりにリーフ論理エクステント又は複合論理エクステントのコンテンツを変更する動作処理の部分しか実装できない。変換が互いに互換性があることを確実にすることはリーダーの役割である。例えば、ポッドを受け取る２つの書込みで開始できる。第１の書込みは、複合論理エクステントＡを複合論理エクステントＢとして形成されたコピーで置換し、リーフ論理エクステントＣをリーフ論理エクステントＤとしてのコピーで及び第２の書込みのためのコンテンツを記憶するための修正で置換し、さらにリーフ論理エクステントＤを複合論理エクステントＢに書き込む。一方で、第２の書込みは、複合論理エクステントＡの同じコピー及び複合論理エクステントＢとの置換を暗示するが、異なるリーフ論理エクステントＥをコピーし、第２の書込みのコンテンツを記憶するために修正される論理エクステントＦで置換し、さらに論理エクステントＦを論理コンテンツＢに書き込む。その場合には、第１の書込みの記述は、ＡのＢとの置換及びＣのＤとの置換、並びにＤの複合論理エクステントＢへの書込み、及び第１の書込みのコンテンツのリーフエクステントＢへの書込みを含む場合があり、第２の書込みの記述は、リーフエクステントＦに書き込まれる第２の書込みのコンテンツとともに、ＡのＢとの置換及びＥのＦとの置換、並びにＦの複合論理エクステントＢへの書込みを含む場合がある。リーダー又は任意のフォロワーは、次いで任意の順序で第１の書込み又は第２の書込みを処理でき、最終結果は、ＢがＡをコピーし、置換し、ＤがＣをコピーし、置換し、ＦがＥをコピーし、置換し、Ｄ及びＦが複合論理エクステントＢに書き込まれることである。Ｂを形成するためのＡの第２のコピーは、Ｂがすでに存在していることを認識することによって回避できる。このようにして、リーダーは、ポッドのための同期中のストレージシステム全体で論理エクステントグラフのために互換性のある共通メタデータを維持することを確実にする場合がある。 [0563] The leader storage system may perform its own local operations to implement these descriptions in the context of its local copy of the pod dataset and the metadata of its local storage system. Additionally, synchronizing followers may perform their own separate local operations to implement these descriptions in the context of their separate local copies of the pod dataset and the metadata of their separate local storage systems. When both the leader and follower operations are complete, the result is a compatible graph of logical extents with compatible leaf logical extent contents. These graphs of logical extents then become a kind of "common metadata," as described in the example above. This common metadata may be described as dependencies between modification operations and required common metadata. The transformation into a graph can be described as a set of predicates, or separate operations within more predicates, that may describe relationships, such as dependencies, along with one or more other operations. In other words, interdependencies between operations may be described as a set of precursors on which an operation depends in some way, and the set of precursors may be viewed as predicates that must be true for the operation to complete. A more complete description of predicates may be found in Application Serial No. 15/696,418, which is incorporated herein by reference in its entirety. Instead, each modification operation that relies on the same particular graph transformation not yet known to be completed across the entire pod may include portions of any graph transformations it relies on. Processing an operation description that identifies a "new" leaf or composite logical extent that already exists can avoid creating a new logical extent because that portion was already processed in the processing of some earlier operation, and instead implement only the portion of the operation processing that modifies the contents of the leaf or composite logical extent. It is the reader's responsibility to ensure that the transformations are compatible with each other. For example, a pod may begin with two writes receiving the pod. The first write replaces composite logical extent A with a copy created as composite logical extent B, replaces leaf logical extent C with a copy created as leaf logical extent D and modifications to store the contents for the second write, and then writes leaf logical extent D to composite logical extent B. Meanwhile, a second write implies the same copy of composite logical extent A and its replacement with composite logical extent B, but copies a different leaf logical extent E and replaces it with logical extent F that is modified to store the content of the second write, and writes logical extent F to logical content B. In that case, a description of the first write may include the replacement of A with B and the replacement of C with D, and writing D to composite logical extent B, and writing the content of the first write to leaf extent B, and a description of the second write may include the replacement of A with B and the replacement of E with F, and writing F to composite logical extent B, with the content of the second write written to leaf extent F. The leader or any follower can then process the first or second write in any order, with the end result being that B copies and replaces A, D copies and replaces C, F copies and replaces E, and D and F are written to the composite logical extent B. The second copy of A to form B can be avoided by recognizing that B already exists. In this way, the leader may ensure that compatible common metadata is maintained for the logical extent graph across synchronized storage systems for the pod.

［0564］論理エクステントの有向非巡回グラフを使用するストレージシステムの実施態様を所与として、論理エクステントの複製された有向非巡回グラフに基づいたポッドの回復が実施されてよい。具体的には、この例では、ポッドでの回復は、複製されたエクステントグラフに基づいてよく、次いでリーフ論理エクステントのコンテンツを回復することだけではなく、これらのグラフの一貫性を回復することも伴う。回復の本実施態様では、動作は、ポッドのためのすべてのストレージシステム全体で完了したと知られていないすべてのリーフ論理エクステントコンテンツ修正だけではなく、ポッドのためのすべての同期中のストレージシステムで完了したと知られていないグラフ変換に対しても問合せすることを含んでよい。係る問い合わせることは、なんらかの調整されたチェックポイント以降の動作に基づいているであろう、又は単に完了したと知られていない動作であり、各ストレージシステムは、完了したとしてまだ信号で知らせていない通常動作中の動作のリストを保つ。この例では、グラフ変換は簡単である。グラフ変換は新しい事柄を作成し、旧い事柄を新しい事柄にコピーし、旧い事柄を２つ以上の分割された新しい事柄にコピーする場合もあれば、それらは他のエクステントへのその参照を修正するために複合エクステントを修正する。任意の論理エクステントを作成又は置換する任意の同期中のストレージシステムで見つけられた任意の記憶されている動作記述は、その論理エクステントをまだ有していない任意の他のストレージシステムでコピーされ、実行される場合がある。関与されたリーフ論理エクステント又は複合論理エクステントが適切に回復されている限り、リーフ論理エクステント又は複合論理エクステントに対する修正を説明する動作は、それらの修正を、まだそれらを適用していなかった任意の同期中のストレージシステムに適用できる。 [0564] Given an implementation of a storage system that uses a directed acyclic graph of logical extents, recovery of a pod may be performed based on the replicated directed acyclic graph of logical extents. Specifically, in this example, recovery at a pod may be based on the replicated extent graph and then involve not only recovering the contents of leaf logical extents but also restoring consistency to these graphs. In this implementation of recovery, operations may include querying not only all leaf logical extent content modifications that are not known to be completed across all storage systems for the pod, but also graph transformations that are not known to be completed across all synchronizing storage systems for the pod. Such queries may be based on operations since some coordinated checkpoint, or simply operations that are not known to be completed; each storage system maintains a list of operations during normal operation that have not yet signaled as completed. In this example, the graph transformations are simple. Graph transformations create new things, copy old things to new things, and may copy old things to two or more separate new things; they modify composite extents to modify their references to other extents. Any stored operation description found on any in-sync storage system that creates or replaces any logical extent may be copied and executed on any other storage system that does not already have that logical extent. As long as the involved leaf logical extent or composite logical extent has been properly recovered, operations that describe modifications to a leaf logical extent or composite logical extent can apply those modifications to any in-sync storage system that has not yet applied them.

［0565］別の例では、論理エクステントグラフを使用する代替策として、ストレージは複製されたコンテンツアドレス指定可能ストアに基づいて実装されてよい。コンテンツアドレス指定可能ストアでは、データのブロックごとに（例えば、５１２バイト、４０９６バイト、８１９２バイトごとに、又は１６３８４バイトごとにも）、（フィンガープリントと呼ばれる場合もある）一意のハッシュ値が、ブロックコンテンツに基づいて計算され、これによりボリューム又はボリュームのエクステント範囲は、特定のハッシュ値を有するブロックに対する参照のリストとして記述できる。同じハッシュ値を有するブロックに対する参照に基づいた同期複製されたストレージシステム実施態様では、複製は、第１のストレージシステムがブロックを受け取り、それらのブロックのフィンガープリントを計算し、それらのフィンガープリントのためのブロックリファレンスを識別し、参照されたブロックに対するボリュームブロックのマッピングに対する更新として１つ又は複数の追加のストレージシステムに対する変更を送達することを伴う場合があるであろう。ブロックが第１のストレージシステムによってすでに記憶されていることがわかる場合、そのストレージシステムは（参照が同じハッシュ値を使用するため、又は参照のための識別子が同一であるか、容易にマッピングできるかのどちらかであるため）追加のストレージシステムのそれぞれで参照に名前を付けるためにその参照を使用できる。代わりに、ブロックが、第１のストレージシステムによって見つけられない場合、次いで第１のストレージシステムのコンテンツは、そのブロックコンテンツと関連付けられたハッシュ値又はアイデンティティとともに、動作記述の一部として他のストレージシステムに送達されてよい。さらに、各同期中のストレージシステムのボリューム記述は、次いで新しいブロックリファレンスで更新される。係るストアでの回復は、次いでボリュームのための最近更新されたブロックリファレンスを比較することを含んでよい。ブロックリファレンスがポッドのための異なる同期中のストレージシステム間で異なる場合、次いで各参照の１つのバージョンは、それらを一貫性があるものにするために他のストレージシステムにコピーされる場合がある。１つのシステム上のブロックリファレンスが存在しない場合、次いでそれはその参照のためのブロックを記憶するなんらかのストレージシステムからコピーされる。仮想コピー動作は、仮想コピー動作を実施することの一部として参照をコピーすることによって係るブロック又はハッシュリファレンスストアでサポートされる場合がある。 [0565] In another example, as an alternative to using a logical extent graph, storage may be implemented based on a replicated content-addressable store. In a content-addressable store, for each block of data (e.g., every 512 bytes, 4096 bytes, 8192 bytes, or even every 16384 bytes), a unique hash value (sometimes called a fingerprint) is calculated based on the block contents, allowing a volume or volume extent range to be described as a list of references to blocks with particular hash values. In a synchronously replicated storage system implementation based on references to blocks with the same hash value, replication might involve a first storage system receiving blocks, calculating fingerprints for those blocks, identifying block references for those fingerprints, and delivering the changes to one or more additional storage systems as updates to the mapping of volume blocks to the referenced blocks. If a block is found to be already stored by the first storage system, that storage system can use that reference to name the reference on each of the additional storage systems (either because the references use the same hash value, or because the identifiers for the references are identical or easily mappable). Alternatively, if a block cannot be found by a first storage system, then the contents of the first storage system, along with the hash value or identity associated with the block contents, may be delivered to the other storage system as part of the operation description. Furthermore, the volume description of each synchronizing storage system is then updated with the new block reference. Recovery in such a store may then include comparing recently updated block references for the volume. If block references differ between different synchronizing storage systems for a pod, then one version of each reference may be copied to the other storage system to make them consistent. If a block reference on one system does not exist, then it is copied from some storage system that stores the block for that reference. Virtual copy operations may be supported in such block or hash reference stores by copying the reference as part of performing the virtual copy operation.

［0566］追加の説明のために、図４２Ａは、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間でメタデータを同期するための例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図４２Ａに示されるストレージシステム（４２００Ａ）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｃ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図４２Ａに示されるストレージシステム（４２００Ａ）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0566] For further explanation, FIG. 42A sets forth a flowchart illustrating an example method for synchronizing metadata between storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage system (4200A) shown in FIG. 42A may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3C, or any combination thereof. In practice, the storage system (4200A) shown in FIG. 42A may include the same components as the storage systems described above, fewer components, or additional components.

［0567］上述されたように、メタデータは、データセットを同期複製しているストレージシステムの間で同期されてよい。係るメタデータは、ポッドの中のストレージオブジェクトの中の仮想アドレスに対する、ポッドの中に記憶されたコンテンツのセグメントのマッピングに関係するポッドの代わりに、ストレージシステムによって記憶される、共通メタデータ又は共用メタデータと呼ばれる場合があり、それらのマッピングに関係する情報は、ポッドに関係するストレージ動作のための正しい挙動―又はより良い性能―を確実にするためにポッドのためのメンバーストレージシステム間で同期される。一部の例では、ストレージオブジェクトは、ボリューム又はスナップショットを実装してよい。同期されたメタデータは（ａ）ポッド内のシステムストレージ間でボリュームコンテンツマッピングを同期させておくための情報、（ｂ）回復チェックポイントのため又は進行中の書込み動作のための追跡データ、（ｃ）非同期複製又は周期複製のためのデータ及びマッピング情報のリモートストレージシステムへの送達に関係する情報を含んでよい。 [0567] As described above, metadata may be synchronized between storage systems that are synchronously replicating a data set. Such metadata, sometimes referred to as common metadata or shared metadata, is stored by a storage system on behalf of a pod related to mappings of segments of content stored in the pod to virtual addresses in storage objects within the pod, and information related to these mappings is synchronized between member storage systems for the pod to ensure correct behavior—or better performance—for storage operations related to the pod. In some examples, a storage object may implement a volume or snapshot. The synchronized metadata may include (a) information for keeping volume content mappings synchronized between system storage within the pod, (b) tracking data for recovery checkpoints or for ongoing write operations, and (c) information related to the delivery of data and mapping information to remote storage systems for asynchronous or periodic replication.

［0568］ポッド内のストレージシステムの間でコンテンツマッピングを同期させたままにするための情報は、スナップショットの効率的な作成を可能にしてよく、このことは同様にしてその後続の更新、スナップショットのコピーを可能にする、又はスナップショット削除は、ポッドメンバーストレージシステム全体で効率的且つ一貫して実行されてよい。 [0568] Information for keeping content mappings synchronized between storage systems within a pod may enable efficient creation of snapshots, which in turn may enable subsequent updates, snapshot copies, or snapshot deletions to be performed efficiently and consistently across pod member storage systems.

［0569］回復チェック哺委員とのため又は進行中の書込み動作のためにデータを追跡することは、効率的な衝突回復、及び部分的に又は完全にポッドのための個々のストレージシステムで適用された可能性があるが、ポッドのための他のストレージシステムでは完全に適用されなかった可能性があるコンテンツ又はボリュームマッピングの効率的な検出を可能にしてよい。 [0569] Tracking data for recovery checks or for ongoing write operations may enable efficient collision recovery and efficient detection of content or volume mappings that may have been partially or fully applied on individual storage systems for a pod, but may not have been fully applied on other storage systems for the pod.

［0570］非同期複製又周期複製のためのリモートストレージシステムへのデータ及びマッピング情報の送達に関係する情報は、ポッドのための２つ以上のメンバーストレージシステムが、非同期複製又は周期複製を駆動するために使用されるメタデータをマッピングし、識別する際の不一致に対処するための最小の懸念で、複製されたポッドコンテンツのためのサーバとしての機能を果たすことを可能にしてよい。 [0570] Information related to the delivery of data and mapping information to a remote storage system for asynchronous or periodic replication may enable two or more member storage systems for a pod to act as servers for replicated pod content with minimal concern for addressing discrepancies in mapping and identifying metadata used to drive asynchronous or periodic replication.

［0571］一部の例では、共用メタデータは、名前が付けられたグループのための若しくはグループの表示、又はポッドのための同期複製されたデータセット全体の部分集合である１つ以上のボリューム又は１つ以上のストレージオブジェクトの識別子を含んでよい－データセットのボリューム又はストレージオブジェクトの係るは、一貫性グループと呼ばれることがある。一貫性グループは、一貫性のあるスナップショット、非同期複製又は周期複製に使用されるデータセットのボリューム又はストレージシステムの集合を指定するために定義されてよい。一部の例では、一貫性グループは、例えばホスト若しくはホストネットワークポートの特定の集合に接続される、又はアプリケーション若しくはバーチャルマシン若しくはコンテナの特定の集合に接続されるすべてのボリュームを含むことによって動的に計算されてよく、アプリケーション、仮想マシン、若しくはコンテナは、外部サーバシステム上で動作してよい、又はポッドのメンバーであるストレージシステムの１つ以上で動作してよい。他の例では、一貫性グループは、データ若しくはデータの集合のタイプのユーザ選択又は動的計算に類似した一貫性グループの仕様に従って定義されてよく、ユーザは、例えばコマンド又は管理コンソールを通して、特定の若しくは名前を付けられた一貫性グループが、ホスト若しくはホストネットワークポートの特定の集合に接続されたすべてのボリュームを含むために作成される、又はアプリケーション若しくはバーチャルマシン若しくはコンテナの特定の集合のためのデータを含むために作成されることを指定してよい。 [0571] In some examples, the shared metadata may include an identifier for a named group or group representation of one or more volumes or one or more storage objects that are a subset of the entire synchronously replicated dataset for a pod—such a set of dataset volumes or storage objects may be referred to as a consistency group. A consistency group may be defined to specify a set of dataset volumes or storage systems to be used for consistent snapshots, asynchronous replication, or periodic replication. In some examples, a consistency group may be dynamically computed, for example, by including all volumes connected to a particular set of hosts or host network ports, or connected to a particular set of applications, virtual machines, or containers, which may be running on external server systems or on one or more of the storage systems that are members of the pod. In other examples, a consistency group may be defined according to a user selection of a type of data or set of data, or a consistency group specification similar to dynamic computation, where a user may specify, for example, through a command or management console, that a particular or named consistency group be created to include all volumes connected to a particular set of hosts or host network ports, or to include data for a particular set of applications, virtual machines, or containers.

［0572］一貫性グループを使用する例では、一貫性グループの第１の一貫性グループスナップショットは、第１のデータセットスナップショット時の一貫性グループのメンバーであるすべてのボリューム又は他のストレージオブジェクトのためのスナップショットの第１の集合を含んでよく、同じ一貫性グループの第２の一貫性グループスナップショットは、第２のデータセットスナップショット時での一貫性グループのメンバーであるボリューム又は他のストレージオブジェクトのためのスナップショットの第２の集合を含む。他の例では、データセットのスナップショットは、非同期で１つ以上のターゲットストレージシステムに記憶されてよい。同様に、一貫性グループの非同期複製は、一貫性グループのメンバーボリューム及び他のストレージオブジェクトに対する動的変更を説明してよく、非同期複製リンクのソース又はターゲットのどちらかでの一貫性グループの一貫性グループスナップショットは、データセットスナップショットが関係するときの一貫性グループの関係でメンバーであるボリューム及び他のストレージオブジェクトを含む。非同期複製接続のターゲットの場合、データセットスナップショットが関係する時間は、それが受け取られ、ターゲットでの一貫性グループスナップショットのときのプロセスにあったときに送信者の動的データセットに依存する。例えば、非同期複製のターゲットが、例えば２０００動作後方にあり、それらの動作のいくつかが一貫性グループメンバーの変更であり、係る変更の第１の集合がソースにとって２０００を超える動作前であり、変更の第２の集合が最後の２０００の範囲内である場合、次いでターゲットでのその時点での一貫性グループスナップショットは、メンバー変更の第１の集合を説明し、変更の第２の集合を説明しない。非同期複製のターゲットの他の使用は、それらの使用のためのボリューム又は他のストレージオブジェクト（及びそのコンテンツ）を決定する際に一貫性グループのためのデータセットの時点の性質を同様に説明してよい。例えば、非同期複製が２０００動作後方である同じ場合では、災害復旧フェイルオーバのためのターゲットの使用は、それらがソースで２０００動作前であったときのボリューム及び他のストレージオブジェクト（及びそのコンテンツ）を含むデータセットから開始する可能性がある。この説明では、ソースで同時に起こる動作（例えば、書込み、ストレージオブジェクトの作成又は削除、ボリューム又は他のストレージオブジェクト又は一貫性データからの他のデータの包含又は除外に影響を及ぼすプロパティに対する変更、又は進行中であり、同じ時点で完了したとして信号で知らされていなかった他の動作）は、単一の明確に定義された順序付けを有さない可能性があるため、動作のカウントはソースで同時に起こる動作の任意の可能にされた順序付けに基づいたなんらかの妥当な順序付けを表す必要があるにすぎない。 [0572] In examples using consistency groups, a first consistency group snapshot of a consistency group may include a first set of snapshots for all volumes or other storage objects that are members of the consistency group at the time of the first dataset snapshot, and a second consistency group snapshot of the same consistency group may include a second set of snapshots for volumes or other storage objects that are members of the consistency group at the time of the second dataset snapshot. In other examples, dataset snapshots may be stored asynchronously on one or more target storage systems. Similarly, asynchronous replication of a consistency group may account for dynamic changes to member volumes and other storage objects of a consistency group, and a consistency group snapshot of a consistency group at either the source or target of an asynchronous replication link includes the volumes and other storage objects that are members in the consistency group relationship at the time the dataset snapshot pertains. In the case of a target of an asynchronous replication connection, the time to which a dataset snapshot pertains depends on the sender's dynamic dataset when it was received and was in process at the time of the consistency group snapshot at the target. For example, if the target of asynchronous replication is, say, 2000 operations behind, some of those operations being consistency group member changes, with a first set of such changes being more than 2000 operations ago to the source, and a second set of changes being within the last 2000, then a consistency group snapshot at that point in time at the target will account for the first set of member changes and not the second set of changes. Other uses of the target of asynchronous replication may similarly account for the point-in-time nature of the dataset for the consistency group in determining the volumes or other storage objects (and their contents) for their use. For example, in the same case where the asynchronous replication is 2000 operations behind, use of the target for disaster recovery failover could start with a dataset containing the volumes and other storage objects (and their contents) as they were at the source 2000 operations ago. For the purposes of this description, operations occurring concurrently at the source (e.g., writes, creation or deletion of storage objects, changes to properties that affect the inclusion or exclusion of other data from volumes or other storage objects or consistency data, or other operations that were in progress and not signaled as completed at the same point in time) may not have a single, well-defined ordering, so the counts of operations need only represent some plausible ordering based on any possible ordering of operations occurring concurrently at the source.

［0573］一貫性グループを使用する別の例として、一貫性グループスナップショットの複製に基づいた周期複製の場合、それぞれの複製された一貫性グループスナップショットは、各一貫性グループスナップショットがソースで形成されたときのボリューム及び他のストレージオブジェクトを含むであろう。一貫性グループのメンバーシップが、共通、つまり共用のメタデータを使用することによって一貫して保たれることを確実にすることは、障害―又は、複製のソース若しくはデータセットスナップショットを形成するシステムにポッド内のあるストレージシステムから別のストレージシステムに切り替えさせる場合がある―が、それらの一貫性グループスナップショット又は一貫性グループ複製を適切に処理するために必要とされる情報を失わないことを確実にする。さらに、このタイプの処理は、ポッドのメンバーである複数のストレージシステムが非同期複製又は周期複製のためのソースシステムとしての機能を同時に果たすことを可能にしてよい。 [0573] As another example of using consistency groups, in the case of periodic replication based on replicating consistency group snapshots, each replicated consistency group snapshot would include the volumes and other storage objects as they were at the time each consistency group snapshot was created at the source. Ensuring that consistency group membership is kept consistent through the use of common, or shared, metadata ensures that a failure—or a failure that may cause the source of a replication or the system that creates the data set snapshot to switch from one storage system to another within a pod—does not result in the loss of information needed to properly process those consistency group snapshots or consistency group replicas. Furthermore, this type of processing may allow multiple storage systems that are members of a pod to simultaneously serve as source systems for asynchronous or periodic replication.

［0574］さらに、セグメントのストレージオブジェクトへのマッピングを記述する同期されたメタデータは、マッピング自体に限定されず、他のストレージシステム情報の中で、例えばシーケンス番号（又は記憶されたデータを識別するためのなんらかの他の値）、タイムスタンプ、ボリューム／スナップショット関係、チェックポイントアイデンティティ、階層を定義するツリー若しくはグラフ、又はマッピング関係の有向グラフ等の追加の情報を含んでよい。 [0574] Additionally, the synchronized metadata describing the mapping of segments to storage objects is not limited to the mapping itself, but may include additional information such as, for example, a sequence number (or some other value for identifying the stored data), a timestamp, a volume/snapshot relationship, a checkpoint identity, a tree or graph defining a hierarchy, or a directed graph of the mapping relationship, among other storage system information.

［0575］図４２Ａに示されるように、データセット（４２５８）を同期複製している複数のストレージシステム（４２００Ａ～４２００Ｎ）は、ポッドのための同期中リスト内のそれぞれの他のストレージシステム（４２００Ｂ～４２００Ｎ）と通信してよい－ストレージシステムは、実行するためのＩ／Ｏ動作を記述するメタデータ及び個々のストレージシステムに記憶されたデータセット（４２５８）のそれぞれのローカルメタデータ表現に加えられる更新を記述するメタデータを交換してよい。さらに、各ストレージシステム（４２００Ａ、４２００Ｂ、．．．４２００Ｎ）は、ストレージオブジェクト（４２５６、４２６０．．．４２６２）のそれぞれのバージョンを記憶してよい。 [0575] As shown in FIG. 42A, multiple storage systems (4200A-4200N) synchronously replicating a dataset (4258) may communicate with each other storage system (4200B-4200N) in the in-sync list for the pod—the storage systems may exchange metadata describing I/O operations to perform and updates to be made to each local metadata representation of the dataset (4258) stored on the individual storage system. Additionally, each storage system (4200A, 4200B, ... 4200N) may store a respective version of the storage object (4256, 4260...4262).

［0576］図４２Ａに示される例の方法は、ストレージシステム（４２００Ａ～４２００Ｎ）の第１のストレージシステム（４２００Ａ）で、データセット（４２５８）に向けられたＩ／Ｏ動作（４２５２）を受け取ること（４２０２）を含む。ストレージシステム（４２００Ａ～４２００Ｎ）の第１のストレージシステム（４２００Ａ）で、データセット（４２５８）に向けられたＩ／Ｏ動作（４２５２）を受け取ること（４２０２）は、例えばストレージエリアネットワーク（１５８）、インターネット、又はホストコンピュータ（４２５１）がその全体でストレージシステム（４２００Ａ）と通信してよい任意のコンピュータネットワーク等のネットワーク全体でパケット又はデータをトランスポートするための１つ以上の通信プロトコルを使用することによって実施されてよい。この例では、ストレージシステム（４２００Ａ）は、例えばＳＣＳＩポート等のネットワークポートで受け取られたＩ／Ｏ動作（４２５２）を受け取ってよく、Ｉ／Ｏ動作（４２５２）は、ポッド内のストレージシステム（４２００Ａ～４２００Ｎ）全体で同期複製されているデータセット（４２５８）の一部であるメモリ場所に向けられる書込みコマンドである。 [0576] The example method shown in FIG. 42A includes receiving (4202) at a first storage system (4200A) of storage systems (4200A-4200N) an I/O operation (4252) directed to a dataset (4258). Receiving (4202) at a first storage system (4200A) of storage systems (4200A-4200N) the I/O operation (4252) directed to the dataset (4258) may be performed using one or more communication protocols for transporting packets or data across a network, such as, for example, a storage area network (158), the Internet, or any computer network through which a host computer (4251) may communicate with storage system (4200A). In this example, storage system (4200A) may receive an I/O operation (4252) received at a network port, such as a SCSI port, where the I/O operation (4252) is a write command directed to a memory location that is part of a dataset (4258) that is synchronously replicated across storage systems (4200A-4200N) in the pod.

［0577］また、図４２Ａに示される例の方法は、Ｉ／Ｏ動作（４２５２）に従って、ストレージオブジェクト（４２５６）の中の仮想アドレスに対するコンテンツのセグメントのマッピングを記述するメタデータ更新（４２５４）を決定すること（４２０４）であって、ストレージオブジェクト（４２５６）はデータセット（４２５８）を含む、決定すること（４２０４）を含む。Ｉ／Ｏ動作（４２５２）に従って、ストレージオブジェクト（４２５６）の中の仮想アドレス対するコンテンツのセグメントのマッピングを記述するメタデータ更新（４２５４）を決定すること（４２０４）であって、ストレージオブジェクト（４２５６）がデータセット（４２５８）を含む決定することは、ポッドのストレージシステム（４２００Ａ～４２００Ｎ）全体で同期されるメタデータのコンテンツに関して上述されるように情報を決定する又は識別することによって実施されてよく、Ｉ／Ｏ動作（４２５２）からの情報も、メタデータ更新（４２５４）、係る論理、つまり仮想アドレス、ペイロードサイズ、及びＩ／Ｏ動作（４２５２）ペイロードが、データセット（４２５８）の中に以前に記憶されたデータに関して含まれる又は組み込まれる方法を記述する重複排除情報等の他の情報に含まれてもよい。 [0577] The example method shown in FIG. 42A also includes determining (4204) metadata updates (4254) describing mappings of segments of content to virtual addresses in a storage object (4256) in accordance with the I/O operation (4252), where the storage object (4256) includes a dataset (4258). Determining (4204) metadata updates (4254) describing the mapping of segments of content to virtual addresses in a storage object (4256) according to an I/O operation (4252), where the determination that the storage object (4256) contains a dataset (4258) may be performed by determining or identifying information as described above regarding the metadata content to be synchronized across the pod's storage systems (4200A-4200N); information from the I/O operation (4252) may also be included in the metadata updates (4254), such logic, i.e., virtual addresses, payload size, and other information such as deduplication information describing how the I/O operation (4252) payload is contained or incorporated with respect to data previously stored in the dataset (4258).

［0578］また、図４２Ａに示される例の方法は、メタデータ更新（４２５４）に従って第２のストレージシステム上でメタデータ表現を更新するためにメタデータ更新（４２５４）を第２のストレージシステム（４２００Ｂ）に送信することによってストレージシステム（４２００Ａ～４２００Ｎ）の第２のストレージシステム（４２００Ｂ）でメタデータを同期させること（４２０６）も含む。メタデータ更新（４２５４）に従って第２のストレージシステム上でメタデータ表現を更新するためにメタデータ更新（４２５４）を第２のストレージシステム（４２００Ｂ）に送信することによってストレージシステム（４２００Ａ～４２００Ｎ）の第２のストレージシステム（４２００Ｂ）でメタデータを同期させること（４２０６）は、実施されてよい。１つ以上のネットワークポートを使用し、及び１つ以上の通信ネットワーク（不図示）全体で、ポッド内のそれぞれの他のストレージシステム（４２００Ｂから４２００Ｎ）にメタデータ更新（４２５４）を送信することによって実施されてよい－それぞれの他のストレージシステム（４２００Ｂから４２００Ｎ）は、同期されたデータセット（４２５８）のそれぞれのローカルメタデータ表現を更新するためにメタデータ更新（４２５４）を受け取ってよい。各ストレージシステム（４２００Ｂ～４２００Ｎ）がメタデータ更新（４２５４）を受け取り、処理した後、すべてのシステムで同期されたデータセット（４２５８）に対応するメタデータが同期される。 [0578] The example method shown in FIG. 42A also includes synchronizing (4206) the metadata at a second storage system (4200B) of the storage systems (4200A-4200N) by sending a metadata update (4254) to the second storage system (4200B) to update a metadata representation on the second storage system according to the metadata update (4254). Synchronizing (4206) the metadata at a second storage system (4200B) of the storage systems (4200A-4200N) by sending a metadata update (4254) to the second storage system (4200B) to update a metadata representation on the second storage system according to the metadata update (4254) may be performed. This may be implemented by sending metadata updates (4254) to each of the other storage systems (4200B through 4200N) in the pod using one or more network ports and across one or more communication networks (not shown), which may receive the metadata updates (4254) to update their respective local metadata representations of the synchronized dataset (4258). After each storage system (4200B through 4200N) receives and processes the metadata updates (4254), the metadata corresponding to the synchronized dataset (4258) is synchronized across all systems.

［0579］追加の説明のために、図４２Ｂは、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間でメタデータを同期させる例の方法を示すフローチャートを説明する。図４２Ｂに示される例の方法は、ストレージシステム（４２００Ａ～４２００Ｎ）の第１のストレージシステム（４２００Ａ）で、データセット（４２５８）に向けられたＩ／Ｏ動作（４２５２）を受け取ること（４２０２）と、Ｉ／Ｏ動作（４２５２）に従って、ストレージオブジェクト（４２５６）の中の仮想アドレスに対するコンテンツのセグメントのマッピングを記述するメタデータ更新（４２５４）を決定すること（４２０４）であって、ストレージオブジェクト（４２５６）はデータセット（４２５８）を含む、決定することと、メタデータ更新（４２５４）に従って第２のストレージシステム上でメタデータ表現を更新するためにメタデータ更新（４２５４）を第２のストレージシステム（４２００Ｂ）に送信することによってストレージシステム（４２００Ａ～４２００Ｎ）の第２のストレージシステム（４２００Ｂ）でメタデータを同期させること（４２０６）も含むので、図４２Ｂに示される例の方法は図４２Ａに示される例の方法に類似している。 For further explanation, FIG. 42B illustrates a flowchart depicting an example method for synchronizing metadata between storage systems synchronously replicating a dataset in accordance with some embodiments of the present disclosure. The example method depicted in FIG. 42B includes receiving (4202), at a first storage system (4200A) of storage systems (4200A-4200N), an I/O operation (4252) directed to a dataset (4258); and determining (4204), in accordance with the I/O operation (4252), a metadata update (4254) describing a mapping of segments of content to virtual addresses within a storage object (4256), wherein the metadata update (4254) describes a mapping of segments of content to virtual addresses within a storage object (4256). The example method illustrated in FIG. 42B is similar to the example method illustrated in FIG. 42A because it also includes determining that the first storage system (4200A-4200N) includes a dataset (4258), and synchronizing (4206) the metadata at the second storage system (4200B) of storage systems (4200A-4200N) by sending metadata updates (4254) to the second storage system (4200B) to update the metadata representation on the second storage system according to the metadata updates (4254).

［0580］しかしながら、図４２Ｂに示される例の方法は、Ｉ／Ｏ動作（４２５２）を第１のストレージシステム（４２００Ａ）上のデータセット（４２５８）に適用すること（４２８８）と、第１のストレージシステム（４２００Ａ）上でＩ／Ｏ動作（４２５２）を無事に適用することに応えて、第１のストレージシステム（４２００Ａ）で及びメタデータ更新（４２５４）に従って、Ｉ／Ｏ動作（４２５２）を適用する前にデータセット（４２５８）に対応するメタデータのバージョンを更新すること（４２９０）と、１つ以上の他のＩ／Ｏ動作に対するＩ／Ｏ動作（４２５２）の順序付けを記述するプレディケートメタデータを決定すること（４２９２）と、をさらに含む。 [0580] However, the example method shown in FIG. 42B further includes applying (4288) the I/O operation (4252) to the dataset (4258) on the first storage system (4200A); and, in response to successfully applying the I/O operation (4252) on the first storage system (4200A), updating (4290) a version of the metadata corresponding to the dataset (4258) at the first storage system (4200A) and in accordance with the metadata update (4254) before applying the I/O operation (4252); and determining (4292) predicate metadata describing the ordering of the I/O operation (4252) relative to one or more other I/O operations.

［0581］Ｉ／Ｏ動作（４２５２）を第１のストレージシステム（４２００Ａ）上のデータセット（４２５８）に適用すること（４２８８）は、図１に関して上述されたように、ストレージシステム（４２００Ａ）のコントローラによって実施されてよく、例えばＮＶＲＡＭ等のストレージシステム（４２００Ａ）及び例えばフラッシュメモリ等の永続記憶装置又は任意のタイプのソリッドステート不揮発性メモリのメモリ構成要素の１つ以上を使用し、書込み動作を実施するコントローラの記述。 [0581] Applying (4288) the I/O operation (4252) to the dataset (4258) on the first storage system (4200A) may be performed by a controller of the storage system (4200A), as described above with respect to FIG. 1, and describes the controller performing the write operation using one or more of the memory components of the storage system (4200A), such as NVRAM, and persistent storage, such as flash memory, or any type of solid-state non-volatile memory.

［0582］第１のストレージシステム（４２００Ａ）で及びメタデータ更新（４２５４）に従って、Ｉ／Ｏ動作（４２５２）を適用する前にデータセット（４２５８）に対応するメタデータのバージョンを更新すること（４２９０）は、対応するストレージオブジェクト（４２５８）、つまりソースボリュームのためのメタデータ表現の一部分を識別し、部分が一部の場合ではソースボリューム全体であってよいデータセット（４２５８）を記憶することによって実施されてよい。さらに、ストレージオブジェクト（４２５６）のためのメタデータ表現の部分は、Ｉ／Ｏ動作（４２５２）のためのメモリアドレスデータを使用して、上述されたメタデータオブジェクトの構造化された集合体を横断し、Ｉ／Ｏ動作（４２５２）のためのメモリアドレスのデータオブジェクトに対応するノードを見つけることによって識別されてよい。さらに、ストレージオブジェクト（４２５６）のためのメタデータ表現の中の１つ以上のノードを参照するメタデータ表現のためのメタデータオブジェクトルートノードが作成されてよく、メタデータオブジェクトルートノードは、ストレージオブジェクト（４２５６）全体のメタデータ表現の中の１つ以上のノードの部分を指定する場合もあれば、Ｉ／Ｏ動作（４２５２）に対応するストレージオブジェクト（４２５６）全体のためのメタデータ表現の部分だけを参照するための他の表示を指定する場合もある。このようにして、データセットのメタデータ表現は、Ｉ／Ｏ動作（４２５２）の無事適用を反映する。 [0582] Updating (4290) the version of the metadata corresponding to the dataset (4258) at the first storage system (4200A) and in accordance with the metadata update (4254) before applying the I/O operation (4252) may be performed by identifying a portion of the metadata representation for the corresponding storage object (4258), i.e., the source volume, and storing the dataset (4258), which portion may in some cases be the entire source volume. Further, the portion of the metadata representation for the storage object (4256) may be identified by traversing the structured collection of metadata objects described above using the memory address data for the I/O operation (4252) to find the node corresponding to the data object at the memory address for the I/O operation (4252). Additionally, a metadata object root node for the metadata representation may be created that references one or more nodes in the metadata representation for the storage object (4256), and the metadata object root node may specify a portion of one or more nodes in the metadata representation for the entire storage object (4256), or may specify other indications to reference only the portion of the metadata representation for the entire storage object (4256) that corresponds to the I/O operation (4252). In this manner, the metadata representation of the dataset reflects the successful application of the I/O operation (4252).

［0583］１つ以上の他のＩ／Ｏ動作に対するＩ／Ｏ動作（４２５２）の順序付けを記述するプレディケートメタデータを決定すること（４２９２）は、受け取られた各Ｉ／Ｏ動作を追跡し、任意の依存関係がＩ／Ｏ動作間に存在するかどうかを判断することによって実施されてよく、係るＩ／Ｏ動作が識別された後、まさにそれらは例えばリーダー定義の順序付け又はプレディケート又はインターロック例外を通して等の技術を使用する通常の実行時中であるべきなので、任意の順序付け一貫性問題は解決されるはずである。インターロック例外は、その全体として本明細書に援用される出願参照番号第１５／６９６，４１８号の中で説明される。プレディケートに関しては、動作と共通メタデータ更新との関係の説明は、別々の修正動作間の相互依存性の集合として説明されてよい－これらの相互依存性は、ある動作がなんらかの方法で依存するプレカーソルの集合として記述される場合があり、プレカーソルの集合は、動作が完了するために真でなければならないプレディケートと見なされる場合がある。さらに、プレディケートは、それらがリーダーとフォロワーとの間の同時並行性に対する制限を広めるために使用される場合、それらのプレディケートが、ストレージシステムが情報を持続する順序を動かすならば持続される情報は多様なもっともらしい結論を暗示するので、保護される必要はない可能性がある。 [0583] Determining predicate metadata (4292) describing the ordering of an I/O operation (4252) relative to one or more other I/O operations may be performed by tracking each received I/O operation and determining whether any dependencies exist between the I/O operations; after such I/O operations are identified, any ordering consistency issues should be resolved, as they should be during normal execution using techniques such as reader-defined ordering or predicates or through interlock exceptions. Interlock exceptions are described in application Ser. No. 15/696,418, which is incorporated herein by reference in its entirety. With respect to predicates, the description of the relationship between an operation and a common metadata update may be described as a set of interdependencies between separate modification operations—these interdependencies may be described as a set of precursors on which an operation depends in some way, and the set of precursors may be viewed as predicates that must be true for the operation to complete. Furthermore, predicates may not need to be protected if they are used to impose limits on concurrency between leaders and followers, as the persisted information could imply a variety of plausible conclusions if those predicates change the order in which the storage system persists the information.

［0584］追加の説明のために、図４３は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間のアクティブメンバーシップを決定するための例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図４３に示されるストレージシステム（４３００Ａ～４３００Ｎ）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｃ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図４３に示されるストレージシステム（４３００Ａ～４３００Ｎ）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0584] For additional explanation, FIG. 43 sets forth a flowchart illustrating an example method for determining active membership among storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (4300A-4300N) illustrated in FIG. 43 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3C, or any combination thereof. In practice, the storage systems (4300A-4300N) illustrated in FIG. 43 may include the same components as the storage systems described above, fewer components, or additional components.

［0585］以下の例では、データセットを同期複製するストレージシステムの間でエラーから回復する方法を決定することは、定足数プロトコルを従事させるかどうかを判断することを含んでよい。以下に説明されるように、ストレージシステムクラスタ又はポッド内でのアクティブメンバーシップを決定することは、データセットに向けられるＩ／Ｏ動作にサービスを提供し続けるストレージシステムの部分集合を決定することによって、例えば通信障害又はストレージデバイス故障等のエラーを克服し得、ストレージシステムの別の部分集合は、データセットに向けられたＩ／Ｏ動作にサービスを提供することを続行しない。このようにして、仲介又は定足数の方針のどちらかを通して、１つ以上のストレージシステムは、ストレージシステムがストレージシステムの同期中リストから加えられる又は削除されると、同期複製されたデータセットを修正するＩ／Ｏ動作の履歴を制御すると決定される。 [0585] In the following example, determining how to recover from an error among storage systems that synchronously replicate a dataset may include determining whether to engage a quorum protocol. As described below, determining active membership within a storage system cluster or pod may overcome errors, such as a communication failure or storage device failure, by determining a subset of storage systems that continue to service I/O operations directed to the dataset, while another subset of storage systems does not continue to service I/O operations directed to the dataset. In this manner, either through intermediation or quorum policies, one or more storage systems may be determined to control the history of I/O operations that modify a synchronously replicated dataset as storage systems are added or removed from the in-sync list of storage systems.

［0586］例えば、ストレージシステム間の通信障害等のエラーに応えて仲介サービスを従事させるプロセス―ストレージシステムは、広域ネットワークを介する連絡アドレスとして示すハンドル、及び仲介のための鍵のプールを管理するために使用できる暗号で安全なトークンを記憶するように構成されてよい―その全体として本願に援用される出願参照番号第１４／７０３，５５９号の中でより詳細に説明される。やはり出願参照番号第１５／７０３，５５９号の中で説明されるのは、データセットを複製するストレージシステムの集合の中のどのストレージシステムが、データセットに向けられたＩ／Ｏ要求にサービスを提供し続けるべきかを判断するための多様な定足数プロトコルの使用である。 [0586] The process of engaging an intermediary service in response to an error, such as a communication failure between storage systems—storage systems may be configured to store handles indicating contact addresses over a wide area network and cryptographically secure tokens that can be used to manage a pool of keys for the intermediary—is described in more detail in Application Serial No. 14/703,559, which is incorporated herein by reference in its entirety. Also described in Application Serial No. 15/703,559 is the use of various quorum protocols to determine which storage system in a collection of storage systems replicating a dataset should continue to service I/O requests directed to the dataset.

［0587］本願参照番号第１４／７０３，５５９号は、仲介及び定足数プロトコルのための実施態様を説明しているが、本開示の焦点は、ストレージシステムクラスタでのアクティブメンバーシップを決定するためにどの技術を実行するのかを判断するための分析である。例えば、いくつかの状況では、１つ以上のストレージシステムが、別の１つ以上のストレージシステムを介して仲介に従事し、仲介を勝ち取ることが可能である－他方の１つ以上のストレージシステムが仲介を勝ち取った場合、Ｉ／Ｏ動作にサービスを提供するためのストレージシステムの性能は、１つ以上のデータストレージ測定基準又は性能基準に従ってより優れているであろう。言い換えると、仲介を勝ち取る１つ以上のシステムが仲介を勝ち取ることは必ずしも有利ではない。以下に説明されるのは、仲介を勝ち取るべきではない１つ以上のストレージシステムが、より高い性能を示しているストレージシステムにより大きい数量の投票を割り当てることを含み、仲介を勝ち取る、又は特定のホストシステムにより密接に接続される、又はその測定基準がより悪いストレージシステムに割り当てられる投票の数量に比べて他の比較的に良い測定基準を有する状況を回避するための技術である。 [0587] While this application Ser. No. 14/703,559 describes implementations for arbitration and quorum protocols, the focus of this disclosure is an analysis for determining which techniques to implement to determine active membership in a storage system cluster. For example, in some situations, one or more storage systems may engage in arbitration through one or more other storage systems and win arbitration—if the other one or more storage systems win arbitration, the storage system's performance for servicing I/O operations will be better according to one or more data storage metrics or performance criteria. In other words, it is not necessarily advantageous for one or more systems to win arbitration. Described below are techniques for avoiding situations where one or more storage systems that should not win arbitration win arbitration, including allocating a greater number of votes to storage systems that exhibit higher performance, or are more closely connected to a particular host system, or have other relatively better metrics than the number of votes allocated to storage systems with poorer metrics.

［0588］一部の例では、ストレージシステムクラスタでアクティブメンバーシップを決定するために、データセットに向けられたＩ／Ｏ要求にサービスを提供し続けるための１つ以上のストレージシステムの集合を決定するためのデフォルト技術は－ストレージシステム（４３００Ａ）が、定足数プロトコルの使用が、Ｉ／Ｏ要求にサービスを提供し続けるための１つ以上のストレージシステムの集合を決定するための定足数を確立できないであろうと判断又は証明することができない限り－定足数プロトコルを使用することによって実施されてよい。言い換えると、ストレージシステム（４３００Ａ）の中のストレージデバイス間の通信障害等のエラーに応えて、ストレージシステム（４３００Ａ）のコントローラは、定足数を確立できるかどうかを判断してよい－特定の定足数プロトコルの下で定足数を確立できる場合、次いで定足数プロトコルは、ストレージシステムクラスタ内のアクティブな又は同期中のメンバーシップを決定するために使用され、それ以外の場合、特定の定足数プロトコルの下で定足数を確立できない場合、次いでストレージシステム（４３００Ａ）は、ストレージシステムクラスタ内のアクティブな又は同期中のメンバーシップを決定するために仲介に従事してよい。簡略な場合には、同期中リストのメンバーである２つのストレージシステムがある場合、他の単一のストレージシステムが定足数を形成できないため、次いで通信障害に応えて定足数分析は実行されない。一部の例では、所与のストレージシステム又は互いに通信するストレージシステムの集合は、ストレージシステムの別の集合が、ストレージシステムの各集合に対する投票の比較に基づいて定足数を形成し得る可能性を排除することができない限り、ストレージシステム又は互いに通信するストレージシステムは、通信で結合されていない１つ以上のストレージシステムをデタッチする定足数方針に頼る場合がある。言い換えると、互いと通信中である１つ以上のストレージシステムが、定足数を形成できる場合、次いで通信していない１つ以上のストレージシステムは、仲介に訴えることなく定足数方針を使用し、デタッチされてよい。 [0588] In some examples, to determine active membership in a storage system cluster, a default technique for determining a set of one or more storage systems to continue servicing I/O requests directed to a dataset may be implemented by using a quorum protocol—unless the storage system (4300A) can determine or prove that use of the quorum protocol would not allow for the establishment of a quorum for determining a set of one or more storage systems to continue servicing I/O requests. In other words, in response to an error, such as a communication failure between storage devices in the storage system (4300A), the controller of the storage system (4300A) may determine whether a quorum can be established—if a quorum can be established under the particular quorum protocol, then the quorum protocol is used to determine active or synchronizing membership in the storage system cluster; otherwise, if a quorum cannot be established under the particular quorum protocol, then the storage system (4300A) may engage in arbitration to determine active or synchronizing membership in the storage system cluster. In a simple case, if there are two storage systems that are members of the in-sync list, then quorum analysis is not performed in response to a communication failure because no other single storage system can form a quorum. In some examples, a given storage system or set of storage systems that communicate with each other may rely on a quorum policy to detach one or more storage systems that are not communicatively coupled unless the storage system or set of storage systems that communicate with each other can eliminate the possibility that another set of storage systems could form a quorum based on a comparison of votes for each set of storage systems. In other words, if one or more storage systems that are in communication with each other are able to form a quorum, then the one or more storage systems that are not communicating may be detached using the quorum policy without resorting to an intermediary.

［0589］一部の例では、ストレージシステム（４３００Ａ）のためのコントローラは、互いと通信しているストレージシステムの集合を決定することによって、定足数を確立し得るかどうかを判断してよい－このことは、データセットを同期複製しているストレージシステム（４３００Ａ～４３００Ｎ）の完全な集合に含まれるストレージシステム（４３００Ｂ～４３００Ｎ）の同期中リストを参照しているストレージシステム（４３００Ａ）に基づいて、それらが通信していないストレージシステムの集合を決定する働きもする。さらに、ストレージシステム（４３００Ａ）は、すべてのストレージシステムの集合又は最後に知られていた同期中のストレージシステム（４３００Ａ～４３００Ｎ）の集合の中の各ストレージシステムが、所与の定足数プロトコルに関して有する票数の記憶されている表示を参照してよい。このようにして、ストレージシステムの第１の集合のそれぞれに対応するそれぞれの票とともに、互いと通信しているストレージシステムの決定された第１の集合、及びストレージシステムの第２の集合のそれぞれに対応するそれぞれの票とともに、分析を実行しているストレージシステム（４３００Ａ）と通信していないストレージシステムの決定された第２の集合を所与として、ストレージシステム（４３００Ａ）は、ストレージシステムの第１の集合のストレージシステムが、定足数を確立するほど十分な票を有しているのか、及びストレージシステムの第２の集合のストレージシステムが定足数を確立するほど十分な票を有する可能性があるのかを判断してよい。 [0589] In some examples, the controller for storage system 4300A may determine whether a quorum can be established by determining the set of storage systems that are in communication with each other - which also serves to determine the set of storage systems with which it is not in communication based on storage system 4300A referencing the in-sync list of storage systems 4300B-4300N that are included in the complete set of storage systems synchronously replicating the dataset. Additionally, storage system 4300A may refer to a stored indication of the number of votes each storage system in the set of all storage systems or the last known set of in-sync storage systems 4300A-4300N has with respect to a given quorum protocol. In this manner, given a determined first set of storage systems that are in communication with each other, along with respective votes corresponding to each of the first set of storage systems, and a determined second set of storage systems that are not in communication with the storage system (4300A) performing the analysis, along with respective votes corresponding to each of the second set of storage systems, the storage system (4300A) may determine whether a storage system in the first set of storage systems has sufficient votes to establish a quorum and whether a storage system in the second set of storage systems is likely to have sufficient votes to establish a quorum.

［0590］言い換えると、通信障害を検出し、通信障害に対応するストレージシステム（４３００Ａ）は、（１）通信していないストレージシステムがおそらく定足数を形成し得るかどうか、（２）通信に留まっているストレージシステムがおそらく定足数を形成し得るかどうか、及び／又は（３）通信していない任意のストレージシステムが、通信しているストレージシステムが定足数を形成し得るかどうかを―少なくとも、ストレージシステムのどの集合が通信に留まるのか、ストレージシステムのどの集合が通信していないのか、及びストレージシステムの各集合の中の各ストレージシステムに対応するそれぞれの票に応じて―おそらく決定し得るかどうかを判断してよい。 [0590] In other words, a storage system (4300A) detecting and responding to a communication failure may determine (1) whether storage systems that are not communicating can possibly form a quorum, (2) whether storage systems that remain in communication can possibly form a quorum, and/or (3) whether any storage systems that are not communicating can possibly determine whether storage systems that are communicating can possibly form a quorum—depending at least on which sets of storage systems remain in communication, which sets of storage systems are not communicating, and the respective votes corresponding to each storage system in each set of storage systems.

［0591］さらに、ストレージシステム（４３００Ａ）が、別のストレージシステム又はストレージシステムの別の集合が定足数を形成できないであろうことを確実にすることによって、ストレージシステム（４３００Ａ）は、それが、ストレージシステム（４３００Ａ）と通信する１つ以上の他のストレージシステムとともに、仲介を勝ち取る場合、ストレージシステムの１つ以上が、仲介を勝ち取ったストレージシステムと再同期するようになった場合、データセットに不一致がなくなるように、他のストレージシステム又はストレージシステムの他の集合は、同期複製されたデータセットのバージョンを作成できるであろうことを確実にする。 [0591] Furthermore, by ensuring that storage system (4300A) ensures that another storage system or another set of storage systems will not be able to form a quorum, storage system (4300A) ensures that if it wins arbitration along with one or more other storage systems that communicate with storage system (4300A), the other storage systems or other set of storage systems will be able to create a synchronously replicated version of the data set so that there will be no inconsistencies in the data set when one or more of the storage systems resynchronize with the storage system that won arbitration.

［0592］ある例として、なんらかのシステム障害の前にポッドのメンバーとして同期中リストに属するストレージシステムと関連付けられた偶数の票があってよい。この例では、同期中のポッドメンバーのストレージシステムの第１の集合が互いに通信中であり、ストレージシステムの第１の集合が定足数を確立するための票の正確に半分に相当する場合、互いと通信していた可能性がある－が、ストレージシステムの第１の集合とではない－ストレージシステムの他の集合は、定足数を確立するために必要とされる票の半分以上を構成できないであろう。この例では、ストレージシステムの第１の集合に含まれるストレージシステムが、ストレージシステムの第１の集合も、ストレージシステムの他の集合もおそらく定足数を形成し得ないと判断する場合があり、それに応えてストレージシステムは仲介を開始してよい。 [0592] In one example, there may be an even number of votes associated with storage systems belonging to the in-sync list as members of a pod prior to some system failure. In this example, if a first set of in-sync pod member storage systems are in communication with each other and the first set of storage systems represent exactly half of the votes to establish a quorum, other sets of storage systems that may have been in communication with each other—but not with the first set of storage systems—would not be able to constitute more than half of the votes required to establish a quorum. In this example, a storage system included in the first set of storage systems may determine that neither the first set of storage systems nor the other sets of storage systems could possibly form a quorum, and in response, the storage system may initiate mediation.

［0593］別の例として、定足数が通信している又は通信していないシステムによって確立し得るかどうかのための判断は、複数のそれぞれの障害イベントに応えて繰り返し決定される場合がある－例えば、仲介又は定足数投票に従事することによって応える等、異なる対応は、各障害イベントに応えて実行されてよい。例えば、通信障害等の障害の前、ストレージシステムの集合は、ポッドの同期中のメンバーであってよい。この例では、定足数を確立するための１票を有する特定のストレージシステムは、ポッドの他のメンバーとの通信を失う場合があり、それに応じて、ポッドの他のメンバーは、定足数を確立するほど十分な票を有することによって、投票を通じて特定のストレージシステムを削除する。特定のストレージシステムを削除するためのこの定足数投票段階の完了後、ポッドの同期中メンバーは、特定のストレージシステムを除いたポッドの同期中システムである－定足数を確立するための合計で４票となるストレージシステムの同期中メンバーリストを生じさせる。この例を続けるためには、上述されたように、障害によりストレージシステムが、合計で２票を有するストレージシステムの集合に属することになる場合、次いでストレージシステムは、定足数は可能ではないと判断する場合があり、仲介を開始する場合がある。 [0593] As another example, the determination of whether a quorum can be established by communicating or non-communicating systems may be iteratively determined in response to multiple respective failure events—a different response may be performed in response to each failure event, such as by responding by engaging in mediation or quorum voting. For example, prior to a failure, such as a communications failure, a collection of storage systems may be in-syncing members of a pod. In this example, a particular storage system with one vote to establish a quorum may lose communication with other members of the pod, and in response, the other members of the pod, having enough votes to establish a quorum, remove the particular storage system through voting. After completion of this quorum voting phase to remove the particular storage system, the in-syncing members of the pod are the in-syncing systems of the pod excluding the particular storage system—resulting in an in-syncing member list of storage systems with a total of four votes to establish a quorum. To continue with this example, as described above, if a failure causes a storage system to belong to a collection of storage systems with a total of two votes, then the storage system may determine that quorum is not possible and may initiate mediation.

［0594］一部の例では、ポッドの同期中メンバーであるストレージシステムは、ゼロ票を含んだ、異なるそれぞれの数の票を割り当てられる場合がある。例えば、ポッドの同期中メンバーであるストレージシステムの集合の場合、異なる票分布は、（ａ）単一票を有するすべてのストレージシステム、（ｂ）複数票を有するいくつかのストレージシステム及び単一票を有するいくつかのストレージシステム、（ｃ）複数票を有するいくつかのストレージシステム、複数票を有するいくつかのストレージシステム、及びゼロ票を有するいくつかのストレージシステム、又は（ｄ）異なる数の表を有する各ストレージシステムを含む。言い換えると、一般的には、ポッドの同期中メンバーである任意の所与のストレージシステムは、ゼロ票を含んだ任意の数の表を割り当てられてよい。ゼロ票を有するストレージシステムの一例は、ポッドのメンバーであるソースストレージシステムからまだポッド内にないターゲットストレージシステムへのデータセットの移行中に発生する場合がある－移行の完了前、ソースストレージシステムはその１つ又は複数の票を制御し、ターゲットストレージシステムはいずれの票も制御せず、移行完了後、ターゲットストレージシステムは、１票又は複数票の制御を与えられてよく、ソースストレージはいかなる票も制御する又は有することがないように更新されてよい。 [0594] In some examples, storage systems that are in-sync members of a pod may be assigned different respective numbers of votes, including zero votes. For example, for a collection of storage systems that are in-sync members of a pod, different vote distributions include (a) all storage systems with a single vote, (b) some storage systems with multiple votes and some with a single vote, (c) some storage systems with multiple votes, some with multiple votes, and some with zero votes, or (d) each storage system with a different number of tables. In other words, generally, any given storage system that is an in-sync member of a pod may be assigned any number of tables, including zero votes. An example of a storage system with zero votes may occur during the migration of a data set from a source storage system that is a member of a pod to a target storage system that is not yet within the pod—before the migration is complete, the source storage system controls one or more of its votes and the target storage system controls none of the votes; after the migration is complete, the target storage system may be given control of one or more votes, and the source storage system may be updated to control or have no votes.

［0595］一部の例では、同期中のリストは、追加のストレージシステムがポッドに加えられる、又はポッドからデタッチされると、ストレージシステムの集合の中で確立されてよく、各ストレージシステムはポッドのメンバーを示すメタデータを維持してよく、同期中リストはポッドの各メンバーのステータスをさらに示す。ストレージシステムがポッドに加えられると、加えられるストレージシステムは、ポッド内の既存のストレージシステムによって、仲介サービスに連絡するための仲介ハンドルを与えられてよい。さらに、ポッドに対する変更が加えられると、同期中のリストは現在のポッドの中のストレージシステムの現在のメンバーシップを反映するために更新される。ポッド定義及び管理に関する追加の説明は、その全体として参照により本明細書に専用される出願参照番号第６２／４７０，１７２号及び第６２／５１８，０７１号の中に見つけられる。さらに、複数のストレージシステムを含むようにポッドが広げられる、つまり拡張されるにつれ、ポッド内のストレージシステムは、特定の仲介サービスから仲介を要求するように構成されてよい－このようにして、ポッド内の各ストレージシステムは、仲介が所与のエラーに対する対応であると判断される場合、同じ仲介サービスから仲介を要求する。仲介サービスにアクセスするためのストレージシステムの構成は、本明細書にさらに説明される。 [0595] In some examples, an in-sync list may be established among a collection of storage systems as additional storage systems are added to or detached from a pod, where each storage system may maintain metadata indicating its membership in the pod, and the in-sync list further indicates the status of each member of the pod. When a storage system is added to a pod, the added storage system may be given an intermediation handle by the existing storage systems in the pod to contact the intermediation service. Furthermore, as changes are made to the pod, the in-sync list is updated to reflect the current membership of storage systems in the current pod. Additional discussion regarding pod definition and management can be found in Application Serial Nos. 62/470,172 and 62/518,071, which are incorporated herein by reference in their entirety. Furthermore, as a pod is expanded or extended to include multiple storage systems, the storage systems within the pod may be configured to request intermediation from a specific intermediation service—in this way, each storage system in the pod requests intermediation from the same intermediation service if intermediation is determined to be a response to a given error. Configuration of the storage system for accessing the intermediary service is further described herein.

［0596］図４３に示されるように、データセット（４３５２）を同期複製している複数のストレージシステム（４３００Ａ～４３００Ｎ）は、１つ以上のネットワーク（不図示）を介してそれぞれの他のストレージシステムと、及び仲介サービス（４３０１）と通信してよい－仲介サービス（４３０１）は、ストレージシステム間の通信障害の場合、ストレージシステムがオフラインになる場合、又はなんらかの他のトリガイベントに起因して、どのストレージサービスがデータセットにサービスを提供し続けるのかを決定してよい。概して、任意の数のストレージシステムは、データセット（４３５２）を同期複製している同期中のリストの一部であってよい。 [0596] As shown in FIG. 43, multiple storage systems (4300A-4300N) synchronously replicating a dataset (4352) may communicate with each other storage system via one or more networks (not shown) and with an intermediary service (4301) that may determine which storage system will continue to service the dataset in the event of a communication failure between the storage systems, if a storage system goes offline, or due to some other triggering event. In general, any number of storage systems may be part of the in-sync list that are synchronously replicating dataset (4352).

［0597］図４３に示される例の方法は、ストレージシステム（４３００Ａ～４３００Ｎ）のうちの特定のストレージシステム（４３００Ａ）によって、ストレージシステム（４３００Ｂ～４３００Ｎ）の１つ以上との通信の損失を伴う通信障害を検出すること（４３０２）であって、特定のストレージシステム（４３００Ａ）が、例えば仲介サービス（４３０１）等の仲介ターゲットから仲介を要求するように構成される、検出することを含む。通信障害を検出すること（４３０２）は、特定のストレージシステム（４３００Ａ）のコントローラが、通信リンク（４３５４）又はチャンネルを介して、なんらかの期間のうちに別のストレージシステムから通信（４３００Ｂ～４３００Ｎ）を受け取らないことを含んだ、いくつかの技術を使用し、実施されてよい。別の例では、通信障害を検出すること（４３０２）は、通信チャネルが正しく動作していないと判断するためにクロック交換プロトコルに従って特定のストレージシステム（４３００Ａ）のコントローラによって実施されてよく、クロックは、本明細書に含まれる他の項でより詳細に説明される。通信障害を検出する（４３０２）のための他の標準的な技術も実施されてよい。 [0597] The example method shown in FIG. 43 includes detecting (4302) a communication failure involving loss of communication with one or more of the storage systems (4300B-4300N) by a particular storage system (4300A) of the storage systems (4300A-4300N), where the particular storage system (4300A) is configured to request intermediation from an intermediation target, such as an intermediation service (4301). Detecting (4302) the communication failure may be implemented using several techniques, including the controller of the particular storage system (4300A) not receiving communication from another storage system (4300B-4300N) over a communication link (4354) or channel for some period of time. In another example, detecting a communication failure (4302) may be performed by a controller of a particular storage system (4300A) according to a clock exchange protocol to determine that a communication channel is not operating correctly, clocks being described in more detail in other sections contained herein. Other standard techniques for detecting a communication failure (4302) may also be implemented.

［0598］また、図４３の例の方法は、１つ以上のストレージシステム（４３００Ｂ～４３００Ｎ）の少なくとも１つが、通信障害に応えて、例えば仲介サービス（４３０１）等の仲介ターゲットから仲介を要求するように構成されると判断すること（４３０４）も含む。１つ以上のストレージシステム（４３００Ｂ～４３００Ｎ）の少なくとも１つが、通信障害に応えて、例えば仲介サービス（４３０１）等の仲介ターゲットから仲介を要求するように構成されると判断すること（４３０４）は、ストレージシステム（４３００Ａ）が上述された分析を受けることによって実施されてよい－ストレージシステム（４３００Ａ）と通信する１つ以上のシステムに対応する票の比較、及びストレージシステム（４３００Ａ）と通信していない１つ以上のストレージシステムに対応する票に基づいて、ストレージシステム（４３００Ａ）が、ストレージシステム（４３００Ａ）と通信していない１つ以上のストレージシステムが定足数を形成できるかどうか、及びストレージシステム（４３００Ａ）と通信していないそれらの１つ以上のストレージシステムが、仲介ターゲットとの仲介におそらく従事している可能性があるかどうかを判断できる。 [0598] The example method of FIG. 43 also includes determining (4304) that at least one of the one or more storage systems (4300B-4300N) is configured to request mediation from an intermediation target, such as an intermediation service (4301), in response to a communication failure. Determining (4304) that at least one of the one or more storage systems (4300B-4300N) is configured to request mediation from an intermediation target, such as an intermediation service (4301), in response to a communication failure may be performed by the storage system (4300A) undergoing the analysis described above - based on a comparison of votes corresponding to one or more systems in communication with the storage system (4300A) and votes corresponding to one or more storage systems not in communication with the storage system (4300A), the storage system (4300A) can determine whether one or more storage systems not in communication with the storage system (4300A) can form a quorum and whether those one or more storage systems not in communication with the storage system (4300A) are likely to be engaged in mediation with the intermediation target.

［0599］また、図４３の例の方法は、ストレージシステム（４３００Ｂ～４３００Ｎ）の１つ以上が仲介ターゲットから仲介を要求するように構成されると判断すること（４３０４）に応えて、仲介ターゲットから仲介を要求するかどうかを判断すること（４３０６）も含む。仲介ターゲットから仲介を要求するかどうかを判断すること（４３０６）は、分析、及び通信が失われた１つ以上のストレージシステムの少なくとも１つが、仲介ターゲットから仲介を要求している場合がある旨の判断（４３０４）に基づいて実施されてよい－ストレージステム（４３０４Ａ）が、１つ以上のストレージシステムの少なくとも１つが、メタデータターゲットから仲介を要求している場合があると判断する場合、次いでストレージシステム（４３００Ａ）は、仲介ターゲットからも仲介を要求する場合がある。それ以外の場合、ストレージシステム（４３００Ａ）、及びストレージシステム（４３００Ａ）と通信中の任意のストレージシステムは、１つ以上のストレージシステムの少なくとも１つをデタッチするために定足数方針に従事する場合もあれば、ストレージシステム（４３００Ａ）及びストレージシステム（４３００Ａ）と通信する任意のストレージシステムは、ストレージシステムの別の集合が、定足数を有する可能性があり、同期複製されたデータセットのコピーを用いて操作するのを停止すると判断してよい（それらのストレージシステムは、それがストレージシステムの同期中リスト及び同期外リストの状態を判断するのに役立つ場合がある複数のストレージシステムと通信が確立できるまで実質的にオフラインになる）。 43 also includes determining (4306) whether to request intermediation from the intermediary target in response to determining (4304) that one or more of the storage systems (4300B-4300N) are configured to request intermediation from the intermediary target. Determining (4306) whether to request intermediation from the intermediary target may be performed based on the analysis and the determination (4304) that at least one of the one or more storage systems with which communication has been lost may be requesting intermediation from the intermediary target—if the storage system (4304A) determines that at least one of the one or more storage systems may be requesting intermediation from the metadata target, then the storage system (4300A) may also request intermediation from the intermediary target. Otherwise, storage system (4300A) and any storage systems in communication with storage system (4300A) may engage in a quorum policy to detach at least one of the one or more storage systems, or storage system (4300A) and any storage systems in communication with storage system (4300A) may determine that another set of storage systems may have a quorum and cease operating with copies of the synchronously replicated data set (those storage systems effectively go offline until they can establish communication with multiple storage systems, which may help determine the state of the in-sync and out-of-sync lists of storage systems).

［0600］追加の説明のために、図４４は、本開示のいくつかの実施形態に従ってデータベースを同期複製するストレージシステムの間のアクティブメンバーシップを決定する例の方法を示すフローチャートを説明する。図４４に示される例の方法は、ストレージシステム（４３００Ａ～４３００Ｎ）のうちの特定のストレージシステム（４３００Ａ）によって、ストレージシステム（４３００Ｂ～４３００Ｎ）の１つ以上との通信障害を検出すること（４３０２）であって、特定のストレージシステム（４３００Ａ）が仲介ターゲットから仲介を要求するように構成される、検出することと、１つ以上のストレージシステム（４３００Ｂ～４３００Ｎ）の少なくとも１つが、通信障害に応えて、例えば仲介サービス（４３０１）等の仲介ターゲットから仲介を要求するように構成されると判断すること（４３０４）と、ストレージシステム（４３００Ｂ～４３００Ｎ）の１つ以上が、仲介ターゲットから仲介を要求するように構成されると判断すること（４３０４）に応えて、仲介ターゲットから仲介を要求するかどうかをと判断すること（４３０６）も含むので、図４４に示される例の方法は、図４に示される例の方法に類似している。 [0600] For further explanation, Figure 44 illustrates a flowchart showing an example method for determining active membership between storage systems that synchronously replicate a database in accordance with some embodiments of the present disclosure. The example method illustrated in FIG. 44 is similar to the example method illustrated in FIG. 4 because it also includes detecting (4302) a communication failure with one or more of the storage systems (4300B-4300N) by a particular storage system (4300A) of the storage systems (4300A-4300N), where the particular storage system (4300A) is configured to request intermediation from an intermediation target; determining (4304) that at least one of the one or more storage systems (4300B-4300N) is configured to request intermediation from an intermediation target, such as an intermediation service (4301), in response to the communication failure; and determining (4306) whether one or more of the storage systems (4300B-4300N) is configured to request intermediation from the intermediation target in response to determining (4304) that the one or more of the storage systems (4300B-4300N) is configured to request intermediation from the intermediation target.

［0601］しかしながら、図４４に示される例は、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第１の集合のストレージシステムの間に一貫性のある通信があると判断すること（４４０２）であって、ストレージシステムのうちの各ストレージシステムは、１つ以上のストレージシステムの第１の集合が、１つ以上のストレージシステムの第２の集合をデタッチし得るかどうかを判断する定足数プロトコルの中のゼロ以上の票に対応する、判断することと、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第２の集合の間でのストレージシステムとの通信の欠如を判断すること（４４０４）であって、１つ以上のストレージシステムの第１の集合が定足数を形成できない、判断することと、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第２の集合も定足数を形成できないと判断すること（４４０６）をさらに含む。 [0601] However, the example shown in FIG. 44 further includes determining, by a first set of one or more storage systems, that there is consistent communication among the storage systems of the first set of one or more storage systems (4402), where each storage system of the storage systems corresponds to zero or more votes in a quorum protocol that determines whether the first set of one or more storage systems may detach a second set of one or more storage systems; determining, by the first set of one or more storage systems, a lack of communication with a storage system among the second set of one or more storage systems (4404), where the first set of one or more storage systems cannot form a quorum; and determining, by the first set of one or more storage systems, that the second set of one or more storage systems also cannot form a quorum (4406).

［0602］１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第１の集合のストレージシステムの間に一貫性のある通信があると判断すること（４４０２）は、１つ以上のストレージシステムの第１の集合の各ストレージシステムが、１つ以上のストレージシステムの第１の集合の他のすべてのシステムとステータスメッセージを交換することによって実施されてよい。別の例では、所与のストレージシステムは、通信障害が検出された１つ以上のストレージシステムの例外はあるが、互いと一貫性のある通信をするストレージシステムがすべて同期中リストのそれらのストレージシステムであると判断してよい。 [0602] Determining by the first set of one or more storage systems that there is consistent communication among the storage systems of the first set of one or more storage systems (4402) may be performed by each storage system of the first set of one or more storage systems exchanging status messages with every other system of the first set of one or more storage systems. In another example, a given storage system may determine that all storage systems that have consistent communication with each other are those storage systems in the in-sync list, with the exception of one or more storage systems for which a communication failure has been detected.

［0603］１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第２の集合の間でのストレージシステムとの通信の欠如を判断すること（４４０４）は、図４３に関して上述されたように、同様に通信障害を検出すること（４３０２）によって実施されてよい。 [0603] Determining (4404) a lack of communication with a storage system between a second set of one or more storage systems by a first set of one or more storage systems may be similarly performed by detecting a communication failure (4302), as described above with respect to FIG. 43.

［0604］１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第２の集合も定足数を形成できないと判断すること（４４０６）は、ストレージシステムのある集合が、ストレージシステムの別の集合が定足数を形成できることを確証し得るかどうかに関して上述されたように実施されてよい。 [0604] Determining (4406) by a first set of one or more storage systems that a second set of one or more storage systems cannot also form a quorum may be performed as described above with respect to whether one set of storage systems can establish that another set of storage systems can form a quorum.

［0605］追加の説明のために、図４５は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間のアクティブメンバーシップを決定するための例の方法を示すフローチャートを説明する。図４５に示される例の方法は、ストレージシステム（４３００Ａ～４３００Ｎ）のうちの特定のストレージシステム（４３００Ａ）によって、ストレージシステム（４３００Ｂ～４３００Ｎ）の１つ以上との通信障害を検出すること（４３０２）であって、特定のストレージシステム（４３００Ａ）が、仲介ターゲットから仲介を要求するように構成される、検出することと、１つ以上のストレージシステム（４３００Ｂ～４３００Ｎ）の少なくとも１つが、通信障害に応えて、例えば仲介サービス（４３０１）等の仲介ターゲットから仲介を要求するように構成されると判断すること（４３０４）と、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第１の集合のストレージシステムの間に一貫性のある通信があると判断すること（４４０２）と、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第２の集合の間でのストレージシステムとの通信の欠如を判断すること（４４０４）と、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第２の集合が定足数を形成するための十分な数又は票をもたないであろうと判断すること（４４０６）と、ストレージシステム（４３００Ｂ～４３００Ｎ）の１つ以上が仲介ターゲットから仲介を要求するように構成されると判断すること（４３０４）に応えて、仲介ターゲットから仲介を要求するかどうかを判断すること（４３０６）も含むので、図４５に示される例の方法は図４４に示される例の方法に類似している。 [0605] For further explanation, FIG. 45 sets forth a flowchart illustrating an example method for determining active membership among storage systems synchronously replicating a dataset in accordance with some embodiments of the present disclosure. The example method illustrated in FIG. 45 includes detecting (4302) a communication failure with one or more of the storage systems (4300B-4300N) by a particular storage system (4300A) among the storage systems (4300A-4300N), where the particular storage system (4300A) is configured to request intermediation from an intermediation target; determining (4304) that at least one of the one or more storage systems (4300B-4300N) is configured to request intermediation from an intermediation target, such as an intermediation service (4301), in response to the communication failure; and determining (4305) by a first set of one or more storage systems that a consistency check is performed among the storage systems of the first set of one or more storage systems. The example method illustrated in FIG. 45 is similar to the example method illustrated in FIG. 44 in that it also includes determining (4402) that there is communication with a storage system among a second set of one or more storage systems, by the first set of one or more storage systems, determining (4404) that there is a lack of communication with a storage system among a second set of one or more storage systems, by the first set of one or more storage systems, determining (4406) that the second set of one or more storage systems will not have a sufficient number or votes to form a quorum, and determining whether to request intermediation from the intermediation target in response to determining (4304) that one or more of the storage systems (4300B-4300N) are configured to request intermediation from the intermediation target.

［0606］しかしながら、図４５に示される例の方法は、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第２の集合の間でのストレージシステムとの通信の欠如を判断すること（４４０４）が、１つ以上のストレージシステムの第１の集合が、データセットを同期複製しているストレージシステムのための票の正確に半分を含んだ、１つ以上のストレージシステムの第１の集合に従って、定足数を形成できないと判断すること（４５０２）をさらに含むことをさらに指定する。 [0606] However, the example method shown in FIG. 45 further specifies that determining (4404) by the first set of one or more storage systems a lack of communication with a storage system among the second set of one or more storage systems further includes determining (4502) that the first set of one or more storage systems cannot form a quorum according to which the first set of one or more storage systems includes exactly half of the votes for storage systems that are synchronously replicating the data set.

［0607］１つ以上のストレージシステムの第１の集合が、第１のデータセットを同期複製しているストレージシステムのための票の正確に半分を含んだ、１つ以上のストレージシステムの第１の集合に従って、定足数を形成できないと判断すること（４５０２）は、ストレージシステムのある集合が、ストレージシステムの別の集合が定足数を形成できると確証し得るかどうかに関して上述されたように実施されてよく、１つ以上のストレージシステムの第１の集合が合計定足数票の正確に半分又は半分以上を制御する場合、次いで残りのストレージシステムが票の欠如に対して定足数を確立することは可能ではないであろう。 [0607] Determining (4502) that a quorum cannot be formed according to a first set of one or more storage systems including exactly half of the votes for storage systems synchronously replicating the first data set may be performed as described above with respect to whether one set of storage systems can confirm that another set of storage systems can form a quorum; if the first set of one or more storage systems controls exactly half or more than half of the total quorum votes, then the remaining storage systems will not be able to establish a quorum due to the lack of votes.

［0608］追加の説明のために、図４６は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間のアクティブメンバーシップを決定する例の方法を示すフローチャートを説明する。図４６に示される例の方法は、ストレージシステム（４３００Ａ～４３００Ｎ）のうちの特定のストレージシステム（４３００Ａ）によって、ストレージシステム（４３００Ｂ～４３００Ｎ）の１つ以上との通信障害を検出すること（４３０２）であって、特定のストレージシステム（４３００Ａ）が、仲介ターゲットから仲介を要求するように構成される、検出することと、１つ以上のストレージシステム（４３００Ｂ～４３００Ｎ）の少なくとも１つが、通信障害に応えて、例えば仲介サービス（４３０１）等の仲介ターゲットから仲介を要求するように構成されると判断すること（４３０４）と、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第１の集合のストレージシステムの間に一貫性のある通信があると判断すること（４４０２）と、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第２の集合の間でのストレージシステムとの通信の欠如を判断すること（４４０４）と、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第２の集合も定足数を形成できないであろうと判断すること（４４０６）と、ストレージシステム（４３００Ｂ～４３００Ｎ）の１つ以上が仲介ターゲットから仲介を要求するように構成されると判断すること（４３０４）に応えて、仲介ターゲットから仲介を要求するかどうかを判断すること（４３０６）も含むので、図４６に示される例の方法は図４４に示される例の方法に類似している。 [0608] For further explanation, FIG. 46 sets forth a flowchart illustrating an example method for determining active membership among storage systems that synchronously replicate a dataset in accordance with some embodiments of the present disclosure. The example method illustrated in FIG. 46 includes detecting (4302) a communication failure with one or more of the storage systems (4300B-4300N) by a particular storage system (4300A) among the storage systems (4300A-4300N), where the particular storage system (4300A) is configured to request intermediation from an intermediation target; determining (4304) that at least one of the one or more storage systems (4300B-4300N) is configured to request intermediation from an intermediation target, such as an intermediation service (4301), in response to the communication failure; and determining (4305) by a first set of one or more storage systems that synchronously replicate a data set among the storage systems of the first set of one or more storage systems. The example method illustrated in FIG. 46 is similar to the example method illustrated in FIG. 44 in that it also includes determining, by a first set of one or more storage systems, that there is consistent communication between the storage systems (4300B-4300N) (4402); determining, by a first set of one or more storage systems, a lack of communication with a storage system between a second set of one or more storage systems (4404); determining, by the first set of one or more storage systems, that the second set of one or more storage systems will also be unable to form a quorum (4406); and determining whether to request intermediation from the intermediation target in response to determining (4304) that one or more of the storage systems (4300B-4300N) are configured to request intermediation from the intermediation target (4306).

［0609］しかしながら、図４６に示される例の方法は、１つ以上のストレージシステムの第１の集合によって、１つ以上のストレージシステムの第２の集合も定足数を形成できないと判断すること（４４０６）が、１つ以上のストレージシステムの第２の集合が、データセットを同期複製するストレージシステムのための票の半分を含んだ１つ以上のストレージシステムの第２の集合に従って定足数を形成できないであろうと判断すること（４６０２）をさらに含むことをさらに指定する。 [0609] However, the example method shown in FIG. 46 further specifies that determining (4406) by the first set of one or more storage systems that the second set of one or more storage systems also cannot form a quorum further includes determining (4602) that the second set of one or more storage systems would not be able to form a quorum according to the second set of one or more storage systems including half of the votes for the storage systems that synchronously replicate the dataset.

［0610］１つ以上のストレージシステムの第２の集合が、データセットを同期複製するストレージシステムのための票の半分を含んだ１つ以上のストレージシステムの第２の集合に従って定足数を形成できないであろうと判断すること（４６０２）は、ストレージシステムのある集合が、ストレージシステムの別の集合が定足数を形成できることを確証し得るかどうかに関して上述されたように実施されてよい。 [0610] Determining (4602) that the second set of one or more storage systems will not be able to form a quorum according to the second set of one or more storage systems including half of the votes for the storage systems that synchronously replicate the data set may be performed as described above with respect to whether one set of storage systems can establish that another set of storage systems can form a quorum.

［0611］追加の説明のために、図４７は、本開示のいくつかの実施形態に従ってデータセットを同期複製するストレージシステムの間のアクティブメンバーシップを決定するための例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図４７に示されるストレージシステム（４３００Ａ～４３００Ｎ）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｃ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図４７に示されるストレージシステム（４３００Ａ～４３００Ｎ）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0611] For further explanation, FIG. 47 sets forth a flowchart illustrating an example method for determining active membership among storage systems that synchronously replicate datasets in accordance with some embodiments of the present disclosure. While not shown in great detail, the storage systems (4300A-4300N) illustrated in FIG. 47 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3C, or any combination thereof. In practice, the storage systems (4300A-4300N) illustrated in FIG. 47 may include the same components as the storage systems described above, fewer components, or additional components.

［0612］図４７に示される例の方法は、ストレージシステムのうちの第１のストレージシステムによって、ストレージシステムの第２のストレージシステムとの通信障害を検出すること（４７０２）を含む。トレージシステムのうちの第１のストレージシステムによって、ストレージシステムの第２のストレージシステムとの通信障害を検出すること（４７０２）は、図４３に関して上述されたように、ストレージシステム（４３００Ａ～４３００Ｎ）のうちの特定のストレージシステム（４３００Ａ）によって、ストレージシステム（４３００Ｂ～４３００Ｎ）の１つ以上との通信障害を検出すること（４３０２）と同様に実施されてよい。 [0612] The example method shown in FIG. 47 includes detecting (4702), by a first storage system of the storage systems, a communication failure with a second storage system of the storage systems. Detecting (4702), by a first storage system of the storage systems, a communication failure with a second storage system of the storage systems may be implemented similarly to detecting (4302), by a particular storage system (4300A) of the storage systems (4300A-4300N) a communication failure with one or more of the storage systems (4300B-4300N), as described above with respect to FIG. 43.

［0613］また、図４７に示される例の方法は、仲介を勝ち取ることにより、性能基準が、第２のストレージシステムがＩ／Ｏ要求にサービスを提供し続けることに比してより良く満たされるように、第１のストレージシステムと通信しているストレージシステムが、データセット（４３５２）に向けられたＩ／Ｏ要求にサービスを提供し続けることが可能になるであろうことを示す仲介分析（４７５２）を生成すること（４７０４）も含む。仲介分析（４７５２）を生成すること（４７０４）は、ストレージシステム（４３００Ａ）が、第１のストレージシステム、及び第１のストレージシステムと通信している１つ以上のストレージシステムの１つ以上の性能特徴を、第１のストレージシステムと通信していない１つ以上のストレージシステムの１つ以上の対応する性能特徴に対して比較して、ストレージシステムのどの集合がより効果的に、効率的に、又は確実にＩ／Ｏ要求を処理するであろうかを判断することによって実施されてよい。例えば、ストレージシステムは、Ｉ／Ｏ要求を処理することに関する性能を示す複数の測定基準を追跡してよく、測定基準は、他の要因の中でもプロセッサ速度、ネットワーク待ち時間、読取り待ち時間、書込み待ち時間に基づいて影響を受ける場合がある。 47 also includes generating (4704) an arbitration analysis (4752) indicating that winning arbitration will enable the storage systems in communication with the first storage system to continue servicing I/O requests directed to the dataset (4352) such that performance criteria will be better met compared to the second storage system continuing to service the I/O requests. Generating (4704) the arbitration analysis (4752) may be performed by the storage system (4300A) comparing one or more performance characteristics of the first storage system and one or more storage systems in communication with the first storage system to one or more corresponding performance characteristics of one or more storage systems not in communication with the first storage system to determine which set of storage systems will more effectively, efficiently, or reliably process the I/O requests. For example, a storage system may track multiple metrics that indicate performance with respect to processing I/O requests, and the metrics may be affected based on processor speed, network latency, read latency, and write latency, among other factors.

［0614］他の例では、仲介分析は、通信していないストレージシステムで定足数を確立し得るかどうかの判断を反映してよく、仲介分析は、ストレージシステム（４３００Ａ）が、図４３に関して上述されたように、仲介又は定足数プロトコルに従事することを示す。 [0614] In another example, the mediation analysis may reflect a determination of whether a quorum can be established at a non-communicating storage system, where the mediation analysis indicates that storage system (4300A) engages in a mediation or quorum protocol, as described above with respect to FIG. 43.

［0615］また、図４７に示される例の方法は、仲介分析に従って、仲介ターゲット（４３０１）から仲介を要求すること（４７０６）も含む。仲介分析に従って、仲介ターゲット（４３０１）から仲介を要求すること（４７０６）は、要求する仲介に関して上述されたように、及び出願参照番号第１４／７０３，５５９号に関してさらに説明されるように、実施されてよい。 [0615] The example method shown in FIG. 47 also includes requesting (4706) mediation from the mediation target (4301) according to the mediation analysis. Requesting (4706) mediation from the mediation target (4301) according to the mediation analysis may be performed as described above with respect to requesting mediation and as further described with respect to Application Serial No. 14/703,559.

［0616］追加の説明のために、図４８は、本開示のいくつかの実施形態に従って、データセットを同期複製するストレージシステムの間でメタデータを同期するための例の方法を示すフローチャートを説明する。あまり詳細には示されないが、図４８に示されるストレージシステム（４８００Ａ）は、図１Ａから図１Ｄ、図２Ａから図２Ｇ、図３Ａから図３Ｃ、又はその任意の組合せに関して上述されたストレージシステムに類似してよい。実際には、図４８に示されるストレージシステム（４８００Ａ）は、上述されたストレージシステムと同じ構成要素、より少ない構成要素、追加の構成要素を含んでよい。 [0616] For further explanation, FIG. 48 sets forth a flowchart illustrating an example method for synchronizing metadata between storage systems that synchronously replicate datasets, according to some embodiments of the present disclosure. While not shown in great detail, the storage system (4800A) shown in FIG. 48 may be similar to the storage systems described above with respect to FIGS. 1A-1D, 2A-2G, 3A-3C, or any combination thereof. In practice, the storage system (4800A) shown in FIG. 48 may include the same components as the storage systems described above, fewer components, or additional components.

［0617］これらの例では、ポッドメンバーシップは、ストレージシステムのリストを使用し、定義されてよく、そのリストの部分集合は、そのポッドのために同期される、つまり同期中となると推測されてよい。一部の場合では、リストの部分集合は、ポッドのためのストレージシステムのあらゆる１つを含んでよく、リストは、すべてのストレージシステムにとって共通であり、ポッドメンバーシップの変更に応えて適用される１つ以上の一貫性プロトコルの使用によりポッド全体で一貫性をもって維持されるメタデータと見なされてよい。『ポッド』は、該用語がここで及び本願の残りを通して使用されるように、データセット、管理オブジェクト及び管理動作の集合、データセットを修正する又は読み取るためのアクセス動作の集合、及び複数のストレージシステムを表す管理エンティティとして実施されてよい。係る管理動作は、データセットを読み取る又は修正するためのアクセス動作が、ストレージシステムのいずれかを通して同等に動作する、ストレージシステムのいずれかを通して同等に管理オブジェクトを修正する又は問い合わせしてよい。各ストレージシステムは、ストレージシステムによる使用のために記憶され、宣伝されたデータセットの適切な部分集合としてデータセットの別々のコピーを記憶してよく、任意の１つのストレージシステムを通して実行され、完了された管理オブジェクト又はデータセットを修正する動作は、ポッドを問い合わせるために後続の管理オブジェクトで、又はデータセットを読み取るために後続のアクセス動作で反映される。『ポッド』に関する追加の詳細は、参照により本明細書に援用される以前に出願された仮特許出願第６２／５１８，０７１号で見つけられてよい。 [0617] In these examples, pod membership may be defined using a list of storage systems, a subset of which may be inferred to be synchronized, or in sync, for the pod. In some cases, the subset of the list may include every one of the storage systems for the pod, and the list may be considered metadata that is common to all storage systems and is consistently maintained across the pod through the use of one or more consistency protocols that are applied in response to changes in pod membership. A 'pod,' as the term is used here and throughout the rest of this application, may be embodied as a management entity representing a dataset, a collection of managed objects and management operations, a collection of access operations to modify or read the dataset, and multiple storage systems. Such management operations may modify or query managed objects equally well through any of the storage systems, and access operations to read or modify the dataset equally well through any of the storage systems. Each storage system may store a separate copy of the dataset as an appropriate subset of the dataset stored and advertised for use by the storage system, and operations that modify a management object or dataset performed and completed through any one storage system are reflected in subsequent management objects to query the pod or subsequent access operations to read the dataset. Additional details regarding "pods" may be found in previously filed provisional patent application Ser. No. 62/518,071, which is incorporated herein by reference.

［0618］ストレージシステムは、ポッドがポッドと関連付けられたデータセットの最後の読み取られたコピーのための同一のアイドルコンテンツを有する少なくとも回復の中にある場合に、ポッドのために同期中と見なされてよい。アイドルコンテンツは、任意の進行中の修正が新しい修正の処理なしに完了した後のコンテンツである。一部の場合では、これは「クラッシュ回復可能な」一貫性と呼ばれる場合がある。ポッドの回復は、ポッド内で同期したストレージシステムに同時更新を適用する際の差異を調整するプロセスと見なされてよい。回復は、ポッドの多様なメンバーに対して要求されたが、無事に完了したとしていずれの要求者にも絶対に信号で知らされなかった同時修正の完了でストレージシステム間のあらゆる不一致を解決する場合がある。 [0618] A storage system may be considered in sync for a pod if the pod is at least in recovery with identical idle content for the last read copy of the dataset associated with the pod. Idle content is the content after any ongoing modifications have completed without processing new modifications. In some cases, this may be referred to as "crash-recoverable" consistency. Pod recovery may be considered the process of reconciling differences in applying concurrent updates to synchronized storage systems within the pod. Recovery may resolve any inconsistencies between storage systems in the completion of concurrent modifications that were requested for various members of the pod but were never signaled to any requester as successfully completed.

［0619］ポッドのためのストレージシステムのリストの使用を所与として、ポッドメンバーとして一覧されるが、ポッドのために同期中として一覧されないストレージシステムは、ポッドからデタッチされていると見なされてよい。逆に、ポッドのためのストレージシステムのリストを使用し、ポッドメンバーとして一覧され、ポッドのためにデータを積極的に提供するために同期中及び現在利用可能としても一覧されるストレージシステムは、ポッドのためにオンラインであると見なされてよい。さらに、ポッドの各ストレージシステムは、それが最後に知ったどのストレージシステムが同期中であったのか、及びそれが最後に知ったどのストレージシステムが、ポッドメンバーの完全な集合を含んでいたのかを含んだ、メンバーシップリストの独自のコピーを有してよい。 [0619] Given the use of a list of storage systems for a pod, a storage system that is listed as a pod member but not listed as in sync for the pod may be considered detached from the pod. Conversely, using the list of storage systems for a pod, a storage system that is listed as a pod member and is also listed as in sync and currently available to actively provide data for the pod may be considered online for the pod. Additionally, each storage system in a pod may have its own copy of the membership list, including which storage systems it last knew were in sync and which storage systems it last knew contained a complete set of pod members.

［0620］この例では、ポッドのためにオンラインであるために、所与のストレージシステムのためのメンバーシップリストは、所与のストレージシステムがポッドのために同期中であることを示し－所与のストレージシステムは、同期中であるとして示されるメンバーシップリスト内のすべての他のストレージシステムと通信できる。ストレージシステムが、それが同期中であり、同期中として示されるメンバーシップ内の他のすべてのストレージシステムと通信していることを確証できない場合、次いでストレージシステムは、ストレージシステムが、それが同期中であり、同期中として示されるメンバーシップ内の他のすべてのストレージシステムと通信中であることを確証できるまで、新しい入信Ｉ／Ｏコマンド又はポッドに向けられる要求を処理することを停止する。一部の例では、ストレージシステムが、それが同期中であり、同期中として示されるメンバーシップリスト内の他のすべてのストレージシステムと通信していることを確証できない場合、次いで新しいＩ／Ｏコマンド又は要求の処理を停止することの代わりに、ストレージシステムはＩ／Ｏコマンド又は要求の処理を停止する代わりに、ストレージシステムは、エラー又は例外でＩ／Ｏコマンド又は要求を完了する。Ｉ／Ｏコマンド又は要求は、異なるネットワークプロトコルを使用する他のタイプの要求の中でＳＣＳＩ要求であってよい。一例として、第１のストレージシステムは、メンバーシップリストの中の第２のストレージシステムが、１つ以上の基準に基づいてデタッチされるべきかを判断してよく、第１のストレージシステムが第２のストレージシステムをデタッチした結果は、第１のストレージシステムが、少なくとも第１のストレージシステムが、メンバーシップリストから第２のストレージシステムを削除した後に、メンバーシップリストに留まるストレージシステムのすべてと現在同期中であるため、Ｉ／Ｏコマンドを受け取り、処理し続けることである。しかしながら、他の危険の中でも共存できないデータセット、データセット破損、又はアプリケーション破損につながる「スプリットブレイン」状況を回避するために、第２のストレージシステムは、第２のストレージシステムが－第１のストレージシステムに加えて－ポッドのためのデータセットに向けられたＩ／Ｏコマンドを受け取り、処理し続けるように、第１のストレージシステムをデタッチすることから妨げられなければならない。言い換えると、ポッド内の２つの異なるストレージシステムが、それらが無事に互いをデタッチしたと考える場合、次いでスプリットブレイン状況は結果として起こる場合がある。 [0620] In this example, to be online for a pod, the membership list for a given storage system indicates that the given storage system is in sync for the pod—the given storage system can communicate with all other storage systems in the membership list that are indicated as in sync. If the storage system cannot confirm that it is in sync and in communication with all other storage systems in the membership that are indicated as in sync, then the storage system stops processing new incoming I/O commands or requests directed to the pod until the storage system can confirm that it is in sync and in communication with all other storage systems in the membership that are indicated as in sync. In some examples, if the storage system cannot confirm that it is in sync and in communication with all other storage systems in the membership that are indicated as in sync, then instead of stopping processing new I/O commands or requests, the storage system completes the I/O commands or requests with an error or exception. The I/O commands or requests may be SCSI requests, among other types of requests that use different network protocols. As an example, a first storage system may determine whether a second storage system in its membership list should be detached based on one or more criteria, and the result of the first storage system detaching the second storage system is that the first storage system continues to receive and process I/O commands because it is currently synchronized with at least all of the storage systems that remain on the membership list after the first storage system removed the second storage system from the membership list. However, to avoid a "split-brain" situation that could lead to incompatible datasets, dataset corruption, or application corruption, among other risks, the second storage system must be prevented from detaching the first storage system so that the second storage system—in addition to the first storage system—continues to receive and process I/O commands directed to datasets for the pod. In other words, if two different storage systems in a pod think they have successfully detached from each other, then a split-brain situation may result.

［0621］同期中として示されるメンバーシップリスト内の別のストレージシステムと通信していないときに進む方法を決定する状況は、ストレージシステムが正常に動作中であり、１つ以上の失われた通信に気づく間に発生する場合がある、ストレージシステムが以前の障害から回復している間に発生する場合がある、ストレージシステムが、どのような理由であれ、ストレージシステムコントローラのある集合からストレージシステムコントローラの別の集合に動作を切り替えているときに発生する場合がある、ストレージシステムの起動中又はネットワークインタフェースが接続若しくは有効化されるときに発生する場合がある、あるいはこれらの及び他の種類のイベントの任意の組合せの間若しくは後に発生する場合がある。言い換えると、ポッドと関連付けられるストレージシステムがメンバーシップリストのすべての既知のデタッチされていないメンバーと通信できないときはいつでも、ストレージシステムは、通信が確立され得るまで、例えばなんらかの所定の時間量、待機する、若しくはオフラインになり、おそらく待機し続けるかのどちらかである場合がある、又はストレージシステムは、スプリットブレイン状況を被る危険なしに通信していないストレージシステムをデタッチすることが安全であると判断する場合がある。さらに、安全なデタッチが十分に迅速に起こる場合、ストレージシステムは、短い遅延にすぎず、ほとんど又は少しも障害のない要求でポッドのために継続的にオンラインのままとなってよい、又はいくつかの要求はより低いレベルの要求者側動作処理を通して回復できる「ビジー」又は「再び試す」を生じさせてよく、アプリケーション又は他の高レベルの動作に対して悪影響を及ぼさない。 [0621] Situations for determining how to proceed when not communicating with another storage system in the membership list that is indicated as in sync may occur while the storage system is operating normally and notices one or more lost communications, may occur while the storage system is recovering from a previous failure, may occur when the storage system is switching operation from one set of storage system controllers to another set of storage system controllers for any reason, may occur during storage system startup or when a network interface is connected or enabled, or may occur during or after any combination of these and other types of events. In other words, whenever a storage system associated with a pod cannot communicate with all known, non-detached members of the membership list, the storage system may either wait, or go offline and possibly continue to wait, for example, some predetermined amount of time, until communication can be established, or the storage system may determine that it is safe to detach the non-communicating storage system without risking incurring a split-brain situation. Furthermore, if the safe detach occurs quickly enough, the storage system may remain continuously online for the pod with only a short delay and few or no failure requests, or some requests may result in a "busy" or "try again" that can be recovered through lower-level requester-side operation processing, with no adverse impact to applications or other high-level operations.

［0622］いくつかの状況では、ポッド内の所与のストレージシステムは、それが、最初に同期中であるポッドに加えられた後、ポッド内の他のストレージシステムに関して旧い、又は異なって構成されていると判断する場合があり、所与のストレージシステムは、ポッド内の既存のストレージシステムが、ソフトウェア、ファームウェア、ハードウェア、又はソフトウェア、ファームウェア、若しくは所与のストレージシステムよりも新しい、つまり異なるハードウェアの組合せで構成されてよい。別の例として、所与のストレージシステムは、所与のストレージシステムが別のストレージシステムに再接続するに応えて、旧い、又は異なって構成されると判断し、他のストレージシステムが所与のストレージシステムをデタッチされたとマークしたと判断する－この場合、所与のストレージシステムは、それが、ポッドのために同期中であるストレージシステムのなんらかの他の集合に接続するまで待機してよい。 [0622] In some circumstances, a given storage system in a pod may determine that it is outdated or differently configured with respect to other storage systems in the pod after it is initially added to the pod with which it is in sync; the given storage system may determine that the existing storage systems in the pod are configured with newer or different software, firmware, or hardware combinations than the given storage system. As another example, a given storage system may determine that it is outdated or differently configured in response to the given storage system reconnecting to another storage system, and the other storage system marks the given storage system as detached - in this case, the given storage system may wait until it connects to some other set of storage systems that are in sync for the pod.

［0623］これらの例では、ストレージシステムが、ポッドから又は同期中メンバーシップリストから加えられる、又は削除される方法が、一過性の不一致を回避し得るかどうかを判断してよい。例えば、一過性の不一致は、各ストレージシステムがメンバーシップリストのそれぞれのコピーを有しているため、及びポッドの中の２つ以上の独立したストレージシステムが、異なるときにそのそれぞれのメンバーシップリストを更新―少なくとも正確に同時以外のときにそのそれぞれのメンバーシップリストを少なくとも更新―してよいため、及びおそらく他のメンバーシップリストと一貫性がないメンバーシップリストのローカルコピーが、所与のストレージシステムが利用可能にしてよいすべてのメンバーシップ情報であてよいため、発生する場合がある。一例として、第１のストレージシステムがポッドのために同期中であり、第２のストレージシステムが加えられる場合、次いで第２のストレージシステムが、第１のストレージシステムと第２のストレージシステムの両方をそのそれぞれのメンバーシップリストで同期中として一覧するために更新される場合―第１のストレージシステムが、第１のストレージシステム及び第２のストレージシステムをそのそれぞれのメンバーシップリストで同期中として一覧する前に―次いで障害が発生し、第１のストレージシステムと第２のストレージシステムの両方の再起動を引き起こす場合、第２のストレージシステムは起動し、第１のストレージシステムに接続するために待機する場合がある。一方、第１のストレージシステムは、それが第２のストレージシステムを待機しなければならない又は待機できるであろうと認識していない場合がある。この例を続けると、第２のストレージシステムは、次いで第１のストレージシステムをデタッチするプロセスを経ることによって、第１のストレージシステムと接続することができないことに応える場合、次いで第２のストレージシステムは、第１のストレージシステムが認識していないプロセスを完了するのに成功する場合があり、結果的にスプリットブレイン状況を生じさせる。 [0623] In these examples, it may be determined whether the manner in which storage systems are added or removed from a pod or from a synchronizing membership list can avoid transient inconsistencies. For example, transient inconsistencies may arise because each storage system has its own copy of the membership list, and because two or more independent storage systems in a pod may update their respective membership lists at different times—at least at times other than exactly simultaneously—and because a local copy of the membership list, possibly inconsistent with other membership lists, may account for all membership information that a given storage system may make available. As an example, if a first storage system is in sync for a pod and a second storage system is added, and the second storage system is then updated to list both the first and second storage systems as in sync in its respective membership list—before the first storage system lists the first and second storage systems as in sync in its respective membership list—if a failure occurs, causing both the first and second storage systems to restart, the second storage system may start up and wait to connect to the first storage system. Meanwhile, the first storage system may not be aware that it must or could wait for the second storage system. Continuing with this example, if the second storage system then responds to its inability to connect with the first storage system by going through a process to detach the first storage system, then the second storage system may succeed in completing a process that the first storage system is unaware of, resulting in a split-brain situation.

［0624］上記例に説明される状況を妨げるための例の技術として、ポッド内のストレージシステムは、個々のストレージシステムが、それらが通信していない場合、それらがデタッチプロセスを経ることを選ぶ可能性があるかどうかに関して不一致とならない方針に従ってよい。個々のストレージシステムが不一致にならないことを確実にする例の技術は、新しいストレージシステムをポッドのための同期中メンバーシップリストに加えるとき、新しいストレージシステムが最初に、新しいストレージシステムがデタッチされたメンバーであることを記憶することを確実にすることである。この時点で、既存の同期中ストレージシステムは、新しいストレージシステムが、新しいストレージシステムが同期中のポッドメンバーであることをローカルに記憶する前に、新しいストレージシステムが同期中のポッドメンバーである表示をローカルに記憶してよい。結果的に、新しいストレージシステムがそれ自体のために同期中ステータスを記憶する前に、リブート又はネットワーク障害又は機能停止の集合がある場合、次いで元のストレージシステム－新しいストレージシステムを加える試みの前に、ポッドの同期中メンバーであるストレージシステム－は、非通信のために新しいストレージシステムをデタッチしてよいが、新しいストレージシステムは待機する。 [0624] As an example technique for preventing the situation described in the example above, storage systems within a pod may follow a policy whereby individual storage systems will not disagree regarding whether they may choose to undergo the detachment process if they are not communicating. An example technique for ensuring that individual storage systems do not disagree is to ensure that when a new storage system is added to the in-sync membership list for a pod, the new storage system first remembers that the new storage system is a detached member. At this point, the existing in-sync storage system may locally store an indication that the new storage system is an in-sync pod member before the new storage system locally remembers that the new storage system is an in-sync pod member. Consequently, if there is a reboot or network failure or outage before the new storage system remembers the in-sync status for itself, then the original storage system—the storage system that is an in-sync member of the pod—may detach the new storage system due to the lack of communication before attempting to add the new storage system while the new storage system waits.

［0625］この例を続けると、メンバーシップの係る変更の逆バージョンが、通信中のストレージシステムをポッドから削除するために必要とされる可能性がある－最初に削除又はデタッチされているストレージシステムが、同期中ではないという表示をローカルに記憶し、ポッドに留まるストレージシステムは、続いて削除されているシステムがもはや同期中ではない旨の表示を記憶する。この時点で、ポッドに留まるストレージシステムと、削除されているストレージシステムの両方は、削除されているストレージシステムをそのそれぞれのメンバーシップリストから削除する。この例では、実施態様に応じて、中間の持続されるデタッチ状態が必要ではない場合がある。 [0625] Continuing with this example, a reverse version of such a membership change may be required to remove a communicating storage system from a pod—first the storage system being removed or detached stores an indication locally that it is not in sync, and the storage system remaining in the pod subsequently stores an indication that the removed system is no longer in sync. At this point, both the storage system remaining in the pod and the storage system being removed remove the removed storage system from their respective membership lists. In this example, an intermediate, persisted detached state may not be required, depending on the implementation.

［0626］さらに、メンバーシップリストのローカルコピーで注意が必要とされるかどうかは、互いをモニタするため、又はそのメンバーシップを確証するためのモデルストレージシステムの使用次第である場合がある。例えば、両方にコンセンサスモデルが使用される場合、又は外部システム－又は外部分散型システム若しくはクラスタ化システム－が、ポップメンバーシップを記憶し、確証するために使用される場合、次いでローカルに記憶されたメンバーシップリストでの不一致は、重要ではなくなる場合がある。 [0626] Furthermore, whether care is required in local copies of membership lists may depend on the use of model storage systems to monitor each other or to validate their membership. For example, if a consensus model is used for both, or if an external system—or an external distributed or clustered system—is used to store and validate pop membership, then discrepancies in locally stored membership lists may be inconsequential.

［0627］自然発生的なメンバーシップ変更を解決するための一部の例のモデルは、定足数、外部ポッドメンバーシップマネージャ、又は既知のリソースに対する競争の使用を含む。これらの例のモデルは、通信故障、ポッド内の１つ以上のストレージシステムが故障すること、又はポッド内の対にされたストレージシステムと通信できない、ストレージシステムが起動すること（又は二次コントローラに障害迂回すること）に応えて使用されてよい。ポッドメンバーシップの変更をトリガする場合があるこれらのイベントを所与として、異なるメンバーシップモデルは、ポッド内のストレージシステムが、１つ以上の対にされたストレージシステムを安全であるやり方でデタッチすることをどのようにして決定するのか、及び１つ以上のストレージシステムをデタッチするときにやり通す方法を定義するために異なる機構を使用してよい。 [0627] Some example models for resolving spontaneous membership changes include the use of a quorum, an external pod membership manager, or contention for known resources. These example models may be used in response to a communication failure, one or more storage systems in a pod failing, or an inability to communicate with a paired storage system in a pod, or a storage system starting up (or failing over to a secondary controller). Given these events that may trigger a change in pod membership, different membership models may use different mechanisms to define how a storage system in a pod decides to detach one or more paired storage systems in a safe manner and how to follow through when detaching one or more storage systems.

［0628］一部の例では、メンバーシップ変更に関する意見の一致に到達する際に使用される複数のメンバーシップリストがある場合がある。例えば、ストレージシステムの所与のグループの場合、各ストレージシステムが同期中リスト上又は同期外リスト上にある場合があり、各ストレージシステムは、同期中リスト及び同期外リストのそれぞれのローカルコピーを記憶する。この例では、ストレージシステムのグループはストレージシステム｛Ａ，Ｂ｝であってよく、最初にポッドはストレージシステムＡを含んでよく、ポッドはストレージシステムＡからストレージシステムＢに伸ばされる、つまり拡大される。ポッドのこの伸展は、ポッドのためのストレージシステムのメンバーシップを拡大することに同等であり、ストレージシステムＡ及びストレージシステムＢが接続されることを確実にすることによって開始してよい。ストレージシステムＡ及びストレージシステムＢが接続されることを確実にすることは、伸展動作に先行する構成ステップであってよい－しかしながら、ストレージシステムＡとストレージシステムＢ間の単なる接続性はポッドを広げるのではなく、むしろストレージシステムＡとストレージシステムＢ間の接続性はポッドを広げることを可能にする。この例では、ストレージシステムＡは－例えば、ボリューム、ポッド、及びストレージシステムを管理するための管理コンソールから－ポッド、つまりストレージシステムＡでのポッドの特定のボリュームをストレージシステムＢに広げることを示すコマンドを受け取ってよい。ストレージシステムＡとストレージシステムＢの間の接続性を所与として、初期状態は、ストレージシステムＡが｛Ａ｝を示す同期中リスト及び｛Ｂ｝を示す同期外リストを記憶するとして説明されてよく、エポック識別子はｎに等しく、メンバーシップシーケンスはｍに等しく、ストレージシステムＢは、同期中リストと同期外リストの両方のために空のリストを記憶する。ストレージシステムＡが伸展コマンドを受け取ることに応えて、ストレージシステムＡはストレージシステムＢに、ポッド識別子、エポック識別子ｎによって識別されるセッションを示すメッセージを送信してよく、それに応えて、ストレージシステムＢはストレージシステムＡに通信し返す。さらに、ストレージシステムＡ及びＢの間の構成レベルハートビートは、ストレージシステムのための同期中リスト及び同期外リストをストレージシステムＢに配布してよく、それに応えてストレージシステムＢは、それが同期中のメンバーではないと判断し、ストレージシステムＡとの再同期動作を開始してよく、このことがストレージシステムＡとストレージシステムＢの両方全体でポッドを同期させる。さらに、再同期に応えて、ストレージシステムＡは、更新された同期中リスト｛Ａ，Ｂ｝をストレージシステムＢに書き込み、次いでストレージシステムＢが応答するのを待機してよい。この時点で、ストレージシステムＡは、同期中の動作に関してストレージシステムＢとの通信を開始する準備が完了している－しかしながら、ストレージシステムＢは、ストレージシステムＢが更新された同期中リスト｛Ａ，Ｂ｝をポッドの同期中メンバーとして受け取るまで、係る通信に関与しない。例えば、ストレージシステムＡはストレージシステムＢとのクロック交換動作を開始することによって通信を始めてよいが、ストレージシステムＢは、ストレージシステムＢが未決の同期中リスト｛Ａ，Ｂ｝を受け取るまでクロック交換動作を開始しない場合がある。クロック交換は、その全体として本明細書に含まれる出願参照番号第６２/４７０，１７２号及び第６２／５１８，０７１号の中により詳細に説明されている。 [0628] In some examples, there may be multiple membership lists used in reaching a consensus on membership changes. For example, for a given group of storage systems, each storage system may be on the in-sync list or the out-of-sync list, and each storage system stores a local copy of the in-sync and out-of-sync lists, respectively. In this example, the group of storage systems may be storage systems {A, B}, a pod may initially include storage system A, and the pod is stretched, or expanded, from storage system A to storage system B. This stretching of the pod is equivalent to expanding the storage system membership for the pod and may begin by ensuring that storage system A and storage system B are connected. Ensuring that storage system A and storage system B are connected may be a configuration step that precedes the stretch operation—however, mere connectivity between storage system A and storage system B does not stretch the pod; rather, connectivity between storage system A and storage system B allows the pod to stretch. In this example, storage system A may receive a command—e.g., from a management console for managing volumes, pods, and storage systems—indicating that a pod, i.e., a particular volume of a pod at storage system A, be stretched to storage system B. Given connectivity between storage system A and storage system B, the initial state may be described as storage system A storing an in-sync list denoting {A} and an out-of-sync list denoting {B}, where the epoch identifier is equal to n, the membership sequence is equal to m, and storage system B storing empty lists for both the in-sync and out-of-sync lists. In response to storage system A receiving the stretch command, storage system A may send a message to storage system B indicating the session identified by the pod identifier, epoch identifier n, and in response, storage system B communicates back to storage system A. Further, the configuration-level heartbeat between storage systems A and B may distribute the in-sync and out-of-sync lists for storage system A to storage system B, in response storage system B may determine that it is not an in-sync member and may initiate a re-sync operation with storage system A, which synchronizes the pod across both storage system A and storage system B. Further, in response to the re-sync, storage system A may write an updated in-sync list {A, B} to storage system B and then wait for storage system B to respond. At this point, storage system A is ready to begin communication with storage system B regarding the in-sync operation - however, storage system B will not engage in such communication until storage system B receives the updated in-sync list {A, B} as an in-sync member of the pod. For example, storage system A may begin communication by initiating a clock exchange operation with storage system B, but storage system B may not initiate the clock exchange operation until storage system B receives the pending in-sync list {A, B}. Clock switching is described in more detail in application reference numbers 62/470,172 and 62/518,071, which are incorporated herein in their entireties.

［0629］この例を続けると、伸展解除する（ｕｎｓｔｒｅｔｃｈ）、つまりストレージシステムをポッド内のメンバーシップから削除するために、メンバーストレージシステムは以下のステップを講じてよい。例えば、ポッドメンバーシップが現在、両方のストレージシステムＡ及びストレージシステムＢが｛Ａ，Ｂ｝の同じ同期中リスト、並びに｛｝の同期外リスト、ｎの現在のエポック、及びｍの現在のメンバーシップシーケンスを有する｛Ａ，Ｂ｝である場合－この状況で、ストレージシステムＡは、ポッドを伸展解除してストレージシステムＢを除外する要求を受け取る場合がある。伸展解除要求に応えて、ストレージシステムＡは、ストレージシステムＢに、コミットされたメンバーシップリストを示すメッセージを送信してよく、｛Ａ，Ｂ｝の同期中リスト及び｛｝の同期外リストを示し、｛Ａ｝の同期中リスト及び｛｝の同期外リスト及びｎの現在のエポック及び（ｍ＋１）のメンバーシップシーケンスを示す未決のメンバーシップリストを示す。ストレージシステムＢは、ストレージシステムＡからメッセージを受け取ることに応えて、メッセージの中に示される状態情報を適用し、ストレージシステムＡに対して、状態変更が適用されたことを応答する。ストレージシステムＡは、状態変更の肯定応答をストレージシステムＢから受け取ることに応えて、そのローカル状態情報を更新して｛Ａ｝の同期中リスト及び｛｝の同期外リストのためのコミットされたメンバーシップリスト、｛Ａ｝の同期中リスト及び｛｝の同期外リスト、及び（ｎ＋１）のエポックのための未決メンバーシップリストを示し、ストレージシステムＢは、次いでストレージシステムＢとの通信を停止する。ストレージシステムＢは、失われたセッションを検出するが、｛Ａ｝の同期中リストを有するため、ストレージシステムＢは、ストレージシステムＡからセッションを再確立することを要求し、ストレージシステムＢがもはやポッドのメンバーではないことを示す応答を受け取る。 [0629] Continuing with this example, to unstretch, i.e., remove a storage system from membership in a pod, a member storage system may take the following steps. For example, if the pod membership is currently {A, B}, where both storage system A and storage system B have the same in-sync list of {A, B} and out-of-sync list of { }, a current epoch of n, and a current membership sequence of m—in this situation, storage system A may receive a request to unstretch the pod and remove storage system B. In response to the unstretch request, storage system A may send a message to storage system B indicating the committed membership list, indicating an in-sync list of {A, B} and an out-of-sync list of { }, and a pending membership list indicating an in-sync list of {A} and an out-of-sync list of { }, a current epoch of n, and a membership sequence of (m+1). In response to receiving the message from storage system A, storage system B applies the state information indicated in the message and responds to storage system A that the state change has been applied. In response to receiving an acknowledgment of the state change from storage system B, storage system A updates its local state information to indicate committed membership lists for the in-sync list of {A} and the out-of-sync list of {}, an in-sync list of {A} and an out-of-sync list of {}, and a pending membership list for epoch (n+1), and storage system B then stops communicating with storage system B. Storage system B detects the lost session, but because it has an in-sync list of {A}, it requests a re-establishment of the session from storage system A and receives a response indicating that storage system B is no longer a member of the pod.

［0630］メンバーシップモデルとして定足数を使用する例では、デタッチ動作を解決するための１つの技術は、メンバーシップのために過半数－つまり、定足数－モデルを使用することである。例えば、３つのストレージシステムを所与として、２つが通信している限り、通信中の２つは、通信していない第３のストレージシステムをデタッチすることに同意できる。しかしながら、第３のストレージシステムは、通信中の２つのストレージシステムのどちらかをデタッチすることを独力で選ぶことはできない。一部の場合では、ポッドの中のストレージシステム通信に一貫性がないとき混乱が生じる場合がある。この例では、ストレージシステム｛Ａ，Ｂ，Ｃ｝の場合、ストレージシステムＢが両方のストレージシステムＡとストレージシステムＣと通信している場合があるのに対し、ストレージシステムＡはストレージシステムＢと通信している場合があるが、ストレージシステムＡはストレージシステムＣと通信していない場合がある。この状況では、両方のストレージシステムＡとストレージシステムＢは、ストレージシステムＣをデタッチしてよい―又は両方のストレージシステムＢとストレージシステムＣは、ストレージシステムＡをデタッチしてよい―が、メンバーシップを理解するためにはポッドメンバー間のより多くの通信が必要とされる場合がある。 [0630] In examples using quorum as a membership model, one technique for resolving detach operations is to use a majority—or quorum—model for membership. For example, given three storage systems, as long as two are communicating, the two communicating can agree to detach the third storage system that is not communicating. However, the third storage system cannot independently choose to detach one of the two communicating storage systems. In some cases, confusion can arise when storage system communication within a pod is inconsistent. In this example, for storage systems {A, B, C}, storage system B may be in communication with both storage system A and storage system C, while storage system A may be in communication with storage system B but not with storage system C. In this situation, both storage systems A and B may detach storage system C—or both storage systems B and C may detach storage system A—but more communication between pod members may be required to understand membership.

［0631］この例を続行すると、定足数方針、つまり定足数プロトコルは、ストレージシステムを加える、又はストレージシステムをポッドから削除するためのこの状況を解決し得る。例えば、第４のストレージシステムがポッドに加えられる場合、次いでストレージシステムの過半数は３つのストレージシステムとなる。過半数に２つが必要とされる３つのストレージシステムから、過半数に３つが必要とされる４つのストレージシステムを有するポッドへの遷移は、ストレージシステムを同期中リストに慎重に加えるための上述されたモデルに類似する何かを必要とする場合がある。例えば、第４のストレージシステム、例えばストレージシステムＤは、アタッチする状態で開始するが、まだアタッチされておらず、それは決して定足数を超える票を生じさせないであろう。そのストレージシステムＤがアタッチする状態にあることを所与として、ストレージシステムＡ、ストレージシステムＢ、及びストレージシステムＣは、それぞれストレージシステムＤを認識するように更新され、３つのストレージシステムが任意の特定のストレージシステムをポッドからデタッチする過半数決定に達する新しい要件について更新されてよい。さらに、所与のストレージシステムをポッドから削除することは、ポッド内の他のストレージシステムを更新する前に所与のストレージシステムをデタッチする状態に同様に遷移してよい。一部の例では、定足数モデルでの問題は、共通構成が、正確に２つのストレージシステムを有するポッドである点であり、係る場合では、１つの解決策は、ポッドに対する定足数投票だけに参加するが、それ以外の場合、ポッドのためのデータセットを記憶しないネットワークの中にストレージシステムを加えることである。この場合、係る投票専用メンバーは、一般的に、一連の定足数投票を引き起こさないであろうが、同期中ストレージシステムとして構成されたポッド内のストレージシステムによって引き起こされる投票にのみ参加するであろう。 [0631] Continuing with this example, a quorum policy, or quorum protocol, may resolve this situation for adding or removing storage systems from a pod. For example, if a fourth storage system is added to a pod, then the majority of storage systems becomes three storage systems. Transitioning from a pod with three storage systems, where two are required for a majority, to a pod with four storage systems, where three are required for a majority, may require something similar to the model described above for carefully adding storage systems to the in-sync list. For example, a fourth storage system, say storage system D, starts in an attaching state but is not yet attached, and it will never generate more than a quorum of votes. Given that storage system D is in an attaching state, storage system A, storage system B, and storage system C may each be updated to recognize storage system D and updated with the new requirement for the three storage systems to reach a majority decision to detach any particular storage system from the pod. Additionally, removing a given storage system from a pod may similarly transition to a state in which the given storage system is detached before updating other storage systems in the pod. In some cases, a problem with the quorum model is that a common configuration is a pod with exactly two storage systems; in such cases, one solution is to add a storage system into the network that only participates in quorum votes for the pod but does not otherwise store the dataset for the pod. In this case, such a voting-only member would generally not trigger a series of quorum votes, but would only participate in votes triggered by storage systems in the pod configured as in-syncing storage systems.

［0632］メンバーシップモデルとして外部ポッドメンバーシップ方法を使用する例では、１つの技術は、ポッドメンバーシップを扱うためにストレージシステム自体の外部にある外部システムを使用し、メンバーシップ遷移を管理することを含む。例えば、ポッドのメンバーになるためには、将来のストレージシステムは、ポッドメンバーシップシステムに連絡して、ポッドに対するメンバーシップを要求し、将来のストレージシステムがポッドのために同期中であることを検証するように構成される。このモデルでは、ポッドのためにオンライン、つまり同期中である任意のストレージシステムは、ポッドメンバーシップシステムとの通信に留まるべきであり、ポッドメンバーシップシステムとの通信が失われる場合、待機する、つまりオフラインになるべきである。この例では、ポッドメンバーシップシステムは、例えば、Ｏｒａｃｌｅ（商標）ＲＡＣ、ＬｉｎｕｘＨＡ、ＶＥＲＩＴＡＳ（商標）ＣｌｕｓｔｅｒＳｅｒｖｅｒ、ＩＢＭ（商標）ＨＡＣＭＰ、又は他等、多様なクラスタツールを使用する高可用性クラスタとして実装されてよい。他の例では、ポッドメンバーシップシステムは、例えばＥｔｃｄ（商標）若しくはＺｏｏｋｅｅｐｅｒ（商標）等の分散型構成ツール、又は例えばＡｍａｚｏｎによるＤｙｎａｍｏＤＭ（商標）等の信頼性のある分散型データベースを使用し、実装されてよい。さらに、他の例では、ポッドメンバーシップは、例えばＲＡＦＴ又はＰＡＸＯＳ等の分散型コンセンサスアルゴリズムを使用し、決定されてよく、ＲＡＦＴからの概念に基づいた実施態様は、メンバーシップのためのＲＡＦＴベースの内部アルゴリズムを含む場合もあれば、有効で最新のメンバーシップを決定するため、及び最新のメンバーシップ情報の現在値を決定するための総合的な解決策の一部として使用され得るログ様式の更新一貫性のためのＲＡＦＴに着想を得た（ＲＡＦＴ－ｉｎｓｐｉｒｅｄ）アルゴリズムを含む場合もある。 [0632] In an example using an external pod membership method as the membership model, one technique involves using an external system outside the storage system itself to handle pod membership and manage membership transitions. For example, to become a member of a pod, a future storage system is configured to contact the pod membership system to request membership for the pod and verify that the future storage system is in sync for the pod. In this model, any storage system that is online, i.e., in sync, for a pod should remain in communication with the pod membership system and should wait, i.e., go offline, if communication with the pod membership system is lost. In this example, the pod membership system may be implemented as a highly available cluster using a variety of cluster tools, such as, for example, Oracle™ RAC, Linux HA, VERITAS™ Cluster Server, IBM™ HACMP, or others. In other examples, the pod membership system may be implemented using a distributed configuration tool, such as Etcd™ or Zookeeper™, or a trusted distributed database, such as DynamoDM™ by Amazon. Additionally, in other examples, pod membership may be determined using a distributed consensus algorithm, such as RAFT or PAXOS; implementations based on concepts from RAFT may include a RAFT-based internal algorithm for membership, or a RAFT-inspired algorithm for log-style update consistency that can be used as part of an overall solution for determining valid and up-to-date membership and for determining the current value of up-to-date membership information.

［0633］メンバーシップモデルとして既知のリソースに対する競争、つまりレーシングプロトコルを使用する例では、ポッドのためのクラスタマネージャが、他者を除外してなんらかの方法でロックされ得るなんらかのリソースに対するアクセスを要求することにより、又はいくつかの係るリソースの過半数に対するアクセスを要求することによりメンバーシップ変更を解決することによって実施されてよい。例えば、１つの技術は、ＳＣＳＩ予約又はＳＣＳＩ持続性予約等のリソース予約を使用して、１つ以上のネットワーク化されたＳＣＳＩデバイスに対するロックを得ることである。この例では、これらのネットワーク化されたデバイスの構成された集合の過半数がストレージシステムによってロックできる場合、次いでそのストレージシステムは他のストレージシステムをデタッチしてよい。それ以外の場合、ストレージシステムは、他のストレージシステムをデタッチできないであろう。さらに、オンライン、つまり同期中に留まるために、ストレージシステムはリソースに対するこれらのロックを頻繁に再アサート若しくは試験する、又はリソースに対するこれらのロックをアサートしている、再アサートしている、又は試験しているなんらかの他のストレージシステムと通信中である必要がある場合がある。さらに、さまざまな方法でアサートされ、試験され得るネットワーク化された計算リソースは、同様に使用されてよい。 [0633] In an example using a race for resources, known as a membership model, or racing protocol, a cluster manager for a pod may resolve membership changes by requesting access to some resource that can be locked in some way to the exclusion of others, or by requesting access to a majority of several such resources. For example, one technique is to use a resource reservation, such as a SCSI reservation or SCSI persistent reservation, to obtain a lock on one or more networked SCSI devices. In this example, if a majority of the configured set of these networked devices can be locked by a storage system, then that storage system may detach other storage systems. Otherwise, the storage system would not be able to detach other storage systems. Furthermore, to stay online, i.e., synchronized, a storage system may need to frequently reassert or test its locks on resources, or be in communication with some other storage systems that are asserting, reasserting, or testing its locks on resources. Furthermore, networked computing resources that can be asserted and tested in various ways may be used similarly.

［0634］この例を続けると、ポッドのすべてのストレージシステムメンバーによる引き延ばされた機能停止が、一方のストレージシステムがメンバーポッドとして再開することを可能にし、他のストレージシステムメンバーをデタッチすることを可能にしつつ、適切に処理できることを確実にするためには、上述されたようなネットワークリソースは、なんらかの他のストレージシステムが、再開するストレージシステムポッドメンバーを以前にデタッチしたことがなかったことを検証するために使用されてよい持続性のあるプロパティを有さなければならない。しかしながら、サービスが、ステータス情報又は他のメタデータを持続的に記憶する能力なしにリソース予約だけを提供する場合、次いでリソース要約サービスは、特定のストレージシステムがアクセスを得た後に問い合わせされ、書き込まれる可能性がある、例えばサードパーティのデータベース又はクラウドストレージ等のなんらかの外部に記憶されたデータへのアクセスを得るために使用されてよい－書き込まれるデータは、デタッチされたストレージシステムが、それがデタッチされたと判断するために問い合わせる場合がある情報を記録する場合がある。 [0634] Continuing with this example, to ensure that a prolonged outage of all storage system members of a pod can be handled appropriately, while allowing one storage system to resume as a member pod and other storage system members to detach, network resources such as those described above must have persistent properties that may be used to verify that some other storage system has not previously detached the resuming storage system pod member. However, if the service only provides resource reservations without the ability to persistently store status information or other metadata, then the resource summary service may be used to gain access to some externally stored data, such as a third-party database or cloud storage, that may be queried and written after a particular storage system gains access—the data that is written may record information that a detached storage system may query to determine that it has been detached.

［0635］一部の例では、レーシングプロトコルは、あるストレージシステムがポッドから別のストレージシステムをデタッチする権限を有するかどうかを決定するサービスである仲介サービスを使用し、実装されてよい。仲介サービスの例の実施態様は、その全体として本明細書に援用される出願番号第１５／７０３，５５９号の中にさらに説明される。 [0635] In some examples, the tracing protocol may be implemented using an intermediary service, a service that determines whether one storage system has the authority to detach another storage system from a pod. Example implementations of intermediary services are further described in Application Serial No. 15/703,559, which is incorporated herein by reference in its entirety.

［0636］別の例では、ポッドが３つ以上のストレージシステム全体で広げられるとき有用である場合がある機構の組合せが使用されてよい。一例では、優先規則はメタデータと結合されてよい。トップオブラック例では、データセンタ又はキャンパス内のより大型の中央ストレージシステムはそれ自体、第２の場所の大型ストレージシステムに同期複製される可能性がある。その場合、トップオブラックストレージシステムは、絶対に単独で再開し得ず、２つの場所のより大型の中央ストレージシステムのいずれかに優先権を与えてよい。その場合の２つのより大型のストレージシステムは、互いの間で仲介するように構成される可能性があり、オンラインに留まる２つのより大型のストレージシステムのどちらかに接続できる任意のより小型のストレージシステムはそのポッドにサービスを提供し続ける場合があり、２つの大型のストレージシステムのどちらにも接続できない（又は、ポッドのためにオフラインである一方にしか接続できない）任意のより小型のストレージシステムは、ポッドにサービスを提供するのを停止する場合がある。さらに、優先モデルは定足数ベースのモデルと結合されてもよい。例えば３つの場所の３つの大型ストレージシステムは、互いの間で定足数モデルを使用する可能性があり、より小型の衛星ストレージシステム又はトップオブラックストレージシステムはいかなる票も欠き、それらが、オンラインであるより大型の同期中のストレージシステムの一方に接続できる場合のみ機能する。 [0636] In another example, a combination of mechanisms may be used that may be useful when a pod is spread across three or more storage systems. In one example, priority rules may be combined with metadata. In a top-of-rack example, a larger central storage system in a data center or campus may itself be synchronously replicated to a larger storage system in a second location. In that case, the top-of-rack storage system may never restart on its own and may give priority to either of the larger central storage systems in the two locations. The two larger storage systems in that case may be configured to mediate between each other, such that any smaller storage system that can connect to either of the two larger storage systems that remain online may continue to serve its pod, and any smaller storage system that cannot connect to either of the two larger storage systems (or can only connect to one that is offline for the pod) may cease serving the pod. Furthermore, the priority model may be combined with a quorum-based model. For example, three large storage systems in three locations might use a quorum model between each other, with smaller satellite or top-of-rack storage systems lacking any votes and only functioning if they can connect to one of the larger, synchronizing storage systems that is online.

［0637］機構を結合する別の例では、仲介は定足数モデルと結合されてよい。例えば、通常２つのストレージシステムが通信していない第３のストレージシステムを安全にデタッチできることを確実にするために、互いの間で投票する３つのストレージシステムがあってよい。一方、１つのストレージシステムは、２つの他のストレージシステムを独力で絶対にデタッチすることはできない。しかしながら、２つのストレージシステムが第３のストレージシステムを無事にデタッチした後に、構成はいま、それらが同期中であることに同意し、第３のストレージシステムがデタッチされる旨の事実に同意する２つのストレージシステムに少なくなる。その場合、残っている２つのストレージシステムは、（例えばクラウドサービスを用いた）仲介を使用して、追加のストレージシステム又はネットワークの障害を処理することに同意してよい。この仲介及び定足数の組合せは、さらに拡張され得る。例えば、４つのストレージシステム間で広げられるポッドでは、任意の３つが第４のストレージシステムをデタッチすることができるが、２つの同期中のストレージシステムが互いと通信しているが、他の２つのストレージシステムに通信していない場合、それらは両方とも同期中であると現在考え、次いでそれらは他の２つを安全にデタッチするために仲介を使用するであろう。５つのストレージシステムポッド構成でさえ、４つのストレージシステムが第５のストレージシステムをデタッチするために投票する場合、次いで残りの４つは、それらが等しい２つの半分に分割される場合仲介を使用でき、ポッドが２つのストレージシステムに少なくなると、それらは連続する障害を解決するために仲介を使用できる。５つから３つは、３つの間で定足数を使用し、次いで２つへの低下を可能にする可能性があり、残った２つのストレージシステムは、追加の故障がある場合、再び仲介を使用する。この一般的なマルチモード定足数仲介機構は、対称的なストレージシステムの間の定足数も、独自の仲介も処理できない追加の数の状況を処理できる。この組合せは、障害のある又はときおり到達不能なメディエータを確実に使用できる（又は、カスタマが彼らを完全に信用していない場合があるクラウドメディエータの場合）事例の数を増やし得る。さらに、この組合せは、仲介が単独である結果、第１のストレージシステムが、単に第１のストレージシステムにしか影響を及ぼさないネットワーク障害時に、無事に第２のストレージシステム及び第３のストレージシステムをデタッチする可能性がある、３つのストレージシステムポッドの場合をより良く処理する。また、この組合せは、３つから２つ、次いで１つへの例で説明されるように、一度に１つのストレージシステムに影響を及ぼす一連の障害をより良く処理し得る。これらの組合せは、同期中であること及びデタッチ動作が特定の状態を生じさせるために機能する―言い換えると、デタッチ済みから同期中に移動することがプロセスであり、一連の定足数／メディエータ関係での各段階は、あらゆる点で、すべてのオンライン／同期中ストレージシステムがポッドのための現在の持続性のある状態に同意していることを確実にするため、システムはステートフルである。これは、単に通信中のクラスタノードの過半数を再び有することが、動作を再開するために十分であると予想される、なんらかの他のクラスタモデルにおいてとは異なるしかしながら、優先モデルは、依然として加えられてよく、衛星ストレージシステム又はトップオブラックストレージシステムは、仲介にも定足数にも絶対に関与せず、それらが、仲介又は定足数に参加するオンラインストレージシステムに接続できる場合にだけポッドにサービスを提供する。 [0637] In another example of combining mechanisms, brokering may be combined with a quorum model. For example, there may be three storage systems that vote among themselves to ensure that a third storage system, with which two storage systems normally do not communicate, can be safely detached. On the other hand, one storage system cannot absolutely detach two other storage systems on its own. However, after two storage systems successfully detach the third storage system, the configuration is now reduced to two storage systems that agree that they are in sync and agree on the fact that the third storage system is detached. In that case, the remaining two storage systems may agree to handle additional storage system or network failures using brokering (e.g., using a cloud service). This combination of brokering and quorum can be further extended. For example, in a pod spread across four storage systems, any three can detach the fourth storage system, but if two in-sync storage systems are communicating with each other but not with the other two storage systems, they will currently consider themselves both in sync and will then use brokering to safely detach the other two. Even in a five-storage-system pod configuration, if four storage systems vote to detach the fifth storage system, then the remaining four can use mediation if they split into two equal halves; once the pod is down to two storage systems, they can use mediation to resolve successive failures. Five to three might use quorum among them, then allow a drop to two, with the remaining two storage systems again using mediation if there are additional failures. This general multi-mode quorum mediation mechanism can handle an additional number of situations that neither quorum nor independent mediation between symmetric storage systems can handle. This combination can increase the number of cases where a faulty or occasionally unreachable mediator can be reliably used (or in the case of cloud mediators where customers may not fully trust them). Furthermore, this combination better handles the case of a three-storage-system pod, where the first storage system may successfully detach the second and third storage systems during a network failure that only affects the first storage system, as a result of the mediation alone. This combination may also better handle a series of failures affecting one storage system at a time, as illustrated in the three-to-two-to-one example. These combinations work because the in-sync and detach operations result in a specific state—in other words, moving from detached to in-sync is a process, and each stage in the quorum/mediator sequence ensures that all online/in-sync storage systems agree on the current, durable state for the pod, so the system is stateful. This differs from some other cluster models, where simply having a majority of cluster nodes in communication again is expected to be sufficient to resume operation. However, a priority model may still be added, where satellite or top-of-rack storage systems never participate in mediation or quorum and only serve pods if they can connect to online storage systems that participate in mediation or quorum.

［0638］図４８に示される例の方法は、メンバーシップイベントが、データセット（４８５８）を同期複製するストレージシステム（４８００Ａ～４８００Ｂ）の集合に対するメンバーシップの変更に相当すると判断すること（４８０２）を含む。メンバーシップイベントが、データセット（４８５８）を同期複製するストレージシステム（４８００Ａ～４８００Ｂ）の集合に対するメンバーシップの変更に相当すると判断すること（４８０２）は、異なる技術を使用し、実施されてよい。一例として、ストレージシステム（４８００Ａ）は、ポッド（４８５４）が新しいストレージシステム（４８００Ｎ）を含むように広げられることを示す、又はポッド（４８５４）が既存のストレージシステム（４８００Ｎ）を除外するために伸展解除されることを示すＩ／Ｏコマンドを受け取ってよい。別の例として、ストレージシステム（４８００Ａ）は、ストレージシステムの集合の特定のストレージシステム（４８００Ｎ）で、通信が失われたこと、又は通信が指定の閾値を超えて信頼できなくなった又は非効率になったことを検出し、判断してよい。 48 includes determining (4802) that a membership event corresponds to a membership change for the set of storage systems (4800A-4800B) that synchronously replicate a data set (4858). Determining (4802) that a membership event corresponds to a membership change for the set of storage systems (4800A-4800B) that synchronously replicate a data set (4858) may be implemented using different techniques. As an example, storage system (4800A) may receive an I/O command indicating that pod (4854) is to be extended to include a new storage system (4800N) or that pod (4854) is to be de-extended to exclude an existing storage system (4800N). As another example, the storage system (4800A) may detect and determine that communication has been lost or has become unreliable or inefficient beyond a specified threshold with a particular storage system (4800N) in the collection of storage systems.

［0639］ストレージシステム（４８００Ａ～４８００Ｎ）の集合のうちのストレージシステム（４８００Ａ）で、ポッド（４８５４）のためのＩ／Ｏコマンド、又はデータセット（４８５８）に向けられたＩ／Ｏ動作（４８５２）を受け取ることは、例えばストレージエリアネットワーク（１５８）、インターネット、又はホストコンピュータ（４８５１）がその全体でストレージシステム（４８００Ａ）と通信し得る任意のコンピュータネットワーク等のネットワークを介してパケット又はデータを移動させるための１つ以上の通信プロトコルを使用することによって実施されてよい。一部の場合では、ポッド（４８５４）のためのＩ／Ｏコマンド又はデータセット（４８５８）に向けられたＩ／Ｏ動作（４８５２）を受け取ることは、ポッド（４８００Ａ）のストレージシステム（４８００Ａ～４８００Ｎ）間の通信相互接続（１７３）―又は、ストレージシステム（４８００Ａ）にとって内部であるなんらかの他の通信チャネル―を使用することによって実施されてよい－Ｉ／Ｏコマンド又は動作は、ストレージシステムコンピューティングリソースに常駐している又は実行中であるアプリケーション又はプロセスから受け取られる。さらに、常駐又は遠隔であるアプリケーションは、ストレージシステム（４８００Ａ～４８００Ｎ）が同期中であり、オンラインであることに依存している機能性を提供し得るファイルシステム、データオブジェクト、データベースを実装する際にストレージシステム（４８００Ａ～４８００Ｎ）を使用してよい－及びこれらのプロトコル又はアプリケーションのいずれかは、同期複製され、対称的にアクセス可能な基本的なストレージ実施態様に作用する分散された実施態様であってよい。この例では、ストレージシステム（４８００Ａ）は、例えばＳＣＳＩポート等のネットワークポートで受け取られるＩ／Ｏコマンド又はＩ／Ｏ動作（４８５２）を受け取ってよく、Ｉ／Ｏ動作（４８５２）は、ポッド内のストレージシステム（４８００Ａ～４８００Ｎ）全体で同期複製されているデータセット（４８５８）の一部である記憶場所に向けられる書込みコマンドである。 [0639] Receipt of an I/O command for a pod (4854) or an I/O operation (4852) directed to a dataset (4858) at a storage system (4800A) of a collection of storage systems (4800A-4800N) may be accomplished by using one or more communication protocols for moving packets or data over a network, such as a storage area network (158), the Internet, or any computer network over which a host computer (4851) can communicate with the storage system (4800A) in its entirety. In some cases, receiving I/O commands or I/O operations 4852 directed to a dataset 4858 for a pod 4854 may be performed using the communications interconnect 173 between the storage systems 4800A-4800N of the pod 4800A—or some other communications channel internal to the storage system 4800A—wherein the I/O commands or operations are received from applications or processes resident or running on the storage system computing resources. Additionally, applications, resident or remote, may use the storage systems 4800A-4800N in implementing file systems, data objects, databases, or other protocols that may provide functionality that relies on the storage systems 4800A-4800N being in sync and online—and any of these protocols or applications may be distributed implementations that operate on a synchronously replicated, symmetrically accessible underlying storage implementation. In this example, storage system (4800A) may receive an I/O command or I/O operation (4852) received at a network port, such as a SCSI port, where the I/O operation (4852) is a write command directed to a storage location that is part of a dataset (4858) that is synchronously replicated across storage systems (4800A-4800N) in the pod.

［0640］また、図４８に示される例の方法は、データセット（４８５８）を同期複製するためのストレージシステムの新しい集合を決定するためにメンバーシップイベントに従って、１つ以上のメンバーシッププロトコルを適用すること（４８０４）も含む。データセット（４８５８）を同期複製するためのストレージシステムの新しい集合を決定するためにメンバーシップイベントに従って、１つ以上のメンバーシッププロトコルを適用すること（４８０４）は、定足数プロトコル、外部ポッドメンバーシップマネージャプロトコル、又はレーシングプロトコルの任意の１つ以上を使用し、上述されたように実施されてよい。 [0640] The example method shown in FIG. 48 also includes applying (4804) one or more membership protocols in accordance with the membership event to determine a new set of storage systems for synchronously replicating the data set (4858). Applying (4804) one or more membership protocols in accordance with the membership event to determine a new set of storage systems for synchronously replicating the data set (4858) may be implemented as described above using any one or more of a quorum protocol, an external pod membership manager protocol, or a racing protocol.

［0641］また、図４８に示される例の方法は、データセット（４８５８）に向けられた１つ以上のＩ／Ｏ動作（４８５２）のために、１つ以上のストレージシステムの新しい集合によって同期複製されたデータセット（４８５８）に１つ以上のＩ／Ｏ動作（４８５２）を適用すること（４８０６）も含む。ストレージシステムの新しい集合によって同期複製されたデータセット（４８５８）に１つ以上のＩ／Ｏ動作（４８５２）を適用すること（４８０６）は、その全体として本明細書に含まれ、データセットに対する任意の変更が、ポッドのすべての同期中のストレージシステムメンバー全体で同期複製されるように、Ｉ／Ｏ動作を受け取り、処理することを説明する出願参照番号第６２／４７０，１７２号、及び第６２／５１８，０７１号の中に説明されるように実施されてよい。 48 also includes, for one or more I/O operations (4852) directed to the dataset (4858), applying (4806) the one or more I/O operations (4852) to the dataset (4858) synchronously replicated by the new set of one or more storage systems. Applying (4806) the one or more I/O operations (4852) to the dataset (4858) synchronously replicated by the new set of storage systems may be implemented as described in Application Nos. 62/470,172 and 62/518,071, which are incorporated herein in their entirety and describe receiving and processing I/O operations such that any changes to the dataset are synchronously replicated across all synchronized storage system members of the pod.

［0642］読者は、上述された方法が、上述されたストレージシステムの任意の組合せによって実施され得ることを理解する。さらに、上述されたストレージシステムのいずれかは、例えばＡｍａｚｏｎ（商標）ＷｅｂＳｅｒｖｉｃｅｓ（『ＡＷＳ』）、Ｇｏｏｇｌｅ（商標）ＣｌｏｕｄＰｌａｔｆｏｒｍ、Ｍｉｃｒｏｓｏｆｔ（商標）Ａｚｕｒｅ、又は他等のクラウドサービスプロバイダによって提供されるストレージと対になってよい。したがって、係る例では、特定のポッドのメンバーは、クラウドサービスプロバイダによって提供されるストレージから成るストレージシステムだけではなく、上述されたストレージシステムの１つを含んでよい。同様に、特定のポッドのメンバーは、クラウドサービスプロバイダによって提供されるストレージから成るストレージシステムの論理的表現だけから成ってよい。例えば、ポッドの第１のメンバーは、第１のＡＷＳ可用性ゾーンのストレージから成るストレージシステムの論理的表現であってよい。一方、ポッドの第２のメンバーは、第２のＡＷＳ可用性ゾーンのストレージから成るストレージシステムの論理的表現であってよい。 [0642] The reader will understand that the above-described method may be implemented with any combination of the above-described storage systems. Furthermore, any of the above-described storage systems may be paired with storage provided by a cloud service provider, such as Amazon™ Web Services (“AWS”), Google™ Cloud Platform, Microsoft™ Azure, or others. Thus, in such an example, a member of a particular pod may include one of the above-described storage systems, as well as a storage system consisting of storage provided by the cloud service provider. Similarly, a member of a particular pod may consist solely of a logical representation of a storage system consisting of storage provided by the cloud service provider. For example, a first member of a pod may be a logical representation of a storage system consisting of storage in a first AWS availability zone, while a second member of the pod may be a logical representation of a storage system consisting of storage in a second AWS availability zone.

［0643］クラウドサービスプロバイダによって提供されるストレージから成るストレージシステムにデータセット（又はバーチャルマシン等の他の管理オブジェクト）を同期複製し、本願に説明される他のすべての機能を実行する能力を助長するために、多様なストレージシステム機能を実施するソフトウェアモジュールは、クラウドサービスプロバイダによって提供される処理リソース上で実行されてよい。係るソフトウェアモジュールは、例えば、ブロックデバイス、Ａｍａｚｏｎ（商標）ＭａｃｈｉｎｅＩｍａｇｅ（『ＡＭＩ』）インスタンス等のクラウドサービスプロバイダによってサポートされる１つ以上のバーチャルマシンで実行してよい。代わりに、係るソフトウェアモジュールは、ハードウェアに直接アクセスできるＡｍａｚｏｎ（商標）ＥＣ２ベアメタルインスタンス等のクラウドサービスプロバイダによって提供されるベアメタル環境で代わりに実行してよい。係る実施形態では、Ａｍａｚｏｎ（商標）ＥＣ２ベアメタルインスタンスは、ストレージシステムを効果的に形成するために高密度フラッシュドライブと対にされてよい。どちらの実施態様でも、ソフトウェアモジュールは、理想的には、クラウドリソース上で、例えばｖＳＡＮ（商標）等のＶＭｗａｒｅ（商標）によって提供される仮想化ソフトウェア及びサービス等の他の従来のデータセンササービスと同一場所に配置されるであろう。読者は、他の多くの実施態様も可能であり、本開示の範囲内にあることを理解する。 [0643] To facilitate the ability to synchronously replicate data sets (or other managed objects, such as virtual machines) to a storage system comprising storage provided by a cloud service provider and perform all other functions described herein, software modules implementing various storage system functions may execute on processing resources provided by the cloud service provider. Such software modules may execute on one or more virtual machines supported by the cloud service provider, such as block devices, Amazon™ Machine Image (“AMI”) instances, etc. Alternatively, such software modules may instead execute in a bare metal environment provided by the cloud service provider, such as an Amazon™ EC2 bare metal instance, where the hardware is directly accessible. In such an embodiment, an Amazon™ EC2 bare metal instance may be paired with high-density flash drives to effectively form a storage system. In either implementation, the software module would ideally be co-located on cloud resources with other traditional data sensor services, such as virtualization software and services offered by VMware™, such as vSAN™. The reader will understand that many other implementations are possible and within the scope of this disclosure.

［0644］読者は、ポッド内のデータセット又は他の管理オブジェクトがオンプロミスストレージシステムに保持され、ポッドが、そのリソースがクラウドサービスプロバイダによって提供されるストレージシステムを含むように広げられる状況では、データセット又は他の管理オブジェクトは、そのリソースが暗号化されたデータとしてクラウドサービスプロバイダによって提供されるストレージシステムに転送されてよいことを理解する。係るデータは、オンプロミスストレージシステムによって暗号化されてよく、これによりクラウドサービスプロバイダにより提供されるリソースに記憶されるデータは暗号化されるが、クラウドサービスプロバイダは暗号化鍵を有していない。このようにして、クラウドに記憶されたデータは、クラウドは暗号化鍵にアクセスできないのでより安全である場合がある。同様に、データが最初にオンプレミスストレージシステムに書き込まれるとき、ネットワーク暗号化を使用でき、暗号化されたデータはクラウドが続けて暗号化鍵にアクセスできないように、クラウドに転送できるであろう。 [0644] The reader will understand that in situations where a dataset or other managed object within a pod is maintained in an on-premises storage system and the pod is expanded to include a storage system whose resources are provided by a cloud service provider, the dataset or other managed object may be transferred to the storage system whose resources are provided by the cloud service provider as encrypted data. Such data may be encrypted by the on-premises storage system, such that data stored in resources provided by the cloud service provider is encrypted, but the cloud service provider does not have the encryption key. In this manner, data stored in the cloud may be more secure because the cloud does not have access to the encryption key. Similarly, when data is initially written to an on-premises storage system, network encryption could be used, and the encrypted data could be transferred to the cloud such that the cloud does not subsequently have access to the encryption key.

［0645］クラウドサービスプロバイダによって提供されるストレージから成るストレージシステムの使用により、災害復旧はサービスとして提供され得る。係る例では、データセット、作業負荷、他の管理オブジェクト等は、オンプレミスストレージシステム上に常駐してよく、そのリソースがクラウドサービスプロバイダによって提供されるストレージシステムに同期複製されてよい。災害がオンプレミスストレージシステムに発生する場合、そのリソースがクラウドサービスプロバイダによって提供されるストレージシステムは、データセットに向けられた要求の処理を引き継ぎ、データセットを別のストレージシステムに移行するのを支援する等であってよい。同様に、そのリソースがクラウドサービスプロバイダによって提供されるストレージシステムは、重度使用の期間中又はそれ以外の場合必要に応じて使用されてよいオンデマンドの二次ストレージシステムとしての機能を果たしてよい。読者は、本明細書に説明される機能の多くを開始するユーザインタフェース又は類似した機構が設計され得、これにより災害復旧をサービスとして可能にすることが、ただ１回マウスクリックを実行することほど簡略になることを理解する。 [0645] Disaster recovery can be provided as a service through the use of a storage system consisting of storage provided by a cloud service provider. In such an example, datasets, workloads, other managed objects, etc. may reside on an on-premises storage system, and the resources may be synchronously replicated to a storage system provided by the cloud service provider. If a disaster occurs to the on-premises storage system, the storage system whose resources are provided by the cloud service provider may take over processing requests directed to the dataset, assist in migrating the dataset to another storage system, etc. Similarly, the storage system whose resources are provided by the cloud service provider may serve as an on-demand secondary storage system that may be used during periods of heavy use or otherwise as needed. The reader will understand that user interfaces or similar mechanisms for initiating many of the functions described herein can be designed, making enabling disaster recovery as a service as simple as performing a single mouse click.

［0646］クラウドサービスプロバイダによって提供されるストレージから成るストレージシステムの使用により、高可用性もサービスとして提供され得る。係る例では、オンプレミスストレージシステムに常駐してよいデータセット、作業負荷、又は他の管理オブジェクトは、そのリソースがクラウドサービスプロバイダによって提供されるストレージシステムに同期複製されてよい。係る例では、例えばＡＷＳＤｉｒｅｃｔＣｏｎｎｅｃｔ等のクラウドに対する専用のネットワーク接続性のため、さまざまな場所からのＡＷＳに対するサブミリ秒の待ち時間を達成できる。したがって、アプリケーションは、前もっての多大な出費なしに伸展クラスタモードで実行することができ、高可用性は、購入され、保守される等の複数の別個に位置するオンプレミスストレージシステムに対する必要性なしに達成され得る。読者は、本明細書に説明される機能の多くを開始するユーザインタフェース又は類似した機構が設計され得、これによりアプリケーションを有効化にすることは、ただ１回マウスクリックを実行することによってクラウドの中に拡大され得ることを理解する。 [0646] High availability can also be provided as a service through the use of a storage system consisting of storage provided by a cloud service provider. In such an example, a data set, workload, or other managed object that may reside in an on-premises storage system may be synchronously replicated to a storage system whose resources are provided by the cloud service provider. In such an example, because of dedicated network connectivity to the cloud, such as AWS Direct Connect, sub-millisecond latency to AWS from various locations can be achieved. Thus, applications can be run in a stretched cluster mode without significant upfront expenditure, and high availability can be achieved without the need for multiple separately located on-premises storage systems to be purchased, maintained, etc. The reader will understand that user interfaces or similar mechanisms can be designed to initiate many of the functions described herein, thereby enabling applications to be extended into the cloud with a single mouse click.

［0647］クラウドサービスプロバイダによって提供されるストレージから成るストレージシステムの使用により、システム回復もサービスとして提供され得る。係る例では、オンプレミスストレージシステムに常駐してよいデータベース、管理オブジェクト、及び他のエンティティの時点コピーは、そのリソースがクラウドサービスプロバイダによって提供されるストレージシステムに同期複製され得る。係る例では、ストレージシステムを特定の時点に戻して回復する必要性が生じる場合、そのリソースがクラウドサービスプロバイダによって提供されるストレージシステム上に含まれるデータセット及び他の管理オブジェクトの時点コピーは、ストレージシステムを回復するために使用され得る。 [0647] System recovery can also be provided as a service through the use of a storage system consisting of storage provided by a cloud service provider. In such an example, point-in-time copies of databases, managed objects, and other entities that may reside in an on-premises storage system can be synchronously replicated to a storage system whose resources are provided by a cloud service provider. In such an example, if a need arises to restore a storage system to a particular point in time, the point-in-time copies of data sets and other managed objects contained on the storage system whose resources are provided by a cloud service provider can be used to restore the storage system.

［0648］クラウドサービスプロバイダによって提供されるストレージから成るストレージシステムの使用により、オンプレミスストレージシステムに記憶されるデータは、多様なクラウドサービスによる使用のためにクラウドの中にネイティブにパイプで送られてよい（ｐｉｐｅｄ）。係る例では、データがオンプレミスストレージシステムに記憶されたようにそのネイティブフォーマットであるデータはクローン化され、多様なクラウドサービスのために使用できるフォーマットに変換されてよい。例えば、データがオンプレミスストレージシステムに記憶されたようにそのネイティブフォーマットであるデータはクローン化され、データ分析問合せがデータに対して実行され得るように、Ａｍａｚｏｎ（商標）Ｒｅｄｓｈｉｆｔによって使用されるフォーマットに変換されてよい。同様に、オンプレミスストレージシステムで記憶されていたようにそのネイティブフォーマットであるデータはクローン化され、Ａｍａｚｏｎ（商標）ＤｙｎａｍｏＤＢ、Ａｍａｚｏｎ（商標）Ａｕｒｏｒａ、又は他のなんらかのクラウドデータベースサービスによって使用されるフォーマットに変換される。係る変換はオンプレミスストレージシステムの外部で発生するため、オンプレミスストレージシステムの中のリソースは、Ｉ／Ｏ動作にサービスを提供する際に使用するために保護され、保持されてよい。一方、必要に応じてスパンアップできるクラウドリソースは、オンプレミスストレージシステムがＩ／Ｏ動作の一次サービサーとして動作し、クラウドサービスプロバイダによって提供されるリソースから成るストレージシステムがバックアップストレージシステムのより多くとして動作する実施形態では、特に貴重である場合があるデータ変換を実行するために使用される。実際に、管理オブジェクトはストレージシステム全体で同期され得るため、オンプレミスストレージシステムが、最初に抽出・加工・書込み（『ＥＴＬ』）パイプラインで必要とされるステップを実施することを担っていた実施形態では、係るパイプラインの構成要素はクラウドにエクスポートされ、クラウド環境で実行してよい。係る技術の使用により、データセットの時点コピー（つまり、スナップショット）を分析サービスに対する入力として使用することを含んだ、サービスとしての分析も提供されてよい。 [0648] Using a storage system consisting of storage provided by a cloud service provider, data stored in an on-premises storage system may be natively piped into the cloud for use by various cloud services. In such an example, data in its native format as it was stored in the on-premises storage system may be cloned and converted into a format usable for the various cloud services. For example, data in its native format as it was stored in the on-premises storage system may be cloned and converted into a format used by Amazon™ Redshift so that data analytics queries can be performed on the data. Similarly, data in its native format as it was stored in the on-premises storage system may be cloned and converted into a format used by Amazon™ DynamoDB, Amazon™ Aurora, or some other cloud database service. Because such conversion occurs outside of the on-premises storage system, resources in the on-premises storage system may be protected and preserved for use in servicing I/O operations. On the other hand, cloud resources that can be spanned as needed are used to perform data transformations, which can be particularly valuable in embodiments where an on-premises storage system acts as a primary servicer of I/O operations and a storage system comprised of resources provided by a cloud service provider acts more as a backup storage system. Indeed, because managed objects can be synchronized across storage systems, in embodiments where an on-premises storage system was initially responsible for performing the steps required in an extract-transform-write ("ETL") pipeline, components of such pipelines may be exported to the cloud and executed in the cloud environment. Use of such techniques may also provide analytics as a service, including using point-in-time copies (i.e., snapshots) of datasets as input to analytics services.

［0649］読者は、アプリケーションが上述されたストレージシステムのいずれかで実行でき、いくつかの実施形態では、係るアプリケーションが一次コントローラ、二次コントローラで、又は同時に両方のコントローラでも実行できることを理解する。係るアプリケーションの例は、背景バッチデータベーススキャンを行うアプリケーション、実行時データの統計分析を行っているアプリケーション等を含む場合がある。 [0649] The reader will understand that applications can run on any of the storage systems described above, and that in some embodiments, such applications can run on the primary controller, the secondary controller, or on both controllers simultaneously. Examples of such applications may include applications that perform background batch database scans, applications that perform statistical analysis of runtime data, etc.

［0650］例の実施形態は、おもに完全機能コンピュータシステムとの関連で説明される。しかしながら、当業の読者は、本開示が、任意の適切なデータ処理システムとの使用のためにコンピュータ可読記憶媒体に配置されたコンピュータプログラム製品で実施され得ることを認識する。係るコンピュータ可読記憶媒体は、磁気媒体、光媒体、又は他の適切な媒体を含んだ、機械可読情報用の任意の記憶媒体であってよい。係る媒体の例は、当業者に思い浮かぶように、ハードドライブ内の磁気ディスク又はディスケット、光ドライブ用コンパクトディスク、磁気テープ及び他を含む。当業者は、適切なプログラミング手段を有する任意のコンピュータシステムが、コンピュータプログラム製品で実施される方法のステップを実行できることをただちに認識する。また、当業者は、本明細書に説明される例の実施形態のいくつかは、コンピュータハードウェアにインストールされ、実行中のソフトウェアに向けられているが、それにも関わらず、ファームウェアとして又はハードウェアとして実施される代替実施形態が、十分に本開示の範囲内にあることも認識する。 [0650] Example embodiments are described primarily in the context of a fully functional computer system. However, readers skilled in the art will recognize that the present disclosure may be embodied in a computer program product disposed on a computer-readable storage medium for use with any suitable data processing system. Such a computer-readable storage medium may be any storage medium for machine-readable information, including magnetic, optical, or other suitable media. Examples of such media include magnetic disks or diskettes in hard drives, compact discs for optical drives, magnetic tape, and others, as will occur to those skilled in the art. Those skilled in the art will readily recognize that any computer system with appropriate programming means is capable of executing the steps of the methods embodied in the computer program product. Those skilled in the art will also recognize that while some of the example embodiments described herein are directed to software installed and running on computer hardware, alternative embodiments implemented as firmware or as hardware are nevertheless well within the scope of the present disclosure.

［0651］実施形態は、システム、方法、及び／又はコンピュータプログラム製品を含む場合がある。コンピュータプログラム製品は、プロセッサに本開示の態様を実施させるためにその上にコンピュータ可読プログラム命令を有する１つのコンピュータ可読記憶媒体（又は複数の媒体）を含んでよい。 [0651] Embodiments may include systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium (or media) having computer-readable program instructions thereon for causing a processor to implement aspects of the present disclosure.

［0652］コンピュータ可読記憶媒体は、命令実行デバイスによって使用するための命令を保持し、記憶できる有形のデバイスである場合がある。コンピュータ可読記憶媒体は、例えば電子ストレージデバイス、磁気ストレージデバイス、光ストレージデバイス、電磁ストレージデバイス、半導体ストレージデバイス、又は上記の任意の適切な組合せであってよいが、これに限定されるものではない。コンピュータ可読記憶媒体のより具体的な例の包括的ではないリストは、以下、つまりポータブルコンピュータディスケット、ハードディスク、ランダムアクセスメモリ（ＲＡＭ）、読取り専用メモリ（ＲＯＭ）、消去可能ＰＲＯＭ（ＥＲＯＭ又はフラッシュメモリ）、スタティックランダムアクセスメモリ（ＳＲＡＭ）、読取り専用コンパクトディスク（ＣＤ－ＲＯＭ）、デジタル多用途ディスク（ＤＶＤ）、メモリスティック、フロッピーディスク、例えばパンチカード又は命令がその上に記録された溝内の隆起構造等の機械的に符号化されたデバイス、及び上記の任意の適切な組合せを含む。コンピュータ可読記憶媒体は、本明細書に使用されるように、本来、例えば電波又は他の自由に伝搬する電磁波、導波管又は他の伝送媒体（例えば、光ファイバケーブルを通過する光パルス）を通って伝搬する電磁波、又はワイヤを通って伝送される電気信号等の一次的な信号であるとして解釈されるべきではない。 [0652] A computer-readable storage medium may be a tangible device capable of retaining and storing instructions for use by an instruction execution device. A computer-readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. A non-exhaustive list of more specific examples of computer-readable storage media includes the following: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (PROM) (EROM or flash memory), static random access memory (SRAM), read-only compact disks (CD-ROMs), digital versatile disks (DVDs), memory sticks, floppy disks, mechanically encoded devices such as punch cards or ridge structures in grooves on which instructions are recorded, and any suitable combination of the above. Computer-readable storage media, as used herein, should not be construed as being primarily a primary signal, such as, for example, radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., light pulses passing through a fiber optic cable), or electrical signals transmitted through wires.

［0653］本明細書に説明されるコンピュータ可読プログラム命令は、コンピュータ可読記憶媒体からそれぞれのコンピューティングデバイス／処理装置に、又は外部コンピュータ若しくは外部ストレージデバイスに、例えばインターネット、ローカルエリアネットワーク、広域ネットワーク、及び／又は無線ネットワークを介してダウンロードできる。ネットワークは、銅伝送ケーブル、光伝送ファイバ、無線伝送、ルータ、ファイアウォール、スイッチ、ゲートウェイコンピュータ、及び／又はエッジサーバ等を含んでよい。各コンピューティングデバイス／処理装置内のネットワークアダプタカード又はネットワークインタフェースは、ネットワークからコンピュータ可読プログラム命令を受け取り、それぞれのコンピューティングデバイス／処理装置の中のコンピュータ可読記憶媒体での格納のためにコンピュータ可読プログラム命令を転送する。 [0653] The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to each computing device/processing device, or to an external computer or external storage device, for example, via the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, optical fiber transmissions, wireless transmissions, routers, firewalls, switches, gateway computers, and/or edge servers, etc. A network adapter card or network interface within each computing device/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage on a computer-readable storage medium within the respective computing device/processing device.

［0654］本開示の動作を実施するためのコンピュータ可読プログラム命令は、アセンブラ命令、インストラクションセットアーキテクチャ（ＩＳＡ）命令、機械命令、マシン依存命令、マイクロコード、ファームウェア命令、状態設定データ、又は例えばＳｍａｌｌｔａｌｋ、Ｃ＋＋等のオブジェクト指向プログラミング言語、及び例えば「Ｃ」プログラミング言語若しくは類似したプログラミング言語等の従来の手続き型プログラミング言語を含んだ１つ以上のプログラミング言語の任意の組合せで書き込まれたソースコード若しくはオブジェクトコードのどちらかであってよい。コンピュータ可読プログラム命令は、完全にユーザのコンピュータ上で、部分的にユーザのコンピュータ上で、スタンドアロンソフトウェアパッケージとして、部分的にユーザのコンピュータ上及び部分的にリモートコンピュータ上で、又は完全にリモートコンピュータ若しくはサーバ上で実行してよい。後者の状況では、リモートコンピュータは、ローカルエリアネットワーク（ＬＡＮ）又は広域ネットワーク（ＷＡＮ）を含んだ任意のタイプのネットワークを通してユーザのコンピュータに接続される場合もあれば、接続は（例えば、インターネットサービスプロバイダを使用し、インターネットを通して）外部コンピュータに行われる場合もある。いくつかの実施形態では、例えばプログラマブル論理回路、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、又はプログラマブルロジックアレイ（ＰＬＡ）を含んだ電子回路網は、本開示の態様を実行するために、コンピュータ可読プログラム命令の状態情報を利用して電子回路網を個人向けにすることによってコンピュータ可読プログラム命令を実行してよい。 [0654] Computer-readable program instructions for carrying out the operations of the present disclosure may be either source code or object code written in any combination of one or more programming languages, including assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state-setting data, or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, and traditional procedural programming languages such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In the latter situation, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be to an external computer (e.g., through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA) may execute computer-readable program instructions by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry to perform aspects of the present disclosure.

［0655］本開示の態様は、本開示のいくつかの実施形態に係る方法、装置（システム）及びコンピュータプログラム製品のフローチャート図及び／又はブロック図を参照して本明細書で説明される。フローチャート図及び／又はブロック図の各ブロック、並びにフローチャート図及び／又はブロック図の中のブロックの組合せが、コンピュータ可読プログラム命令によって実装できることが理解される。 [0655] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to some embodiments of the present disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

［0656］これらのコンピュータ可読プログラム命令は、機械を生産するために、汎用コンピュータ、専用コンピュータ、又は他のプログラム可能データ処理装置のプロセッサに提供されてよく、これによりコンピュータ又は他のプログラム可能データ処理装置のプロセッサを介して実行する命令は、フローチャート及び／又はブロック図の１つ又は複数のブロックに指定される機能／行為を実施するための手段を作り出す。また、これらのコンピュータ可読プログラム命令は、コンピュータ、プログラム可能データ処理装置、及び／又は他のデバイスに特定の方法で機能するように命令できるコンピュータ可読記憶媒体に記憶されてもよく、これによりその中に命令が記憶されているコンピュータ可読記憶媒体は、フローチャート及び／又はブロック図の１つ又は複数のブロックに指定される機能／行為の態様を実施する命令を含んだ製造品を含む。 [0656] These computer-readable program instructions may be provided to a processor of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus to produce a machine, whereby the instructions executing on the processor of the computer or other programmable data processing apparatus create means for performing the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams. These computer-readable program instructions may also be stored on a computer-readable storage medium that can instruct a computer, programmable data processing apparatus, and/or other device to function in a particular manner, whereby the computer-readable storage medium having instructions stored therein includes an article of manufacture containing instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

［0657］また、コンピュータ可読プログラム命令は、一連の操作ステップをコンピュータ、他のプログラム可能装置、又は他のデバイスで実行させて、コンピュータ実装プロセスを生じさせるためにコンピュータ、他のプログラム可能データ処理装置、又は他の装置にロードされてもよく、これによりコンピュータ、他のプログラム可能装置、又は他のデバイスで実行する命令は、フローチャート及び／又はブロック図の１つ又は複数のブロックで指定される機能／行為を実施する。 [0657] Computer-readable program instructions may also be loaded into a computer, other programmable data processing device, or other device to cause the computer, other programmable device, or other device to execute a series of operational steps to produce a computer-implemented process, whereby the instructions executing on the computer, other programmable device, or other device perform the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

［0658］図に示されるフローチャート及びブロック図は、本開示の態様な実施形態に係るシステム、方法、及びコンピュータプログラム製品の考えられる実施態様のアーキテクチャ、機能性、及び動作を示す。この時点で、フローチャート図又はブロック図の各ブロックは、指定された論理機能（複数可）を実施するための１つ以上の実行可能命令を含むモジュール、セグメント、又は命令の部分を表してよい。いくつかの代替実施態様では、ブロックで注記された機能は、図に注記される順序を外れて起こる場合がある。例えば、連続して示される２つのブロックが、実際には実質的に同時に実行される場合もあれば、関与する機能性に応じてブロックが逆順で実行される場合もある。ブロック図及びフローチャート図の各ブロック、及びブロック図及び／又はフローチャート図のブロックの組合せが、指定された機能若しくは行為を実行する、又は専用ハードウェア及びコンピュータ命令の組合せを実施する専用ハードウェアベースのシステムによって実施される場合があることも留意されたい。 [0658] The flowcharts and block diagrams depicted in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to aspects of the present disclosure. At this point, each block in the flowcharts or block diagrams may represent a module, segment, or portion of instructions, including one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may be executed in the reverse order, depending on the functionality involved. It should also be noted that each block in the block diagrams and flowchart diagrams, and combinations of blocks in the block diagrams and/or flowchart diagrams, may be implemented by a dedicated hardware-based system that performs the specified functions or acts, or a combination of dedicated hardware and computer instructions.

［0659］読者は、本明細書に説明されるステップが、さまざまな方法で実施されてよいこと、及び特定の順序付けが必要とされていないことを理解する。上述の説明から、修正及び変更が、本開示の多様な実施形態でその真の精神から逸脱することなく加えられてよいこともさらに理解される。本明細書の説明は、例示の目的のためだけであり、限定的な意味で解釈されるべきではない。本開示の範囲は、以下の特許請求の範囲の言語によってのみ限定される。 [0659] The reader will understand that the steps described herein may be implemented in a variety of ways, and that no particular order is required. It will also be understood from the foregoing description that modifications and variations may be made in various embodiments of the present disclosure without departing from its true spirit. The description herein is for illustrative purposes only and should not be construed in a limiting sense. The scope of the present disclosure is limited only by the language of the following claims.

Claims

1. A method comprising:
In response to detecting, by one or more storage systems, a trigger event, requesting intermediation from an intermediation service external to the one or more storage systems, the intermediation service configured to determine which storage systems will continue to provide services to the data set after the trigger event;
processing data storage requests directed to the dataset by a first storage system that receives a positive intermediation result from the intermediation service, and storage systems that do not receive a positive intermediation result from the intermediation service not processing data storage requests directed to the dataset;
A method comprising:

the trigger event is a communication failure between the first storage system and at least one other storage system among the one or more storage systems ;
The method of claim 1.

the trigger event is the expiration of a synchronous replication lease;
The method of claim 1.

detaching, by the first storage system, the at least one other storage system of the one or more storage systems from the one or more storage systems that synchronously replicate the data set in response to the indication of the positive mediation result from the mediation service;
The method of claim 2 further comprising:

the request for mediation from each storage system is a request to establish a shared resource indicating an identifier for each requesting storage system;
The method of claim 1.

the first storage system waits a first time period before requesting the intermediation;
the at least one other storage system of the one or more storage systems waits a second time period before requesting the intermediation.
The method of claim 4.

the first storage system is designated as a preferred storage system based on the first time period being less than the second time period;
The method of claim 6.

the intermediation service is a database service provided by a cloud service provider;
The method of claim 1.

the intermediary service is implemented on a host computer that issues requests to modify data to the first storage system and at least one other storage system among the one or more storage systems ;
The method of claim 1.

the intermediation service is implemented in a failure domain that is distinct from a failure domain of either the first storage system or at least one other storage system of the one or more storage systems ;
The method of claim 1.

the intermediation service is one of a plurality of intermediation services;
The method of claim 1.

The device is
A computer processor;
a computer memory operatively coupled to the computer processor, the computer memory having, when executed by the computer processor, a program for causing the apparatus to:
In response to detecting, by one or more storage systems, a trigger event, requesting intermediation from an intermediation service external to the one or more storage systems, the intermediation service configured to determine which storage systems will continue to service the data set after the trigger event;
processing data storage requests directed to the dataset by a first storage system of the one or more storage systems that receives a positive intermediation result from the intermediation service, and a storage system that does not receive a positive intermediation result from the intermediation service not processing data storage requests directed to the dataset;
said computer memory having computer program instructions disposed therein for executing
Equipped with.

the trigger event is a communication failure between the first storage system and at least one other storage system among the one or more storage systems ;
13. The apparatus of claim 12.

the trigger event is the expiration of a synchronous replication lease;
13. The apparatus of claim 12.

When executed by the computer processor, the apparatus:
detaching, by the first storage system, at least one other of the one or more storage systems from the one or more storage systems that synchronously replicate the data set in response to the indication of the positive mediation result from the mediation service;
The apparatus of claim 12 , further comprising computer program instructions to execute:

the request for mediation from each storage system is a request to establish a shared resource indicating an identifier for each requesting storage system;
13. The apparatus of claim 12.

the first storage system waits a first time period before requesting the intermediation;
at least one other storage system of the one or more storage systems waits a second time period before requesting the intermediation;
13. The apparatus of claim 12.

the first storage system is designated as a preferred storage system based on the first time period being less than the second time period;
18. The apparatus of claim 17.

the intermediation service is a database service provided by a cloud service provider;
13. The apparatus of claim 12.

1. A method of mediating between a set of storage systems that synchronously replicate a data set, the set of storage systems including a first storage system and a second storage system;
detecting, by the first storage system, a communication failure between the first storage system and the second storage system;
requesting, by the first storage system, intermediation from an intermediation service in response to detecting the communication failure;
detaching, by the first storage system, the second storage system from the set of storage systems that synchronously replicate the data set in response to an indication of a positive mediation result from the mediation service;
A method comprising: