ハードウェアRAIDモニタリング
安価なサーバでもハードウェアRAIDを搭載していることは珍しくなくなりました。
何度かサポートしてきていますが、監視していないケースが意外と多く見られます。
そこで、過去に対応したことのあるケースについて、モニタリング方法/導入、使い方を簡単にですが纏めてみました。
ターミナル出力は参考値です。一部マスクしてありますが、ほとんどのケースで正常時のものになっています。
エラー/異常の判断は、ツール毎/異常部位により異なりますので、詳しくはツールのマニュアル類を参照ください。
個人的にはステータスのみで判断して(大抵は保守に入っているでしょうから)交換してしまうと良いと思っています。
キャッシュ用に電池を積んでいる場合は、その消耗と言うケースもよくあります。
ディスクと共に消耗品なので、アラートになる前に、定期的に交換するのが望ましいのですけどね。
なお、大抵の場合、モニタリングソフト/ツール類はハードウェアに付属していますので、あるならばそれを使用すると良いでしょう。
LSI Logic MegaRAID
- RAIDカード確認方法
$ grep -i megaraid /var/log/dmesg scsi0 : LSI Logic MegaRAID driver
MegaCLIというコマンドラインツールが公開されていますので、これをビルドして使用します。
http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5082327
http://tools.rapidsoft.de/perc/perc-cheat-sheet.html
# yum info megacli ... Name : megacli Arch : i386 Version: 2.00.11 Release: 2 Size : 1.7 M Repo : installed Summary: MegaCli is used to manage SAS RAID controllers ...
- アダプタ情報
# megacli -AdpAllinfo -aALL
Adapter #0
==============================================================================
Versions
================
Product Name : MegaRAID SAS 8708EM2
Serial No : ...
FW Package Build: ...
Mfg. Data
================
Mfg. Date : ...
Rework Date : ...
Revision No : ...
Battery FRU : ...
Image Versions In Flash:
================
FW Version : ...
BIOS Version : ...
WebBIOS Version : ...
Ctrl-R Version : ...
Preboot CLI Version: ...
Boot Block Version : ...
Pending Images In Flash
================
None
PCI Info
================
Vendor Id : ...
Device Id : ...
SubVendorId : ...
SubDeviceId : ...
Host Interface : PCIE
Number of Frontend Port: 0
Device Interface : PCIE
Number of Backend Port: 8
Port : Address
0 ...
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
HW Configuration
================
SAS Address : ...
BBU : Absent
Alarm : Present
NVRAM : Present
Serial Debugger : Present
Memory : Present
Flash : Present
Memory Size : 128MB
Settings
================
Current Time : ...
Predictive Fail Poll Interval : ...
Interrupt Throttle Active Count : 16
Interrupt Throttle Completion : 50us
Rebuild Rate : 30%
PR Rate : 30%
Resynch Rate : 30%
Check Consistency Rate : 30%
Reconstruction Rate : 30%
Cache Flush Interval : 4s
Max Drives to Spinup at One Time : 2
Delay Among Spinup Groups : 12s
Physical Drive Coercion Mode : Disabled
Cluster Mode : Disabled
Alarm : Disabled
Auto Rebuild : Enabled
Battery Warning : Disabled
Ecc Bucket Size : 15
Ecc Bucket Leak Rate : 1440 Minutes
Restore HotSpare on Insertion : Enabled
Expose Enclosure Devices : Disabled
Maintain PD Fail History : Disabled
Host Request Reordering : Enabled
Auto Detect BackPlane Enabled : SGPIO/i2c SEP
Load Balance Mode : Auto
Capabilities
================
RAID Level Supported : RAID0, RAID1, RAID10
Supported Drives : SAS, SATA
Allowed Mixing:
Mix In Enclosure Allowed
Status
================
ECC Bucket Count : 0
Limitations
================
Max Arms Per VD : 32
Max Spans Per VD : 8
Max Arrays : 128
Max Number of VDs : 64
Max Parallel Commands : 1008
Max SGE Count : 80
Max Data Transfer Size : 8192 sectors
Max Strips PerIO : 42
Min Stripe Size : 8kB
Max Stripe Size : 1024kB
Device Present
================
Virtual Drives : 1
Degraded : 0
Offline : 0
Physical Devices : 3
Disks : 2
Critical Disks : 0
Failed Disks : 0
Supported Adapter Operations
================
Rebuild Rate : Yes
CC Rate : Yes
BGI Rate : Yes
Reconstruct Rate : Yes
Patrol Read Rate : Yes
Alarm Control : Yes
Cluster Support : No
BBU : Yes
Spanning : Yes
Dedicated Hot Spare : Yes
Revertible Hot Spares : No
Foreign Config Import : Yes
Self Diagnostic : Yes
Allow Mixed Redundancy on Array : No
Global Hot Spares : Yes
Deny SCSI Passthrough : No
Deny SMP Passthrough : No
Deny STP Passthrough : No
Supported VD Operations
================
Read Policy : Yes
Write Policy : Yes
IO Policy : Yes
Access Policy : Yes
Disk Cache Policy : Yes
Reconstruction : Yes
Deny Locate : No
Deny CC : No
Supported PD Operations
================
Force Online : Yes
Force Offline : Yes
Force Rebuild : Yes
Deny Force Failed : No
Deny Force Good/Bad : No
Deny Missing Replace : No
Deny Clear : No
Deny Locate : No
Disable Copyback : No
Enable Copyback on SMART : No
Error Counters
================
Memory Correctable Errors : 0
Memory Uncorrectable Errors : 0
Cluster Information
================
Cluster Permitted : No
Cluster Active : No
Default Settings
================
Phy Polarity : 0
Phy PolaritySplit : 0
Background Rate : 30
Stripe Size : 64kB
Flush Time : 4 seconds
Write Policy : WB
Read Policy : None
Cache When BBU Bad : Disabled
Cached IO : No
SMART Mode : Mode 6
Alarm Disable : No
Coercion Mode : None
ZCR Config : Unknown
Dirty LED Shows Drive Activity : No
BIOS Continue on Error : Yes
Spin Down Mode : None
Allowed Device Type : SAS/SATA Mix
Allow Mix In Enclosure : Yes
Allow Mix In VD : No
Allow SATA In Cluster : No
Max Chained Enclosures : 3
Disable Ctrl-R : Yes
Enable Web BIOS : Yes
Direct PD Mapping : Yes
BIOS Enumerate VDs : Yes
Restore Hot Spare on Insertion : Yes
Expose Enclosure Devices : No
Maintain PD Fail History : No
Disable Puncturing : Yes
Zero Based Enclosure Enumeration : No
PreBoot CLI Enabled : No
LED Show Drive Activity : Yes
Cluster Disable : Yes
SAS Disable : No
Auto Detect BackPlane Enable : SGPIO/i2c SEP
Exit Code: 0x00- 物理デバイス情報
# megacli -PDList -aALL Adapter #0 Enclosure Device ID: 252 Slot Number: 0 Device Id: 0 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: ...MB [... Sectors] Non Coerced Size: ...MB [... Sectors] Coerced Size: ...MB [... Sectors] Firmware state: Online SAS Address(0): ... SAS Address(1): ... Connected Port Number: 0(path0) Inquiry Data: SEAGATE ... Foreign State: None Enclosure Device ID: 252 Slot Number: 1 Device Id: 1 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: ...MB [... Sectors] Non Coerced Size: ...MB [... Sectors] Coerced Size: ...MB [... Sectors] Firmware state: Online SAS Address(0): ... SAS Address(1): ... Connected Port Number: 1(path0) Inquiry Data: SEAGATE ... Foreign State: None Exit Code: 0x00
- 論理デバイス情報
# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Disk: 0 (Target Id: 0) Name:array0 RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0 Size:...MB State: Optimal Stripe Size: 64kB Number Of Drives:2 Span Depth:1 Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Access Policy: Read/Write Disk Cache Policy: Disabled Exit Code: 0x00
メーカ限定ですが、NEC社Express5800シリーズに対しては、NEC社がツールを公開しています。
比較的新しいExpress5800シリーズにはUniversal RAID Utilityが使用出来ます。
http://support.express.nec.co.jp/dload/420842-A01/index.html
既に製品情報に無いようなExpress5800シリーズにはMegaMonitorが使用出来ます。
(RAIDカードのファームウェアが古くて、megacliも動作しない場合があります。)
http://www.express.nec.co.jp/linux/distributions/confirm/gam/megamgr.htm
LSI Logic Fusion-MPT
- RAIDカード確認方法
$ grep -i mptbase /var/log/dmesg mptbase: ioc0: Initiating bringup
mpt-statusというコマンドラインツールが公開されていますので、これをビルドして使用します。
http://sven.stormbind.net/mpt-status-rhel/
daemonizeはepelにもありますので、そちらからインストールしても良いでしょう。
# yum --enablerepo=epel info daemonize ... Name : daemonize Arch : x86_64 Version : 1.7.3 Release : 1.el6 Size : 19 k Repo : installed Summary : Run a command as a Unix daemon URL : http://www.clapper.org/software/daemonize/ License : BSD Description : daemonize runs a command as a Unix daemon. As defined in W. ...
# yum info mpt-status ... Name : mpt-status Arch : x86_64 Version : 1.2.0 Release : 3.el6 Size : 31 k Repo : installed Summary : Get RAID status out of mpt (and other) HW RAID controllers URL : http://www.drugphish.ch/~ratz/mpt-status/ License : GPLv2+ ...
# chkconfig --list mpt-statusd mpt-statusd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
# service mpt-statusd start
Starting mpt-status monitor: mpt-statusd
[ OK ]正常時。
# mpt-status -s log_id 0 OPTIMAL phys_id 0 ONLINE phys_id 1 ONLINE
異常時。
# mpt-status -s log_id 0 DEGRADED phys_id 0 ONLINE phys_id 1 FAILED # mpt-status -v ioc0 vol_id 0 type IM, 2 phy, 33 GB, state DEGRADED, flags ENABLED ioc0 phy 0 scsi_id 0 IBM-ESXS MAP3367NC FN B109, 33 GB, state ONLINE, flags NONE ioc0 phy 1 scsi_id 1 IBM-ESXS MAP3367NC FN B109, 33 GB, state FAILED, flags OUT_OF_SYNC
HP SmartArray
- RAIDカード確認方法
$ head -1 /proc/driver/cciss/cciss0 cciss0: HP Smart Array P400i Controller
http://www8.hp.com/jp/ja/support-drivers.html
上記サポートページから製品情報を検索し「HP アレイ コンフィギュレーション ユーティリティ CLI for Linux」(使用OSによっては64ビット用)を入手します。
http://www.datadisk.co.uk/html_docs/redhat/hpacucli.htm
# hpacucli ctrl all show config
Smart Array P400i in Slot 0 (Embedded) (sn: ... )
array A (SAS, Unused Space: 0 MB)
logicaldrive 1 (... GB, RAID 1, OK)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, ... GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, ... GB, OK)# hpacucli ctrl slot=0 array A show
Smart Array P400i in Slot 0 (Embedded)
Array: A
Interface Type: SAS
Unused Space: 0 MB
Status: OK
MultiDomain Status: OKcciss_vol_statusというステータス取得ツールも公開されています。
http://h50146.www5.hp.com/products/software/oe/linux/mainstream/support/download/cciss_vol_status/
# cciss_vol_status /dev/cciss/c0d0 /dev/cciss/c0d0: (Smart Array P400i) RAID 1 Volume 0 status: OK.
IBM/Adaptec ServeRAID
- RAIDカード確認方法
$ grep -i serveraid /var/log/dmesg scsi0 : ServeRAID
arcconfというコマンドラインツールが公開されていますので、これを使用します。
http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5073618&brandind=5000008
http://www.obvious.co.nz/aacraid/arcconf/
# arcconf -v | ARCCONF | IBM uniform command line interface | ARCCONF | Version 9.30 (B17006) | ARCCONF | (C) Adaptec 2003-2007 | ARCCONF | All Rights Reserved ...
# arcconf getconfig 1
Controllers found: 1
----------------------------------------------------------------------
Controller information
----------------------------------------------------------------------
Controller Status : Okay
Channel description : SAS/SATA
Controller Model : IBM ServeRAID 8k
Controller Serial Number : ...
Physical Slot : 0
Installed memory : 256 MB
Copyback : Disabled
Data scrubbing : Enabled
Defunct disk drive count : 0
Logical drives/Offline/Critical : 1/0/0
--------------------------------------------------------
Controller Version Information
--------------------------------------------------------
BIOS : ...
Firmware : ...
Driver : ...
Boot Flash : ...
--------------------------------------------------------
Controller Battery Information
--------------------------------------------------------
Status : Okay
Over temperature : No
Capacity remaining : 100 percent
Time remaining (at current draw) : ... days, ... hours, ... minutes
--------------------------------------------------------
Controller Vital Product Data
--------------------------------------------------------
VPD Assigned# : ...
EC Version# : ...
Controller FRU# : ...
Battery FRU# : ...
----------------------------------------------------------------------
Logical drive information
----------------------------------------------------------------------
Logical drive number 1
Logical drive name : Drive 1
RAID level : 5
Status of logical drive : Okay
Size : ... MB
Read-cache mode : Enabled
Write-cache mode : Enabled (write-back)
Write-cache setting : Enabled (write-back) when protected by battery
Partitioned : Yes
Number of segments : 3
Stripe-unit size : 256 KB
Stripe order (Channel,Device) : 0,2 0,1 0,3
Defunct segments : No
Defunct stripes : No
----------------------------------------------------------------------
Physical Device information
----------------------------------------------------------------------
Device #0
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,1
Reported Location : Enclosure 0, Slot 1
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : ...
Firmware : ...
Serial number : ...
World-wide name : ...
Size : ... MB
Write Cache : Disabled (write-through)
FRU : ...
PFA : No
Device #1
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,2
Reported Location : Enclosure 0, Slot 2
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : ...
Firmware : ...
Serial number : ...
World-wide name : ...
Size : ... MB
Write Cache : Disabled (write-through)
FRU : ...
PFA : No
Device #2
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,3
Reported Location : Enclosure 0, Slot 0
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : ...
Firmware : ...
Serial number : ...
World-wide name : ...
Size : ... MB
Write Cache : Disabled (write-through)
FRU : ...
PFA : No
Device #3
Device is an Enclosure services device
Reported Channel,Device : 2,0
Enclosure ID : 0
Type : SES2
Vendor : IBM
Model : SAS SES-2 DEVICE
Firmware : 1.10
Status of Enclosure services device
Temperature : Normal
Command completed successfully.コマンドのバージョンによって、応答内容が若干異なったりするので、監視に組み込む際には注意しましょう。
# arcconf -v | UCLI | Adaptec by PMC uniform command line interface | UCLI | Version 7.0 (B18786) | UCLI | (C) Adaptec by PMC 2003-2011 | UCLI | All Rights Reserved ...
# arcconf getconfig 1
Controllers found: 1
----------------------------------------------------------------------
Controller information
----------------------------------------------------------------------
Controller Status : Optimal
Channel description : SAS/SATA
Controller Model : IBM ServeRAID 8k
Controller Serial Number : ...
Physical Slot : 0
Installed memory : 256 MB
Copyback : Disabled
Background consistency check : Enabled
Automatic Failover : Enabled
Stayawake period : Disabled
Spinup limit internal drives : 0
Spinup limit external drives : 0
Defunct disk drive count : 0
Logical devices/Failed/Degraded : 1/0/0
--------------------------------------------------------
Controller Version Information
--------------------------------------------------------
BIOS : ...
Firmware : ...
Driver : ...
Boot Flash : ...
--------------------------------------------------------
Controller Battery Information
--------------------------------------------------------
Status : Optimal
Over temperature : No
Capacity remaining : 100 percent
Time remaining (at current draw) : ... days, ... hours, ... minutes
----------------------------------------------------------------------
Logical device information
----------------------------------------------------------------------
Logical device number 0
Logical device name : Drive 1
RAID level : 5
Status of logical device : Optimal
Size : ... MB
Stripe-unit size : 256 KB
Read-cache mode : Enabled
Write-cache mode : Enabled (write-back)
Write-cache setting : Enabled (write-back) when protected by battery/ZMM
Partitioned : Yes
Protected by Hot-Spare : No
Bootable : Yes
Failed stripes : No
Power settings : Disabled
--------------------------------------------------------
Logical device segment information
--------------------------------------------------------
Segment 0 : ...
Segment 1 : ...
Segment 2 : ...
----------------------------------------------------------------------
Physical Device information
----------------------------------------------------------------------
Device #0
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device(T:L) : 0,1(1:0)
Reported Location : Enclosure 0, Slot 1
Reported ESD(T:L) : 2,0(0:0)
Vendor : IBM-ESXS
Model : ...
Firmware : ...
Serial number : ...
World-wide name : ...
Size : ... MB
Write Cache : Disabled (write-through)
FRU : ...
S.M.A.R.T. : No
S.M.A.R.T. warnings : 0
Device #1
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device(T:L) : 0,2(2:0)
Reported Location : Enclosure 0, Slot 2
Reported ESD(T:L) : 2,0(0:0)
Vendor : IBM-ESXS
Model : ...
Firmware : ...
Serial number : ...
World-wide name : ...
Size : ... MB
Write Cache : Disabled (write-through)
FRU : ...
S.M.A.R.T. : No
S.M.A.R.T. warnings : 0
Device #2
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device(T:L) : 0,3(3:0)
Reported Location : Enclosure 0, Slot 0
Reported ESD(T:L) : 2,0(0:0)
Vendor : IBM-ESXS
Model : ...
Firmware : ...
Serial number : ...
World-wide name : ...
Size : ... MB
Write Cache : Disabled (write-through)
FRU : ...
S.M.A.R.T. : No
S.M.A.R.T. warnings : 0
Device #3
Device is an Enclosure services device
Reported Channel,Device(T:L) : 2,0(0:0)
Enclosure ID : 0
Type : SES2
Vendor : IBM
Model : SAS 4 DRIVE BP
Firmware : 1.10
Status of Enclosure services device
Command completed successfully.
S.M.A.R.T. の利用
RAIDカードによっては、S.M.A.R.T. Monitoring Tools が使用出来ます。
モニタリングと併用するのも良いでしょう。
http://sourceforge.net/projects/smartmontools/
http://sourceforge.net/apps/trac/smartmontools/wiki/Supported_RAID-Controllers
# yum info smartmontools ... Name : smartmontools Arch : x86_64 Epoch : 1 Version : 5.42 Release : 2.el6 Size : 1.3 M Repo : installed From repo : base Summary : Tools for monitoring SMART capable hard disks URL : http://smartmontools.sourceforge.net/ License : GPLv2+ ...
ちなみにCentOS 4まではkernel-utilsに入ってはいますが、RAIDカード用には機能しないかもしれません。
# smartctl -a -dcciss,1 /dev/cciss/c0d0p1
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.5.1.el6.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Vendor: HP
Product: ...
Revision: ...
User Capacity: ... bytes [... GB]
Logical block size: ... bytes
Logical Unit id: ...
Serial number: ...
Device type: disk
Transport protocol: SAS
Local Time is: ...
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK
Current Drive Temperature: 30 C
Drive Trip Temperature: 70 C
Manufactured in week 01 of year 2010
Specified cycle count over device lifetime: ...
Accumulated start-stop cycles: 10
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = ...
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 0.000 0
write: 0 0 0 0 0 0.000 0
Non-medium error count: 0
No self-tests have been logged
Long (extended) Self Test duration: ... seconds [... minutes]