h1

Quick-Take: ZFS and Early Disk Failure

September 17, 2010

Anyone who’s discussed storage with me knows that I “hate” desktop drives in storage arrays. When using SAS disks as a standard, that’s typically a non-issue because there’s not typically a distinction between “desktop” and “server” disks in the SAS world. Therefore, you know I’m talking about the other “S” word – SATA. Here’s a tale of SATA woe that I’ve seen repeatedly cause problems for inexperienced ZFS’ers out there…

When volumes fail in ZFS, the “final” indicator is data corruption. Fortunately, ZFS checksums recognize corrupted data and can take action to correct and report the problem. But that’s like treating cancer only after you’ve experienced the symptoms. In fact, the failing disk will likely begin to “under-perform” well before actual “hard” errors show-up as read, write or checksum errors in the ZFS pool. Depending on the reason for “under-performing” this can affect the performance of any controller, pool or enclosure that contains the disk.

Wait – did he say enclosure? Sure. Just like a bad NIC chattering on a loaded network, a bad SATA device can occupy enough of the available service time for a controller or SAS bus (i.e. JBOD enclosure) to make a noticeable performance drop in otherwise “unrelated” ZFS pools. Hence, detection of such events is an important thing. Here’s an example of an old WD SATA disk failing as viewed from the NexentaStor “Data Sets” GUI:

Disk Statistics showing failing drive

Something is wrong with device c5t84d0...

Device c5t84d0 is having some serious problems. Busy time is 7x higher than counterparts, and its average service time is 14x higher. As a member of a RAIDz group, the entire group is being held-back by this “under-performing” member. From this snapshot, it appears that NexentaStor is giving us some good information about the disk from the “web GUI” but this assumption would not be correct. In fact, the “web GUI” is only reporting “real time” data so long as the disk is under load. In the case of a lightly loaded zpool, the statistics may not even be reported.

However, from the command shell, historic and real-time access to per-device performance is available. The output of “iostat -exn” shows the count of all errors for devices since the last time counters were reset, and average I/O loads for each:

Device statistics from 'iostat' show error and I/O history.

Device statistics from 'iostat' show error and I/O history.

The output of iostat clearly shows this disk has serious hardware problems. It indicates hardware errors as well as transmission errors for the device recognized as ‘c5t84d0’ and the I/O statistics – chiefly read, write and average service time – implicate this disk as a performance problem for the associated RAIDz group. So, if the device is really failing, shouldn’t there be a log report of such an event? Yes, and here’s a snip from the message log showing the error:

SCSI error with ioc_status=0x8048 reported in /var/log/messages

SCSI error with ioc_status=0x8048 reported in /var/log/messages for failing device.

However, in this case, the log is not “full” with messages of this sort. In fact, it only showed-up under the stress of an iozone benchmark (run from the NexentaStor ‘nmc’ console). I can (somewhat safely) conclude this to be a device failure since at least one other disk in this group is of the same make, model and firmware revision of the culprit. The interesting aspect about this “failure” is that it does not result in a read, write or checksum error for the associated zpool. Why? Because the device is only loosely coupled to the zpool as a constituent leaf device, and it also implies that the device errors were recoverable by either the drive or the device driver (mapping around a bad/hard error.)

Since these problems are being resolved at the device layer, the ZFS pool is “unaware” of the problem as you can see from the output of ‘zpool status’ for this volume:

zpool status output for pool with undetected failing device

Problems with disk device as yet undetected at the zpool layer.

This doesn’t mean that the “consumers” of the zpool’s resources are “unaware” of the problem, as the disk error has manifested itself in the zpool as higher delays, lower I/O through-put and subsequently less pool bandwidth. In short, if the error is persistent under load, the drive has a correctable but catastrophic (to performance) problem and will need to be replaced. If, however, the error goes away, it is possible that the device driver has suitably corrected for the problem and the drive can stay in place.

SOLORI’s Take: How do we know if the drive needs to be replaced? Time will establish an error rate. In short, running the benchmark again and watching the error counters for the device will determine if the problem persists. Eventually, the errors will either go away or they wont. For me, I’m hoping that the disk fails to give me an excuse to replace the whole pool with a new set of SATA “eco/green” disks for more lab play. Stay tuned…

SOLORI’s Take: In all of its flavors, 1.5Gbps, 3Gbps and 6Gbps, I find SATA drives inferior to “similarly” spec’d SAS for just about everything. In my experience, the worst SAS drives I’ve ever used have been more reliable than most of the SATA drives I’ve used. That doesn’t mean there are “no” good SATA drives, but it means that you really need to work within tighter boundaries when mixing vendors and models in SATA arrays. On top of that, the additional drive port and better typical sustained performance make SAS a clear winner over SATA (IMHO). The big exception to the rule is economy – especially where disk arrays are used for on-line backup – but that’s another discussion…

10 comments

  1. This is a very common problem for people using SATA drives (even the “enterprise” ones). We try to stay on top of it by pro-actively replacing disks when they start to show signs of failure or poor performance, but it’s still hard to determine what’s the threshold. Like you said, if the workload is very low, our clients might not even notice it that much. And they often don’t, specially e-mail and web… the virtualization folks on the other hand won’t allow a single time-out to pass unnoticed 🙂

    Like


    • @giovanni – I actually considered taking the “anti-SATA rant” out of this Quick-Take, but instead moved it to the editorial comment at the end. I’ve always had a love-hate relationship with SATA because of its excellent density/cost ratio. Fortunately, SAS is catching SATA in cost and density and with the better signaling, extra port and reliability, the few remaining dollars difference are becoming harder and harder to justify… If you’re using SATA, I hope you’re running mirror groups and not RAIDz groups; and if RAIDz, I hope it’s RAIDz2 or better 🙂 These kind of error modes are intrinsically worse on RAIDz types.

      You raise a great point about virtualization guys – especially since the admins without cross-training in SAN and VM are apt to point fingers at the network or CPU when these conditions arise. Shared storage either lifts or sinks all boats in VM. It’s kind of a catch-22 for the guys “doing it on the cheap” to have chosen SATA over SAS on density/price and skimped-by with the minimum number of RAID groups. Here’s where this type of “hidden” failure come to bite them where it hurts 🙂 And faced with a resilver, their day is going to get much worse before it gets better…

      Like


  2. […] This post was mentioned on Twitter by Andy Leonard, Collin C MacMillan. Collin C MacMillan said: Quick post ZFS/NexentaStor and early drive failure/consequences: http://bit.ly/cVupy9 […]

    Like


  3. Curious if you tried using Nexenta’s AutoSMART plugin and whether it provided anything useful in the diagnosis.

    Like


  4. @Brad – although it’s of no help to CE users, it could be valuable to commercial users. After checking the plugins available to my test appliance, I could not find it in the available plugin’s list.

    After finding the instruction PDF for AutoSMART online, I can confirm that I’m running the required 3.0.3 (or better) so I’ve followed-up with Nexenta in the interim. From the docs, it looks like this could plug some holes in early failure detection for both SAS/SCSI disks and SATA as it supports IE and SMART reporting.

    It I can get the plugin installed on my test mule, I’ll post a follow-up “take” on its operation…

    Like


  5. […] […]

    Like


  6. Very helpful article. I was also wondering if i may ask your opinion regarding high S/W errors and next to no H/W errors. We have had 7 failures occur in the same 3 drive bays. With noticeably high S/W errors.
    s/w h/w trn tot device Raid Pool Failed
    0 0 0 0 c2d0
    8088 8 0 8096 c0t72d0 CV_1 14jan,5jan,
    0 0 0 0 c0t73d0 CV_1
    0 0 0 0 c0t74d0 CV_1
    0 0 0 0 c0t75d0 CV_2
    0 0 0 0 c0t76d0 CV_2
    0 0 0 0 c0t77d0 CV_2
    0 0 0 0 c0t78d0 CV_3
    14184 0 0 14184 c0t79d0 CV_3 3Feb,16jan,
    17dec, 30nov
    0 8 0 8 c0t80d0 CV_3
    0 64 0 64 c0t81d0 CV_4
    0 0 0 0 c0t82d0 CV_4
    5472 17 0 5489 c0t83d0 CV_4 23nov,

    Any help would be appreciated 🙂

    Like


    • Michael:

      Given the limited info and described background, software errors would normally point me towards disk firmware differences. Nexenta’s disk view makes it easy to check for those differences.

      Any other factors would require a bit more information about your pool makeup and controller topology. Also, by same slots, do you mean different disks in the same drive bays or always those specific disks?

      Like


  7. could someone explain how to replaced a failed raidz memeber in nexenta 3.1.1.5?

    Like


    • On-line replace is the preferred method. If your disk/controller supports hot swap (a lot of on-board controllers and non-raid SATA do not) then it’s pretty straight forward.

      If the drive has faulted due to checksum, it could shorten the replace cycle if the disk is available during the replace. If your disk has completely failed, remove it, replace it and perform a lunsync. If you’re controller mapped (ie. SAS and not MPXIO) you may need to also on-line the drive. Next, schedule the replacement disk as a replacement for the failed on via NMC or NMV; then wait for the resilver to complete.

      Your disk topology, data bulk and raid type will dictate your rebuild time.

      Like



Comments are closed.

%d bloggers like this: