Quick-Take: ZFS and Early Disk FailureSeptember 17, 2010
Anyone who’s discussed storage with me knows that I “hate” desktop drives in storage arrays. When using SAS disks as a standard, that’s typically a non-issue because there’s not typically a distinction between “desktop” and “server” disks in the SAS world. Therefore, you know I’m talking about the other “S” word – SATA. Here’s a tale of SATA woe that I’ve seen repeatedly cause problems for inexperienced ZFS’ers out there…
When volumes fail in ZFS, the “final” indicator is data corruption. Fortunately, ZFS checksums recognize corrupted data and can take action to correct and report the problem. But that’s like treating cancer only after you’ve experienced the symptoms. In fact, the failing disk will likely begin to “under-perform” well before actual “hard” errors show-up as read, write or checksum errors in the ZFS pool. Depending on the reason for “under-performing” this can affect the performance of any controller, pool or enclosure that contains the disk.
Wait – did he say enclosure? Sure. Just like a bad NIC chattering on a loaded network, a bad SATA device can occupy enough of the available service time for a controller or SAS bus (i.e. JBOD enclosure) to make a noticeable performance drop in otherwise “unrelated” ZFS pools. Hence, detection of such events is an important thing. Here’s an example of an old WD SATA disk failing as viewed from the NexentaStor “Data Sets” GUI:
Device c5t84d0 is having some serious problems. Busy time is 7x higher than counterparts, and its average service time is 14x higher. As a member of a RAIDz group, the entire group is being held-back by this “under-performing” member. From this snapshot, it appears that NexentaStor is giving us some good information about the disk from the “web GUI” but this assumption would not be correct. In fact, the “web GUI” is only reporting “real time” data so long as the disk is under load. In the case of a lightly loaded zpool, the statistics may not even be reported.
However, from the command shell, historic and real-time access to per-device performance is available. The output of “iostat -exn” shows the count of all errors for devices since the last time counters were reset, and average I/O loads for each:
The output of iostat clearly shows this disk has serious hardware problems. It indicates hardware errors as well as transmission errors for the device recognized as ‘c5t84d0’ and the I/O statistics – chiefly read, write and average service time – implicate this disk as a performance problem for the associated RAIDz group. So, if the device is really failing, shouldn’t there be a log report of such an event? Yes, and here’s a snip from the message log showing the error:
However, in this case, the log is not “full” with messages of this sort. In fact, it only showed-up under the stress of an iozone benchmark (run from the NexentaStor ‘nmc’ console). I can (somewhat safely) conclude this to be a device failure since at least one other disk in this group is of the same make, model and firmware revision of the culprit. The interesting aspect about this “failure” is that it does not result in a read, write or checksum error for the associated zpool. Why? Because the device is only loosely coupled to the zpool as a constituent leaf device, and it also implies that the device errors were recoverable by either the drive or the device driver (mapping around a bad/hard error.)
Since these problems are being resolved at the device layer, the ZFS pool is “unaware” of the problem as you can see from the output of ‘zpool status’ for this volume:
This doesn’t mean that the “consumers” of the zpool’s resources are “unaware” of the problem, as the disk error has manifested itself in the zpool as higher delays, lower I/O through-put and subsequently less pool bandwidth. In short, if the error is persistent under load, the drive has a correctable but catastrophic (to performance) problem and will need to be replaced. If, however, the error goes away, it is possible that the device driver has suitably corrected for the problem and the drive can stay in place.
SOLORI’s Take: How do we know if the drive needs to be replaced? Time will establish an error rate. In short, running the benchmark again and watching the error counters for the device will determine if the problem persists. Eventually, the errors will either go away or they wont. For me, I’m hoping that the disk fails to give me an excuse to replace the whole pool with a new set of SATA “eco/green” disks for more lab play. Stay tuned…
SOLORI’s Take: In all of its flavors, 1.5Gbps, 3Gbps and 6Gbps, I find SATA drives inferior to “similarly” spec’d SAS for just about everything. In my experience, the worst SAS drives I’ve ever used have been more reliable than most of the SATA drives I’ve used. That doesn’t mean there are “no” good SATA drives, but it means that you really need to work within tighter boundaries when mixing vendors and models in SATA arrays. On top of that, the additional drive port and better typical sustained performance make SAS a clear winner over SATA (IMHO). The big exception to the rule is economy – especially where disk arrays are used for on-line backup – but that’s another discussion…