This is what you REALLY don't want to see when replacing a failed drive in a RAIDZ (Raid5 like) array:
pool: tank1
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver in progress for 28h8m, 100.00% done, 0h0m to go
config: (note: 100% done doesn't mean done, I don't know who wrote the % done formula but it's useless).
NAME STATE READ WRITE CKSUM
tank1 DEGRADED 0 0 6.32K
raidz1-0 DEGRADED 0 0 12.8K
c7t8d0 DEGRADED 0 0 0 too many errors
c7t4d0 DEGRADED 0 0 0 too many errors
c7t10d0 FAULTED 27 563 0 too many errors
c7t11d0 DEGRADED 0 0 0 too many errors
c7t12d0 DEGRADED 0 0 0 too many errors
replacing-5 DEGRADED 0 0 0
c7t13d0 FAULTED 4 9.60K 0 too many errors
c7t3d0 ONLINE 0 0 0 384G resilvered
errors: Permanent errors have been detected in the following files:
/tank1/sqlbackups/...masked..._200911240001.BAK
Mathematically speaking tihs should be impossible, this was a simple drive replacement and resilver (rebuild the raid).
For those that say Raid-5 is sufficient parity and that recovery times aren't that bad here's a good example of why you need Raid-6 or other raid solutions. This array had all drives working before c7t13d0 failed - That drive was replaced immediately.
I would say the enclosure or controller has failed, however sitting in front of the computer I hear clicking like drives are going bad. Seems impossible.
A zpool clear tank1 c7t10d0 (and all others) resulted in the pool trying again to resilver - I reduced the queue depth to try to reduce the load on the disks during this recovery - unfortunately the time to work on this server with very little load on it is almost over.
I'll update with another blog post once resolved.