Error recovery control: Difference between revisions

Content deleted Content added
m Reverted edits by 108.170.140.88 (talk) to last version by GreenC bot
m fixed dashes using a script
Line 3:
 
==Overview==
Modern [[hard drive]]s feature an ability to recover from some read/write errors by internally remapping [[Disk sector|sectors]] and performing other forms of self -test and recovery. The process for this can sometimes take several seconds or (under heavy usage) minutes, during which time the drive is unresponsive. Hardware RAID controllers and software RAID implementations are designed to recognise a drive which does not respond within a few seconds, and mark it as unreliable, indicating that it should be withdrawn from use and the array rebuilt from [[Parity bit#Parity block|parity data]]. This is a long process, degrades performance, and if more drives fail under the resulting additional workload, it may be catastrophic.
 
If the drive itself is inherently reliable but has some bad sectors, then TLER and similar features prevent a disk from being unnecessarily marked as 'failed' by limiting the time spent on correcting detected errors before advising the array controller of a failed operation. The array controller can then handle the data recovery for the limited amount involved, rather than marking the entire drive as faulty.
Line 51:
 
==smartctl utility==
The {{Mono|smartctl}} utility (part of the smartmontools package) can be used<ref>[http://cgi.csc.liv.ac.uk/~greg/projects/er c/ Author's description of the original patch to smartctl that implemented that feature]</ref> on hard disk drives that fully implement the ATA-8<ref>[http://www.t13.org/documents/UploadedDocuments/docs2007/D1699r4a-ATA8-ACS.pdf AT Attachment 8 - ATA/ATAPI Command Set (ATA8-ACS) ]</ref> standard to control the TLER behavior by setting the SCT Error Recovery
Control (scterc) parameter.
 
Line 62:
 
==Software RAID==
Linux [[mdadm]] simply holds and lets the drive complete its recovery - however, the default command timeout for the SCSI Disk layer (/sys/block/sd?/device/timeout) is 30 seconds,<ref>{{cite web|url=https://github.com/torvalds/linux/blob/master/drivers/scsi/sd.h#LC11|title=linux/sd.h at master · torvalds/linux · GitHub|work=GitHub}}</ref> after which it will attempt to reset the drive, and if that fails, put the drive offline.<ref>{{cite web|url=https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/scsi/scsi_eh.txt|title=kernel/git/torvalds/linux.git - Linux kernel source tree|work=kernel.org}}</ref>
 
==References==