Thank you Steven.
I see your point. Implementing automated drive-removal in the pool based on the outlier detection can cause more harm then good when done too radically. I would think that in many cases, human interaction based on experience and brainpower is the better option.
In our case an SSD cache with a sudden 1000 times more latency could be considered failed. And the pool can live without 1 cache disk, and re-distribute the load across the other cache disks. I'd much rather take the hit of assymetrical cache then a 1000ms drive being an active part of the pool. That's an easy one to implement with automated removal perhaps. If you were to limit it to 1 or 2 cache drives max.
But what if the cache drive still has queued write I/O? If not handled properly, this could mean data corruption right ?
Capacity drives hold data-blocks and removing them automatically involves kicking of repair jobs which take long and add to the I/O load on a system that is perhaps already in stress due to the detected high latency capacity disk.
It's good to hear you are considering this form of automation carefully. For us as admins, we need to be aware of it and know what the best cause of action is. And sometimes it means taking action. This can be difficult. Balancing between waiting for the system to heal itself and avoid bigger issues, and putting stress on SLA's. Or take action by removing stuff, to get the train going again. Knowing how S2D pools work is vital for that.
Your last comment on disk-warranty and history is also interesting. Let me give some more facts on our specific Intel SSD disk we have now discovered.
Dell released a firmware update on sept 30th for this drive. Here's the link.
https://www.dell.com/support/home/nl-nl/drivers/driversdetails?driverid=2674v
The release notes state the drive can stop responding after a soft-reboot following a windows install. This is rather vague if you ask me. Could you interpret a monthly CU update as a 'windows install' ? Because this is what happened in our case.
The drive might not be broken after all, but simply hitting this firmware bug as we weren't running that release of september 30th yet. We have filed in a request at Dell for this situation and they'r having a look at the storage logs etc.
If we were to update the firmware, re-insert it all could be well. But like you said, how do you track evidence of such a disk for potential future fails. What if it's not related to the firmware after all, and 188 days later the same thing happens ? I wouldn't remember in 188 days what happened to this disk. I can hardly remember what happened last week.
Again, many thanks for getting back on this.
Greetings, Richard