Storage Spaces Direct single point of failure in 1 SSD

Richard Willkomm 1 Reputation point
2020-10-20T08:20:03.42+00:00

Last weekend we had a serious issue with one of our 2019 HyperV HCI clusters.
It turns out a single SSD cache disk was causing the entire cluster to basically grind to a halt. S2D is built with fault-tolerance in mind. Disks can fail, even multiple. Nodes can fail. Networks can fail. It should be able to take hits and keep running. Well, turns out one hit is enough if it's done in the right place.

Here's what we have

  • 6 Dell node Win2019 HCI cluster running HyperV and S2D
  • Each node with a mix of SSD (cache) and HDD (capacity) (5 SSD + 15 HDD)
  • 2x10Gbit dedicated RDMA network for S2D and 2X10Gbit dedicated for VMs and management

What happened ?
During planned maintenance and installing windows updates, one of the nodes failed after it's reboot. Or actually one of the SSD's in this node failed. This SSD showed a huge latency (1000ms) in Admin Centre starting from the moment the node was rebooted. We only discovered this in WAC after a while. This caused the entire storage layer of this node to become overloaded. And subsequently this impacted the entire S2D pool and cluster. Pool was online, including the Virtual disks, but they also showed latency in the 500-750ms range. Where they are usually below 1ms. The reboot of each node always causes S2D repair jobs (expected behavior), but these had trouble finishing, again because of the huge latency.
Network issues were ruled out. We had perfect connectivity and the 10G ports were at max 20% usage. Very low compared to normally.

We first tried to retire the SSD that failed (powershell command). But this did not help. In the end we had to physicallty pull the SSD from the server to solve the issue. The high latency was gone immediately, VMs came back online and S2D jobs finished quickly.

How can a single SSD cause an entire cluster to fail ?
An issue with a single disk I get. And and i can get that same issue impacting an entire node too. But why does it impact the entire cluster ? Is the only option here to retire the disk (which we did), or retire (power off) the node (which can be done remotely) ?

Anyone run into similar issues ?
Thanx in advance

Hyper-V
Hyper-V
A Windows technology providing a hypervisor-based virtualization solution enabling customers to consolidate workloads onto a single server.
2,866 questions
Windows Server Storage
Windows Server Storage
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Storage: The hardware and software system used to retain data for subsequent retrieval.
669 questions
Windows Server
Windows Server
A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.
1,044 questions
0 comments No comments
{count} votes

8 answers

Sort by: Most helpful
  1. Richard Willkomm 1 Reputation point
    2020-11-11T11:26:12.083+00:00

    Thank you Steven.

    I see your point. Implementing automated drive-removal in the pool based on the outlier detection can cause more harm then good when done too radically. I would think that in many cases, human interaction based on experience and brainpower is the better option.

    In our case an SSD cache with a sudden 1000 times more latency could be considered failed. And the pool can live without 1 cache disk, and re-distribute the load across the other cache disks. I'd much rather take the hit of assymetrical cache then a 1000ms drive being an active part of the pool. That's an easy one to implement with automated removal perhaps. If you were to limit it to 1 or 2 cache drives max.

    But what if the cache drive still has queued write I/O? If not handled properly, this could mean data corruption right ?

    Capacity drives hold data-blocks and removing them automatically involves kicking of repair jobs which take long and add to the I/O load on a system that is perhaps already in stress due to the detected high latency capacity disk.

    It's good to hear you are considering this form of automation carefully. For us as admins, we need to be aware of it and know what the best cause of action is. And sometimes it means taking action. This can be difficult. Balancing between waiting for the system to heal itself and avoid bigger issues, and putting stress on SLA's. Or take action by removing stuff, to get the train going again. Knowing how S2D pools work is vital for that.

    Your last comment on disk-warranty and history is also interesting. Let me give some more facts on our specific Intel SSD disk we have now discovered.

    Dell released a firmware update on sept 30th for this drive. Here's the link.
    https://www.dell.com/support/home/nl-nl/drivers/driversdetails?driverid=2674v

    The release notes state the drive can stop responding after a soft-reboot following a windows install. This is rather vague if you ask me. Could you interpret a monthly CU update as a 'windows install' ? Because this is what happened in our case.

    The drive might not be broken after all, but simply hitting this firmware bug as we weren't running that release of september 30th yet. We have filed in a request at Dell for this situation and they'r having a look at the storage logs etc.

    If we were to update the firmware, re-insert it all could be well. But like you said, how do you track evidence of such a disk for potential future fails. What if it's not related to the firmware after all, and 188 days later the same thing happens ? I wouldn't remember in 188 days what happened to this disk. I can hardly remember what happened last week.

    Again, many thanks for getting back on this.

    Greetings, Richard

    0 comments No comments

  2. Richard Willkomm 1 Reputation point
    2020-11-16T15:26:26.72+00:00

    I have a small update.

    Dell investigated the storage logs but found nothing. They also specifically state the SSD itself does not store any logs during it's lifetime.

    The S2D cluster that was impacted has been put into maintenance in the meantime. All VM workloads are migrated of to other clusters. So I'm free to do with it as I want.
    I will first re-insert the 'failed' drive and see what happens. If it's really broke, it should pull down the cluster just like before. I can then also try the firmware to see if that helps. If I can get it on the drive that is.

    If it doesn't fail and take down the storage, that would be a problem. It's either another issue, or it doesn't happen all the time or in a specific situation.

    Keep you posted.

    0 comments No comments

  3. Richard Willkomm 1 Reputation point
    2020-12-02T14:14:27+00:00

    Turned out the SSD was indeed faulty. It kept running at around 1000ms response. Updates didn't matter. Drive is replaced and cluster is working again.

    Good lesson learned. One drive might cause a lot of issues. Best to remove it as quickly as possible, or take down the node that contains the drive. That way, the rest of the cluster is free to carry on.

    Grtz

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.