Pool Nodes Stuck in Unusable State

Question

Pool Nodes Stuck in Unusable State

Angelina Souy 0

Since this morning, my pool nodes have been stuck in the "Unusable" state, despite no recent changes on my end.

I don't see any error that would help me understand where the errors comes from.

I attempted to delete and recreate the pool, but the issue persists. The nodes take a long time to start and then transition back to the "Unusable" state.
I tried creating again the pool without any start task or subnet attached and it's still failing.

Could you please advise?

Angelina Souy 0 Reputation points

2025-06-02T15:00:48.38+00:00

Can anyone have a look at my issue while @Dharani Reguri is away?
Arko 4,060 Reputation points Microsoft External Staff Moderator

2025-06-03T12:15:01.16+00:00
Hi Angelina Souy,

Thanks for sharing the details, and I understand how critical this issue is for your Batch workloads.

Based on the information you've provided including the Batch account (mojitobatchservice) in the West Europe region, the VM image being used (2022-datacenter-core-smalldisk), and the screenshot showing your current quotas — the root cause of the “Unusable” state for all pool nodes is now clear.

Although your subscription has a total of 78 dedicated vCPUs available, Azure Batch also enforces per-VM series quotas, and in your case, those are mostly set to zero. From your screenshot, only the Av2 series has 10 vCPUs allocated all other VM families (like Dsv3 or Dsv4, which are typically used for the image you're deploying) have no quota assigned. When Azure Batch attempts to provision nodes using a VM size for which there is no available quota, it silently fails and leaves the nodes in the Unusable state without surfacing a clear error in the portal.

This behavior aligns exactly with what you’re seeing: all pools, including freshly created ones without start tasks, app packages, or subnets, are still failing. We also confirmed that there are no active Azure service advisories or regional platform issues in West Europe impacting Batch. Additionally, the resource change logs you shared show only minor metadata updates to the auto storage sync time, which do not impact node provisioning.

To fix the issue, you have two options:

Request a quota increase for the VM series you are trying to use.

In the Azure Portal, go to your subscription -> "Usage + quotas" ->filter by West Europe, and then search for the relevant VM series (such as Dsv3, Dsv4, or Dv2 depending on your pool’s VM size).

Click "Request quota increase" and raise it to an appropriate level (for example, 20 or 50 vCPUs).

Alternatively, try using a VM size that already has quota assigned for example, Standard_A2_v2, which is part of the Av2 series where you already have 10 vCPUs available. This can help validate that the issue is indeed related to VM series quota.

Once the quota for the appropriate VM size family is increased, the Batch pools should start provisioning normally and exit the Unusable state.

Checkout this MS doc once- https://learn.microsoft.com/en-us/azure/batch/batch-quota-limit#increase-batch-quotas
Angelina Souy 0 Reputation points

2025-06-03T13:11:38.9233333+00:00

Hi Arko,

Thank you for looking at my issue.
My quotas are correctly set. I previously adjusted them based on my usage of F4s and F2s pools, and everything was working fine until last week.

When an issue is related to quota limits, it's clearly indicated on the Pools page - but that’s not the case here.

Is there any other reason that could explain why the node is in an "unusable" state?
Arko 4,060 Reputation points Microsoft External Staff Moderator

2025-06-03T16:09:30.4266667+00:00

Hello Angelina Souy, Please check your private message, I have pinged you there. Please keep an eye on your private chatbox as I may reach out to you for additional details if required. I'll keep you posted on what I find from my investigation. Thanks
Angelina Souy 0 Reputation points

2025-06-11T15:21:29.0633333+00:00

I have additional feedback regarding the overall Microsoft QA experience. Currently, I don't receive email notifications when there's a new answer on my post or when I receive private messages.

Also, the question summaries don't reflect the actual number of comments or answers, which makes it difficult to track the issue. It took a full week to receive a response to my support request, which is quite long and made it inconvenient to follow up on my thread.

2 answers

Your answer

Angelina Souy 0 Reputation points

2025-06-02T15:00:48.38+00:00

Can anyone have a look at my issue while @Dharani Reguri is away?
Angelina Souy 0 Reputation points

2025-06-03T13:11:38.9233333+00:00

Hi Arko,

Thank you for looking at my issue.
My quotas are correctly set. I previously adjusted them based on my usage of F4s and F2s pools, and everything was working fine until last week.

When an issue is related to quota limits, it's clearly indicated on the Pools page - but that’s not the case here.

Is there any other reason that could explain why the node is in an "unusable" state?
Arko 4,060 Reputation points Microsoft External Staff Moderator

2025-06-03T16:09:30.4266667+00:00

Hello Angelina Souy, Please check your private message, I have pinged you there. Please keep an eye on your private chatbox as I may reach out to you for additional details if required. I'll keep you posted on what I find from my investigation. Thanks
Angelina Souy 0 Reputation points

2025-06-11T15:21:29.0633333+00:00

I have additional feedback regarding the overall Microsoft QA experience. Currently, I don't receive email notifications when there's a new answer on my post or when I receive private messages.

Also, the question summaries don't reflect the actual number of comments or answers, which makes it difficult to track the issue. It took a full week to receive a response to my support request, which is quite long and made it inconvenient to follow up on my thread.

Answer 1

Dharani Reguri 1,250 Microsoft External Staff Moderator

Hi Angelina Souy,

If a node is in an unusable state, but has no computeNodeError, it means Batch is unable to communicate with the VM. In this case, Batch always tries to recover the VM. However, Batch doesn't automatically attempt to recover VMs that failed to install application packages or containers, even if their state is unusable.

To debug the issue, I request you to share the information below:

Is the issue being with one pool or affecting multiple pools in the batch account.
What is the VM image reference you are using
Are you using a custom image or Shared Image Gallery (SIG)
What is the current quota for your Batch account in the affected region

Please check the document related to node in unusable state and Azure Batch node gets stuck in the Unusable state.

Thank you.

Dharani Reguri 1,250 Reputation points Microsoft External Staff Moderator

2025-05-30T09:14:34.9033333+00:00

Hi Angelina Souy,

Just checking in if you got a chance to check my comments on the thread and also, I request you to share the details in Private message to debug the issue further.
Angelina Souy 0 Reputation points

2025-05-30T14:54:17.5233333+00:00
Hi Dharani,

The issue is affecting all the pools.

I am using the VM image "2022-datacenter-core-smalldisk"

I am not using a custom image

My quota are fine, if that was the issue, I would see an error

I tried deleting and re-creating the pools without any Virtual network, start task or application package. I also tried other VM images, but all the pools I create are Unusable.
Dharani Reguri 1,250 Reputation points Microsoft External Staff Moderator

2025-05-30T15:20:05.9966667+00:00

Hi Angelina Souy,

I understand that you have enough quota for the nodes but still the nodes are in unusable state.

Can you please provide the Batch account name, pool ID and all the details in Private message and also if you have seen any error message under nodes, please share with us.
Bheemani Anji Babu 350 Reputation points Microsoft External Staff Moderator

2025-06-03T12:56:06.46+00:00

Hi Angelina Souy

Can you try this workaround for Azure Batch nodes stuck in the Unusable state is to create a new pool without attaching it to any virtual network and avoid using custom images or startup scripts. Instead, use a known working Marketplace image like Ubuntu 20.04 provided by Microsoft. This helps isolate the issue. Also confirm that your subscription has sufficient quota in the region and that there are no restrictive NSG rules or subnet delegations. Once the pool is healthy, reintroduce custom settings gradually let me know about this approach.

-Thank you
Angelina Souy 0 Reputation points

2025-06-03T13:14:18.8533333+00:00

Hi Bheemani,

Thank you for your support.I’ve already created a pool without any virtual network, subnet, or NSG. I’m using the 2025-datacenter-core-smalldisk image, which is not a custom image, and I haven’t configured a start task or installed any applications.

I also confirm that my quotas are correctly set - the setup has been working reliably for months.

However, the nodes remain unusable.

Please advise, it has been a week already.
Angelina Souy 0 Reputation points

2025-06-03T13:58:52.4766667+00:00

I tried to create another batch service account and the nodes are working ... So it's related to my batch service account ...

Answer 2

Hello Angelina Souy,

Thank you for the update. I’m glad to hear that creating a new Batch account in another region resolved the issue and that nodes are now provisioning correctly there.

This confirmation further supports the earlier analysis that the root cause is not related to your configuration or quota limits but rather points to a platform-level fault that is scoped to the original Batch account or its deployment fabric in the West Europe region.

As mentioned earlier, your per-series quotas for F2s and F4s were correctly configured, and no recent changes were made to your pool setup, start task, image reference, or networking configuration. The autoStorage.lastKeySync logs were inconsequential, and no Azure Health advisories indicated issues in West Europe. Given this, it’s highly likely that a backend issue such as an image provisioning regression, node agent mismatch, or a fabric-specific allocation fault is causing new nodes to enter the "Unusable" state in your original Batch account.

Your test with a fresh account in a new region provides a valuable reference point to help isolate the issue further.

In the meantime, if your production workflows allow, using the newly created Batch account as a temporary workaround is a good path forward. As discussed in private message, I’ll keep you posted.

Update-

The issue was caused by Azure Batch's change of moving from classic node communication mode to simplified node communication mode being implemented with the classic communication mode being retired on March 31, 2026. Following this change and upon investigation, the Microsoft support team found that my node management access appears to be set to "deny." This configuration was contributing to the nodes entering an unusable state.

The customer had to remove the Access Rule restriction, and the nodes were working again, but this configuration was working in the Classic Communication mode.

The issue is now solved.

Arko 4,060 Reputation points Microsoft External Staff Moderator

2025-06-04T08:42:44.5233333+00:00

Hope I was able to clear your query here. It would be great if you could kindly mark the answer as accepted so that anyone having similar query on MS Q&A forum can refer to it. Thanks
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Angelina Souy 0 Reputation points

2025-06-11T15:16:48.6566667+00:00

Hi,

The issue was caused by Azure Batch's change of moving from classic node communication mode to simplified node communication mode being implemented with the classic communication mode being retired on March 31, 2026. Following this change and upon investigation, the Microsoft support team found that my node management access appears to be set to "deny." This configuration was contributing to the nodes entering an unusable state.

I had to remove the Access Rule restriction and the nodes were working again, but this configuration was working in the Classic Communication mode.

The issue is now solved.

Share via

Pool Nodes Stuck in Unusable State

2 answers

Your answer