Pool Nodes Stuck in Unusable State

Angelina Souy 0 Reputation points
2025-05-28T14:11:59.25+00:00

Since this morning, my pool nodes have been stuck in the "Unusable" state, despite no recent changes on my end.

I don't see any error that would help me understand where the errors comes from.

I attempted to delete and recreate the pool, but the issue persists. The nodes take a long time to start and then transition back to the "Unusable" state.
I tried creating again the pool without any start task or subnet attached and it's still failing.

Could you please advise?

Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
373 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Dharani Reguri 1,250 Reputation points Microsoft External Staff Moderator
    2025-05-28T15:18:10.8+00:00

    Hi Angelina Souy,

    If a node is in an unusable state, but has no computeNodeError, it means Batch is unable to communicate with the VM. In this case, Batch always tries to recover the VM. However, Batch doesn't automatically attempt to recover VMs that failed to install application packages or containers, even if their state is unusable.

    To debug the issue, I request you to share the information below:

    • Is the issue being with one pool or affecting multiple pools in the batch account.
    • What is the VM image reference you are using
    • Are you using a custom image or Shared Image Gallery (SIG)
    • What is the current quota for your Batch account in the affected region

    Please check the document related to node in unusable state and Azure Batch node gets stuck in the Unusable state.

    Thank you.


  2. Arko 4,060 Reputation points Microsoft External Staff Moderator
    2025-06-04T08:40:56.49+00:00

    Hello Angelina Souy,

    Thank you for the update. I’m glad to hear that creating a new Batch account in another region resolved the issue and that nodes are now provisioning correctly there.

    This confirmation further supports the earlier analysis that the root cause is not related to your configuration or quota limits but rather points to a platform-level fault that is scoped to the original Batch account or its deployment fabric in the West Europe region.

    As mentioned earlier, your per-series quotas for F2s and F4s were correctly configured, and no recent changes were made to your pool setup, start task, image reference, or networking configuration. The autoStorage.lastKeySync logs were inconsequential, and no Azure Health advisories indicated issues in West Europe. Given this, it’s highly likely that a backend issue such as an image provisioning regression, node agent mismatch, or a fabric-specific allocation fault is causing new nodes to enter the "Unusable" state in your original Batch account.

    Your test with a fresh account in a new region provides a valuable reference point to help isolate the issue further.

    In the meantime, if your production workflows allow, using the newly created Batch account as a temporary workaround is a good path forward. As discussed in private message, I’ll keep you posted.

    Update-

    The issue was caused by Azure Batch's change of moving from classic node communication mode to simplified node communication mode being implemented with the classic communication mode being retired on March 31, 2026. Following this change and upon investigation, the Microsoft support team found that my node management access appears to be set to "deny." This configuration was contributing to the nodes entering an unusable state.

    The customer had to remove the Access Rule restriction, and the nodes were working again, but this configuration was working in the Classic Communication mode.

    The issue is now solved.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.