Tasks fail to detect GPU on some Pool nodes due to early startup race condition

JeffreyCMI 6 Reputation points
2025-06-18T20:33:27.05+00:00

We’re running container tasks on a GPU-enabled Azure Batch pool (Docker-based, Standard_NC4as_T4_v3). All nodes are configured identically, and tasks are configured identically, running the same task in the same Docker image and entrypoint.

Despite this, the entire first round of tasks to run on certain pool nodes intermittently fail with:

failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Investigation shows:

  • Only ~10% of new nodes are affected. The first round of Tasks on other (equivalent) nodes do find the GPU, logging Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3072 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0001:00:00.0, compute capability: 7.5
  • Affected tasks are the first tasks to run on the VM after provisioning. Later, Batch Tasks on the same node succeed and detect the GPU correctly.
  • nvidia-smi works on the host: GPU appears idle, lists no running processes using the GPU. Whereas when a Task has found the GPU, nvidia-smi shows GPU usage and lists running processes.
  • No node configuration drift: By the time I can SSH into the VM (some minutes after task start), docker info, nvidia-container-cli info, /etc/docker/daemon.json, and container runtime versions are consistent across nodes (good and bad).
  • No task configuration drift: same containerRunOptions (I am not using --gpus all), image name and tag, entrypoint.

Seems to me there's some race condition at node startup, where the GPU driver or NVIDIA container runtime stack is not fully initialized when the first task is scheduled. This feels very much within Azure's responsibility and out of my control.

We've encountered the bug intermittently, starting when we upgraded our Batch Pools image from

    publisher = "microsoft-azure-batch"
    offer     = "ubuntu-server-container-rdma"
    sku       = "20-04-lts"
    version   = "latest"

to

    publisher = "microsoft-dsvm"
    offer     = "ubuntu-hpc"
    sku       = "2204"
    version   = "latest"
Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
374 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.