Scale Ray clusters on Vertex AI

Ray clusters on Vertex AI offer two scaling options: autoscaling and manual scaling. Autoscaling lets the cluster automatically adjust the number of worker nodes based on the resources the Ray tasks and actors require. If you run a heavy workload and are unsure of the resources needed, autoscaling is recommended. Manual scaling gives users more granular control of the nodes.

Autoscaling can reduce workload costs but adds node launch overhead and can be tricky to configure. If you are new to Ray, start with non-autoscaling clusters, and use the manual scaling feature.

Autoscaling

Enable a Ray cluster's autoscaling feature by specifying the minimum replica count (min_replica_count) and maximum replica count (max_replica_count) of a worker pool.

Note the following:

  • Configure the autoscaling specification of all worker pools.
  • Custom upscaling and downscaling speed is not supported. For default values, see Upscaling and downscaling speed in the Ray documentation.

Set worker pool autoscaling specification

Use the Google Cloud console or Vertex AI SDK for Python to enable a Ray cluster's autoscaling feature.

Ray on Vertex AI SDK

from google.cloud import aiplatform
import vertex_ray
from vertex_ray import AutoscalingSpec

autoscaling_spec = AutoscalingSpec(
 min_replica_count=1,
 max_replica_count=3,
)

head_node_type = Resources(
 machine_type="n1-standard-16",
 node_count=1,
)

worker_node_types = [Resources(
 machine_type="n1-standard-16",
 accelerator_type="NVIDIA_TESLA_T4",
 accelerator_count=1,
 autoscaling_spec=autoscaling_spec,
)]

# Create the Ray cluster on Vertex AI
CLUSTER_RESOURCE_NAME = vertex_ray.create_ray_cluster(
head_node_type=head_node_type,
worker_node_types=worker_node_types,
...
)

Console

In accordance with the OSS Ray best practice recommendation, setting the logical CPU count to 0 on the Ray head node is enforced in order to avoid running any workload on the head node.

  1. In the Google Cloud console, go to the Ray on Vertex AI page.

    Go to the Ray on Vertex AI page

  2. Click Create cluster to open the Create cluster panel.

  3. For each step in the Create cluster panel, review or replace the default cluster information. Click Continue to complete each step:

    1. For Name and region, specify a Name and choose a ___location for your cluster.
    2. For Compute settings, specify the configuration of the Ray cluster on the head node, including its machine type, accelerator type and count, disk type and size, and replica count. Optionally, add a custom image URI to specify a custom container image to add Python dependencies not provided by the default container image. See Custom image.

      Under Advanced options, you can:

      • Specify your own encryption key.
      • Specify a custom service account.
      • If you don't need to monitor the resource statistics of your workload during training, disable the metrics collection.
    3. To create a cluster with an autoscaling worker pool, provide a value for the worker pool's maximum replica count. Compute settings for autoscaling

  4. Click Create.

Manual scaling

As your workloads surge or decrease on your Ray clusters on Vertex AI, manually scale the number of replicas to match demand. For example, if you have excess capacity, scale down your worker pools to save costs.

Limitations with VPC Peering

When you scale clusters, you can change only the number of replicas in your existing worker pools. For example, you can't add or remove worker pools from your cluster or change the machine type of your worker pools. Also, the number of replicas for your worker pools can't be lower than one.

If you use a VPC peering connection to connect to your clusters, a limitation exists on the maximum number of nodes. The maximum number of nodes depends on the number of nodes the cluster had when you created the cluster. For more information, see Max number of nodes calculation. This maximum number includes not just your worker pools but also your head node. If you use the default network configuration, the number of nodes can't exceed the upper limits as described in the create clusters documentation.

Maximum number of nodes calculation

If you use private services access (VPC peering) to connect to your nodes, use the following formulas to check that you don't exceed the maximum number of nodes (M), assuming f(x) = min(29, (32 - ceiling(log2(x))):

  • f(2 * M) = f(2 * N)
  • f(64 * M) = f(64 * N)
  • f(max(32, 16 + M)) = f(max(32, 16 + N))

The maximum total number of nodes in the Ray on Vertex AI cluster you can scale up to (M) depends on the initial total number of nodes you set up (N). After you create the Ray on Vertex AI cluster, you can scale the total number of nodes to any amount between P and M inclusive, where P is the number of pools in your cluster.

The initial total number of nodes in the cluster and the scaling up target number must be in the same color block.

Diagram showing the relationship between initial and scaled node counts

Update replica count

Use the Google Cloud console or Vertex AI SDK for Python to update your worker pool's replica count. If your cluster includes multiple worker pools, you can individually change each of their replica counts in a single request.

Ray on Vertex AI SDK

import vertexai
import vertex_ray

vertexai.init()
cluster = vertex_ray.get_ray_cluster("CLUSTER_NAME")

# Get the resource name.
cluster_resource_name = cluster.cluster_resource_name

# Create the new worker pools
new_worker_node_types = []
for worker_node_type in cluster.worker_node_types:
 worker_node_type.node_count = REPLICA_COUNT # new worker pool size
 new_worker_node_types.append(worker_node_type)

# Make update call
updated_cluster_resource_name = vertex_ray.update_ray_cluster(
 cluster_resource_name=cluster_resource_name,
 worker_node_types=new_worker_node_types,
)

Console

  1. In the Google Cloud console, go to the Ray on Vertex AI page.

    Go to the Ray on Vertex AI page

  2. From the list of clusters, click the cluster to modify.

  3. On the Cluster details page, click Edit cluster.

  4. In the Edit cluster pane, select the worker pool to update and then modify the replica count.

  5. Click Update.

    Wait a few minutes for your cluster to update. When the update is complete, you can see the updated replica count on the Cluster details page.

  6. Click Create.