The Google Kubernetes Engine (GKE) Volume Populator can help you automate and streamline the process of preloading data from Cloud Storage buckets to destination PersistentVolumeClaims (PVCs) during dynamic provisioning.
How GKE Volume Populator works
GKE Volume Populator leverages the core Kubernetes Volume Populator concept. Instead of provisioning an empty volume, the GKE Volume Populator allows a PVC to reference a GCPDataSource
custom resource. This custom resource specifies the source Cloud Storage bucket and the necessary credentials.
When you create a PVC with a dataSourceRef
pointing to a GCPDataSource
resource, the GKE Volume Populator initiates the data transfer. It copies data from the specified Cloud Storage bucket URI into the underlying persistent storage volume before making the volume available to your Pods.
This process reduces your need to use manual data transfer scripts or CLI commands, and automates the transfer of large datasets to persistent volumes. GKE Volume Populator supports data transfers between the following source and destination types:
- Cloud Storage to Parallelstore
- Cloud Storage to Hyperdisk ML
GKE Volume Populator is a GKE managed component that's enabled by default on both Autopilot and Standard clusters. You primarily interact with GKE Volume Populator through the gcloud CLI and kubectl CLI.
Architecture
The following diagram shows how data flows from the source storage to the destination storage, and how the PersistentVolume for the destination storage is created by using GKE Volume Populator.
- You create a PVC that references a
GCPDataSource
custom resource. - The GKE Volume Populator detects the PVC and initiates a data transfer Job.
- The transfer Job runs on an existing node pool, or a new one is created if node auto-provisioning is enabled.
- The transfer Job copies data from the Cloud Storage bucket specified in the
GCPDataSource
resource to the destination storage volume. - After the transfer is complete, the PVC is bound to the destination storage volume, making the data available to the workload Pod.
Key benefits
The GKE Volume Populator offers several benefits:
- Automated data population: automatically populate volumes with data from Cloud Storage during provisioning, which helps reduce operational overhead.
- Seamless data portability: move data from object storage to high-performance file (Parallelstore) or block storage (Hyperdisk) systems to help optimize for price or performance based on your workload needs.
- Simplified workflows: reduce the need for separate data loading Jobs, or manual intervention to prepare persistent volumes.
- Integration with Identity and Access Management (IAM): use IAM-based authentication through Workload Identity Federation for GKE to help ensure secure data transfer with fine-grained access control.
- Accelerated AI/ML workloads: quickly preload large datasets, models, and weights directly into high-performance storage to help speed up training and inference tasks.
Use cases for GKE Volume Populator
You can use GKE Volume Populator to load large training datasets for AI/ML. Imagine you have a multi-terabyte dataset for training a large language model (LLM) stored in a Cloud Storage bucket. Your training Job runs on GKE and requires high I/O performance. Instead of manually copying the data, you can use the GKE Volume Populator to automatically provision a Parallelstore or Hyperdisk ML volume, and populate it with the dataset from Cloud Storage when the PVC is created. This automated process helps ensure that your training Pods start with immediate, high-speed access to the data.
Here are some more examples where you can use the GKE Volume Populator:
- Pre-caching AI/ML model weights and assets from Cloud Storage into Hyperdisk ML volumes to accelerate model loading times for inference serving.
- Migrating data from Cloud Storage to persistent volumes for stateful applications requiring performant disk access.