We’re excited to announce the final availability of fine-grained compute and reminiscence quota allocation with HyperPod process governance. With this functionality, clients can optimize Amazon SageMaker HyperPod cluster utilization on Amazon Elastic Kubernetes Service (Amazon EKS), distribute honest utilization, and help environment friendly useful resource allocation throughout totally different groups or initiatives. For extra data, see HyperPod process governance greatest practices for maximizing the worth of SageMaker HyperPod process governance.
Compute quota administration is an administrative mechanism that units and controls compute useful resource limits throughout customers, groups, and initiatives. It controls honest useful resource distribution, stopping a single entity from monopolizing cluster assets, thereby optimizing general computational effectivity.
Due to price range constraints, clients may wish to allocate compute assets throughout a number of groups pretty. For instance, a knowledge scientist may want some GPUs (for instance, 4 H100 GPUs) for mannequin improvement, however not your complete occasion’s compute capability. In different instances, clients have restricted compute assets however many groups, they usually wish to pretty share compute assets throughout these groups, in order that no idle capability is left unused.
With HyperPod process governance, directors can now allocate granular GPU, vCPU, and vCPU reminiscence to groups and initiatives—along with your complete occasion assets—based mostly on their most well-liked technique. Key capabilities embody GPU-level quota allocation by occasion sort and household, or {hardware} sort—supporting each Trainium and NVIDIA GPUs—and non-obligatory CPU and reminiscence allocation for fine-tuned useful resource management. Directors can even outline the burden (or precedence stage) a workforce is given for fair-share idle compute allocation.
“With all kinds of frontier AI knowledge experiments and manufacturing pipelines, having the ability to maximize SageMaker HyperPod Cluster utilization is extraordinarily excessive affect. This requires honest and managed entry to shared assets like state-of-the-art GPUs, granular {hardware} allocation, and extra. That is precisely what HyperPod process governance is constructed for, and we’re excited to see AWS pushing environment friendly cluster utilization for a wide range of AI use instances.”
– Daniel Xu, Director of Product at Snorkel AI, whose AI knowledge expertise platform empowers enterprises to construct specialised AI purposes by leveraging their organizational experience at scale.
On this submit, we dive deep into methods to outline quotas for groups or initiatives based mostly on granular or instance-level allocation. We focus on totally different strategies to outline such insurance policies, and the way knowledge scientists can schedule their jobs seamlessly with this new functionality.
Resolution overview
Conditions
To observe the examples on this weblog submit, it’s essential to meet the next conditions:
To schedule and execute the instance jobs within the Submitting Duties part, additionally, you will want:
- A neighborhood surroundings (both your native machine or a cloud-based compute surroundings), from which to run the HyperPod CLI and kubectl instructions, configured as follows:
- HyperPod Coaching Operator put in within the cluster
Allocating granular compute and reminiscence quota utilizing the AWS console
Directors are the first persona interacting with SageMaker HyperPod process governance and are accountable for managing cluster compute allocation in alignment with the group’s strategic priorities and objectives.
Implementing this characteristic follows the acquainted compute allocation creation workflow of HyperPod process governance. To get began, check in to the AWS Administration Console and navigate to Cluster Administration underneath HyperPod Clusters within the Amazon SageMaker AI console. After deciding on your HyperPod cluster, choose the Insurance policies tab within the cluster element web page. Navigate to Compute allocations and select Create.
As with current performance, you may allow process prioritization and fair-share useful resource allocation by way of cluster insurance policies that prioritize crucial workloads and distribute idle compute throughout groups. Through the use of HyperPod process governance, you may outline queue admission insurance policies (first-come-first-serve by default or process rating) and idle compute allocation strategies (first-come-first-serve or fair-share by default). Within the Compute allocation part, you may create and edit allocations to distribute assets amongst groups, allow lending and borrowing of idle compute, configure preemption of low-priority duties, and assign fair-share weights.
The important thing innovation is within the Allocations part proven within the following determine, the place you’ll now discover fine-grained choices for useful resource allocation. Along with the prevailing instance-level quotas, now you can straight specify GPU quotas by occasion sort and household or by {hardware} sort. If you outline GPU allocations, HyperPod process governance intelligently calculates acceptable default values for vCPUs and reminiscence that are set proportionally.
For instance, when allocating 2 GPUs from a single p5.48xlarge occasion (which has 8 GPUs, 192 vCPUs, and a couple of TiB reminiscence) in your HyperPod cluster, HyperPod process governance assigns 48 vCPUs and 512 GiB reminiscence as default values—which is equal to 1 quarter of the occasion’s complete assets. Equally, in case your HyperPod cluster accommodates 2 ml.g5.2xlarge situations (every with 1 GPU, 8 vCPUs, and 32 GiB reminiscence), allocating 2 GPUs would routinely assign 16 vCPUs and 64 GiB reminiscence from each situations as proven within the following picture.

You’ll be able to both proceed with these routinely calculated default values or customise the allocation by manually adjusting the vCPUs and vCPU reminiscence fields as seen within the following picture.

Amazon SageMaker HyperPod helps clusters that embody CPU-based situations, GPU-based situations, and AWS Neuron-based {hardware} (AWS Inferentia and AWS Trainium chips). You’ll be able to specify useful resource allocation on your workforce by situations, GPUs, vCPUs, vCPU reminiscence, or Neuron gadgets, as proven within the following picture.

Quota allocation may be greater than capability. Assets added to the compute allocation coverage that aren’t at the moment accessible within the cluster symbolize planning for future capability upgrades. Jobs that require these unprovisioned assets shall be routinely queued and stay in a pending state till the required assets develop into accessible. It’s necessary to know that in SageMaker HyperPod, compute allocations perform as quotas, that are verified throughout workload scheduling to know if a workload ought to be admitted or not, no matter precise capability availability. When useful resource requests are inside these outlined allocation limits and present utilization, the Kubernetes scheduler (kube-scheduler) handles the precise distribution and placement of pods throughout the HyperPod cluster nodes.
Allocating granular compute and reminiscence quota utilizing AWS CLI
You too can create or replace compute quotas utilizing the AWS CLI. The next is an instance for making a compute quota with solely GPU rely specification utilizing the AWS CLI:
Compute quotas may also be created with combined quota varieties, together with a sure variety of situations and granular compute assets, as proven within the following instance:
HyperPod process governance deep dive
SageMaker HyperPod process governance allows allocation of GPU, CPU, and reminiscence assets by integrating with Kueue, a Kubernetes-native system for job queueing.
Kueue doesn’t change current Kubernetes scheduling elements, however quite integrates with the kube-scheduler, such that Kueue decides whether or not a workload ought to be admitted based mostly on the useful resource quotas and present utilization, after which the kube-scheduler takes care of pod placement on the nodes.
When a workload requests particular assets, Kueue selects an acceptable useful resource taste based mostly on availability, node affinity, and job precedence. The scheduler then injects the corresponding node labels and tolerations into the PodSpec, permitting Kubernetes to position the pod on nodes with the requested {hardware} configuration. This helps exact useful resource governance and environment friendly allocation for multi-tenant clusters.
When a SageMaker HyperPod process governance compute allocation is created, Kueue creates ClusterQueues that outline useful resource quotas and scheduling insurance policies, together with ResourceFlavors for the chosen occasion varieties with their distinctive useful resource traits.
For instance, the next compute allocation coverage allocates ml.g6.12xlarge situations with 2 GPUs and 48 vCPUs to the onlygputeam workforce, implementing a LendAndBorrow technique with an as much as 50% borrowing restrict. This configuration allows versatile useful resource sharing whereas sustaining precedence by way of a fair proportion weight of 10 and the power to preempt decrease precedence duties from different groups.
The corresponding Kueue ClusterQueue is configured with the ml.g6.12xlarge taste, offering quotas for two NVIDIA GPUs, 48 CPU cores, and 192 Gi reminiscence.
A Kueue LocalQueue shall be additionally created, and can reference the corresponding ClusterQueue. The LocalQueue acts because the namespace-scoped useful resource by way of which customers can submit workloads, and these workloads are then admitted and scheduled in line with the quotas and insurance policies outlined within the ClusterQueue.
Submitting duties
There are two methods to submit duties on Amazon EKS orchestrated SageMaker HyperPod clusters: the SageMaker HyperPod CLI and the Kubernetes command-line instrument, kubectl. With each choices, knowledge scientists have to reference their workforce’s namespace and process precedence class—along with the requested GPU and vCPU compute and reminiscence assets—to make use of their granular allotted quota with acceptable prioritization. If the consumer doesn’t specify a precedence class, then SageMaker HyperPod process governance will routinely assume the bottom precedence. The particular GPU sort comes from an occasion sort choice, as a result of knowledge scientists wish to use GPUs with sure capabilities (for instance, H100 as a substitute of H200) to carry out their duties effectively.
HyperPod CLI
The HyperPod CLI was created to summary the complexities of working with kubectl and in order that builders utilizing SageMaker HyperPod can iterate quicker with customized instructions.The next is an instance of a job submission with the HyperPod CLI requesting each compute and reminiscence assets:
The highlighted parameters allow requesting granular compute and reminiscence assets. The HyperPod CLI requires to put in the HyperPod Coaching Operator within the cluster after which construct a container picture that features the HyperPod Elastic Agent. For additional directions on methods to construct such container picture, please discuss with the HyperPod Coaching Operator documentation.
For extra data on the supported HyperPod CLI arguments and associated description, see the SageMaker HyperPod CLI reference documentation.
Kubectl
The next is an instance of a kubectl command to submit a job to the HyperPod cluster utilizing the desired queue. It is a easy instance of a PyTorch job that can examine for GPU availability after which sleep for five minutes. Compute and reminiscence assets are requested utilizing the usual Kubernetes useful resource administration constructs.
Following is a brief reference information for useful instructions when interacting with SageMaker HyperPod process governance:
- Describing cluster coverage with the AWS CLI – This AWS CLI command is helpful for viewing the cluster coverage settings on your cluster.
- Record compute quota allocations with the AWS CLI – Use this AWS CLI command to view the totally different groups and arrange process governance and their respective quota allocation settings.
- HyperPod CLI – The HyperPod CLI abstracts widespread kubectl instructions used to work together with SageMaker HyperPod clusters resembling submitting, itemizing, and cancelling duties. See the SageMaker HyperPod CLI reference documentation for a full record of instructions.
- kubectl – You too can use kubectl to work together with process governance; some helpful instructions are:
kubectl get workloads -n hyperpod-ns- kubectl describe workload . These instructions present the workloads operating in your cluster per namespace and supply detailed reasonings on Kueue admission. You need to use these instructions to reply questions resembling “Why was my process preempted?” or “Why did my process get admitted?”
Frequent situations
A standard use case for extra granular allocation of GPU compute is fine-tuning small and medium sized giant language fashions (LLMs). A single H100 or H200 GPU may be adequate to deal with such a use case (additionally relying on the chosen batch measurement and different components), and machine studying (ML) platform directors can select to allocate a single GPU to every knowledge scientist or ML researcher to optimize the utilization of an occasion like ml.p5.48xlarge, which comes with 8 H100 GPUs onboard.
Small language fashions (SLMs) have emerged as a big development in generative AI, providing decrease latency, decreased deployment prices, and enhanced privateness capabilities whereas sustaining spectacular efficiency on focused duties, making them more and more very important for agentic workflows and edge computing situations. The brand new SageMaker HyperPod process governance with fine-grained GPU, CPU, and reminiscence allocation considerably enhances SLM improvement by enabling exact matching of assets to mannequin necessities, permitting groups to effectively run a number of experiments concurrently with totally different architectures. This useful resource optimization is especially worthwhile as organizations develop specialised SLMs for domain-specific purposes, with priority-based scheduling in order that crucial mannequin coaching jobs obtain assets first whereas maximizing general cluster utilization. By offering precisely the suitable assets on the proper time, HyperPod accelerates the event of specialised, domain-specific SLMs that may be deployed as environment friendly brokers in complicated workflows, enabling extra responsive and cost-effective AI options throughout industries.
With the rising reputation of SLMs, organizations can use granular quota allocation to create focused quota insurance policies that prioritize GPU assets, addressing the budget-sensitive nature of ML infrastructure the place GPUs symbolize probably the most important value and efficiency issue. Organizations can now selectively apply CPU and reminiscence limits the place wanted, making a granular useful resource administration strategy that effectively helps various machine studying workloads no matter mannequin measurement.
Equally, to help inference workloads, a number of groups may not require a whole occasion to deploy their fashions, serving to to keep away from having total situations geared up with a number of GPUs allotted to every workforce and leaving GPU compute sitting idle.
Lastly, throughout experimentation and algorithm improvement, knowledge scientists and ML researchers can select to deploy a container internet hosting their most well-liked IDE on HyperPod, like JupyterLab or Code-OSS (Visible Studio Code open supply). On this situation, they typically experiment with smaller batch sizes earlier than scaling to multi-GPU configurations, therefore not needing total multi-GPU situations to be allotted.Related issues apply to CPU situations; for instance, an ML platform administrator may resolve to make use of CPU situations for IDE deployment, as a result of knowledge scientists desire to scale their coaching or fine-tuning with jobs quite than experimenting with the native IDE compute. In such instances, relying on the situations of selection, partitioning CPU cores throughout the workforce may be useful.
Conclusion
The introduction of fine-grained compute quota allocation in SageMaker HyperPod represents a big development in ML infrastructure administration. By enabling GPU-level useful resource allocation alongside instance-level controls, organizations can now exactly tailor their compute assets to match their particular workloads and workforce buildings.
This granular strategy to useful resource governance addresses crucial challenges confronted by ML groups at this time, balancing price range constraints, maximizing costly GPU utilization, and making certain honest entry throughout knowledge science groups of all sizes. Whether or not fine-tuning SLMs that require single GPUs, operating inference workloads with diverse useful resource wants, or supporting improvement environments that don’t require full occasion energy, this versatile functionality helps make sure that no compute assets sit idle unnecessarily.
ML workloads proceed to diversify of their useful resource necessities and SageMaker HyperPod process governance now offers the adaptability organizations have to optimize their GPU capability investments. To study extra, go to the SageMaker HyperPod product web page and HyperPod process governance documentation.
Give this a attempt within the Amazon SageMaker AI console and depart your feedback right here.
In regards to the authors
Siamak Nariman is a Senior Product Supervisor at AWS. He’s targeted on AI/ML expertise, ML mannequin administration, and ML governance to enhance general organizational effectivity and productiveness. He has in depth expertise automating processes and deploying varied applied sciences.
Zhenshan Jin is a Senior Software program Engineer at Amazon Internet Companies (AWS), the place he leads software program improvement for process governance on SageMaker HyperPod. In his function, he focuses on empowering clients with superior AI capabilities whereas fostering an surroundings that maximizes engineering workforce effectivity and productiveness.
Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Internet Companies. With a number of years of software program engineering and an ML background, he works with clients of any measurement to know their enterprise and technical wants and design AI and ML options that make the perfect use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on initiatives in numerous domains, together with MLOps, laptop imaginative and prescient, and NLP, involving a broad set of AWS companies. In his free time, Giuseppe enjoys taking part in soccer.
Sindhura Palakodety is a Options Architect at AWS. She is obsessed with serving to clients construct enterprise-scale Effectively-Architected options on the AWS platform and specializes within the knowledge analytics area.
