Right this moment, we’re excited to announce a brand new functionality of Amazon SageMaker HyperPod process governance that will help you optimize coaching effectivity and community latency of your AI workloads. SageMaker HyperPod process governance streamlines useful resource allocation and facilitates environment friendly compute useful resource utilization throughout groups and initiatives on Amazon Elastic Kubernetes Service (Amazon EKS) clusters. Directors can govern accelerated compute allocation and implement process precedence insurance policies, bettering useful resource utilization. This helps organizations give attention to accelerating generative AI innovation and decreasing time to market, reasonably than coordinating useful resource allocation and replanning duties. Discuss with Greatest practices for Amazon SageMaker HyperPod process governance for extra data.
Generative AI workloads sometimes demand in depth community communication throughout Amazon Elastic Compute Cloud (Amazon EC2) cases, the place community bandwidth impacts each workload runtime and processing latency. The community latency of those communications relies on the bodily placement of cases inside a knowledge heart’s hierarchical infrastructure. Information facilities could be organized into nested organizational models comparable to community nodes and node units, with a number of cases per community node and a number of community nodes per node set. For instance, cases throughout the identical organizational unit expertise sooner processing time in comparison with these throughout totally different models. This implies fewer community hops between cases leads to decrease communication.
To optimize the location of your generative AI workloads in your SageMaker HyperPod clusters by contemplating the bodily and logical association of assets, you should use EC2 community topology data throughout your job submissions. An EC2 occasion’s topology is described by a set of nodes, with one node in every layer of the community. Discuss with How Amazon EC2 occasion topology works for particulars on how EC2 topology is organized. Community topology labels supply the next key advantages:
- Lowered latency by minimizing community hops and routing site visitors to close by cases
- Improved coaching effectivity by optimizing workload placement throughout community assets
With topology-aware scheduling for SageMaker HyperPod process governance, you should use topology community labels to schedule your jobs with optimized community communication, thereby bettering process effectivity and useful resource utilization in your AI workloads.
On this submit, we introduce topology-aware scheduling with SageMaker HyperPod process governance by submitting jobs that characterize hierarchical community data. We offer particulars about how one can use SageMaker HyperPod process governance to optimize your job effectivity.
Resolution overview
Information scientists work together with SageMaker HyperPod clusters. Information scientists are answerable for the coaching, fine-tuning, and deployment of fashions on accelerated compute cases. It’s essential to ensure information scientists have the mandatory capability and permissions when interacting with clusters of GPUs.
To implement topology-aware scheduling, you first affirm the topology data for all nodes in your cluster, then run a script that tells you which of them cases are on the identical community nodes, and eventually schedule a topology-aware coaching process in your cluster. This workflow facilitates larger visibility and management over the location of your coaching cases.
On this submit, we stroll via viewing node topology data and submitting topology-aware duties to your cluster. For reference, NetworkNodes describes the community node set of an occasion. In every community node set, three layers comprise the hierarchical view of the topology for every occasion. Cases which can be closest to one another will share the identical layer 3 community node. If there are not any widespread community nodes within the backside layer (layer 3), then see if there may be commonality at layer 2.
Conditions
To get began with topology-aware scheduling, you could have the next stipulations:
- An EKS cluster
- A SageMaker HyperPod cluster with cases enabled for topology data
- The SageMaker HyperPod process governance add-on put in (model 1.2.2 or later)
- Kubectl put in
- (Elective) The SageMaker HyperPod CLI put in
Get node topology data
Run the next command to indicate node labels in your cluster. This command supplies community topology data for every occasion.
Cases with the identical community node layer 3 are as shut as doable, following EC2 topology hierarchy. You must see a listing of node labels that appear like the next:topology.k8s.aws/network-node-layer-3: nn-33333exampleRun the next script to indicate the nodes in your cluster which can be on the identical layers 1, 2, and three community nodes:
The output of this script will print a stream chart that you should use in a stream diagram editor comparable to Mermaid.js.org to visualise the node topology of your cluster. The next determine is an instance of the cluster topology for a seven-instance cluster.
Submit duties
SageMaker HyperPod process governance provides two methods to submit duties utilizing topology consciousness. On this part, we talk about these two choices and a 3rd different choice to process governance.
Modify your Kubernetes manifest file
First, you’ll be able to modify your present Kubernetes manifest file to incorporate one in all two annotation choices:
- kueue.x-k8s.io/podset-required-topology – Use this feature in the event you will need to have all pods scheduled on nodes on the identical community node layer as a way to start the job
- kueue.x-k8s.io/podset-preferred-topology – Use this feature in the event you ideally need all pods scheduled on nodes in the identical community node layer, however you might have flexibility
The next code is an instance of a pattern job that makes use of the kueue.x-k8s.io/podset-required-topology setting to schedule pods that share the identical layer 3 community node:
To confirm which nodes your pods are working on, use the next command to view node IDs per pod:kubectl get pods -n hyperpod-ns-team-a -o huge
Use the SageMaker HyperPod CLI
The second strategy to submit a job is thru the SageMaker HyperPod CLI. Make sure you set up the newest model (model pending) to make use of topology-aware scheduling. To make use of topology-aware scheduling with the SageMaker HyperPod CLI, you’ll be able to embody both the --preferred-topology parameter or the --required-topology parameter in your create job command.
The next code is an instance command to begin a topology-aware mnist coaching job utilizing the SageMaker HyperPod CLI, change XXXXXXXXXXXX together with your AWS account ID:
Clear up
If you happen to deployed new assets whereas following this submit, discuss with the Clear Up part within the SageMaker HyperPod EKS workshop to be sure you don’t accrue undesirable costs.
Conclusion
Throughout giant language mannequin (LLM) coaching, pod-to-pod communication distributes the mannequin throughout a number of cases, requiring frequent information trade between these cases. On this submit, we mentioned how SageMaker HyperPod process governance helps schedule workloads to allow job effectivity by optimizing throughput and latency. We additionally walked via how one can schedule jobs utilizing SageMaker HyperPod topology community data to optimize community communication latency in your AI duties.
We encourage you to check out this resolution and share your suggestions within the feedback part.
Concerning the authors
Nisha Nadkarni is a Senior GenAI Specialist Options Architect at AWS, the place she guides corporations via greatest practices when deploying giant scale distributed coaching and inference on AWS. Previous to her present function, she spent a number of years at AWS centered on serving to rising GenAI startups develop fashions from ideation to manufacturing.
Siamak Nariman is a Senior Product Supervisor at AWS. He’s centered on AI/ML expertise, ML mannequin administration, and ML governance to enhance total organizational effectivity and productiveness. He has in depth expertise automating processes and deploying numerous applied sciences.
Zican Li is a Senior Software program Engineer at Amazon Internet Providers (AWS), the place he leads software program improvement for Activity Governance on SageMaker HyperPod. In his function, he focuses on empowering clients with superior AI capabilities whereas fostering an atmosphere that maximizes engineering workforce effectivity and productiveness.
Anoop Saha is a Sr GTM Specialist at Amazon Internet Providers (AWS) specializing in generative AI mannequin coaching and inference. He companions with prime frontier mannequin builders, strategic clients, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop held a number of management roles at startups and enormous firms, primarily specializing in silicon and system structure of AI infrastructure.

