Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Customers, Progress, and International Traits

    March 17, 2026

    CISA Points Alert on Wing FTP Server Vulnerability Utilized in Assaults

    March 17, 2026

    Nvidia's DGX Station is a desktop supercomputer that runs trillion-parameter AI fashions with out the cloud

    March 17, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»Schedule topology-aware workloads utilizing Amazon SageMaker HyperPod process governance
    Machine Learning & Research

    Schedule topology-aware workloads utilizing Amazon SageMaker HyperPod process governance

    Oliver ChambersBy Oliver ChambersSeptember 16, 2025No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Schedule topology-aware workloads utilizing Amazon SageMaker HyperPod process governance
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    Right this moment, we’re excited to announce a brand new functionality of Amazon SageMaker HyperPod process governance that will help you optimize coaching effectivity and community latency of your AI workloads. SageMaker HyperPod process governance streamlines useful resource allocation and facilitates environment friendly compute useful resource utilization throughout groups and initiatives on Amazon Elastic Kubernetes Service (Amazon EKS) clusters. Directors can govern accelerated compute allocation and implement process precedence insurance policies, bettering useful resource utilization. This helps organizations give attention to accelerating generative AI innovation and decreasing time to market, reasonably than coordinating useful resource allocation and replanning duties. Discuss with Greatest practices for Amazon SageMaker HyperPod process governance for extra data.

    Generative AI workloads sometimes demand in depth community communication throughout Amazon Elastic Compute Cloud (Amazon EC2) cases, the place community bandwidth impacts each workload runtime and processing latency. The community latency of those communications relies on the bodily placement of cases inside a knowledge heart’s hierarchical infrastructure. Information facilities could be organized into nested organizational models comparable to community nodes and node units, with a number of cases per community node and a number of community nodes per node set. For instance, cases throughout the identical organizational unit expertise sooner processing time in comparison with these throughout totally different models. This implies fewer community hops between cases leads to decrease communication.

    To optimize the location of your generative AI workloads in your SageMaker HyperPod clusters by contemplating the bodily and logical association of assets, you should use EC2 community topology data throughout your job submissions. An EC2 occasion’s topology is described by a set of nodes, with one node in every layer of the community. Discuss with How Amazon EC2 occasion topology works for particulars on how EC2 topology is organized. Community topology labels supply the next key advantages:

    • Lowered latency by minimizing community hops and routing site visitors to close by cases
    • Improved coaching effectivity by optimizing workload placement throughout community assets

    With topology-aware scheduling for SageMaker HyperPod process governance, you should use topology community labels to schedule your jobs with optimized community communication, thereby bettering process effectivity and useful resource utilization in your AI workloads.

    On this submit, we introduce topology-aware scheduling with SageMaker HyperPod process governance by submitting jobs that characterize hierarchical community data. We offer particulars about how one can use SageMaker HyperPod process governance to optimize your job effectivity.

    Resolution overview

    Information scientists work together with SageMaker HyperPod clusters. Information scientists are answerable for the coaching, fine-tuning, and deployment of fashions on accelerated compute cases. It’s essential to ensure information scientists have the mandatory capability and permissions when interacting with clusters of GPUs.

    To implement topology-aware scheduling, you first affirm the topology data for all nodes in your cluster, then run a script that tells you which of them cases are on the identical community nodes, and eventually schedule a topology-aware coaching process in your cluster. This workflow facilitates larger visibility and management over the location of your coaching cases.

    On this submit, we stroll via viewing node topology data and submitting topology-aware duties to your cluster. For reference, NetworkNodes describes the community node set of an occasion. In every community node set, three layers comprise the hierarchical view of the topology for every occasion. Cases which can be closest to one another will share the identical layer 3 community node. If there are not any widespread community nodes within the backside layer (layer 3), then see if there may be commonality at layer 2.

    Conditions

    To get began with topology-aware scheduling, you could have the next stipulations:

    • An EKS cluster
    • A SageMaker HyperPod cluster with cases enabled for topology data
    • The SageMaker HyperPod process governance add-on put in (model 1.2.2 or later)
    • Kubectl put in
    • (Elective) The SageMaker HyperPod CLI put in

    Get node topology data

    Run the next command to indicate node labels in your cluster. This command supplies community topology data for every occasion.

    kubectl get nodes -L topology.k8s.aws/network-node-layer-1
    kubectl get nodes -L topology.k8s.aws/network-node-layer-2
    kubectl get nodes -L topology.k8s.aws/network-node-layer-3

    Cases with the identical community node layer 3 are as shut as doable, following EC2 topology hierarchy. You must see a listing of node labels that appear like the next:topology.k8s.aws/network-node-layer-3: nn-33333exampleRun the next script to indicate the nodes in your cluster which can be on the identical layers 1, 2, and three community nodes:

    git clone https://github.com/aws-samples/awsome-distributed-training.git
    cd awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/task-governance 
    chmod +x visualize_topology.sh
    bash visualize_topology.sh

    The output of this script will print a stream chart that you should use in a stream diagram editor comparable to Mermaid.js.org to visualise the node topology of your cluster. The next determine is an instance of the cluster topology for a seven-instance cluster.

    Submit duties

    SageMaker HyperPod process governance provides two methods to submit duties utilizing topology consciousness. On this part, we talk about these two choices and a 3rd different choice to process governance.

    Modify your Kubernetes manifest file

    First, you’ll be able to modify your present Kubernetes manifest file to incorporate one in all two annotation choices:

    • kueue.x-k8s.io/podset-required-topology – Use this feature in the event you will need to have all pods scheduled on nodes on the identical community node layer as a way to start the job
    • kueue.x-k8s.io/podset-preferred-topology – Use this feature in the event you ideally need all pods scheduled on nodes in the identical community node layer, however you might have flexibility

    The next code is an instance of a pattern job that makes use of the kueue.x-k8s.io/podset-required-topology setting to schedule pods that share the identical layer 3 community node:

    apiVersion: batch/v1
    form: Job
    metadata:
      identify: test-tas-job
      namespace: hyperpod-ns-team-a
      labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue
        kueue.x-k8s.io/priority-class: inference-priority
    spec:
      parallelism: 10
      completions: 10
      droop: true
      template:
        metadata:
          labels:
            kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue
          annotations:
            kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
        spec:
          containers:
            - identify: dummy-job
              picture: public.ecr.aws/docker/library/alpine:newest
              command: ["sleep", "3600s"]
              assets:
                requests:
                  cpu: "1"
          restartPolicy: By no means

    To confirm which nodes your pods are working on, use the next command to view node IDs per pod:kubectl get pods -n hyperpod-ns-team-a -o huge

    Use the SageMaker HyperPod CLI

    The second strategy to submit a job is thru the SageMaker HyperPod CLI. Make sure you set up the newest model (model pending) to make use of topology-aware scheduling. To make use of topology-aware scheduling with the SageMaker HyperPod CLI, you’ll be able to embody both the --preferred-topology parameter or the --required-topology parameter in your create job command.

    The next code is an instance command to begin a topology-aware mnist coaching job utilizing the SageMaker HyperPod CLI, change XXXXXXXXXXXX together with your AWS account ID:

    hyp create hyp-pytorch-job 
    --job-name test-pytorch-job-cli 
    --image XXXXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com/ptjob:mnist 
    --pull-policy "At all times" 
    --tasks-per-node 1 
    --max-retry 1 
    --preferred-topology topology.k8s.aws/network-node-layer-3

    Clear up

    If you happen to deployed new assets whereas following this submit, discuss with the Clear Up part within the SageMaker HyperPod EKS workshop to be sure you don’t accrue undesirable costs.

    Conclusion

    Throughout giant language mannequin (LLM) coaching, pod-to-pod communication distributes the mannequin throughout a number of cases, requiring frequent information trade between these cases. On this submit, we mentioned how SageMaker HyperPod process governance helps schedule workloads to allow job effectivity by optimizing throughput and latency. We additionally walked via how one can schedule jobs utilizing SageMaker HyperPod topology community data to optimize community communication latency in your AI duties.

    We encourage you to check out this resolution and share your suggestions within the feedback part.


    Concerning the authors

    Nisha Nadkarni is a Senior GenAI Specialist Options Architect at AWS, the place she guides corporations via greatest practices when deploying giant scale distributed coaching and inference on AWS. Previous to her present function, she spent a number of years at AWS centered on serving to rising GenAI startups develop fashions from ideation to manufacturing.

    Siamak Nariman is a Senior Product Supervisor at AWS. He’s centered on AI/ML expertise, ML mannequin administration, and ML governance to enhance total organizational effectivity and productiveness. He has in depth expertise automating processes and deploying numerous applied sciences.

    Zican Li is a Senior Software program Engineer at Amazon Internet Providers (AWS), the place he leads software program improvement for Activity Governance on SageMaker HyperPod. In his function, he focuses on empowering clients with superior AI capabilities whereas fostering an atmosphere that maximizes engineering workforce effectivity and productiveness.

    Anoop Saha is a Sr GTM Specialist at Amazon Internet Providers (AWS) specializing in generative AI mannequin coaching and inference. He companions with prime frontier mannequin builders, strategic clients, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop held a number of management roles at startups and enormous firms, primarily specializing in silicon and system structure of AI infrastructure.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    High 7 Free Machine Studying Programs with Certificates

    March 17, 2026

    AWS and NVIDIA deepen strategic collaboration to speed up AI from pilot to manufacturing

    March 17, 2026

    5 Vital Shifts D&A Leaders Should Make to Drive Analytics and AI Success

    March 16, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Customers, Progress, and International Traits

    By Amelia Harper JonesMarch 17, 2026

    Few digital media platforms have made as huge an influence as TikTok. However because the…

    CISA Points Alert on Wing FTP Server Vulnerability Utilized in Assaults

    March 17, 2026

    Nvidia's DGX Station is a desktop supercomputer that runs trillion-parameter AI fashions with out the cloud

    March 17, 2026

    New AI Management Guidelines with Emily Discipline, CPO of LPL Monetary

    March 17, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.