Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Hackers Breach Toptal GitHub, Publish 10 Malicious npm Packages With 5,000 Downloads

    July 29, 2025

    You must flip off this default TV setting ASAP – and why even consultants advocate it

    July 29, 2025

    Prime Abilities Information Scientists Ought to Study in 2025

    July 29, 2025
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»Streamline machine studying workflows with SkyPilot on Amazon SageMaker HyperPod
    Machine Learning & Research

    Streamline machine studying workflows with SkyPilot on Amazon SageMaker HyperPod

    Oliver ChambersBy Oliver ChambersJuly 13, 2025No Comments14 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Streamline machine studying workflows with SkyPilot on Amazon SageMaker HyperPod
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    This publish is co-written with Zhanghao Wu, co-creator of SkyPilot.

    The speedy development of generative AI and basis fashions (FMs) has considerably elevated computational useful resource necessities for machine studying (ML) workloads. Trendy ML pipelines require environment friendly techniques for distributing workloads throughout accelerated compute sources, whereas ensuring developer productiveness stays excessive. Organizations want infrastructure options that aren’t solely highly effective but in addition versatile, resilient, and easy to handle.

    SkyPilot is an open supply framework that simplifies working ML workloads by offering a unified abstraction layer that helps ML engineers run their workloads on totally different compute sources with out managing underlying infrastructure complexities. It gives a easy, high-level interface for provisioning sources, scheduling jobs, and managing distributed coaching throughout a number of nodes.

    Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not solely gives the pliability to create and use your individual software program stack, but in addition gives optimum efficiency via similar backbone placement of situations, in addition to built-in resiliency. Combining the resiliency of SageMaker HyperPod and the effectivity of SkyPilot gives a robust framework to scale up your generative AI workloads.

    On this publish, we share how SageMaker HyperPod, in collaboration with SkyPilot, is streamlining AI growth workflows. This integration makes our superior GPU infrastructure extra accessible to ML engineers, enhancing productiveness and useful resource utilization.

    Challenges of orchestrating machine studying workloads

    Kubernetes has turn into fashionable for ML workloads on account of its scalability and wealthy open supply tooling. SageMaker HyperPod orchestrated on Amazon Elastic Kubernetes Service (Amazon EKS) combines the ability of Kubernetes with the resilient setting of SageMaker HyperPod designed for coaching massive fashions. Amazon EKS assist in SageMaker HyperPod strengthens resilience via deep well being checks, automated node restoration, and job auto-resume capabilities, offering uninterrupted coaching for large-scale and long-running jobs.

    ML engineers transitioning from conventional VM or on-premises environments usually face a steep studying curve. The complexity of Kubernetes manifests and cluster administration can pose important challenges, probably slowing down growth cycles and useful resource utilization.

    Moreover, AI infrastructure groups confronted the problem of balancing the necessity for superior administration instruments with the will to offer a user-friendly expertise for his or her ML engineers. They required an answer that might supply each high-level management and ease of use for day-to-day operations.

    SageMaker HyperPod with SkyPilot

    To deal with these challenges, we partnered with SkyPilot to showcase an answer that makes use of the strengths of each platforms. SageMaker HyperPod excels at managing the underlying compute sources and situations, offering the sturdy infrastructure vital for demanding AI workloads. SkyPilot enhances this by providing an intuitive layer for job administration, interactive growth, and staff coordination.

    By means of this partnership, we will supply our prospects one of the best of each worlds: the highly effective, scalable infrastructure of SageMaker HyperPod, mixed with a user-friendly interface that considerably reduces the training curve for ML engineers. For AI infrastructure groups, this integration gives superior administration capabilities whereas simplifying the expertise for his or her ML engineers, making a win-win scenario for all stakeholders.

    SkyPilot helps AI groups run their workloads on totally different infrastructures with a unified high-level interface and highly effective administration of sources and jobs. An AI engineer can convey of their AI framework and specify the useful resource necessities for the job; SkyPilot will intelligently schedule the workloads on one of the best infrastructure: discover the obtainable GPUs, provision the GPU, run the job, and handle its lifecycle.

    Answer overview

    Implementing this answer is easy, whether or not you’re working with present SageMaker HyperPod clusters or organising a brand new deployment. For present clusters, you may join utilizing AWS Command Line Interface (AWS CLI) instructions to replace your kubeconfig and confirm the setup. For brand new deployments, we information you thru organising the API server, creating clusters, and configuring high-performance networking choices like Elastic Material Adapter (EFA).

    The next diagram illustrates the answer structure.

    Within the following sections, we present run SkyPilot jobs for multi-node distributed coaching on SageMaker HyperPod. We go over the method of making a SageMaker HyperPod cluster, putting in SkyPilot, making a SkyPilot cluster, and deploying a SkyPilot coaching job.

    Stipulations

    You need to have the next stipulations:

    • An present SageMaker HyperPod cluster with Amazon EKS (to create one, check with Deploy Your HyperPod Cluster). You need to provision a single ml.p5.48xlarge occasion for the code samples within the following sections.
    • Entry to the AWS CLI and kubectl command line instruments.
    • A Python setting for putting in SkyPilot.

    Create a SageMaker HyperPod cluster

    You possibly can create an EKS cluster with a single AWS CloudFormation stack following the directions in Utilizing CloudFormation, configured with a digital personal cloud (VPC) and storage sources.

    To create and handle SageMaker HyperPod clusters, you need to use both the AWS Administration Console or AWS CLI. If you happen to use the AWS CLI, specify the cluster configuration in a JSON file and select the EKS cluster created from the CloudFormation stack because the orchestrator of the SageMaker HyperPod cluster. You then create the cluster employee nodes with NodeRecovery set to Computerized to allow automated node restoration, and for OnStartDeepHealthChecks, add InstanceStress and InstanceConnectivity to allow deep well being checks. See the next code:

    cat > cluster-config.json << EOL
    {
        "ClusterName": "hp-cluster",
        "Orchestrator": {
            "Eks": {
                "ClusterArn": "${EKS_CLUSTER_ARN}"
            }
        },
        "InstanceGroups": [
            {
                "InstanceGroupName": "worker-group-1",
                "InstanceType": "ml.p5.48xlarge",
                "InstanceCount": 2,
                "LifeCycleConfig": {
                    "SourceS3Uri": "s3://${BUCKET_NAME}",
                    "OnCreate": "on_create.sh"
                },
                "ExecutionRole": "${EXECUTION_ROLE}",
                "ThreadsPerCore": 1,
                "OnStartDeepHealthChecks": [
                    "InstanceStress",
                    "InstanceConnectivity"
                ],
            },
      ....
        ],
        "VpcConfig": {
            "SecurityGroupIds": [
                "$SECURITY_GROUP"
            ],
            "Subnets": [
                "$SUBNET_ID"
            ]
        },
        "ResilienceConfig": {
            "NodeRecovery": "Computerized"
        }
    }
    EOL

    You possibly can add InstanceStorageConfigs to provision and mount extra Amazon Elastic Block Retailer (Amazon EBS) volumes on SageMaker HyperPod nodes.

    To create the cluster utilizing the SageMaker HyperPod APIs, run the next AWS CLI command:

    aws sagemaker create-cluster  
    --cli-input-json file://cluster-config.json

    You are actually able to arrange SkyPilot in your SageMaker HyperPod cluster.

    Connect with your SageMaker HyperPod EKS cluster

    Out of your AWS CLI setting, run the aws eks update-kubeconfig command to replace your native kube config file (situated at ~/.kube/config) with the credentials and configuration wanted to hook up with your EKS cluster utilizing the kubectl command (present your particular EKS cluster title):

    aws eks update-kubeconfig --name $EKS_CLUSTER_NAME

    You possibly can confirm that you’re linked to the EKS cluster by working the next command:

    kubectl config current-context

    Set up SkyPilot with Kubernetes assist

    Use the next code to put in SkyPilot with Kubernetes assist utilizing pip:

    pip set up skypilot[kubernetes]

    This installs the most recent construct of SkyPilot, which incorporates the mandatory Kubernetes integrations.

    Confirm SkyPilot’s connection to the EKS cluster

    Test if SkyPilot can hook up with your Kubernetes cluster:

    sky verify k8s

    The output ought to look just like the next code:

    Checking credentials to allow clouds for SkyPilot.
    Kubernetes: enabled [compute]
    
    To allow a cloud, comply with the hints above and rerun: sky verify
    If any issues stay, check with detailed docs at: https://docs.skypilot.co/en/newest/getting-started/set up.html
    
    🎉 Enabled clouds 🎉
    Kubernetes [compute]
    Energetic context: arn:aws:eks:us-east-2:XXXXXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster
    
    Utilizing SkyPilot API server: http://127.0.0.1:46580

    If that is your first time utilizing SkyPilot with this Kubernetes cluster, you may see a immediate to create GPU labels to your nodes. Observe the directions by working the next code:

    python -m sky.utils.kubernetes.gpu_labeler --context

    This script helps SkyPilot establish what GPU sources can be found on every node in your cluster. The GPU labeling job may take a couple of minutes relying on the variety of GPU sources in your cluster.

    Uncover obtainable GPUs within the cluster

    To see what GPU sources can be found in your SageMaker HyperPod cluster, use the next code:

    sky show-gpus --cloud k8s

    This can listing the obtainable GPU varieties and their counts. Now we have two p5.48xlarge situations, every geared up with 8 NVIDIA H100 GPUs:

     Kubernetes GPUs
    GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
    H100 1, 2, 4, 8 16 16
    
    Kubernetes per node accelerator availability
    NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
    hyperpod-i-00baa178bc31afde3 H100 8 8
    hyperpod-i-038beefa954efab84 H100 8 8

    Launch an interactive growth setting

    With SkyPilot, you may launch a SkyPilot cluster for interactive growth:

    sky launch -c dev --gpus H100

    This command creates an interactive growth setting (IDE) with a single H100 GPU and can sync the native working listing to the cluster. SkyPilot handles the pod creation, useful resource allocation, and setup of the IDE.

    Thought of sources (1 node):
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------
     CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                                                 COST ($)   CHOSEN   
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Kubernetes   2CPU--8GB--H100:1   2       8         H100:1         arn:aws:eks:us-east-2:XXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster   0.00          ✔     
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Launching a brand new cluster 'dev'. Proceed? [Y/n]: Y
    • Launching on Kubernetes.
    Pod is up.
    ✔ Cluster launched: dev. View logs: sky api logs -1 sky-2025-05-05-15-28-47-523797/provision. log
    • Syncing information.
    Run instructions not specified or empty.
    Helpful Instructions
    Cluster title: dey
    To log into the pinnacle VM:   ssh dev
    To submit a job:           sky exec dev yaml_file
    To cease the cluster:       sky cease dev
    To teardown the cluster:   sky down dev

    After it’s launched, you may hook up with your IDE:

    ssh dev

    This offers you an interactive shell in your IDE, the place you may run your code, set up packages, and carry out ML experiments.

    Run coaching jobs

    With SkyPilot, you may run distributed coaching jobs in your SageMaker HyperPod cluster. The next is an instance of launching a distributed coaching job utilizing a YAML configuration file.

    First, create a file named practice.yaml along with your coaching job configuration:

    sources:
        accelerators: H100
    
    num_nodes: 1
    
    setup: |
        git clone --depth 1 https://github.com/pytorch/examples || true
        cd examples
        git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
        # SkyPilot's default picture on AWS/GCP has CUDA 11.6 (Azure 11.5).
        uv venv --python 3.10
        supply .venv/bin/activate
        uv pip set up -r necessities.txt "numpy<2" "torch"
    
    run: |
        cd examples
        supply .venv/bin/activate
        cd mingpt
        export LOGLEVEL=INFO
    
        MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
        echo "Beginning distributed coaching, head node: $MASTER_ADDR"
    
        torchrun 
        --nnodes=$SKYPILOT_NUM_NODES 
        --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE 
        --master_addr=$MASTER_ADDR 
        --master_port=8008 
        --node_rank=${SKYPILOT_NODE_RANK} 
        essential.py

    Then launch your coaching job:

    sky launch -c practice practice.yaml

    This creates a coaching job on a single p5.48xlarge nodes, geared up with 8 H100 NVIDIA GPUs. You possibly can monitor the output with the next command:

    sky logs practice

    Working multi-node coaching jobs with EFA

    Elastic Material Adapter (EFA) is a community interface for Amazon Elastic Compute Cloud (Amazon EC2) situations that lets you run functions requiring excessive ranges of inter-node communications at scale on AWS via its custom-built working system bypass {hardware} interface. This allows functions to speak immediately with the community {hardware} whereas bypassing the working system kernel, considerably lowering latency and CPU overhead. This direct {hardware} entry is especially useful for distributed ML workloads the place frequent inter-node communication throughout gradient synchronization can turn into a bottleneck. By utilizing EFA-enabled situations akin to p5.48xlarge or p6-b200.48xlarge, knowledge scientists can scale their coaching jobs throughout a number of nodes whereas sustaining the low-latency, high-bandwidth communication important for environment friendly distributed coaching, finally lowering coaching time and enhancing useful resource utilization for large-scale AI workloads.

    The next code snippet reveals incorporate this into your SkyPilot job:

    title: nccl-test-efa
    
    sources:
      cloud: kubernetes
      accelerators: H100:8
      image_id: docker:public.ecr.aws/hpc-cloud/nccl-tests:newest
    
    num_nodes: 2
    
    envs:
      USE_EFA: "true"
    
    run: |
      if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
        echo "Head node"
    
        # Whole variety of processes, NP needs to be the full variety of GPUs within the cluster
        NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
    
        # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to every IP as slots
        nodes=""
        for ip in $SKYPILOT_NODE_IPS; do
          nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
        performed
        nodes=${nodes::-1}
        echo "All nodes: ${nodes}"
    
        # Set setting variables
        export PATH=$PATH:/usr/native/cuda-12.2/bin:/decide/amazon/efa/bin:/usr/bin
        export LD_LIBRARY_PATH=/usr/native/cuda-12.2/lib64:/decide/amazon/openmpi/lib:/decide/nccl/construct/lib:/decide/amazon/efa/lib:/decide/aws-ofi-nccl/set up/lib:/usr/native/nvidia/lib:$LD_LIBRARY_PATH
        export NCCL_HOME=/decide/nccl
        export CUDA_HOME=/usr/native/cuda-12.2
        export NCCL_DEBUG=INFO
        export NCCL_BUFFSIZE=8388608
        export NCCL_P2P_NET_CHUNKSIZE=524288
        export NCCL_TUNER_PLUGIN=/decide/aws-ofi-nccl/set up/lib/libnccl-ofi-tuner.so
    
        if [ "${USE_EFA}" == "true" ]; then
          export FI_PROVIDER="efa"
        else
          export FI_PROVIDER=""
        fi
    
        /decide/amazon/openmpi/bin/mpirun 
          --allow-run-as-root 
          --tag-output 
          -H $nodes 
          -np $NP 
          -N $SKYPILOT_NUM_GPUS_PER_NODE 
          --bind-to none 
          -x FI_PROVIDER 
          -x PATH 
          -x LD_LIBRARY_PATH 
          -x NCCL_DEBUG=INFO 
          -x NCCL_BUFFSIZE 
          -x NCCL_P2P_NET_CHUNKSIZE 
          -x NCCL_TUNER_PLUGIN 
          --mca pml ^cm,ucx 
          --mca btl tcp,self 
          --mca btl_tcp_if_exclude lo,docker0,veth_def_agent 
          /decide/nccl-tests/construct/all_reduce_perf 
          -b 8 
          -e 2G 
          -f 2 
          -g 1 
          -c 5 
          -w 5 
          -n 100
      else
        echo "Employee nodes"
      fi
    
    config:
      kubernetes:
        pod_config:
          spec:
            containers:
            - sources:
                limits:
                  
                  vpc.amazonaws.com/efa: 32
                requests:
                  
                  vpc.amazonaws.com/efa: 32

    Clear up

    To delete your SkyPilot cluster, run the next command:

    sky down

    To delete the SageMaker HyperPod cluster created on this publish, you may person both the SageMaker AI console or the next AWS CLI command:

    aws sagemaker delete-cluster --cluster-name

    Cluster deletion will take a couple of minutes. You possibly can verify profitable deletion after you see no clusters on the SageMaker AI console.

    If you happen to used the CloudFormation stack to create sources, you may delete it utilizing the next command:

    aws cloudformation delete-stack --stack-name

    Conclusion

    By combining the sturdy infrastructure capabilities of SageMaker HyperPod with SkyPilot’s user-friendly interface, we’ve showcased an answer that helps groups deal with innovation reasonably than infrastructure complexity. This method not solely simplifies operations but in addition enhances productiveness and useful resource utilization throughout organizations of all sizes. To get began, check with SkyPilot within the Amazon EKS Help in Amazon SageMaker HyperPod workshop.


    Concerning the authors

    Roy Allela is a Senior AI/ML Specialist Options Architect at AWS. He helps AWS prospects—from small startups to massive enterprises—practice and deploy basis fashions effectively on AWS. He’s obsessed with computational optimization issues and enhancing the efficiency of AI workloads.

    Zhanghao Wu is a co-creator of the SkyPilot open supply challenge and holds a PhD in pc science from UC Berkeley. He works on SkyPilot core, client-server structure, managed jobs, and enhancing the AI expertise on numerous cloud infrastructure on the whole.

    Ankit Anand is a Senior Basis Fashions Go-To-Market (GTM) Specialist at AWS. He companions with high generative AI mannequin builders, strategic prospects, and AWS service groups to allow the subsequent technology of AI/ML workloads on AWS. Ankit’s expertise contains product administration experience inside the monetary providers trade for high-frequency and low-latency buying and selling and enterprise growth for Amazon Alexa.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    Prime Abilities Information Scientists Ought to Study in 2025

    July 29, 2025

    mRAKL: Multilingual Retrieval-Augmented Information Graph Building for Low-Resourced Languages

    July 28, 2025

    How Uber Makes use of ML for Demand Prediction?

    July 28, 2025
    Top Posts

    Hackers Breach Toptal GitHub, Publish 10 Malicious npm Packages With 5,000 Downloads

    July 29, 2025

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025
    Don't Miss

    Hackers Breach Toptal GitHub, Publish 10 Malicious npm Packages With 5,000 Downloads

    By Declan MurphyJuly 29, 2025

    In what is the newest occasion of a software program provide chain assault, unknown risk…

    You must flip off this default TV setting ASAP – and why even consultants advocate it

    July 29, 2025

    Prime Abilities Information Scientists Ought to Study in 2025

    July 29, 2025

    Apera AI closes Sequence A financing, updates imaginative and prescient software program, names executives

    July 29, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2025 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.