Streamline machine studying workflows with SkyPilot on Amazon SageMaker HyperPod

This publish is co-written with Zhanghao Wu, co-creator of SkyPilot.

The speedy development of generative AI and basis fashions (FMs) has considerably elevated computational useful resource necessities for machine studying (ML) workloads. Trendy ML pipelines require environment friendly techniques for distributing workloads throughout accelerated compute sources, whereas ensuring developer productiveness stays excessive. Organizations want infrastructure options that aren’t solely highly effective but in addition versatile, resilient, and easy to handle.

SkyPilot is an open supply framework that simplifies working ML workloads by offering a unified abstraction layer that helps ML engineers run their workloads on totally different compute sources with out managing underlying infrastructure complexities. It gives a easy, high-level interface for provisioning sources, scheduling jobs, and managing distributed coaching throughout a number of nodes.

Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not solely gives the pliability to create and use your individual software program stack, but in addition gives optimum efficiency via similar backbone placement of situations, in addition to built-in resiliency. Combining the resiliency of SageMaker HyperPod and the effectivity of SkyPilot gives a robust framework to scale up your generative AI workloads.

On this publish, we share how SageMaker HyperPod, in collaboration with SkyPilot, is streamlining AI growth workflows. This integration makes our superior GPU infrastructure extra accessible to ML engineers, enhancing productiveness and useful resource utilization.

Challenges of orchestrating machine studying workloads

Kubernetes has turn into fashionable for ML workloads on account of its scalability and wealthy open supply tooling. SageMaker HyperPod orchestrated on Amazon Elastic Kubernetes Service (Amazon EKS) combines the ability of Kubernetes with the resilient setting of SageMaker HyperPod designed for coaching massive fashions. Amazon EKS assist in SageMaker HyperPod strengthens resilience via deep well being checks, automated node restoration, and job auto-resume capabilities, offering uninterrupted coaching for large-scale and long-running jobs.

ML engineers transitioning from conventional VM or on-premises environments usually face a steep studying curve. The complexity of Kubernetes manifests and cluster administration can pose important challenges, probably slowing down growth cycles and useful resource utilization.

Moreover, AI infrastructure groups confronted the problem of balancing the necessity for superior administration instruments with the will to offer a user-friendly expertise for his or her ML engineers. They required an answer that might supply each high-level management and ease of use for day-to-day operations.

SageMaker HyperPod with SkyPilot

To deal with these challenges, we partnered with SkyPilot to showcase an answer that makes use of the strengths of each platforms. SageMaker HyperPod excels at managing the underlying compute sources and situations, offering the sturdy infrastructure vital for demanding AI workloads. SkyPilot enhances this by providing an intuitive layer for job administration, interactive growth, and staff coordination.

By means of this partnership, we will supply our prospects one of the best of each worlds: the highly effective, scalable infrastructure of SageMaker HyperPod, mixed with a user-friendly interface that considerably reduces the training curve for ML engineers. For AI infrastructure groups, this integration gives superior administration capabilities whereas simplifying the expertise for his or her ML engineers, making a win-win scenario for all stakeholders.

SkyPilot helps AI groups run their workloads on totally different infrastructures with a unified high-level interface and highly effective administration of sources and jobs. An AI engineer can convey of their AI framework and specify the useful resource necessities for the job; SkyPilot will intelligently schedule the workloads on one of the best infrastructure: discover the obtainable GPUs, provision the GPU, run the job, and handle its lifecycle.

Answer overview

Implementing this answer is easy, whether or not you’re working with present SageMaker HyperPod clusters or organising a brand new deployment. For present clusters, you may join utilizing AWS Command Line Interface (AWS CLI) instructions to replace your kubeconfig and confirm the setup. For brand new deployments, we information you thru organising the API server, creating clusters, and configuring high-performance networking choices like Elastic Material Adapter (EFA).

The next diagram illustrates the answer structure.

Within the following sections, we present run SkyPilot jobs for multi-node distributed coaching on SageMaker HyperPod. We go over the method of making a SageMaker HyperPod cluster, putting in SkyPilot, making a SkyPilot cluster, and deploying a SkyPilot coaching job.

Stipulations

You need to have the next stipulations:

An present SageMaker HyperPod cluster with Amazon EKS (to create one, check with Deploy Your HyperPod Cluster). You need to provision a single ml.p5.48xlarge occasion for the code samples within the following sections.
Entry to the AWS CLI and kubectl command line instruments.
A Python setting for putting in SkyPilot.

Create a SageMaker HyperPod cluster

You possibly can create an EKS cluster with a single AWS CloudFormation stack following the directions in Utilizing CloudFormation, configured with a digital personal cloud (VPC) and storage sources.

To create and handle SageMaker HyperPod clusters, you need to use both the AWS Administration Console or AWS CLI. If you happen to use the AWS CLI, specify the cluster configuration in a JSON file and select the EKS cluster created from the CloudFormation stack because the orchestrator of the SageMaker HyperPod cluster. You then create the cluster employee nodes with NodeRecovery set to Computerized to allow automated node restoration, and for OnStartDeepHealthChecks, add InstanceStress and InstanceConnectivity to allow deep well being checks. See the next code:

cat > cluster-config.json << EOL
{
    "ClusterName": "hp-cluster",
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "${EKS_CLUSTER_ARN}"
        }
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 2,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://${BUCKET_NAME}",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
            "OnStartDeepHealthChecks": [
                "InstanceStress",
                "InstanceConnectivity"
            ],
        },
  ....
    ],
    "VpcConfig": {
        "SecurityGroupIds": [
            "$SECURITY_GROUP"
        ],
        "Subnets": [
            "$SUBNET_ID"
        ]
    },
    "ResilienceConfig": {
        "NodeRecovery": "Computerized"
    }
}
EOL

You possibly can add InstanceStorageConfigs to provision and mount extra Amazon Elastic Block Retailer (Amazon EBS) volumes on SageMaker HyperPod nodes.

To create the cluster utilizing the SageMaker HyperPod APIs, run the next AWS CLI command:

aws sagemaker create-cluster  
--cli-input-json file://cluster-config.json

You are actually able to arrange SkyPilot in your SageMaker HyperPod cluster.

Connect with your SageMaker HyperPod EKS cluster

Out of your AWS CLI setting, run the aws eks update-kubeconfig command to replace your native kube config file (situated at ~/.kube/config) with the credentials and configuration wanted to hook up with your EKS cluster utilizing the kubectl command (present your particular EKS cluster title):

aws eks update-kubeconfig --name $EKS_CLUSTER_NAME

You possibly can confirm that you’re linked to the EKS cluster by working the next command:

kubectl config current-context

Set up SkyPilot with Kubernetes assist

Use the next code to put in SkyPilot with Kubernetes assist utilizing pip:

pip set up skypilot[kubernetes]

This installs the most recent construct of SkyPilot, which incorporates the mandatory Kubernetes integrations.

Confirm SkyPilot’s connection to the EKS cluster

Test if SkyPilot can hook up with your Kubernetes cluster:

sky verify k8s

The output ought to look just like the next code:

Checking credentials to allow clouds for SkyPilot.
Kubernetes: enabled [compute]

To allow a cloud, comply with the hints above and rerun: sky verify
If any issues stay, check with detailed docs at: https://docs.skypilot.co/en/newest/getting-started/set up.html

🎉 Enabled clouds 🎉
Kubernetes [compute]
Energetic context: arn:aws:eks:us-east-2:XXXXXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster

Utilizing SkyPilot API server: http://127.0.0.1:46580

If that is your first time utilizing SkyPilot with this Kubernetes cluster, you may see a immediate to create GPU labels to your nodes. Observe the directions by working the next code:

python -m sky.utils.kubernetes.gpu_labeler --context

This script helps SkyPilot establish what GPU sources can be found on every node in your cluster. The GPU labeling job may take a couple of minutes relying on the variety of GPU sources in your cluster.

Uncover obtainable GPUs within the cluster

To see what GPU sources can be found in your SageMaker HyperPod cluster, use the next code:

sky show-gpus --cloud k8s

This can listing the obtainable GPU varieties and their counts. Now we have two p5.48xlarge situations, every geared up with 8 NVIDIA H100 GPUs:

 Kubernetes GPUs
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
H100 1, 2, 4, 8 16 16

Kubernetes per node accelerator availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
hyperpod-i-00baa178bc31afde3 H100 8 8
hyperpod-i-038beefa954efab84 H100 8 8

Launch an interactive growth setting

With SkyPilot, you may launch a SkyPilot cluster for interactive growth:

sky launch -c dev --gpus H100

This command creates an interactive growth setting (IDE) with a single H100 GPU and can sync the native working listing to the cluster. SkyPilot handles the pod creation, useful resource allocation, and setup of the IDE.

Thought of sources (1 node):
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                                                 COST ($)   CHOSEN   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--H100:1   2       8         H100:1         arn:aws:eks:us-east-2:XXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster   0.00          ✔     
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a brand new cluster 'dev'. Proceed? [Y/n]: Y
• Launching on Kubernetes.
Pod is up.
✔ Cluster launched: dev. View logs: sky api logs -1 sky-2025-05-05-15-28-47-523797/provision. log
• Syncing information.
Run instructions not specified or empty.
Helpful Instructions
Cluster title: dey
To log into the pinnacle VM:   ssh dev
To submit a job:           sky exec dev yaml_file
To cease the cluster:       sky cease dev
To teardown the cluster:   sky down dev

After it’s launched, you may hook up with your IDE:

ssh dev

This offers you an interactive shell in your IDE, the place you may run your code, set up packages, and carry out ML experiments.

Run coaching jobs

With SkyPilot, you may run distributed coaching jobs in your SageMaker HyperPod cluster. The next is an instance of launching a distributed coaching job utilizing a YAML configuration file.

First, create a file named practice.yaml along with your coaching job configuration:

sources:
    accelerators: H100

num_nodes: 1

setup: |
    git clone --depth 1 https://github.com/pytorch/examples || true
    cd examples
    git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
    # SkyPilot's default picture on AWS/GCP has CUDA 11.6 (Azure 11.5).
    uv venv --python 3.10
    supply .venv/bin/activate
    uv pip set up -r necessities.txt "numpy<2" "torch"

run: |
    cd examples
    supply .venv/bin/activate
    cd mingpt
    export LOGLEVEL=INFO

    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    echo "Beginning distributed coaching, head node: $MASTER_ADDR"

    torchrun 
    --nnodes=$SKYPILOT_NUM_NODES 
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE 
    --master_addr=$MASTER_ADDR 
    --master_port=8008 
    --node_rank=${SKYPILOT_NODE_RANK} 
    essential.py

Then launch your coaching job:

sky launch -c practice practice.yaml

This creates a coaching job on a single p5.48xlarge nodes, geared up with 8 H100 NVIDIA GPUs. You possibly can monitor the output with the next command:

sky logs practice

Working multi-node coaching jobs with EFA

Elastic Material Adapter (EFA) is a community interface for Amazon Elastic Compute Cloud (Amazon EC2) situations that lets you run functions requiring excessive ranges of inter-node communications at scale on AWS via its custom-built working system bypass {hardware} interface. This allows functions to speak immediately with the community {hardware} whereas bypassing the working system kernel, considerably lowering latency and CPU overhead. This direct {hardware} entry is especially useful for distributed ML workloads the place frequent inter-node communication throughout gradient synchronization can turn into a bottleneck. By utilizing EFA-enabled situations akin to p5.48xlarge or p6-b200.48xlarge, knowledge scientists can scale their coaching jobs throughout a number of nodes whereas sustaining the low-latency, high-bandwidth communication important for environment friendly distributed coaching, finally lowering coaching time and enhancing useful resource utilization for large-scale AI workloads.

The next code snippet reveals incorporate this into your SkyPilot job:

title: nccl-test-efa

sources:
  cloud: kubernetes
  accelerators: H100:8
  image_id: docker:public.ecr.aws/hpc-cloud/nccl-tests:newest

num_nodes: 2

envs:
  USE_EFA: "true"

run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Head node"

    # Whole variety of processes, NP needs to be the full variety of GPUs within the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to every IP as slots
    nodes=""
    for ip in $SKYPILOT_NODE_IPS; do
      nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
    performed
    nodes=${nodes::-1}
    echo "All nodes: ${nodes}"

    # Set setting variables
    export PATH=$PATH:/usr/native/cuda-12.2/bin:/decide/amazon/efa/bin:/usr/bin
    export LD_LIBRARY_PATH=/usr/native/cuda-12.2/lib64:/decide/amazon/openmpi/lib:/decide/nccl/construct/lib:/decide/amazon/efa/lib:/decide/aws-ofi-nccl/set up/lib:/usr/native/nvidia/lib:$LD_LIBRARY_PATH
    export NCCL_HOME=/decide/nccl
    export CUDA_HOME=/usr/native/cuda-12.2
    export NCCL_DEBUG=INFO
    export NCCL_BUFFSIZE=8388608
    export NCCL_P2P_NET_CHUNKSIZE=524288
    export NCCL_TUNER_PLUGIN=/decide/aws-ofi-nccl/set up/lib/libnccl-ofi-tuner.so

    if [ "${USE_EFA}" == "true" ]; then
      export FI_PROVIDER="efa"
    else
      export FI_PROVIDER=""
    fi

    /decide/amazon/openmpi/bin/mpirun 
      --allow-run-as-root 
      --tag-output 
      -H $nodes 
      -np $NP 
      -N $SKYPILOT_NUM_GPUS_PER_NODE 
      --bind-to none 
      -x FI_PROVIDER 
      -x PATH 
      -x LD_LIBRARY_PATH 
      -x NCCL_DEBUG=INFO 
      -x NCCL_BUFFSIZE 
      -x NCCL_P2P_NET_CHUNKSIZE 
      -x NCCL_TUNER_PLUGIN 
      --mca pml ^cm,ucx 
      --mca btl tcp,self 
      --mca btl_tcp_if_exclude lo,docker0,veth_def_agent 
      /decide/nccl-tests/construct/all_reduce_perf 
      -b 8 
      -e 2G 
      -f 2 
      -g 1 
      -c 5 
      -w 5 
      -n 100
  else
    echo "Employee nodes"
  fi

config:
  kubernetes:
    pod_config:
      spec:
        containers:
        - sources:
            limits:
              
              vpc.amazonaws.com/efa: 32
            requests:
              
              vpc.amazonaws.com/efa: 32

Clear up

To delete your SkyPilot cluster, run the next command:

sky down

To delete the SageMaker HyperPod cluster created on this publish, you may person both the SageMaker AI console or the next AWS CLI command:

aws sagemaker delete-cluster --cluster-name

Cluster deletion will take a couple of minutes. You possibly can verify profitable deletion after you see no clusters on the SageMaker AI console.

If you happen to used the CloudFormation stack to create sources, you may delete it utilizing the next command:

aws cloudformation delete-stack --stack-name

Conclusion

By combining the sturdy infrastructure capabilities of SageMaker HyperPod with SkyPilot’s user-friendly interface, we’ve showcased an answer that helps groups deal with innovation reasonably than infrastructure complexity. This method not solely simplifies operations but in addition enhances productiveness and useful resource utilization throughout organizations of all sizes. To get began, check with SkyPilot within the Amazon EKS Help in Amazon SageMaker HyperPod workshop.

Concerning the authors

Roy Allela is a Senior AI/ML Specialist Options Architect at AWS. He helps AWS prospects—from small startups to massive enterprises—practice and deploy basis fashions effectively on AWS. He’s obsessed with computational optimization issues and enhancing the efficiency of AI workloads.

Zhanghao Wu is a co-creator of the SkyPilot open supply challenge and holds a PhD in pc science from UC Berkeley. He works on SkyPilot core, client-server structure, managed jobs, and enhancing the AI expertise on numerous cloud infrastructure on the whole.

Ankit Anand is a Senior Basis Fashions Go-To-Market (GTM) Specialist at AWS. He companions with high generative AI mannequin builders, strategic prospects, and AWS service groups to allow the subsequent technology of AI/ML workloads on AWS. Ankit’s expertise contains product administration experience inside the monetary providers trade for high-frequency and low-latency buying and selling and enterprise growth for Amazon Alexa.

Main Menu

What's Hot

Hackers Breach Toptal GitHub, Publish 10 Malicious npm Packages With 5,000 Downloads

You must flip off this default TV setting ASAP – and why even consultants advocate it

Prime Abilities Information Scientists Ought to Study in 2025

Streamline machine studying workflows with SkyPilot on Amazon SageMaker HyperPod

Prime Abilities Information Scientists Ought to Study in 2025

mRAKL: Multilingual Retrieval-Augmented Information Graph Building for Low-Resourced Languages

How Uber Makes use of ML for Demand Prediction?

Hackers Breach Toptal GitHub, Publish 10 Malicious npm Packages With 5,000 Downloads

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Hackers Breach Toptal GitHub, Publish 10 Malicious npm Packages With 5,000 Downloads

You must flip off this default TV setting ASAP – and why even consultants advocate it

Prime Abilities Information Scientists Ought to Study in 2025

Apera AI closes Sequence A financing, updates imaginative and prescient software program, names executives

Main Menu

Subscribe to Updates

What's Hot

Streamline machine studying workflows with SkyPilot on Amazon SageMaker HyperPod

Challenges of orchestrating machine studying workloads

SageMaker HyperPod with SkyPilot

Answer overview

Stipulations

Create a SageMaker HyperPod cluster

Connect with your SageMaker HyperPod EKS cluster

Set up SkyPilot with Kubernetes assist

Confirm SkyPilot’s connection to the EKS cluster

Uncover obtainable GPUs within the cluster

Launch an interactive growth setting

Run coaching jobs

Working multi-node coaching jobs with EFA

Clear up

Conclusion

Concerning the authors

Related Posts