This publish is co-written with Zhanghao Wu, co-creator of SkyPilot.
The speedy development of generative AI and basis fashions (FMs) has considerably elevated computational useful resource necessities for machine studying (ML) workloads. Trendy ML pipelines require environment friendly techniques for distributing workloads throughout accelerated compute sources, whereas ensuring developer productiveness stays excessive. Organizations want infrastructure options that aren’t solely highly effective but in addition versatile, resilient, and easy to handle.
SkyPilot is an open supply framework that simplifies working ML workloads by offering a unified abstraction layer that helps ML engineers run their workloads on totally different compute sources with out managing underlying infrastructure complexities. It gives a easy, high-level interface for provisioning sources, scheduling jobs, and managing distributed coaching throughout a number of nodes.
Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not solely gives the pliability to create and use your individual software program stack, but in addition gives optimum efficiency via similar backbone placement of situations, in addition to built-in resiliency. Combining the resiliency of SageMaker HyperPod and the effectivity of SkyPilot gives a robust framework to scale up your generative AI workloads.
On this publish, we share how SageMaker HyperPod, in collaboration with SkyPilot, is streamlining AI growth workflows. This integration makes our superior GPU infrastructure extra accessible to ML engineers, enhancing productiveness and useful resource utilization.
Challenges of orchestrating machine studying workloads
Kubernetes has turn into fashionable for ML workloads on account of its scalability and wealthy open supply tooling. SageMaker HyperPod orchestrated on Amazon Elastic Kubernetes Service (Amazon EKS) combines the ability of Kubernetes with the resilient setting of SageMaker HyperPod designed for coaching massive fashions. Amazon EKS assist in SageMaker HyperPod strengthens resilience via deep well being checks, automated node restoration, and job auto-resume capabilities, offering uninterrupted coaching for large-scale and long-running jobs.
ML engineers transitioning from conventional VM or on-premises environments usually face a steep studying curve. The complexity of Kubernetes manifests and cluster administration can pose important challenges, probably slowing down growth cycles and useful resource utilization.
Moreover, AI infrastructure groups confronted the problem of balancing the necessity for superior administration instruments with the will to offer a user-friendly expertise for his or her ML engineers. They required an answer that might supply each high-level management and ease of use for day-to-day operations.
SageMaker HyperPod with SkyPilot
To deal with these challenges, we partnered with SkyPilot to showcase an answer that makes use of the strengths of each platforms. SageMaker HyperPod excels at managing the underlying compute sources and situations, offering the sturdy infrastructure vital for demanding AI workloads. SkyPilot enhances this by providing an intuitive layer for job administration, interactive growth, and staff coordination.
By means of this partnership, we will supply our prospects one of the best of each worlds: the highly effective, scalable infrastructure of SageMaker HyperPod, mixed with a user-friendly interface that considerably reduces the training curve for ML engineers. For AI infrastructure groups, this integration gives superior administration capabilities whereas simplifying the expertise for his or her ML engineers, making a win-win scenario for all stakeholders.
SkyPilot helps AI groups run their workloads on totally different infrastructures with a unified high-level interface and highly effective administration of sources and jobs. An AI engineer can convey of their AI framework and specify the useful resource necessities for the job; SkyPilot will intelligently schedule the workloads on one of the best infrastructure: discover the obtainable GPUs, provision the GPU, run the job, and handle its lifecycle.
Answer overview
Implementing this answer is easy, whether or not you’re working with present SageMaker HyperPod clusters or organising a brand new deployment. For present clusters, you may join utilizing AWS Command Line Interface (AWS CLI) instructions to replace your kubeconfig and confirm the setup. For brand new deployments, we information you thru organising the API server, creating clusters, and configuring high-performance networking choices like Elastic Material Adapter (EFA).
The next diagram illustrates the answer structure.
Within the following sections, we present run SkyPilot jobs for multi-node distributed coaching on SageMaker HyperPod. We go over the method of making a SageMaker HyperPod cluster, putting in SkyPilot, making a SkyPilot cluster, and deploying a SkyPilot coaching job.
Stipulations
You need to have the next stipulations:
- An present SageMaker HyperPod cluster with Amazon EKS (to create one, check with Deploy Your HyperPod Cluster). You need to provision a single ml.p5.48xlarge occasion for the code samples within the following sections.
- Entry to the AWS CLI and
kubectl
command line instruments. - A Python setting for putting in SkyPilot.
Create a SageMaker HyperPod cluster
You possibly can create an EKS cluster with a single AWS CloudFormation stack following the directions in Utilizing CloudFormation, configured with a digital personal cloud (VPC) and storage sources.
To create and handle SageMaker HyperPod clusters, you need to use both the AWS Administration Console or AWS CLI. If you happen to use the AWS CLI, specify the cluster configuration in a JSON file and select the EKS cluster created from the CloudFormation stack because the orchestrator of the SageMaker HyperPod cluster. You then create the cluster employee nodes with NodeRecovery
set to Computerized
to allow automated node restoration, and for OnStartDeepHealthChecks
, add InstanceStress
and InstanceConnectivity
to allow deep well being checks. See the next code:
You possibly can add InstanceStorageConfigs to provision and mount extra Amazon Elastic Block Retailer (Amazon EBS) volumes on SageMaker HyperPod nodes.
To create the cluster utilizing the SageMaker HyperPod APIs, run the next AWS CLI command:
You are actually able to arrange SkyPilot in your SageMaker HyperPod cluster.
Connect with your SageMaker HyperPod EKS cluster
Out of your AWS CLI setting, run the aws eks update-kubeconfig command to replace your native kube config file (situated at ~/.kube/config
) with the credentials and configuration wanted to hook up with your EKS cluster utilizing the kubectl
command (present your particular EKS cluster title):
aws eks update-kubeconfig --name $EKS_CLUSTER_NAME
You possibly can confirm that you’re linked to the EKS cluster by working the next command:
kubectl config current-context
Set up SkyPilot with Kubernetes assist
Use the next code to put in SkyPilot with Kubernetes assist utilizing pip:
pip set up skypilot[kubernetes]
This installs the most recent construct of SkyPilot, which incorporates the mandatory Kubernetes integrations.
Confirm SkyPilot’s connection to the EKS cluster
Test if SkyPilot can hook up with your Kubernetes cluster:
sky verify k8s
The output ought to look just like the next code:
If that is your first time utilizing SkyPilot with this Kubernetes cluster, you may see a immediate to create GPU labels to your nodes. Observe the directions by working the next code:
python -m sky.utils.kubernetes.gpu_labeler --context
This script helps SkyPilot establish what GPU sources can be found on every node in your cluster. The GPU labeling job may take a couple of minutes relying on the variety of GPU sources in your cluster.
Uncover obtainable GPUs within the cluster
To see what GPU sources can be found in your SageMaker HyperPod cluster, use the next code:
sky show-gpus --cloud k8s
This can listing the obtainable GPU varieties and their counts. Now we have two p5.48xlarge situations, every geared up with 8 NVIDIA H100 GPUs:
Launch an interactive growth setting
With SkyPilot, you may launch a SkyPilot cluster for interactive growth:
sky launch -c dev --gpus H100
This command creates an interactive growth setting (IDE) with a single H100 GPU and can sync the native working listing to the cluster. SkyPilot handles the pod creation, useful resource allocation, and setup of the IDE.
After it’s launched, you may hook up with your IDE:
ssh dev
This offers you an interactive shell in your IDE, the place you may run your code, set up packages, and carry out ML experiments.
Run coaching jobs
With SkyPilot, you may run distributed coaching jobs in your SageMaker HyperPod cluster. The next is an instance of launching a distributed coaching job utilizing a YAML configuration file.
First, create a file named practice.yaml
along with your coaching job configuration:
Then launch your coaching job:
sky launch -c practice practice.yaml
This creates a coaching job on a single p5.48xlarge nodes, geared up with 8 H100 NVIDIA GPUs. You possibly can monitor the output with the next command:
sky logs practice
Working multi-node coaching jobs with EFA
Elastic Material Adapter (EFA) is a community interface for Amazon Elastic Compute Cloud (Amazon EC2) situations that lets you run functions requiring excessive ranges of inter-node communications at scale on AWS via its custom-built working system bypass {hardware} interface. This allows functions to speak immediately with the community {hardware} whereas bypassing the working system kernel, considerably lowering latency and CPU overhead. This direct {hardware} entry is especially useful for distributed ML workloads the place frequent inter-node communication throughout gradient synchronization can turn into a bottleneck. By utilizing EFA-enabled situations akin to p5.48xlarge or p6-b200.48xlarge, knowledge scientists can scale their coaching jobs throughout a number of nodes whereas sustaining the low-latency, high-bandwidth communication important for environment friendly distributed coaching, finally lowering coaching time and enhancing useful resource utilization for large-scale AI workloads.
The next code snippet reveals incorporate this into your SkyPilot job:
Clear up
To delete your SkyPilot cluster, run the next command:
sky down
To delete the SageMaker HyperPod cluster created on this publish, you may person both the SageMaker AI console or the next AWS CLI command:
aws sagemaker delete-cluster --cluster-name
Cluster deletion will take a couple of minutes. You possibly can verify profitable deletion after you see no clusters on the SageMaker AI console.
If you happen to used the CloudFormation stack to create sources, you may delete it utilizing the next command:
aws cloudformation delete-stack --stack-name
Conclusion
By combining the sturdy infrastructure capabilities of SageMaker HyperPod with SkyPilot’s user-friendly interface, we’ve showcased an answer that helps groups deal with innovation reasonably than infrastructure complexity. This method not solely simplifies operations but in addition enhances productiveness and useful resource utilization throughout organizations of all sizes. To get began, check with SkyPilot within the Amazon EKS Help in Amazon SageMaker HyperPod workshop.
Concerning the authors
Roy Allela is a Senior AI/ML Specialist Options Architect at AWS. He helps AWS prospects—from small startups to massive enterprises—practice and deploy basis fashions effectively on AWS. He’s obsessed with computational optimization issues and enhancing the efficiency of AI workloads.
Zhanghao Wu is a co-creator of the SkyPilot open supply challenge and holds a PhD in pc science from UC Berkeley. He works on SkyPilot core, client-server structure, managed jobs, and enhancing the AI expertise on numerous cloud infrastructure on the whole.
Ankit Anand is a Senior Basis Fashions Go-To-Market (GTM) Specialist at AWS. He companions with high generative AI mannequin builders, strategic prospects, and AWS service groups to allow the subsequent technology of AI/ML workloads on AWS. Ankit’s expertise contains product administration experience inside the monetary providers trade for high-frequency and low-latency buying and selling and enterprise growth for Amazon Alexa.