Handle Amazon SageMaker HyperPod clusters utilizing the HyperPod CLI and SDK

Coaching and deploying giant AI fashions requires superior distributed computing capabilities, however managing these distributed methods shouldn’t be advanced for knowledge scientists and machine studying (ML) practitioners. The command line interface (CLI) and software program improvement package (SDK) for Amazon SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS) orchestration simplify the way you handle cluster infrastructure and use the service’s distributed coaching and inference capabilities.

The SageMaker HyperPod CLI supplies knowledge scientists with an intuitive command-line expertise, abstracting away the underlying complexity of distributed methods. Constructed on high of the SageMaker HyperPod SDK, the CLI affords simple instructions for managing HyperPod clusters and customary workflows like launching coaching or fine-tuning jobs, deploying inference endpoints, and monitoring cluster efficiency. This makes it supreme for fast experimentation and iteration.

A layered structure for simplicity

The HyperPod CLI and SDK observe a multi-layered, shared structure. The CLI and the Python module function user-facing entry factors and are each constructed on high of widespread SDK parts to supply constant conduct throughout interfaces. For infrastructure automation, the SDK orchestrates cluster lifecycle administration by means of a mixture of AWS CloudFormation stack provisioning and direct AWS API interactions. Coaching and inference workloads and built-in improvement environments (IDEs) (Areas) are expressed as Kubernetes Customized Useful resource Definitions (CRDs), which the SDK manages by means of the Kubernetes API.

On this publish, we exhibit the best way to use the CLI and the SDK to create and handle SageMaker HyperPod clusters in your AWS account. We stroll by means of a sensible instance and dive deeper into the consumer workflow and parameter decisions.

This publish focuses on cluster creation and administration. For a deep dive into utilizing the HyperPod CLI and SDK to submit coaching jobs and deploy inference endpoints, see our companion publish: Practice and deploy fashions on Amazon SageMaker HyperPod utilizing the brand new HyperPod CLI and SDK.

Stipulations

To observe the examples on this publish, you need to have the next conditions:

Set up the SageMaker HyperPod CLI

First, set up the most recent model of the SageMaker HyperPod CLI and SDK. The examples on this publish are primarily based on model 3.5.0. Out of your native surroundings, run the next command, you possibly can alternatively set up the CLI in a Python digital surroundings:

# Set up the HyperPod CLI and SDK
pip set up sagemaker-hyperpod

This command units up the instruments wanted to work together with SageMaker HyperPod clusters. For an current set up, be sure to have the most recent model of the bundle put in (SageMaker HyperPod 3.5.0 or later) to have the ability to use the related set of options described on this publish. To confirm if the CLI is put in appropriately, run the hyp command and verify the outputs:

# Test if the HyperPod CLI is appropriately put in
hyp

The output will probably be just like the next, and consists of directions on the best way to use the CLI:

Utilization: hyp [OPTIONS] COMMAND [ARGS]...

Choices:
  --version  Present model info
  --help     Present this message and exit.

Instructions:
  configure                       Replace any subset of fields in ./config.yaml by passing -- flags.
  create                          Create endpoints, pytorch jobs, cluster stacks, area, area entry or area admin config.
  delete                          Delete endpoints, pytorch jobs, area, area entry or area template.
  describe                        Describe endpoints, pytorch jobs or cluster stacks, areas or area template.
  exec                            Execute instructions in pods for endpoints or pytorch jobs.
  get-cluster-context             Get context associated to the present set cluster.
  get-logs                        Get pod logs for endpoints, pytorch jobs or areas.
  get-monitoring                  Get monitoring configurations for Hyperpod cluster.
  get-operator-logs               Get operator logs for endpoints.
  init                            Initialize a TEMPLATE scaffold in DIRECTORY.
  invoke                          Invoke mannequin endpoints.
  checklist                            Checklist endpoints, pytorch jobs, cluster stacks, areas, and area templates.
  list-accelerator-partition-type
                                  Checklist obtainable accelerator partition varieties for an occasion sort.
  list-cluster                    Checklist SageMaker Hyperpod Clusters with metadata.
  list-pods                       Checklist pods for endpoints or pytorch jobs.
  reset                           Reset the present listing's config.yaml to an "empty" scaffold: all schema keys set to default values (however maintaining the...
  set-cluster-context             Hook up with a HyperPod EKS cluster.
  begin                           Begin area sources.
  cease                            Cease area sources.
  replace                          Replace an current HyperPod cluster configuration, area, or area template.
  validate                        Validate this listing's config.yaml towards the suitable schema.

For extra info on CLI utilization and the obtainable instructions and respective parameters, see the CLI reference documentation.

The HyperPod CLI supplies instructions to handle the total lifecycle of HyperPod clusters. The next sections clarify the best way to create new clusters, monitor their creation, modify occasion teams, and delete clusters.

Creating a brand new HyperPod cluster

HyperPod clusters will be created by means of the AWS Administration Console or the HyperPod CLI, each of which offer streamlined experiences for cluster creation. The console affords the simplest and most guided method, whereas the CLI is particularly helpful for purchasers preferring a programmatic expertise—for instance, to allow reproducibility or to construct automation round cluster creation. Each strategies use the identical underlying CloudFormation template, which is out there within the SageMaker HyperPod cluster setup GitHub repository. For a walkthrough of the console-based expertise, see the cluster creation expertise weblog publish.

Creating a brand new cluster by means of the HyperPod CLI follows a configuration-based workflow: the CLI first generates configuration recordsdata, that are then edited to match the supposed cluster specs. These recordsdata are subsequently submitted as a CloudFormation stack that creates the HyperPod cluster together with the required sources, corresponding to a VPC and FSx for Lustre filesystem, amongst others.To initialize a brand new cluster configuration by operating the next command:hyp init cluster-stack

This initializes a brand new cluster configuration within the present listing and generates a config.yaml file that you need to use to specify the configuration of the cluster stack. Moreover it should create a README.md with details about the performance and workflow along with a template for the CloudFormation stack parameters in cfn_params.jinja.

(base) xxxxxxxx@3c06303f9abb hyperpod % hyp init cluster-stack
Initializing new scaffold for 'cluster-stack'…
✔️ cluster-stack for schema model='1.0' is initialized in .
🚀 Welcome!
📘 See ./README.md for utilization.

The cluster stack’s configuration variables are outlined in config.yaml. The next is an excerpt from the file:

...
# Prefix for use for all sources. A 4-digit UUID will probably be added to prefix throughout submission
resource_name_prefix: hyp-eks-stack
# Boolean to Create HyperPod Cluster Stack
create_hyperpod_cluster_stack: True
# Identify of SageMaker HyperPod Cluster
hyperpod_cluster_name: hyperpod-cluster
# Boolean to Create EKS Cluster Stack
create_eks_cluster_stack: True
# The Kubernetes model
kubernetes_version: 1.31
...

The resource_name_prefix parameter serves as the first identifier for the AWS sources created throughout deployment. Every deployment should use a singular useful resource identify prefix to keep away from conflicts. The worth of the prefix parameter is routinely appended with a singular identifier throughout cluster creation to supply useful resource uniqueness.

The configuration will be edited both instantly by opening config.yaml in an editor of your alternative or by operating the hyp configure command. The next instance exhibits the best way to specify the Kubernetes model of the Amazon EKS cluster that will probably be created by the stack:

hyp configure --kubernetes-version 1.33

Updating variables by means of the CLI instructions supplies added safety by performing validation towards the outlined schema earlier than setting the worth in config.yaml.

Moreover the Kubernetes model and the useful resource identify prefix, some examples of serious parameters are listed beneath:

# Checklist of string containing occasion group configurations
instance_group_settings:
  - {'InstanceCount': 1, 'InstanceGroupName': 'default', 'InstanceType': 'ml.t3.medium', 'TargetAvailabilityZoneId': 'use2-az2', 'ThreadsPerCore': 1, 'InstanceStorageConfigs': [{'EbsVolumeConfig': {'VolumeSizeInGB': 500}}]}

# Boolean to Create EKS Cluster Stack
create_eks_cluster_stack: True

# The identify of the S3 bucket used to retailer the cluster lifecycle scripts
s3_bucket_name: amzn-s3-demo-bucket

# Storage capability for the FSx file system in GiB
storage_capacity: 1200

There are two vital nuances when updating the configuration values by means of hyp configure instructions:

Underscores (_) in variable names inside config.yaml develop into hyphens (-) within the CLI instructions. Thus kubernetes_version in config.yaml is configured by way of hyp configure --kubernetes-version within the CLI.
Variables that comprise lists of entries inside config.yaml are configured as JSON lists within the CLI command. For instance, a number of occasion teams are configured inside config.yaml as the next:

instance_group_settings:
  - {'InstanceCount': 1, 'InstanceGroupName': 'default', 'InstanceType': 'ml.t3.medium', 'TargetAvailabilityZoneId': 'use2-az2', 'ThreadsPerCore': 1, 'InstanceStorageConfigs': [{'EbsVolumeConfig': {'VolumeSizeInGB': 500}}]}
  - {'InstanceCount': 2, 'InstanceGroupName': 'employee', 'InstanceType': 'ml.t3.giant', 'TargetAvailabilityZoneId': 'use2-az2', 'ThreadsPerCore': 1, 'InstanceStorageConfigs': [{'EbsVolumeConfig': {'VolumeSizeInGB': 1000}}]}

Which interprets to the next CLI command:

hyp configure —instance-group-settings "[{'InstanceCount': 1, 'InstanceGroupName': 'default', 'InstanceType': 'ml.t3.medium', 'TargetAvailabilityZoneId': 'use2-az2', 'ThreadsPerCore': 1, 'InstanceStorageConfigs': [{'EbsVolumeConfig': {'VolumeSizeInGB': 500}}]}, {'InstanceCount': 2, 'InstanceGroupName': 'employee', 'InstanceType': 'ml.t3.giant', 'TargetAvailabilityZoneId': 'use2-az2', 'ThreadsPerCore': 1, 'InstanceStorageConfigs': [{'EbsVolumeConfig': {'VolumeSizeInGB': 1000}}]}]"

After you’re accomplished making the specified modifications, validate your configuration file by operating the next command:hyp validate

It will validate the parameters in config.yaml towards the outlined schema. If profitable, the CLI will output the next:

(base) xxxxxxxx@3c06303f9abb hyperpod % hyp validate
✔️  config.yaml is legitimate!

The cluster creation stack will be submitted to CloudFormation by operating the next command:hyp create --region

The hyp create command performs validation and injects values from config.yaml into the cfn_params.jinja template. If no AWS Area is explicitly supplied, the command makes use of the default Area out of your AWS credentials configuration. The resolved configuration file and CloudFormation template values are saved to a timestamped subdirectory beneath the ./run/ listing, offering a light-weight native versioning mechanism to trace which configuration was used to create a cluster at a given cut-off date. You may as well select to commit these artifacts to your model management system to enhance reproducibility and auditability. If profitable, the command outputs the CloudFormation stack ID:

(base) xxxxxxxx@3c06303f9abb dev % hyp create
✔️ config.yaml is legitimate!
✔️ Submitted! Information written to run/20251118T101501
Submitting to default area: us-east-1.
Stack creation initiated. Stack ID: arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1

Monitoring the HyperPod cluster creation course of

You’ll be able to checklist the present CloudFormation stacks by operating the next command:hyp checklist cluster-stack --region

You’ll be able to optionally filter the output by stack standing by including the next flag: --status "['CREATE_COMPLETE', 'UPDATE_COMPLETE']".

The output of this command will look just like the next:

(base) xxxxxxxx@3c06303f9abb dev % hyp checklist cluster-stack
📋 HyperPod Cluster Stacks (94 discovered)

[1] Stack Particulars:
 Subject | Worth
---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------
 StackId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8/e2898250-c491-11f0-bf25-0afff7e082cf
 StackName | HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
 TemplateDescription | S3 Endpoint Stack
 CreationTime | :18:50
 StackStatus | CREATE_COMPLETE
 ParentId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
 RootId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
 DriftInformation | {'StackDriftStatus': 'NOT_CHECKED'}

Relying on the configuration in config.yaml, a number of nested stacks are created that cowl totally different facets of the HyperPod cluster setup such because the EKSClusterStack, FsxStack and the VPCStack.

You should use the describe command to view particulars about any of the person stacks:hyp describe cluster-stack --region

The output for an exemplary substack, S3EndpointStack, will seem like the next:

(base) xxxxxxxx@3c06303f9abb dev % hyp describe cluster-stack HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
📋 Stack Particulars for: HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
Standing: CREATE_COMPLETE
 Subject | Worth 
-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------
 StackId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8/e2898250-c491-11f0-bf25-0afff7e082cf
 StackName | HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
 Description | S3 Endpoint Stack
 Parameters | [
 |  "ParameterValue": "hyp-eks-demo-stack"
 ,
 |  "ParameterKey": "VpcId",
 ,
 |  ,
 |  
 | ]
 CreationTime | :18:50.007000+00:00
 RollbackConfiguration | {}
 StackStatus | CREATE_COMPLETE
 DisableRollback | True
 NotificationARNs | []
 Capabilities | [
 | "CAPABILITY_AUTO_EXPAND",
 | "CAPABILITY_IAM",
 | "CAPABILITY_NAMED_IAM"
 | ]
 Tags | []
 EnableTerminationProtection | False
 ParentId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
 RootId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
 DriftInformation | {
 | "StackDriftStatus": "NOT_CHECKED"

If any of the stacks present CREATE_FAILED, ROLLBACK_* or DELETE_*, open the CloudFormation web page within the console to analyze the foundation trigger. Failed cluster creation stacks are sometimes associated to inadequate service quotas for the cluster itself, the occasion teams, or the community parts corresponding to VPCs or NAT gateways. Test the corresponding SageMaker HyperPod Quotas to study extra in regards to the required quotas for SageMaker HyperPod.

Connecting to a cluster

After the cluster stack has efficiently created the required sources and the standing has modified to CREATE_COMPLETE, you possibly can configure the CLI and your native Kubernetes surroundings to work together with the HyperPod cluster.

hyp set-cluster-context --cluster-name —area

The --cluster-name choice specifies the identify of the HyperPod cluster to connect with and the --region choice specifies the Area the place the cluster has been created. Optionally, a particular namespace will be configured utilizing the --namespace parameter. The command updates your native Kubernetes config in ./kube/config, as a way to use each the HyperPod CLI and Kubernetes utilities corresponding to kubectl to handle the sources in your HyperPod cluster.

See our companion weblog publish for additional details about the best way to use the CLI to submit coaching jobs and inference deployments to your newly created HyperPod cluster: Practice and deploy fashions on Amazon SageMaker HyperPod utilizing the brand new HyperPod CLI and SDK.

Modifying an current HyperPod cluster

The HyperPod CLI supplies a command to switch the occasion teams and node restoration mode of an current HyperPod cluster by means of the hyp replace cluster command. This may be helpful if it’s good to scale your cluster by including or eradicating employee nodes, or if you wish to change the occasion varieties utilized by the node teams.

To replace the occasion teams, run the next command, tailored along with your cluster identify and desired occasion group settings:

hyp replace cluster --cluster-name  --region  
 --instance-groups '[{
        "instance_count": 2,
        "instance_group_name": "worker-nodes",
        "instance_type": "ml.m5.large",
        "execution_role": "arn:aws:iam:::role/",
        "life_cycle_config": {
            "source_s3_uri": "s3:///amzn-s3-demo-source-bucket/",
            "on_create": "on_create.sh"
        }
    }]'

Notice that all the fields within the previous command are required to run the replace command, even when, for instance, solely the occasion depend is modified. You’ll be able to checklist the present cluster and occasion group configurations to acquire the required values by operating the hyp describe cluster --region command.

The output of the replace command will seem like the next:

[11/18/25 13:21:57] Replace Params: {'instance_groups': [ClusterInstanceGroupSpecification(instance_count=2, instance_group_name="worker-nodes", instance_type="ml.m5.large", life_cycle_config=ClusterLifeCycleConfig(source_s3_uri='s3://amzn-s3-demo-source-bucket2', on_create="on_create.sh"), execution_role="arn:aws:iam::037065979077:role/hyp-eks-stack-4e5aExecRole", threads_per_core=, instance_storage_configs=, on_start_deep_health_checks=, training_plan_arn=, override_vpc_config=, scheduled_update_config=, image_id=)], 'node_recovery': 'Computerized'}
[11/18/25 13:21:58]  Updating cluster useful resource. sources.py:3506
INFO:sagemaker_core.principal.sources:Updating cluster useful resource.
Cluster has been up to date
Cluster hyperpod-cluster has been up to date

The --node-recovery choice enables you to configure the node restoration conduct, which will be set to both Computerized or None. For details about the SageMaker HyperPod computerized node restoration function, see Computerized node restoration.

Deleting an current HyperPod cluster

To delete an current HyperPod cluster, run the next command. Notice that this motion is not reversible:

hyp delete cluster-stack --region

This command removes the desired CloudFormation stack and the related AWS sources. You should use the non-obligatory --retain-resources flag to specify a comma-separated checklist of logical useful resource IDs to retain throughout the deletion course of. It’s vital to rigorously think about which sources it’s good to retain, as a result of the delete operation can’t be undone.

The output of this command will seem like the next, asking you to verify the useful resource deletion:

⚠ WARNING: It will delete the next 12 sources:

Different (12):
 - EKSClusterStack
 - FsxStack
 - HelmChartStack
 - HyperPodClusterStack
 - HyperPodParamClusterStack
 - LifeCycleScriptStack
 - PrivateSubnetStack
 - S3BucketStack
 - S3EndpointStack
 - SageMakerIAMRoleStack
 - SecurityGroupStack
 - VPCStack

Proceed? [y/N]: y
✓ Stack 'HyperpodClusterStack-d5351' deletion initiated efficiently

SageMaker HyperPod SDK

SageMaker HyperPod additionally features a Python SDK for programmatic entry to the options described earlier. The Python SDK is utilized by the CLI instructions and is put in once you set up the sagemaker-hyperpod Python bundle as described to start with of this publish. The HyperPod CLI is greatest suited to customers preferring a streamlined, interactive expertise for widespread HyperPod administration duties like creating and monitoring clusters, coaching jobs, and inference endpoints. It’s significantly useful for fast prototyping, experimentation, and automating repetitive HyperPod workflows by means of scripts or steady integration and supply (CI/CD) pipelines. In distinction, the HyperPod SDK supplies extra programmatic management and adaptability, making it the popular alternative when it’s good to embed HyperPod performance instantly into your software, combine with different AWS or third-party providers, or construct advanced, custom-made HyperPod administration workflows. Think about the complexity of your use case, the necessity for automation and integration, and your staff’s familiarity with programming languages when deciding whether or not to make use of the HyperPod CLI or SDK.

The SageMaker HyperPod CLI GitHub repository exhibits examples of how cluster creation and administration will be applied utilizing the Python SDK.

Conclusion

The SageMaker HyperPod CLI and SDK simplify cluster creation and administration. With the examples on this publish, we’ve demonstrated how these instruments present worth by means of:

Simplified lifecycle administration – From preliminary configuration to cluster updates and cleanup, the CLI aligns with how groups handle long-running coaching and inference environments and abstracts away pointless complexity.
Declarative management when wanted – The SDK exposes the underlying configuration mannequin, in order that groups can codify cluster specs, occasion teams, storage filesystems, and extra.
Built-in observability – Visibility into CloudFormation stacks is out there with out switching instruments, supporting easy iteration throughout improvement and operation.

Getting began with these instruments is as simple as putting in the SageMaker HyperPod bundle. The SageMaker HyperPod CLI and SDK present the correct stage of abstraction for each knowledge scientists trying to shortly experiment with distributed coaching and ML engineers constructing manufacturing methods.

For those who’re desirous about the best way to use the HyperPod CLI and SDK for submitting coaching jobs and deploying fashions to your new cluster, be sure to verify our companion weblog publish: Practice and deploy fashions on Amazon SageMaker HyperPod utilizing the brand new HyperPod CLI and SDK.

Main Menu

What's Hot

Setting Up a Google Colab AI-Assisted Coding Surroundings That Really Works

Pricing Breakdown and Core Characteristic Overview

65% of Organisations Nonetheless Detect Unauthorised Shadow AI Regardless of Visibility Optimism

Handle Amazon SageMaker HyperPod clusters utilizing the HyperPod CLI and SDK

Setting Up a Google Colab AI-Assisted Coding Surroundings That Really Works

We ran 16 AI Fashions on 9,000+ Actual Paperwork. Here is What We Discovered.

Quick Paths and Sluggish Paths – O’Reilly

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge

Setting Up a Google Colab AI-Assisted Coding Surroundings That Really Works

Pricing Breakdown and Core Characteristic Overview

65% of Organisations Nonetheless Detect Unauthorised Shadow AI Regardless of Visibility Optimism

Nvidia's new open weights Nemotron 3 tremendous combines three totally different architectures to beat gpt-oss and Qwen in throughput

Main Menu

Subscribe to Updates

What's Hot

Handle Amazon SageMaker HyperPod clusters utilizing the HyperPod CLI and SDK

A layered structure for simplicity

Stipulations

Set up the SageMaker HyperPod CLI

Creating a brand new HyperPod cluster

Monitoring the HyperPod cluster creation course of

Connecting to a cluster

Modifying an current HyperPod cluster

Deleting an current HyperPod cluster

SageMaker HyperPod SDK

Conclusion

In regards to the authors

Related Posts