Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Why AI is the Final Working System You’ll Ever Want

    January 23, 2026

    Performative Coverage: When Anti-Racism Is Managed, Not Practiced

    January 23, 2026

    Fable Reboot Set for Fall 2026 as RPG Franchise Debuts on PS5

    January 23, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»Amazon SageMaker HyperPod enhances ML infrastructure with scalability and customizability
    Machine Learning & Research

    Amazon SageMaker HyperPod enhances ML infrastructure with scalability and customizability

    Oliver ChambersBy Oliver ChambersAugust 26, 2025No Comments13 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Amazon SageMaker HyperPod enhances ML infrastructure with scalability and customizability
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    Amazon SageMaker HyperPod is a purpose-built infrastructure for optimizing basis mannequin (FM) coaching and inference at scale. SageMaker HyperPod removes the undifferentiated heavy lifting concerned in constructing and optimizing machine studying (ML) infrastructure for coaching FMs, decreasing coaching time by as much as 40%.

    SageMaker HyperPod presents persistent clusters with built-in resiliency, whereas additionally providing deep infrastructure management by permitting customers to SSH into the underlying Amazon Elastic Compute Cloud (Amazon EC2) cases. It helps effectively scale mannequin growth and deployment duties comparable to coaching, fine-tuning, or inference throughout a cluster of a whole bunch or 1000’s of AI accelerators, whereas decreasing the operational heavy lifting concerned in managing such clusters. As AI strikes in the direction of deployment adopting to a mess of domains and use circumstances, the necessity for flexibility and management is changing into extra pertinent. Massive enterprises need to make sure that the GPU clusters comply with the organization-wide insurance policies and safety guidelines. Mission-critical AI/ML workloads usually require specialised environments that align with the group’s software program stack and operational requirements.

    SageMaker HyperPod helps Amazon Elastic Kubernetes Service (Amazon EKS) and presents two new options that improve this management and suppleness to allow manufacturing deployment of large-scale ML workloads:

    • Steady provisioning – SageMaker HyperPod now helps steady provisioning, which boosts cluster scalability by options like partial provisioning, rolling updates, concurrent scaling operations, and steady retries when launching and configuring your HyperPod cluster.
    • Customized AMIs – Now you can use customized Amazon Machine Photographs (AMIs), which allows the preconfiguration of software program stacks, safety brokers, and proprietary dependencies that may in any other case require complicated post-launch bootstrapping. Clients can create customized AMIs utilizing the HyperPod public AMI as a base and set up extra software program required to satisfy their group’s particular safety and compliance necessities.

    On this put up, we dive deeper into every of those options.

    Steady provisioning

    The brand new steady provisioning function in SageMaker HyperPod represents a transformative development for organizations working intensive ML workloads, delivering unprecedented flexibility and operational effectivity that accelerates AI innovation. This function gives the next advantages:

    • Partial provisioning – SageMaker HyperPod prioritizes delivering the utmost potential variety of cases with out failure. You can begin working your workload whereas your cluster will try and provision the remaining cases.
    • Concurrent operations – SageMaker HyperPod helps simultaneous scaling and upkeep actions (comparable to scale up, scale down, and patching) on a single occasion group ready for earlier operations to finish.
    • Steady retries – SageMaker HyperPod persistently makes an attempt to meet the person’s request till it encounters a NonRecoverable error from the place restoration isn’t potential.
    • Elevated buyer visibility – SageMaker HyperPod maps customer-initiated and service-initiated operations to structured exercise streams, offering real-time standing updates and detailed progress monitoring.

    For ML groups dealing with tight deadlines and useful resource constraints, this implies dramatically decreased wait occasions and the power to start mannequin coaching and deployment with no matter computing energy is straight away obtainable, whereas the system works diligently within the background to provision remaining requested assets.

    Implement steady provisioning in a SageMaker HyperPod cluster

    The structure introduces an intuitive but highly effective parameter that places scaling technique management straight in your arms: --node-provisioning-mode. Steady provisioning maximizes useful resource utilization and operational agility.

    The next code creates a cluster with one occasion group and steady provisioning mode enabled utilizing --node-provisioning-mode:

    aws sagemaker create-cluster  
    --cluster-name $HP_CLUSTER_NAME 
    --orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' 
    --vpc-config '{
       "SecurityGroupIds": ["'$SECURITY_GROUP'"],
       "Subnets": ["'$SUBNET'"]
    }' 
    --instance-groups '{
       "InstanceGroupName": "ig-1",
       "InstanceType": "ml.p6-b200.48xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
          "SourceS3Uri": "s3://'$BUCKET_NAME'",
          "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$EXECUTION_ROLE'",
       "ThreadsPerCore": 1
    }' 
    --node-provisioning-mode Steady
    {
        "ClusterArn": "arn:aws:sagemaker:us-west-2:530295135845:cluster/pv09azbjo6hs"
    }

    Extra options are launched with steady provisioning:

    • Cron job scheduling for example group software program updates:
    aws sagemaker update-cluster --cluster-name $HP_CLUSTER_NAME 
    --instance-groups '[{
       "InstanceGroupName": "group2",
       "InstanceType": "ml.p6-b200.48xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
          "SourceS3Uri": "s3://'$BUCKET_NAME'",
          "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$EXECUTION_ROLE'",
       "ThreadsPerCore": 1,
       "ScheduledUpdateConfig": {
          "ScheduleExpression": "cron(30 19 27 * ? *)" # Cron job parameters: cron(Minutes Hours Day-of-month Month Day-of-week Year)
       }
    }]' 

    • Rolling updates with security measures. With rolling deployment, HyperPod step by step shifts site visitors out of your previous fleet to a brand new fleet. If there is a matter throughout deployment, it shouldn’t have an effect on the entire cluster.
    aws sagemaker update-cluster --cluster-name $HP_CLUSTER_NAME 
    --instance-groups '[{
       "InstanceGroupName": "group4",
       "ScheduledUpdateConfig": {
          "ScheduleExpression": "cron(45 14 25 * ? *)",
          "DeploymentConfig": {
             "AutoRollbackConfiguration": [{
                "AlarmName": "RollbackPatchingAlarm"
             }],
             "RollingUpdatePolicy": {
                "MaximumBatchSize": {
                   "Sort": "INSTANCE_COUNT",
                   "Worth": 1
                }
             },
             "WaitIntervalInSeconds": 15
          }
       }
    }]'

    aws sagemaker list-cluster-nodes --cluster-name $HP_CLUSTER_NAME

    • Batch add nodes (add nodes to particular occasion teams):
    aws sagemaker batch-add-cluster-nodes --cluster-name $HP_CLUSTER_NAME 
    --nodes-to-add '[{
       "InstanceGroupName": "group1",
       "IncrementTargetCountBy": 5
    }]'

    • Batch delete nodes (take away particular nodes by ID):
    aws sagemaker batch-delete-cluster-nodes --cluster-name $HP_CLUSTER_NAME 
    --node-ids i-0b949a3867b2a963a

    • Allow Coaching Plan capability for example provisioning by including the TrainingPlanArn parameter throughout occasion group creation:
    aws sagemaker update-cluster --cluster-name $HP_CLUSTER_NAME 
    --instance-groups '[{
       "InstanceGroupName": "training-group",
       "InstanceType": "ml.p6-b200.48xlarge",
       "InstanceCount": 3,
       "TrainingPlanArn": "YOUR_TRAINING_PLAN_ARN"
    }]'

    • Cluster occasion observability:
    aws sagemake list-cluster-events —cluster-name $HP_CLUSTER_NAME

    Customized AMIs

    To cut back operational overhead, nodes in a SageMaker HyperPod cluster are launched with the AWS Deep Studying AMIs (DLAMIs). AWS DLAMIs are pre-built AMIs which can be optimized for working deep studying workloads on EC2 cases. They arrive pre-installed with fashionable deep studying frameworks, libraries, and instruments to make it simple to get began with coaching and deploying deep studying fashions.

    The brand new customized AMI function of SageMaker HyperPod unlocks even larger worth for enterprise clients by delivering the granular management and operational excellence you could speed up AI initiatives whereas sustaining safety requirements. It seamlessly bridges high-performance computing necessities with enterprise-grade safety and operational excellence.

    Organizations can now construct personalized AMIs utilizing SageMaker HyperPod performance-tuned public AMIs as a basis; groups can pre-install safety brokers, compliance instruments, proprietary software program, and specialised libraries straight into optimized photos.

    This function presents the next advantages:

    • It accelerates time-to-value by minimizing runtime set up delays and decreasing cluster initialization time by pre-built configurations.
    • From a safety standpoint, it allows enterprise-grade centralized management, so safety groups can keep full oversight whereas assembly their compliance necessities.
    • Operationally, the function promotes excellence by standardized, reproducible environments utilizing version-controlled AMIs, whereas offering seamless integration with current workflows.

    The next sections define a step-by-step method to construct your individual AMI and apply it to your SageMaker HyperPod cluster.

    Choose and procure your SageMaker HyperPod base AMI

    You may select from two choices to retrieve the SageMaker HyperPod base AMI. To make use of the Amazon EC2 console, full the next steps:

    1. On the Amazon EC2 console, select AMIs below Photographs within the navigation pane.
    2. Select Public photos because the picture sort and set the Proprietor alias filter to Amazon.
    3. Seek for AMIs prefixed with HyperPod EKS.
    4. Select the suitable AMI (ideally the most recent).

    Alternatively, you should utilize the Amazon Command Line Interface (AWS CLI) with AWS Methods Supervisor to fetch the most recent SageMaker HyperPod base AMI:

    aws ssm get-parameter 
      --name "/aws/service/sagemaker-hyperpod/ami/x86_64/eks-1.31-amazon-linux-2/newest/ami-id" 
      --region us-west-2 
      --query "Parameter.Worth" 
      --output textual content
    
    // Substitute the parameter identify with corresponding kubernetes model as required.
    // For instance, If you wish to use kubernetes 1.30, use the next parameter

    Construct your customized AMI

    After you choose a SageMaker HyperPod public AMI, use that as the bottom AMI to construct your individual customized AMI utilizing one of many following strategies. This isn’t an exhaustive checklist for constructing AMIs; you should utilize your most popular methodology. SageMaker HyperPod doesn’t have any robust suggestions.

    • Amazon EC2 console – Select your personalized EC2 occasion, then select Motion, Picture and Templates, Create Picture.
    • AWS CLI – Use the aws ec2 create-image command.
    • HashiCorp Packer – Packer is an open supply software from HashiCorp that you should utilize to create equivalent machine photos for a number of platforms from a single supply configuration. It helps creating AMIs for AWS, in addition to photos for different cloud suppliers and virtualization platforms.
    • EC2 Picture Builder – EC2 Picture Builder is a totally managed AWS service that makes it simple to automate the creation, upkeep, validation, sharing, and deployment of Linux or Home windows Server photos.

    Arrange the required permissions

    Earlier than you begin utilizing customized AMIs, verify you may have the required AWS Identification and Entry Administration (IAM) insurance policies configured. Be sure to add the next insurance policies to your ClusterAdmin person permissions (IAM coverage):

    # Minimal set of permissions for admin to run the HyperPod core APIs
    "sagemaker:CreateCluster",
    "sagemaker:DeleteCluster",
    "sagemaker:DescribeCluster",
    "sagemaker:DescribeCluterNode",
    "sagemaker:ListClusterNodes",
    "sagemaker:ListClusters",
    "sagemaker:UpdateCluster",
    "sagemaker:UpdateClusterSoftware",
    "sagemaker:BatchDeleteClusterNodes",
    "eks:DescribeCluster",
    "eks:CreateAccessEntry",
    "eks:DescribeAccessEntry",
    "eks:DeleteAccessEntry",
    "eks:AssociateAccessPolicy",
    "iam:CreateServiceLinkedRole",
    
    # Permissions required to handle HyperPod clusters with customized AMI
    "ec2:DescribeImages",
    "ec2:ModifyImageAttribute",
    "ec2:modifySnapshotAttribute",
    "ec2:DescribeSnapshots"

    Run cluster administration operations

    To create a cluster with a customized AMI, use the aws sagemaker create-cluster command. Specify your customized AMI within the ImageId parameter, and embrace different required cluster configurations:

    aws sagemaker create-cluster 
       --cluster-name clusterNameHere 
       --orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' 
       --node-provisioning-mode Steady 
       --instance-groups '{
       "InstanceGroupName": "groupNameHere",
       "InstanceType": "ml.p6-b200.48xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
          "SourceS3Uri": "s3://'$BUCKET_NAME'",
          "OnCreate": "on_create.sh"
       },
       "ImageId: ",
       "ExecutionRole": "'$EXECUTION_ROLE'",
       "ThreadsPerCore": 1,
       "InstanceStorageConfigs": [
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 500,
                }
            }
       ]
    }' --vpc-config '{
       "SecurityGroupIds": ["'$SECURITY_GROUP'"],
       "Subnets": ["'$SUBNET'"]
    }'

    Scale up an occasion group with the next code:

    aws sagemaker update-cluster 
        --cluster-name $HP_CLUSTER_NAME --instance-groups '[{                  
        "InstanceGroupName": "groupNameHere",
       "InstanceType": "ml.p6-b200.48xlarge",
       "InstanceCount": 10,
       "LifeCycleConfig": {
          "SourceS3Uri": "s3://'$BUCKET_NAME'",
          "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$EXECUTION_ROLE'",
       "ThreadsPerCore": 1,
       "ImageId: ",
    }]'

    Add an occasion group with the next code:

    aws sagemaker update-cluster 
       --cluster-name "clusterNameHere" 
       --instance-groups '{
       "InstanceGroupName": "groupNameHere",
       "InstanceType": "ml.p6-b200.48xlarge",
       "InstanceCount": 10,
       "LifeCycleConfig": {
          "SourceS3Uri": "s3://'$BUCKET_NAME'",
          "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$EXECUTION_ROLE'",
       "ThreadsPerCore": 1,
       "ImageId: ",
    }' '{
       "InstanceGroupName": "groupNameHere2",
       "InstanceType": "ml.c5.2xlarge",
       "InstanceCount": 1,
       "LifeCycleConfig": {
          "SourceS3Uri": "s3://'$BUCKET_NAME'",
          "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$EXECUTION_ROLE'",
       "ThreadsPerCore": 1,
       "ImageId: ",
    }'

    Concerns

    When utilizing customized AMIs along with your cluster, concentrate on the next necessities and limitations:

    • Snapshot help – Customized AMIs should include solely the basis snapshot. Extra snapshots will not be supported and can trigger cluster creation or replace operations to fail with a validation exception if the AMI accommodates extra snapshots past the basis quantity.
    • Patching – ImageId in update-cluster is immutable. For patching current occasion teams, you will need to use UpdateClusterSoftware with ImageId.
    • AMI variations and deprecation – The public AMI releases web page talks in regards to the public AMI variations and deprecation standing. Clients are anticipated to watch this web page for AMI vulnerabilities and deprecation standing and patch cluster with up to date customized AMI.

    Clear up

    To scrub up your assets to keep away from incurring extra expenses, full the next steps:

    1. Delete your SageMaker HyperPod cluster.
    2. Should you created the networking stack from the SageMaker HyperPod workshop, delete the stack as nicely to scrub up the digital personal cloud (VPC) assets and the FSx for Lustre quantity.

    Conclusion

    On this put up, we launched three options in SageMaker HyperPod that improve scalability and customizability for ML infrastructure. Steady provisioning presents versatile useful resource provisioning that will help you begin coaching and deploying your fashions quicker and handle your cluster extra effectively. With customized AMIs, you may align your ML environments with organizational safety requirements and software program necessities. To study extra about these options, see:


    In regards to the authors

    Mark Vinciguerra is an Affiliate Specialist Options Architect at Amazon Internet Providers (AWS) primarily based in New York. He focuses on Generative AI coaching and inference, with the aim of serving to clients architect, optimize, and scale their workloads throughout numerous AWS providers. Previous to AWS, he went to Boston College and graduated with a level in Laptop Engineering. You may join with him on LinkedIn.

    Anoop Saha is a Sr GTM Specialist at Amazon Internet Providers (AWS) specializing in generative AI mannequin coaching and inference. He companions with prime frontier mannequin builders, strategic clients, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop held a number of management roles at startups and huge companies, primarily specializing in silicon and system structure of AI infrastructure.

    Monidipa Chakraborty at present serves as a Senior Software program Growth Engineer at Amazon Internet Providers (AWS), particularly inside the SageMaker HyperPod group. She is dedicated to helping clients by designing and implementing sturdy and scalable programs that exhibit operational excellence. Bringing practically a decade of software program growth expertise, Monidipa has contributed to numerous sectors inside Amazon, together with Video, Retail, Amazon Go, and AWS SageMaker.

    Arun Nagpal is a Sr Technical Account Supervisor & Enterprise Help Lead at Amazon Internet Providers (AWS), specializing in driving generative AI and supporting startups by enterprise-wide cloud transformations. He focuses on adopting AI providers inside AWS and aligning expertise methods with enterprise targets to attain impactful outcomes.

    Daiming Yang is a technical chief at AWS, engaged on machine studying infrastructure that permits large-scale coaching and inference workloads. He has contributed to a number of AWS providers and is proficient in numerous AWS applied sciences, with experience in distributed programs, Kubernetes, and cloud-native structure. Enthusiastic about constructing dependable, customer-focused options, he makes a speciality of reworking complicated technical challenges into easy, sturdy programs that scale globally.

    Kunal Jha is a Principal Product Supervisor at AWS, the place he focuses on constructing Amazon SageMaker HyperPod to allow scalable distributed coaching and fine-tuning of basis fashions. In his spare time, Kunal enjoys snowboarding and exploring the Pacific Northwest. You may join with him on LinkedIn.

    Sai Kiran Akula is an engineering chief at AWS, engaged on the HyperPod group targeted on bettering infrastructure for machine studying coaching/inference jobs. He has contributed to core AWS providers like EC2, ECS, Fargate, and SageMaker associate AI apps. With a background in distributed programs, he focuses on constructing dependable and scalable options throughout groups.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    All the things You Have to Know About How Python Manages Reminiscence

    January 23, 2026

    The Human Behind the Door – O’Reilly

    January 23, 2026

    How PDI constructed an enterprise-grade RAG system for AI functions with AWS

    January 23, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Why AI is the Final Working System You’ll Ever Want

    By Amelia Harper JonesJanuary 23, 2026

    For many years, the non-public laptop has been the middle of our digital universe, with…

    Performative Coverage: When Anti-Racism Is Managed, Not Practiced

    January 23, 2026

    Fable Reboot Set for Fall 2026 as RPG Franchise Debuts on PS5

    January 23, 2026

    All the things You Have to Know About How Python Manages Reminiscence

    January 23, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.