This put up was written with Mohamed Hossam of Brightskies.
Analysis universities engaged in large-scale AI and high-performance computing (HPC) typically face important infrastructure challenges that impede innovation and delay analysis outcomes. Conventional on-premises HPC clusters include lengthy GPU procurement cycles, inflexible scaling limits, and complicated upkeep necessities. These obstacles limit researchers’ capacity to iterate rapidly on AI workloads resembling pure language processing (NLP), laptop imaginative and prescient, and basis mannequin (FM) coaching. Amazon SageMaker HyperPod alleviates the undifferentiated heavy lifting concerned in constructing AI fashions. It helps rapidly scale mannequin improvement duties resembling coaching, fine-tuning, or inference throughout a cluster of tons of or hundreds of AI accelerators (NVIDIA GPUs H100, A100, and others) built-in with preconfigured HPC instruments and automatic scaling.
On this put up, we exhibit how a analysis college applied SageMaker HyperPod to speed up AI analysis through the use of dynamic SLURM partitions, fine-grained GPU useful resource administration, budget-aware compute price monitoring, and multi-login node load balancing—all built-in seamlessly into the SageMaker HyperPod surroundings.
Answer overview
Amazon SageMaker HyperPod is designed to assist large-scale machine studying operations for researchers and ML scientists. The service is absolutely managed by AWS, eradicating operational overhead whereas sustaining enterprise-grade safety and efficiency.
The next structure diagram illustrates how you can entry SageMaker HyperPod to submit jobs. Finish customers can use AWS Web site-to-Web site VPN, AWS Consumer VPN, or AWS Direct Join to securely entry the SageMaker HyperPod cluster. These connections terminate on the Community Load Balancer that effectively distributes SSH visitors to login nodes, that are the first entry factors for job submission and cluster interplay. On the core of the structure is SageMaker HyperPod compute, a controller node that orchestrates cluster operations, and a number of compute nodes organized in a grid configuration. This setup helps environment friendly distributed coaching workloads with high-speed interconnects between nodes, all contained inside a personal subnet for enhanced safety.
The storage infrastructure is constructed round two most important elements: Amazon FSx for Lustre supplies high-performance file system capabilities, and Amazon S3 for devoted storage for datasets and checkpoints. This dual-storage method supplies each quick information entry for coaching workloads and safe persistence of useful coaching artifacts.
The implementation consisted of a number of phases. Within the following steps, we exhibit how you can deploy and configure the answer.
Conditions
Earlier than deploying Amazon SageMaker HyperPod, be certain that the next stipulations are in place:
- AWS configuration:
- The AWS Command Line Interface (AWS CLI) configured with applicable permissions
- Cluster configuration information ready:
cluster-config.jsonandprovisioning-parameters.json
- Community setup:
- An AWS Id and Administration (IAM) position with permissions for the next:
Launch the CloudFormation stack
We launched an AWS CloudFormation stack to provision the required infrastructure elements, together with a VPC and subnet, FSx for Lustre file system, S3 bucket for lifecycle scripts and coaching information, and IAM roles with scoped permissions for cluster operation. Seek advice from the Amazon SageMaker HyperPod workshop for CloudFormation templates and automation scripts.
Customise SLURM cluster configuration
To align compute sources with departmental analysis wants, we created SLURM partitions to mirror the organizational construction, for instance NLP, laptop imaginative and prescient, and deep studying groups. We used the SLURM partition configuration to outline slurm.conf with customized partitions. SLURM accounting was enabled by configuring slurmdbd and linking utilization to departmental accounts and supervisors.
To assist fractional GPU sharing and environment friendly utilization, we enabled Generic Useful resource (GRES) configuration. With GPU stripping, a number of customers can entry GPUs on the identical node with out competition. The GRES setup adopted the rules from the Amazon SageMaker HyperPod workshop.
Provision and validate the cluster
We validated the cluster-config.json and provisioning-parameters.json information utilizing the AWS CLI and a SageMaker HyperPod validation script:
Then we created the cluster:
Implement price monitoring and finances enforcement
To watch utilization and management prices, every SageMaker HyperPod useful resource (for instance, Amazon EC2, FSx for Lustre, and others) was tagged with a singular ClusterName tag. AWS Budgets and AWS Value Explorer studies had been configured to trace month-to-month spending per cluster. Moreover, alerts had been set as much as notify researchers in the event that they approached their quota or finances thresholds.
This integration helped facilitate environment friendly utilization and predictable analysis spending.
Allow load balancing for login nodes
Because the variety of concurrent customers elevated, the college adopted a multi-login node structure. Two login nodes had been deployed in EC2 Auto Scaling teams. A Community Load Balancer was configured with goal teams to route SSH and Techniques Supervisor visitors. Lastly, AWS Lambda capabilities enforced session limits per consumer utilizing Run-As tags with Session Supervisor, a functionality of Techniques Supervisor.
For particulars in regards to the full implementation, see Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user expertise.
Configure federated entry and consumer mapping
To facilitate safe and seamless entry for researchers, the establishment built-in AWS IAM Id Middle with their on-premises Lively Listing (AD) utilizing AWS Listing Service. This allowed for unified management and administration of consumer identities and entry privileges throughout SageMaker HyperPod accounts. The implementation consisted of the next key elements:
- Federated consumer integration – We mapped AD customers to POSIX consumer names utilizing Session Supervisor
run-astags, permitting fine-grained management over compute node entry - Safe session administration – We configured Techniques Supervisor to ensure customers entry compute nodes utilizing their very own accounts, not the default
ssm-user - Id-based tagging – Federated consumer names had been routinely mapped to consumer directories, workloads, and budgets by way of useful resource tags
For full step-by-step steerage, discuss with the Amazon SageMaker HyperPod workshop.
This method streamlined consumer provisioning and entry management whereas sustaining robust alignment with institutional insurance policies and compliance necessities.
Put up-deployment optimizations
To assist stop pointless consumption of compute sources by idle periods, the college configured SLURM with Pluggable Authentication Modules (PAM). This setup enforces computerized logout for customers after their SLURM jobs are full or canceled, supporting immediate availability of compute nodes for queued jobs.
The configuration improved job scheduling throughput by releasing idle nodes instantly and lowered administrative overhead in managing inactive periods.
Moreover, QoS insurance policies had been configured to regulate useful resource consumption, restrict job durations, and implement honest GPU entry throughout customers and departments. For instance:
- MaxTRESPerUser – Makes positive GPU or CPU utilization per consumer stays inside outlined limits
- MaxWallDurationPerJob – Helps stop excessively lengthy jobs from monopolizing nodes
- Precedence weights – Aligns precedence scheduling based mostly on analysis group or challenge
These enhancements facilitated an optimized, balanced HPC surroundings that aligns with the shared infrastructure mannequin of educational analysis establishments.
Clear up
To delete the sources and keep away from incurring ongoing costs, full the next steps:
- Delete the SageMaker HyperPod cluster:
- Delete the CloudFormation stack used for the SageMaker HyperPod infrastructure:
This may routinely take away related sources, such because the VPC and subnets, FSx for Lustre file system, S3 bucket, and IAM roles. In the event you created these sources outdoors of CloudFormation, you could delete them manually.
Conclusion
SageMaker HyperPod supplies analysis universities with a strong, absolutely managed HPC resolution tailor-made for the distinctive calls for of AI workloads. By automating infrastructure provisioning, scaling, and useful resource optimization, establishments can speed up innovation whereas sustaining finances management and operational effectivity. By custom-made SLURM configurations, GPU sharing utilizing GRES, federated entry, and strong login node balancing, this resolution highlights the potential of SageMaker HyperPod to remodel analysis computing, so researchers can deal with science, not infrastructure.
For extra particulars on benefiting from SageMaker HyperPod, take a look at the SageMaker HyperPod workshop and discover additional weblog posts about SageMaker HyperPod.
In regards to the authors
Tasneem Fathima is Senior Options Architect at AWS. She helps Greater Training and Analysis prospects within the United Arab Emirates to undertake cloud applied sciences, enhance their time to science, and innovate on AWS.
Mohamed Hossam is a Senior HPC Cloud Options Architect at Brightskies, specializing in high-performance computing (HPC) and AI infrastructure on AWS. He helps universities and analysis establishments throughout the Gulf and Center East in harnessing GPU clusters, accelerating AI adoption, and migrating HPC/AI/ML workloads to the AWS Cloud. In his free time, Mohamed enjoys enjoying video video games.

