Trendy generative AI mannequin suppliers require unprecedented computational scale, with pre-training typically involving hundreds of accelerators working constantly for days, and generally months. Basis Fashions (FMs) demand distributed coaching clusters — coordinated teams of accelerated compute cases, utilizing frameworks like PyTorch — to parallelize workloads throughout lots of of accelerators (like AWS Trainium and AWS Inferentia chips or NVIDIA GPUs).
Orchestrators like SLURM and Kubernetes handle these advanced workloads, scheduling jobs throughout nodes, managing cluster sources, and processing requests. Paired with AWS infrastructure like Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing cases, Elastic Material Adapter (EFA), and distributed file programs like Amazon Elastic File System (Amazon EFS) and Amazon FSx, these extremely clusters can run large-scale machine studying (ML) coaching and inference, dealing with parallelism, gradient synchronization and collective communications, and even routing and cargo balancing. Nonetheless, at scale, even strong orchestrators face challenges round cluster resilience. Distributed coaching workloads particularly run synchronously, as a result of every coaching step requires collaborating cases to finish their calculations earlier than continuing to the following step. Which means that if a single occasion fails, the complete job fails. The chance of those failures will increase with the dimensions of the cluster.
Though resilience and infrastructure reliability is usually a problem, developer expertise stays equally pivotal. Conventional ML workflows create silos, the place knowledge and analysis scientists prototype on native Jupyter notebooks or Visible Studio Code cases, missing entry to cluster-scale storage, and engineers handle manufacturing jobs by way of separate SLURM or Kubernetes (kubectl
or helm
, for instance) interfaces. This fragmentation has penalties, together with mismatches between pocket book and manufacturing environments, lack of native entry to cluster storage, and most significantly, sub-optimal use of extremely clusters.
On this submit, we discover these challenges. Specifically, we suggest an answer to boost the info scientist expertise on Amazon SageMaker HyperPod—a resilient extremely cluster resolution.
Amazon SageMaker HyperPod
SageMaker HyperPod is a compute surroundings goal constructed for large-scale frontier mannequin coaching. You’ll be able to construct resilient clusters for ML workloads and develop state-of-the-art frontier fashions. SageMaker HyperPod runs well being monitoring brokers within the background for every occasion. When it detects a {hardware} failure, SageMaker HyperPod robotically repairs or replaces the defective occasion and resumes coaching from the final saved checkpoint. This automation alleviates the necessity for guide intervention, which implies you may practice in distributed settings for weeks or months with minimal disruption.
To study extra in regards to the resilience and Complete Price of Possession (TCO) advantages of SageMaker HyperPod, try Cut back ML coaching prices with Amazon SageMaker HyperPod. As of penning this submit, SageMaker HyperPod helps each SLURM and Amazon Elastic Kubernetes Service (Amazon EKS) as orchestrators.
To deploy a SageMaker HyperPod cluster, discuss with the SageMaker HyperPod workshops (SLURM, Amazon EKS). To study extra about what’s being deployed, try the structure diagrams later on this submit. You’ll be able to select to make use of both of the 2 orchestrators primarily based in your choice.
Amazon SageMaker Studio
Amazon SageMaker Studio is a completely built-in improvement surroundings (IDE) designed to streamline the end-to-end ML lifecycle. It supplies a unified, web-based interface the place knowledge scientists and builders can carry out ML duties, together with knowledge preparation, mannequin constructing, coaching, tuning, analysis, deployment, and monitoring.
By centralizing these capabilities, SageMaker Studio alleviates the necessity to swap between a number of instruments, considerably enhancing productiveness and collaboration. SageMaker Studio helps quite a lot of IDEs, reminiscent of JupyterLab Notebooks, Code Editor primarily based on Code-OSS, Visible Studio Code Open Supply, and RStudio, providing flexibility for various improvement preferences. SageMaker Studio helps non-public and shared areas, so groups can collaborate successfully whereas optimizing useful resource allocation. Shared areas enable a number of customers to entry the identical compute sources throughout profiles, and personal areas present devoted environments for particular person customers. This flexibility empowers knowledge scientists and builders to seamlessly scale their compute sources and improve collaboration inside SageMaker Studio. Moreover, it integrates with superior tooling like managed MLflow and Companion AI Apps to streamline experiment monitoring and speed up AI-driven innovation.
Distributed file programs: Amazon FSx
Amazon FSx for Lustre is a completely managed file storage service designed to supply high-performance, scalable, and cost-effective storage for compute-intensive workloads. Powered by the Lustre structure, it’s optimized for functions requiring entry to quick storage, reminiscent of ML, high-performance computing, video processing, monetary modeling, and large knowledge analytics.
FSx for Lustre delivers sub-millisecond latencies, scaling as much as 1 GBps per TiB of throughput, and tens of millions of IOPS. This makes it splendid for workloads demanding speedy knowledge entry and processing. The service integrates with Amazon Easy Storage Service (Amazon S3), enabling seamless entry to S3 objects as information and facilitating quick knowledge transfers between Amazon FSx and Amazon S3. Updates in S3 buckets are robotically mirrored in FSx file programs and vice versa. For extra info on this integration, try Exporting information utilizing HSM instructions and Linking your file system to an Amazon S3 bucket.
Idea behind mounting an FSx for Lustre file system to SageMaker Studio areas
You should use FSx for Lustre as a shared high-performance file system to attach SageMaker Studio domains with SageMaker HyperPod clusters, streamlining ML workflows for knowledge scientists and researchers. By utilizing FSx for Lustre as a shared quantity, you may construct and refine your coaching or fine-tuning code utilizing IDEs like JupyterLab and Code Editor in SageMaker Studio, put together datasets, and save your work instantly within the FSx for Lustre quantity.This identical quantity is mounted by SageMaker HyperPod in the course of the execution of coaching workloads, enabling direct entry to ready knowledge and code with out the necessity for repetitive knowledge transfers or customized picture creation. Knowledge scientists can iteratively make adjustments, put together knowledge, and submit coaching workloads instantly from SageMaker Studio, offering consistency throughout improvement and execution environments whereas enhancing productiveness. This integration alleviates the overhead of shifting knowledge between environments and supplies a seamless workflow for large-scale ML tasks requiring excessive throughput and low-latency storage. You’ll be able to configure FSx for Lustre volumes to supply file system entry to SageMaker Studio person profiles in two distinct methods, every tailor-made to completely different collaboration and knowledge administration wants.
Possibility 1: Shared file system partition throughout each person profile
Infrastructure directors can arrange a single FSx for Lustre file system partition shared throughout person profiles inside a SageMaker Studio area, as illustrated within the following diagram.
Determine 1: A FSx for Lustre file system partition shared throughout a number of person profiles inside a single SageMaker Studio Area
- Shared undertaking directories – Groups engaged on large-scale tasks can collaborate seamlessly by accessing a shared partition. This makes it doable for a number of customers to work on the identical information, datasets, and FMs with out duplicating sources.
- Simplified file administration – You don’t have to handle non-public storage; as a substitute, you may depend on the shared listing to your file-related wants, lowering complexity.
- Improved knowledge governance and safety – The shared FSx for Lustre partition is centrally managed by the infrastructure admin, enabling strong entry controls and knowledge insurance policies to keep up safety and integrity of shared sources.
Possibility 2: Shared file system partition throughout every person profile
Alternatively, directors can configure devoted FSx for Lustre file system partitions for every particular person person profile in SageMaker Studio, as illustrated within the following diagram.

Determine 2: A FSx for Lustre file system with a devoted partition per person
This setup supplies personalised storage and facilitates knowledge isolation. Key advantages embrace:
- Particular person knowledge storage and evaluation – Every person will get a personal partition to retailer private datasets, fashions, and information. This facilitates impartial work on tasks with clear segregation by person profile.
- Centralized knowledge administration – Directors retain centralized management over the FSx for Lustre file system, facilitating safe backups and direct entry whereas sustaining knowledge safety for customers.
- Cross-instance file sharing – You’ll be able to entry your non-public information throughout a number of SageMaker Studio areas and IDEs, as a result of the FSx for Lustre partition supplies persistent storage on the person profile stage.
Answer overview
The next diagram illustrates the structure of SageMaker HyperPod with SLURM integration.

Determine 3: Structure Diagram for SageMaker HyperPod with Slurm because the orchestrator
The next diagram illustrates the structure of SageMaker HyperPod with Amazon EKS integration.

Determine 4: Structure Diagram for SageMaker HyperPod with EKS because the orchestrator
These diagrams illustrate what you’ll provision as a part of this resolution. Along with the SageMaker HyperPod cluster you have already got, you provision a SageMaker Studio area, and connect the cluster’s FSx for Lustre file system to the SageMaker Studio area. Relying on whether or not or not you select a SharedFSx
, you may both connect the file system to be mounted with a single partition shared throughout person profiles (that you simply configure) inside your SageMaker area, or connect it to be mounted with a number of partitions for a number of remoted customers. To study extra about this distinction, discuss with the part earlier on this submit discussing the speculation behind mounting an FSx for Lustre file system to SageMaker Studio areas.
Within the following sections, we current a walkthrough of this integration by demonstrating on a SageMaker HyperPod with Amazon EKS cluster how one can:
- Connect a SageMaker Studio area.
- Use that area to fine-tune the DeepSeek-R1-Distill-Qwen-14B utilizing the FreedomIntelligence/medical-o1-reasoning-SFT dataset.
Conditions
This submit assumes that you’ve got a SageMaker HyperPod cluster.
Deploy sources utilizing AWS CloudFormation
As a part of this integration, we offer an AWS CloudFormation stack template (SLURM, Amazon EKS). Earlier than deploying the stack, ensure you have a SageMaker HyperPod cluster arrange.
Within the stack for SageMaker HyperPod with SLURM, you create the next sources:
- A SageMaker Studio area.
- Lifecycle configurations for putting in mandatory packages for the SageMaker Studio IDE, together with SLURM. Lifecycle configurations will probably be created for each JupyterLab and Code Editor. We set it up in order that your Code Editor or JupyterLab occasion will basically be configured as a login node to your SageMaker HyperPod cluster.
- An AWS Lambda operate that:
- Associates the created
security-group-for-inbound-nfs
safety group to the SageMaker Studio area. - Associates the
security-group-for-inbound-nfs
safety group to the FSx for Lustre ENIs. - Optionally available:
- If
SharedFSx
is about toTrue
, the created partition is shared within the FSx for Lustre quantity and related to the SageMaker Studio area. - If
SharedFSx
is about toFalse
, a Lambda operate creates the partition/{user_profile_name}
and associates it to the SageMaker Studio person profile.
- If
- Associates the created
Within the stack for SageMaker HyperPod with Amazon EKS, you create the next sources:
- A SageMaker Studio area.
- Lifecycle configurations for putting in mandatory packages for SageMaker Studio IDE, reminiscent of
kubectl
andjq
. Lifecycle configurations will probably be created for each JupyterLab and Code Editor. - A Lambda operate that:
- Associates the created
security-group-for-inbound-nfs
safety group to the SageMaker Studio area. - Associates the
security-group-for-inbound-nfs
safety group to the FSx for Lustre ENIs. - Optionally available:
- If
SharedFSx
is about toTrue
, the created partition is shared within the FSx for Lustre quantity and related to the SageMaker Studio area. - If
SharedFSx
is about toFalse
, a Lambda operate creates the partition/{user_profile_name}
and associates it to the SageMaker Studio person profile.
- If
- Associates the created
The primary distinction within the implementation of the 2 is within the lifecycle configurations for the JupyterLab or Code Editor servers working on the 2 implementations of SageMaker HyperPod—that is due to the distinction in the way you work together with the cluster utilizing the completely different orchestrators (kubectl
or helm
for Amazon EKS, and ssm
or ssh
for SLURM). Along with mounting your cluster’s FSx for Lustre file system, for SageMaker HyperPod with Amazon EKS, the lifecycle scripts configure your JupyterLab or Code Editor server to have the ability to run identified Kubernetes-based command line interfaces, together with kubectl
, eksctl
, and helm
. Moreover, it preconfigures your context, in order that your cluster is able to use as quickly as your JupyterLab or Code Editor occasion is up.
You will discover the lifecycle configuration for SageMaker HyperPod with Amazon EKS on the deployed CloudFormation stack template. SLURM works a bit in a different way. We designed the lifecycle configuration in order that your JupyterLab or Code Editor occasion would function a login node to your SageMaker HyperPod with SLURM cluster. Login nodes can help you log in to the cluster, submit jobs, and look at and manipulate knowledge with out working on the crucial slurmctld
scheduler node. This additionally makes it doable to run monitoring servers like intention, TensorBoard, or Grafana or Prometheus. Subsequently, the lifecycle configuration right here robotically installs SLURM and configures it with the intention to interface along with your cluster utilizing your JupyterLab or Code Editor occasion. You will discover the script used to configure SLURM on these cases on GitHub.
Each these configurations use the identical logic to mount the file programs. The directions present in Including a customized file system to a website have been achieved in a customized useful resource (Lambda operate) outlined within the CloudFormation stack template.
For extra particulars on deploying these supplied stacks, try the respective workshop pages for SageMaker HyperPod with SLURM and SageMaker HyperPod with Amazon EKS.
Knowledge science journey on SageMaker HyperPod with SageMaker Studio
As a knowledge scientist, after you arrange the SageMaker HyperPod and SageMaker Studio integration, you may log in to the SageMaker Studio surroundings by way of your person profile.

Determine 5: You’ll be able to log in to your SageMaker Studio surroundings by way of your created person profile.
In SageMaker Studio, you may choose your most well-liked IDE to begin prototyping your fine-tuning workload, and create the MLFlow monitoring server to trace coaching and system metrics in the course of the execution of the workload.

Determine 6: Choose your most well-liked IDE to connect with your HyperPod cluster
The SageMaker HyperPod clusters web page supplies details about the out there clusters and particulars on the nodes.
Figures 7,8: You can too see details about your SageMaker HyperPod cluster on SageMaker Studio
For this submit, we chosen Code Editor as our most well-liked IDE. The automation supplied by this resolution preconfigured the FSx for Lustre file system and the lifecycle configuration to put in the required modules for submitting workloads on the cluster through the use of the hyperpod-cli
or kubectl
. For the occasion sort, you may select a variety of accessible cases. In our case, we opted for the default ml.t3.medium.

Determine 9: CodeEditor configuration
The event surroundings already presents the partition mounted as a file system, the place you can begin prototyping your code for knowledge preparation of mannequin fine-tuning. For the aim of this instance, we fine-tune DeepSeek-R1-Distill-Qwen-14B utilizing the FreedomIntelligence/medical-o1-reasoning-SFT dataset.

Determine 10: Your cluster’s information are accessible instantly in your CodeEditor area, because of your file system being mounted on to your CodeEditor area! This implies you may develop regionally, and deploy onto your ultra-cluster.
The repository is organized as follows:
- download_model.py – The script to obtain the open supply mannequin instantly within the FSx for Lustre quantity. This fashion, we offer a sooner and constant execution of the coaching workload on SageMaker HyperPod.
- scripts/dataprep.py – The script to obtain and put together the dataset for the fine-tuning workload. Within the script, we format the dataset through the use of the immediate model outlined for the DeepSeek R1 fashions and save the dataset within the FSx for Lustre quantity. This fashion, we offer a sooner execution of the coaching workload by avoiding asset copy from different knowledge repositories.
- scripts/practice.py – The script containing the fine-tuning logic, utilizing open supply modules like Hugging Face transformers and optimization and distribution methods utilizing FSDP and QLoRA.
- scripts/analysis.py – The script to run ROUGE analysis on the fine-tuned mannequin.
- pod-finetuning.yaml – The manifest file containing the definition of the container used to execute the fine-tuning workload on the SageMaker HyperPod cluster.
- pod-evaluation.yaml – The manifest file containing the definition of the container used to execute the analysis workload on the SageMaker HyperPod cluster.
After downloading the mannequin and getting ready the dataset for the fine-tuning, you can begin prototyping the fine-tuning script instantly within the IDE.

Determine 11: You can begin creating regionally!
The updates performed within the script will probably be robotically mirrored within the container for the execution of the workload. If you’re prepared, you may outline the manifest file for the execution of the workload on SageMaker HyperPod. Within the following code, we spotlight the important thing elements of the manifest. For an entire instance of a Kubernetes manifest file, discuss with the awsome-distributed-training GitHub repository.
The important thing elements are as follows:
- replicas: 8 – This specifies that eight employee pods will probably be created for this PyTorchJob. That is notably necessary for distributed coaching as a result of it determines the dimensions of your coaching job. Having eight replicas means your PyTorch coaching will probably be distributed throughout eight separate pods, permitting for parallel processing and sooner coaching instances.
- Persistent quantity configuration – This consists of the next:
- title: fsx-volume – Defines a named quantity that will probably be used for storage.
- persistentVolumeClaim – Signifies that is utilizing Kubernetes’s persistent storage mechanism.
- claimName: fsx-claim – References a pre-created
PersistentVolumeClaim
, pointing to an FSx for Lustre file system used within the SageMaker Studio surroundings.
- Container picture – This consists of the next:
- Coaching command – The highlighted command exhibits the execution directions for the coaching workload:
- pip set up -r /knowledge/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/necessities.txt – Installs dependencies at runtime, to customise the container with packages and modules required for the fine-tuning workload.
- torchrun … /knowledge/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/scripts/practice.py – The precise coaching script, by pointing to the shared FSx for Lustre file system, within the partition created for the SageMaker Studio person profile
Knowledge-Scientist
. - –config /knowledge/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/args-fine-tuning.yaml – Arguments supplied to the coaching script, which incorporates definition of the coaching parameters, and extra variables used in the course of the execution of the workload.
The args-fine-tuning.yaml
file incorporates the definition of the coaching parameters to supply to the script. As well as, the coaching script was outlined to avoid wasting coaching and system metrics on the managed MLflow server in SageMaker Studio, in case the Amazon Useful resource Identify (ARN) and experiment title are supplied:
# Location within the FSx for Lustre file system the place the bottom mannequin was saved
model_id: "/knowledge/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/DeepSeek-R1-Distill-Qwen-14B"
mlflow_uri: "${MLFLOW_ARN}"
mlflow_experiment_name: "deepseek-r1-distill-llama-8b-agent"
# sagemaker particular parameters
# File system path the place the workload will retailer the mannequin
output_dir: "/knowledge/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/mannequin/"
# File system path the place the workload can entry the dataset practice dataset
train_dataset_path: "/knowledge/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/knowledge/practice/"
# File system path the place the workload can entry the dataset check dataset
test_dataset_path: "/knowledge/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/knowledge/check/"
# coaching parameters
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1
learning_rate: 2e-4 # studying fee scheduler
num_train_epochs: 1 # variety of coaching epochs
per_device_train_batch_size: 2 # batch measurement per system throughout coaching
per_device_eval_batch_size: 2 # batch measurement for analysis
gradient_accumulation_steps: 2 # variety of steps earlier than performing a backward/replace cross
gradient_checkpointing: true # use gradient checkpointing
bf16: true # use bfloat16 precision
tf32: false # use tf32 precision
fsdp: "full_shard auto_wrap offload"
fsdp_config:
backward_prefetch: "backward_pre"
cpu_ram_efficient_loading: true
offload_params: true
forward_prefetch: false
use_orig_params: true
merge_weights: true
The parameters model_id
, output_dir
, train_dataset_path
, and test_dataset_path
comply with the identical logic described for the manifest file and discuss with the situation the place the FSx for Lustre quantity is mounted within the container, below the partition Knowledge-Scientist
created for the SageMaker Studio person profile.
When you could have completed the event of the fine-tuning script and outlined the coaching parameters for the workload, you may deploy the workload with the next instructions:
$ kubectl apply -f pod-finetuning.yaml
service/etcd unchanged
deployment.apps/etcd unchanged
pytorchjob.kubeflow.org/deepseek-r1-qwen-14b-fine-tuning created
You’ll be able to discover the logs of the workload execution instantly from the SageMaker Studio IDE.

Determine 12: View the logs of the submitted coaching run instantly in your CodeEditor terminal
You’ll be able to observe coaching and system metrics from the managed MLflow server in SageMaker Studio.

Determine 13: SageMaker Studio instantly integrates with a managed MLFlow server. You should use it to trace coaching and system metrics instantly out of your Studio Area
Within the SageMaker HyperPod cluster sections, you may discover cluster metrics because of the mixing of SageMaker Studio with SageMaker HyperPod observability.

Determine 14: You’ll be able to view further cluster stage/infrastructure metrics within the “Compute” -> “SageMaker HyperPod clusters” part, together with GPU utilization.
On the conclusion of the fine-tuning workload, you should use the identical cluster to run batch analysis workloads on the mannequin by deploying the manifest pod-evaluation.yaml
file to run an analysis on the fine-tuned mannequin through the use of ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-L-Sum), which measure the similarity between machine-generated textual content and human-written reference textual content.
The analysis script makes use of the identical SageMaker HyperPod cluster and compares outcomes with the beforehand downloaded base mannequin.
Clear up
To scrub up your sources to keep away from incurring extra costs, comply with these steps:
- Delete unused SageMaker Studio sources.
- Optionally, delete the SageMaker Studio area.
- In case you created a SageMaker HyperPod cluster, delete the cluster to cease incurring prices.
- In case you created the networking stack from the SageMaker HyperPod workshop, delete the stack as properly to scrub up the digital non-public cloud (VPC) sources and the FSx for Lustre quantity.
Conclusion
On this submit, we mentioned how SageMaker HyperPod and SageMaker Studio can enhance and velocity up the event expertise of information scientists through the use of IDEs and tooling of SageMaker Studio and the scalability and resiliency of SageMaker HyperPod with Amazon EKS. The answer simplifies the setup for the system administrator of the centralized system through the use of the governance and safety capabilities supplied by the AWS companies.
We suggest beginning your journey by exploring the workshops Amazon EKS Help in Amazon SageMaker HyperPod and Amazon SageMaker HyperPod, and prototyping your personalized massive language mannequin through the use of the sources out there within the awsome-distributed-training GitHub repository.
A particular because of our colleagues Nisha Nadkarni (Sr. WW Specialist SA GenAI), Anoop Saha (Sr. Specialist WW Basis Fashions), and Mair Hasco (Sr. WW GenAI/ML Specialist) within the AWS ML Frameworks group, for his or her assist within the publication of this submit.
Concerning the authors
Bruno Pistone is a Senior Generative AI and ML Specialist Options Architect for AWS primarily based in Milan. He works with massive clients serving to them to deeply perceive their technical wants and design AI and Machine Studying options that make one of the best use of the AWS Cloud and the Amazon Machine Studying stack. His experience embrace: Machine Studying finish to finish, Machine Studying Industrialization, and Generative AI. He enjoys spending time together with his buddies and exploring new locations, in addition to travelling to new locations
Aman Shanbhag is a Specialist Options Architect on the ML Frameworks group at Amazon Net Providers (AWS), the place he helps clients and companions with deploying ML coaching and inference options at scale. Earlier than becoming a member of AWS, Aman graduated from Rice College with levels in pc science, arithmetic, and entrepreneurship.