This publish is the second a part of the GPT-OSS collection specializing in mannequin customization with Amazon SageMaker AI. In Half 1, we demonstrated fine-tuning GPT-OSS fashions utilizing open supply Hugging Face libraries with SageMaker coaching jobs, which helps distributed multi-GPU and multi-node configurations, so you’ll be able to spin up high-performance clusters on demand.
On this publish, we present how one can fine-tune GPT OSS fashions on utilizing recipes on SageMaker HyperPod and Coaching Jobs. SageMaker HyperPod recipes aid you get began with coaching and fine-tuning standard publicly obtainable basis fashions (FMs) comparable to Meta’s Llama, Mistral, and DeepSeek in simply minutes, utilizing both SageMaker HyperPod or coaching jobs. The recipes present pre-built, validated configurations that alleviate the complexity of organising distributed coaching environments whereas sustaining enterprise-grade efficiency and scalability for fashions. We define steps to fine-tune the GPT-OSS mannequin on a multilingual reasoning dataset, HuggingFaceH4/Multilingual-Pondering, so GPT-OSS can deal with structured, chain-of-thought (CoT) reasoning throughout a number of languages.
Answer overview
This resolution makes use of SageMaker HyperPod recipes to run a fine-tuning job on HyperPod utilizing Amazon Elastic Kubernetes Service (Amazon EKS) orchestration or coaching jobs. Recipes are processed via the SageMaker HyperPod recipe launcher, which serves because the orchestration layer chargeable for launching a job on the corresponding structure comparable to SageMaker HyperPod (Slurm or Amazon EKS) or coaching jobs. To study extra, see SageMaker HyperPod recipes.
For particulars on fine-tuning the GPT-OSS mannequin, see Nice-tune OpenAI GPT-OSS fashions on Amazon SageMaker AI utilizing Hugging Face libraries.
Within the following sections, we talk about the stipulations for each choices, after which transfer on to the information preparation. The ready knowledge is saved to Amazon FSx for Lustre, which is used because the persistent file system for SageMaker HyperPod, or Amazon Easy Storage Service (Amazon S3) for coaching jobs. We then use recipes to submit the fine-tuning job, and eventually deploy the skilled mannequin to a SageMaker endpoint for testing and evaluating the mannequin. The next diagram illustrates this structure.
Conditions
To observe alongside, it’s essential to have the next stipulations:
- An area growth setting with AWS credentials configured for creating and accessing SageMaker assets, or a distant setting comparable to Amazon SageMaker Studio.
- For SageMaker HyperPod fine-tuning, full the next:
- For fine-tuning the mannequin utilizing SageMaker coaching jobs, it’s essential to have one ml.p5.48xlarge occasion (with 8 x NVIDIA H100 GPUs) for coaching jobs utilization. In case you don’t have ample limits, request the next SageMaker quotas on the Service Quotas console: P5 occasion (ml.p5.48xlarge) for coaching jobs (ml.p5.48xlarge for cluster utilization): 1.
It’d take as much as 24 hours for these limits to be accepted. You may also use SageMaker coaching plans to order these situations for a selected timeframe and use case (cluster or coaching jobs utilization). For extra particulars, see Reserve coaching plans in your coaching jobs or HyperPod clusters.
Subsequent, use your most popular growth setting to arrange the dataset for fine-tuning. You’ll find the total code within the Generative AI utilizing Amazon SageMaker repository on GitHub.
Information tokenization
We use the Hugging FaceH4/Multilingual-Pondering dataset, which is a multilingual reasoning dataset containing CoT examples translated into languages comparable to French, Spanish, and German. The recipe helps a sequence size of 4,000 tokens for the GPT-OSS 120B mannequin. The next instance code demonstrates the right way to tokenize the multilingual-thinking dataset. The recipe accepts knowledge in Hugging Face format (arrow). After it’s tokenized, it can save you the processed dataset to disk.
Now that you’ve ready and tokenized the dataset, you’ll be able to fine-tune the GPT-OSS mannequin in your dataset, utilizing both SageMaker HyperPod or coaching jobs. SageMaker coaching jobs are perfect for one-off or periodic coaching workloads that want non permanent compute assets, making it a completely managed, on-demand expertise in your coaching wants. SageMaker HyperPod is perfect for steady growth and experimentation, offering a persistent, preconfigured, and failure-resilient cluster. Relying in your selection, skip to the suitable part for subsequent steps.
Nice-tune the mannequin utilizing SageMaker HyperPod
To fine-tune the mannequin utilizing HyperPod, begin by organising the digital setting and putting in the mandatory dependencies to execute the coaching job on the EKS cluster. Be sure that the cluster is InService earlier than continuing, and also you’re utilizing Python 3.9 or better in your growth setting.
Subsequent, obtain and arrange the SageMaker HyperPod recipes repository:
Now you can use the SageMaker HyperPod recipe launch scripts to submit your coaching job. Utilizing the recipe entails updating the k8s.yaml configuration file and executing the launch script.
In recipes_collection/cluster/k8s.yaml, replace the persistent_volume_claims part. It mounts the FSx declare to the /fsx listing of every computing pod:
SageMaker HyperPod recipes present a launch script for every recipe inside the launcher_scripts listing. To fine-tune the GPT-OSS-120B mannequin, replace the launch scripts positioned at launcher_scripts/gpt_oss/run_hf_gpt_oss_120b_seq4k_gpu_lora.sh and replace the cluster_type parameter.
The up to date launch script ought to look just like the next code when working SageMaker HyperPod with Amazon EKS. Ensure that cluster=k8s and cluster_type=k8s are up to date within the launch script:
When the script is prepared, you’ll be able to launch fine-tuning of the GPT OSS 120B mannequin utilizing the next code:
After submitting a job for fine-tuning, you should use the next command to confirm profitable submission. You must be capable to see the pods working in your cluster:
To verify logs for the job, you should use the kubectl logs command:
kubectl logs -f hf-gpt-oss-120b-lora-h2cwd-worker-0
You must be capable to see the next logs when the coaching begins and completes. You can see the checkpoints written to the /fsx/experiment/checkpoints folder.
When the coaching is full, the ultimate merged mannequin will be discovered within the experiment listing path you outlined within the launcher script below /fsx/experiment/checkpoints/peft_full/steps_50/final-model.
Nice-tune utilizing SageMaker coaching jobs
You may also use recipes straight with SageMaker coaching jobs utilizing the SageMaker Python SDK. The coaching jobs routinely spin up the compute, load the enter knowledge, run the coaching script, save the mannequin to your output location, and tear down the situations, for a clean coaching expertise.
The next code snippet exhibits the right way to use recipes with the PyTorch estimator. You should utilize the training_recipe parameter to specify the coaching or fine-tuning recipe for use, and recipe_overrides for any parameters that want alternative. For coaching jobs, replace the enter, output, and outcomes directories to places in /choose/ml as required by SageMaker coaching jobs.
After the job is submitted, you’ll be able to monitor the standing of your coaching job on the SageMaker console, by selecting Coaching jobs below Coaching within the navigation pane. Select the coaching job that begins with gpt-oss-recipe to view its particulars and logs. When the coaching job is full, the outputs will likely be saved to an S3 location. You may get the situation of the output artifacts from the S3 mannequin artifact part on the job particulars web page.
Run inference
After you fine-tune your GPT-OSS mannequin with SageMaker recipes on both SageMaker coaching jobs or SageMaker HyperPod, the output is a personalized mannequin artifact that merges the bottom mannequin with the personalized PEFT adapters. This remaining mannequin is saved in Amazon S3 and will be deployed straight from Amazon S3 to SageMaker endpoints for real-time inference.
To serve GPT-OSS fashions, it’s essential to have the newest vLLM containers (v0.10.1 or later). A full record of vllm-openai Docker picture variations is obtainable on Docker hub.
The steps to deploy your fine-tuned GPT-OSS mannequin are outlined on this part.
Construct the newest GPT-OSS container in your SageMaker endpoint
In case you’re deploying the mannequin from SageMaker Studio utilizing JupyterLab or the Code Editor, each environments include Docker preinstalled. Just be sure you’re utilizing the SageMaker Distribution picture v3.0 or later for compatibility.You possibly can construct your deployment container by working the next instructions:
In case you’re working these instructions from an area terminal or different setting, merely omit the %%bash line and run the instructions as normal shell instructions.
The construct.sh script is chargeable for routinely constructing and pushing a vllm-openai container that’s optimized for SageMaker endpoints. After it’s constructed, the customized SageMaker endpoint appropriate vllm picture is pushed to Amazon Elastic Container Registry (Amazon ECR). SageMaker endpoints can then pull this picture from Amazon ECR at runtime to spin up the container for inference.
The next is an instance of the construct.sh script:
The Dockerfile defines how we convert an open supply vLLM Docker picture right into a SageMaker hosting-compatible picture. This entails extending the bottom vllm-openai picture, including the serve entrypoint script, and making it executable. See the next instance Dockerfile:
The serve script acts as a translation layer between SageMaker internet hosting conventions and the vLLM runtime. You possibly can keep the identical deployment workflow you’re accustomed to when internet hosting fashions on SageMaker endpoints, whereas routinely changing SageMaker-specific configurations into the format anticipated by vLLM.
Key factors to notice about this script:
- It enforces the usage of port 8080, which SageMaker requires for inference containers
- It dynamically interprets setting variables prefixed with
OPTION_into CLI arguments for vLLM (for instance,OPTION_MAX_MODEL_LEN=4096adjustments to--max-model-len 4096) - It prints the ultimate set of arguments for visibility
- It lastly launches the vLLM API server with the translated arguments
The next is an instance serve script:
Host personalized GPT-OSS as a SageMaker real-time endpoint
Now you’ll be able to deploy your fine-tuned GPT-OSS mannequin utilizing the ECR picture URI you constructed within the earlier step. On this instance, the mannequin artifacts are saved securely in an S3 bucket, and SageMaker will obtain them into the container at runtime.Full the next configurations:
- Set
model_datato level to the S3 prefix the place your mannequin artifacts are positioned - Set the
OPTION_MODELsetting variable to/choose/ml/mannequin, which is the place SageMaker mounts the mannequin contained in the container - (Elective) In case you’re serving a mannequin from Hugging Face Hub as a substitute of Amazon S3, you’ll be able to set
OPTION_MODELon to the Hugging Face mannequin ID as a substitute
The endpoint startup may take a number of minutes because the mannequin artifacts are downloaded and the container is initialized.The next is an instance deployment code:
Pattern inference
After your endpoint is deployed and within the InService state, you’ll be able to invoke your fine-tuned GPT-OSS mannequin utilizing the SageMaker Python SDK.
The next is an instance predictor setup:
The modified vLLM container is totally appropriate with the OpenAI-style messages enter format, making it simple to ship chat-style requests:
You’ve got efficiently deployed and invoked your customized fine-tuned GPT-OSS mannequin on SageMaker real-time endpoints, utilizing the vLLM framework for optimized, low-latency inference. You’ll find extra GPT-OSS internet hosting examples within the OpenAI gpt-oss examples GitHub repo.
Clear up
To keep away from incurring further fees, full the next steps to scrub up the assets used on this publish:
- Delete the SageMaker endpoint:
pretrained_predictor.delete_endpoint()
- In case you created a SageMaker HyperPod cluster for the needs of this publish, delete the cluster by following the directions in Deleting a SageMaker HyperPod cluster.
- Clear up the FSx for Lustre quantity if it’s now not wanted by following directions in Deleting a file system.
- In case you used coaching jobs, the coaching situations are routinely deleted when the roles are full.
Conclusion
On this publish, we confirmed the right way to fine-tune OpenAI’s GPT-OSS fashions (gpt-oss-120b and gpt-oss-20b) on SageMaker AI utilizing SageMaker HyperPod recipes. We mentioned how SageMaker HyperPod recipes present a strong but accessible resolution for organizations to scale their AI mannequin coaching capabilities with massive language fashions (LLMs) together with GPT-OSS, utilizing both a persistent cluster via SageMaker HyperPod, or an ephemeral cluster utilizing SageMaker coaching jobs. The structure streamlines advanced distributed coaching workflows via its intuitive recipe-based method, lowering setup time from weeks to minutes. We additionally confirmed how these fine-tuned fashions will be seamlessly deployed to manufacturing utilizing SageMaker endpoints with vLLM optimization, offering enterprise-grade inference capabilities with OpenAI-compatible APIs. This end-to-end workflow, from coaching to deployment, helps organizations construct and serve customized LLM options whereas utilizing the scalable infrastructure of AWS and complete ML platform capabilities of SageMaker.
To start utilizing the SageMaker HyperPod recipes, go to the Amazon SageMaker HyperPod recipes GitHub repo for complete documentation and instance implementations. In case you’re curious about exploring the fine-tuning additional, the Generative AI utilizing Amazon SageMaker GitHub repo has the mandatory code and notebooks. Our staff continues to develop the recipe ecosystem primarily based on buyer suggestions and rising ML traits, ensuring that you’ve the instruments wanted for profitable AI mannequin coaching.
Particular due to everybody who contributed to the launch: Hengzhi Pei, Zach Kimberg, Andrew Tian, Leonard Lausen, Sanjay Dorairaj, Manish Agarwal, Sareeta Panda, Chang Ning Tsai, Maxwell Nuyens, Natasha Sivananjaiah, and Kanwaljit Khurmi.
In regards to the authors
Durga Sury is a Senior Options Architect at Amazon SageMaker, the place she helps enterprise prospects construct safe and scalable AI/ML programs. When she’s not architecting options, you’ll find her having fun with sunny walks along with her canine, immersing herself in homicide thriller books, or catching up on her favourite Netflix exhibits.
Pranav Murthy is a Senior Generative AI Information Scientist at AWS, specializing in serving to organizations innovate with Generative AI, Deep Studying, and Machine Studying on Amazon SageMaker AI. Over the previous 10+ years, he has developed and scaled superior pc imaginative and prescient (CV) and pure language processing (NLP) fashions to sort out high-impact issues—from optimizing international provide chains to enabling real-time video analytics and multilingual search. When he’s not constructing AI options, Pranav enjoys enjoying strategic video games like chess, touring to find new cultures, and mentoring aspiring AI practitioners. You’ll find Pranav on LinkedIn.
Sumedha Swamy is a Senior Supervisor of Product Administration at Amazon Net Companies (AWS), the place he leads a number of areas of the Amazon SageMaker, together with SageMaker Studio – the industry-leading built-in growth setting for machine studying, developer and administrator experiences, AI infrastructure, and SageMaker SDK.
Dmitry Soldatkin is a Senior AI/ML Options Architect at Amazon Net Companies (AWS), serving to prospects design and construct AI/ML options. Dmitry’s work covers a variety of ML use circumstances, with a major curiosity in Generative AI, deep studying, and scaling ML throughout the enterprise. He has helped corporations in lots of industries, together with insurance coverage, monetary companies, utilities, and telecommunications. You possibly can join with Dmitry on LinkedIn.
Arun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker staff. He makes a speciality of massive language mannequin coaching workloads, serving to prospects construct LLM workloads utilizing SageMaker HyperPod, SageMaker coaching jobs, and SageMaker distributed coaching. Exterior of labor, he enjoys working, mountaineering, and cooking.
Anirudh Viswanathan is a Senior Product Supervisor, Technical, at AWS with the SageMaker staff, the place he focuses on Machine Studying. He holds a Grasp’s in Robotics from Carnegie Mellon College and an MBA from the Wharton College of Enterprise. Anirudh is a named inventor on greater than 50 AI/ML patents. He enjoys long-distance working, exploring artwork galleries, and attending Broadway exhibits.

