This submit is the second a part of the DeepSeek sequence specializing in mannequin customization with Amazon SageMaker HyperPod recipes (or recipes for brevity). In Half 1, we demonstrated the efficiency and ease of fine-tuning DeepSeek-R1 distilled fashions utilizing these recipes. On this submit, we use the recipes to fine-tune the unique DeepSeek-R1 671b parameter mannequin. We show this by means of the step-by-step implementation of those recipes utilizing each SageMaker coaching jobs and SageMaker HyperPod.
Enterprise use case
After its public launch, DeepSeek-R1 mannequin, developed by DeepSeek AI, confirmed spectacular outcomes throughout a number of analysis benchmarks. The mannequin follows the Combination of Specialists (MoE) structure and has 671 billion parameters. Historically, giant fashions are properly tailored for a large spectrum of generalized duties by the advantage of being educated on the massive quantity of information. The DeepSeek-R1 mannequin was educated on 14.8 trillion tokens. The unique R1 mannequin demonstrates robust few-shot or zero-shot studying capabilities, permitting it to generalize to new duties and situations that weren’t a part of its authentic coaching.
Nonetheless, many purchasers favor to both fine-tune or run steady pre-training of those fashions to adapt it to their particular enterprise purposes or to optimize it for particular duties. A monetary group may need to customise the mannequin with their customized information to help with their information processing duties. Or a hospital community can fine-tune it with their affected person information to behave as a medical assistant for his or her docs. Advantageous-tuning also can lengthen the mannequin’s generalization potential. Clients can fine-tune it with a corpus of textual content in particular languages that aren’t totally represented within the authentic coaching information. For instance, a mannequin fine-tuned with a further trillion tokens of Hindi language will be capable of develop the identical generalization capabilities to Hindi.
The choice on which mannequin to fine-tune is determined by the top software in addition to the accessible dataset. Based mostly on the quantity of proprietary information, clients can determine to fine-tune the bigger DeepSeek-R1 mannequin as a substitute of doing it for one of many distilled variations. As well as, the R1 fashions have their very own set of guardrails. Clients may need to fine-tune to replace these guardrails or develop on them.
Advantageous-tuning bigger fashions like DeepSeek-R1 requires cautious optimization to stability price, deployment necessities, and efficiency effectiveness. To attain optimum outcomes, organizations should meticulously choose an applicable atmosphere, decide the perfect hyperparameters, and implement environment friendly mannequin sharding methods.
Answer structure
SageMaker HyperPod recipes successfully deal with these necessities by offering a fastidiously curated mixture of distributed coaching strategies, optimizations, and configurations for state-of-the-art (SOTA) open supply fashions. These recipes have undergone intensive benchmarking, testing, and validation to supply seamless integration with the SageMaker coaching and fine-tuning processes.
On this submit, we discover options that show the way to fine-tune the DeepSeek-R1 mannequin utilizing these recipes on both SageMaker HyperPod or SageMaker coaching jobs. Your alternative between these companies will rely in your particular necessities and preferences. For those who require granular management over coaching infrastructure and intensive customization choices, SageMaker HyperPod is the perfect alternative. SageMaker coaching jobs, however, is tailor-made for organizations that desire a totally managed expertise for his or her coaching workflows. To be taught extra particulars about these service options, discuss with Generative AI basis mannequin coaching on Amazon SageMaker.
The next diagram illustrates the answer structure for coaching utilizing SageMaker HyperPod. With HyperPod, customers can start the method by connecting to the login/head node of the Slurm cluster. Every step is run as a Slurm job and makes use of Amazon FSx for Lustre for storing mannequin checkpoints. For DeepSeek-R1, the method consists of the next steps:
- Obtain the DeepSeek-R1 mannequin and convert weights from FP8 to BF16 format
- Load the mannequin into reminiscence and carry out fine-tuning utilizing Quantized Low-Rank Adaptation (QLoRA)
- Merge QLoRA adapters with the bottom mannequin
- Convert and cargo the mannequin for batch analysis
The next diagram illustrates the answer structure for SageMaker coaching jobs. You’ll be able to execute every step within the coaching pipeline by initiating the method by means of the SageMaker management airplane utilizing APIs, AWS Command Line Interface (AWS CLI), or the SageMaker ModelTrainer SDK. In response, SageMaker launches coaching jobs with the requested quantity and kind of compute situations to run particular duties. For DeepSeek-R1, the method consists of three major steps:
- Obtain and convert R1 to BF16 datatype format
- Load the mannequin into reminiscence and carry out fine-tuning
- Consolidate and cargo the checkpoints into reminiscence, then run inference and metrics to judge efficiency enhancements
Stipulations
Full the next conditions earlier than working the DeepSeek-R1 671B mannequin fine-tuning pocket book:
- Make the next quota improve requests for SageMaker. It’s good to request a minimal of two
ml.p5.48xlarge
situations (with 8 x NVIDIA H100 GPUs) ranging to a most of 4ml.p5.48xlarge
situations (relying on time-to-train and cost-to-train trade-offs in your use case). On the Service Quotas console, request the next SageMaker quotas. It could actually take as much as 24 hours for the quota improve to be accepted:- P5 situations (
ml.p5.48xlarge
) for coaching job utilization: 2–4 - P5 situations (
ml.p5.48xlarge
) for HyperPod clusters (ml.p5.48xlarge for cluster utilization): 2–4
- P5 situations (
- For those who select to make use of HyperPod clusters to run your coaching, arrange a HyperPod Slurm cluster, referring to Amazon SageMaker HyperPod Developer Information. Alternatively, you can even use the AWS CloudFormation template supplied within the Personal Account workshop and comply with the directions to arrange a cluster and a growth atmosphere to entry and submit jobs to the cluster.
- (Optionally available) For those who select to make use of SageMaker coaching jobs, you may create an Amazon SageMaker Studio area (discuss with Use fast setup for Amazon SageMaker AI) to entry Jupyter notebooks with the previous position (You should utilize JupyterLab in your native setup too).
- Create an AWS Id and Entry Administration (IAM) position with managed insurance policies
AmazonSageMakerFullAccess
,AmazonFSxFullAccess
, andAmazonS3FullAccess
to provide the mandatory entry to SageMaker to run the examples.
- Create an AWS Id and Entry Administration (IAM) position with managed insurance policies
- Clone the GitHub repository with the belongings for this deployment. This repository consists of a pocket book that references coaching belongings:
Answer walkthrough
To carry out the answer, comply with the steps within the subsequent sections.
Technical concerns
The default weights supplied by the DeepSeek crew on their official R1 repository are of kind FP8. Nonetheless, we selected to disable FP8 in our recipes as a result of we empirically discovered that coaching with BF16 enhances generalization throughout various datasets with minimal modifications to the recipe hyperparameters. Due to this fact, to realize steady fine-tuning for a mannequin of 671b parameter measurement, we advocate first changing the mannequin from FP8 to BF16 utilizing the fp8_cast_bf16.py command-line script supplied by DeepSeek. Executing this script will copy over the transformed BF16 weights in Safetensor format to the desired output listing. Keep in mind to repeat over the mannequin’s config.yaml to the output listing so the weights are loaded precisely. These steps are encapsulated in a prologue script and are documented step-by-step below the Advantageous-tuning part.
Clients can use a sequence size of 8K for coaching, as examined on a p5.48xlarge
occasion, every geared up with eight NVIDIA H100 GPUs. You too can select a smaller sequence size if wanted. Coaching with a sequence size larger than 8K may result in out-of-memory points with GPUs. Additionally, changing mannequin weights from FP8 to BF16 requires a p5.48xlarge
occasion, which can also be advisable for coaching as a result of mannequin’s excessive host reminiscence necessities throughout initialization.
Clients should improve their transformers model to transformers==4.48.2
to run the coaching.
Advantageous-tuning
Run the finetune_deepseek_r1_671_qlora.ipynb pocket book to fine-tune the DeepSeek-R1 mannequin utilizing QLoRA on SageMaker.
Put together the dataset
This part covers loading the FreedomIntelligence/medical-o1-reasoning-SFT dataset, tokenizing and chunking the dataset, and configuring the information channels for SageMaker coaching on Amazon Easy Storage Service (Amazon S3). Full the next steps:
- Format the dataset by making use of the immediate format for DeepSeek-R1:
- Load the FreedomIntelligence/medical-o1-reasoning-SFT dataset and cut up it into coaching and validation datasets:
- Load the DeepSeek-R1 tokenizer from the Hugging Face Transformers library and generate tokens for the practice and validation datasets. We use the unique sequence size of 8K:
- Put together the coaching and validation datasets for SageMaker coaching by saving them as
arrow
information, required by SageMaker HyperPod recipes, and setting up the S3 paths the place these information might be uploaded. This dataset might be utilized in each SageMaker coaching jobs and SageMaker HyperPod examples:
The following part describes the way to run a fine-tuning instance with SageMaker coaching jobs.
Choice A: Advantageous-tune utilizing SageMaker coaching jobs
Observe these high-level steps:
- Obtain DeepSeek-R1 to the FSx for Lustre mounted listing
- Convert DeepSeek-R1 from FP8 to BF16
- Advantageous-tune the DeepSeek-R1 mannequin
- Merge the educated adapter with the bottom mannequin
Outline a utility perform to create the ModelTrainer class for each step of the SageMaker coaching jobs pipeline:
Obtain DeepSeek-R1 to the FSx for Lustre mounted listing
Observe these steps:
- Choose the occasion kind, Amazon FSx information channel, community configuration for the coaching job, and supply code, then outline the ModelTrainer class to run the coaching job on the
ml.c5.18xlarge
occasion to obtain DeepSeek-R1 from the Hugging Face DeepSeek-R1 hub:
- Provoke the coaching calling practice perform of the ModelTrainer class:
Convert DeepSeek R1 from FP8 to BF16
Use ModelTrainer to transform the DeepSeek-R1 downloaded mannequin weights from FP8 to BF16 format for optimum PEFT coaching. We use script convert.sh to run the execution utilizing the ml.c5.18xlarge
occasion.
Use SageMaker coaching heat pool configuration to retain and reuse provisioned infrastructure after the completion of a mannequin obtain coaching job within the earlier step:
Advantageous-tune the DeepSeek-R1 mannequin
The following section includes fine-tuning the DeepSeek-R1 mannequin utilizing two ml.p5.48xlarge
situations, utilizing distributed coaching. You implement this by means of the SageMaker recipe hf_deepseek_r1_671b_seq8k_gpu_qlora
, which includes the QLoRA methodology. QLoRA makes the giant language mannequin (LLM) trainable on restricted compute by quantizing the bottom mannequin to 4-bit precision whereas utilizing small, trainable low-rank adapters for fine-tuning, dramatically decreasing reminiscence necessities with out sacrificing mannequin high quality:
Provoke the coaching job to fine-tune the mannequin. SageMaker coaching jobs will provision two P5 situations, orchestrate the SageMaker mannequin parallel container smdistributed-modelparallel:2.4.1-gpu-py311-cu121
, and execute the recipe to fine-tune DeepSeek-R1 with the QLoRA technique on an ephemeral cluster:
Merge the educated adapter with the bottom mannequin
Merge the educated adapters with the bottom mannequin so it may be used for inference:
The following part exhibits how one can run related steps on HyperPod to run your generative AI workloads.
Choice B: Advantageous-tune utilizing SageMaker HyperPod with Slurm
To fine-tune the mannequin utilizing HyperPod, guarantee that your cluster is up and prepared by following the conditions talked about earlier. To entry the login/head node of the HyperPod Slurm cluster out of your growth atmosphere, comply with the login directions at SSH into Cluster within the workshop.
Alternatively, you can even use AWS Techniques Supervisor and run a command reminiscent of the next to begin the session. You could find the cluster ID, occasion group identify, and occasion ID on the Amazon SageMaker console.
- While you’re within the cluster’s login/head node, run the next instructions to arrange the atmosphere. Run
sudo su - ubuntu
to run the remaining instructions as the basis person, except you may have a selected person ID to entry the cluster and your POSIX person is created by means of a lifecycle script on the cluster. Confer with the multi-user setup for extra particulars.
- Create a squash file utilizing Enroot to run the job on the cluster. Enroot runtime presents GPU acceleration, rootless container assist, and seamless integration with HPC environments, making it splendid for working workflows securely.
- After you’ve created the squash file, replace the
recipes_collection/config.yaml
file with absolutely the path to the squash file (created within the previous step), and replace theinstance_type
if wanted. The ultimate config file ought to have the next parameters:
Additionally replace the file recipes_collection/cluster/slurm.yaml
so as to add container_mounts
pointing to the FSx for Lustre file system utilized in your cluster.
Observe these high-level steps to arrange, fine-tune, and consider the mannequin utilizing HyperPod recipes:
- Obtain the mannequin and convert weights to BF16
- Advantageous-tune the mannequin utilizing QLoRA
- Merge the educated mannequin adapter
- Consider the fine-tuned mannequin
Obtain the mannequin and convert weights to BF16
Obtain the DeepSeek-R1 mannequin from the HuggingFace hub and convert the mannequin weights from FP8 to BF16. It’s good to convert this to make use of QLoRA for fine-tuning. Copy and execute the next bash script:
Advantageous-tune the mannequin utilizing QLoRA
Obtain the ready dataset that you just uploaded to Amazon S3 into your FSx for Lustre quantity connected to the cluster.
- Enter the next instructions to obtain the information from Amazon S3:
- Replace the launcher script to fine-tune the DeepSeek-R1 671B mannequin. The launcher scripts function handy wrappers for executing the coaching script, major.py file, simplifying the method of fine-tuning and parameter adjustment. For fine-tuning the DeepSeek R1 671B mannequin, you’ll find the particular script at:
Earlier than working the script, you might want to modify the situation of the coaching and validation information, replace the HuggingFace mannequin ID, and optionally the entry token for personal fashions and datasets. The script ought to seem like the next (replace recipes.coach.num_nodes
in the event you’re utilizing a multi-node cluster):
You’ll be able to view the recipe for this fine-tuning activity below recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_qlora.yaml
and override extra parameters as wanted.
- Submit the job by working the launcher script:
Monitor the job utilizing Slurm instructions reminiscent of squeue
and scontrol
present to view the standing of the job and the corresponding logs. The logs will be discovered within the outcomes folder within the launch listing. When the job is full, the mannequin adapters are saved within the EXP_DIR
that you just outlined within the launch. The construction of the listing ought to seem like this:
You’ll be able to see the educated adapter weights are saved as a part of the checkpointing below ./checkpoints/peft_sharded/step_N
. We are going to later use this to merge with the bottom mannequin.
Merge the educated mannequin adapter
Observe these steps:
- Run a job utilizing the
smdistributed-modelparallel
enroot picture to merge the adapter with the bottom mannequin.
- Obtain the
merge_peft_checkpoint.py
code from sagemaker-hyperpod-training-adapter-for-nemo repository and retailer it in Amazon FSx. Modify the export variables within the following scripts accordingly to mirror the paths forSOURCE_DIR
,ADAPTER_PATH
,BASE_MODEL_BF16
andMERGE_MODEL_PATH
.
Consider the fine-tuned mannequin
Use the fundamental testing scripts supplied by DeekSeek to deploy the merged mannequin.
- Begin by cloning their repo:
- It’s good to convert the merged mannequin to a selected format for working inference. On this case, you want 4*P5 situations to deploy the mannequin as a result of the merged mannequin is in BF16. Enter the next command to transform the mannequin:
- When the conversion is full, use the next sbatch script to run the batch inference, making the next changes:
- Replace the
ckpt-path
to the transformed mannequin path from the earlier step. - Create a brand new
prompts.txt
file with every line containing a immediate. The job will use the prompts from this file and generate output.
- Replace the
Cleanup
To wash up your assets to keep away from incurring extra expenses, comply with these steps:
- Delete any unused SageMaker Studio assets.
- (Optionally available) Delete the SageMaker Studio area.
- Confirm that your coaching job isn’t working anymore. To take action, in your SageMaker console, select Coaching and verify Coaching jobs.
- For those who created a HyperPod cluster, delete the cluster to cease incurring prices. For those who created the networking stack from the HyperPod workshop, delete the stack as properly to wash up the digital non-public cloud (VPC) assets and the FSx for Lustre quantity.
Conclusion
On this submit, we demonstrated the way to fine-tune giant fashions reminiscent of DeepSeek-R1 671B utilizing both SageMaker coaching jobs or SageMaker HyperPod with HyperPod recipes in a number of steps. This strategy minimizes the complexity of figuring out optimum distributed coaching configurations and offers a easy method to correctly measurement your workloads with the perfect price-performance structure on AWS.
To start out utilizing SageMaker HyperPod recipes, go to our sagemaker-hyperpod-recipes GitHub repository for complete documentation and instance implementations. Our crew regularly expands our recipes based mostly on buyer suggestions and rising machine studying (ML) tendencies, ensuring you may have the mandatory instruments for profitable AI mannequin coaching.
In regards to the Authors
Kanwaljit Khurmi is a Principal Worldwide Generative AI Options Architect at AWS. He collaborates with AWS product groups, engineering departments, and clients to supply steering and technical help, serving to them improve the worth of their hybrid machine studying options on AWS. Kanwaljit focuses on helping clients with containerized purposes and high-performance computing options.
Arun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker crew. He focuses on giant language mannequin coaching workloads, serving to clients construct LLM workloads utilizing SageMaker HyperPod, SageMaker coaching jobs, and SageMaker distributed coaching. Exterior of labor, he enjoys working, mountain climbing, and cooking.
Anoop Saha is a Sr GTM Specialist at Amazon Net Companies (AWS) specializing in generative AI mannequin coaching and inference. He companions with prime frontier mannequin builders, strategic clients, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop held a number of management roles at startups and huge firms, primarily specializing in silicon and system structure of AI infrastructure.
Rohith Nadimpally is a Software program Improvement Engineer engaged on AWS SageMaker, the place he accelerates large-scale AI/ML workflows. Earlier than becoming a member of Amazon, he graduated with Honors from Purdue College with a level in Pc Science. Exterior of labor, he enjoys enjoying tennis and watching films.