Customise DeepSeek-R1 671b mannequin utilizing Amazon SageMaker HyperPod recipes

This submit is the second a part of the DeepSeek sequence specializing in mannequin customization with Amazon SageMaker HyperPod recipes (or recipes for brevity). In Half 1, we demonstrated the efficiency and ease of fine-tuning DeepSeek-R1 distilled fashions utilizing these recipes. On this submit, we use the recipes to fine-tune the unique DeepSeek-R1 671b parameter mannequin. We show this by means of the step-by-step implementation of those recipes utilizing each SageMaker coaching jobs and SageMaker HyperPod.

Enterprise use case

After its public launch, DeepSeek-R1 mannequin, developed by DeepSeek AI, confirmed spectacular outcomes throughout a number of analysis benchmarks. The mannequin follows the Combination of Specialists (MoE) structure and has 671 billion parameters. Historically, giant fashions are properly tailored for a large spectrum of generalized duties by the advantage of being educated on the massive quantity of information. The DeepSeek-R1 mannequin was educated on 14.8 trillion tokens. The unique R1 mannequin demonstrates robust few-shot or zero-shot studying capabilities, permitting it to generalize to new duties and situations that weren’t a part of its authentic coaching.

Nonetheless, many purchasers favor to both fine-tune or run steady pre-training of those fashions to adapt it to their particular enterprise purposes or to optimize it for particular duties. A monetary group may need to customise the mannequin with their customized information to help with their information processing duties. Or a hospital community can fine-tune it with their affected person information to behave as a medical assistant for his or her docs. Advantageous-tuning also can lengthen the mannequin’s generalization potential. Clients can fine-tune it with a corpus of textual content in particular languages that aren’t totally represented within the authentic coaching information. For instance, a mannequin fine-tuned with a further trillion tokens of Hindi language will be capable of develop the identical generalization capabilities to Hindi.

The choice on which mannequin to fine-tune is determined by the top software in addition to the accessible dataset. Based mostly on the quantity of proprietary information, clients can determine to fine-tune the bigger DeepSeek-R1 mannequin as a substitute of doing it for one of many distilled variations. As well as, the R1 fashions have their very own set of guardrails. Clients may need to fine-tune to replace these guardrails or develop on them.

Advantageous-tuning bigger fashions like DeepSeek-R1 requires cautious optimization to stability price, deployment necessities, and efficiency effectiveness. To attain optimum outcomes, organizations should meticulously choose an applicable atmosphere, decide the perfect hyperparameters, and implement environment friendly mannequin sharding methods.

Answer structure

SageMaker HyperPod recipes successfully deal with these necessities by offering a fastidiously curated mixture of distributed coaching strategies, optimizations, and configurations for state-of-the-art (SOTA) open supply fashions. These recipes have undergone intensive benchmarking, testing, and validation to supply seamless integration with the SageMaker coaching and fine-tuning processes.

On this submit, we discover options that show the way to fine-tune the DeepSeek-R1 mannequin utilizing these recipes on both SageMaker HyperPod or SageMaker coaching jobs. Your alternative between these companies will rely in your particular necessities and preferences. For those who require granular management over coaching infrastructure and intensive customization choices, SageMaker HyperPod is the perfect alternative. SageMaker coaching jobs, however, is tailor-made for organizations that desire a totally managed expertise for his or her coaching workflows. To be taught extra particulars about these service options, discuss with Generative AI basis mannequin coaching on Amazon SageMaker.

The next diagram illustrates the answer structure for coaching utilizing SageMaker HyperPod. With HyperPod, customers can start the method by connecting to the login/head node of the Slurm cluster. Every step is run as a Slurm job and makes use of Amazon FSx for Lustre for storing mannequin checkpoints. For DeepSeek-R1, the method consists of the next steps:

Obtain the DeepSeek-R1 mannequin and convert weights from FP8 to BF16 format
Load the mannequin into reminiscence and carry out fine-tuning utilizing Quantized Low-Rank Adaptation (QLoRA)
Merge QLoRA adapters with the bottom mannequin
Convert and cargo the mannequin for batch analysis

The next diagram illustrates the answer structure for SageMaker coaching jobs. You’ll be able to execute every step within the coaching pipeline by initiating the method by means of the SageMaker management airplane utilizing APIs, AWS Command Line Interface (AWS CLI), or the SageMaker ModelTrainer SDK. In response, SageMaker launches coaching jobs with the requested quantity and kind of compute situations to run particular duties. For DeepSeek-R1, the method consists of three major steps:

Obtain and convert R1 to BF16 datatype format
Load the mannequin into reminiscence and carry out fine-tuning
Consolidate and cargo the checkpoints into reminiscence, then run inference and metrics to judge efficiency enhancements

Stipulations

Full the next conditions earlier than working the DeepSeek-R1 671B mannequin fine-tuning pocket book:

Make the next quota improve requests for SageMaker. It’s good to request a minimal of two ml.p5.48xlarge situations (with 8 x NVIDIA H100 GPUs) ranging to a most of 4 ml.p5.48xlarge situations (relying on time-to-train and cost-to-train trade-offs in your use case). On the Service Quotas console, request the next SageMaker quotas. It could actually take as much as 24 hours for the quota improve to be accepted:
- P5 situations (ml.p5.48xlarge) for coaching job utilization: 2–4
- P5 situations (ml.p5.48xlarge) for HyperPod clusters (ml.p5.48xlarge for cluster utilization): 2–4
For those who select to make use of HyperPod clusters to run your coaching, arrange a HyperPod Slurm cluster, referring to Amazon SageMaker HyperPod Developer Information. Alternatively, you can even use the AWS CloudFormation template supplied within the Personal Account workshop and comply with the directions to arrange a cluster and a growth atmosphere to entry and submit jobs to the cluster.
(Optionally available) For those who select to make use of SageMaker coaching jobs, you may create an Amazon SageMaker Studio area (discuss with Use fast setup for Amazon SageMaker AI) to entry Jupyter notebooks with the previous position (You should utilize JupyterLab in your native setup too).
1. Create an AWS Id and Entry Administration (IAM) position with managed insurance policies AmazonSageMakerFullAccess, AmazonFSxFullAccess, and AmazonS3FullAccess to provide the mandatory entry to SageMaker to run the examples.
Clone the GitHub repository with the belongings for this deployment. This repository consists of a pocket book that references coaching belongings:

git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git
cd 18_sagemaker_training_recipes/ft_deepseek_r1_qlora

Answer walkthrough

To carry out the answer, comply with the steps within the subsequent sections.

Technical concerns

The default weights supplied by the DeepSeek crew on their official R1 repository are of kind FP8. Nonetheless, we selected to disable FP8 in our recipes as a result of we empirically discovered that coaching with BF16 enhances generalization throughout various datasets with minimal modifications to the recipe hyperparameters. Due to this fact, to realize steady fine-tuning for a mannequin of 671b parameter measurement, we advocate first changing the mannequin from FP8 to BF16 utilizing the fp8_cast_bf16.py command-line script supplied by DeepSeek. Executing this script will copy over the transformed BF16 weights in Safetensor format to the desired output listing. Keep in mind to repeat over the mannequin’s config.yaml to the output listing so the weights are loaded precisely. These steps are encapsulated in a prologue script and are documented step-by-step below the Advantageous-tuning part.

Clients can use a sequence size of 8K for coaching, as examined on a p5.48xlarge occasion, every geared up with eight NVIDIA H100 GPUs. You too can select a smaller sequence size if wanted. Coaching with a sequence size larger than 8K may result in out-of-memory points with GPUs. Additionally, changing mannequin weights from FP8 to BF16 requires a p5.48xlarge occasion, which can also be advisable for coaching as a result of mannequin’s excessive host reminiscence necessities throughout initialization.

Clients should improve their transformers model to transformers==4.48.2 to run the coaching.

Advantageous-tuning

Run the finetune_deepseek_r1_671_qlora.ipynb pocket book to fine-tune the DeepSeek-R1 mannequin utilizing QLoRA on SageMaker.

Put together the dataset

This part covers loading the FreedomIntelligence/medical-o1-reasoning-SFT dataset, tokenizing and chunking the dataset, and configuring the information channels for SageMaker coaching on Amazon Easy Storage Service (Amazon S3). Full the next steps:

Format the dataset by making use of the immediate format for DeepSeek-R1:

def generate_prompt(data_point):
full_prompt = f"""
Under is an instruction that describes a activity, paired with an enter
that gives additional context.
Write a response that appropriately completes the request.
Earlier than answering, consider carefully concerning the query and create a step-by-step chain of ideas to make sure a logical and correct response.

### Instruction:
You're a medical skilled with superior information in medical reasoning, diagnostics, and remedy planning.
Please reply the next medical query.

### Query:
{data_point["Question"]}

### Response:
{data_point["Complex_CoT"]}

"""
return {"immediate": full_prompt.strip()}

Load the FreedomIntelligence/medical-o1-reasoning-SFT dataset and cut up it into coaching and validation datasets:

# Load dataset from the hub
train_set = load_dataset(dataset_name, 'en', cut up="practice[5%:]")
test_set = load_dataset(dataset_name, 'en', cut up="practice[:5%]")

...

train_dataset = train_set.map(
generate_and_tokenize_prompt,
remove_columns=columns_to_remove,
batched=False
)

test_dataset = test_set.map(
generate_and_tokenize_prompt,
remove_columns=columns_to_remove,
batched=False
)

Load the DeepSeek-R1 tokenizer from the Hugging Face Transformers library and generate tokens for the practice and validation datasets. We use the unique sequence size of 8K:

model_id = "deepseek-ai/DeepSeek-R1"
max_seq_length=8096

# Initialize a tokenizer by loading a pre-trained tokenizer configuration, utilizing the quick tokenizer implementation if accessible.
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

...

train_dataset = train_dataset.map(tokenize, remove_columns=["prompt"])
test_dataset = test_dataset.map(tokenize, remove_columns=["prompt"])

Put together the coaching and validation datasets for SageMaker coaching by saving them as arrow information, required by SageMaker HyperPod recipes, and setting up the S3 paths the place these information might be uploaded. This dataset might be utilized in each SageMaker coaching jobs and SageMaker HyperPod examples:

train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/practice"
val_dataset_s3_path = f"s3://{bucket_name}/{input_path}/check"

train_dataset.save_to_disk(train_dataset_s3_path)
val_dataset.save_to_disk(val_dataset_s3_path)

The following part describes the way to run a fine-tuning instance with SageMaker coaching jobs.

Choice A: Advantageous-tune utilizing SageMaker coaching jobs

Observe these high-level steps:

Obtain DeepSeek-R1 to the FSx for Lustre mounted listing
Convert DeepSeek-R1 from FP8 to BF16
Advantageous-tune the DeepSeek-R1 mannequin
Merge the educated adapter with the bottom mannequin

Outline a utility perform to create the ModelTrainer class for each step of the SageMaker coaching jobs pipeline:

# Creates and executes a mannequin coaching job utilizing SageMaker
def create_model_trainer(
use_recipes: bool,
compute: dict,
community: dict,
data_channel: dict,
motion: str,
hyperparameters: dict ={},
source_code: str=None,
training_recipe: str=None,
recipe_overrides: str=None,
image_uri: str=None
) -> ModelTrainer:

...

Obtain DeepSeek-R1 to the FSx for Lustre mounted listing

Observe these steps:

Choose the occasion kind, Amazon FSx information channel, community configuration for the coaching job, and supply code, then outline the ModelTrainer class to run the coaching job on the ml.c5.18xlarge occasion to obtain DeepSeek-R1 from the Hugging Face DeepSeek-R1 hub:

# Create compute occasion
compute = ComputeCreator.create(
instance_type="ml.c5.18xlarge",
instance_count=1
)

# Create FSx information channel
data_channel = FSxDataChannelCreator.create_channel(
directory_path=fsx_mount_point
)

# Create community configuration
community = NetworkConfigCreator.create_network_config(network_config)

# Arrange supply code configuration
source_code = SourceCode(
source_dir="scripts",
entry_script="obtain.py"
)
...

# Create mannequin coach
model_trainer = create_model_trainer(
compute=compute,
community=community,
data_channel=data_channel,
motion="obtain",
source_code=source_code
...
)

Provoke the coaching calling practice perform of the ModelTrainer class:

model_trainer.practice(input_data_config=[data_channel], wait=True)

Convert DeepSeek R1 from FP8 to BF16

Use ModelTrainer to transform the DeepSeek-R1 downloaded mannequin weights from FP8 to BF16 format for optimum PEFT coaching. We use script convert.sh to run the execution utilizing the ml.c5.18xlarge occasion.

Use SageMaker coaching heat pool configuration to retain and reuse provisioned infrastructure after the completion of a mannequin obtain coaching job within the earlier step:

# Outline constants
FSX_MODELDIR_BF16 = "deepseek-r1-bf16"
FSX_DIR_PATH = f"{fsx_mount_point}/{fsx_dir_basemodel}"

# Create compute occasion
compute = ComputeCreator.create(
instance_type="ml.p5.48xlarge",
instance_count=1
)

...

# Arrange supply code configuration
source_code = SourceCode(
source_dir="scripts",
entry_script="convert.sh"
)

...
# Create mannequin coach for conversion
model_trainer = create_model_trainer(
..
motion="convert",
...
)

Advantageous-tune the DeepSeek-R1 mannequin

The following section includes fine-tuning the DeepSeek-R1 mannequin utilizing two ml.p5.48xlarge situations, utilizing distributed coaching. You implement this by means of the SageMaker recipe hf_deepseek_r1_671b_seq8k_gpu_qlora, which includes the QLoRA methodology. QLoRA makes the giant language mannequin (LLM) trainable on restricted compute by quantizing the bottom mannequin to 4-bit precision whereas utilizing small, trainable low-rank adapters for fine-tuning, dramatically decreasing reminiscence necessities with out sacrificing mannequin high quality:

# Create compute configuration with P5 situations
compute = ComputeCreator.create(
instance_type="ml.p5.48xlarge",
instance_count=2
)

...

# Create mannequin coach for fine-tuning
model_trainer = create_model_trainer(
use_recipes=True,
...
motion="finetune",
training_recipe="fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_qlora",
recipe_overrides=recipe_overrides
)

Provoke the coaching job to fine-tune the mannequin. SageMaker coaching jobs will provision two P5 situations, orchestrate the SageMaker mannequin parallel container smdistributed-modelparallel:2.4.1-gpu-py311-cu121, and execute the recipe to fine-tune DeepSeek-R1 with the QLoRA technique on an ephemeral cluster:

model_trainer.practice (input_data_config=[data_channel], wait=True)

Merge the educated adapter with the bottom mannequin

Merge the educated adapters with the bottom mannequin so it may be used for inference:

# Create compute configuration with P5 occasion
compute = ComputeCreator.create(
instance_type="ml.p5.48xlarge",
instance_count=1
)

# Configure supply code location and entry level
source_code = SourceCode(
source_dir="scripts",
entry_script="cli-inference.sh"
)
...

# Create mannequin coach for adapter merging
model_trainer = create_model_trainer(
use_recipes=False,
...
motion="mergeadapter",
source_code=source_code,
)

The following part exhibits how one can run related steps on HyperPod to run your generative AI workloads.

Choice B: Advantageous-tune utilizing SageMaker HyperPod with Slurm

To fine-tune the mannequin utilizing HyperPod, guarantee that your cluster is up and prepared by following the conditions talked about earlier. To entry the login/head node of the HyperPod Slurm cluster out of your growth atmosphere, comply with the login directions at SSH into Cluster within the workshop.

Alternatively, you can even use AWS Techniques Supervisor and run a command reminiscent of the next to begin the session. You could find the cluster ID, occasion group identify, and occasion ID on the Amazon SageMaker console.

aws ssm start-session --target sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id] --region region_name

While you’re within the cluster’s login/head node, run the next instructions to arrange the atmosphere. Run sudo su - ubuntu to run the remaining instructions as the basis person, except you may have a selected person ID to entry the cluster and your POSIX person is created by means of a lifecycle script on the cluster. Confer with the multi-user setup for extra particulars.

# create a digital atmosphere
python3 -m venv ${PWD}/venv
supply venv/bin/activate

# clone the recipes repository and arrange the atmosphere
git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 set up -r necessities.txt

Create a squash file utilizing Enroot to run the job on the cluster. Enroot runtime presents GPU acceleration, rootless container assist, and seamless integration with HPC environments, making it splendid for working workflows securely.

# create a squash file utilizing Enroot
REGION=
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}

After you’ve created the squash file, replace the recipes_collection/config.yaml file with absolutely the path to the squash file (created within the previous step), and replace the instance_type if wanted. The ultimate config file ought to have the next parameters:

...

cluster_type: slurm
...

instance_type: p5.48xlarge
...

container: /fsx/.sqsh
...

Additionally replace the file recipes_collection/cluster/slurm.yaml so as to add container_mounts pointing to the FSx for Lustre file system utilized in your cluster.

Observe these high-level steps to arrange, fine-tune, and consider the mannequin utilizing HyperPod recipes:

Obtain the mannequin and convert weights to BF16
Advantageous-tune the mannequin utilizing QLoRA
Merge the educated mannequin adapter
Consider the fine-tuned mannequin

Obtain the mannequin and convert weights to BF16

Obtain the DeepSeek-R1 mannequin from the HuggingFace hub and convert the mannequin weights from FP8 to BF16. It’s good to convert this to make use of QLoRA for fine-tuning. Copy and execute the next bash script:

#!/bin/bash
begin=$(date +%s)
# set up git lfs and obtain the mannequin from huggingface
sudo apt-get set up git-lfs
GIT_LFS_SKIP_SMUDGE=1 && git clone https://huggingface.co/deepseek-ai/DeepSeek-R1 
&& cd DeepSeek-R1 && git config lfs.concurrenttransfers nproc &&  git lfs pull
finish=$(date +%s)
echo "Time taken to obtain mannequin: $((finish - begin)) seconds"
begin=$(date +%s)
#convert the mannequin weights from fp8 to bf16
supply venv/bin/activate
git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference && pip set up -r necessities.txt && 
wget https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/blob/major/src/hyperpod_nemo_adapter/scripts/fp8_cast_bf16.py && 
python fp8_cast_bf16.py --input-fp8-hf-path ./DeepSeek-R1 --output-bf16-hf-path ./DeepSeek-R1-bf16

finish=$(date +%s)
echo "Time taken to transform mannequin to BF16: $((finish - begin)) seconds"

Advantageous-tune the mannequin utilizing QLoRA

Obtain the ready dataset that you just uploaded to Amazon S3 into your FSx for Lustre quantity connected to the cluster.

Enter the next instructions to obtain the information from Amazon S3:

aws s3 cp s3://{bucket_name}/{input_path}/practice /fsx/ubuntu/deepseek/information/practice --recursive
aws s3 cp s3://{bucket_name}/{input_path}/check /fsx/ubuntu/deepseek/information/check --recursive

Replace the launcher script to fine-tune the DeepSeek-R1 671B mannequin. The launcher scripts function handy wrappers for executing the coaching script, major.py file, simplifying the method of fine-tuning and parameter adjustment. For fine-tuning the DeepSeek R1 671B mannequin, you’ll find the particular script at:

launcher_scripts/deepseek/run_hf_deepseek_r1_671b_seq8k_gpu_qlora.sh

Earlier than working the script, you might want to modify the situation of the coaching and validation information, replace the HuggingFace mannequin ID, and optionally the entry token for personal fashions and datasets. The script ought to seem like the next (replace recipes.coach.num_nodes in the event you’re utilizing a multi-node cluster):

#!/bin/bash

# Unique Copyright (c), NVIDIA CORPORATION. Modifications © Amazon.com

#Customers ought to setup their cluster kind in /recipes_collection/config.yaml

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}

HF_MODEL_NAME_OR_PATH="/fsx/ubuntu/deepseek/DeepSeek-R1-bf16" # Path to the BF16 transformed mannequin

TRAIN_DIR="/fsx/ubuntu/deepseek/information/practice" # Location of coaching dataset
VAL_DIR="/fsx/ubuntu/deepseek/information/practice/" # Location of validation dataset

EXP_DIR="/fsx/ubuntu/deepseek/checkpoints" # Location to avoid wasting experiment information together with logging, checkpoints, and many others.

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/major.py" 
recipes=fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_qlora 
base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/outcomes" 
recipes.run.identify="hf-deepseek-r1-671b-seq8k-gpu-qlora" 
recipes.exp_manager.exp_dir="$EXP_DIR" 
recipes.coach.num_nodes=2 
recipes.mannequin.train_batch_size=1 
recipes.mannequin.information.train_dir="$TRAIN_DIR" 
recipes.mannequin.information.val_dir="$VAL_DIR" 
recipes.mannequin.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH"

You’ll be able to view the recipe for this fine-tuning activity below recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_qlora.yaml and override extra parameters as wanted.

Submit the job by working the launcher script:

bash launcher_scripts/deepseek/run_hf_deepseek_r1_671b_seq8k_gpu_qlora.sh

Monitor the job utilizing Slurm instructions reminiscent of squeue and scontrol present to view the standing of the job and the corresponding logs. The logs will be discovered within the outcomes folder within the launch listing. When the job is full, the mannequin adapters are saved within the EXP_DIR that you just outlined within the launch. The construction of the listing ought to seem like this:

ls -R
.:.:
checkpoints experiment end result.json

./checkpoints:
peft_sharded

./checkpoints/peft_sharded:
step_50

./checkpoints/peft_sharded/step_50:
README.md adapter_config.json adapter_model.safetensors tp0_ep0

You’ll be able to see the educated adapter weights are saved as a part of the checkpointing below ./checkpoints/peft_sharded/step_N. We are going to later use this to merge with the bottom mannequin.

Merge the educated mannequin adapter

Observe these steps:

Run a job utilizing the smdistributed-modelparallel enroot picture to merge the adapter with the bottom mannequin.

Obtain the merge_peft_checkpoint.py code from sagemaker-hyperpod-training-adapter-for-nemo repository and retailer it in Amazon FSx. Modify the export variables within the following scripts accordingly to mirror the paths for SOURCE_DIR, ADAPTER_PATH, BASE_MODEL_BF16 and MERGE_MODEL_PATH.

#!/bin/bash
# Copyright Amazon.com, Inc. or its associates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
#SBATCH --nodes=1 # variety of nodes to make use of
#SBATCH --job-name=deepseek_merge_adapter # identify of your job
#SBATCH --exclusive # job has unique use of the useful resource, no sharing
#SBATCH --wait-all-nodes=1

set -ex;
export SOURCE_DIR=/fsx/path_to_merge_code #(folder containing merge_peft_checkpoint.py)
export ADAPTER_PATH=/fsx/path_to_adapter #( from earlier step )
export BASE_MODEL_BF16=/fsx/path_to_base #( BF16 mannequin from step 1 )
export MERGE_MODEL_PATH=/fsx/path_to_merged_model

# default variables for mounting native paths to container
: "${IMAGE:=$(pwd)/smdistributed-modelparallel.sqsh}"
: "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}" #that is want for validating its hyperpod cluster
: "${ADAPTER_PATH_1:=$ADAPTER_PATH:$ADAPTER_PATH}"
: "${BASE_MODEL_BF16_1:=$BASE_MODEL_BF16:$BASE_MODEL_BF16}"
: "${MERGE_MODEL_PATH_1:=$MERGE_MODEL_PATH:$MERGE_MODEL_PATH}"
: "${SOURCE_DIR_1:=$SOURCE_DIR:$SOURCE_DIR}"
############

declare -a ARGS=(
--container-image $IMAGE
--container-mounts $HYPERPOD_PATH,$ADAPTER_PATH_1,$BASE_MODEL_BF16_1,$MERGE_MODEL_PATH_1,$SOURCE_DIR_1
)
#Merge adapter with base mannequin.

srun -l "${ARGS[@]}" python  $SOURCE_DIR/merge_peft_checkpoint.py 
--hf_model_name_or_path $BASE_MODEL_BF16 
--peft_adapter_checkpoint_path $ADAPTER_PATH 
--output_model_path $MERGE_MODEL_PATH 
--deepseek_v3 true

Consider the fine-tuned mannequin

Use the fundamental testing scripts supplied by DeekSeek to deploy the merged mannequin.

Begin by cloning their repo:

git clone https://github.com/deepseek-ai/DeepSeek-V3.git

cd DeepSeek-V3/inference
pip set up -r necessities.txt

It’s good to convert the merged mannequin to a selected format for working inference. On this case, you want 4*P5 situations to deploy the mannequin as a result of the merged mannequin is in BF16. Enter the next command to transform the mannequin:

python convert.py --hf-ckpt-path /fsx/ubuntu/deepseek/DeepSeek-V3-Base/ 
--save-path /fsx/ubuntu/deepseek/DeepSeek-V3-Demo --n-experts 256 
--model-parallel 32

When the conversion is full, use the next sbatch script to run the batch inference, making the next changes:
1. Replace the ckpt-path to the transformed mannequin path from the earlier step.
2. Create a brand new prompts.txt file with every line containing a immediate. The job will use the prompts from this file and generate output.

#!/bin/bash
#SBATCH —nodes=4
#SBATCH —job-name=deepseek_671b_inference
#SBATCH —output=deepseek_671b_percentj.out

# Set atmosphere variables
export MASTER_ADDR=$(scontrol present hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
supply /fsx/ubuntu/alokana/deepseek/venv/bin/activate
# Run the job utilizing torchrun
srun /fsx/ubuntu/alokana/deepseek/venv/bin/torchrun 
—nnodes=4 
—nproc-per-node=8 
—rdzv_id=$SLURM_JOB_ID 
—rdzv_backend=c10d 
—rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT 
./generate.py 
—ckpt-path /fsx/ubuntu/alokana/deepseek/DeepSeek-R1-Demo 
—config ./configs/config_671B.json 
--input-file ./prompts.txt

Cleanup

To wash up your assets to keep away from incurring extra expenses, comply with these steps:

Delete any unused SageMaker Studio assets.
(Optionally available) Delete the SageMaker Studio area.
Confirm that your coaching job isn’t working anymore. To take action, in your SageMaker console, select Coaching and verify Coaching jobs.
For those who created a HyperPod cluster, delete the cluster to cease incurring prices. For those who created the networking stack from the HyperPod workshop, delete the stack as properly to wash up the digital non-public cloud (VPC) assets and the FSx for Lustre quantity.

Conclusion

On this submit, we demonstrated the way to fine-tune giant fashions reminiscent of DeepSeek-R1 671B utilizing both SageMaker coaching jobs or SageMaker HyperPod with HyperPod recipes in a number of steps. This strategy minimizes the complexity of figuring out optimum distributed coaching configurations and offers a easy method to correctly measurement your workloads with the perfect price-performance structure on AWS.

To start out utilizing SageMaker HyperPod recipes, go to our sagemaker-hyperpod-recipes GitHub repository for complete documentation and instance implementations. Our crew regularly expands our recipes based mostly on buyer suggestions and rising machine studying (ML) tendencies, ensuring you may have the mandatory instruments for profitable AI mannequin coaching.

In regards to the Authors

Kanwaljit Khurmi is a Principal Worldwide Generative AI Options Architect at AWS. He collaborates with AWS product groups, engineering departments, and clients to supply steering and technical help, serving to them improve the worth of their hybrid machine studying options on AWS. Kanwaljit focuses on helping clients with containerized purposes and high-performance computing options.

Arun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker crew. He focuses on giant language mannequin coaching workloads, serving to clients construct LLM workloads utilizing SageMaker HyperPod, SageMaker coaching jobs, and SageMaker distributed coaching. Exterior of labor, he enjoys working, mountain climbing, and cooking.

Anoop Saha is a Sr GTM Specialist at Amazon Net Companies (AWS) specializing in generative AI mannequin coaching and inference. He companions with prime frontier mannequin builders, strategic clients, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop held a number of management roles at startups and huge firms, primarily specializing in silicon and system structure of AI infrastructure.

Rohith Nadimpally is a Software program Improvement Engineer engaged on AWS SageMaker, the place he accelerates large-scale AI/ML workflows. Earlier than becoming a member of Amazon, he graduated with Honors from Purdue College with a level in Pc Science. Exterior of labor, he enjoys enjoying tennis and watching films.

Main Menu

What's Hot

California Forces Chatbots to Spill the Beans

Chinese language Menace Group ‘Jewelbug’ Quietly Infiltrated Russian IT Community for Months

Anthropic is freely giving its highly effective Claude Haiku 4.5 AI at no cost to tackle OpenAI

Customise DeepSeek-R1 671b mannequin utilizing Amazon SageMaker HyperPod recipes – Half 2

FS-DFM: Quick and Correct Lengthy Textual content Era with Few-Step Diffusion Language Fashions

Construct a tool administration agent with Amazon Bedrock AgentCore

Information Analytics Automation Scripts with SQL Saved Procedures

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge