Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Trump’s cyber technique emphasizes offensive operations, deregulation, AI

    March 7, 2026

    The WIRED Information to Wires: How one can Handle the Mess of Cables Round Your Desk

    March 7, 2026

    Constructing {custom} mannequin supplier for Strands Brokers with LLMs hosted on SageMaker AI endpoints

    March 7, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»Constructing {custom} mannequin supplier for Strands Brokers with LLMs hosted on SageMaker AI endpoints
    Machine Learning & Research

    Constructing {custom} mannequin supplier for Strands Brokers with LLMs hosted on SageMaker AI endpoints

    Oliver ChambersBy Oliver ChambersMarch 7, 2026No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Constructing {custom} mannequin supplier for Strands Brokers with LLMs hosted on SageMaker AI endpoints
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    Organizations more and more deploy {custom} giant language fashions (LLMs) on Amazon SageMaker AI real-time endpoints utilizing their most popular serving frameworks—resembling SGLang, vLLM, or TorchServe—to assist achieve better management over their deployments, optimize prices, and align with compliance necessities. Nonetheless, this flexibility introduces a essential technical problem: response format incompatibility with Strands brokers. Whereas these {custom} serving frameworks sometimes return responses in OpenAI-compatible codecs to facilitate broad surroundings help, Strands brokers count on mannequin responses aligned with the Bedrock Messages API format.

    The problem is especially important as a result of help for the Messages API just isn’t assured for the fashions hosted on SageMaker AI real-time endpoints. Whereas Amazon Bedrock Mantle distributed inference engine has supported OpenAI messaging codecs since December 2025, flexibility of SageMaker AI permits prospects to host numerous basis fashions—some requiring esoteric immediate and response codecs that don’t conform to plain APIs. This creates a niche between the serving framework’s output construction and what Strands expects, stopping seamless integration regardless of each techniques being technically useful. The answer lies in implementing {custom} mannequin parsers that reach SageMakerAIModel and translate the mannequin server’s response format into what Strands expects, enabling organizations to leverage their most popular serving frameworks with out sacrificing compatibility with the Strands Brokers SDK.

    This put up demonstrates how you can construct {custom} mannequin parsers for Strands brokers when working with LLMs hosted on SageMaker that don’t natively help the Bedrock Messages API format. We’ll stroll by way of deploying Llama 3.1 with SGLang on SageMaker utilizing awslabs/ml-container-creator, then implementing a {custom} parser to combine it with Strands brokers.

    Strands Customized Parsers

    Strands brokers count on mannequin responses in a selected format aligned with the Bedrock Messages API. While you deploy fashions utilizing {custom} serving frameworks like SGLang, vLLM, or TorchServe, they sometimes return responses in their very own codecs—typically OpenAI-compatible for broad surroundings help. With no {custom} parser, you’ll encounter errors like:

    TypeError: 'NoneType' object just isn't subscriptable

    This occurs as a result of the Strands Brokers default SageMakerAIModel class makes an attempt to parse responses assuming a selected construction that your {custom} endpoint doesn’t present. On this put up and the companion code base, we illustrate how you can prolong the SageMakerAIModel class with {custom} parsing logic that interprets your mannequin server’s response format into what Strands expects.

    Implementation Overview

    Our implementation consists of three layers:

    1. Mannequin Deployment Layer: Llama 3.1 served by SGLang on SageMaker, returning OpenAI-compatible responses
    2. Parser Layer: Customized LlamaModelProvider class that extends SageMakerAIModel to deal with Llama 3.1’s response format
    3. Agent Layer: Strands agent that makes use of the {custom} supplier for conversational AI, appropriately parsing the mannequin’s response

    We begin through the use of awslabs/ml-container-creator, an AWS Labs open-source Yeoman generator that automates the creation of SageMaker BYOC (Deliver Your Personal Container) deployment initiatives. It generates the artifacts wanted to construct LLM serving containers, together with Dockerfiles, CodeBuild configurations, and deployment scripts.

    Set up ml-container-creator

    Step one we have to take is to construct the serving container for our mannequin. We use an open-source challenge to construct the container and generate deployment scripts for that container. The next instructions illustrate how you can set up awslabs/ml-container-creator and its dependencies, which embody npm and Yeoman. For extra info, evaluate the challenge’s README and Wiki to get began.

    # Set up Yeoman globally
    npm set up -g yo
    
    # Clone and set up ml-container-creator
    git clone https://github.com/awslabs/ml-container-creator
    cd ml-container-creator
    npm set up && npm hyperlink
    
    # Confirm set up
    yo --generators # Ought to present ml-container-creator

    Generate Deployment Mission

    As soon as put in and linked, the yo command permits you to run put in turbines, yo ml-container-creator permits you to run the generator we want for this train.

    # Run the generator
    yo ml-container-creator
    
    # Configuration choices:
    # - Framework: transformers
    # - Mannequin Server: sglang
    # - Mannequin: meta-llama/Llama-3.1-8B-Instruct
    # - Deploy Goal: codebuild
    # - Occasion Sort: ml.g6.12xlarge (GPU)
    # - Area: us-east-1

    The generator creates an entire challenge construction:

    /
    ├── Dockerfile # Container with SGLang and dependencies
    ├── buildspec.yml # CodeBuild configuration
    ├── code/
    │ └── serve # SGLang server startup script
    ├── deploy/
    │ ├── submit_build.sh # Triggers CodeBuild
    │ └── deploy.sh # Deploys to SageMaker
    └── take a look at/
    └── test_endpoint.sh # Endpoint testing script

    Construct and Deploy

    Initiatives constructed by awslabs/ml-container-creator embody templatized construct and deployment scripts. The ./deploy/submit_build.sh and ./deploy/deploy.sh scripts are used to construct the picture, push the picture to Amazon Elastic Container Registry (ECR), and deploy to an Amazon SageMaker AI real-time endpoint.

    cd llama-31-deployment
    
    # Construct container with CodeBuild (no native Docker required)
    ./deploy/submit_build.sh
    
    # Deploy to SageMaker
    ./deploy/deploy.sh arn:aws:iam::ACCOUNT:position/SageMakerExecutionRole

    The deployment course of:

    1. CodeBuild builds the Docker picture with SGLang and Llama 3.1
    2. Picture is pushed to Amazon ECR
    3. SageMaker creates a real-time endpoint
    4. SGLang downloads the mannequin from HuggingFace and masses it into GPU reminiscence
    5. Endpoint reaches InService standing (roughly 10-Quarter-hour)

    We will take a look at the endpoint through the use of ./take a look at/test_endpoint.sh, or with a direct invocation:

    import boto3
    import json
    
    runtime_client = boto3.shopper('sagemaker-runtime', region_name="us-east-1")
    
    payload = {
    "messages": [
        {"user", "content": "Hello, how are you?"}
      ],
      "max_tokens": 100,
      "temperature": 0.7
    }
    
    response = runtime_client.invoke_endpoint(
      EndpointName="llama-31-deployment-endpoint",
      ContentType="software/json",
      Physique=json.dumps(payload)
    )
    
    consequence = json.masses(response['Body'].learn().decode('utf-8'))
    print(consequence['choices'][0]['message']['content'])

    Understanding the Response Format

    Llama 3.1 returns OpenAI-compatible responses. Strands expects mannequin responses to stick to the Bedrock Messages API format. Till late final 12 months, this was an ordinary compatibility mismatch. Since December 2025, the Amazon Bedrock Mantle distributed inference engine helps OpenAI messaging codecs:

    {
      "id": "cmpl-abc123",
      "object": "chat.completion",
      "created": 1704067200,
      "mannequin": "meta-llama/Llama-3.1-8B-Instruct",
      "selections": [{
        "index": 0,
        "message": {"role": "assistant", "content": "I'm doing well, thank you for asking!"},
        "finish_reason": "stop"
      }],
      "utilization": {
        "prompt_tokens": 23,
        "completion_tokens": 12,
        "total_tokens": 35
      }
    }

    Nonetheless, help for the Messages API just isn’t assured for the fashions hosted on SageMaker AI real-time endpoints. SageMaker AI permits prospects to host many sorts of basis fashions on managed GPU-accelerated infrastructure, a few of which can require esoteric immediate/response codecs. For instance, the default SageMakerAIModel makes use of the legacy Bedrock Messages API format and makes an attempt to entry fields that don’t exist in the usual OpenAI Messages format, inflicting TypeError model failures.

    Implementing a Customized Mannequin Parser

    Customized mannequin parsers are a function of the Strands Brokers SDK that gives sturdy compatibility and adaptability for purchasers constructing brokers powered by LLMs hosted on SageMaker AI. Right here, we describe how you can create a {custom} supplier that extends SageMakerAIModel:

    def stream(self, messages: Record[Dict[str, Any]], tool_specs: record, system_prompt: Optionally available[str], **kwargs):
      # Construct payload messages
      payload_messages = []
      if system_prompt:
        payload_messages.append({"position": "system", "content material": system_prompt})
        # Extract message content material from Strands format
        for msg in messages:
          payload_messages.append({"position": "person", "content material": msg['content'][0]['text']})
          
          # Construct full payload with streaming enabled
          payload = {
            "messages": payload_messages,
            "max_tokens": kwargs.get('max_tokens', self.max_tokens),
            "temperature": kwargs.get('temperature', self.temperature),
            "top_p": kwargs.get('top_p', self.top_p),
            "stream": True
          }
    
          attempt:
            # Invoke SageMaker endpoint with streaming
            response = self.runtime_client.invoke_endpoint_with_response_stream(
              EndpointName=self.endpoint_name,
              ContentType="software/json",
              Settle for="software/json",
              Physique=json.dumps(payload)
            )
    
            # Course of streaming response
            accumulated_content = ""
              for occasion in response['Body']:
                chunk = occasion['PayloadPart']['Bytes'].decode('utf-8')
                if not chunk.strip():
                  proceed
        
                # Parse SSE format: "information: {json}n"
                for line in chunk.cut up('n'):
                  if line.startswith('information: '):
                    attempt:
                      json_str = line.exchange('information: ', '').strip()
                      if not json_str:
                        proceed
                      
                      chunk_data = json.masses(json_str)
                      if 'selections' in chunk_data and chunk_data['choices']:
                        delta = chunk_data['choices'][0].get('delta', {})
    
                        # Yield content material delta in Strands format
                        if 'content material' in delta:
                          content_chunk = delta['content']
                          accumulated_content += content_chunk
                          yield {
                            "kind": "contentBlockDelta",
                            "delta": {"textual content": content_chunk},
                            "contentBlockIndex": 0
                          }
    
                        # Examine for completion
                        finish_reason = chunk_data['choices'][0].get('finish_reason')
                        if finish_reason:
                          yield {
                            "kind": "messageStop",
                            "stopReason": finish_reason
                          }
    
                        # Yield utilization metadata
                        if 'utilization' in chunk_data:
                          yield {
                            "kind": "metadata",
                            "utilization": chunk_data['usage']
                          }
    
                    besides json.JSONDecodeError:
                      proceed
    
          besides Exception as e:
            yield {
              "kind": "error",
              "error": {
                "message": f"Endpoint invocation failed: {str(e)}",
                "kind": "EndpointInvocationError"
              }
          }

    The stream technique overrides the habits of the SageMakerAIModel and permits the agent to parse responses primarily based on the necessities of the underlying mannequin. Whereas the overwhelming majority of fashions do help OpenAI’s Message API protocol, this functionality permits power-users to leverage extremely specified LLMs on SageMaker AI to energy agent workloads utilizing Strands Brokers SDK. As soon as the {custom} mannequin response logic is constructed, Strands Brokers SDK makes it easy to initialize brokers with {custom} mannequin suppliers:

    from strands.agent import Agent
    
    # Initialize {custom} supplier
    supplier = LlamaModelProvider(
      endpoint_name="llama-31-deployment-endpoint",
      region_name="us-east-1",
      max_tokens=1000,
      temperature=0.7
    )
    
    # Create agent with {custom} supplier
    agent = Agent(
      title="llama-assistant",
      mannequin=supplier,
      system_prompt=(
        "You're a useful AI assistant powered by Llama 3.1, "
        "deployed on Amazon SageMaker. You present clear, correct, "
        "and pleasant responses to person questions."
      )
    )
    
    # Take a look at the agent
    response = agent("What are the important thing advantages of deploying LLMs on SageMaker?")
    print(response.content material)

    The entire implementation for this practice parser, together with the Jupyter pocket book with detailed explanations and the ml-container-creator deployment challenge, is accessible within the companion GitHub repository.

    Conclusion

    Constructing {custom} mannequin parsers for Strands brokers helps customers to leverage completely different LLM deployments on SageMaker, no matter its response format. By extending SageMakerAIModel and implementing the stream() technique, you possibly can combine custom-hosted fashions whereas sustaining the clear agent interface of Strands.

    Key takeaways:

    1. awslabs/ml-container-creator simplifies SageMaker BYOC deployments with production-ready infrastructure code
    2. Customized parsers bridge the hole between mannequin server response codecs and Strands expectations
    3. The stream() technique is the essential integration level for {custom} suppliers

    In regards to the authors

    Dan Ferguson is a Sr. Options Architect at AWS, primarily based in New York, USA. As a machine studying providers skilled, Dan works to help prospects on their journey to integrating ML workflows effectively, successfully, and sustainably.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    5 Highly effective Python Decorators to Optimize LLM Purposes

    March 7, 2026

    KV Caching in LLMs: A Information for Builders

    March 7, 2026

    How We Guess Towards the Bitter Lesson – O’Reilly

    March 7, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Trump’s cyber technique emphasizes offensive operations, deregulation, AI

    By Declan MurphyMarch 7, 2026

    Palo Alto Networks CEO Nikesh Arora stated in an announcement, “I commend ONCD Director [Sean]…

    The WIRED Information to Wires: How one can Handle the Mess of Cables Round Your Desk

    March 7, 2026

    Constructing {custom} mannequin supplier for Strands Brokers with LLMs hosted on SageMaker AI endpoints

    March 7, 2026

    Pricing Choices and Practical Scope

    March 7, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.