Speed up Enterprise AI Improvement utilizing Weights & Biases and Amazon Bedrock AgentCore

This put up is co-written by Thomas Capelle and Ray Strickland from Weights & Biases (W&B).

Generative synthetic intelligence (AI) adoption is accelerating throughout enterprises, evolving from easy basis mannequin interactions to classy agentic workflows. As organizations transition from proof-of-concepts to manufacturing deployments, they require sturdy instruments for growth, analysis, and monitoring of AI functions at scale.

On this put up, we show the right way to use Basis Fashions (FMs) from Amazon Bedrock and the newly launched Amazon Bedrock AgentCore alongside W&B Weave to assist construct, consider, and monitor enterprise AI options. We cowl the whole growth lifecycle from monitoring particular person FM calls to monitoring advanced agent workflows in manufacturing.

Overview of W&B Weave

Weights & Biases (W&B) is an AI developer system that gives complete instruments for coaching fashions, fine-tuning, and leveraging basis fashions for enterprises of all sizes throughout varied industries.

W&B Weave presents a unified suite of developer instruments to assist each stage of your agentic AI workflows. It permits:

Tracing & monitoring: Observe giant language mannequin (LLM) calls and utility logic to debug and analyze manufacturing methods.
Systematic iteration: Refine and iterate on prompts, datasets and fashions.
Experimentation: Experiment with completely different fashions and prompts within the LLM Playground.
Analysis: Use customized or pre-built scorers alongside our comparability instruments to systematically assess and improve utility efficiency. Acquire person and skilled suggestions for real-life testing and analysis.
Guardrails: Assist defend your utility with safeguards for content material moderation, immediate security, and extra. Use customized or third-party guardrails (together with Amazon Bedrock Guardrails) or W&B Weave’s native guardrails.

W&B Weave might be absolutely managed by Weights & Biases in a multi-tenant or single-tenant setting or might be deployed in a buyer’s Amazon Digital Personal Cloud (VPC) immediately. As well as, W&B Weave’s integration into the W&B Improvement Platform supplies organizations a seamlessly built-in expertise between the mannequin coaching/fine-tuning workflow and the agentic AI workflow.

To get began, subscribe to the Weights & Biases AI Improvement Platform by way of AWS Market. People and educational groups can subscribe to W&B at no further value.

Monitoring Amazon Bedrock FMs with W&B Weave SDK

W&B Weave integrates seamlessly with Amazon Bedrock by way of Python and TypeScript SDKs. After putting in the library and patching your Bedrock consumer, W&B Weave routinely tracks the LLM calls:

!pip set up weave
import weave
import boto3
import json
from weave.integrations.bedrock.bedrock_sdk import patch_client

weave.init("my_bedrock_app")

# Create and patch the Bedrock consumer
consumer = boto3.consumer("bedrock-runtime")
patch_client(consumer)

# Use the consumer as common
response = consumer.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    physique=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 100,
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ]
    }),
    contentType="utility/json",
    settle for="utility/json"
)
response_dict = json.hundreds(response.get('physique').learn())
print(response_dict["content"][0]["text"])

This integration routinely variations experiments and tracks configurations, offering full visibility into your Amazon Bedrock functions with out modifying core logic.

Experimenting with Amazon Bedrock FMs in W&B Weave Playground

The W&B Weave Playground accelerates immediate engineering with an intuitive interface for testing and evaluating Bedrock fashions. Key options embody:

Direct immediate enhancing and message retrying
Facet-by-side mannequin comparability
Entry from hint views for fast iteration

To start, add your AWS credentials within the Playground settings, choose your most well-liked Amazon Bedrock FMs, and begin experimenting. The interface permits fast iteration on prompts whereas sustaining full traceability of experiments.

Evaluating Amazon Bedrock FMs with W&B Weave Evaluations

W&B Weave Evaluations supplies devoted instruments for evaluating generative AI fashions successfully. By leveraging W&B Weave Evaluations alongside Amazon Bedrock, customers can effectively consider these fashions, analyze outputs, and visualize efficiency throughout key metrics. Customers can use in-built scorers from W&B Weave, third celebration or customized scorers, and human/skilled suggestions as effectively. This mix permits for a deeper understanding of the tradeoffs between fashions, comparable to variations in value, accuracy, pace, and output high quality.

W&B Weave has a first-class method to monitor evaluations with Mannequin & Analysis lessons. To arrange an analysis job, prospects can:

Outline a dataset or checklist of dictionaries with a group of examples to be evaluated
Create an inventory of scoring capabilities. Every perform ought to have a model_output and optionally, different inputs out of your examples, and return a dictionary with the scores
Outline an Amazon Bedrock mannequin by utilizing Mannequin class
Consider this mannequin by calling Analysis

Right here’s an instance of organising an analysis job:

import weave
from weave import Analysis
import asyncio

# Acquire your examples
examples = [
    {"question": "What is the capital of France?", "expected": "Paris"},
    {"question": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
    {"question": "What is the square root of 64?", "expected": "8"},
]

# Outline any customized scoring perform
@weave.op()
def match_score1(anticipated: str, output: dict) -> dict:
    # Right here is the place you'd outline the logic to attain the mannequin output
    return {'match': anticipated == model_output['generated_text']}

@weave.op()
def function_to_evaluate(query: str):
    # here is the place you'd add your LLM name and return the output
    return  {'generated_text': 'Paris'}

# Rating your examples utilizing scoring capabilities
analysis = Analysis(
    dataset=examples, scorers=[match_score1]
)

# Begin monitoring the analysis
weave.init('intro-example')
# Run the analysis
asyncio.run(analysis.consider(function_to_evaluate))

The analysis dashboard visualizes efficiency metrics, enabling knowledgeable choices about mannequin choice and configuration. For detailed steerage, see our earlier put up on evaluating LLM summarization with Amazon Bedrock and Weave.

Enhancing Amazon Bedrock AgentCore Observability with W&B Weave

Amazon Bedrock AgentCore is a whole set of companies for deploying and working extremely succesful brokers extra securely at enterprise scale. It supplies safer runtime environments, workflow execution instruments, and operational controls that work with well-liked frameworks like Strands Brokers, CrewAI, LangGraph, and LlamaIndex, in addition to many LLM fashions – whether or not from Amazon Bedrock or exterior sources.

AgentCore consists of built-in observability by way of Amazon CloudWatch dashboards that monitor key metrics like token utilization, latency, session length, and error charges. It additionally traces workflow steps, exhibiting which instruments had been invoked and the way the mannequin responded, offering important visibility for debugging and high quality assurance in manufacturing.

When working with AgentCore and W&B Weave collectively, groups can use AgentCore’s built-in operational monitoring and safety foundations whereas additionally utilizing W&B Weave if it aligns with their current growth workflows. Organizations already invested within the W&B setting could select to include W&B Weave’s visualization instruments alongside AgentCore’s native capabilities. This method offers groups flexibility to make use of the observability resolution that most closely fits their established processes and preferences when growing advanced brokers that chain a number of instruments and reasoning steps.

There are two predominant approaches so as to add W&B Weave observability to your AgentCore brokers: utilizing the native W&B Weave SDK or integrating by way of OpenTelemetry.

Native W&B Weave SDK

The only method is to make use of W&B Weave’s @weave.op decorator to routinely monitor perform calls. Initialize W&B Weave along with your undertaking title and wrap the capabilities you need to monitor:

import weave
import os

os.environ["WANDB_API_KEY"] = "your_api_key"
weave.init("your_project_name")

@weave.op()
def word_count_op(textual content: str) -> int:
    return len(textual content.break up())

@weave.op()
def run_agent(agent: Agent, user_message: str) -> Dict[str, Any]:
    end result = agent(user_message)
    return {"message": end result.message, "mannequin": agent.mannequin.config["model_id"]}

Since AgentCore runs as a docker container, add W&B weave to your dependencies (for instance, uv add weave) to incorporate it in your container picture.

OpenTelemetry Integration

For groups already utilizing OpenTelemetry or wanting vendor-neutral instrumentation, W&B Weave helps OTLP (OpenTelemetry Protocol) immediately:

from opentelemetry import hint
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

auth_b64 = base64.b64encode(f"api:{WANDB_API_KEY}".encode()).decode()
exporter = OTLPSpanExporter(
    endpoint="https://hint.wandb.ai/otel/v1/traces",
    headers={"Authorization": f"Primary {auth_b64}", "project_id": WEAVE_PROJECT}
)

# Create spans to trace execution
with tracer.start_as_current_span("invoke_agent") as span:
    span.set_attribute("enter.worth", json.dumps({"immediate": user_message}))
    end result = agent(user_message)
    span.set_attribute("output.worth", json.dumps({"message": end result.message}))

This method maintains compatibility with AgentCore’s current OpenTelemetry infrastructure whereas routing traces to W&B Weave for visualization.When utilizing each AgentCore and W&B Weave collectively, groups have a number of choices for observability. AgentCore’s CloudWatch integration displays system well being, useful resource utilization, and error charges whereas offering tracing for agent reasoning and gear choice. W&B Weave presents visualization capabilities that current execution knowledge in codecs acquainted to groups already utilizing the W&B setting. Each options present visibility into how brokers course of data and make choices, permitting organizations to decide on the observability method that greatest aligns with their current workflows and preferences.This dual-layer method means customers can:

Monitor manufacturing service degree agreements (SLAs) by way of CloudWatch alerts
Debug advanced agent behaviors in W&B Weave’s hint explorer
Optimize token utilization and latency with detailed execution breakdowns
Examine agent efficiency throughout completely different prompts and configurations

The mixing requires minimal code modifications, preserves your current AgentCore deployment, and scales along with your agent complexity. Whether or not you’re constructing easy tool-calling brokers or orchestrating multi-step workflows, this observability stack supplies the insights wanted to iterate shortly and deploy confidently.

For implementation particulars and full code examples, discuss with our earlier put up.

Conclusion

On this put up, we demonstrated the right way to construct and optimize enterprise-grade agentic AI options by combining Amazon Bedrock’s FMs and AgentCore with W&B Weave’s complete observability toolkit. We explored how W&B Weave can improve each stage of the LLM growth lifecycle—from preliminary experimentation within the Playground to systematic analysis of mannequin efficiency, and eventually to manufacturing monitoring of advanced agent workflows.

The mixing between Amazon Bedrock and W&B Weave supplies a number of key capabilities:

Computerized monitoring of Amazon Bedrock FM calls with minimal code modifications utilizing the W&B Weave SDK
Speedy experimentation by way of the W&B Weave Playground’s intuitive interface for testing prompts and evaluating fashions
Systematic analysis with customized scoring capabilities to guage completely different Amazon Bedrock fashions
Complete observability for AgentCore deployments, with CloudWatch metrics offering extra sturdy operational monitoring supplemented by detailed execution traces

To get began:

Request a free trial or subscribe to Weights &Biases AI Improvement Platform by way of AWS Market
Set up the W&B Weave SDK and comply with our code examples to start monitoring your Bedrock FM calls
Experiment with completely different fashions within the W&B Weave Playground by including your AWS credentials and testing varied Amazon Bedrock FMs
Arrange evaluations utilizing the W&B Weave Analysis framework to systematically evaluate mannequin efficiency on your use instances
Improve your AgentCore brokers by including W&B Weave observability utilizing both the native SDK or OpenTelemetry integration

Begin with a easy integration to trace your Amazon Bedrock calls, then progressively undertake extra superior options as your AI functions develop in complexity. The mixture of Amazon Bedrock and W&B Weave’s complete growth instruments supplies the inspiration wanted to construct, consider, and preserve production-ready AI options at scale.

In regards to the authors

James Yi is a Senior AI/ML Companion Options Architect at AWS. He spearheads AWS’s strategic partnerships in Rising Applied sciences, guiding engineering groups to design and develop cutting-edge joint options in generative AI. He permits area and technical groups to seamlessly deploy, function, safe, and combine companion options on AWS. James collaborates intently with enterprise leaders to outline and execute joint Go-To-Market methods, driving cloud-based enterprise progress. Outdoors of labor, he enjoys taking part in soccer, touring, and spending time together with his household.

Ray Strickland is a Senior Companion Options Architect at AWS specializing in AI/ML, Agentic AI and Clever Doc Processing. He permits companions to deploy scalable generative AI options utilizing AWS greatest practices and drives innovation by way of strategic companion enablement packages. Ray collaborates throughout a number of AWS groups to speed up AI adoption and has in depth expertise in companion analysis and enablement.

Thomas Capelle is a Machine Studying Engineer at Weights & Biases. He’s answerable for maintaining the www.github.com/wandb/examples repository stay and updated. He additionally builds content material on MLOPS, functions of W&B to industries, and enjoyable deep studying generally. Beforehand he was utilizing deep studying to resolve short-term forecasting for photo voltaic power. He has a background in City Planning, Combinatorial Optimization, Transportation Economics, and Utilized Math.

Scott Juang is the Director of Alliances at Weights & Biases. Previous to W&B, he led various strategic alliances at AWS and Cloudera. Scott studied Supplies Engineering and has a ardour for renewable power.

Main Menu

What's Hot

Seth Godin on Management, Vulnerability, and Making an Influence within the New World Of Work

mAceReason-Math: A Dataset of Excessive-High quality Multilingual Math Issues Prepared For RLVR

AMC Robotics and HIVE Announce Collaboration to Advance AI-Pushed Robotics Compute Infrastructure

Speed up Enterprise AI Improvement utilizing Weights & Biases and Amazon Bedrock AgentCore

mAceReason-Math: A Dataset of Excessive-High quality Multilingual Math Issues Prepared For RLVR

P-EAGLE: Quicker LLM inference with Parallel Speculative Decoding in vLLM

We Used 5 Outlier Detection Strategies on a Actual Dataset: They Disagreed on 96% of Flagged Samples

Seth Godin on Management, Vulnerability, and Making an Influence within the New World Of Work

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Seth Godin on Management, Vulnerability, and Making an Influence within the New World Of Work

mAceReason-Math: A Dataset of Excessive-High quality Multilingual Math Issues Prepared For RLVR

AMC Robotics and HIVE Announce Collaboration to Advance AI-Pushed Robotics Compute Infrastructure

Tremble Chatbot App Entry, Prices, and Characteristic Insights

Main Menu

Subscribe to Updates

What's Hot

Speed up Enterprise AI Improvement utilizing Weights & Biases and Amazon Bedrock AgentCore

Overview of W&B Weave

Monitoring Amazon Bedrock FMs with W&B Weave SDK

Experimenting with Amazon Bedrock FMs in W&B Weave Playground

Evaluating Amazon Bedrock FMs with W&B Weave Evaluations

Enhancing Amazon Bedrock AgentCore Observability with W&B Weave

Native W&B Weave SDK

OpenTelemetry Integration

Conclusion

In regards to the authors

Related Posts