You should use Amazon Bedrock Customized Mannequin Import to seamlessly combine your custom-made fashions—corresponding to Llama, Mistral, and Qwen—that you’ve got fine-tuned elsewhere into Amazon Bedrock. The expertise is totally serverless, minimizing infrastructure administration whereas offering your imported fashions with the identical unified API entry as native Amazon Bedrock fashions. Your customized fashions profit from computerized scaling, enterprise-grade safety, and native integration with Amazon Bedrock options corresponding to Amazon Bedrock Guardrails and Amazon Bedrock Information Bases.
Understanding how assured a mannequin is in its predictions is important for constructing dependable AI purposes, significantly when working with specialised customized fashions that may encounter domain-specific queries.
With log chance help now added to Customized Mannequin Import, you possibly can entry details about your fashions’ confidence of their predictions on the token degree. This enhancement offers better visibility into mannequin habits and permits new capabilities for mannequin analysis, confidence scoring, and superior filtering methods.
On this put up, we discover how log possibilities work with imported fashions in Amazon Bedrock. You’ll be taught what log possibilities are, tips on how to allow them in your API calls, and tips on how to interpret the returned information. We additionally spotlight sensible purposes—from detecting potential hallucinations to optimizing RAG techniques and evaluating fine-tuned fashions—that exhibit how these insights can enhance your AI purposes, serving to you construct extra reliable options together with your customized fashions.
Understanding log possibilities
In language fashions, a log chance represents the logarithm of the chance that the mannequin assigns to a token in a sequence. These values point out how assured the mannequin is about every token it generates or processes. Log possibilities are expressed as unfavourable numbers, with values nearer to zero indicating increased confidence. For instance, a log chance of -0.1 corresponds to roughly 90% confidence, whereas a worth of -3.0 corresponds to about 5% confidence. By analyzing these values, you possibly can establish when a mannequin is extremely sure versus when it’s making much less assured predictions. Log possibilities present a quantitative measure of how probably the mannequin thought of every generated token, providing invaluable perception into the arrogance of its output. By analyzing them you possibly can,
- Gauge confidence throughout a response: Assess how assured the mannequin was in several sections of its output, serving to you establish the place it was sure versus unsure.
- Rating and examine outputs: Evaluate total sequence probability (by including or averaging log possibilities) to rank or filter a number of mannequin outputs.
- Detect potential hallucinations: Determine sudden drops in token-level confidence, which might flag segments that may require verification or evaluate.
- Cut back RAG prices with early pruning: Run brief, low-cost draft generations based mostly on retrieved contexts, compute log possibilities for these drafts, and discard low-scoring candidates early, avoiding pointless full-length generations or costly reranking whereas protecting solely probably the most promising contexts within the pipeline.
- Construct confidence-aware purposes: Adapt system habits based mostly on certainty ranges—for instance, set off clarifying prompts, present fallback responses, or flagging for human evaluate.
Total, log possibilities are a robust software for deciphering and debugging mannequin responses with measurable certainty—significantly invaluable for purposes the place understanding why a mannequin responded in a sure approach could be as essential because the response itself.
Conditions
To make use of log chance help with customized mannequin import in Amazon Bedrock, you want:
- An lively AWS account with entry to Amazon Bedrock
- A customized mannequin created in Amazon Bedrock utilizing the Customized Mannequin Import function after July 31, 2025, when the log possibilities help was launched
- Acceptable AWS Id and Entry Administration (IAM) permissions to invoke fashions by the Amazon Bedrock Runtime
Introducing log possibilities help in Amazon Bedrock
With this launch, Amazon Bedrock now permits fashions imported utilizing the Customized Mannequin Import function to return token-level log possibilities as a part of the inference response.
When invoking a mannequin by Amazon Bedrock InvokeModel API, you possibly can entry token log possibilities by setting "return_logprobs": true within the JSON request physique. With this flag enabled, the mannequin’s response will embody extra fields offering log possibilities for each the immediate tokens and the generated tokens, in order that prospects can analyze the mannequin’s confidence in its predictions. These log possibilities allow you to quantitatively assess how assured your customized fashions are when processing inputs and producing responses. The granular metrics enable for higher analysis of response high quality, troubleshooting of surprising outputs, and optimization of prompts or mannequin configurations.
Let’s stroll by an instance of invoking a customized mannequin on Amazon Bedrock with log possibilities enabled and study the output format. Suppose you have got already imported a customized mannequin (as an illustration, a fine-tuned Llama 3.2 1B mannequin) into Amazon Bedrock and have its mannequin Amazon Useful resource Identify (ARN). You possibly can invoke this mannequin utilizing the Amazon Bedrock Runtime SDK (Boto3 for Python on this instance) as proven within the following instance:
Within the previous code, we ship a immediate—"The short brown fox jumps"—to our customized imported mannequin. We configure commonplace inference parameters: a most technology size of fifty tokens, a average temperature of 0.5 for average randomness, and a cease situation (both a interval or a newline). The "return_logprobs":True parameter tells Amazon Bedrock to return log possibilities within the response.
The InvokeModel API returns a JSON response containing three important elements: the usual generated textual content output, metadata concerning the technology course of, and now log possibilities for each immediate and generated tokens. These values reveal the mannequin’s inside confidence for every token prediction, so you possibly can perceive not simply what textual content was produced, however how sure the mannequin was at every step of the method. The next is an instance response from the "fast brown fox jumps" immediate, exhibiting log possibilities (showing as unfavourable numbers):
The uncooked API response offers token IDs paired with their log possibilities. To make this information interpretable, we have to first decode the token IDs utilizing the suitable tokenizer (on this case, the Llama 3.2 1B tokenizer), which maps every ID again to its precise textual content token. Then we convert log possibilities to possibilities by making use of the exponential perform, translating these values into extra intuitive possibilities between 0 and 1. We’ve got applied these transformations utilizing customized code (not proven right here) to provide a human-readable format the place every token seems alongside its chance, making the mannequin’s confidence in its predictions instantly clear.
Let’s break down what this tells us concerning the mannequin’s inside processing:
technology: That is the precise textual content generated by the mannequin (in our instance, it’s a continuation of the immediate that we despatched to the mannequin). This is similar area you’d get usually from any mannequin invocation.prompt_token_countandgeneration_token_count: These point out the variety of tokens within the enter immediate and within the output, respectively. In our instance, the immediate was tokenized into six tokens, and the mannequin generated 5 tokens in its completion.stop_reason: The explanation the technology stopped ("cease"means the mannequin naturally stopped at a cease sequence or end-of-text,"size"means it hit the max token restrict, and so forth). In our case it reveals"cease", indicating the mannequin stopped by itself or due to the cease situation we supplied.prompt_logprobs: This array offers log possibilities for every token within the immediate. Because the mannequin processes your enter, it repeatedly predicts what ought to come subsequent based mostly on what it has seen thus far. These values measure which tokens in your immediate have been anticipated or shocking to the mannequin.- The primary entry is
Noneas a result of the very first token has no previous context. The mannequin can not predict something with out prior data. Every subsequent entry incorporates token IDs mapped to their log possibilities. We’ve got transformed these IDs to readable textual content and reworked the log possibilities into percentages for simpler understanding. - You possibly can observe the mannequin’s growing confidence because it processes acquainted sequences. For instance, after seeing The short brown, the mannequin predicted fox with 95.1% confidence. After seeing the total context as much as fox, it predicted jumps with 81.1% confidence.
- Many positions present a number of tokens with their possibilities, revealing options the mannequin thought of. As an illustration, on the second place, the mannequin evaluated each The (2.7%) and Query (30.6%), which implies the mannequin thought of each tokens viable at that place. This added visibility helps you perceive the place the mannequin weighted options and may reveal when it was extra unsure or had problem selecting from a number of choices.
- Notably low possibilities seem for some tokens—fast obtained simply 0.01%—indicating the mannequin discovered these phrases surprising of their context.
- The general sample tells a transparent story: particular person phrases initially obtained low possibilities, however as the entire fast brown fox jumps phrase emerged, the mannequin’s confidence elevated dramatically, exhibiting it acknowledged this as a well-recognized expression.
- When a number of tokens in your immediate constantly obtain low possibilities, your phrasing may be uncommon for the mannequin. This uncertainty can have an effect on the standard of completions. Utilizing these insights, you possibly can reformulate prompts to raised align with patterns the mannequin encountered in its coaching information.
- The primary entry is
logprobs: This array incorporates log possibilities for every token within the mannequin’s generated output. The format is analogous: a dictionary mapping token IDs to their corresponding log possibilities.- After decoding these values, we will see that the tokens over, the, lazy, and canine all have excessive possibilities. This demonstrates the mannequin acknowledged it was finishing the well-known phrase the fast brown fox jumps over the lazy canine—a typical pangram that the mannequin seems to have robust familiarity with.
- In distinction, the ultimate interval (newline) token has a a lot decrease chance (30.3%), revealing the mannequin’s uncertainty about tips on how to conclude the sentence. This is sensible as a result of the mannequin had a number of legitimate choices: ending the sentence with a interval, persevering with with extra content material, or selecting one other punctuation mark altogether.
Sensible use circumstances of log possibilities
Token-level log possibilities from the Customized Mannequin Import function present invaluable insights into your mannequin’s decision-making course of. These metrics rework the way you work together together with your customized fashions by revealing their confidence ranges for every generated token. Listed here are impactful methods to make use of these insights:
Rating a number of completions
You should use log possibilities to quantitatively rank a number of generated outputs for a similar immediate. When your software wants to decide on between completely different potential completions—whether or not for summarization, translation, or artistic writing—you possibly can calculate every completion’s total probability by averaging or including the log possibilities throughout all its tokens.
Instance:
Immediate: Translate the phrase "Battre le fer pendant qu'il est chaud"
- Completion A:
"Strike whereas the iron is scorching"(Common log chance: -0.39) - Completion B:
"Beat the iron whereas it's scorching."(Common log chance: -0.46)
On this instance, Completion A receives the next log chance rating (nearer to zero), indicating the mannequin discovered this idiomatic translation extra pure than the extra literal Completion B. This numerical method permits your software to routinely choose probably the most possible output or current a number of candidates ranked by the mannequin’s confidence degree.
This rating functionality extends past translation to many situations the place a number of legitimate outputs exist—together with content material technology, code completion, and inventive writing—offering an goal high quality metric based mostly on the mannequin’s confidence quite than relying solely on subjective human judgment.
Detecting hallucinations and low-confidence solutions
Fashions may produce hallucinations—plausible-sounding however factually incorrect statements—when dealing with ambiguous prompts, advanced queries, or subjects exterior their experience. Log possibilities present a sensible approach to detect these situations by revealing the mannequin’s inside uncertainty, serving to you establish doubtlessly inaccurate data even when the output seems assured.
By analyzing token-level log possibilities, you possibly can establish which components of a response the mannequin was doubtlessly unsure about, even when the textual content seems assured on the floor. This functionality is very invaluable in retrieval-augmented technology (RAG) techniques, the place responses ought to be grounded in retrieved context. When a mannequin has related data out there, it sometimes generates solutions with increased confidence. Conversely, low confidence throughout a number of tokens suggests the mannequin may be producing content material with out enough supporting data.
Instance:
- Immediate:
- Mannequin output:
On this instance, we deliberately requested a couple of fictional metric—Portfolio Synergy Quotient (PSQ)—to exhibit how log possibilities reveal uncertainty in mannequin responses. Regardless of producing a professional-sounding definition for this non-existent monetary idea, the token-level confidence scores inform a revealing story. The boldness scores proven beneath are derived by making use of the exponential perform to the log possibilities returned by the mannequin.
- PSQ reveals medium confidence (63.8%), indicating that the mannequin acknowledged the acronym format however wasn’t extremely sure about this particular time period.
- Widespread finance terminology like lessons (98.2%) and portfolio (92.8%) exhibit excessive confidence, probably as a result of these are commonplace ideas broadly utilized in monetary contexts.
- Vital connecting ideas present notably low confidence: measure (14.0%) and diversification (31.8%), reveal the mannequin’s uncertainty when making an attempt to elucidate what PSQ means or does.
- Purposeful phrases like is (45.9%) and of (56.6%) hover within the medium confidence ranges, suggesting uncertainty concerning the total construction of the reason.
By figuring out these low-confidence segments, you possibly can implement focused safeguards in your purposes—corresponding to flagging content material for verification, retrieving extra context, producing clarifying questions, or making use of confidence thresholds for delicate data. This method helps create extra dependable AI techniques that may distinguish between high-confidence data and unsure responses.
Monitoring immediate high quality
When engineering prompts on your software, log possibilities reveal how nicely the mannequin understands your directions. If the primary few generated tokens present unusually low possibilities, it usually indicators that the mannequin struggled to interpret what you’re asking.
By monitoring the common log chance of the preliminary tokens—sometimes the primary 5–10 generated tokens—you possibly can quantitatively measure immediate readability. Nicely-structured prompts with clear context sometimes produce increased possibilities as a result of the mannequin instantly is aware of what to do. Imprecise or underspecified prompts usually yield decrease preliminary token likelihoods because the mannequin hesitates or searches for route.
Instance:
Immediate comparability for customer support responses:
- Primary immediate:
- Common log chance of first 5 tokens: -1.215 (decrease confidence)
- Optimized immediate:
- Common log chance of first 5 tokens: -0.333 (increased confidence)
The optimized immediate generates increased log possibilities, demonstrating that exact directions and clear context cut back the mannequin’s uncertainty. Relatively than making absolute judgments about immediate high quality, this method helps you to measure relative enchancment between variations. You possibly can instantly observe how particular parts—position definitions, contextual particulars, and express expectations—enhance mannequin confidence. By systematically measuring these confidence scores throughout completely different immediate iterations, you construct a quantitative framework for immediate engineering that reveals precisely when and the way your directions grow to be unclear to the mannequin, enabling steady data-driven refinement.
Lowering RAG prices with early pruning
In conventional RAG implementations, techniques retrieve 5–20 paperwork and generate full responses utilizing these retrieved contexts. This method drives up inference prices as a result of each retrieved context consumes tokens no matter precise usefulness.
Log possibilities allow a less expensive various by early pruning. As a substitute of instantly processing the retrieved paperwork in full:
- Generate draft responses based mostly on every retrieved context
- Calculate the common log chance throughout these brief drafts
- Rank contexts by their common log chance scores
- Discard low-scoring contexts that fall beneath a confidence threshold
- Generate the entire response utilizing solely the highest-confidence contexts
This method works as a result of contexts that include related data produce increased log possibilities within the draft technology section. When the mannequin encounters useful context, it generates textual content with better confidence, mirrored in log possibilities nearer to zero. Conversely, irrelevant or tangential contexts produce extra unsure outputs with decrease log possibilities.
By filtering contexts earlier than full technology, you possibly can cut back token consumption whereas sustaining and even enhancing reply high quality. This shifts the method from a brute-force method to a focused pipeline that directs full technology solely towards contexts the place the mannequin demonstrates real confidence within the supply materials.
Tremendous-tuning analysis
When you have got fine-tuned a mannequin on your particular area, log possibilities provide a quantitative approach to assess the effectiveness of your coaching. By analyzing confidence patterns in responses, you possibly can decide in case your mannequin has developed correct calibration—exhibiting excessive confidence for proper domain-specific solutions and applicable uncertainty elsewhere.
A well-calibrated fine-tuned mannequin ought to assign increased possibilities to correct data inside its specialised space whereas sustaining decrease confidence when working exterior its coaching area. Issues with calibration seem in two important kinds. Overconfidence happens when the mannequin assigns excessive possibilities to incorrect responses, suggesting it hasn’t correctly realized the boundaries of its data. Underneath confidence manifests as constantly low possibilities regardless of producing correct solutions, indicating that coaching won’t have sufficiently strengthened appropriate patterns.
By systematically testing your mannequin throughout numerous situations and analyzing the log possibilities, you possibly can establish areas needing extra coaching or detect potential biases in your present method. This creates a data-driven suggestions loop for iterative enhancements, ensuring your mannequin performs reliably inside its meant scope whereas sustaining applicable boundaries round its experience.
Getting began
Right here’s tips on how to begin utilizing log possibilities with fashions imported by the Amazon Bedrock Customized Mannequin Import function:
- Allow log possibilities in your API calls: Add
"return_logprobs": trueto your request payload when invoking your customized imported mannequin. This parameter works with each theInvokeModelandInvokeModelWithResponseStreamAPIs. Start with acquainted prompts to watch which tokens your mannequin predicts with excessive confidence in comparison with which it finds shocking. - Analyze confidence patterns in your customized fashions: Look at how your fine-tuned or domain-adapted fashions reply to completely different inputs. The log possibilities reveal whether or not your mannequin is appropriately calibrated on your particular area—exhibiting excessive confidence the place it ought to be sure.
- Develop confidence-aware purposes: Implement sensible use circumstances corresponding to hallucination detection, response rating, and content material verification to make your purposes extra sturdy. For instance, you possibly can flag low-confidence sections of responses for human evaluate or choose the highest-confidence response from a number of generations.
Conclusion
Log chance help for Amazon Bedrock Customized Mannequin Import affords enhanced visibility into mannequin decision-making. This function transforms beforehand opaque mannequin habits into quantifiable confidence metrics that builders can analyze and use.
All through this put up, we now have demonstrated tips on how to allow log possibilities in your API calls, interpret the returned information, and use these insights for sensible purposes. From detecting potential hallucinations and rating a number of completions to optimizing RAG techniques and evaluating fine-tuning high quality, log possibilities provide tangible advantages throughout various use circumstances.
For purchasers working with custom-made basis fashions like Llama, Mistral, or Qwen, these insights handle a basic problem: understanding not simply what a mannequin generates, however how assured it’s in its output. This distinction turns into vital when deploying AI in domains requiring excessive reliability—corresponding to finance, healthcare, or enterprise purposes—the place incorrect outputs can have vital penalties.
By revealing confidence patterns throughout several types of queries, log possibilities enable you assess how nicely your mannequin customizations have affected calibration, highlighting the place your mannequin excels and the place it would want refinement. Whether or not you’re evaluating fine-tuning effectiveness, debugging surprising responses, or constructing techniques that adapt to various confidence ranges, this functionality represents an essential development in bringing better transparency and management to generative AI improvement on Amazon Bedrock.
We sit up for seeing how you utilize log possibilities to construct extra clever and reliable purposes together with your customized imported fashions. This functionality demonstrates the dedication from Amazon Bedrock to offer builders with instruments that allow assured innovation whereas delivering the scalability, safety, and ease of a totally managed service.
Concerning the authors
Manoj Selvakumar is a Generative AI Specialist Options Architect at AWS, the place he helps organizations design, prototype, and scale AI-powered options within the cloud. With experience in deep studying, scalable cloud-native techniques, and multi-agent orchestration, he focuses on turning rising improvements into production-ready architectures that drive measurable enterprise worth. He’s captivated with making advanced AI ideas sensible and enabling prospects to innovate responsibly at scale—from early experimentation to enterprise deployment. Earlier than becoming a member of AWS, Manoj labored in consulting, delivering information science and AI options for enterprise shoppers, constructing end-to-end machine studying techniques supported by robust MLOps practices for coaching, deployment, and monitoring in manufacturing.
Yanyan Zhang is a Senior Generative AI Information Scientist at Amazon Internet Companies, the place she has been engaged on cutting-edge AI/ML applied sciences as a Generative AI Specialist, serving to prospects use generative AI to realize their desired outcomes. Yanyan graduated from Texas A&M College with a PhD in Electrical Engineering. Exterior of labor, she loves touring, figuring out, and exploring new issues.
Lokeshwaran Ravi is a Senior Deep Studying Compiler Engineer at AWS, specializing in ML optimization, mannequin acceleration, and AI safety. He focuses on enhancing effectivity, lowering prices, and constructing safe ecosystems to democratize AI applied sciences, making cutting-edge ML accessible and impactful throughout industries.
Revendra Kumar is a Senior Software program Growth Engineer at Amazon Internet Companies. In his present position, he focuses on mannequin internet hosting and inference MLOps on Amazon Bedrock. Previous to this, he labored as an engineer on internet hosting Quantum computer systems on the cloud and creating infrastructure options for on-premises cloud environments. Exterior of his skilled pursuits, Revendra enjoys staying lively by taking part in tennis and mountaineering.

