Optimize question responses with consumer suggestions utilizing Amazon Bedrock embedding and few-shot prompting

Bettering response high quality for consumer queries is crucial for AI-driven functions, particularly these specializing in consumer satisfaction. For instance, an HR chat-based assistant ought to strictly observe firm insurance policies and reply utilizing a sure tone. A deviation from that may be corrected by suggestions from customers. This publish demonstrates how Amazon Bedrock, mixed with a consumer suggestions dataset and few-shot prompting, can refine responses for greater consumer satisfaction. By utilizing Amazon Titan Textual content Embeddings v2, we exhibit a statistically vital enchancment in response high quality, making it a priceless software for functions looking for correct and personalised responses.

Latest research have highlighted the worth of suggestions and prompting in refining AI responses. Immediate Optimization with Human Suggestions proposes a scientific method to studying from consumer suggestions, utilizing it to iteratively fine-tune fashions for improved alignment and robustness. Equally, Black-Field Immediate Optimization: Aligning Massive Language Fashions with out Mannequin Coaching demonstrates how retrieval augmented chain-of-thought prompting enhances few-shot studying by integrating related context, enabling higher reasoning and response high quality. Constructing on these concepts, our work makes use of the Amazon Titan Textual content Embeddings v2 mannequin to optimize responses utilizing obtainable consumer suggestions and few-shot prompting, attaining statistically vital enhancements in consumer satisfaction. Amazon Bedrock already offers an automated immediate optimization characteristic to robotically adapt and optimize prompts with out extra consumer enter. On this weblog publish, we showcase the way to use OSS libraries for a extra personalized optimization based mostly on consumer suggestions and few-shot prompting.

We’ve developed a sensible resolution utilizing Amazon Bedrock that robotically improves chat assistant responses based mostly on consumer suggestions. This resolution makes use of embeddings and few-shot prompting. To exhibit the effectiveness of the answer, we used a publicly obtainable consumer suggestions dataset. Nonetheless, when making use of it inside an organization, the mannequin can use its personal suggestions information supplied by its customers. With our take a look at dataset, it reveals a 3.67% improve in consumer satisfaction scores. The important thing steps embrace:

Retrieve a publicly obtainable consumer suggestions dataset (for this instance, Unified Suggestions Dataset on Hugging Face).
Create embeddings for queries to seize semantic comparable examples, utilizing Amazon Titan Textual content Embeddings.
Use comparable queries as examples in a few-shot immediate to generate optimized prompts.
Examine optimized prompts towards direct giant language mannequin (LLM) calls.
Validate the advance in response high quality utilizing a paired pattern t-test.

The next diagram is an outline of the system.

The important thing advantages of utilizing Amazon Bedrock are:

Zero infrastructure administration – Deploy and scale with out managing complicated machine studying (ML) infrastructure
Price-effective – Pay just for what you employ with the Amazon Bedrock pay-as-you-go pricing mannequin
Enterprise-grade safety – Use AWS built-in safety and compliance options
Easy integration – Combine seamlessly current functions and open supply instruments
A number of mannequin choices – Entry numerous basis fashions (FMs) for various use circumstances

The next sections dive deeper into these steps, offering code snippets from the pocket book as an example the method.

Conditions

Conditions for implementation embrace an AWS account with Amazon Bedrock entry, Python 3.8 or later, and configured Amazon credentials.

Knowledge assortment

We downloaded a consumer suggestions dataset from Hugging Face, llm-blender/Unified-Suggestions. The dataset incorporates fields akin to conv_A_user (the consumer question) and conv_A_rating (a binary ranking; 0 means the consumer doesn’t prefer it and 1 means the consumer likes it). The next code retrieves the dataset and focuses on the fields wanted for embedding era and suggestions evaluation. It may be run in an Amazon Sagemaker pocket book or a Jupyter pocket book that has entry to Amazon Bedrock.

# Load the dataset and specify the subset
dataset = load_dataset("llm-blender/Unified-Suggestions", "synthetic-instruct-gptj-pairwise")

# Entry the 'prepare' break up
train_dataset = dataset["train"]

# Convert the dataset to Pandas DataFrame
df = train_dataset.to_pandas()

# Flatten the nested dialog constructions for conv_A and conv_B safely
df['conv_A_user'] = df['conv_A'].apply(lambda x: x[0]['content'] if len(x) > 0 else None)
df['conv_A_assistant'] = df['conv_A'].apply(lambda x: x[1]['content'] if len(x) > 1 else None)

# Drop the unique nested columns if they're not wanted
df = df.drop(columns=['conv_A', 'conv_B'])

Knowledge sampling and embedding era

To handle the method successfully, we sampled 6,000 queries from the dataset. We used Amazon Titan Textual content Embeddings v2 to create embeddings for these queries, reworking textual content into high-dimensional representations that enable for similarity comparisons. See the next code:

import random import bedrock # Take a pattern of 6000 queries 
df = df.shuffle(seed=42).choose(vary(6000)) 
# AWS credentials
session = boto3.Session()
area = 'us-east-1'
# Initialize the S3 shopper
s3_client = boto3.shopper('s3')

boto3_bedrock = boto3.shopper('bedrock-runtime', area)
titan_embed_v2 = BedrockEmbeddings(
    shopper=boto3_bedrock, model_id="amazon.titan-embed-text-v2:0")
    
# Perform to transform textual content to embeddings
def get_embeddings(textual content):
    response = titan_embed_v2.embed_query(textual content)
    return response  # This could return the embedding vector

# Apply the perform to the 'immediate' column and retailer in a brand new column
df_test['conv_A_user_vec'] = df_test['conv_A_user'].apply(get_embeddings)

Few-shot prompting with similarity search

For this half, we took the next steps:

Pattern 100 queries from the dataset for testing. Sampling 100 queries helps us run a number of trials to validate our resolution.
Compute cosine similarity (measure of similarity between two non-zero vectors) between the embeddings of those take a look at queries and the saved 6,000 embeddings.
Choose the highest ok comparable queries to the take a look at queries to function few-shot examples. We set Ok = 10 to stability between the computational effectivity and variety of the examples.

See the next code:

# Step 2: Outline cosine similarity perform
def compute_cosine_similarity(embedding1, embedding2):
embedding1 = np.array(embedding1).reshape(1, -1) # Reshape to 2D array
embedding2 = np.array(embedding2).reshape(1, -1) # Reshape to 2D array
return cosine_similarity(embedding1, embedding2)[0][0]

# Pattern question embedding
def get_matched_convo(question, df):
    query_embedding = get_embeddings(question)
    
    # Step 3: Compute similarity with every row within the DataFrame
    df['similarity'] = df['conv_A_user_vec'].apply(lambda x: compute_cosine_similarity(query_embedding, x))
    
    # Step 4: Type rows based mostly on similarity rating (descending order)
    df_sorted = df.sort_values(by='similarity', ascending=False)
    
    # Step 5: Filter or get prime matching rows (e.g., prime 10 matches)
    top_matches = df_sorted.head(10) 
    
    # Print prime matches
    return top_matches[['conv_A_user', 'conv_A_assistant','conv_A_rating','similarity']]

This code offers a few-shot context for every take a look at question, utilizing cosine similarity to retrieve the closest matches. These instance queries and suggestions function extra context to information the immediate optimization. The next perform generates the few-shot immediate:

import boto3
from langchain_aws import ChatBedrock
from pydantic import BaseModel

# Initialize Amazon Bedrock shopper
bedrock_runtime = boto3.shopper(service_name="bedrock-runtime", region_name="us-east-1")

# Configure the mannequin to make use of
model_id = "us.anthropic.claude-3-5-haiku-20241022-v1:0"
model_kwargs = {
"max_tokens": 2048,
"temperature": 0.1,
"top_k": 250,
"top_p": 1,
"stop_sequences": ["nnHuman"],
}

# Create the LangChain Chat object for Bedrock
llm = ChatBedrock(
shopper=bedrock_runtime,
model_id=model_id,
model_kwargs=model_kwargs,
)

# Pydantic mannequin to validate the output immediate
class OptimizedPromptOutput(BaseModel):
optimized_prompt: str

# Perform to generate the few-shot immediate
def generate_few_shot_prompt_only(user_query, nearest_examples):
    # Make sure that df_examples is a DataFrame
    if not isinstance(nearest_examples, pd.DataFrame):
    increase ValueError("Anticipated df_examples to be a DataFrame")
    # Assemble the few-shot immediate utilizing nearest matching examples
    few_shot_prompt = "Listed here are examples of consumer queries, LLM responses, and suggestions:nn"
    for i in vary(len(nearest_examples)):
    few_shot_prompt += f"Consumer Question: {nearest_examples.loc[i,'conv_A_user']}n"
    few_shot_prompt += f"LLM Response: {nearest_examples.loc[i,'conv_A_assistant']}n"
    few_shot_prompt += f"Consumer Suggestions: {'👍' if nearest_examples.loc[i,'conv_A_rating'] == 1.0 else '👎'}nn"
    
    # Add the consumer question for which the optimized immediate is required
    few_shot_prompt += f"Primarily based on these examples, generate a normal optimized immediate for the next consumer question:nn"
    few_shot_prompt += f"Consumer Question: {user_query}n"
    few_shot_prompt += "Optimized Immediate: Present a transparent, well-researched response based mostly on correct information and credible sources. Keep away from pointless info or hypothesis."
    
    return few_shot_prompt

The get_optimized_prompt perform performs the next duties:

The consumer question and comparable examples generate a few-shot immediate.
We use the few-shot immediate in an LLM name to generate an optimized immediate.
Be certain that the output is within the following format utilizing Pydantic.

See the next code:

# Perform to generate an optimized immediate utilizing Bedrock and return solely the immediate utilizing Pydantic
def get_optimized_prompt(user_query, nearest_examples):
    # Generate the few-shot immediate
    few_shot_prompt = generate_few_shot_prompt_only(user_query, nearest_examples)
    
    # Name the LLM to generate the optimized immediate
    response = llm.invoke(few_shot_prompt)
    
    # Extract and validate solely the optimized immediate utilizing Pydantic
    optimized_prompt = response.content material # Mounted to entry the 'content material' attribute of the AIMessage object
    optimized_prompt_output = OptimizedPromptOutput(optimized_prompt=optimized_prompt)
    
    return optimized_prompt_output.optimized_prompt

# Instance utilization
question = "Is the US greenback weakening over time?"
nearest_examples = get_matched_convo(question, df_test)
nearest_examples.reset_index(drop=True, inplace=True)

# Generate optimized immediate
optimized_prompt = get_optimized_prompt(question, nearest_examples)
print("Optimized Immediate:", optimized_prompt)

The make_llm_call_with_optimized_prompt perform makes use of an optimized immediate and consumer question to make the LLM (Anthropic’s Claude Haiku 3.5) name to get the ultimate response:

# Perform to make the LLM name utilizing the optimized immediate and consumer question
def make_llm_call_with_optimized_prompt(optimized_prompt, user_query):
    start_time = time.time()
    # Mix the optimized immediate and consumer question to kind the enter for the LLM
    final_prompt = f"{optimized_prompt}nnUser Question: {user_query}nResponse:"

    # Make the decision to the LLM utilizing the mixed immediate
    response = llm.invoke(final_prompt)
    
    # Extract solely the content material from the LLM response
    final_response = response.content material  # Extract the response content material with out including any labels
    time_taken = time.time() - start_time
    return final_response,time_taken

# Instance utilization
user_query = "How you can develop avocado indoor?"
# Assume 'optimized_prompt' has already been generated from the earlier step
final_response,time_taken = make_llm_call_with_optimized_prompt(optimized_prompt, user_query)
print("LLM Response:", final_response)

Comparative analysis of optimized and unoptimized prompts

To match the optimized immediate with the baseline (on this case, the unoptimized immediate), we outlined a perform that returned a outcome with out an optimized immediate for all of the queries within the analysis dataset:

def get_unoptimized_prompt_response(df_eval):
    # Iterate over the dataframe and make LLM calls
    for index, row in tqdm(df_eval.iterrows()):
        # Get the consumer question from 'conv_A_user'
        user_query = row['conv_A_user']
        
        # Make the Bedrock LLM name
        response = llm.invoke(user_query)
        
        # Retailer the response content material in a brand new column 'unoptimized_prompt_response'
        df_eval.at[index, 'unoptimized_prompt_response'] = response.content material  # Extract 'content material' from the response object
    
    return df_eval

The next perform generates the question response utilizing similarity search and intermediate optimized immediate era for all of the queries within the analysis dataset:

def get_optimized_prompt_response(df_eval):
    # Iterate over the dataframe and make LLM calls
    for index, row in tqdm(df_eval.iterrows()):
        # Get the consumer question from 'conv_A_user'
        user_query = row['conv_A_user']
        nearest_examples = get_matched_convo(user_query, df_test)
        nearest_examples.reset_index(drop=True, inplace=True)
        optimized_prompt = get_optimized_prompt(user_query, nearest_examples)
        # Make the Bedrock LLM name
        final_response,time_taken = make_llm_call_with_optimized_prompt(optimized_prompt, user_query)
        
        # Retailer the response content material in a brand new column 'unoptimized_prompt_response'
        df_eval.at[index, 'optimized_prompt_response'] = final_response  # Extract 'content material' from the response object
    
    return df_eval

This code compares responses generated with and with out few-shot optimization, establishing the info for analysis.

LLM as decide and analysis of responses

To quantify response high quality, we used an LLM as a decide to attain the optimized and unoptimized responses for alignment with the consumer question. We used Pydantic right here to ensure the output sticks to the specified sample of 0 (LLM predicts the response received’t be appreciated by the consumer) or 1 (LLM predicts the response will probably be appreciated by the consumer):

# Outline Pydantic mannequin to implement predicted suggestions as 0 or 1
class FeedbackPrediction(BaseModel):
    predicted_feedback: conint(ge=0, le=1)  # Solely enable values 0 or 1

# Perform to generate few-shot immediate
def generate_few_shot_prompt(df_examples, unoptimized_response):
    few_shot_prompt = (
        "You might be an neutral decide evaluating the standard of LLM responses. "
        "Primarily based on the consumer queries and the LLM responses supplied under, your activity is to find out whether or not the response is sweet or unhealthy, "
        "utilizing the examples supplied. Return 1 if the response is sweet (thumbs up) or 0 if the response is unhealthy (thumbs down).nn"
    )
    few_shot_prompt += "Under are examples of consumer queries, LLM responses, and consumer suggestions:nn"
    
    # Iterate over few-shot examples
    for i, row in df_examples.iterrows():
        few_shot_prompt += f"Consumer Question: {row['conv_A_user']}n"
        few_shot_prompt += f"LLM Response: {row['conv_A_assistant']}n"
        few_shot_prompt += f"Consumer Suggestions: {'👍' if row['conv_A_rating'] == 1 else '👎'}nn"
    
    # Present the unoptimized response for suggestions prediction
    few_shot_prompt += (
        "Now, consider the next LLM response based mostly on the examples above. Return 0 for unhealthy response or 1 for good response.nn"
        f"Consumer Question: {unoptimized_response}n"
        f"Predicted Suggestions (0 for 👎, 1 for 👍):"
    )
    return few_shot_prompt

LLM-as-a-judge is a performance the place an LLM can decide the accuracy of a textual content utilizing sure grounding examples. We have now used that performance right here to evaluate the distinction between the outcome acquired from optimized and un-optimized immediate. Amazon Bedrock launched an LLM-as-a-judge performance in December 2024 that can be utilized for such use circumstances. Within the following perform, we exhibit how the LLM acts as an evaluator, scoring responses based mostly on their alignment and satisfaction for the complete analysis dataset:

# Perform to foretell suggestions utilizing few-shot examples
def predict_feedback(df_examples, df_to_rate, response_column, target_col):
    # Create a brand new column to retailer predicted suggestions
    df_to_rate[target_col] = None
    
    # Iterate over every row within the dataframe to charge
    for index, row in tqdm(df_to_rate.iterrows(), complete=len(df_to_rate)):
        # Get the unoptimized immediate response
        strive:
            time.sleep(2)
            unoptimized_response = row[response_column]

            # Generate few-shot immediate
            few_shot_prompt = generate_few_shot_prompt(df_examples, unoptimized_response)

            # Name the LLM to foretell the suggestions
            response = llm.invoke(few_shot_prompt)

            # Extract the anticipated suggestions (assuming the mannequin returns '0' or '1' as suggestions)
            predicted_feedback_str = response.content material.strip()  # Clear and extract the anticipated suggestions

            # Validate the suggestions utilizing Pydantic
            strive:
                feedback_prediction = FeedbackPrediction(predicted_feedback=int(predicted_feedback_str))
                # Retailer the anticipated suggestions within the dataframe
                df_to_rate.at[index, target_col] = feedback_prediction.predicted_feedback
            besides (ValueError, ValidationError):
                # In case of invalid information, assign default worth (e.g., 0)
                df_to_rate.at[index, target_col] = 0
        besides:
            cross

    return df_to_rate

Within the following instance, we repeated this course of for 20 trials, capturing consumer satisfaction scores every time. The general rating for the dataset is the sum of the consumer satisfaction rating.

df_eval = df.drop(df_test.index).pattern(100)
df_eval['unoptimized_prompt_response'] = "" # Create an empty column to retailer responses
df_eval = get_unoptimized_prompt_response(df_eval)
df_eval['optimized_prompt_response'] = "" # Create an empty column to retailer responses
df_eval = get_optimized_prompt_response(df_eval)
Name the perform to foretell suggestions
df_with_predictions = predict_feedback(df_eval, df_eval, 'unoptimized_prompt_response', 'predicted_unoptimized_feedback')
df_with_predictions = predict_feedback(df_with_predictions, df_with_predictions, 'optimized_prompt_response', 'predicted_optimized_feedback')

# Calculate accuracy for unoptimized and optimized responses
original_success = df_with_predictions.conv_A_rating.sum()*100.0/len(df_with_predictions)
unoptimized_success  = df_with_predictions.predicted_unoptimized_feedback.sum()*100.0/len(df_with_predictions) 
optimized_success = df_with_predictions.predicted_optimized_feedback.sum()*100.0/len(df_with_predictions) 

# Show outcomes
print(f"Authentic success: {original_success:.2f}%")
print(f"Unoptimized Immediate success: {unoptimized_success:.2f}%")
print(f"Optimized Immediate success: {optimized_success:.2f}%")

Consequence evaluation

The next line chart reveals the efficiency enchancment of the optimized resolution over the unoptimized one. Inexperienced areas point out constructive enhancements, whereas purple areas present adverse modifications.

As we gathered the results of 20 trials, we noticed that the imply of satisfaction scores from the unoptimized immediate was 0.8696, whereas the imply of satisfaction scores from the optimized immediate was 0.9063. Subsequently, our technique outperforms the baseline by 3.67%.

Lastly, we ran a paired pattern t-test to check satisfaction scores from the optimized and unoptimized prompts. This statistical take a look at validated whether or not immediate optimization considerably improved response high quality. See the next code:

from scipy import stats
# Pattern consumer satisfaction scores from the pocket book
unopt = [] #20 samples of scores for the unoptimized promt
decide = [] # 20 samples of scores for the optimized promt]
# Paired pattern t-test
t_stat, p_val = stats.ttest_rel(unopt, decide)
print(f"t-statistic: {t_stat}, p-value: {p_val}")

After working the t-test, we acquired a p-value of 0.000762, which is lower than 0.05. Subsequently, the efficiency increase of optimized prompts over unoptimized prompts is statistically vital.

Key takeaways

We realized the next key takeaways from this resolution:

Few-shot prompting improves question response – Utilizing extremely comparable few-shot examples results in vital enhancements in response high quality.
Amazon Titan Textual content Embeddings allows contextual similarity – The mannequin produces embeddings that facilitate efficient similarity searches.
Statistical validation confirms effectiveness – A p-value of 0.000762 signifies that our optimized method meaningfully enhances consumer satisfaction.
Improved enterprise affect – This method delivers measurable enterprise worth by improved AI assistant efficiency. The three.67% improve in satisfaction scores interprets to tangible outcomes: HR departments can anticipate fewer coverage misinterpretations (lowering compliance dangers), and customer support groups may see a big discount in escalated tickets. The answer’s capability to repeatedly study from suggestions creates a self-improving system that will increase ROI over time with out requiring specialised ML experience or infrastructure investments.

Limitations

Though the system reveals promise, its efficiency closely depends upon the supply and quantity of consumer suggestions, particularly in closed-domain functions. In eventualities the place solely a handful of suggestions examples can be found, the mannequin may battle to generate significant optimizations or fail to seize the nuances of consumer preferences successfully. Moreover, the present implementation assumes that consumer suggestions is dependable and consultant of broader consumer wants, which could not at all times be the case.

Subsequent steps

Future work may give attention to increasing this method to help multilingual queries and responses, enabling broader applicability throughout numerous consumer bases. Incorporating Retrieval Augmented Technology (RAG) methods may additional improve context dealing with and accuracy for complicated queries. Moreover, exploring methods to deal with the constraints in low-feedback eventualities, akin to artificial suggestions era or switch studying, may make the method extra sturdy and versatile.

Conclusion

On this publish, we demonstrated the effectiveness of question optimization utilizing Amazon Bedrock, few-shot prompting, and consumer suggestions to considerably improve response high quality. By aligning responses with user-specific preferences, this method alleviates the necessity for costly mannequin fine-tuning, making it sensible for real-world functions. Its flexibility makes it appropriate for chat-based assistants throughout numerous domains, akin to ecommerce, customer support, and hospitality, the place high-quality, user-aligned responses are important.

To study extra, discuss with the next assets:

Concerning the Authors

Tanay Chowdhury is a Knowledge Scientist on the Generative AI Innovation Heart at Amazon Internet Providers.

Parth Patwa is a Knowledge Scientist on the Generative AI Innovation Heart at Amazon Internet Providers.

Yingwei Yu is an Utilized Science Supervisor on the Generative AI Innovation Heart at Amazon Internet Providers.

Main Menu

What's Hot

AI Now Weaves Yarn Desires into Digital Artwork

What’s Actually Coming for Your Digital Defenses

DJI drones: The place to purchase the DJI Mini 4K drone

Optimize question responses with consumer suggestions utilizing Amazon Bedrock embedding and few-shot prompting

Automate the creation of handout notes utilizing Amazon Bedrock Information Automation

Greatest Proxy Suppliers in 2025

Apple Workshop on Human-Centered Machine Studying 2024

AI Now Weaves Yarn Desires into Digital Artwork

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

AI Now Weaves Yarn Desires into Digital Artwork

What’s Actually Coming for Your Digital Defenses

DJI drones: The place to purchase the DJI Mini 4K drone

Automate the creation of handout notes utilizing Amazon Bedrock Information Automation

Main Menu

Subscribe to Updates

What's Hot

Optimize question responses with consumer suggestions utilizing Amazon Bedrock embedding and few-shot prompting

Conditions

Knowledge assortment

Knowledge sampling and embedding era

Few-shot prompting with similarity search

Comparative analysis of optimized and unoptimized prompts

LLM as decide and analysis of responses

Consequence evaluation

Key takeaways

Limitations

Subsequent steps

Conclusion

Concerning the Authors

Related Posts