The Amazon.com Catalog is the muse of each buyer’s procuring expertise—the definitive supply of product info with attributes that energy search, suggestions, and discovery. When a vendor lists a brand new product, the catalog system should extract structured attributes—dimensions, supplies, compatibility, and technical specs—whereas producing content material comparable to titles that match how prospects search. A title isn’t a easy enumeration like colour or dimension; it should steadiness vendor intent, buyer search habits, and discoverability. This complexity, multiplied by thousands and thousands of every day submissions, makes catalog enrichment a great proving floor for self-learning AI.
On this put up, we display how the Amazon Catalog Crew constructed a self-learning system that constantly improves accuracy whereas lowering prices at scale utilizing Amazon Bedrock.
The problem
In generative AI deployment environments, enhancing mannequin efficiency requires fixed consideration. As a result of fashions course of thousands and thousands of merchandise, they inevitably encounter edge circumstances, evolving terminology, and domain-specific patterns the place accuracy could degrade. The normal method—utilized scientists analyzing failures, updating prompts, testing modifications, and redeploying—works however is resource-intensive and struggles to maintain tempo with real-world quantity and selection. The problem isn’t whether or not we are able to enhance these programs, however how one can make enchancment scalable and computerized somewhat than depending on guide intervention. At Amazon Catalog, we confronted this problem head-on. The tradeoffs appeared unimaginable: massive fashions would ship accuracy however wouldn’t scale effectively to our quantity, whereas smaller fashions struggled with the complicated, ambiguous circumstances the place sellers wanted essentially the most assist.
Resolution overview
Our breakthrough got here from an unconventional experiment. As a substitute of selecting a single mannequin, we deployed a number of smaller fashions to course of the identical merchandise. When these fashions agreed on an attribute extraction, we may belief the end result. However after they disagreed—whether or not from real ambiguity, lacking context, or one mannequin making an error—we found one thing profound. These disagreements weren’t all the time errors, however they have been nearly all the time indicators of complexity. This led us to design a self-learning system that reimagines how generative AI scales. A number of smaller fashions course of routine circumstances by means of consensus, invoking bigger fashions solely when disagreements happen. The bigger mannequin is applied as a supervisor agent with entry to specialised instruments for deeper investigation and evaluation. However the supervisor doesn’t simply resolve disputes; it generates reusable learnings saved in a dynamic data base that helps stop whole lessons of future disagreements. We invoke extra highly effective fashions solely when the system detects excessive studying worth at inference time, whereas correcting the output. The result’s a self-learning system the place prices lower and high quality will increase—as a result of the system learns to deal with edge circumstances that beforehand triggered supervisor calls. Error charges fell constantly, not by means of retraining however by means of accrued learnings from resolved disagreements injected into smaller mannequin prompts. The next determine exhibits the structure of this self-learning system.
Within the self-learning structure, product knowledge flows by means of generator-evaluator employees, with disagreements routed to a supervisor for investigation. Publish-inference, the system additionally captures suggestions alerts from sellers (comparable to itemizing updates and appeals) and prospects (comparable to returns and unfavorable critiques). Learnings from the sources are saved in a hierarchical data base and injected again into employee prompts, making a steady enchancment loop.
The next describes a simplified reference structure that demonstrates how this self-learning sample will be applied utilizing AWS providers. Whereas our manufacturing system has extra complexity, this instance illustrates the core elements and knowledge flows.
This technique will be constructed with Amazon Bedrock, which gives the important infrastructure for multi-model architectures. The power of Amazon Bedrock to entry numerous basis fashions permits groups to deploy smaller, environment friendly fashions like Amazon Nova Lite as employees and extra succesful fashions like Anthropic Claude Sonnet as supervisors—optimizing each value and efficiency. For even larger value effectivity at scale, groups also can deploy open supply small fashions on Amazon Elastic Compute Cloud (Amazon EC2) GPU cases, offering full management over employee mannequin choice and batch throughput optimization. For productionizing a supervisor agent with its specialised instruments and dynamic data base, Bedrock AgentCore gives the runtime scalability, reminiscence administration, and observability wanted to deploy self-learning programs reliably at scale.

Our supervisor agent integrates with Amazon’s in depth Choice and Catalog Programs. The above diagram is a simplified view exhibiting the important thing options of the agent and among the AWS providers that make it potential. Product knowledge flows by means of generator-evaluator employees (Amazon EC2 and Amazon Bedrock Runtime), with agreements saved instantly and disagreements routed to a supervisor agent (Bedrock AgentCore). The training aggregator and reminiscence supervisor make the most of Amazon DynamoDB for the data base, with learnings injected again into employee prompts. Human evaluation (Amazon Easy Queue Service (Amazon SQS)) and observability (Amazon CloudWatch) full the structure. Manufacturing implementations will probably require extra elements for scale, reliability, and integration with current programs.
However how did we arrive at this structure? The important thing perception got here from an surprising place.
The perception: Turning disagreements into alternatives
Our perspective shifted throughout a debugging session. When a number of smaller fashions (comparable to Nova Lite) disagreed on product attributes—decoding the identical specification in a different way primarily based on how they understood technical terminology—we initially noticed this as a failure. However the knowledge advised a unique story: merchandise the place our smaller fashions disagreed correlated with circumstances requiring extra guide evaluation and clarification. When fashions disagreed, these have been exactly the merchandise that wanted extra investigation. The disagreements have been surfacing studying alternatives, however we couldn’t have engineers and scientists deep-dive on each case. The supervisor agent does this mechanically at scale. And crucially, the objective isn’t simply to find out which mannequin was proper—it’s to extract learnings that assist stop related disagreements sooner or later. That is the important thing to environment friendly scaling. Disagreements don’t simply come from AI employees at inference time. Publish-inference, sellers specific disagreement by means of itemizing updates and appeals—alerts that our unique extraction may need missed essential context. Prospects disagree by means of returns and unfavorable critiques, typically indicating that product info didn’t match expectations. These post-inference human alerts feed into the identical studying pipeline, with the supervisor investigating patterns and producing learnings that assist stop related points throughout future merchandise. We discovered a candy spot: attributes with reasonable AI employee disagreement charges yielded the richest learnings—excessive sufficient to floor significant patterns, low sufficient to point solvable ambiguity. When disagreement charges are too low, they usually mirror noise or basic mannequin limitations somewhat than learnable patterns—for these, we think about using extra succesful employees. When disagreement charges are too excessive, it alerts that employee fashions or prompts aren’t but mature sufficient, triggering extreme supervisor calls that undermine the effectivity good points of the structure. These thresholds will range by process and area; the secret’s figuring out your personal candy spot the place disagreements symbolize real complexity price investigating, somewhat than basic gaps in employee functionality or random noise.
Deep dive: The way it works
On the coronary heart of our system are a number of light-weight employee fashions working in parallel—some as turbines extracting attributes, others as evaluators assessing these extractions. These employees will be applied in a non-agentic manner with mounted inputs, making them batch-friendly and scalable. The generator-evaluator sample creates productive stress, conceptually just like the productive stress in generative adversarial networks (GANs), although our method operates at inference time by means of prompting somewhat than coaching. We explicitly immediate evaluators to be vital, instructing them to scrutinize extractions for ambiguities, lacking context, or potential misinterpretations. This adversarial dynamic surfaces disagreements that symbolize real complexity somewhat than letting ambiguous circumstances cross by means of undetected. When the generator and evaluator agree, now we have excessive confidence within the end result and course of it at minimal computational value. This consensus path handles most product attributes. Once they disagree, we’ve recognized a case price investigating—triggering the supervisor to resolve the dispute and extract reusable learnings.
Our structure treats disagreement as a common studying sign. At inference time, worker-to-worker disagreements catch ambiguity. Publish-inference, vendor suggestions catches misalignments with intent and buyer suggestions catches misalignments with expectations. The three channels feed the supervisor, which extracts learnings that enhance accuracy throughout the board. When employees disagree, we invoke a supervisor agent—a extra succesful mannequin that resolves the dispute and investigates why it occurred. The supervisor determines what context or reasoning the employees lacked, and these insights grow to be reusable learnings for future circumstances. For instance, when employees disagreed about utilization classification for a product primarily based on sure technical phrases, the supervisor investigated and clarified that these phrases alone have been inadequate—visible context and different indicators wanted to be thought-about collectively. The supervisor generated a studying about how one can correctly weight totally different alerts for that product class. This studying instantly up to date our data base, and when injected into employee prompts for related merchandise, helped stop future disagreements throughout 1000’s of things. Whereas the employees may theoretically be the identical mannequin because the supervisor, utilizing smaller fashions is essential for effectivity at scale. The architectural benefit emerges from this asymmetry: light-weight employees deal with routine circumstances by means of consensus, whereas the extra succesful supervisor is invoked solely when disagreements floor high-value studying alternatives. Because the system accumulates learnings and disagreement charges drop, supervisor calls naturally decline—effectivity good points are baked instantly into the structure. This worker-supervisor heterogeneity additionally permits richer investigation. As a result of supervisors are invoked selectively, they’ll afford to tug in extra alerts—buyer critiques, return causes, vendor historical past—that might be impractical to retrieve for each product however present essential context when resolving complicated disagreements. When these alerts yield generalizable insights about how prospects need product info introduced—which attributes to spotlight, what terminology resonates, how one can body specs—the ensuing learnings profit future inferences throughout related merchandise with out retrieving these resource-intensive alerts once more. Over time, this creates a suggestions loop: higher product info results in fewer returns and unfavorable critiques, which in flip displays improved buyer satisfaction.
The data base: Making learnings scalable
The supervisor investigates disagreements on the particular person product degree. With thousands and thousands of things to course of, we want a scalable strategy to remodel these product-specific insights into reusable learnings. Our aggregation technique adapts to context: high-volume patterns get synthesized into broader learnings, whereas distinctive or vital circumstances are preserved individually. We use a hierarchical construction the place a big language mannequin (LLM)-based reminiscence supervisor navigates the data tree to position every studying. Ranging from the basis, it traverses classes and subcategories, deciding at every degree whether or not to proceed down an current path, create a brand new department, merge with current data, or change outdated info. This dynamic group permits the data base to evolve with rising patterns whereas sustaining logical construction. Throughout inference, employees obtain related learnings of their prompts primarily based on product class, mechanically incorporating area data from previous disagreements. The data base additionally introduces traceability—when an extraction appears incorrect, we are able to pinpoint precisely which studying influenced it. This shifts auditing from an unscalable process to a sensible one: as a substitute of reviewing a pattern of thousands and thousands of outputs—the place human effort grows proportionally with scale—groups can audit the data base itself, which stays comparatively mounted in dimension no matter inference quantity. Area consultants can instantly contribute by including or refining entries, no retraining required. A single well-crafted studying can instantly enhance accuracy throughout 1000’s of merchandise. The data base bridges human experience and AI functionality, the place automated learnings and human insights work collectively.
Classes realized and greatest practices
When this self-learning structure works greatest:
- Excessive-volume inference the place enter variety drives compounded studying
- High quality-critical purposes the place consensus gives pure high quality assurance
- Evolving domains with new patterns and terminology continually rising
It’s much less appropriate for low-volume situations (inadequate disagreements for studying) or use circumstances with mounted, unchanging guidelines.
Essential success elements:
- Defining disagreements: With a generator-evaluator pair, disagreement happens when the evaluator flags the extraction as needing enchancment. With a number of employees, scale thresholds accordingly. The hot button is sustaining productive stress between employees. If disagreement charges fall exterior the productive vary (too low or too excessive), contemplate extra succesful employees or refined prompts.
- Monitoring studying effectiveness: Disagreement charges should lower over time—that is your main well being metric. If charges keep flat, examine data retrieval, immediate injection, or evaluator criticality.
- Information group: Construction learnings hierarchically and preserve them actionable. Summary steerage doesn’t assist; particular, concrete learnings instantly enhance future inferences.
Widespread pitfalls
- Specializing in value over intelligence: Price discount is a byproduct, not the objective
- Rubber-stamp evaluators: Evaluators that merely approve generator outputs gained’t floor significant disagreements—immediate them to actively problem and critique extractions
- Poor studying extraction: Supervisors should establish generalizable patterns, not simply repair particular person circumstances
- Information rot: With out group, learnings grow to be unsearchable and unusable
The important thing perception: deal with declining disagreement charges as your north star metric—they present the system is really studying.
Deployment methods: Two approaches
- Study-then-deploy: Begin with fundamental prompts and let the system be taught aggressively in a pre-production setting. Area consultants then audit the data base—not particular person outputs—to ensure realized patterns align with desired outcomes. When permitted, deploy with validated learnings. That is very best for brand new use circumstances the place you don’t but know what good seems like—disagreements assist uncover the suitable patterns, and data base auditing permits you to form them earlier than manufacturing.
- Deploy-and-learn: Begin with refined prompts and good preliminary high quality, then constantly enhance by means of ongoing studying in manufacturing. This works greatest for well-understood use circumstances the place you’ll be able to outline high quality upfront however nonetheless need to seize domain-specific nuances over time.
Each approaches use the identical structure—the selection depends upon whether or not you’re exploring new territory or optimizing acquainted floor.
Conclusion
What began as an experiment in catalog enrichment revealed a basic reality: AI programs don’t must be frozen in time. By embracing disagreements as studying alerts somewhat than failures, we’ve constructed an structure that accumulates area data by means of precise utilization. We watched the system evolve from generic understanding to domain-specific experience. It realized industry-specific terminology. It found contextual guidelines that modify throughout classes. It tailored to necessities no pre-trained mannequin would encounter—all with out retraining, by means of learnings saved in a data base and injected again into employee prompts. For groups operationalizing related architectures, Amazon Bedrock AgentCore affords purpose-built capabilities:
- AgentCore Runtime handles fast consensus selections for routine circumstances whereas supporting prolonged reasoning when supervisors examine complicated disagreements
- AgentCore Observability gives visibility into which learnings drive influence, serving to groups refine data propagation and preserve reliability at scale
The implications lengthen past catalog administration. Excessive-volume AI purposes may gain advantage from this course of—and the power of Amazon Bedrock to entry numerous fashions makes this structure easy to implement. The important thing perception is that this: we’ve shifted from asking “which mannequin ought to we use?” to “how can we construct programs that be taught our particular patterns? “Whether or not you learn-then-deploy for brand new use circumstances or deploy-and-learn for established ones, the implementation is easy: begin with employees suited to your process, select a supervisor, and let disagreements drive studying. With the suitable structure, each inference can grow to be a chance to seize area data. That’s not simply scaling—that’s constructing institutional data into your AI programs.
Acknowledgement
This work wouldn’t have been potential with out the contributions and assist from Ankur Datta (Senior Principal Utilized Scientist – chief of science in On a regular basis Necessities Shops), Zhu Cheng (Utilized Scientist), Xuan Tang (Software program Engineer), Mohammad Ghasemi (Utilized Scientist). We sincerely recognize the contributions in designs, implementations, quite a few fruitful brain-storming classes, and all of the insightful concepts and solutions.
Concerning the authors
Tarik Arici is a Principal Scientist at Amazon Choice and Catalog Programs (ASCS), the place he pioneers self-learning generative AI programs design for catalog high quality enhancement at scale. His work focuses on constructing AI programs that mechanically accumulate area data by means of manufacturing utilization—studying from buyer critiques and returns, vendor suggestions, and mannequin disagreements to enhance high quality whereas lowering prices. Tarik holds a PhD in Electrical and Laptop Engineering from Georgia Institute of Expertise.
Sameer Thombare is a Senior Product Supervisor at Amazon with over a decade of expertise in Product Administration, Class/P&L Administration throughout numerous industries, together with heavy engineering, telecommunications, finance, and eCommerce. Sameer is captivated with growing constantly enhancing closed-loop programs and leads strategic initiatives inside Amazon Choice and Catalog Programs (ASCS) to construct a complicated self-learning closed-loop system that synthesize alerts from prospects, sellers, and provide chain operations to optimize outcomes. Sameer holds an MBA from the Indian Institute of Administration Bangalore and an engineering diploma from Mumbai College.
Amin Banitalebi obtained his PhD within the Digital Media on the College of British Columbia (UBC), Canada, in 2014. Since then, he has taken numerous utilized science roles spanning over areas in pc imaginative and prescient, pure language processing, suggestion programs, classical machine studying, and generative AI. Amin has co-authored over 90 publications and patents. He’s at the moment an Utilized Science Supervisor in Amazon On a regular basis Necessities.
Puneet Sahni is a Senior Principal Engineer at Amazon Choice and Catalog Programs (ASCS), the place he has spent over 8 years enhancing the completeness, consistency, and correctness of catalog knowledge. He focuses on catalog knowledge modeling and its utility to enhancing Promoting Associate and buyer experiences, whereas utilizing ML/DL and LLM-based enrichment to drive enhancements in catalog knowledge high quality.
Erdinc Basci joined Amazon in 2015 and brings over 23 years of know-how {industry} expertise. At Amazon, he has led the evolution of Catalog system architectures—together with ingestion pipelines, prioritized processing, and visitors shaping—in addition to catalog knowledge structure enhancements comparable to segmented affords, product specs for manufacture-on-demand merchandise, and catalog knowledge experimentation. Erdinc has championed a hands-on efficiency engineering tradition throughout Amazon providers unlocking $1B+ annualized value financial savings and 20%+ latency wins throughout core Shops providers. He’s at the moment centered on enhancing generative AI utility efficiency and GPU effectivity throughout Amazon. Erdinc holds a BS in Laptop Science from Bilkent College, Turkey, and an MBA from Seattle College, US.
Mey Meenakshisundaram is a Director in Amazon Choice and Catalog Programs, the place he leads modern GenAI options to ascertain Amazon’s worldwide catalog because the best-in-class supply for product info. His group pioneers superior machine studying strategies, together with multi-agent programs and enormous language fashions, to mechanically enrich product attributes and enhance catalog high quality at scale. Excessive-quality product info within the catalog is vital for delighting prospects find the suitable merchandise, empowering promoting companions to listing their merchandise successfully, and enabling Amazon operations to scale back guide effort.

