This publish was co-written with Saurabh Gupta and Todd Colby from Pushpay.
Pushpay is a market-leading digital giving and engagement platform designed to assist church buildings and faith-based organizations drive neighborhood engagement, handle donations, and strengthen generosity fundraising processes effectively. Pushpay’s church administration system gives church directors and ministry leaders with insight-driven reporting, donor improvement dashboards, and automation of economic workflows.
Utilizing the facility of generative AI, Pushpay developed an progressive agentic AI search function constructed for the distinctive wants of ministries. The strategy makes use of pure language processing so ministry workers can ask questions in plain English and generate real-time, actionable insights from their neighborhood information. The AI search function addresses a important problem confronted by ministry leaders: the necessity for fast entry to neighborhood insights with out requiring technical experience. For instance, ministry leaders can enter “present me people who find themselves members in a bunch, however haven’t given this yr” or “present me people who find themselves not engaged in my church,” and use the outcomes to take significant motion to higher help people of their neighborhood. Most neighborhood leaders are time-constrained and lack technical backgrounds; they’ll use this resolution to acquire significant information about their congregations in seconds utilizing pure language queries.
By empowering ministry workers with quicker entry to neighborhood insights, the AI search function helps Pushpay’s mission to encourage generosity and connection between church buildings and their neighborhood members. Early adoption customers report that this resolution has shortened their time to insights from minutes to seconds. To realize this end result, the Pushpay group constructed the function utilizing agentic AI capabilities on Amazon Net Providers (AWS) whereas implementing strong high quality assurance measures and establishing a fast iterative suggestions loop for steady enhancements.
On this publish, we stroll you thru Pushpay’s journey in constructing this resolution and discover how Pushpay used Amazon Bedrock to create a customized generative AI analysis framework for steady high quality assurance and establishing fast iteration suggestions loops on AWS.
Resolution overview: AI powered search structure
The answer consists of a number of key parts that work collectively to ship an enhanced search expertise. The next determine exhibits the answer structure diagram and the general workflow.
Determine 1: AI Search Resolution Structure
- Person interface layer: The answer begins with Pushpay customers submitting pure language queries via the prevailing Pushpay software interface. By utilizing pure language queries, church ministry workers can receive information insights utilizing AI capabilities with out studying new instruments or interfaces.
- AI search agent: On the coronary heart of the system lies the AI search agent, which consists of two key parts:
- System immediate: Accommodates the big language mannequin (LLM) position definitions, directions, and software descriptions that information the agent’s conduct.
- Dynamic immediate constructor (DPC): routinely constructs extra personalized system prompts based mostly on the person particular data, corresponding to church context, pattern queries, and software filter stock. Additionally they use semantic search to pick out solely related filters amongst tons of of accessible software filters. The DPC improves response accuracy and person expertise.
- Amazon Bedrock superior function: The answer makes use of the next Amazon Bedrock managed companies:
- Immediate caching: Reduces latency and prices by caching ceaselessly used system immediate.
- LLM processing: Makes use of Claude Sonnet 4.5 to course of prompts and generate JSON output required by the appliance to show the specified question outcomes as insights to customers.
- Analysis system: The analysis system implements a closed-loop enchancment resolution the place person interactions are instrumented, captured and evaluated offline. The analysis outcomes feed right into a dashboard for product and engineering groups to investigate and drive iterative enhancements to the AI search agent. Throughout this course of, the info science group collects a golden dataset and constantly curates this dataset based mostly on the precise person queries coupled with validated responses.
The challenges of preliminary resolution with out analysis
To create the AI search function, Pushpay developed the primary iteration of the AI search agent. The answer implements a single agent configured with a rigorously tuned system immediate that features the system position, directions, and the way the person interface works with detailed clarification of every filter device and their sub-settings. The system immediate is cached utilizing Amazon Bedrock immediate caching to cut back token price and latency. The agent makes use of the system immediate to invoke an Amazon Bedrock LLM which generates the JSON doc that Pushpay’s software makes use of to use filters and current question outcomes to customers.
Nevertheless, this primary iteration shortly revealed some limitations. Whereas it demonstrated a 60-70% success charge with fundamental enterprise queries, the group reached an accuracy plateau. The analysis of the agent was a handbook and tedious course of Tuning the system immediate past this accuracy threshold proved difficult given the varied spectrum of person queries and the appliance’s protection of over 100 distinct configurable filters. These offered important blockers for the group’s path to manufacturing.

Determine 2: AI Search First Resolution
Enhancing the answer by including a customized generative AI analysis framework
To handle the challenges of measuring and bettering agent accuracy, the group carried out a generative AI analysis framework built-in into the prevailing structure, proven within the following determine. This framework consists of 4 key parts that work collectively to supply complete efficiency insights and allow data-driven enhancements.

Determine 3: Introducing the GenAI Analysis Framework
- The golden dataset: A curated golden dataset containing over 300 consultant queries, every paired with its corresponding anticipated output, types the muse of automated analysis. The product and information science groups rigorously developed and validated this dataset to realize complete protection of real-world use instances and edge instances. Moreover, there’s a steady curation strategy of including consultant precise person queries with validated outcomes.
- The evaluator: The evaluator element processes person enter queries and compares the agent-generated output towards the golden dataset utilizing the LLM as a choose sample This strategy generates core accuracy metrics whereas capturing detailed logs and efficiency information, corresponding to latency, for additional evaluation and debugging.
- Area class: Area classes are developed utilizing a mixture of generative AI area summarization and human-defined common expressions to successfully categorize person queries. The evaluator determines the area class for every question, enabling nuanced, category-based analysis as a further dimension of analysis metrics.
- Generative AI analysis dashboard: The dashboard serves because the mission management for Pushpay’s product and engineering groups, displaying area category-level metrics to evaluate efficiency and latency and information choices. It shifts the group from single mixture scores to nuanced, domain-based efficiency insights.
The accuracy dashboard: Pinpointing weaknesses by area
As a result of person queries are categorized into area classes, the dashboard incorporates statistical confidence visualization utilizing a 95% Wilson rating interval to show accuracy metrics and question volumes at every area stage. By utilizing classes, the group can pinpoint the AI agent’s weaknesses by area. Within the following instance , the “exercise” area exhibits considerably decrease accuracy than different classes.

Determine 4: Pinpointing Agent Weaknesses by Area
Moreover, a efficiency dashboard, proven within the following determine, visualizes latency indicators on the area class stage, together with latency distributions from p50 to p90 percentiles. Within the following instance, the exercise area reveals notably increased latency than others.

Determine 5: Figuring out Latency Bottlenecks by Area
Strategic rollout via domain-Degree insights
Area-based metrics revealed various efficiency ranges throughout semantic domains, offering essential insights into agent effectiveness. Pushpay used this granular visibility to make strategic function rollout choices. By quickly suppressing underperforming classes—corresponding to exercise queries—whereas present process optimization, the system achieved 95% general accuracy. By utilizing this strategy, customers skilled solely the highest-performing options whereas the group refined others to manufacturing requirements.

Determine 6: Reaching 95% Accuracy with Area-Degree Characteristic Rollout
Strategic prioritization: Specializing in high-impact domains
To prioritize enhancements systematically, Pushpay employed a 2×2 matrix framework plotting subjects towards two dimensions (proven within the following determine): Enterprise precedence (vertical axis) and present efficiency or feasibility (horizontal axis). This visualization positioned subjects with each excessive enterprise worth and robust current efficiency within the top-right quadrant. The group then targeted on these areas as a result of they required much less heavy lifting to realize additional accuracy enchancment from already-good ranges to an distinctive 95% accuracy for the enterprise targeted subjects.
The implementation adopted an iterative cycle: after every spherical of enhancements, they re-analyze the outcomes to establish the subsequent set of high-potential subjects. This systematic, cyclical strategy enabled steady optimization whereas sustaining give attention to business-critical areas.

Determine 7: Strategic Prioritization Framework for Area Class Optimization
Dynamic immediate development
The insights gained from the analysis framework led to an architectural enhancement: the introduction of a dynamic immediate constructor. This element enabled fast iterative enhancements by permitting fine-grained management over which area classes the agent might deal with. The structured discipline stock – beforehand embedded within the system immediate – was reworked right into a dynamic component, utilizing semantic search to assemble contextually related prompts for every person question. This strategy tailors the immediate filter stock based mostly on three key contextual dimensions: question content material, person persona, and tenant-specific necessities. The result’s a extra exact and environment friendly system that generates extremely related responses whereas sustaining the flexibleness wanted for steady optimization.
Enterprise influence
The generative AI analysis framework turned the cornerstone of Pushpay’s AI function improvement, delivering measurable worth throughout three dimensions:
- Person expertise: The AI search function diminished time-to-insight from roughly 120 seconds (skilled customers manually navigating complicated UX) to below 4 seconds – a 15-fold acceleration that immediately helps improve ministry leaders’ productiveness and decision-making pace. This function democratized information insights, in order that customers of various technical ranges can entry significant intelligence with out requiring specialised experience.
- Growth velocity: The scientific analysis strategy reworked optimization cycles. Reasonably than debating immediate modifications, the group now validates modifications and measures domain-specific impacts inside minutes, changing extended deliberations with data-driven iteration.
- Manufacturing readiness: Enhancements from 60–70% accuracy to greater than 95% accuracy utilizing high-performance domains offered the quantitative confidence required for customer-facing deployment, whereas the framework’s structure permits steady refinement throughout different area classes.
Key takeaways on your AI agent journey
The next are key takeaways from Pushpay’s expertise that you need to use in your individual AI agent journey.
1/ Construct with manufacturing in thoughts from day one
Constructing agentic AI methods is simple, however scaling them to manufacturing is difficult. Builders ought to undertake a scaling mindset throughout the proof-of-concept section, not after. Implementing strong tracing and analysis frameworks early, gives a transparent pathway from experimentation to manufacturing. By utilizing this technique, groups can establish and deal with accuracy points systematically earlier than they turn into blockers.
2/ Make the most of the superior options of Amazon Bedrock
Amazon Bedrock immediate caching considerably reduces token prices and latency by caching ceaselessly used system prompts. For brokers with massive, steady system prompts, this function is crucial for production-grade efficiency.
3/ Suppose past mixture metrics
Mixture accuracy scores can generally masks important efficiency variations. By evaluating agent efficiency on the area class stage, Pushpay uncovered weaknesses past what a single accuracy metric can seize. This granular strategy permits focused optimization and knowledgeable rollout choices, ensuring customers solely expertise high-performing options whereas others are refined.
4/ Information safety and accountable AI
When growing agentic AI methods, contemplate data safety and LLM safety issues from the outset, following the AWS Shared Accountability Mannequin, as a result of safety necessities basically influence the architectural design. Pushpay’s clients are church buildings and faith-based organizations who’re stewards of delicate data—together with pastoral care conversations, monetary giving patterns, household struggles, prayer requests and extra. On this implementation instance, Pushpay set a transparent strategy to incorporating AI ethically inside its product ecosystem, sustaining strict safety requirements to make sure church information and personally identifiable data (PII) stays inside its safe partnership ecosystem. Information is shared solely with safe and applicable information protections utilized and isn’t used to coach exterior fashions. To be taught extra about Pushpay’s requirements for incorporating AI inside their merchandise, go to the Pushpay Information Middle for a extra in-depth evaluate of firm requirements.
Conclusion: Your Path to Manufacturing-Prepared AI Brokers
Pushpay’s journey from a 60–70% accuracy prototype to a 95% correct production-ready AI agent demonstrates that constructing dependable agentic AI methods requires extra than simply subtle prompts—it calls for a scientific, data-driven strategy to analysis and optimization. The important thing breakthrough wasn’t within the AI know-how itself, however in implementing a complete analysis framework constructed on sturdy observability basis that offered granular visibility into agent efficiency throughout completely different domains. This systematic strategy enabled fast iteration, strategic rollout choices, and steady enchancment.
Able to construct your individual production-ready AI agent?
- Discover Amazon Bedrock: Start constructing your agent with Amazon Bedrock
- Implement LLM-as-a-judge: Create your individual analysis system utilizing the patterns described on this LLM-as-a-judge on Amazon Bedrock Mannequin Analysis
- Construct your golden dataset: Begin curating consultant queries and anticipated outputs on your particular use case
Concerning the authors
Roger Wang is a Senior Resolution Architect at AWS. He’s a seasoned architect with over 20 years of expertise within the software program trade. He helps New Zealand and international software program and SaaS firms use cutting-edge know-how at AWS to unravel complicated enterprise challenges. Roger is captivated with bridging the hole between enterprise drivers and technological capabilities and thrives on facilitating conversations that drive impactful outcomes.
Melanie Li, PhD, is a Senior Generative AI Specialist Options Architect at AWS based mostly in Sydney, Australia, the place her focus is on working with clients to construct options leveraging state-of-the-art AI and machine studying instruments. She has been actively concerned in a number of Generative AI initiatives throughout APJ, harnessing the facility of Giant Language Fashions (LLMs). Previous to becoming a member of AWS, Dr. Li held information science roles within the monetary and retail industries.
Frank Huang, PhD, is a Senior Analytics Specialist Options Architect at AWS based mostly in Auckland, New Zealand. He focuses on serving to clients ship superior analytics and AI/ML options. All through his profession, Frank has labored throughout a wide range of industries corresponding to monetary companies, Web3, hospitality, media and leisure, and telecommunications. Frank is raring to make use of his deep experience in cloud structure, AIOps, and end-to-end resolution supply to assist clients obtain tangible enterprise outcomes with the facility of knowledge and AI.
Saurabh Gupta is an information science and AI skilled at Pushpay based mostly in Auckland, New Zealand, the place he focuses on implementing sensible AI options and statistical modeling. He has in depth expertise in machine studying, information science, and Python for information science purposes, with specialised expertise coaching in database brokers and AI implementation. Previous to his present position, he gained expertise in telecom, retail and monetary companies, growing experience in advertising analytics and buyer retention packages. He has a Grasp’s in Statistics from College of Auckland and a Grasp’s in Enterprise Administration from the Indian Institute of Administration, Calcutta.
Todd Colby is a Senior Software program Engineer at Pushpay based mostly in Seattle. His experience is targeted on evolving complicated legacy purposes with AI, and translating person wants into structured, high-accuracy options. He leverages AI to extend supply velocity and produce leading edge metrics and enterprise resolution instruments.

