How PDI constructed an enterprise-grade RAG system for AI functions with AWS

PDI Applied sciences is a world chief within the comfort retail and petroleum wholesale industries. They assist companies across the globe improve effectivity and profitability by securely connecting their knowledge and operations. With 40 years of expertise, PDI Applied sciences assists clients in all features of their enterprise, from understanding client habits to simplifying know-how ecosystems throughout the availability chain.

Enterprises face a big problem of creating their information bases accessible, searchable, and usable by AI programs. Inside groups at PDI Applied sciences had been scuffling with data scattered throughout disparate programs together with web sites, Confluence pages, SharePoint websites, and numerous different knowledge sources. To handle this, PDI Applied sciences constructed PDI Intelligence Question (PDIQ), an AI assistant that provides staff entry to firm information by means of an easy-to-use chat interface. This resolution is powered by a customized Retrieval Augmented Technology (RAG) system, constructed on Amazon Internet Companies (AWS) utilizing serverless applied sciences. Constructing PDIQ required addressing the next key challenges:

Mechanically extracting content material from various sources with completely different authentication necessities
Needing the flexibleness to pick, apply, and interchange essentially the most appropriate giant language mannequin (LLM) for various processing necessities
Processing and indexing content material for semantic search and contextual retrieval
Making a information basis that allows correct, related AI responses
Constantly refreshing data by means of scheduled crawling
Supporting enterprise-specific context in AI interactions

On this publish, we stroll by means of the PDIQ course of circulate and structure, specializing in the implementation particulars and the enterprise outcomes it has helped PDI obtain.

Answer structure

On this part, we discover PDIQ’s complete end-to-end design. We look at the info ingestion pipeline from preliminary processing by means of storage to consumer search capabilities, in addition to the zero-trust safety framework that protects key consumer personas all through their platform interactions. The structure consists of those components:

Scheduler – Amazon EventBridge maintains and executes the crawler scheduler.
Crawlers – AWS Lambda invokes crawlers which might be executed as duties by Amazon Elastic Container Service (Amazon ECS).
Amazon DynamoDB – Persists crawler configurations and different metadata comparable to Amazon Easy Storage Service (Amazon S3) picture location and captions.
Amazon S3 – All supply paperwork are saved in Amazon S3. Amazon S3 occasions set off the downstream circulate for each object that’s created or deleted.
Amazon Easy Notification Service (Amazon SNS) – Receives notification from Amazon S3 occasions.
Amazon Easy Queue Service (Amazon SQS) – Subscribed to Amazon SNS to carry the incoming requests in a queue.
AWS Lambda – Handles the enterprise logic for chunking, summarizing, and producing vector embeddings.
Amazon Bedrock – Offers API entry to basis fashions (FMs) utilized by PDIQ:
Amazon Aurora PostgreSQL-Appropriate Version – Shops vector embeddings.

The next diagram is the answer structure.

Subsequent, we evaluation how PDIQ implements a zero-trust safety mannequin with role-based entry management for 2 key personas:

Directors configure information bases and crawlers by means of Amazon Cognito consumer teams built-in with enterprise single sign-on. Crawler credentials are encrypted at relaxation utilizing AWS Key Administration Service (AWS KMS) and solely accessible inside remoted execution environments.
Finish customers entry information bases based mostly on group permissions validated on the utility layer. Customers can belong to a number of teams (comparable to human assets or compliance) and swap contexts to question role-appropriate datasets.

Course of circulate

On this part, we evaluation the end-to-end course of circulate. We break it down by sections to dive deeper into every step and clarify the performance.

Crawlers

Crawlers are configured by Administrator to gather knowledge from a wide range of sources that PDI depends on. Crawlers hydrate the info into the information base in order that this data might be retrieved by finish customers. PDIQ at the moment helps the next crawler configurations:

Internet crawler – Through the use of Puppeteer for headless browser automation, the crawler converts HTML net pages to markdown format utilizing turndown. By following the embedded hyperlinks on the web site, the crawler can seize full context and relationships between pages. Moreover, the crawler downloads belongings comparable to PDFs and pictures whereas preserving the unique reference and affords customers configuration choices comparable to charge limiting.
Confluence crawler – This crawler makes use of Confluence REST API with authenticated entry to extract web page content material, attachments, and embedded photos. It preserves web page hierarchy and relationships, handles particular Confluence components comparable to information containers, notes, and lots of extra.
Azure DevOps crawler – PDI makes use of Azure DevOps to handle its code base, monitor commits, and keep mission documentation in a centralized repository. PDIQ makes use of Azure DevOps REST API with OAuth or private entry token (PAT) authentication to extract this data. Azure DevOps crawler preserves mission hierarchy, dash relationships, and backlog construction additionally maps work merchandise relationships (comparable to mother or father/little one or linked gadgets), thereby offering an entire view of the dataset.
SharePoint crawler – It makes use of Microsoft Graph API with OAuth authentication to extract doc libraries, lists, pages, and file content material. The crawler processes MS Workplace paperwork (Phrase, Excel, PowerPoint) into searchable textual content and maintains doc model historical past and permission metadata.

By constructing separate crawler configurations, PDIQ affords straightforward extensibility into the platform to configure extra crawlers on demand. It additionally affords the flexibleness to administrator customers to configure the settings for his or her respective crawlers (comparable to frequency, depth, or charge limits).

The next determine exhibits the PDIQ UI to configure the information base.

The next determine exhibits the PDI UI to configure your crawler (comparable to Confluence).

The next determine exhibits the PDIQ UI to schedule crawlers.

Dealing with photos

Information crawled is saved in Amazon S3 with correct metadata tags. If the supply is in HTML format, the duty converts the content material into markdown (.md) information. For these markdown information, there’s an extra optimization step carried out to switch the pictures within the doc with the Amazon S3 reference location. Key advantages of this strategy embody:

PDI can use S3 object keys to uniquely reference every picture, thereby optimizing the synchronization course of to detect modifications in supply knowledge
You may optimize storage by changing photos with captions and avoiding the necessity to retailer duplicate photos
It supplies the flexibility to make the content material of the pictures searchable and relatable to the textual content content material within the doc
Seamlessly inject authentic photos when rendering a response to consumer inquiry

The next is a pattern markdown file the place photos are changed with the S3 file location:

![image-20230113-074652](https:// amzn-s3-demo-bucket.s3.amazonaws.com/kb/123/file/attachments/12133171243_image-20230113-074652.png)

Doc processing

That is essentially the most important step of the method. The important thing goal of this step is to generate vector embeddings in order that they can be utilized for similarity matching and efficient retrieval based mostly on consumer inquiry. The method follows a number of steps, beginning with picture captioning, then doc chunking, abstract technology, and embedding technology. To caption the pictures, PDIQ scans the markdown information to find picture tags . For every of those photos, PDIQ scans and generates a picture caption that explains the content material of the picture. This caption will get injected again into the markdown file, subsequent to the tag, thereby enriching the doc content material. This strategy affords improved contextual searchability. PDIQ enhances content material discovery by embedding insights extracted from photos instantly into the unique markdown information. This strategy ensures that picture content material turns into a part of the searchable textual content, enabling richer and extra correct context retrieval throughout search and evaluation. The strategy additionally saves prices. To keep away from pointless LLM inference requires very same photos, PDIQ shops picture metadata (file location and generated captions) in Amazon DynamoDB. This step allows environment friendly reuse of beforehand generated captions, eliminating the necessity for repeated caption technology calls to LLM.

The next is an instance of a picture caption immediate:

You're a skilled picture captioning assistant. Your job is to supply clear, factual, and goal descriptions of photos. Concentrate on describing seen components, objects, and scenes in a impartial and acceptable method.

The next is a snippet of markdown file that incorporates the picture tag, LLM-generated caption, and the corresponding S3 file location:

![image-20230818-114454: The image displays a security tip notification on a computer screen. The notification is titled "Security tip" and advises the user to use generated passwords to keep their accounts safe. The suggested password, "2m5oFX#g&tLRMhN3," is shown in a green box. Below the suggested password, there is a section labeled "Very Strong," indicating the strength of the password. The password length is set to 16 characters, and it includes lowercase letters, uppercase letters, numbers, and symbols. There is also a "Dismiss" button to close the notification. Below the password section, there is a link to "See password history." The bottom of the image shows navigation icons for "Vault," "Generator," "Alerts," and "Account." The "Generator" icon is highlighted in red.]
(https:// amzn-s3-demo-bucket.s3.amazonaws.com/kb/ABC/file/attachments/12133171243_image-20230818-114454.png)

Now that markdown information are injected with picture captions, the subsequent step is to interrupt the unique doc into chunks that match into the context window of the embeddings mannequin. PDIQ makes use of Amazon Titan Textual content Embeddings V2 mannequin to generate vectors and shops them in Aurora PostgreSQL-Appropriate Serverless. Based mostly on inner accuracy testing and chunking finest practices from AWS, PDIQ performs chunking as follows:

70% of the tokens for content material
10% overlap between chunks
20% for abstract tokens

Utilizing the doc chunking logic from the earlier step, the doc is transformed into vector embeddings. The method consists of:

Calculate chunk parameters – Decide the scale and whole variety of chunks required for the doc based mostly on the 70% calculation.
Generate doc abstract – Use Amazon Nova Lite to create a abstract of your entire doc, constrained by the 20% token allocation. This abstract is reused throughout all chunks to supply constant context.
Chunk and prepend abstract – Break up the doc into overlapping chunks (10%), with the abstract prepended on the high.
Generate embeddings – Use Amazon Titan Textual content Embeddings V2 to generate vector embeddings for every chunk (abstract plus content material), which is then saved within the vector retailer.

By designing a custom-made strategy to generate a abstract part atop of all chunks, PDIQ ensures that when a selected chunk is matched based mostly on similarity search, the LLM has entry to your entire abstract of the doc and never solely the chunk that matched. This strategy enriches finish consumer expertise leading to a rise of approval charge for accuracy from 60% to 79%.

The next is an instance of a summarization immediate:

You're a specialised doc summarization assistant with experience in enterprise and technical content material.

Your job is to create concise, information-rich summaries that:
Protect all quantifiable knowledge (numbers, percentages, metrics, dates, monetary figures)
Spotlight key enterprise terminology and domain-specific ideas
Extract necessary entities (individuals, organizations, merchandise, places)
Determine important relationships between ideas
Keep factual accuracy with out including interpretations
Concentrate on extracting data that may be most respected for:
Answering particular enterprise questions
Supporting data-driven determination making
Enabling exact data retrieval in a RAG system
The abstract must be complete but concise, prioritizing particular details over common descriptions.
Embrace any tables, lists, or structured knowledge in a format that preserves their relationships.
Guarantee all technical phrases, acronyms, and specialised vocabulary are preserved precisely as written.

The next is an instance of abstract textual content, accessible on every chunk:

### Abstract: PLC Person Creation Course of and Password Reset
**Doc Overview:**
This doc supplies directions for creating new customers and resetting passwords 
**Key Directions:**

  {Shortened for Weblog illustration} 


This abstract captures the important steps, necessities, and entities concerned within the PLC consumer creation and password reset course of utilizing Jenkins.
---

Chunk 1 has a abstract on the high adopted by particulars from the supply:

{Abstract Textual content from above}
This abstract captures the important steps, necessities, and entities concerned within the PLC consumer creation and password reset course of utilizing Jenkins.

title: 2. PLC Person Creation Course of and Password Reset

![image-20230818-114454: The image displays a security tip notification on a computer screen. The notification is titled "Security tip" and advises the user to use generated passwords to keep their accounts safe. The suggested password, "2m5oFX#g&tLRMhN3," is shown in a green box. Below the suggested password, there is a section labeled "Very Strong," indicating the strength of the password. The password length is set to 16 characters, and it includes lowercase letters, uppercase letters, numbers, and symbols. There is also a "Dismiss" button to close the notification. Below the password section, there is a link to "See password history." The bottom of the image shows navigation icons for "Vault," "Generator," "Alerts," and "Account." The "Generator" icon is highlighted in red.](https:// amzn-s3-demo-bucket.s3.amazonaws.com/kb/123/file/attachments/12133171243_image-20230818-114454.png)

Chunk 2 has a abstract on the high, adopted by continuation of particulars from the supply:

{Abstract Textual content from above}
This abstract captures the important steps, necessities, and entities concerned within the PLC consumer creation and password reset course of utilizing Jenkins.
---
Maintains a menu with choices comparable to 

![image-20230904-061307:  - The generated text has been blocked by our content filters.](https:// amzn-s3-demo-bucket.s3.amazonaws.com/kb/123/file/attachments/12133171243_image-20230904-061307.png)

PDIQ scans every doc chunk and generates vector embeddings. This knowledge is saved in Aurora PostgreSQL database with key attributes, together with a novel information base ID, corresponding embeddings attribute, authentic textual content (abstract plus chunk plus picture caption), and a JSON binary object that features metadata fields for extensibility. To maintain the information base in sync, PDI implements the next steps:

Add – These are internet new supply objects that must be ingested. PDIQ implements the doc processing circulate described beforehand.
Replace – If PDIQ determines the identical object is current, it compares the hash key worth from the supply with the hash worth from the JSON object.
Delete – If PDIQ determines {that a} particular supply doc now not exists, it triggers a delete operation on the S3 bucket (s3:ObjectRemoved:*), which ends up in a cleanup job, deleting the data similar to the important thing worth within the Aurora desk.

PDI makes use of Amazon Nova Professional to retrieve essentially the most related doc and generates a response by following these key steps:

Utilizing similarity search, retrieves essentially the most related doc chunks, which embody abstract, chunk knowledge, picture caption, and picture hyperlink.
For the matching chunk, retrieve your entire doc.
LLM then replaces the picture hyperlink with the precise picture from Amazon S3.
LLM generates a response based mostly on the info retrieved and the preconfigured system immediate.

The next is a snippet of system immediate:

Help assistant specializing in PDI's Logistics(PLC) platform, serving to workers analysis and resolve help instances in Salesforce. You'll help with discovering options, summarizing case data, and recommending acceptable subsequent steps for decision.

Skilled, clear, technical when wanted whereas sustaining accessible language.

Decision Course of:
Response Format template:
Deal with Confidential Data:

Outcomes and subsequent steps

By constructing this custom-made RAG resolution on AWS, PDI realized the next advantages:

Versatile configuration choices enable knowledge ingestion at consumer-preferred frequencies.
Scalable design allows future ingestion from extra supply programs by means of simply configurable crawlers.
Helps crawler configuration utilizing a number of authentication strategies, together with username and password, secret key-value pairs, and API keys.
Customizable metadata fields allow superior filtering and enhance question efficiency.
Dynamic token administration helps PDI intelligently stability tokens between content material and summaries, enhancing consumer responses.
Consolidates various supply knowledge codecs right into a unified structure for streamlined storage and retrieval.

PDIQ supplies key enterprise outcomes that embody:

Improved effectivity and backbone charges – The instrument empowers PDI help groups to resolve buyer queries considerably quicker, usually automating routine points and offering instant, exact responses. This has led to much less buyer ready on case decision and extra productive brokers.
Excessive buyer satisfaction and loyalty – By delivering correct, related, and personalised solutions grounded in stay documentation and firm information, PDIQ elevated buyer satisfaction scores (CSAT), internet promoter scores (NPS), and total loyalty. Clients really feel heard and supported, strengthening PDI model relationships.
Value discount – PDIQ handles the majority of repetitive queries, permitting restricted help workers to give attention to expert-level instances, which improves productiveness and morale. Moreover, PDIQ is constructed on serverless structure, which routinely scales whereas minimizing operational overhead and value.
Enterprise flexibility – A single platform can serve completely different enterprise items, who can curate the content material by configuring their respective knowledge sources.
Incremental worth – Every new content material supply provides measurable worth with out system redesign.

PDI continues to boost the applying with a number of deliberate enhancements within the pipeline, together with:

Construct extra crawler configuration for brand new knowledge sources (for instance, GitHub).
Construct agentic implementation for PDIQ to be built-in into bigger advanced enterprise processes.
Enhanced doc understanding with desk extraction and construction preservation.
Multilingual help for international operations.
Improved relevance rating with hybrid retrieval methods.
Capability to invoke PDIQ based mostly on occasions (for instance, supply commits).

Conclusion

PDIQ service has reworked how customers entry and use enterprise information at PDI Applied sciences. Through the use of Amazon serverless providers, PDIQ can routinely scale with demand, cut back operational overhead, and optimize prices. The answer’s distinctive strategy to doc processing, together with the dynamic token administration and the customized picture captioning system, represents vital technical innovation in enterprise RAG programs. The structure efficiently balances efficiency, value, and scalability whereas sustaining safety and authentication necessities. As PDI Applied sciences proceed to broaden PDIQ’s capabilities, they’re excited to see how this structure can adapt to new sources, codecs, and use instances.

In regards to the authors

Samit Kumbhani is an Amazon Internet Companies (AWS) Senior Options Architect within the New York Metropolis space with over 18 years of expertise. He at the moment companions with unbiased software program distributors (ISVs) to construct extremely scalable, progressive, and safe cloud options. Outdoors of labor, Samit enjoys enjoying cricket, touring, and biking.

Jhorlin De Armas is an Architect II at PDI Applied sciences, the place he leads the design of AI-driven platforms on Amazon Internet Companies (AWS). Since becoming a member of PDI in 2024, he has architected a compositional AI service that allows configurable assistants, brokers, information bases, and guardrails utilizing Amazon Bedrock, Aurora Serverless, AWS Lambda, and DynamoDB. With over 18 years of expertise constructing enterprise software program, Jhorlin makes a speciality of cloud-centered architectures, serverless platforms, and AI/ML options.

David Mbonu is a Sr. Options Architect at Amazon Internet Companies (AWS), serving to horizontal enterprise utility ISV clients construct and deploy transformational options on AWS. David has over 27 years of expertise in enterprise options structure and system engineering throughout software program, FinTech, and public cloud firms. His latest pursuits embody AI/ML, knowledge technique, observability, resiliency, and safety. David and his household reside in Sugar Hill, GA.

Main Menu

What's Hot

65% of Organisations Nonetheless Detect Unauthorised Shadow AI Regardless of Visibility Optimism

Nvidia's new open weights Nemotron 3 tremendous combines three totally different architectures to beat gpt-oss and Qwen in throughput

How To Change A Company Tradition With Kate Johnson, CEO of Lumen Applied sciences

How PDI constructed an enterprise-grade RAG system for AI functions with AWS

We ran 16 AI Fashions on 9,000+ Actual Paperwork. Here is What We Discovered.

Quick Paths and Sluggish Paths – O’Reilly

Speed up customized LLM deployment: Effective-tune with Oumi and deploy to Amazon Bedrock

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge

65% of Organisations Nonetheless Detect Unauthorised Shadow AI Regardless of Visibility Optimism

Nvidia's new open weights Nemotron 3 tremendous combines three totally different architectures to beat gpt-oss and Qwen in throughput

How To Change A Company Tradition With Kate Johnson, CEO of Lumen Applied sciences

We ran 16 AI Fashions on 9,000+ Actual Paperwork. Here is What We Discovered.

Main Menu

Subscribe to Updates

What's Hot

How PDI constructed an enterprise-grade RAG system for AI functions with AWS

Answer structure

Course of circulate

Crawlers

Dealing with photos

Doc processing

Outcomes and subsequent steps

Conclusion

In regards to the authors

Related Posts