Shreya Shankar on AI for Company Knowledge Processing

Generative AI within the Actual World

Generative AI within the Actual World: Shreya Shankar on AI for Company Knowledge Processing

00:00
/
30m 14s

Companies have plenty of knowledge—however most of that knowledge is unstructured textual knowledge: reviews, catalogs, emails, notes, and way more. With out construction, enterprise analysts can’t make sense of the info; there’s worth within the knowledge, however it could’t be put to make use of. AI generally is a device for locating and extracting the construction that’s hidden in textual knowledge. On this episode, Ben and Shreya discuss a brand new technology of tooling that brings AI to enterprise knowledge processing.

Take a look at different episodes of this podcast on the O’Reilly studying platform.

In regards to the Generative AI within the Actual World podcast: In 2023, ChatGPT put AI on everybody’s agenda. In 2025, the problem might be turning these agendas into actuality. In Generative AI within the Actual World, Ben Lorica interviews leaders who’re constructing with AI. Study from their expertise to assist put AI to work in your enterprise.

Factors of Curiosity

0:00: Introduction to Shreya Shankar.
0:18: One of many themes of your work is a selected type of knowledge processing. Earlier than we go into instruments, what’s the downside you’re making an attempt to deal with?
0:52: For many years, organizations have been struggling to make sense of unstructured knowledge. There’s an enormous quantity of textual content that individuals make sense of. We didn’t have the know-how to do this till LLMs got here round.
1:38: I’ve spent the final couple of years constructing a processing framework for individuals to control unstructured knowledge with LLMs. How can we extract semantic knowledge?
1:55: The prior artwork could be utilizing NLP libraries and doing bespoke duties?
2:12: We’ve seen two flavors of method: bespoke code and crowdsourcing. Individuals nonetheless do each. However now LLMs can simplify the method.
2:45: The standard activity is “I’ve a big assortment of unstructured textual content and I wish to extract as a lot construction as potential.” An excessive could be a data graph; within the center could be the issues that NLP individuals do. Your knowledge pipelines are designed to do that utilizing LLMs.
3:22: Broadly, the duties are thematic extraction: I wish to extract themes from paperwork. You may program LLMs to search out themes. You need some consumer steering and steerage for what a theme is, then use the LLM for grouping.
4:04: One of many instruments you constructed is DocETL. What’s the everyday workflow?
4:19: The thought is to put in writing MapReduce pipelines, the place map extracts insights, and group does aggregation. Doing this with LLMs signifies that the map is described by an LLM immediate. Possibly the immediate is “Extract all of the ache factors and any related quotes.” Then you’ll be able to think about flattening this throughout all of the paperwork, grouping them by the ache factors, and one other LLM can do the abstract to supply a report. DocETL exposes these knowledge processing primitives and orchestrates them to scale up and throughout activity complexity.
5:52: What if you wish to extract 50 issues from a map operation? You shouldn’t ask an LLM to do 50 issues without delay. It’s best to group them and decompose them into subtasks. DocETL does some optimizations to do that.
6:18: The consumer may very well be a noncoder and may not be engaged on your complete pipeline.
7:00: Individuals try this rather a lot; they could simply write a single map operation.
7:16: However the finish consumer you take into consideration doesn’t even know the phrases “map” and “filter.”
7:22: That’s the purpose. Proper now, individuals nonetheless have to study knowledge processing primitives.
7:49: These LLMs are probabilistic; do you additionally set the expectations with the consumer that you just may get totally different outcomes each time you run the pipeline?
8:16: There are two several types of duties. One is the place you need the LLM to be correct and there’s an actual floor fact—for instance, entity extraction. The opposite kind is the place you wish to offload a inventive course of to the LLM—for instance, “Inform me what’s attention-grabbing on this knowledge.” They’ll run it till there aren’t any new insights to be gleaned. When is nondeterminism an issue? How do you engineer programs round it?
9:56: You may additionally have an information engineering workforce that makes use of this and turns PDF information into one thing like an information warehouse that individuals can question. On this setting, are you accustomed to lakehouses structure and the notion of the medallion structure?
10:49: Individuals really use DocETL to create a desk out of PDFs and put it in a relational database. That’s one of the simplest ways to consider how one can transfer ahead within the enterprise setting. I’ve additionally seen individuals utilizing these tables in RAG or downstream LLM functions.
11:31: I understand that it is a fast-moving area. To what extent can DocETL leverage different libraries like BAML? It’s a domain-specific language that turns prompts into directions. And there are different issues on the knowledge extraction aspect—for instance, getting knowledge from pictures in PDF information. To what extent can DocETL leverage the perfect of breed?
12:54: We now have plug-ins and operators as plug-ins. Customers can write their very own; neighborhood members have contributed totally different plug-ins. We’re desirous about native integrations with RAG.
14:01: What are the most typical knowledge sorts?
14:11: PDFs—some individuals will run OCR on PDFs, so unstructured textual content, transcripts, JSON-formatted logs. The good factor is that a lot knowledge may be represented as a string.
14:36: So your start line is strings. So I can have MCP servers that suck knowledge from Confluence and Wikis and also you begin from there.
14:53: Our datasets are JSON or CSV format. So think about a CSV with one or two columns.
15:03: Do you present customers of this device with diagnostics or analysis instruments?
15:14: This brings me to DocWrangler, which is a specialised IDE for writing DocETL pipelines. You get extra observability, it’s simpler to engineer prompts, we now have automated immediate writing, LLMs that edit prompts. It will get you from zero to a beginning pipeline.
16:00: Individuals are actually utilizing issues like expectations and assertions. Is there the equal?
16:13: We now have guardrails on LLM-powered operations: We will examine for hallucination; we will reproduce LLMs for use as guardrails or LLM-as-judge; we will loop on an operation if it doesn’t go; we will additionally write pipelines that question an exterior knowledge supply and drop paperwork that don’t meet standards.
17:16: A separate factor we’re discovering is how to do that in groups.
17:39: If the purpose is to onboard noncoders, plenty of this work goes to be on the UX aspect.
18:03: The DocWrangler venture is all about what’s the suitable UX. How can we leverage AI help as a lot as potential? The semantic knowledge processing ecosystem is tremendous new. The consumer has an intent that’s laborious to precise. There’s the semantic pipeline. And there’s the precise knowledge—the paperwork. When you consider constructing UX, it’s important to optimize the interplay between all three. The place does AI assist? The place does AI not assist?
20:06: Every little thing that we’ve mentioned is within the context of a fast-moving basis mannequin world. Now we now have reasoning fashions. How do you’re feeling about reasoning fashions within the context of what you’re doing? They’re costly and slower. What recommendation do you give customers of DocETL?
21:03: Reasoning is most useful in bridging the understanding between the consumer and the preliminary pipeline that they write. A reasoning mannequin can go from a crudely specified intent to a well-specified pipeline. The o1 mannequin is best at this than GPT-4o. But when you have already got a well-defined immediate, the reasoning mannequin doesn’t offer you a lot leverage.
23:10: I might think about that supervised fine-tuning would repay for a pipeline. Are individuals utilizing DocETL to generate knowledge for fine-tuning LLMs?
23:36: I haven’t seen individuals doing this, however I’m positive they’re. Persons are writing DocETL pipelines on their very own LLMs, however I’m undecided how they fine-tune them.
24:09: I at all times use two or three LLMs and attempt to get a consensus. The LLM relies on your use case and your knowledge, proper?
24:46: Completely. In our consumer research, individuals say the identical factor: The usual pipeline is to make use of OpenAI or Gemini for extraction, and Claude for content material technology and aggregation. Some are utilizing DeepSeek, however we ran the pilot earlier than DeepSeek grew to become widespread. I’m positive it has risen.
25:33: I believe you boxed your self in with the identify DocETL; we’re seeing multimodal fashions. As fashions turn into extra succesful, you’ll transfer with the capabilities of the muse mannequin.
26:05: Once we first launched the venture, a bunch of individuals stated we must always do multimodal, pictures, audio. However these questions simply vanished. Extra individuals stated “I simply have textual content issues.” We’re within the gritty phases of actual enterprise use instances, that are textual content wrangling issues.
26:50: The default is to make use of textual content, however there’s plenty of nuance in these different realities, particularly video. So I have to ask about associated tasks.
27:20: I simply met the aryn.ai individuals at a convention. All of us share the curiosity in doing semantic knowledge processing. Many establishments have individuals constructing such a system. It’s attention-grabbing to see the place we differ. DocETL has a single map operator; different programs have many map operators. So there are attention-grabbing implementation variations.
28:58: You’re in Berkeley; inform me you’re utilizing Ray.
29:06: Every little thing runs on a single machine proper now, however we are going to scale up with Ray. These LLMs usually are not low cost, although they’re getting cheaper. Gemini is actually low cost.

Main Menu

What's Hot

Night Honey Chat: My Unfiltered Ideas

Coming AI rules have IT leaders anxious about hefty compliance fines

The right way to Set up Visible Studio 2026 on Home windows 11

Shreya Shankar on AI for Company Knowledge Processing – O’Reilly

Reinvent Buyer Engagement with Dynamics 365: Flip Insights into Motion

From Habits to Instruments – O’Reilly

FS-DFM: Quick and Correct Lengthy Textual content Era with Few-Step Diffusion Language Fashions

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge