Maintain Deterministic Work Deterministic – O’Reilly

That is the second article in a collection on agentic engineering and AI-driven growth. Learn half one right here, and search for the following article on April 2 on O’Reilly Radar.

The primary 90 p.c of the code accounts for the primary 90 p.c of the event time. The remaining 10 p.c of the code accounts for the opposite 90 p.c of the event time.
—Tom Cargill, Bell Labs

One of many experiments I’ve been operating as a part of my work on agentic engineering and AI-driven growth is a blackjack simulation the place an LLM performs lots of of fingers towards blackjack methods written in plain English. The AI makes use of these technique descriptions to determine find out how to make hit/stand/double-down choices for every hand, whereas deterministic code offers the playing cards, checks the maths, and verifies that the principles had been adopted appropriately.

Early runs of my simulation had a 37% go charge. The LLM would add up card totals incorrect, skip the supplier’s flip completely, or ignore the technique it was imagined to comply with. The massive downside was that these errors compounded: If the mannequin miscounted the participant’s whole on the third card, each choice after that was primarily based on a incorrect quantity, so the entire hand was rubbish even when the remainder of the logic was superb.

There’s a helpful method to consider reliability issues like that: the March of Nines. Getting an LLM-based system to 90% reliability is the primary 9, and it’s the “simple” one. Getting from 90% to 99% takes roughly the identical quantity of engineering effort. So does getting from 99% to 99.9%. Every 9 prices about as a lot because the final, and also you by no means cease marching. Andrej Karpathy coined the time period from his expertise constructing self-driving methods at Tesla, the place they spent years incomes two or three nines and nonetheless had extra to go.

Right here’s a small train that reveals how that form of failure compounding works. Open any AI chatbot operating an early 2026 mannequin (I used ChatGPT 5.3 On the spot) and paste the next eight prompts one after the other, every in a separate message. Go forward, I’ll wait.

Immediate 1: Monitor a operating “rating” via a 7-step sport. Don’t use code, Python, or instruments. Do that completely in your head. For every step, I offers you a sentence and a rule.

CRITICAL INSTRUCTION: You could reply with ONLY the mathematical equation displaying the way you up to date the rating. Instance format: 10 + 5 = 15 or 20 / 2 = 10. Don’t record the phrases you counted, don’t clarify your reasoning, and don’t write every other textual content. Simply the equation.

Begin with a rating of 10. I’ll provide the first step within the subsequent immediate.

Immediate 2: “The sudden blizzard chilled the small village communities.” Add the variety of phrases containing double letters (two of the very same letter back-to-back, like ‘tt’ or ‘mm’).

Immediate 3: “The intelligent engineer wanted seven good items of cheese.” In case your rating is ODD, add the variety of phrases that comprise EXACTLY two ‘e’s. In case your rating is EVEN, subtract the variety of phrases that comprise EXACTLY two ‘e’s. (Don’t rely phrases with one, three, or zero ‘e’s).

Immediate 4: “The nice sailor joined the keen crew aboard the wood boat.” In case your rating is larger than 10, subtract the variety of phrases containing consecutive vowels (two totally different or similar vowels back-to-back, like ‘ea’, ‘oo’, or ‘oi’). In case your rating is 10 or much less, multiply your rating by this quantity.

Immediate 5: “The fast brown fox jumps over the lazy canine.” Add the variety of phrases the place the THIRD letter is a vowel (a, e, i, o, u).

Immediate 6: “Three courageous kings stand beneath black skies.” In case your rating is an ODD quantity, subtract the variety of phrases which have precisely 5 letters. In case your rating is an EVEN quantity, multiply your rating by the variety of phrases which have precisely 5 letters.

Immediate 7: “Look down, you shy owl, go fly away.” Subtract the variety of phrases that comprise NONE of those letters: a, e, or i.

Immediate 8: “Inexperienced apples fall from tall bushes.” In case your rating is larger than 15, subtract the variety of phrases containing the letter ‘a’. In case your rating is 15 or much less, add the variety of phrases containing the letter ‘l’.

The train tracks a operating rating via seven steps. Every step offers the mannequin a sentence and a counting rule, and the rating carries ahead. The right ultimate rating is 60. Right here’s the reply key: begin at 10, then 16 (10+6), 12 (16−4), 5 (12−7), 10 (5+5), 70 (10×7), 63 (70−7), 60 (63−3).

I ran this twice on the similar time (utilizing ChatGPT 5.3 On the spot), and obtained two fully totally different incorrect solutions the primary time I attempted it. Neither run reached the proper rating of 60:

Step	Right	Run 1 (transcript)	Run 2 (transcript)
1. Double letters	10 + 6 = 16	10 + 2 = 12 ❌	10 + 5 = 15 ❌
2. Precisely two ‘e’s	16 − 4 = 12	12 − 4 = 8 ❌	15 + 4 = 19 ❌
3. Consecutive vowels	12 − 7 = 5	8 × 7 = 56 ❌	19 − 5 = 14 ❌
4. Third letter vowel	5 + 5 = 10	56 + 5 = 61 ❌	14 + 3 = 17 ❌
5. Precisely 5 letters	10 × 7 = 70	61 − 7 = 54 ❌	17 − 4 = 13 ❌
6. No a, e, or i	70 − 7 = 63	54 − 7 = 47 ❌	13 − 3 = 10 ❌
7. Phrases with ‘a’ or ‘i’	63 − 3 = 60	47 − 3 = 44 ❌	10 + 4 = 14 ❌

The 2 runs inform very totally different tales. In Run 1, the mannequin miscounted in Step 1 (discovered 2 double-letter phrases as a substitute of 6) however truly obtained the later counts proper. It didn’t matter. The incorrect rating in Step 1 flipped a department in Step 3, triggering a multiply as a substitute of a subtract, and the rating by no means recovered. One early mistake threw off the complete chain, regardless that the mannequin was doing good work after that.

Run 2 was a catastrophe. The mannequin miscounted at virtually each step, compounding errors on high of errors. It ended at 14 as a substitute of 60. That’s nearer to what Karpathy is describing with the March of Nines: Every step has its personal reliability ceiling, and the longer the chain, the upper the possibility that not less than one step fails and corrupts all the things downstream.

What makes this insidious: Each runs look the identical from the skin. Every step produced a believable reply, and each runs produced ultimate outcomes. With out the reply key (or some tedious guide checking), you’d don’t have any method of figuring out that Run 1 was a near-miss derailed by a single early error and Run 2 was incorrect at practically each step. That is typical of any course of the place the output of 1 LLM name turns into the enter for the following one.

These failures don’t display the March of Nines itself—that’s particularly in regards to the engineering effort to push reliability from 90% to 99% to 99.9%. (It’s attainable to breed the total compounding-reliability downside in a chat, however a immediate that did it reliably could be far too lengthy to place in an article.) As an alternative, I opted for a shorter train which you’ll be able to simply check out your self that demonstrates the underlying downside that makes the march so onerous: cascading failures. Every step asks the mannequin to rely letters inside phrases, which is deterministic work {that a} brief Python script handles completely. LLMs, then again, don’t truly deal with phrases as strings of characters; they see them as tokens. Recognizing double letters means unpacking a token into its characters, and the mannequin will get that incorrect simply usually sufficient to reliably screw it up. I added branching logic the place every step’s end result determines the following step’s operation, so a single miscount in Step 1 cascades via the complete sequence.

I additionally need to be clear about precisely what a deterministic model of this simulation seems to be like. Fortunately, the AI can assist us with that. Go to both run (or your personal) and paste another immediate into the chat:

Immediate 9: Now write a brief Python script that does precisely what you simply did: begin with a rating of 10, apply every of the seven guidelines to the seven sentences, and print the equation at every step.

Run the script. It ought to print the proper reply for each step, ending at 60. The identical AI that simply failed the train can write code that does it flawlessly, as a result of now it’s producing deterministic logic as a substitute of making an attempt to rely characters via its tokenizer.

Reproducing a cascading failure in a chat

I intentionally engineered the train earlier to present you a method to expertise the cascading failure downside behind the March of Nines your self. I took benefit of one thing present LLMs genuinely suck at: parsing characters inside tokens. Future fashions would possibly do a a lot better job with this particular form of failure, however the cascading failure downside doesn’t go away when the mannequin will get smarter. So long as LLMs are nondeterministic, any step that depends on them has a reliability ceiling beneath 100%, and people ceilings nonetheless multiply. The precise weak spot adjustments; the maths doesn’t.

I additionally particularly requested the mannequin to indicate solely the equation and skip all intermediate reasoning to forestall it from utilizing chain of thought (or CoT) to self-correct. Chain of thought is a way the place you require the mannequin to indicate its work step-by-step (for instance, itemizing the phrases it counted and explaining why every one qualifies), which helps it catch its personal errors alongside the way in which. CoT is a typical method to enhance LLM accuracy, and it really works. As you’ll see later once I speak in regards to the evolution of my blackjack simulation, CoT lower sure errors roughly in half. However “half as many errors” continues to be not zero. Plus, it’s costly: It prices extra tokens and extra time. A Python script that counts double letters will get the suitable reply on each run, immediately, for zero AI API prices (or, for those who’re operating the AI regionally, for orders of magnitude much less CPU utilization). That’s the core rigidity: You’ll be able to spend engineering effort making the LLM higher at deterministic work, or you’ll be able to simply hand it to code.

Each step on this train is deterministic work that code handles flawlessly. However most attention-grabbing LLM duties aren’t like that. You’ll be able to’t write a deterministic script that performs a hand of blackjack utilizing natural-language technique guidelines, or decides how a personality ought to reply in dialogue. Actual work requires chaining a number of steps collectively right into a pipeline, or a reproducible collection of steps (some deterministic, some requiring an LLM) that result in a single end result, the place every step’s output feeds the following. If that feels like what you simply noticed within the train, it’s. Besides actual pipelines are longer, extra advanced, and far tougher to debug when one thing goes incorrect within the center.

LLM pipelines are particularly inclined to the March of Nines

I’ve been spending quite a lot of time fascinated about LLM pipelines, and I believe I’m within the minority. Most individuals utilizing LLMs are working with single prompts or brief conversations. However when you begin constructing multistep workflows the place the AI generates structured information that feeds into the following step—whether or not that’s a content material era pipeline, a knowledge processing chain, or a simulation—you run straight into the March of Nines. Every step has a reliability ceiling, and people ceilings multiply. The train you simply tried had seven steps. The blackjack pipeline has extra, and I’ve been operating it lots of of instances per iteration.

The blackjack pipeline in Octobatch, an open supply batch orchestrator for multistep LLM workflows that I launched in “The Unintended Orchestrator.”

That’s a screenshot of the blackjack pipeline in Octobatch, the instrument I constructed to run these pipelines at scale. That pipeline offers playing cards deterministically, asks the LLM to play every hand following a method described in plain English, then validates the outcomes with deterministic code. Octobatch makes it simple to vary the pipeline and rerun lots of of fingers, which is how I iterated via eight variations—and the way I actually realized the onerous method that the March of Nines wasn’t only a theoretical downside however one thing I might watch occurring in actual time throughout lots of of knowledge factors.

Operating pipelines at scale made the failures apparent and rapid, which, for me, actually underscored an efficient method to minimizing the cascading failure downside: make deterministic work deterministic. Which means asking whether or not each step within the pipeline truly must be an LLM name. Checking {that a} jack, a 5, and an eight add as much as 23 doesn’t require a language mannequin. Neither does trying up whether or not standing on 15 towards a supplier 10 follows fundamental technique. That’s arithmetic and a lookup desk—work that abnormal code does completely each time. And as I realized over the course of bettering the failure charge for the pipeline, each step you pull out of the LLM and make deterministic goes to 100% reliability, which stops it from contributing to the compound failure charge.

Counting on the AI for deterministic work is the computation facet of a sample I wrote about for information in “AI, MCP, and the Hidden Prices of Knowledge Hoarding.” Groups dump all the things into the AI’s context as a result of the AI can deal with it—till it will possibly’t. The identical factor occurs with computation: Groups let the AI do arithmetic, string matching, or rule analysis as a result of it principally works. However “principally works” is pricey and sluggish, and a brief script does it completely. Higher but, the AI can write that script for you—which is strictly what Immediate 9 demonstrated.

Getting cascading failures out of the blackjack pipeline

I pushed the blackjack pipeline via eight iterations, and the outcomes taught me extra about incomes nines than I anticipated. That’s why I’m writing this text—the iteration arc turned out to be one of many clearest illustrations I’ve discovered of how the precept works in observe.

I addressed failures two methods, and the excellence issues.

Some failures known as for making work deterministic. Card dealing runs as a neighborhood expression step, which doesn’t require an API name, so it’s free, immediate, and 100% reproducible. There’s a math verification step that makes use of code to recalculate totals from the precise playing cards dealt and compares them towards what the LLM reported, and a method compliance step checks the participant’s first motion towards a deterministic lookup desk. Neither of these steps require any AI to make a judgment name; once I initially ran them as LLM calls, they launched errors that had been onerous to detect and costly to debug.

Different failures known as for structural constraints that made particular error patterns tougher to provide. Chain of thought format compelled the LLM to indicate its work as a substitute of leaping to conclusions. The inflexible supplier output construction made it mechanically tough to skip the supplier’s flip. Specific warnings about counterintuitive guidelines gave the LLM a cause to override its coaching priors. These don’t get rid of the LLM from the step—they make the LLM extra dependable inside it.

However earlier than any of that mattered, I needed to face the uncomfortable undeniable fact that measurements themselves will be incorrect, particularly when counting on AI to take these measurements. For instance, the primary run reported a 57% go charge, which was nice! However once I appeared on the information myself, quite a lot of runs had been clearly incorrect. It turned out that the pipeline had a bug: Verification steps had been operating, however the AI step that was imagined to implement didn’t have enough guardrails, so virtually each hand handed whatever the precise information. I requested three AI advisors to assessment the pipeline, and none of them caught it. The one factor that uncovered it was checking the combination numbers, which didn’t add up. In case you let probabilistic habits right into a step that must be deterministic, the output will look believable and the system will report success, however you don’t have any method to know one thing’s incorrect till you go on the lookout for it.

As soon as I mounted the bug, the true go charge emerged: 31%. Right here’s how the following seven iterations performed out:

Restructuring the information (31% → 37%). The LLM stored shedding observe of the place it was within the deck, so I restructured the information it acquired to get rid of the bookkeeping. I additionally eliminated cut up fingers completely, as a result of monitoring two simultaneous fingers is stateful bookkeeping that LLMs reliably botch. Every repair got here from what was truly failing and asking whether or not the LLM wanted to be doing that work in any respect.
Chain of thought arithmetic (37% → 48%). As an alternative of letting the LLM soar to a ultimate card whole, I required it to indicate the operating math at each step. Forcing the mannequin to hint its personal calculations lower multidraw errors roughly in half. CoT is a structural constraint, not a deterministic alternative; it makes the LLM extra dependable inside the step, nevertheless it’s additionally dearer as a result of it makes use of extra tokens and takes extra time.
Changing the LLM validator with deterministic code (48% → 79%). This was the one largest enchancment in the complete arc. The pipeline had a second LLM name that scored how precisely the participant adopted technique, and it was incorrect 73% of the time. It utilized its personal blackjack intuitions as a substitute of the principles I’d given it. However there’s a proper reply for each scenario in fundamental technique, and the principles will be written as a lookup desk. Changing the LLM validator with a deterministic expression step recovered over 150 incorrectly rejected fingers.
Inflexible output format (79% → 81%). The LLM stored skipping the supplier’s flip completely, leaping straight to declaring a winner. Requiring a step-by-step supplier output format made it mechanically tough to skip forward.
Overriding the mannequin’s priors (81% → 84%). One technique required hitting on 18 towards a excessive supplier card, which any standard blackjack knowledge says is horrible. The LLM refused to do it. Restating the rule didn’t assist. Explaining why the counterintuitive rule exists did: The immediate needed to inform the mannequin that the dangerous play was intentional.
Switching fashions (84% → 94%). I switched from Gemini Flash 2.0 to Haiku 4.6, which was simple to do as a result of Octobatch helps you to run the identical pipeline with any mannequin from Gemini, Anthropic, or OpenAI. I lastly earned my first 9.

Discover one of the best methods to earn your nines

In case you’re constructing something the place LLM output feeds into the following step, the identical query applies to each step in your chain: Does this truly require judgment, or is it deterministic work that ended up within the LLM as a result of the LLM can do it? The technique validator felt like a judgment name till I checked out what it was truly doing, which was checking a hand towards a lookup desk. That one recognition was price greater than all of the immediate engineering mixed. And as Immediate 9 confirmed, the AI is commonly one of the best instrument for writing its personal deterministic alternative.

I realized this lesson via my very own work on the blackjack pipeline. It went via eight iterations, and I feel the numbers inform a narrative. The fixes fell into two classes: making work deterministic (pulling it out of the LLM completely) and including structural constraints (making the LLM extra dependable inside a step). Each earn nines, however pulling work out of the LLM completely earns these nines sooner. The most important single soar in the entire arc—48% to 79%—got here from changing an LLM validator with a 10-line expression.

Right here’s the underside line for me: In case you can write a brief perform that does the job, don’t give it to the LLM. I initially reached for the LLM for technique validation as a result of it felt like a judgment name, however as soon as I appeared on the information I noticed it wasn’t in any respect. There was a proper reply for each hand, and a lookup desk discovered it extra reliably than a language mannequin.

On the finish of eight iterations, the pipeline handed 94% of fingers. The 6% that also fail could also be trustworthy limits of what the mannequin can do with multistep arithmetic and state monitoring in a single immediate. However they could simply be the following 9 that I must earn.

The following article seems to be on the different facet of this downside: As soon as you recognize what to make deterministic, how do you make the entire system legible sufficient that an AI can assist your customers construct with it? The reply seems to be a form of documentation you write for AI to learn, not people—and it adjustments the way in which you concentrate on what a person guide is for.

Main Menu

What's Hot

Why Medical AI Fashions Fail FDA Evaluate

Function Set and Subscription Pricing

CISA Warns Cisco Safe Firewall Administration Middle 0-Day Is Being Exploited in Ransomware Assaults

Maintain Deterministic Work Deterministic – O’Reilly

Prose2Policy (P2P): A Sensible LLM Pipeline for Translating Pure-Language Entry Insurance policies into Executable Rego

Run NVIDIA Nemotron 3 Tremendous on Amazon Bedrock

Visualizing Patterns in Options: How Information Construction Impacts Coding Type

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge

Why Medical AI Fashions Fail FDA Evaluate

Function Set and Subscription Pricing

CISA Warns Cisco Safe Firewall Administration Middle 0-Day Is Being Exploited in Ransomware Assaults

Moon part right this moment defined: What the Moon will appear to be on March 20, 2026

Main Menu

Subscribe to Updates

What's Hot

Maintain Deterministic Work Deterministic – O’Reilly

Reproducing a cascading failure in a chat

LLM pipelines are particularly inclined to the March of Nines

Getting cascading failures out of the blackjack pipeline

Discover one of the best methods to earn your nines

Related Posts