Transcript
Tim Williams: Hello and welcome! I'm Tim Williams and this is episode 17 of the Rubber Duck Radio, how are you Paul?
Paul Mason: I'm good, [chuckle] let's jump into what we were just talking about off air, what's the challenge this week?
Tim Williams: [sigh] So here's the thing that's been driving me absolutely nuts this week. [pause] I spent three days — [emphasis] three days — trying to get an AI agent to help me manage a multi-environment AWS setup. [inhale] Dev, staging, production. Standard stuff. I gave it the Terraform. I gave it the architecture docs. I gave it — [short pause] I mean, I practically wrote it a novel. [exhale] And it still couldn't keep straight which environment had which resources.
Paul Mason: Yeah. [short pause] Let me guess — it kept suggesting changes that would've nuked production?
Tim Williams: [laughing] Oh, only about forty percent of the time. [inhale] The other sixty percent it was confidently telling me to apply dev configs to staging and then acting like [emphasis] I was the one who was confused.
Paul Mason: [tsk] That's exactly what I've been seeing. [inhale] And here's my theory — I think a lot of developers using these AI tools on greenfield projects, they're working with like, Cloudflare Workers or Vercel. Stripped-down infrastructure. [short pause] And honestly? The agents handle that fine. Because there's just not that much to keep track of.
Tim Williams: [exhale] Right. [pause] And that's the whole problem in a nutshell. When you're on AWS or Azure, you're not dealing with a handful of primitives — you've got, what, [short pause] two hundred plus services? Each with its own IAM permissions, its own networking model, its own quirks. [inhale] The agent has to pull in [emphasis] so much context just to understand the landscape that it gets completely lost in the weeds.
Paul Mason: Yeah, that's it. [pause] It's the context window problem but on steroids. [inhale] With Cloudflare, you've got what — Workers, D1, R2, KV, maybe Durable Objects. Like five or six things. The agent can hold that entire mental model in its head. [short pause] But AWS? [tsk] Good luck getting it to remember that your RDS instance is in us-east-1 but your Lambda's in us-west-2 and they need to talk through a VPC peering connection.
Tim Williams: And the thing is, [pause] Terraform helps. It really does. Having infrastructure as code gives the agent [emphasis] something to chew on. [inhale] But here's the part that nobody talks about — the agent doesn't know what to [emphasis] discover. It can read the Terraform, sure. But it doesn't know to ask, "Hey, is this security group actually open to the right CIDR block?" or "Are these two environments accidentally sharing a state file?" [short pause] It can see the code, but it can't see the [emphasis] implications.
Paul Mason: And that's the kicker, right? [pause] The moral of this whole thing — sorry, stealing your line — is that the developer still has to [emphasis] actually understand the infrastructure. [short pause] Tight docs and IaC help the agent not be completely useless, but they're not a replacement for having someone on the team who knows what a VPC is and why you can't just — [chuckle] — you know, slap everything in the default one and call it a day.
Tim Williams: You know, Paul — [pause] that point you just made about the developer still needing to understand the infrastructure? [short pause] That actually connects to something bigger I've been stewing on. [pause] There's this really blurry line right now between what [emphasis] should be agentic and what should be [emphasis] deterministic. [inhale] And I'm seeing a lot of people — especially folks who are newer to writing software — throwing [emphasis] everything at an agent. Tasks that have always been faster and more reliable as deterministic workflows.
Paul Mason: Yeah. [short pause] Totally. [inhale] And here's the thing I keep noticing — [pause] a lot of people who aren't familiar with software development don't realize you [emphasis] can ask the agent to build you a reusable pipeline. [tsk] They treat the LLM like it [emphasis] is the application, instead of a tool for [emphasis] building the application.
Tim Williams: Exactly. [pause] And that's [emphasis] crazy expensive. Not just in compute — in time, in reliability, in debugging headache. [inhale] It's like — [short pause] you're calling in a Michelin-star chef to boil water. [chuckle] Every. Single. Day. [pause] Once, sure — have the chef design the kitchen. But after that, [emphasis] just turn on the stove. [inhale] And what's wild is you can even create your own MCPs to reduce the cognitive lift for your LLMs. Build the deterministic scaffolding once, then only bring the agent in for the parts that actually require thinking, or decision-making, or summarization.
Paul Mason: [chuckle] I have a [emphasis] perfect example of exactly this. [inhale] So — [pause] I've got a trip coming up, right? And I wanted to track flight costs over a month or so, figure out the best time to buy. [short pause] Now, the way I see most people approaching this — they'd just ask an agent. "Hey, track this flight route for me and tell me when to buy." Or worse, set up a daily cron job where a frontier model runs the [emphasis] entire pipeline, every single day. Re-discovering the data, re-parsing, re-deciding what to do.
Tim Williams: [laughing] Oh, I can already see where this is going. [inhale] So walk me through it — what'd you do instead?
Paul Mason: So I sat down with a frontier model — [short pause] one shot, one conversation — and said, "Build me a data aggregation pipeline for this specific flight route." [inhale] It generated the whole thing. The scraping, the parsing, the error handling. [pause] Then I pointed it at a local SQLite database — super simple — and had it store all the price data there, day by day. [short pause] Set up a little daily graph so I could see the trend visually. [chuckle] Nothing fancy, just enough to spot the pattern.
Tim Williams: Okay, [short pause] so the pipeline's running, the data's pouring in — [inhale] where does the LLM actually come in?
Paul Mason: [inhale] And here's where it gets interesting — [pause] [emphasis] after the pipeline had been running for about a month, I fed all that accumulated data into a [emphasis] small local LLM. [short pause] Not a frontier model. Not an agent. Just a tiny model running on my machine. And I asked it, "Based on this data, when's the best time to buy?" [exhale] The LLM only touched the one part that required actual [emphasis] reasoning — looking at the trend and making a judgment call. The data collection? Completely deterministic. No AI involved. Day after day, just a script doing its job.
Paul Mason: [tsk] Compare that to what I see out there. [inhale] People are setting up daily jobs where a GPT-5-class agent re-discovers the flight data, re-parses it, re-decides what to do — [short pause] it's like rebuilding your car every time you need to drive to the grocery store. [chuckle] And then they wonder why their API bill is four hundred bucks a month.
Tim Williams: [exhale] That's such a clean example. [pause] And it gets at exactly the distinction I've been trying to put my finger on. [inhale] If the task is [emphasis] data collection — scraping, storing, transforming — that's a pipeline. Deterministic. Build it once, run it a thousand times. [short pause] But if the task is [emphasis] judgment — looking at a month of price data and saying "this trend suggests Tuesday is your day" — [pause] that's where the LLM earns its keep. [inhale] And the tragedy is, people who don't know the difference are burning frontier-model compute on the part that should be a cron job with a shell script.
Tim Williams: [inhale] And Paul, I think that last point you made — [pause] about people not realizing the agent can build the pipeline — it connects to something bigger that I've been seeing everywhere lately. [exhale] It's this collective push from management. [emphasis] Indiscriminate AI adoption. Like, the API bill itself becomes the proof of success.
Paul Mason: [tsk] Yeah. [short pause] I've literally been in meetings where someone bragged about their OpenAI bill. [chuckle] Like it was a badge of honor. [inhale] Six figures a month, and nobody could tell me what business outcome it drove.
Tim Williams: [laughing] Right. [inhale] It's the worst kind of metric. [pause] Spend as a proxy for innovation. [emphasis] And here's the thing — it's not just wasteful, it's actively harmful. Because when the API bill is the KPI, you're incentivized to shove AI into places it has no business being.
Paul Mason: Totally. [inhale] And what gets me is — [pause] it's like we forgot how to measure things. [emphasis] Business outcomes. Did we ship faster? Did we reduce churn? Did the team actually get more done? [exhale] Those are hard to measure. An API bill? That's just a number on a dashboard. Easy to flaunt, easy to game.
Tim Williams: [pause] So let's actually draw the line. Because I think this is the conversation every engineering team needs to have with their leadership right now. [inhale] What should be deterministic, and what should be relativistic — meaning, what actually requires judgment, reasoning, transformation.
Paul Mason: Yeah. [short pause] And I think the heuristic is actually pretty simple. [inhale] If the answer exists in the data — like, it's just a computation away — it's deterministic. Run a query, run a script, move on. [pause] If the answer requires weighing ambiguity, interpreting nuance, or generating something new? [emphasis] That's where the LLM earns its keep.
Tim Williams: [exhale] And the tragedy is — [pause] [emphasis] most of the truly valuable AI use cases I've seen are modest. They're targeted. They're not these sprawling agentic architectures that touch every system. [inhale] It's things like summarizing customer feedback threads before a sprint planning. Or generating test cases from a PR diff. Small, high-judgment tasks.
Paul Mason: [chuckle] It's the unsexy stuff. [inhale] Nobody's getting a promotion for a Slack bot that summarizes Jira tickets. [short pause] But that's exactly the kind of thing that actually makes a team faster. [emphasis] Not replacing the build pipeline with an agent. Just greasing the wheels where humans actually spend time.
Tim Williams: [pause] So here it is then — [inhale] measure business outcomes, not AI spend. Keep deterministic workflows deterministic. [emphasis] And use LLMs for what they're actually good at: judgment, transformation, and summarization. Everything else? [short pause] Write a script. [chuckle] Your CFO will thank you.
Paul Mason: [laughing] Yes. [inhale] And I think — [pause] the companies that figure this out first? The ones that treat AI as a precision tool instead of a status symbol? [emphasis] Those are the ones that are gonna be around in five years. The ones flaunting their API bills? [tsk] They're gonna flaunt themselves right out of business.
Tim Williams: [pause] Alright — [inhale] let me push back on that a bit. [short pause] Because I think there's a grey area we're glossing over.
Paul Mason: [chuckle] Uh oh. Here we go.
Tim Williams: [laughing] No, no — hear me out. [inhale] Your heuristic is clean. If the answer's in the data, it's deterministic. If it takes judgment, it's LLM territory. I like it. [pause] But there's a whole category of interaction that doesn't fit neatly into either bucket. [short pause] And that's user interaction with software.
Paul Mason: Okay. [inhale] Walk me through it.
Tim Williams: So here's the thing — [inhale] if you know a piece of software inside and out, you're right. [emphasis] You're faster than an agent. You know the keyboard shortcuts, you know the data model, you know where the traps are. Deterministic wins. [pause] But [emphasis] what if you don't know the software?
Paul Mason: Yeah. [short pause] That's... [pause] that's actually where it gets tricky.
Tim Williams: Right? [inhale] So imagine you're someone who's never touched Photoshop. You need to remove a background, adjust the lighting, export in a specific format. [pause] For a designer, that's a thirty-second deterministic workflow. For you? It's an hour of googling, trial and error, maybe giving up. [short pause] But you hand it to an agent that can puppet Photoshop — and suddenly you're done in two minutes.
Paul Mason: So you're saying — [inhale] the line between deterministic and agentic isn't just about the task. It's about [emphasis] who's doing it.
Tim Williams: Exactly. [inhale] The task itself is deterministic — it's the same pixels, the same transformations every time. [pause] But the [emphasis] knowledge asymmetry makes it indeterministic in practice. You don't know the steps, so you can't encode them. The agent becomes a translation layer between intent and execution.
Paul Mason: Hmm. [pause] So it's like — [short pause] I know exactly how to write a SQL query to get the flight data I want. That's deterministic for me. [emphasis] But for someone who's never touched SQL, asking an agent to do it — that's not them being lazy. They genuinely don't have the skill.
Tim Williams: That's it. [inhale] And I think this is the nuance that gets lost. [pause] The purist argument says everything with a determined outcome should be a script. But [emphasis] that assumes the person has the knowledge to write the script in the first place. [short pause] For a lot of people, the agent isn't replacing efficiency — it's [emphasis] replacing impossibility.
Paul Mason: That's a really good point. [pause] And I think — [inhale] this is where my flight tracker example actually has a flaw. [chuckle] I knew SQL. I knew the pipeline. So of course I could build it deterministically. [short pause] But if I handed that same problem to someone who's never touched a database? [emphasis] The agent is the right call.
Tim Williams: And here's the thing that makes it even murkier — [inhale] [emphasis] where's the line between using the agent to puppet software you don't know, and using the agent as a crutch so you never learn it?
Paul Mason: [pause] Oof. [tsk] That's the real question, isn't it?
Tim Williams: [chuckle] I told you it was a grey area. [inhale] Because on one hand, you don't need to know how a compiler works to write good code. Abstraction is the entire history of computing. [pause] On the other hand — [emphasis] if you never learn what the agent is actually doing, you can't verify the output. You can't catch when it's confidently wrong.
Paul Mason: Yeah. [inhale] And this loops right back to your first topic — [emphasis] the multi-environment problem. The agent puppeting your AWS console when you don't know what a security group is? [short pause] That's terrifying.
Tim Williams: [laughing] Right — we've come full circle. [inhale] So I think the refined heuristic has to be something like: [pause] [emphasis] use determinism where you have mastery, use agents to bridge knowledge gaps, but never let the agent be a substitute for understanding the domain you're responsible for.
Paul Mason: Yeah. [short pause] That lands. [inhale] It's not a simple flowchart — it's a judgment call about [emphasis] what you can verify. [pause] If you can verify the output, the agent's fair game. If you're just trusting it? [tsk] That's when you're in trouble.
Tim Williams: Speaking of being in trouble — [pause] there's a new MIT study out that puts some [emphasis] very uncomfortable numbers behind exactly what we've been talking about. [short pause] The NANDA initiative at MIT just dropped a report. Fortune covered it a few weeks ago. [exhale] [emphasis] Ninety-five percent of generative AI pilots at companies are failing to deliver any measurable profit impact. [pause] Ninety-five percent.
Paul Mason: [pause] [tsk] Ninety-five. [short pause] [inhale] And these are the same companies whose management is out there bragging about their AI spend, right? [emphasis] That's the part that gets me.
Tim Williams: [sigh] Exactly. [inhale] The MIT team did 150 interviews with leaders, surveyed 350 employees, analyzed 300 public AI deployments. [pause] The core finding isn't that the models are bad — it's that there's a [emphasis] learning gap. The tools don't adapt to workflows, and organizations don't adapt to the tools. [short pause] But here's the part that made my jaw drop — [inhale] more than half of generative AI budgets are going to sales and marketing tools. [pause] And the [emphasis] actual ROI? It's in back-office automation. Eliminating BPO, cutting agency costs, streamlining operations. [exhale] Companies are pouring money into the [emphasis] wrong end of the pipeline.
Paul Mason: [chuckle] So they're spending on the flashy stuff while the boring automation that would [emphasis] actually move the needle gets ignored. [pause] That's — [inhale] that's not even an AI problem at this point. That's an organizational stupidity problem.
Tim Williams: [laughing] Right. [inhale] But it gets worse. [pause] The article profiles this CEO — Eric Vaughan at IgniteTech — who mandated [emphasis] "AI Monday." Every Monday, you couldn't have customer calls, couldn't work on budgets — [short pause] you had to [emphasis] only work on AI projects. [pause] And within a year, he'd replaced [emphasis] eighty percent of the staff.
Paul Mason: [long pause] Wait. [inhale] He — [short pause] he replaced eighty percent of his workforce because they weren't AI enough on Mondays? [tsk] That's not a strategy. That's a [emphasis] hostage situation.
Tim Williams: [exhale] And that's what we're seeing, right? [pause] Management conflating [emphasis] activity with [emphasis] outcomes. The API bill is the new proxy metric. [short pause] Nobody's asking "what did the AI actually improve?" — they're asking "how much did we spend on AI this quarter?" [sigh] As if the spend [emphasis] itself is proof of innovation.
Paul Mason: Yeah. [inhale] And that's the exact opposite of what the MIT data says actually works. [pause] The five percent that [emphasis] are succeeding? [short pause] They're picking one pain point, executing well, and partnering with vendors who specialize. [emphasis] Vendor partnerships succeed sixty-seven percent of the time. Internal builds? [tsk] Thirty-three percent.
Tim Williams: And yet — [inhale] the report says companies are [emphasis] overwhelmingly choosing the internal build route. [pause] They're going solo even though the data screams otherwise. [short pause] It's like — [chuckle] it's like watching someone try to build their own car because they don't trust Toyota, and then being surprised when it doesn't start.
Paul Mason: [laughing] And spending ten times the cost of a Toyota in the process. [exhale] But here's the thing that ties it all back — [inhale] I'd bet real money that [emphasis] the five percent who are succeeding are the ones being selective about what's agentic and what's deterministic. [pause] They're not the companies running AI Mondays and flaunting their token count like it's a scoreboard.
Tim Williams: [emphasis] That's it. [inhale] The successful ones are probably asking the same question we've been asking this whole episode — [pause] what should the model [emphasis] actually be doing? [short pause] Not "how much can we make the model do?" [exhale] The API bill as a status symbol is [emphasis] the most reliable indicator that nobody in the room is thinking critically about where the value is.
Paul Mason: [chuckle] The API bill as an inverse KPI. [short pause] The higher it is, the worse your strategy. [inhale] I love it. [pause] And look — [exhale] the report also mentions that the most advanced orgs are starting to experiment with agentic systems that can learn and remember and act within boundaries. [short pause] That's the next wave. [emphasis] But if you can't even get the deterministic stuff right first — [tsk] you're gonna be in that ninety-five percent forever.
Tim Williams: [exhale] Alright, speaking of getting things right — [pause] I need to confess something. [inhale] I finally broke up with VS Code.
Paul Mason: [surprised] Wait — [short pause] like, [emphasis] actually broke up? [chuckle] After all these years?
Tim Williams: [laughing] After all these years. [inhale] Both flavors — the Cursor fork and the stock Microsoft branch. [pause] It's done. [short pause] I was sitting there last week watching an agent fly through my codebase at a speed that my editor — [emphasis] my actual editor — couldn't keep up with. [exhale] The keystroke lag, the extension weight, the constant "window is taking too long" messages. [sigh] It was like watching a race car stuck in LA traffic.
Paul Mason: [chuckle] Yeah, [inhale] I've noticed the same thing. [short pause] When Cursor's own agent is spawning terminals faster than the editor can render the diff — [tsk] that's a bad sign.
Tim Williams: Right? [inhale] So I tried Zed. [pause] [emphasis] And I am in love. [short pause] The responsiveness — [exhale] I'm not exaggerating, it might as well be Vim. The layout is clean, it's not trying to be a platform, and the agentic experience — [emphasis] it just makes sense. You can plug in just about any existing API. OpenAI lets you use your developer license for it. [pause] No weird enterprise gatekeeping.
Paul Mason: So what'd you throw at it? [inhale] First impressions from actually using it?
Tim Williams: [exhale] Okay so — [inhale] I wired up a [emphasis] local agent running Gemma 4. Not a frontier model. A local open weights model, right on my machine. [pause] And I pointed it at a decently large codebase — not a toy, like a real project with multiple services, configs everywhere. [short pause] And it [emphasis] handled it. Semi-complex refactoring, navigating across files, understanding the architecture enough to make informed suggestions. [exhale] I was genuinely surprised.
Paul Mason: [delight] Okay that's — [pause] that's actually huge. [inhale] A local model on a real codebase? [short pause] That hits on exactly what we've been saying about deterministic versus agentic — you're running the [emphasis] reasoning locally, for the stuff that actually requires thinking, without burning frontier compute.
Tim Williams: [emphasis] That's exactly what got me excited. [inhale] Because here's the thing — I can't live full time in the Claude Code or Codex workflow. You know the one — terminal only, black box, the model's doing everything and you're just... [pause] watching. [short pause] I still need to [emphasis] traverse the codebase. I need to see the full context of files, do visual inspections, understand the shape of things before I hand off a task. [exhale] Those agentic harnesses are incredible — but they're not the full picture for a developer.
Paul Mason: Yeah, [inhale] the harness-versus-workspace thing. [pause] Claude Code is great when you know exactly what you want and you just need it done. [short pause] But when you're still [emphasis] understanding the problem? [tsk] You need visibility. You need an editor that doesn't get in the way.
Tim Williams: Right. [inhale] And that's where Zed clicked for me. [pause] The editor is fast enough that the agent [emphasis] is the bottleneck now, not the UI. [short pause] Which is how it should be. [exhale] The thing I'm trying to figure out next — and this is where I think the real unlock is — [inhale] how do you automatically split the work between the frontier models and the open weights models? [pause] Opus for the high-level architecture decisions, the big-picture design work. [short pause] And then Gemma 4, Qwen 3.6 — these new, surprisingly capable open models — for the implementation, the refactoring, the stuff that needs to be fast and cheap and doesn't require a PhD in everything.
Paul Mason: [excited] Oh, [inhale] I love that. [pause] That's — [short pause] that's the exact same pattern as the flight tracker, right? [emphasis] Frontier model for the architecture, local model for the execution. [pause] You're just applying it to the [emphasis] development workflow itself.
Tim Williams: [laughing] Right — we've come full circle [emphasis] again. [inhale] The flight tracker pattern, the deterministic-versus-relativistic pattern — it applies to the tools we use every day too. [pause] Why am I burning Opus compute on a refactor that Gemma 4 can handle locally in under a second? [short pause] And why am I asking a local model to architect a distributed system? [exhale] The right tool for the right job — it sounds so obvious when you say it out loud.
Paul Mason: [chuckle] Yeah, [inhale] but [emphasis] most teams aren't even asking the question. [pause] They're just throwing Opus at everything because someone set the API key once and nobody's thought about it since. [short pause] The fact that you're experimenting with a local Gemma model on real work — [exhale] that's the kind of intentionality we've been talking about this whole episode.
Tim Williams: [inhale] Well and that's the moral of the story, isn't it? [pause] Whether we're talking about multi-environment infrastructure, agentic versus deterministic workflows, how businesses measure AI success — [short pause] it all comes down to [emphasis] intentionality. [exhale] Not "how much AI can we use" but "where does AI actually help?" [pause] And sometimes that means using a frontier model to [emphasis] design the pipeline and then getting out of its way. Sometimes it means running a small local model because the task doesn't need more. [short pause] And sometimes — [chuckle] it means switching editors because your old one can't keep up with the agent you unleashed.
Paul Mason: [laughing] You'll thank yourself for taking heed. [inhale] No, but seriously — [pause] I think this episode really did connect every dot. [short pause] We started with your AWS agent nightmare, walked through the deterministic-versus-agentic line, the MIT numbers, the grey areas — and landed at... [exhale] be intentional. Verify what you can verify. Use the right model for the right job. And [emphasis] don't let your API bill be your KPI.
Tim Williams: [emphasis] That's the episode. [inhale] [pause] If you're in that ninety-five percent — or you're worried you might be — [short pause] take one thing from this. Ask yourself: what should the model [emphasis] actually be doing? Not what [emphasis] can it do. What [emphasis] should it do. [exhale] Everything else is just an API bill with a marketing budget.
Paul Mason: [chuckle] The API bill as an inverse KPI. [pause] We should put that on a T-shirt. [inhale] Alright — I'm Paul Mason.
Tim Williams: And I'm Tim Williams. [inhale] This has been episode seventeen of Rubber Duck Radio. [short pause] Gemma 4, Qwen 3.6 — go try a local model on your actual codebase. You might be surprised. [exhale] And if your editor can't keep up with your agent — [chuckle] maybe it's time.
Paul Mason: [laughing] [emphasis] Here's looking at you, VS Code. [pause] See you next time.