[>>] S1E21July 2, 202637:02

The Local AI Win Nobody's Talking About

Tim Williams (host)Paul Mason (host)

0:00

37:02

Now playing:Feeling Optimistic About Local AI

Chapters

Show Notes

Tim Williams opens up about his surprising optimism for on-device AI, from Google's encoder-free Gemma 4 to Apple's upcoming AFM 3, and what it means when a 12B model runs entirely offline. The duo also dissect Claude Sonnet 5's lukewarm reception and Mythos 5's restricted comeback, then pivot to a practical breakthrough: sub-agent architecture that pairs a frontier orchestrator with cheap local workers. It’s a workflow that slashes cloud costs, preserves privacy, and puts the power back in your hands—proving the local AI future isn't just theoretical.

Transcript

Tim Williams: Welcome back to Rubber Duck Radio, episode 21. I'm Tim Williams and I've got Paul Mason with me. Paul, [pause] I've got a few things on my mind today, and the first one is — [emphasis] I'm feeling optimistic. Paul Mason: That's a good start. [chuckle] Optimistic about what? Tim Williams: Local AI. On-device models. [pause] I know, I know — it sounds like I'm about to pitch you on something that doesn't exist yet. But hear me out. [inhale] I've been running Gemma 4 twelve-B on my laptop and it [emphasis] genuinely changed how I think about what's possible offline. Paul Mason: Gemma 4? That's Google's new one, right? The one from like, uh, early June? Tim Williams: Yeah, dropped June 3rd. And here's what's wild — [pause] it's encoder-free. No separate vision encoder, no separate audio encoder. Vision is literally a single matrix multiply. Raw audio gets projected straight into token space. [exhale] One model, text and vision and audio, running on a laptop with sixteen gigs of RAM. Paul Mason: [surprised] Wait — so it can see images and hear audio natively? Not via some pipeline where it calls out to a separate model? Tim Williams: [emphasis] One model. And it doesn't just run — it runs well. Agentic coding, data analysis, voice dictation, all offline. [pause] This is a twelve billion parameter model. That's not massive. That's the kind of thing that fits on a phone in a year or two. Paul Mason: Okay, [short pause] that's actually impressive. I've been so heads-down in the cloud API workflow that I haven't really been tracking the small model scene. [tsk] Have you actually run it locally, or is this more of a, uh, theoretical optimism? Tim Williams: [laughing] No, I ran it. I pointed it at some photos on my machine and asked it to categorize them, and it just — [pause] did it. No network request. No API key. No latency spike while it phoned home. [exhale] There's something almost unsettling about it. You forget what local compute feels like. Paul Mason: [chuckle] You forget because we've been trained out of it. Like, for years now the answer to anything AI has been 'send it to the cloud.' And now a twelve-B model on your laptop is like — [short pause] nah, I got this. Tim Williams: Exactly. And that brings me to Apple. [pause] WWDC was June 8th, and they showed off AFM 3 Core Advanced — it's a twenty-B sparse model that only activates one to four billion parameters per prompt. It runs on-device. Ships with image input. Structured output. Tool calling. [emphasis] All free, offline, no API key. Paul Mason: [doubtful] Okay, but — this is Apple. They've been promising Siri improvements for like a decade. What makes this different? Tim Williams: Two things. [pause] First, they released a Python SDK. That's — [short pause] that's not the Apple I'm used to complaining about. Second, they built the Foundation Models framework so you can plug in [emphasis] any model — Qwen, whatever — through Core AI. They're not locking you into their model. [inhale] Apple just made on-device LLMs a baseline iOS capability. That's a signal. Paul Mason: So you're saying the takeaway isn't 'Apple made a good model' — it's 'Apple just normalized on-device AI as a platform feature.' Tim Williams: [emphasis] That's exactly it. And once it's a platform feature, the build decision changes. It stops being 'how big a model do I need' and starts being 'what [emphasis] can't run on-device.' [pause] That's a fundamentally different question. Paul Mason: Alright, [short pause] so let me get this straight. You're saying we're heading toward a world where your phone has a genuinely capable local model that can — what? Translate voice in real time? Look at a photo and tell you what's in it? Categorize your stuff without ever touching a server? Tim Williams: That's the dream. [inhale] Imagine a small agentic model living on your device — it knows your calendar, your photos, your messages. You say 'Hey, find that restaurant Paul mentioned last month and add it to my places to try' — and it just does it. [pause] No cloud. No latency. No privacy trade-off. It's all local. Paul Mason: [chuckle] So Siri but actually useful. Tim Williams: [laughing] Siri but actually useful, yes. [pause] But more than that — agentic. It can chain actions. 'Read this PDF, pull out the action items, and add them to my task list.' That's not a query. That's a workflow. And the idea of doing that entirely offline, with no data ever leaving the device — [exhale] that's the kind of thing that makes me optimistic. Paul Mason: Yeah, and I think the piece that matters for developers — for people like us actually building stuff — is the bridge. Because right now we're all building against the OpenAI API spec. Every agentic app I've built, every tool chain, it's all OpenAI-compatible endpoints. Tim Williams: [emphasis] Right. And here's the thing — [pause] you don't have to wait. Google's LiteRT-LM already ships with an OpenAI-compatible server. You run `litert-lm serve` and suddenly Aider, Continue, OpenCode — all your tools — they just point at localhost. Same API. Zero code changes. Paul Mason: [excited] Oh, that's slick. So the strategy is — build today against the API you know, and when the local SDK matures, you just swap the endpoint. Your app doesn't care where the model lives. Tim Williams: Build for the cloud, deploy to the device. [pause] And we're not talking about some distant future — this is working [emphasis] right now. The twelve-B model is already here. The on-device frameworks are shipping. The OpenAI-compatible local server exists. [exhale] We're at the point where the pieces are on the table. Paul Mason: Okay, [short pause] I'll admit — I came into this conversation skeptical and you've actually got me feeling pretty good about it. [chuckle] That doesn't happen often. Tim Williams: [laughing] I'll take that as a win. [pause] Look, I'm not saying every problem is solved. The models still hallucinate. The tool calling isn't perfect. There's plenty of edge cases. [inhale] But for the first time in a while, I look at the trajectory and I think — this is going in a direction that's good for users. Good for privacy. Good for developers. Paul Mason: Yeah, and I think the developer piece is key. Because if I can prototype locally, iterate fast, and only pay for cloud when I actually need scale — [short pause] that changes the economics of building AI features entirely. Tim Williams: Local AI isn't just about privacy or offline capability — it's about [emphasis] agency. For users and for developers. You're not renting intelligence from someone else's data center. You've got a model that lives on your machine, works for you, and doesn't report back to anyone. [exhale] That's worth being optimistic about. Paul Mason: Alright, [pause] I'm sold. [chuckle] So what else is on your mind today? Tim Williams: [pause] Well — [inhale] speaking of AI news, did you see the Sonnet 5 launch this week? [chuckle] The internet was not kind. Paul Mason: [laughing] Not kind is an understatement. I was scrolling Reddit and it's just — [short pause] brutal. The top thread is basically 'what's the point?' [chuckle] People are saying it's the dumbest Claude model they've used. Tim Williams: Yeah, and here's the thing — [pause] I get the frustration. Anthropic's own benchmarks show Sonnet 5 at high reasoning is more expensive AND less capable than Opus 4.8 at low. Their own charts! [exhale] That's a rough look. Paul Mason: Right, and the default performance is actually worse than Sonnet 4.6. So if you're just firing it up without tweaking the effort settings — which is what most people do — you're getting a downgrade. Tim Williams: [chuckle] But here's where I think it gets funny. [pause] If Sonnet 5 had dropped a year ago — June 2025 — we would have lost our minds. We'd be writing blog posts about the revolution. [emphasis] Guaranteed. Paul Mason: Oh, a hundred percent. [short pause] A year ago we were still on GPT-4o and Sonnet 3.5. This thing would've blown the doors off. Now? [tsk] It's the middle child nobody asked for. Tim Williams: And that's the wild thing — it's not that Sonnet 5 is bad. It writes better than 4.6, it's less robotic, some people actually prefer it for creative stuff. [inhale] But it's competing against Opus 4.8 and GPT-5.6 and Gemma 4. The floor has just risen so fast. Paul Mason: Yeah, Simon Willison had a good take on this — he pointed out Sonnet 5 is intentionally less capable at cybersecurity stuff. It's not trying to be Mythos. It's the safe model. The Register called it 'straight down the middle of the road to dodge controversy.' Tim Williams: Which honestly makes sense given everything going on with the government and Mythos and Fable 5. [pause] Anthropic is playing it safe because they kind of have to right now. But the result is a model that feels... [short pause] unnecessary? At least at launch. Paul Mason: Yeah, and the pricing doesn't help. When you can use Opus 4.8 on low and get better results for less money — [short pause] who's the target audience here? [incredulously] People who ran out of Opus credits? Tim Williams: [laughing] That's literally what Hacker News said. [pause] But you know, the moral of the story isn't that Sonnet 5 is bad. It's that our expectations have shifted so dramatically that a genuinely capable model — and it IS capable — gets roasted because it's not mind-blowing. [exhale] That's actually kind of incredible. Paul Mason: Yeah. [short pause] The baseline moved. That's a good thing, even if it makes for some rough launch weeks. Tim Williams: Speaking of Anthropic launches — [pause] Mythos 5 is back. Well, [chuckle] sort of. Paul Mason: Right, the 'trusted partners' thing. The government cleared it for about a hundred companies — cybersecurity firms, critical infrastructure. Very exclusive club. Tim Williams: And Fable 5 is still blocked. So it's not like everything's back to normal. [inhale] I'm genuinely happy Mythos is available again — it's an incredibly powerful model for defensive cybersecurity. But I gotta be honest with you... Paul Mason: [chuckle] It doesn't change anything for us, does it? Tim Williams: [emphasis] Exactly. I can do everything I need with Opus 4.8, with GPT-5.6, and increasingly — as we just talked about — with Gemma 4 running locally. I don't need access to a model that's guarded like a nuclear weapon. Paul Mason: Same. And honestly, the whole thing feels a little weird — Sam Altman even said he doesn't like the idea of the government picking the customers. [short pause] I mean, he's got his own interests there, but he's not wrong. Tim Williams: It sets a strange precedent. [pause] But here's the thing — the conversation around Mythos is about a model most developers will never touch. Meanwhile, what we were just talking about with local AI — that's something I can download and run [emphasis] right now. On my laptop. For free. Paul Mason: Yeah, and that's actually the more exciting story. [short pause] Mythos coming back is good news for cybersecurity. I'm not mad about it. But it doesn't change my workflow. A local model that can do agentic stuff offline? [emphasis] That changes everything. Tim Williams: The accessible stuff is what moves the needle. [pause] A model behind a velvet rope is interesting. A model on my machine that I can actually build with — that's [emphasis] useful. [exhale] And I think that's where my head's been at lately. The local AI future is way more compelling to me than the frontier model arms race. Paul Mason: Alright, [pause] so you've got me excited about local AI, mildly amused by Sonnet 5's rough launch, and vaguely interested in Mythos. [chuckle] What else you got? Tim Williams: [laughing] I like that ranking. [pause] Actually, there's one more thing I wanted to dig into. And this one's personal — it's something I stumbled into building recently that completely changed how I think about AI workflows. [pause] What do you know about sub-agent architecture? Paul Mason: Uh, [short pause] I know the concept. I've seen it in Claude Code and Codex — they spin off sub-agents for different tasks. But honestly, [chuckle] I've mostly just let them do their thing without thinking too hard about it. You've been building with it directly? Tim Williams: [excited] Yeah, and here's the thing — I knew it was important. I'd seen the pattern everywhere. Every major AI coding harness uses it. But I hadn't really [emphasis] felt why until I did it myself. [pause] And now it's one of those things where I can't un-see it. Tim Williams: So here's the basic idea. [inhale] When you're working on a long-horizon project — writing code, building structured data, ingesting a mountain of documents — you don't give one model the whole thing and say "go." Instead, you break it into sub-tasks. Each sub-agent gets a [emphasis] fresh context window, a scoped set of directions, and clear bounds for what's acceptable to return. Paul Mason: So it's basically like... breaking a big software project into components and features. You don't build the whole app in one file. You scope things out, define interfaces, build each piece separately. Tim Williams: [emphasis] Exactly. [pause] And the reason this matters so much — and this is the part that really clicked for me — is that LLMs have a limited attention capacity. Just like humans. [inhale] The smaller the scope you give them, and the larger the context budget you preserve, the better and more well-thought-out the output is. Paul Mason: Okay, [short pause] but hold on. These models have massive context windows now. Opus 4.8 has a million tokens. That's like... [pause] several novels. Are you telling me they still run out of steam? Tim Williams: [laughing] I'm so glad you asked. Because I tested this. [pause] I gave Opus 4.8 — a SOTA model, million-token window — a long-horizon complex task where [emphasis] everything happened inside one context window. Tim Williams: And you know what happened? [pause] Even with that [emphasis] absurdly generous window, the model ran out of steam. The context got polluted. It lost the thread. The early decisions were fine, the middle decisions were shaky, and by the end... [exhale] it was making choices that didn't even align with the original spec. Paul Mason: [surprised] Wait — with a million tokens? It just... lost focus? Tim Williams: Like a developer who's been coding for fourteen hours straight and forgot what the ticket was even about. [chuckle] The context window isn't the problem — it's what's [emphasis] in the context window. All those intermediate decisions, the false starts, the corrections... they pile up. The signal-to-noise ratio collapses. Tim Williams: So then I ran the same project — [pause] same Opus 4.8, same spec — but this time I had it [emphasis] orchestrate. Break the full project into sub-tasks. Define the scope for each one. Define testing methodologies. Define what acceptable output looks like. Then pass each sub-task to a sub-agent to actually do the work. Paul Mason: And the sub-agent was...? Tim Williams: [pause] A local Qwen 3.6 27B dense model. [emphasis] Not a frontier model. A model you can run on a decent GPU at home. Paul Mason: [surprised] Okay. And how'd that go? Tim Williams: Night and day. [emphasis] Night and day. [inhale] The orchestrator — Opus 4.8 — stayed sharp because its context stayed clean. It just had to think about architecture, decomposition, and review. Each sub-agent got a pristine context window with a laser-focused task. And when the sub-agent finished, the orchestrator got a summary of the completed work, tested it, and spun up new sub-agents if anything needed refinement. Tim Williams: The orchestrator's attention capacity was preserved because it was never down in the weeds. [pause] It was doing what a senior engineer does — not writing every line of code, but making sure all the pieces fit together. Paul Mason: So you had a frontier model doing the architecture and a local model doing the implementation. [short pause] That's... [pause] actually kind of brilliant. And cost-effective. Tim Williams: [emphasis] That's the part that gets me. [inhale] If you're using something like Claude Code or Codex, and it's spinning up sub-agents, those sub-agents are burning premium tokens. You're paying frontier-model prices for work a local model could handle. It's like hiring a senior architect to dig footings. Paul Mason: [chuckle] Future you will not thank present you's credit card. Tim Williams: [laughing] No, he absolutely will not. [pause] And here's where it gets... let's call it what it is. [inhale] These coding harnesses [emphasis] should give you the option to delegate work to local sub-agents. It's a no-brainer for token cost. But they don't. And I don't think they will. Paul Mason: Because it cannibalizes their high-end token revenue. Tim Williams: Bingo. And they [emphasis] know this. [pause] Dario Amodei just went on record — well, his 2023 Senate testimony resurfaced — saying open-source AI is moving down a "very dangerous path." That Chinese models are a threat. Paul Mason: [doubtful] Dangerous how, exactly? Tim Williams: Well, he framed it as a safety thing — once models are released openly, you can't control them, can't revoke access, can't update safeguards. [pause] And look, [sigh] there's probably a real argument in there somewhere. But here's the part that jumps out at me — [emphasis] the market is already moving in the opposite direction. Tim Williams: Companies are quietly switching to Chinese models — DeepSeek, Qwen, GLM — because the performance gap has basically closed and the economics are [emphasis] dramatically better. GLM-5.2 is reportedly matching Mythos on some cybersecurity benchmarks. And it's open-weight. Anyone can download it. Paul Mason: [tsk] So when Dario says "dangerous," what he actually means is... Tim Williams: [interrupting] Dangerous to his bottom line. [pause] And I don't say that to be cynical — I say it because the evidence is right there. Chinese open-weight models can already do [emphasis] 95% of what the American SOTA models do. For a fraction of the cost. And you can run them locally, offline, with no API key. Paul Mason: [low voice] Which brings us right back to where we started. Local AI. Tim Williams: [emphasis] Exactly. [exhale] The sub-agent architecture proves that you don't need a frontier model for every task in the pipeline. You need a smart orchestrator, and then you need capable, cheap, local workers. [pause] And the fact that I can run Qwen 3.6 locally and get production-quality code out of it — while Opus handles the architecture — that's not a demo. That's a real workflow. Today. Paul Mason: Dario's [emphasis] right to be scared, just not for the reasons he's saying out loud. Tim Williams: [laughing] The moral of the story is: sub-agent architecture works, local models are ready for it, and the incumbents have every incentive to pretend otherwise. [pause] But we can see the writing on the wall. [inhale] And honestly? The local AI future I was talking about at the top of the episode — this is the proof point. It's not theoretical. It works. [pause] Here's looking at you, Dario. Paul Mason: [laughing] Alright. [exhale] That's a lot. I need a minute to process all that. [pause] You got anything else, or are we wrapping this episode? Tim Williams: Actually, [short pause] yeah — one more thing. And it connects directly to everything we just said. Tim Williams: So while I've been experimenting with this sub-agent pattern on my own, [pause] the industry has apparently been watching. Because in the last two weeks, two major coding tools shipped sub-agent features. Paul Mason: [surprised] Wait — really? Which ones? Tim Williams: Cursor and Claude Code. [pause] Cursor shipped 'Cloud Subagents' in late June — you type /in-cloud and it spins up isolated VMs that work on tasks in parallel. There's even a /babysit command where a cloud agent manages your entire PR. [pause] And Anthropic released 'Agent Teams' in Claude Code — every session has an implicit team, you spawn teammates directly. Paul Mason: [tsk] So they built exactly what you just spent ten minutes describing. Except— Tim Williams: [emphasis] Except they put it in the cloud. [pause] Cursor's sub-agents run in their VMs. Claude Code's agent teams burn Claude tokens — probably Opus tokens — for every sub-agent in the team. [inhale] They adopted the pattern, but they missed the point. Paul Mason: The point being that you want the sub-agents to be [emphasis] cheap and local. Tim Williams: Or at least — [short pause] under my control. [pause] Look, I'm not saying everything has to run on-device. If I want to push a sub-agent task through OpenRouter to a Qwen model that costs a fraction of what Opus costs — I want that option. [emphasis] I want control over the cost-quality tradeoff for sub-agent work. Paul Mason: Yeah. And the Reddit reactions to that Cursor launch were, uh — [chuckle] mixed, let's say. Tim Williams: [laughing] Oh, I saw those. What was the line — 'the last thing you want is more time managing tabs than actually building'? Paul Mason: Yeah, that one. And there was another guy complaining that Cursor was spinning up multiple sub-agents without even asking. [tsk] People were not thrilled about losing control. [pause] Which, again, proves your point. Tim Williams: Here's the irony. [inhale] OpenCode has had sub-agent support for a while now. You can configure it to use different models for different tasks. [pause] And I haven't even set it up. Paul Mason: [surprised] Really? You, the local AI guy? Tim Williams: [laughing] I know, I know. But it's a headache. You have to configure each sub-agent, set up the model routing, manage the context handoffs — [sigh] it's the kind of thing that should be one toggle. 'Use cheaper models for sub-tasks.' Done. Paul Mason: So the feature request is basically: give us the sub-agent architecture, [emphasis] but let us choose what model does the work. Tim Williams: [emphasis] Exactly. Let me point my sub-agents at a local model. Or OpenRouter. Or whatever I want. [pause] Don't lock me into your cloud VMs and your token pricing. [inhale] Because the tools are catching up to the pattern — which is great — but they're building a walled garden around it. Paul Mason: And that's the whole episode, right there. [chuckle] Local AI, sub-agent architecture, the incumbents building moats — it all connects. Tim Williams: It really does. [exhale] Alright, I think that's a good place to wrap. [pause] Episode 21 — local AI optimism, Sonnet 5's bad reviews, Mythos is back, sub-agent architecture, and the tools that are almost getting it right. Paul Mason: Guys, just go out and build with local models. You'll thank me later. [chuckle] Tim Williams: [laughing] Here's looking at you. [pause] We'll see you next time on Rubber Duck Radio.

Related Projects

DevOps Infrastructure Solution

Engineered comprehensive DevOps solution optimizing workflows and reducing bugs.

Senior Web DeveloperView project ->

AI Charts

AI-powered flowchart, ERD, and swimlane diagram builder with a built-in AI assistant and an MCP server exposing 18+ tools for external AI integration. Works with any OpenAI-compatible LLM — no vendor lock-in.

Solo DeveloperView project ->

AI Sound

AI-native audio editor built as a modern replacement for Audacity, with LLM integration at its core. Features multi-track editing, AI transcription, speaker diarization, semantic search, and a full MCP server for external AI assistant integration.

Solo DeveloperView project ->

GTZenda

Enterprise document intelligence pipeline that ingests procurement data from AI agents, classifies and normalizes documents using LLM processing, and pushes structured data into a government sales intelligence platform. Built on AWS with SQS-driven async processing and OpenAI integration.

Lead DeveloperView project ->