Wunderlandmedia

Running Claude Code on a Local Model Just Got 2x Faster (Thanks to Ollama + MLX)

Ollama 0.19 doubles local inference speed on Apple Silicon via MLX. I tested it with Claude Code and Qwen 3.5. Here's what actually changed for coding agents.

Kemal EsensoyModified on April 27, 2026
Running Claude Code on a Local Model Just Got 2x Faster (Thanks to Ollama + MLX)
Artificial Intelligence

I spend a lot of time inside Claude Code. It's the tool that runs most of my coding workflow these days. And every month, the API bill reminds me exactly how much time that is.

So when Ollama dropped version 0.19 with Apple's MLX framework baked in, I didn't just read the blog post. I pulled it onto my Mac Studio, pointed Claude Code at it, and ran the same tasks I'd been paying Anthropic for. Here's what actually happened.

Why I Even Care About Running Coding Agents Locally

Let me put it this way: if you run Claude Code for a few hours a day, five days a week, the API costs add up. Fast. We're talking hundreds of dollars a month for a one-person agency. That's real money.

API costs piling up versus running coding agents locally for free

But it's not just about the bill. I travel. I work from trains and planes. I've had projects where client data couldn't leave the machine for compliance reasons. In all of those cases, a local model that could actually keep up with my workflow would be a game changer.

The problem was always the same: local models felt like talking to a brick wall. You'd type a prompt, wait, watch the cursor blink, and eventually get something back that was... fine. Usable for simple stuff. Nowhere near fast enough for the back-and-forth rhythm that makes coding agents useful.

That changed last month.

What Ollama 0.19 Actually Changed

This isn't a minor version bump. Ollama 0.19 swapped out the entire inference backend on Apple Silicon. Instead of llama.cpp (which has served the community well for years), Ollama now runs on Apple's MLX framework.

Why does that matter? MLX was built specifically for Apple's unified memory architecture. On a Mac, your CPU and GPU share the same memory pool. llama.cpp was designed for the CUDA world where CPU and GPU memory are separate, so it had to copy data back and forth. MLX skips all of that.

On top of the backend switch, Ollama 0.19 introduced NVFP4 quantization. This is NVIDIA's format for squeezing models into less memory while keeping the output quality close to full precision. The practical result: you can run larger models on the same hardware, and the outputs match what cloud inference providers are serving.

The caching got smarter too. Ollama now reuses its cache across conversations, stores snapshots at intelligent points in the prompt, and keeps shared prefixes alive longer. For coding agents that hammer the same system prompt over and over, this is a big deal.

The Benchmarks That Actually Matter

Ollama's official numbers on an M5 Max are impressive: 1,810 tokens per second for prefill (up from 1,154) and 112 tokens per second for decode (up from 58). That's a 57% and 93% improvement respectively, running Qwen3.5-35B-A3B in NVFP4.

Ollama MLX benchmark comparison showing dramatically faster performance on Apple Silicon

But I don't have an M5 Max. I have an M1 Max Mac Studio with 64GB of unified memory. And honestly, the difference was even more dramatic on my machine: decode speed went from 3.19 tokens per second with the old llama.cpp backend to 23.39 tokens per second with MLX. That's roughly 7x faster. Same machine, same model, same prompt.

At 23 tokens per second, you're getting responses fast enough to maintain a real conversation. It's not Sonnet-level quality, but the speed gap between local and cloud just closed dramatically. Benchmarks only tell part of the story, of course. The real test is whether you can actually use it for work.

Setting It Up: From Zero to Local Coding Agent

The setup is surprisingly simple. Download Ollama 0.19, then run:

ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

Setting up Ollama with Claude Code on a Mac for local AI coding

That spins up an Ollama instance with Anthropic API compatibility built in. Since Ollama v0.14, it speaks the same API format as Claude, so Claude Code doesn't need a proxy or any special configuration. You point ANTHROPIC_BASE_URL at your local Ollama instance and you're done.

This also works with LM Studio (v0.4.1+), llama.cpp directly, or vLLM if you prefer those. The key requirement is the same across all of them: the model needs to support a 64k+ token context window and tool calling. Without tool calling, coding agents can't do their thing.

One important note: you need a Mac with at least 32GB of unified memory. The Qwen3.5-35B-A3B model is a Mixture of Experts architecture, which helps (only 3B parameters are active per token), but the full model still needs to fit in memory. If you're curious about how Claude Code communicates with tools under the hood, MCP is the protocol that makes it work.

Qwen 3.5 vs 3.6: Which Model to Actually Run

Ollama 0.19 shipped optimized for Qwen3.5-35B-A3B. It's a Mixture of Experts model: 35 billion parameters total, but only 3 billion active per token. That's why it flies on consumer hardware. You get the knowledge of a 35B model with the speed of a 3B model.

But here's the thing: Qwen3.6-35B-A3B already dropped on April 16th. It scores 73.4% on SWE-bench Verified and 51.5 on Terminal-Bench 2.0. Those are serious numbers for a model you can run on a MacBook.

For straightforward code generation, test writing, and refactoring within a single file, both models are surprisingly good. Where they fall short is complex multi-file reasoning, navigating unfamiliar codebases, and catching subtle bugs that require deep context. For that kind of work, a spec-driven approach with a frontier model still makes more sense.

What This Actually Changes for Solo Devs

Here's what $0 marginal cost does to your behavior: you stop rationing.

Solo developer using local AI coding agent offline on a plane

When every API call costs money, you develop this habit of mentally weighing whether a prompt is "worth it." Should I ask it to refactor this function, or just do it myself? Is this question complex enough to justify the tokens? With a local model, that friction disappears. You just ask. You experiment. You try things you wouldn't have tried when each attempt cost a few cents.

Offline coding is real now too. I wrote parts of a client project on a flight last week using Ollama locally. No wifi, no API, just my Mac and a local model. It handled boilerplate generation and test scaffolding perfectly.

It's not just coding agents either. I built WunderType, a macOS menu bar app that corrects and rewrites text in any app using Ollama locally. The MLX update made it noticeably snappier. What used to feel like a pause now feels instant. If you're curious about why I built it, here's the story.

But let's be honest: local models break things differently than cloud models. They hallucinate different things, miss different edge cases, and fail in ways you don't expect if you've been spoiled by Sonnet or Opus. The errors are cheaper, but they're still errors.

My Honest Take After a Week

After running this setup daily for a week, here's where I landed.

Good for: quick edits, boilerplate generation, test writing, code explanation, simple refactors, inline documentation, and anything where speed matters more than depth.

Not ready for: complex multi-step reasoning across files, debugging subtle issues in unfamiliar code, or anything that needs the kind of deep contextual understanding that frontier models excel at.

The real unlock isn't replacing cloud models entirely. It's the 80/20 split: local for the 80% of tasks that don't need frontier intelligence, API for the 20% that do. That alone can cut your monthly API bill significantly.

This is the first time running a local coding agent has felt viable for daily use, not just for demos or weekend experiments. When I let Claude Code cook on a weekend project, it was using a frontier model. Could a local model do the same? Not yet. But at 7x the speed on my M1 Max, it's getting closer every month.

If you're running an agency, freelancing, or just building things on your own and want to cut costs without cutting capability, this is worth trying. Let's talk if you want help setting up an AI-assisted development workflow that actually makes sense for your budget.

About the Author

KE

Kemal Esensoy

Kemal Esensoy, founder of Wunderlandmedia, started his journey as a freelance web developer and designer. He conducted web design courses with over 3,000 students. Today, he leads an award-winning full-stack agency specializing in web development, SEO, and digital marketing.

Local Claude Code With Ollama MLX 2x Faster | Wunderlandmedia