Local AI Clusters Using Exo Platform

April 10, 2026 · 4 min read

Creator of DevoxxGenie

Cloud APIs are convenient, but they come with per-token costs, rate limits, and the nagging question of where your code actually goes. What if you could run the same frontier-class models (GLM, MiniMax, Kimi, DeepSeek) at full Q6 or even Q8 quality, entirely on hardware you already own?

With DevoxxGenie's refreshed Exo integration, you can. Pool the memory of multiple Apple Silicon Macs into a single inference cluster and run models that would never fit on one machine. No subscription. No token budget. Just electricity.

Exo dashboard showing a multi-node cluster

Why Run Local?

Cloud LLM providers charge per token, and those costs add up fast during agentic coding sessions where the model reads, writes, and re-reads files across dozens of tool calls. A single Agent Mode session can easily burn through thousands of input and output tokens.

Running locally flips the economics:

Zero marginal cost: Once the hardware is on your desk, every token is free.
Full data privacy: Your source code never leaves your network.
No rate limits: No throttling, no quota exhaustion mid-session.
Higher quantisation: Cloud providers typically serve heavily quantised models. With enough cluster RAM you can run Q6 or Q8 variants for noticeably better reasoning and code quality.

Big Models, Small Cluster

A single M4 Max MacBook with 128 GB of unified memory can already run a lot. But the truly large models, the ones that rival cloud APIs in quality, need more. That's where Exo comes in.

Exo automatically discovers nearby Apple Silicon devices and distributes model layers across them using pipeline or tensor parallelism. Connect two Macs with a Thunderbolt cable and you've doubled your available memory.

Example setups:

Cluster	Combined RAM	What you can run
2x Mac Mini M4 Pro (48 GB)	96 GB	DeepSeek-R1 70B Q6, GLM-4 9B Q8
MacBook Pro M4 Max + Mac Studio M1 Ultra (128 GB each)	256 GB	MiniMax M2.5 6-bit (173 GB), Kimi-VL
2x Mac Studio M2 Ultra (192 GB each)	384 GB	Qwen3 Coder 480B Q4, DeepSeek-R1 671B Q3

All using the MLX backend, optimised for Apple Silicon.

Exo instance ready with MiniMax M2.5 loaded across two nodes A MiniMax M2.5 instance fully loaded and ready for chat across a two-node cluster.

Getting Started

1. Install Exo on every device

Download the latest DMG and launch it. Exo runs in the background and exposes a web dashboard at http://localhost:52415.

2. Connect your Macs

Plug in a Thunderbolt cable for the fastest interconnect. Exo also works over regular networking, but Thunderbolt gives you significantly lower latency, especially important for tensor parallelism.

3. Download a model

Open the Exo dashboard and pick a model. Start with something modest (Llama 3.3 70B Q4 at ~39 GB) to verify your cluster, then go bigger.

4. Point DevoxxGenie at Exo

In Settings > DevoxxGenie > Large Language Models, enable Exo and set the URL to http://localhost:52415/v1/. Select Exo as your provider in the chat panel, choose your model, and start coding.

DevoxxGenie handles the rest: it previews placements, creates an optimal instance across your cluster, waits for warmup, and notifies you when everything is ready.

DevoxxGenie warming up Exo model runners DevoxxGenie shows a progress bar while the model instance loads across your cluster.

DevoxxGenie Exo cluster panel The Exo cluster panel in DevoxxGenie shows connected nodes, memory usage, GPU stats, and the active model instance.

Works with Agent Mode

The best part: Exo models work seamlessly with DevoxxGenie's Agent Mode. The LLM gets full access to built-in tools like file reading, writing, editing, searching, and command execution, all running through your local cluster.

Pair a powerful local model with Agent Mode and you have an autonomous coding assistant that:

Costs nothing per token
Keeps your code on your hardware
Runs as many tool calls as you need without worrying about API bills

The Bottom Line

You don't need a cloud subscription to use frontier-class LLMs for coding. A couple of Apple Silicon Macs, a Thunderbolt cable, and Exo give you a private AI cluster that rivals cloud APIs, at the cost of electricity.

Check out the full Exo documentation for setup details, supported models, sharding strategies, and troubleshooting tips.

Happy (local) coding!

Why Run Local?​

Big Models, Small Cluster​

Getting Started​

1. Install Exo on every device​

2. Connect your Macs​

3. Download a model​

4. Point DevoxxGenie at Exo​

Works with Agent Mode​

The Bottom Line​