We had a hypothesis: an LLM router for coding agents could fit entirely on Cloudflare. No VPS, no Postgres, no Redis. One wrangler.toml, the whole stack. Five days later, here we are. This is the architecture, the decisions, and the tradeoffs.
The shape of the problem
We needed:
- Two client API surfaces: Anthropic /v1/messages (so Claude Code Just Works) and OpenAI /v1/chat/completions (so every other SDK Just Works).
- Multi-tenant authentication. Each customer holds their own API key; we hold one shared upstream key per provider. Customer billing flows through us.
- Per-customer prepaid wallet — top up via Stripe, debit per request, block at zero balance.
- Routing across providers: same model id can be served by Together, Fireworks, DeepInfra. We pick the right host per request.
- Real-time usage in a dashboard. No data warehouse.
The stack — one CF account, ~6 bindings
Workers # /v1/messages, /v1/chat/completions, /dashboard/api/*
Durable Objects # one wallet per tenant (atomic balance + ledger)
KV # bearer token → Clerk user_id cache (60s TTL)
D1 # users, usage_events, wallet_ledger, processed_webhooks
R2 # raw event archive (Logpush)
AI Gateway # routing/cache/retry/analytics for every upstream callThat's the entire production surface. No long-running process to babysit. Auto-scales from 1 RPS to 10k. Billed on usage.
Cloudflare AI Gateway is the load-bearing piece
We considered building our own retry/cache/observability around the upstream providers. Then we tried using AIG's universal endpoint instead and deleted ~600 lines of plumbing.
What AIG gives us for free:
- Single fetch path — every upstream call goes to
gateway.ai.cloudflare.com/v1/{acct}/{gw}/{provider}/v1/chat/completions. The provider slug is the only thing that varies. Adding a new provider = adding a catalog row. - Exact-match cache — identical prompts return for $0, sub-50ms. This alone pays for the whole stack on tree-search workloads.
- Per-customer attribution — we attach
cf-aig-metadata: {user_id, key_id, model, surface}on every call. AIG's analytics segment by those fields out of the box. We didn't have to build a customer dashboard for upstream cost — Cloudflare built it. - Custom-cost reporting — the
cf-aig-custom-costheader lets us tell AIG "record this request as costing $X to the customer" (our marked-up price), so finance views match invoices. - Sequential retries on 5xx from upstream, configurable backoff.
Wallet = one Durable Object per tenant
Every customer has their own DO instance, addressed by Clerk user id. The DO owns a SQLite-backed balance + ledger. Three methods, atomic by construction:
canSpend() → { allowed: bool, balance_micros }
debit(amt, ref) → idempotent on ref (PK constraint)
credit(amt, ref) → idempotent on refThe idempotency-on-ref design is the entire reason we're comfortable letting Stripe redeliver webhooks ten times. The DO ledger's primary key is the request id. Retried debit? Already there, no-op.
The simplification we picked over precision
Original plan: a reserve/settle protocol — pre-debit the worst-case cost before the upstream call, refund the difference after. This guarantees a customer never overdrafts.
We threw it out. Instead:
- Pre-flight:
balance > 0→ allow the request. No estimate, no reservation. - After the upstream stream closes, debit the actual cost from the upstream's reported
usagefield. Run viactx.waitUntil()so it never blocks the customer's response. - A single coding turn at near-zero balance can push you a few cents negative. Documented in the T's & C's. Next request is blocked until you top up.
This deletes: token counting in the request hot path, mid-stream cap logic, SSE injection of stop signals, the reserve/settle protocol on the DO, and a whole class of race conditions around concurrent streams. Worst-case failure is bounded (cents), recoverable (top up), and visible to the customer (next request 402's with a clear message).
The hard part: SSE translation
Anthropic's streaming format and OpenAI's are both SSE, but structurally different. Anthropic emits typed events (message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop) with strict block lifecycles. OpenAI emits one chunk type with cumulative deltas.
We wrote a state machine that consumes OpenAI chunks and emits the Anthropic event sequence faithfully. The hardest cases:
- Tool calls mid-stream: OpenAI sends cumulative
tool_calls[]deltas keyed by index. We have to open an Anthropictool_usecontent block on first sighting (when we have the id + name) and emitinput_json_deltafor every subsequent argument fragment. - Block transitions: customer sends a request with text + tools, model returns text + tool_use. We need to close the text block before opening the tool_use block, in order, even though OpenAI's deltas can interleave.
- Reasoning content: DeepSeek-R1 puts its chain-of-thought in
delta.reasoning_content. We translate it into a proper Anthropicthinkingcontent block, distinct from the final answer's text block.
The whole translator is ~250 lines, lives in packages/translate/src/stream.ts, and has fixture-based tests recorded from real upstream responses per provider. It's the only piece of code we'd rebuild from scratch given the chance — it's the actual product.
What we'd do differently
- Start with AIG. We initially tried direct upstream fetches and only added AIG later. Reverse the order — AIG first, direct as fallback for providers AIG doesn't cover (none of ours, as it turned out).
- Drop BYOK in v1. We almost shipped per-tenant upstream key custody. The encryption envelope, key rotation cron, dashboard UI to manage per-provider keys — it would have doubled the surface area. Killed it. Phase 2 if customers ask.
- Use Clerk's API Keys feature, not roll our own. We spiked a custom
qlk_*issuance flow with bcrypt hashing in D1 before realizing Clerk had shipped exactly this in their SDK. Rip-and- replace took an afternoon. Save yourself the day.
Try it
Sign up, mint a key, point Claude Code at us. The whole stack you just read about runs every request you make. The source is on GitHub.