Why not just use OpenRouter?

OpenRouter is great. We use it ourselves as a fallback path for models AIG doesn't directly support. But we wanted: (1) BOTH Anthropic /v1/messages AND OpenAI /v1/chat/completions as equal first-class surfaces — every coding agent and SDK works directly without translation shims, (2) Cloudflare AI Gateway in our control plane for free observability + caching, (3) prepaid wallet semantics that match how indie devs actually budget.

How does AI Gateway differ from Workers AI?

Workers AI runs open-source inference on Cloudflare's own GPUs — you pay them, they serve the model. AI Gateway is a routing/observability/caching proxy in front of any LLM provider (OpenAI, Anthropic, Groq, DeepSeek, etc.) — you bring your provider key, AIG handles retries, cache, analytics, custom-cost reporting. We use both: AIG for the routing layer, Workers AI as one of the catalog providers.

Did you really build this in five days?

End-to-end working through Phase 3 (auth + AIG routing + multi-provider catalog + D1 + Stripe wallet + Clerk webhook + dashboard) — yes. The Anthropic↔OpenAI streaming SSE translator was the hardest single piece. The rest is plumbing.

Building qlaud on Cloudflare AI Gateway in five days

We had a hypothesis: an LLM router for coding agents could fit entirely on Cloudflare. No VPS, no Postgres, no Redis. One wrangler.toml, the whole stack. Five days later, here we are. This is the architecture, the decisions, and the tradeoffs.

The shape of the problem

We needed:

Two client API surfaces: Anthropic /v1/messages (so Claude Code Just Works) and OpenAI /v1/chat/completions (so every other SDK Just Works).
Multi-tenant authentication. Each customer holds their own API key; we hold one shared upstream key per provider. Customer billing flows through us.
Per-customer prepaid wallet — top up via Stripe, debit per request, block at zero balance.
Routing across providers: same model id can be served by Together, Fireworks, DeepInfra. We pick the right host per request.
Real-time usage in a dashboard. No data warehouse.

The stack — one CF account, ~6 bindings

Workers          # /v1/messages, /v1/chat/completions, /dashboard/api/*
Durable Objects  # one wallet per tenant (atomic balance + ledger)
KV               # bearer token → Clerk user_id cache (60s TTL)
D1               # users, usage_events, wallet_ledger, processed_webhooks
R2               # raw event archive (Logpush)
AI Gateway       # routing/cache/retry/analytics for every upstream call

That's the entire production surface. No long-running process to babysit. Auto-scales from 1 RPS to 10k. Billed on usage.

Cloudflare AI Gateway is the load-bearing piece

We considered building our own retry/cache/observability around the upstream providers. Then we tried using AIG's universal endpoint instead and deleted ~600 lines of plumbing.

What AIG gives us for free:

Single fetch path — every upstream call goes to gateway.ai.cloudflare.com/v1/{acct}/{gw}/{provider}/v1/chat/completions. The provider slug is the only thing that varies. Adding a new provider = adding a catalog row.
Exact-match cache — identical prompts return for $0, sub-50ms. This alone pays for the whole stack on tree-search workloads.
Per-customer attribution — we attach cf-aig-metadata: {user_id, key_id, model, surface} on every call. AIG's analytics segment by those fields out of the box. We didn't have to build a customer dashboard for upstream cost — Cloudflare built it.
Custom-cost reporting — the cf-aig-custom-cost header lets us tell AIG "record this request as costing $X to the customer" (our marked-up price), so finance views match invoices.
Sequential retries on 5xx from upstream, configurable backoff.

Wallet = one Durable Object per tenant

Every customer has their own DO instance, addressed by Clerk user id. The DO owns a SQLite-backed balance + ledger. Three methods, atomic by construction:

canSpend()      → { allowed: bool, balance_micros }
debit(amt, ref) → idempotent on ref (PK constraint)
credit(amt, ref) → idempotent on ref

The idempotency-on-ref design is the entire reason we're comfortable letting Stripe redeliver webhooks ten times. The DO ledger's primary key is the request id. Retried debit? Already there, no-op.

The simplification we picked over precision

Original plan: a reserve/settle protocol — pre-debit the worst-case cost before the upstream call, refund the difference after. This guarantees a customer never overdrafts.

We threw it out. Instead:

Pre-flight: balance > 0 → allow the request. No estimate, no reservation.
After the upstream stream closes, debit the actual cost from the upstream's reported usage field. Run via ctx.waitUntil() so it never blocks the customer's response.
A single coding turn at near-zero balance can push you a few cents negative. Documented in the T's & C's. Next request is blocked until you top up.

This deletes: token counting in the request hot path, mid-stream cap logic, SSE injection of stop signals, the reserve/settle protocol on the DO, and a whole class of race conditions around concurrent streams. Worst-case failure is bounded (cents), recoverable (top up), and visible to the customer (next request 402's with a clear message).

The hard part: SSE translation

Anthropic's streaming format and OpenAI's are both SSE, but structurally different. Anthropic emits typed events (message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop) with strict block lifecycles. OpenAI emits one chunk type with cumulative deltas.

We wrote a state machine that consumes OpenAI chunks and emits the Anthropic event sequence faithfully. The hardest cases:

Tool calls mid-stream: OpenAI sends cumulative tool_calls[] deltas keyed by index. We have to open an Anthropic tool_use content block on first sighting (when we have the id + name) and emit input_json_delta for every subsequent argument fragment.
Block transitions: customer sends a request with text + tools, model returns text + tool_use. We need to close the text block before opening the tool_use block, in order, even though OpenAI's deltas can interleave.
Reasoning content: DeepSeek-R1 puts its chain-of-thought in delta.reasoning_content. We translate it into a proper Anthropic thinking content block, distinct from the final answer's text block.

The whole translator is ~250 lines, lives in packages/translate/src/stream.ts, and has fixture-based tests recorded from real upstream responses per provider. It's the only piece of code we'd rebuild from scratch given the chance — it's the actual product.

What we'd do differently

Start with AIG. We initially tried direct upstream fetches and only added AIG later. Reverse the order — AIG first, direct as fallback for providers AIG doesn't cover (none of ours, as it turned out).
Drop BYOK in v1. We almost shipped per-tenant upstream key custody. The encryption envelope, key rotation cron, dashboard UI to manage per-provider keys — it would have doubled the surface area. Killed it. Phase 2 if customers ask.
Use Clerk's API Keys feature, not roll our own. We spiked a custom qlk_* issuance flow with bcrypt hashing in D1 before realizing Clerk had shipped exactly this in their SDK. Rip-and- replace took an afternoon. Save yourself the day.

Building qlaud on Cloudflare AI Gateway in five days

The shape of the problem

The stack — one CF account, ~6 bindings

Cloudflare AI Gateway is the load-bearing piece

Wallet = one Durable Object per tenant

The simplification we picked over precision

The hard part: SSE translation

What we'd do differently

Try it

Frequently asked questions

Keep reading

Ship an AI agent on qlaud in under a minute.