You shipped an AI app. Day-one users love it. Day-seven you notice your provider bill is climbing fast and you have no idea which user is responsible. Day-fourteen one user's leaked browser key burns through $400 in three hours. This post walks through the four-step pattern that prevents that — without ripping out the SDK you're already using.
The pattern: mint, cap, meter, bill
Same shape as Stripe Connect, AWS sub-accounts, or any other tenant-scoped infra primitive — but for AI inference. You hold one master key. On signup you mint a per-user child key with a hard spend cap. Your users' requests use their own key, and the gateway enforces the cap before forwarding to the provider. At month-end you pull a per-user spend rollup and push it into whatever billing system you already run.
Step 1: mint your master key
Sign up at qlaud.ai, top up the wallet, and in /keys create a master-scope key. Store it as a server-side secret — this key can mint, list, and revoke other keys, so treat it like an admin token.
# .env (server-side only — never ship to the browser)
QLAUD_MASTER_KEY=qlk_live_<your_master_key>Step 2: mint a per-user key on signup, with a cap
Wherever your user-signup webhook lives (Clerk user.created, Supabase auth trigger, your own POST /signup) — add four lines that mint a qlaud key for the new user with a hard spend cap.
// On user signup in your app
const res = await fetch('https://api.qlaud.ai/v1/keys', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.QLAUD_MASTER_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
name: `user_${userId}`,
scope: 'standard', // inference-only, can't mint sub-keys
max_spend_usd: 5, // hard cap — gateway enforces, returns 402 over
}),
});
const { id: keyId, secret } = await res.json();
// secret = qlk_live_... — returned ONCE, store with the user
await db.users.update({ id: userId }, { qlaudKeyId: keyId, qlaudKey: secret });The secret is shown once — store it with the user record. The id is what you use to revoke or look up usage later. The cap is enforced at the gateway, not in your app code: a request that would push the user over their cap returns 402 Payment Required before it ever reaches the upstream provider.
Step 3: your user's app calls qlaud with their key
This is the part where you change nothing about your existing SDK code. You're already using the OpenAI SDK, the Anthropic SDK, the ElevenLabs SDK, Vercel AI SDK, LangChain — point the base URL at qlaud and pass the user's minted key. Same request shape, same response shape, same streaming semantics.
// Anthropic SDK — Claude with cache_control, tool_use, thinking blocks all preserved
import Anthropic from '@anthropic-ai/sdk';
const claude = new Anthropic({
baseURL: 'https://api.qlaud.ai',
apiKey: user.qlaudKey, // qlk_live_... minted in step 2
});
await claude.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }],
});// OpenAI SDK — same key, same gateway
import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'https://api.qlaud.ai/v1',
apiKey: user.qlaudKey,
});
await openai.chat.completions.create({
model: 'gpt-5.4',
messages: [{ role: 'user', content: prompt }],
});Anthropic, OpenAI, ElevenLabs, Deepgram, Google AI Studio, xAI Grok, Alibaba Qwen, Moonshot Kimi, MiniMax, DeepSeek, Mistral, Cerebras — all accessible through the same key, same base URL. Whatever auth-header convention your SDK uses (x-api-key, xi-api-key, x-goog-api-key, Authorization: Bearer, Authorization: Token), qlaud accepts it.
Step 4: pull per-user spend at month-end and bill
Whenever your billing cron fires — first of the month, every Sunday, end of trial — pull the rollup with a single GET. The response groups spend by key, broken down by model, with token counts and latency.
// Pull last month's per-user usage
const fromMs = startOfLastMonth.getTime();
const toMs = endOfLastMonth.getTime();
const res = await fetch(
`https://api.qlaud.ai/v1/usage?from_ms=${fromMs}&to_ms=${toMs}`,
{ headers: { Authorization: `Bearer ${process.env.QLAUD_MASTER_KEY}` } },
);
const { by_key } = await res.json();
// by_key: [{ key_id, name, cost_micros, by_model: [...] }, ...]
for (const row of by_key) {
const user = await db.users.findOne({ qlaudKeyId: row.key_id });
const upstreamUsd = row.cost_micros / 1_000_000; // micros -> dollars
const yourMargin = upstreamUsd * YOUR_MARGIN_MULTIPLIER;
const billUsd = upstreamUsd + yourMargin;
// Push to whatever billing tool you're already using
// (Stripe, Paddle, Lemon Squeezy, in-app credits, custom invoicing)
await yourBillingTool.charge(user, billUsd);
}Costs are returned in micros (1/1,000,000 USD) for exact decimal arithmetic — no floating-point rounding when you reconcile. The by_key array only contains keys you minted, so you see only your own users' usage. Want a single user's breakdown? Hit /v1/keys/:keyId/usage for an event-level drill-down with cap progress.
Why this beats rolling your own
Per-user metering looks deceptively simple — it's "just" counting tokens and writing rows. The hard parts are the ones you don't see until you ship:
- Streaming usage is inconsistent across providers. Some include
usagein the final SSE chunk, some include it only when you setstream_options.include_usage = true, some omit it entirely on partial streams. We patch this per-provider. - Caps have to be enforced before the request, not after. A naive "decrement after response" lets one user fan out 50 parallel requests and overspend by 50× the cap.
- Cost calc has to track every model price change. Providers re-price quietly. We track the catalog so your invoice math doesn't silently go wrong.
- Native SDK shapes are not all OpenAI-compatible. Anthropic
cache_control, ElevenLabs voice cloning, Deepgram diarization — these don't fit OpenAI's schema. We do native passthrough so the SDK you're using actually works.
You spend a quarter writing this and you're still patching it next year. Or you change a base URL and add four lines on signup.
Get started
Sign up for qlaud, top up $5, mint a master key, ship per-user billing tonight. Your users' SDK code stays exactly the same.