Last month my OpenAI bill was $5,247. I'd budgeted $1,000. The Stripe email landed at 6:30 AM and I sat there refreshing the page, because there was no way that number could be right.
It was right. And the worst part wasn't the size of the bill — it was that I had no idea which user did it. OpenAI's dashboard showed the aggregate; my own logs showed thousands of requests; I couldn't connect the two. By the time I'd traced it (one user, automation script, ~60 requests/min for two days), the damage was done.
Six weeks later my AI bill is $1,650 — a 60% reduction — with the same product features and growing user base. This post is what I changed, with the actual code, the math, and the architecture decisions that mattered. Most of it boils down to one pattern: per-user API keys with hard spend caps.
The naive approach I tried first (and why it failed)
My first instinct was to add cost tracking myself. Postgres table:requests(user_id, model, input_tokens, output_tokens, cost_micros, created_at). Every request logs a row. Sum by user_id. Done.
It was not done. Three things broke:
1. Streaming responses don't tell you the token count until the end
When you stream from OpenAI, you don't know completion_tokens until the stream finishes. If the request is canceled mid-stream, you have to count tokens yourself from the chunks you received. I got this wrong twice — first by under-counting (the canceled-stream case), then by double-counting (when I added a retry on transient 5xx errors and didn't dedupe).
2. Caps need to be enforced BEFORE the request fires, not after
The "log every request and sum it up" approach is observability, not enforcement. The runaway user could already have burned $400 by the time my nightly cron noticed. To enforce a cap I'd need a synchronous read before every request — which means a Postgres lookup in the hot path of every API call, which is its own performance problem.
3. Cross-provider attribution turns into a special-case nightmare
I wasn't just on OpenAI. Anthropic Claude for some flows, DeepSeek for others. Each provider has different pricing, different streaming shapes, different ways of counting tokens (Anthropic counts cache reads separately, OpenAI bundles them). My single Postgres rollup needed provider-specific pricing logic + token-counting code per shape. Every new model I added meant new code in the metering path.
Around week two of this rabbit hole I realized I was building the wrong layer. What I actually wanted was a gateway that did the metering for me, so my application code could go back to being application code.
The pattern: per-user API keys with hard caps
Stripe Connect, AWS IAM sub-accounts, GitHub fine-grained PATs — every modern tenant-scoped infra primitive uses the same pattern. You hold one master credential. You mint child credentials per user, each with their own scoped permissions and limits. Cost tracking and access control fall out automatically because they're defined at the credential level.
For AI inference, this looks like:
- You sign up for qlaud (or build your own gateway). You get one master key.
- On every user signup in YOUR app, you mint a per-user child key with a $10 cap.
- That user's requests carry their own key. The gateway enforces the cap before forwarding.
- You pull per-user usage at month-end and bill however you want.
The end-to-end code is roughly 30 lines:
Step 1 — Mint a key on signup
// At user signup, server-side
const userKey = await qlaud.keys.create({
user_id: user.id,
name: user.email,
scope: "standard",
max_spend_usd: 10, // hard cap — this is the magic
});
await db.users.update(user.id, {
qlaud_key: userKey.secret,
});Step 2 — Use the per-user key in the official OpenAI SDK
import OpenAI from "openai";
// Per-request, get the user's stored key from your DB
const client = new OpenAI({
baseURL: "https://api.qlaud.ai/v1",
apiKey: user.qlaud_key,
});
const completion = await client.chat.completions.create({
model: "gpt-5.4",
messages: [...],
stream: true,
});That's the whole client-side change. The OpenAI SDK doesn't know it's hitting a gateway. Same response shape, same error handling, same streaming — only the baseURL changed. Anthropic SDK works the same way (set ANTHROPIC_BASE_URL=https://api.qlaud.ai).
Step 3 — When a user hits the cap, return 402 cleanly
qlaud automatically returns 402 Payment Required when a user's cap is exhausted. In your UI, catch that response and surface an upgrade prompt:
try {
const completion = await client.chat.completions.create(...);
} catch (err) {
if (err.status === 402) {
showUpgradeModal({
message: "You've hit your daily AI credit limit. Upgrade to keep going.",
planLink: "/pricing",
});
return;
}
throw err;
}Step 4 — Pull usage at month-end
const usage = await fetch("https://api.qlaud.ai/v1/usage?from_ms=...&to_ms=...", {
headers: { Authorization: `Bearer ${process.env.QLAUD_MASTER_KEY}` },
}).then(r => r.json());
// usage.by_key[].cost_micros — divide by 1_000_000 for dollars
for (const k of usage.by_key) {
await stripe.invoiceItems.create({
customer: getStripeCustomerByQlaudUser(k.user_id),
amount: Math.ceil(k.cost_micros / 10_000), // markup baked in
currency: "usd",
});
}Whatever margin you want is between you and your customer. qlaud charges you upstream cost + 7% gateway fee. Your invoice line items can be per-token, per-feature, flat tier with usage allowance — your call.
What changed in 6 weeks (the real numbers)
Here's what happened to the bill week-by-week after I made the switch. These are real numbers from my dogfood usage on qlaud — same product, same growing user base.
Week Bill ($) Active users Notes
0 5,247 74 baseline (the $5K shock month)
1 3,180 78 user_42 hit their $10 cap on day 2
2 2,720 85 5 users hit caps; 3 upgraded to paid
3 2,340 91 added cap-warn email at 80%
4 2,015 97 tightened tier defaults: $5 free, $25 paid
5 1,810 104 churn cleaned up — bad-actor user gone
6 1,650 112 steady stateCouple of observations from the numbers, since they're worth more than the percentage drop in isolation:
- The runaway user got contained immediately. Day 2 of week 1, user_42 hit the cap. Their script kept retrying and getting 402'd. Total spend on that user for the month: $10. Previous month: estimated $1,800.
- Caps surface upgrade signal. Five users hit caps legitimately in week 2 — power users actually using the product. Three of them upgraded when shown the modal. That's a 60% upgrade-on-cap-hit rate that I had no way to surface before. New-feature: hitting the cap is now my best lead source.
- Tier defaults compounded. Once I knew per-user spend patterns I could redesign the free tier. Free went from "$10 generous" to "$5 limited"; paid tier ($19/mo) gets $25 of usage. Conversion went up because the free tier hits cap faster, and average revenue per paid user is now higher than the gross AI cost. Sustainable unit economics for the first time.
What this unlocks beyond cost control
The killer feature isn't the 60% reduction. It's that per-user attribution is now a primitive in my product, and that primitive composes:
Cohort analysis
Group users by signup date, plan, geography, referral source. For each cohort, see "average cost per user", "% of users who hit cap", "median time to first dollar of value." This is the kind of analysis that turns gut-feel pricing into actual unit economics.
Anomaly detection
A user spending 10x the cohort median in week 1? Either they're a power user (good signal — reach out, offer a custom plan) or a bad actor (also good signal — review and ban before they cost you money).
Granular feature pricing
Previously I priced features in averages: "AI features cost about $0.05 per use." Now I have actual numbers: image-gen costs $0.04, summarization costs $0.003, agent loop costs $0.18. Pricing decisions stop being guesses.
Per-user model routing
For free-tier users, route to DeepSeek V3 ($0.27/MTok in). For paid, route to Claude Sonnet 4.6 ($3/MTok in). The cost-per-user gap shrinks 10x without changing the perceived product quality much.
Things I'd do differently if starting over
A few decisions in retrospect — short list because hindsight is long-winded:
- Mint per-user keys from day one, not after the $5K shock. The cost of adding it later is dealing with one messy migration; the cost of not having it from day one is one bad-actor user away.
- Default the free-tier cap lower — start at $3, not $10. People who care will pay; people who don't were never going to pay.
- Build the cap-warn email at 80% on day one, not week 3. Users who hit cap unexpectedly churn; users who get warned and given a clear upgrade path convert.
- Use the gateway's response time-series, not just spend. Latency-by-user surfaces issues spend doesn't (a user hitting slow paths repeatedly = something to fix in the product).
The takeaway
Per-user AI cost attribution stops being an afterthought when you make it a credential primitive. Mint a key per user, cap it, drop into the official SDK, pull usage at month-end. That's the playbook. The infrastructure to do this yourself is doable but distracting; the infrastructure to do it via qlaud is one signup and a base-URL change away.
Free tier with $200 starter credit if you want to kick the tires. Drop-in compatible with the OpenAI, Anthropic, ElevenLabs, Vercel AI, LangChain, and LlamaIndex SDKs. The first-month bill anomaly is the one you'll never have again.