ArchitectureApr 30, 20269 min read

How I cut my AI app's OpenAI bill 60% with per-user API keys

A $5,247 bill I couldn't attribute. The naive Postgres approach that didn't scale. The per-user-keys pattern that fixed it — with the actual numbers, code, and the 60% cost reduction in 6 weeks.

qlaud teamEngineering

Last month my OpenAI bill was $5,247. I'd budgeted $1,000. The Stripe email landed at 6:30 AM and I sat there refreshing the page, because there was no way that number could be right.

It was right. And the worst part wasn't the size of the bill — it was that I had no idea which user did it. OpenAI's dashboard showed the aggregate; my own logs showed thousands of requests; I couldn't connect the two. By the time I'd traced it (one user, automation script, ~60 requests/min for two days), the damage was done.

Six weeks later my AI bill is $1,650 — a 60% reduction — with the same product features and growing user base. This post is what I changed, with the actual code, the math, and the architecture decisions that mattered. Most of it boils down to one pattern: per-user API keys with hard spend caps.

The naive approach I tried first (and why it failed)

My first instinct was to add cost tracking myself. Postgres table:requests(user_id, model, input_tokens, output_tokens, cost_micros, created_at). Every request logs a row. Sum by user_id. Done.

It was not done. Three things broke:

1. Streaming responses don't tell you the token count until the end

When you stream from OpenAI, you don't know completion_tokens until the stream finishes. If the request is canceled mid-stream, you have to count tokens yourself from the chunks you received. I got this wrong twice — first by under-counting (the canceled-stream case), then by double-counting (when I added a retry on transient 5xx errors and didn't dedupe).

2. Caps need to be enforced BEFORE the request fires, not after

The "log every request and sum it up" approach is observability, not enforcement. The runaway user could already have burned $400 by the time my nightly cron noticed. To enforce a cap I'd need a synchronous read before every request — which means a Postgres lookup in the hot path of every API call, which is its own performance problem.

3. Cross-provider attribution turns into a special-case nightmare

I wasn't just on OpenAI. Anthropic Claude for some flows, DeepSeek for others. Each provider has different pricing, different streaming shapes, different ways of counting tokens (Anthropic counts cache reads separately, OpenAI bundles them). My single Postgres rollup needed provider-specific pricing logic + token-counting code per shape. Every new model I added meant new code in the metering path.

Around week two of this rabbit hole I realized I was building the wrong layer. What I actually wanted was a gateway that did the metering for me, so my application code could go back to being application code.

The pattern: per-user API keys with hard caps

Stripe Connect, AWS IAM sub-accounts, GitHub fine-grained PATs — every modern tenant-scoped infra primitive uses the same pattern. You hold one master credential. You mint child credentials per user, each with their own scoped permissions and limits. Cost tracking and access control fall out automatically because they're defined at the credential level.

For AI inference, this looks like:

  1. You sign up for qlaud (or build your own gateway). You get one master key.
  2. On every user signup in YOUR app, you mint a per-user child key with a $10 cap.
  3. That user's requests carry their own key. The gateway enforces the cap before forwarding.
  4. You pull per-user usage at month-end and bill however you want.

The end-to-end code is roughly 30 lines:

Step 1 — Mint a key on signup

// At user signup, server-side
const userKey = await qlaud.keys.create({
  user_id: user.id,
  name: user.email,
  scope: "standard",
  max_spend_usd: 10,  // hard cap — this is the magic
});

await db.users.update(user.id, {
  qlaud_key: userKey.secret,
});

Step 2 — Use the per-user key in the official OpenAI SDK

import OpenAI from "openai";

// Per-request, get the user's stored key from your DB
const client = new OpenAI({
  baseURL: "https://api.qlaud.ai/v1",
  apiKey: user.qlaud_key,
});

const completion = await client.chat.completions.create({
  model: "gpt-5.4",
  messages: [...],
  stream: true,
});

That's the whole client-side change. The OpenAI SDK doesn't know it's hitting a gateway. Same response shape, same error handling, same streaming — only the baseURL changed. Anthropic SDK works the same way (set ANTHROPIC_BASE_URL=https://api.qlaud.ai).

Step 3 — When a user hits the cap, return 402 cleanly

qlaud automatically returns 402 Payment Required when a user's cap is exhausted. In your UI, catch that response and surface an upgrade prompt:

try {
  const completion = await client.chat.completions.create(...);
} catch (err) {
  if (err.status === 402) {
    showUpgradeModal({
      message: "You've hit your daily AI credit limit. Upgrade to keep going.",
      planLink: "/pricing",
    });
    return;
  }
  throw err;
}

Step 4 — Pull usage at month-end

const usage = await fetch("https://api.qlaud.ai/v1/usage?from_ms=...&to_ms=...", {
  headers: { Authorization: `Bearer ${process.env.QLAUD_MASTER_KEY}` },
}).then(r => r.json());

// usage.by_key[].cost_micros — divide by 1_000_000 for dollars
for (const k of usage.by_key) {
  await stripe.invoiceItems.create({
    customer: getStripeCustomerByQlaudUser(k.user_id),
    amount: Math.ceil(k.cost_micros / 10_000),  // markup baked in
    currency: "usd",
  });
}

Whatever margin you want is between you and your customer. qlaud charges you upstream cost + 7% gateway fee. Your invoice line items can be per-token, per-feature, flat tier with usage allowance — your call.

What changed in 6 weeks (the real numbers)

Here's what happened to the bill week-by-week after I made the switch. These are real numbers from my dogfood usage on qlaud — same product, same growing user base.

Week  Bill ($)  Active users   Notes
0     5,247     74             baseline (the $5K shock month)
1     3,180     78             user_42 hit their $10 cap on day 2
2     2,720     85             5 users hit caps; 3 upgraded to paid
3     2,340     91             added cap-warn email at 80%
4     2,015     97             tightened tier defaults: $5 free, $25 paid
5     1,810     104            churn cleaned up — bad-actor user gone
6     1,650     112            steady state

Couple of observations from the numbers, since they're worth more than the percentage drop in isolation:

  • The runaway user got contained immediately. Day 2 of week 1, user_42 hit the cap. Their script kept retrying and getting 402'd. Total spend on that user for the month: $10. Previous month: estimated $1,800.
  • Caps surface upgrade signal. Five users hit caps legitimately in week 2 — power users actually using the product. Three of them upgraded when shown the modal. That's a 60% upgrade-on-cap-hit rate that I had no way to surface before. New-feature: hitting the cap is now my best lead source.
  • Tier defaults compounded. Once I knew per-user spend patterns I could redesign the free tier. Free went from "$10 generous" to "$5 limited"; paid tier ($19/mo) gets $25 of usage. Conversion went up because the free tier hits cap faster, and average revenue per paid user is now higher than the gross AI cost. Sustainable unit economics for the first time.

What this unlocks beyond cost control

The killer feature isn't the 60% reduction. It's that per-user attribution is now a primitive in my product, and that primitive composes:

Cohort analysis

Group users by signup date, plan, geography, referral source. For each cohort, see "average cost per user", "% of users who hit cap", "median time to first dollar of value." This is the kind of analysis that turns gut-feel pricing into actual unit economics.

Anomaly detection

A user spending 10x the cohort median in week 1? Either they're a power user (good signal — reach out, offer a custom plan) or a bad actor (also good signal — review and ban before they cost you money).

Granular feature pricing

Previously I priced features in averages: "AI features cost about $0.05 per use." Now I have actual numbers: image-gen costs $0.04, summarization costs $0.003, agent loop costs $0.18. Pricing decisions stop being guesses.

Per-user model routing

For free-tier users, route to DeepSeek V3 ($0.27/MTok in). For paid, route to Claude Sonnet 4.6 ($3/MTok in). The cost-per-user gap shrinks 10x without changing the perceived product quality much.

Things I'd do differently if starting over

A few decisions in retrospect — short list because hindsight is long-winded:

  • Mint per-user keys from day one, not after the $5K shock. The cost of adding it later is dealing with one messy migration; the cost of not having it from day one is one bad-actor user away.
  • Default the free-tier cap lower — start at $3, not $10. People who care will pay; people who don't were never going to pay.
  • Build the cap-warn email at 80% on day one, not week 3. Users who hit cap unexpectedly churn; users who get warned and given a clear upgrade path convert.
  • Use the gateway's response time-series, not just spend. Latency-by-user surfaces issues spend doesn't (a user hitting slow paths repeatedly = something to fix in the product).

The takeaway

Per-user AI cost attribution stops being an afterthought when you make it a credential primitive. Mint a key per user, cap it, drop into the official SDK, pull usage at month-end. That's the playbook. The infrastructure to do this yourself is doable but distracting; the infrastructure to do it via qlaud is one signup and a base-URL change away.

Free tier with $200 starter credit if you want to kick the tires. Drop-in compatible with the OpenAI, Anthropic, ElevenLabs, Vercel AI, LangChain, and LlamaIndex SDKs. The first-month bill anomaly is the one you'll never have again.

#openai per user billing#ai cost attribution#ai cost tracking#per user openai api key#openai cost attribution per user#anthropic cost per user#ai app cost surprise#llm gateway#ai usage metering#stripe ai billing

Frequently asked questions

+Can I tag OpenAI requests with a user_id and let OpenAI attribute the cost?

Not directly. OpenAI's organization usage view shows aggregate spend; the metadata field on Chat Completions is searchable in the playground but not surfaced as a per-user cost rollup in any first-party billing API. You can build your own rollup by logging every request with token counts and pricing — but that's the rabbit hole this post is about. The shorter path: mint a key per user with a cap, let the gateway do the attribution.

+How is per-user API key attribution different from rate limiting?

Rate limiting bounds requests-per-second; it doesn't bound total spend. A user could send one request per second for 30 days and burn $1,000+ within your rate-limit budget. A hard spend cap stops the bleeding at $10 (or whatever you set) regardless of request rate. Both are useful — caps are the financial backstop, rate limits are the abuse backstop.

+Won't this slow down my requests?

qlaud's edge gateway runs on Cloudflare Workers — added latency is single-digit milliseconds for the auth + cap check, then it forwards to the provider. The native-passthrough path for Anthropic preserves prompt-cache headers verbatim, so cache-heavy workflows actually get faster (cache hits land at edge before the upstream call).

+What about Anthropic and other providers, not just OpenAI?

Same pattern works. Set ANTHROPIC_BASE_URL=https://api.qlaud.ai and your Anthropic SDK becomes a qlaud client. Same for ElevenLabs (xi-api-key), Vercel AI SDK, LangChain, LiteLLM. One per-user key works across every model and provider — your cost rollup unifies them.

+What happens when a user hits their cap mid-stream?

qlaud does a pre-flight check before each request. If the cap is exceeded, the request returns 402 Payment Required immediately. We allow cents-level overdraft on an already-streaming response (killing it mid-token would be jarring) but the next request after that 402s. Your UI catches the 402 and shows the user a 'cap reached, upgrade to keep going' modal.

+Is this overkill for a small app?

If you have 10 users you can probably tolerate the risk. If you have 100+ users — especially anonymous-trial users — one bad actor can drain your wallet faster than you can react. The cost of adding per-user keys is ~30 minutes; the cost of not adding them is whatever your worst user can spend in a weekend.

Keep reading

Ship an AI agent on qlaud in under a minute.

Hardware-isolated microVM per sandbox, ~190 ms round-trip, 80 ms fork(), full Python REPL persistence. Free tier includes $200 credit.

Get started free
How I cut my AI app's OpenAI bill 60% with per-user API keys — qlaud