Why not just use the OpenAI Assistants API for chat persistence?

Assistants API gives you threads + persistence on OpenAI's infra, but only for OpenAI models. The moment you want Anthropic, DeepSeek, or any local model in the same conversation history, you're back to building it yourself. Plus Assistants pricing is unpredictable (file storage + retrieval costs add up). A model-agnostic gateway with persistence is the cleaner long-term architecture.

Can I use Vercel KV / Supabase for chat persistence and skip the gateway?

Yes — but you still own the streaming reassembly, the deduplication on retries, the tool_use/tool_result audit trail, and the history endpoint. Persistence storage is the easy 20%; reliable streams + tool history is the hard 80%. The gateway approach addresses all of it server-side. The right call depends on whether your competitive advantage is in chat infrastructure (then build it) or in your domain product (then don't).

What happens to in-flight messages when the user's browser disconnects?

With a roll-your-own setup, the partial response is gone unless you've explicitly buffered it server-side. With qlaud's threads API, the message is captured at the gateway as it streams, persisted, and retrievable on reconnect. The user reloads the page and the message is still there. This is the kind of polish that makes chat apps feel solid vs. feel hacky.

Do I need a vector database for chat history?

If you want semantic search over past conversations (e.g., 'find all messages where I asked about X'), yes. If you only want last-N-messages context, no — a simple ordered list is enough. qlaud bundles vector search via Cloudflare Vectorize so semantic retrieval is one API call rather than a separate Pinecone integration; for last-N you just hit the messages endpoint with a limit.

How does this work with tool calls that return errors mid-conversation?

Tool failures are first-class in the persisted history — every tool_use block has a matching tool_result block, including is_error: true and the error message. The model sees the failure on retry and can route around it. This is the 'tool history' part most roll-your-own implementations skip, and it's why 'AI agent loops' end up brittle without it.

The hidden infrastructure you build when you ship AI chat

I added a chat UI to my app in 30 minutes. The actual chat infrastructure took three weeks. This post is what those three weeks were spent on, in what order each piece broke, and how I'd shortcut most of it if I were starting over.

If you've shipped AI chat before, you know what I'm about to describe. If you haven't yet, this is the post I wish I'd read first — half warning, half tutorial, with a path through the swamp at the end.

What "ship AI chat" actually means

A chat box is a textarea, an <EventSource> for streaming, and a list of message bubbles. That's the part that takes 30 minutes. Then real users start using it and you discover everything chat needs to be a real product:

Stream reassembly when the connection drops mid-response
Persistence so refresh doesn't lose the conversation
Deduplication when the user retries on a 5xx
Tool calls that need to be auditable in conversation history
Per-user sequencing so two browsers don't desync
A history endpoint, sortable + paginated, with backpressure
Cleanup of dangling streams that the client never closed
A schema that doesn't fall apart when you add models with different shapes

Each of these is a fixable problem. The aggregate cost is what surprises you.

What breaks first: streaming reassembly on flaky networks

Your first beta tester closes their laptop mid-response. They reopen ten minutes later. The chat shows a half-finished assistant message ending in "Therefore, the optimal strate—" and that's it. Reload the page; the half-message is gone, no recovery, no resume.

The fix is buffering the stream server-side, not just relaying it. You need a server endpoint that:

Receives chunks from the upstream model
Forwards them to the client over an SSE/websocket
Also writes them to durable storage as they land
On client reconnect, replays whatever was buffered + continues the live stream

That's a non-trivial pattern. Cloudflare Durable Objects work well for it (single-writer, replicates across reconnects). Postgres + Listen/Notify works too if you commit to managing the connection pool. AWS uses a combination of API Gateway WebSockets + DynamoDB streams.

Whatever you pick, you've now committed to a specific infra primitive that's load-bearing in the hot path of every chat interaction. Pick wrong here and you're refactoring it in 6 months when the reconnect story leaks at scale.

What breaks second: the persistence schema

OK, you persist messages now. What's the schema?

The naive version:

-- v1 schema
CREATE TABLE messages (
  id        uuid PRIMARY KEY,
  thread_id uuid NOT NULL,
  role      text NOT NULL,  -- 'user' | 'assistant'
  content   text NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now()
);

That works for two days. Then you discover:

Multi-modal content. "Content" isn't just text — it's an array of blocks: text, image, tool_use, tool_result. Now you need either JSONB or a separate message_blocks table.

Token counts. You need them for billing, retention, rate limiting. They have to land on the message row at the moment the stream completes, which is a separate event from when you first inserted the row. You either eagerly upsert them (hot-path write contention) or eventually-update via background job (eventual consistency in your UI).

Sequencing. Two browser tabs send a message concurrently. You need a strict ordering. Naive auto-incrementing integer doesn't work across writers; UUIDs don't sort; clock-based ordering breaks under skew. You end up with a per-thread sequence number that requires a row lock or a Durable Object.

Per-end-user scoping. If you're building a B2B product where companies have many end-users, you need to scope every query by end_user_id. Add a column, add an index, add an explicit WHERE clause to every read path. Forget one and you have a tenant-leak bug.

Schema v3, eight commits later, looks more like:

CREATE TABLE threads (
  id          uuid PRIMARY KEY,
  user_id     text NOT NULL,
  end_user_id text NOT NULL,
  metadata    jsonb,
  created_at  timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE thread_messages (
  thread_id    uuid NOT NULL REFERENCES threads(id),
  seq          integer NOT NULL,  -- per-thread sequence
  role         text NOT NULL,
  content      jsonb NOT NULL,    -- array of blocks
  request_id   text,              -- for dedup
  token_count  integer,
  created_at   timestamptz NOT NULL,
  PRIMARY KEY (thread_id, seq)
);

CREATE INDEX idx_thread_messages_thread_id ON thread_messages(thread_id);
CREATE INDEX idx_threads_end_user ON threads(end_user_id);

Plus migrations. Plus a backfill plan when you change anything. Plus rate-limited cleanup of orphaned threads. Plus a vacuum strategy for the inevitable bloat.

What breaks third: tool calls that vanish from history

Then you add tool calling. The model emits a tool_use block ("call get_weather('SF')"). Your code dispatches the tool, gets a result ("72°F, sunny"), feeds it back into the next request as a tool_result block.

Question: do you persist those tool_use and tool_result blocks?

First reaction: "no, those are internal — only show user-visible messages." Two weeks later: a user asks "why did the AI tell me my appointment was at 3pm? It should have said 2pm." You go to debug it. You can see the user's message ("when's my appointment?") and the assistant's reply ("3pm"). You CAN'T see the get_calendar tool call, what it returned, or whether it errored.

Tool call history isn't optional. It's THE audit trail when an agent makes a wrong decision. Persist it. Now your thread_messages content blocks include four kinds: text, image, tool_use, tool_result. Your read paths have to filter to user-visible content unless an admin is debugging.

What breaks fourth: deduplication

The user sends a message. Your client gets a 502 from your edge worker (CDN hiccup, container restart, whatever). The client retries. Now you have two identical user messages in the thread.

Dedup is per-request-id, not per-content. Generate an idempotency key client-side, send it as a header, persist it as a column, fail gracefully on conflict. Easy to describe, easy to forget, hard to retrofit into a system that already has dirty data.

What breaks fifth: the history endpoint

Your client needs to load past messages on page refresh. So you write GET /api/threads/:id/messages. Easy. It returns a JSON array of messages.

Then a power user has a 6,000-message conversation. Your endpoint returns 12MB of JSON. The browser hangs while parsing. You add ?limit=50. Now you need cursor-based pagination because offset-based starts skipping messages when new ones arrive while scrolling. You write the cursor encoding, decode it, validate it, handle malformed cursors gracefully. Another two days.

The accumulated cost

Let's tally:

Streaming reassembly + buffer: ~2 days
Persistence schema (with the migrations to get to v3): ~3 days
Tool call history: ~1.5 days
Deduplication: ~0.5 days
History endpoint with cursor pagination: ~1 day
Per-end-user scoping audit (going through every endpoint): ~1 day
Bug fixes from production usage: ~3 days
Vector search for "find past conversation about X" (Pinecone setup, embedding pipeline): ~2 days

Total: ~14 days of senior engineer time, conservatively. Plus the ongoing cost of maintaining all of it as your model providers add new block types, your schema needs new fields, and you have to keep migrations + backfills consistent.

That's the time you spent NOT shipping product features. For the chunk of teams whose differentiation is the chat application — not the infrastructure — this is dead weight.

The shortcut: a managed thread API

Most of the above isn't novel. Every team building AI chat re-derives the same pattern. So somebody had to write the gateway version, and qlaud's threads API is what we ship for it. Two endpoints replace the entire stack:

Create a thread

curl https://api.qlaud.ai/v1/threads \
  -H "Authorization: Bearer qlk_live_…" \
  -d '{ "end_user_id": "user_42", "metadata": { "topic": "support" } }'

Send a message — streams + persists in one call

curl https://api.qlaud.ai/v1/threads/{id}/messages \
  -H "Authorization: Bearer qlk_live_…" \
  -d '{
    "model": "claude-sonnet-4-6",
    "stream": true,
    "content": [{ "type": "text", "text": "What's my plan?" }],
    "tools_mode": "tenant"
  }'

Fetch sequenced history

const { data } = await qlaud.threads.messages({
  thread_id: "thread_eace4f23",
});
// → ordered messages, with text + tool_use + tool_result blocks intact
// → token counts, request_ids, end_user_ids all there
// → automatic dedup, cursor pagination via ?limit + ?cursor

What you get for free, in order of "what would otherwise have taken you a week":

Streams that don't lose data. Server-side buffering of every chunk as it lands. Reconnect = resume; refresh = full history. No half-finished messages.
Tool calls persisted as first-class history. Every tool_use and tool_result block lands in the same conversation record. Audit trails, debugging, agent retry logic — all become readable.
Sequencing per end-user. Each thread carries anend_user_id; messages get a per-thread sequence number; two browsers writing to the same thread don't desync.
Deduplication on request-id. Retries don't double-write. 5xx-then-retry just hits the same message slot.
Vector search built in. Every assistant message gets embedded and indexed. GET /v1/search returns semantically similar past messages — no Pinecone integration, no embedding pipeline.
Cross-model. Claude, GPT, DeepSeek, Gemini, etc. Same thread can flip models mid-conversation; the history shape stays consistent.

Time to integrate: roughly the same 30 minutes as the original chat textarea. The 14 days of stack-building skipped.

When to roll your own anyway

I'm not arguing nobody should build their own chat backend. The honest framing is: build it when chat infrastructure is your competitive advantage. Notion, Linear, Slack — those companies own their persistence layer because the database IS the product.

For everyone else — and that's most teams — the persistence layer is a tax. You pay it because you have to, and you'd rather pay $15/month for someone else to operate it than pay 14 days of senior eng time + ongoing on-call.

Some heuristics for when to take on the work yourself:

You need data residency in a specific region that no gateway offers (uncommon — most ship on Cloudflare, AWS, or GCP edge).
Your conversation messages are 100KB+ each (rich embedded media that doesn't fit JSON). Most managed stacks have row size limits around 64-128KB.
You need millisecond-tight read latency (e.g., autocomplete). Most managed gateways are P99 ~50-150ms. Below that requires colocation.
Compliance audit requires that no data ever leaves your infra. Some healthcare and government contracts require this.

If none of those apply, the math is straightforward. Use the managed version, ship the actual product.

The bottom line

Shipping AI chat is shipping ~12 features wearing a textarea costume. The textarea is one feature. The other eleven take three weeks. Most teams either ship without them (which feels hacky), or build them all (which delays the product). The third option — outsource the eleven so you can focus on the textarea your users actually see — is the one I wish I'd taken from day one.

If you want to try the gateway version, qlaud has a free tier with $200 starter credit and works with the OpenAI / Anthropic / ElevenLabs SDKs you already use. The threads API is documented at docs.qlaud.ai/api-reference/threads; the recipe book for tools is at docs.qlaud.ai/api-reference/tool-examples.

Or build it yourself. Just count the days first.