I added a chat UI to my app in 30 minutes. The actual chat infrastructure took three weeks. This post is what those three weeks were spent on, in what order each piece broke, and how I'd shortcut most of it if I were starting over.
If you've shipped AI chat before, you know what I'm about to describe. If you haven't yet, this is the post I wish I'd read first — half warning, half tutorial, with a path through the swamp at the end.
What "ship AI chat" actually means
A chat box is a textarea, an <EventSource> for streaming, and a list of message bubbles. That's the part that takes 30 minutes. Then real users start using it and you discover everything chat needs to be a real product:
- Stream reassembly when the connection drops mid-response
- Persistence so refresh doesn't lose the conversation
- Deduplication when the user retries on a 5xx
- Tool calls that need to be auditable in conversation history
- Per-user sequencing so two browsers don't desync
- A history endpoint, sortable + paginated, with backpressure
- Cleanup of dangling streams that the client never closed
- A schema that doesn't fall apart when you add models with different shapes
Each of these is a fixable problem. The aggregate cost is what surprises you.
What breaks first: streaming reassembly on flaky networks
Your first beta tester closes their laptop mid-response. They reopen ten minutes later. The chat shows a half-finished assistant message ending in "Therefore, the optimal strate—" and that's it. Reload the page; the half-message is gone, no recovery, no resume.
The fix is buffering the stream server-side, not just relaying it. You need a server endpoint that:
- Receives chunks from the upstream model
- Forwards them to the client over an SSE/websocket
- Also writes them to durable storage as they land
- On client reconnect, replays whatever was buffered + continues the live stream
That's a non-trivial pattern. Cloudflare Durable Objects work well for it (single-writer, replicates across reconnects). Postgres + Listen/Notify works too if you commit to managing the connection pool. AWS uses a combination of API Gateway WebSockets + DynamoDB streams.
Whatever you pick, you've now committed to a specific infra primitive that's load-bearing in the hot path of every chat interaction. Pick wrong here and you're refactoring it in 6 months when the reconnect story leaks at scale.
What breaks second: the persistence schema
OK, you persist messages now. What's the schema?
The naive version:
-- v1 schema
CREATE TABLE messages (
id uuid PRIMARY KEY,
thread_id uuid NOT NULL,
role text NOT NULL, -- 'user' | 'assistant'
content text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);That works for two days. Then you discover:
Multi-modal content. "Content" isn't just text — it's an array of blocks: text, image, tool_use, tool_result. Now you need either JSONB or a separate message_blocks table.
Token counts. You need them for billing, retention, rate limiting. They have to land on the message row at the moment the stream completes, which is a separate event from when you first inserted the row. You either eagerly upsert them (hot-path write contention) or eventually-update via background job (eventual consistency in your UI).
Sequencing. Two browser tabs send a message concurrently. You need a strict ordering. Naive auto-incrementing integer doesn't work across writers; UUIDs don't sort; clock-based ordering breaks under skew. You end up with a per-thread sequence number that requires a row lock or a Durable Object.
Per-end-user scoping. If you're building a B2B product where companies have many end-users, you need to scope every query by end_user_id. Add a column, add an index, add an explicit WHERE clause to every read path. Forget one and you have a tenant-leak bug.
Schema v3, eight commits later, looks more like:
CREATE TABLE threads (
id uuid PRIMARY KEY,
user_id text NOT NULL,
end_user_id text NOT NULL,
metadata jsonb,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE TABLE thread_messages (
thread_id uuid NOT NULL REFERENCES threads(id),
seq integer NOT NULL, -- per-thread sequence
role text NOT NULL,
content jsonb NOT NULL, -- array of blocks
request_id text, -- for dedup
token_count integer,
created_at timestamptz NOT NULL,
PRIMARY KEY (thread_id, seq)
);
CREATE INDEX idx_thread_messages_thread_id ON thread_messages(thread_id);
CREATE INDEX idx_threads_end_user ON threads(end_user_id);Plus migrations. Plus a backfill plan when you change anything. Plus rate-limited cleanup of orphaned threads. Plus a vacuum strategy for the inevitable bloat.
What breaks third: tool calls that vanish from history
Then you add tool calling. The model emits a tool_use block ("call get_weather('SF')"). Your code dispatches the tool, gets a result ("72°F, sunny"), feeds it back into the next request as a tool_result block.
Question: do you persist those tool_use and tool_result blocks?
First reaction: "no, those are internal — only show user-visible messages." Two weeks later: a user asks "why did the AI tell me my appointment was at 3pm? It should have said 2pm." You go to debug it. You can see the user's message ("when's my appointment?") and the assistant's reply ("3pm"). You CAN'T see the get_calendar tool call, what it returned, or whether it errored.
Tool call history isn't optional. It's THE audit trail when an agent makes a wrong decision. Persist it. Now your thread_messages content blocks include four kinds: text, image, tool_use, tool_result. Your read paths have to filter to user-visible content unless an admin is debugging.
What breaks fourth: deduplication
The user sends a message. Your client gets a 502 from your edge worker (CDN hiccup, container restart, whatever). The client retries. Now you have two identical user messages in the thread.
Dedup is per-request-id, not per-content. Generate an idempotency key client-side, send it as a header, persist it as a column, fail gracefully on conflict. Easy to describe, easy to forget, hard to retrofit into a system that already has dirty data.
What breaks fifth: the history endpoint
Your client needs to load past messages on page refresh. So you write GET /api/threads/:id/messages. Easy. It returns a JSON array of messages.
Then a power user has a 6,000-message conversation. Your endpoint returns 12MB of JSON. The browser hangs while parsing. You add ?limit=50. Now you need cursor-based pagination because offset-based starts skipping messages when new ones arrive while scrolling. You write the cursor encoding, decode it, validate it, handle malformed cursors gracefully. Another two days.
The accumulated cost
Let's tally:
- Streaming reassembly + buffer: ~2 days
- Persistence schema (with the migrations to get to v3): ~3 days
- Tool call history: ~1.5 days
- Deduplication: ~0.5 days
- History endpoint with cursor pagination: ~1 day
- Per-end-user scoping audit (going through every endpoint): ~1 day
- Bug fixes from production usage: ~3 days
- Vector search for "find past conversation about X" (Pinecone setup, embedding pipeline): ~2 days
Total: ~14 days of senior engineer time, conservatively. Plus the ongoing cost of maintaining all of it as your model providers add new block types, your schema needs new fields, and you have to keep migrations + backfills consistent.
That's the time you spent NOT shipping product features. For the chunk of teams whose differentiation is the chat application — not the infrastructure — this is dead weight.
The shortcut: a managed thread API
Most of the above isn't novel. Every team building AI chat re-derives the same pattern. So somebody had to write the gateway version, and qlaud's threads API is what we ship for it. Two endpoints replace the entire stack:
Create a thread
curl https://api.qlaud.ai/v1/threads \
-H "Authorization: Bearer qlk_live_…" \
-d '{ "end_user_id": "user_42", "metadata": { "topic": "support" } }'Send a message — streams + persists in one call
curl https://api.qlaud.ai/v1/threads/{id}/messages \
-H "Authorization: Bearer qlk_live_…" \
-d '{
"model": "claude-sonnet-4-6",
"stream": true,
"content": [{ "type": "text", "text": "What's my plan?" }],
"tools_mode": "tenant"
}'Fetch sequenced history
const { data } = await qlaud.threads.messages({
thread_id: "thread_eace4f23",
});
// → ordered messages, with text + tool_use + tool_result blocks intact
// → token counts, request_ids, end_user_ids all there
// → automatic dedup, cursor pagination via ?limit + ?cursorWhat you get for free, in order of "what would otherwise have taken you a week":
- Streams that don't lose data. Server-side buffering of every chunk as it lands. Reconnect = resume; refresh = full history. No half-finished messages.
- Tool calls persisted as first-class history. Every
tool_useandtool_resultblock lands in the same conversation record. Audit trails, debugging, agent retry logic — all become readable. - Sequencing per end-user. Each thread carries an
end_user_id; messages get a per-thread sequence number; two browsers writing to the same thread don't desync. - Deduplication on request-id. Retries don't double-write. 5xx-then-retry just hits the same message slot.
- Vector search built in. Every assistant message gets embedded and indexed.
GET /v1/searchreturns semantically similar past messages — no Pinecone integration, no embedding pipeline. - Cross-model. Claude, GPT, DeepSeek, Gemini, etc. Same thread can flip models mid-conversation; the history shape stays consistent.
Time to integrate: roughly the same 30 minutes as the original chat textarea. The 14 days of stack-building skipped.
When to roll your own anyway
I'm not arguing nobody should build their own chat backend. The honest framing is: build it when chat infrastructure is your competitive advantage. Notion, Linear, Slack — those companies own their persistence layer because the database IS the product.
For everyone else — and that's most teams — the persistence layer is a tax. You pay it because you have to, and you'd rather pay $15/month for someone else to operate it than pay 14 days of senior eng time + ongoing on-call.
Some heuristics for when to take on the work yourself:
- You need data residency in a specific region that no gateway offers (uncommon — most ship on Cloudflare, AWS, or GCP edge).
- Your conversation messages are 100KB+ each (rich embedded media that doesn't fit JSON). Most managed stacks have row size limits around 64-128KB.
- You need millisecond-tight read latency (e.g., autocomplete). Most managed gateways are P99 ~50-150ms. Below that requires colocation.
- Compliance audit requires that no data ever leaves your infra. Some healthcare and government contracts require this.
If none of those apply, the math is straightforward. Use the managed version, ship the actual product.
The bottom line
Shipping AI chat is shipping ~12 features wearing a textarea costume. The textarea is one feature. The other eleven take three weeks. Most teams either ship without them (which feels hacky), or build them all (which delays the product). The third option — outsource the eleven so you can focus on the textarea your users actually see — is the one I wish I'd taken from day one.
If you want to try the gateway version, qlaud has a free tier with $200 starter credit and works with the OpenAI / Anthropic / ElevenLabs SDKs you already use. The threads API is documented at docs.qlaud.ai/api-reference/threads; the recipe book for tools is at docs.qlaud.ai/api-reference/tool-examples.
Or build it yourself. Just count the days first.