Skip to main content
Back to Blog
13 min read

Streaming AI Agent Output in Real-Time: Our WebSocket Architecture

How we built a real-time streaming pipeline from Claude API through Go WebSockets to React with backpressure.

websocketreal-timestreaminggoreactsystem-designbackpressure
CT

ChatML Team

ChatML Team


When you watch a Claude agent edit a file, run a bash command, and explain its reasoning in ChatML, the output appears token by token in your session view. Behind that simple experience is a five-stage streaming pipeline that routes structured, multiplexed, real-time data from the Claude API through a Go WebSocket hub to a React frontend --- without dropping events or overwhelming the client.

TL;DR Architecture Decisions: Five-stage pipeline (Claude API → Node.js agent runner → Go WebSocket hub → browser WebSocket → React renderer). Non-blocking fan-out disconnects slow clients instead of blocking fast ones. Three-level backpressure: per-client buffer (256 messages), agent runner throttling, and client-side requestAnimationFrame batching. End-to-end latency under 50ms. If you're building real-time LLM streaming for a browser-based or desktop tool, this is how to stream LLM output to the browser reliably.

This post is a reference architecture for anyone building LLM-powered developer tools with real-time streaming. We will walk through every stage of the pipeline, share the Go hub implementation that sits at its center, and explain the backpressure strategy that keeps the system stable under load.

If you are curious about the broader technology decisions behind ChatML, start with why we chose four languages for one app. For the desktop shell that hosts this pipeline, see our post on choosing Tauri 2 over Electron.

The streaming challenge

AI agent output is not just text. A single Claude session produces a structured stream of heterogeneous events: text tokens, tool calls (file reads, writes, bash commands), tool results with potentially large payloads, sub-agent spawns, status updates, and errors. Each event type requires different rendering in the UI. A text delta appends to a running markdown block. A tool call creates an expandable card. A sub-agent spawn opens a nested conversation thread.

And ChatML runs multiple agent sessions simultaneously --- each one an independent stream. A user might have three agents working across different git worktrees, all producing output at the same time. The system must multiplex these streams to the correct UI views without cross-contamination, handle late-joining clients that connect to a session already in progress, and degrade gracefully when any part of the pipeline falls behind.

This is not a simple "pipe SSE to the browser" problem. It is a structured, multiplexed, real-time data routing problem with backpressure requirements at every boundary.

The five-stage pipeline

Here is the full data flow from Claude's response to rendered pixels:

Stage 1: Claude API to Agent Runner (Node.js). The Claude API delivers responses as a server-sent event stream. Our Node.js agent runner consumes this SSE stream and processes each event --- accumulating text deltas, detecting tool use blocks, executing tools, and feeding results back to the API for the next turn.

Stage 2: Agent Runner to Go Backend. As the agent runner processes events, it emits structured JSON messages to stdout. Each line is a self-contained JSON object describing what happened: a text delta, a tool invocation, a tool result, or a lifecycle event. The Go backend spawns the agent runner as a child process and reads these messages line by line.

Stage 3: Go Backend (Hub). The Go backend parses each JSON message, tags it with the session ID, and hands it to the WebSocket hub. The hub maintains a registry of connected browser clients and their session subscriptions. It routes each message to every client subscribed to that session, buffers messages for history replay, and enforces backpressure limits.

Stage 4: Go Backend to Browser. The hub writes each message as a WebSocket text frame. The browser receives a stream of typed JSON objects, each one a discrete event in the agent's execution.

Stage 5: Browser (React). The React frontend dispatches each event to the appropriate renderer --- markdown for text, expandable cards for tool calls, syntax-highlighted blocks for tool results, nested threads for sub-agents --- and composites them into a scrollable session timeline.

The boundaries between stages are intentional. Each one is a natural point for buffering, backpressure, and independent failure recovery.

The Go WebSocket hub

The hub is the heart of the websocket streaming architecture. It is a concurrent pub/sub router that accepts messages from agent runners and fans them out to browser clients. The design follows a pattern that will be familiar to anyone who has read the Gorilla WebSocket chat example, but extended for multi-session routing and history replay.

The core data structures:

type Hub struct {
    // Registered clients, keyed by session ID
    sessions map[string]map[*Client]bool
 
    // Buffered message history per session for replay
    history map[string][]*Message
 
    // Channels for the hub's main loop
    register   chan *Subscription
    unregister chan *Subscription
    broadcast  chan *Message
 
    mu sync.RWMutex
}
 
type Client struct {
    conn *websocket.Conn
    send chan []byte // buffered outbound channel
}
 
type Subscription struct {
    client    *Client
    sessionID string
}
 
type Message struct {
    SessionID string          `json:"sessionId"`
    Type      string          `json:"type"`
    Payload   json.RawMessage `json:"payload"`
    Timestamp int64           `json:"ts"`
}

The hub runs a single goroutine with a select loop. This is the standard Go pattern for a concurrent coordinator --- a single goroutine owns the mutable state and communicates with the outside world through channels:

func (h *Hub) Run() {
    for {
        select {
        case sub := <-h.register:
            clients, ok := h.sessions[sub.sessionID]
            if !ok {
                clients = make(map[*Client]bool)
                h.sessions[sub.sessionID] = clients
            }
            clients[sub.client] = true
 
            // Replay buffered history to the new client
            if history, ok := h.history[sub.sessionID]; ok {
                for _, msg := range history {
                    data, _ := json.Marshal(msg)
                    select {
                    case sub.client.send <- data:
                    default:
                        // Client buffer full on connect; drop it
                        h.removeClient(sub)
                        break
                    }
                }
            }
 
        case sub := <-h.unregister:
            h.removeClient(sub)
 
        case msg := <-h.broadcast:
            // Append to session history
            h.appendHistory(msg)
 
            // Fan out to all subscribed clients
            data, _ := json.Marshal(msg)
            if clients, ok := h.sessions[msg.SessionID]; ok {
                for client := range clients {
                    select {
                    case client.send <- data:
                    default:
                        // Slow client; disconnect rather than block
                        h.removeClient(&Subscription{
                            client:    client,
                            sessionID: msg.SessionID,
                        })
                    }
                }
            }
        }
    }
}

Each connected client has two goroutines: a read pump that listens for control messages (subscribe, unsubscribe, ping) and a write pump that drains the send channel to the WebSocket:

func (c *Client) writePump() {
    ticker := time.NewTicker(pingPeriod)
    defer func() {
        ticker.Stop()
        c.conn.Close()
    }()
    for {
        select {
        case message, ok := <-c.send:
            c.conn.SetWriteDeadline(time.Now().Add(writeWait))
            if !ok {
                c.conn.WriteMessage(websocket.CloseMessage, []byte{})
                return
            }
            if err := c.conn.WriteMessage(
                websocket.TextMessage, message,
            ); err != nil {
                return
            }
        case <-ticker.C:
            c.conn.SetWriteDeadline(time.Now().Add(writeWait))
            if err := c.conn.WriteMessage(
                websocket.PingMessage, nil,
            ); err != nil {
                return
            }
        }
    }
}

This go websocket hub design gives us several properties that matter for real-time AI streaming: the hub never blocks on a slow client (it disconnects them instead), late-joining clients get history replay so they see the full session context, and the single-goroutine coordinator eliminates lock contention on the session map.

Backpressure management

Uncontrolled real-time streaming is a recipe for memory exhaustion and cascading failures. We manage backpressure at three distinct levels.

Level 1: Per-client send buffer

Each WebSocket client has a buffered channel (default capacity: 256 messages). When the hub broadcasts a message, it attempts a non-blocking send to each client's channel. If the channel is full --- meaning the client has fallen 256 messages behind --- the hub disconnects the client instead of blocking.

This is the critical design decision. The alternative, blocking the broadcast until the slow client catches up, would stall every other client subscribed to the same session. One laggy browser tab would freeze every other viewer. Disconnecting the slow client and letting it reconnect (with history replay) is strictly better.

select {
case client.send <- data:
    // Delivered to client buffer
default:
    // Buffer full: slow client gets disconnected
    h.removeClient(sub)
}

Level 2: Agent runner throttling

If the Go backend's internal message buffers grow beyond a high-water mark, it signals the agent runner to slow down. The agent runner and Go backend communicate bidirectionally: stdout carries messages from the runner to Go, and stdin carries control signals from Go to the runner.

When buffer pressure exceeds the threshold, the Go backend writes a {"type": "throttle", "delay_ms": 100} message to the runner's stdin. The runner inserts a small delay between API calls to Claude, reducing the inbound event rate. When pressure drops below the low-water mark, a {"type": "resume"} message clears the throttle.

This is a coarse-grained backpressure mechanism. We rarely hit it in practice --- the Claude API's own latency is usually the bottleneck --- but it prevents runaway memory growth if the frontend is overwhelmed (e.g., the user minimizes the window and the OS throttles the renderer).

Level 3: Client-side batching

Even with a well-managed WebSocket pipeline, naively calling setState on every incoming token would overwhelm React's reconciliation. A fast Claude response can produce 50--80 tokens per second. Rendering each one individually would mean 50--80 re-renders per second, each triggering layout and paint.

Instead, the frontend collects incoming events in a buffer and flushes them to React state in a requestAnimationFrame loop:

const bufferRef = useRef<AgentEvent[]>([]);
const rafRef = useRef<number | null>(null);
 
function onMessage(event: AgentEvent) {
  bufferRef.current.push(event);
 
  if (rafRef.current === null) {
    rafRef.current = requestAnimationFrame(() => {
      const batch = bufferRef.current;
      bufferRef.current = [];
      rafRef.current = null;
 
      dispatch({ type: "BATCH_EVENTS", events: batch });
    });
  }
}

This batches all events that arrive within a single animation frame (roughly 16ms at 60fps) into a single state update. The result is smooth, jank-free streaming even during bursts of rapid token output. The approach is similar to what the React team recommends for high-frequency updates --- decouple the event ingestion rate from the render rate.

The agent runner JSON protocol

The Node.js agent runner communicates with the Go backend through a line-delimited JSON protocol over stdout. Each line is a complete JSON object with a type field that determines how the message is processed and rendered.

Here is a simplified sequence showing a typical agent interaction:

{"type":"text_delta","sessionId":"s_abc123","delta":"Let me read the configuration file."}
{"type":"text_delta","sessionId":"s_abc123","delta":"\n"}
{"type":"tool_use","sessionId":"s_abc123","tool":"read_file","input":{"path":"/app/config.ts"}}
{"type":"tool_result","sessionId":"s_abc123","tool":"read_file","output":"export const config = {\n  port: 3000,\n  host: 'localhost'\n};\n"}
{"type":"text_delta","sessionId":"s_abc123","delta":"I can see the config uses port 3000. Let me update it."}
{"type":"tool_use","sessionId":"s_abc123","tool":"edit_file","input":{"path":"/app/config.ts","old":"port: 3000","new":"port: 8080"}}
{"type":"tool_result","sessionId":"s_abc123","tool":"edit_file","output":"OK"}
{"type":"agent_spawn","sessionId":"s_abc123","childId":"s_def456","task":"Run the test suite"}
{"type":"agent_complete","sessionId":"s_abc123","status":"success"}

A few design decisions in this protocol are worth noting. First, text_delta messages carry incremental text, not the full accumulated text. This keeps message sizes small and lets the frontend do its own accumulation. Second, tool_use and tool_result are separate messages because tool execution can take significant time --- the frontend shows a "running" state between them. Third, agent_spawn includes the child session ID so the frontend can open a nested subscription and render the sub-agent's output inline.

The protocol uses stderr for out-of-band diagnostics (logs, performance traces) that are captured by the Go backend but not forwarded to the browser.

Frontend rendering

The React frontend maps each event type to a specialized renderer:

Text deltas stream into a markdown component. We accumulate deltas into a string buffer and render it through a lightweight markdown parser. The parser runs incrementally --- it does not re-parse the entire document on every delta. For long responses, this keeps render time constant rather than linear in response length.

Tool calls render as collapsible cards showing the tool name, the input (file path, command, etc.), and a spinning indicator while the tool executes. When the corresponding tool_result arrives, the card expands to show the output with syntax highlighting. File diffs get a specialized renderer with additions and deletions highlighted.

Sub-agents render as nested conversation threads. When an agent_spawn event arrives, the frontend opens a new WebSocket subscription for the child session and renders it as an indented, collapsible thread within the parent session view. This can nest multiple levels deep --- an agent spawning a sub-agent that spawns another sub-agent.

The scroll problem deserves special mention. The session view is a long, vertically scrollable timeline where new content pushes in at the bottom. The expected behavior: if the user is at the bottom, they should stay at the bottom as new content arrives (auto-scroll). If the user has scrolled up to review earlier output, they should not be yanked back down.

We track this with a simple heuristic:

function useAutoScroll(containerRef: RefObject<HTMLElement>) {
  const isNearBottom = useRef(true);
 
  useEffect(() => {
    const el = containerRef.current;
    if (!el) return;
 
    const onScroll = () => {
      const threshold = 80; // px from bottom
      isNearBottom.current =
        el.scrollHeight - el.scrollTop - el.clientHeight < threshold;
    };
 
    el.addEventListener("scroll", onScroll, { passive: true });
    return () => el.removeEventListener("scroll", onScroll);
  }, [containerRef]);
 
  const scrollToBottom = useCallback(() => {
    if (isNearBottom.current && containerRef.current) {
      containerRef.current.scrollTop = containerRef.current.scrollHeight;
    }
  }, [containerRef]);
 
  return scrollToBottom;
}

The scrollToBottom function is called after each batched state update. It only scrolls if the user was already near the bottom, preserving their scroll position during review.

For long-running sessions that accumulate thousands of events, we virtualize the timeline using a windowed list. Only the visible events (plus a small overscan buffer) are mounted in the DOM. This keeps the React component tree bounded regardless of session length.

Performance characteristics

The end-to-end latency from a token leaving the Claude API to a pixel changing on screen is typically under 50ms. The breakdown:

  • Claude API to agent runner: Near-zero additional latency (SSE is consumed as it arrives).
  • Agent runner to Go backend: Sub-millisecond (stdout pipe on the same machine).
  • Go hub routing: Sub-millisecond (channel send plus JSON marshal).
  • WebSocket frame to browser: 1--2ms (local loopback; this is a desktop app).
  • React render: 5--15ms depending on event type and batch size.

The dominant latency is React rendering, which is why the client-side batching strategy matters.

The system handles 10+ concurrent agent sessions on a standard MacBook without degradation. Each session consumes roughly 2MB of memory in the Go backend (history buffer plus client send buffers) and a proportional amount in the React frontend (component tree for rendered events). Memory usage is bounded by configurable buffer limits --- when a session's history exceeds the cap, the oldest events are evicted.

CPU utilization during active streaming is modest. The Go hub spends most of its time blocked in select, waking briefly to route each message. The React frontend's batching strategy keeps it at a steady 60fps with CPU usage proportional to the visible content rather than the inbound event rate.

Wrapping up

Building a real-time ai streaming pipeline for AI agent output is a different problem from building a chat interface. The heterogeneous event types, concurrent sessions, and variable-rate output require a structured routing layer with explicit backpressure at every boundary.

The Go WebSocket hub pattern --- a single coordinator goroutine with channel-based fan-out and non-blocking sends --- has proven to be a reliable foundation. It is simple enough to reason about under pressure and performant enough that we have never needed to optimize it. The three-level backpressure strategy (per-client buffer, runner throttling, client-side batching) keeps the system stable across a wide range of conditions without requiring manual tuning.

If you are building developer tools powered by LLMs, we hope this architecture serves as a useful reference. The specific technology choices (Go, Node.js, React, WebSockets) matter less than the structural decisions: explicit event typing, buffered boundaries between stages, and backpressure that disconnects rather than blocks.

Want to try ChatML?

Download ChatML