Skip to main content
Back to Blog
12 min read

SQLite Concurrency in Go: What We Learned Building a Desktop AI IDE

How we went from random 'database is locked' crashes to zero errors with WAL mode, connection pooling, and a 139-line retry function with exponential backoff and jitter.

sqlitegoconcurrencydesktopdatabasearchitecture
MC

Marcio Castilho

ChatML Team


Two weeks after we shipped the first beta of ChatML, a user reported that sessions would occasionally fail to save. No stack trace, no panic, just a quiet failure. I dug into the logs and found the error that every SQLite developer dreads:

database is locked

It was intermittent. Impossible to reproduce on demand. And it only showed up when users ran multiple AI agents in parallel --- which happens to be the entire point of the product.

ChatML is a desktop AI IDE. The whole premise is that you should be able to kick off three, four, five AI agents at once, each working in its own git worktree, each streaming output in real-time. And under the hood, all of those agents are writing to the same SQLite database on the user's machine.

We have a polyglot architecture --- Go backend, Node.js agent runner, React frontend, Rust desktop shell --- and the Go backend and every agent runner process all share a single database file. The backend writes sessions and metadata. Each agent runner writes messages and tool results as they stream in from the Claude API. Three agents running means four processes fighting over one SQLite file.

This post is about how we fixed it. We went through three iterations before landing on something that worked, and I think the journey is more interesting than the final answer.

Why SQLite in the first place

Before getting into the fix, I should explain why we chose SQLite for a production desktop app instead of something like PostgreSQL.

The short answer is simplicity. ChatML stores sessions, conversations, messages, attachments, and user preferences --- all local to the user's machine. SQLite gives us a single file we can back up by copying it. Zero configuration. No daemon to manage. No port conflicts. No "works in development but you need Postgres in production" problem. SQLite is production.

We also use the pure Go driver from modernc.org/sqlite, which means no CGO dependency. That matters more than you'd think --- it means we compile to a single static binary and cross-compile for macOS, Linux, and Windows without any C toolchain headaches.

The trade-off, of course, is that SQLite allows only one writer at a time. When multiple goroutines or processes try to write concurrently, someone loses. And when they lose, you get "database is locked."

The naive fix: just add a mutex

My first instinct was the Go developer's default reaction to any concurrency problem: throw a sync.Mutex at it.

type SQLiteStore struct {
    mu sync.Mutex
    db *sql.DB
}
 
func (s *SQLiteStore) AddSession(ctx context.Context, session *models.Session) error {
    s.mu.Lock()
    defer s.mu.Unlock()
    _, err := s.db.ExecContext(ctx, `INSERT INTO sessions ...`, ...)
    return err
}
 
func (s *SQLiteStore) GetSessions(ctx context.Context) ([]*models.Session, error) {
    s.mu.Lock()
    defer s.mu.Unlock()
    rows, err := s.db.QueryContext(ctx, `SELECT * FROM sessions`)
    // ...
}

This worked. The "database is locked" errors disappeared immediately. But so did our performance.

The problem is that a Go mutex serializes everything. Not just writes against writes --- reads against reads too. When a user has the session list open and three agents running, the UI polls for updates constantly. With the mutex, ten goroutines reading session lists would queue behind each other, and behind any write in progress. The UI felt sluggish, updates lagged, and scroll position would jump around because reads were getting backed up behind agent writes.

We had traded one problem for another. The "database is locked" errors were gone, but the app felt slow in exactly the scenario where it needed to feel fast.

Trying again: WAL mode

After some research, I realized we were ignoring the single most important SQLite configuration for concurrent access: Write-Ahead Logging.

By default, SQLite uses a rollback journal. In this mode, writers block readers and readers block writers. It's the worst possible mode for our use case. WAL mode flips this --- readers and writers can operate concurrently. Readers see a consistent snapshot of the database while a writer is mid-transaction. For an app where the UI is constantly reading while agents are constantly writing, this changes everything.

We ripped out the mutex and configured SQLite properly:

// SQLiteStore implements data persistence using SQLite
// Note: We don't use a Go mutex because SQLite with WAL mode handles
// concurrency. The busy_timeout pragma handles lock contention at the
// database level.
type SQLiteStore struct {
    db     *sql.DB
    dbPath string
}
 
func NewSQLiteStore() (*SQLiteStore, error) {
    dbPath := appdir.DBPath()
 
    db, err := sql.Open("sqlite",
        dbPath+"?_pragma=foreign_keys(1)&_pragma=journal_mode(WAL)&_pragma=busy_timeout(5000)")
    if err != nil {
        return nil, err
    }
 
    db.SetMaxOpenConns(10)
    db.SetMaxIdleConns(5)
    db.SetConnMaxLifetime(0)
 
    // ...
}

That one DSN string does a lot of heavy lifting. journal_mode(WAL) enables Write-Ahead Logging. busy_timeout(5000) tells SQLite to wait up to 5 seconds for a lock instead of failing immediately --- it uses its own internal retry strategy during that window. foreign_keys(1) turns on referential integrity, which SQLite has disabled by default for backwards compatibility reasons.

The connection pool settings matter too. We allow 10 connections so that concurrent reads don't serialize --- with WAL mode, multiple readers can proceed simultaneously on different connections. We never expire connections because the app runs continuously and SQLite connections are cheap. There's no TCP state to worry about, no server-side session limits. Recycling them would just add overhead for no benefit.

I left a comment in the code to explain why there's no mutex. I knew future me (or someone else on the team) would look at a concurrent data store with no lock and immediately reach for one.

This was a huge improvement. The UI was responsive again. Reads never blocked behind writes. And for normal usage --- one or two agents running --- the "database is locked" errors were gone.

But we pushed harder.

The problem that WAL mode doesn't solve

With three or more agents running simultaneously, all producing rapid-fire writes --- saving messages, tool results, streaming snapshots --- we started seeing the errors again. Not as often, but they were there.

The reason is that busy_timeout has a limit. We set it to 5 seconds, which handles the vast majority of contention. But when one agent is in the middle of a transaction that inserts a batch of messages, and two other agents are also trying to write, the queue can back up. The 5-second window expires, and someone gets SQLITE_BUSY.

We needed one more layer: application-level retries.

The right solution: retry with exponential backoff and jitter

I knew we needed a retry wrapper, but I wanted to get it right. A naive retry loop --- just sleep for a fixed duration and try again --- has a well-known problem called the thundering herd. If three goroutines all fail at the same time, they all sleep for the same duration and retry at the same time, colliding again. And again. And again.

The answer is exponential backoff with jitter. You double the wait time on each attempt, and you add a random offset so concurrent retriers spread out over time instead of hitting the lock simultaneously.

But before we retry anything, we need to know which errors are worth retrying. A UNIQUE constraint failed will never succeed on retry --- retrying it just wastes time. Only transient lock errors should be retried.

func IsTransientDBError(err error) bool {
    if err == nil {
        return false
    }
 
    errStr := strings.ToLower(err.Error())
    transientPatterns := []string{
        "database is locked",
        "database table is locked",
        "sqlite_busy",
        "sqlite_locked",
    }
    for _, pattern := range transientPatterns {
        if strings.Contains(errStr, pattern) {
            return true
        }
    }
 
    return false
}

You might wonder why we use string matching instead of typed error codes. The reason is practical: in a real codebase, database errors get wrapped multiple times as they propagate through layers. fmt.Errorf("AddSession: %w", err) wraps the original error, and then the next layer wraps it again. Using errors.Is with driver-specific error types breaks as soon as anyone wraps the error without preserving the type chain. Calling err.Error() and checking the string works regardless of how deeply the error is wrapped. The strings "database is locked" and "SQLITE_BUSY" are canonical SQLite messages that haven't changed in decades. It's pragmatic, not elegant, and it works.

Here's the retry function itself:

type RetryConfig struct {
    MaxRetries int
    BaseDelay  time.Duration
    MaxDelay   time.Duration
    JitterPct  float64
}
 
func DefaultRetryConfig() RetryConfig {
    return RetryConfig{
        MaxRetries: 5,
        BaseDelay:  50 * time.Millisecond,
        MaxDelay:   2 * time.Second,
        JitterPct:  0.25,
    }
}
 
func RetryDBOperation[T any](
    ctx context.Context,
    opName string,
    config RetryConfig,
    operation func(context.Context) (T, error),
) (T, error) {
    var result T
    var lastErr error
 
    totalAttempts := config.MaxRetries + 1
 
    for attempt := 0; attempt < totalAttempts; attempt++ {
        if ctx.Err() != nil {
            return result, fmt.Errorf("%s: context cancelled: %w",
                opName, ctx.Err())
        }
 
        result, lastErr = operation(ctx)
        if lastErr == nil {
            return result, nil
        }
 
        if !IsTransientDBError(lastErr) {
            return result, lastErr
        }
 
        if attempt < config.MaxRetries {
            delay := calculateBackoff(attempt, config)
            logger.DBRetry.Warnf(
                "%s: transient error (attempt %d/%d), retrying in %v: %v",
                opName, attempt+1, totalAttempts, delay, lastErr)
 
            select {
            case <-time.After(delay):
            case <-ctx.Done():
                return result, fmt.Errorf(
                    "%s: context cancelled during retry: %w",
                    opName, ctx.Err())
            }
        }
    }
 
    return result, fmt.Errorf("%s: max retries exceeded: %w",
        opName, lastErr)
}

A few things to note about this. It uses Go generics so it works for both value-returning queries and void writes. The opName parameter shows up in every log message and error, so when you see "AddMessageToConversation: transient error (attempt 1/6), retrying in 52ms" in the logs, you know exactly what's happening. It checks ctx.Done() during the backoff sleep, so if the app is shutting down or an HTTP request times out, retries stop immediately instead of burning through all attempts. And if the error isn't transient --- a constraint violation, a malformed query --- it returns immediately without retrying.

For write operations that don't return a value, we have a convenience wrapper:

func RetryDBExec(
    ctx context.Context,
    opName string,
    config RetryConfig,
    operation func(context.Context) error,
) error {
    _, err := RetryDBOperation(ctx, opName, config,
        func(ctx context.Context) (struct{}, error) {
            return struct{}{}, operation(ctx)
        })
    return err
}

Nothing fancy. It just adapts a func() error into the generic signature so we don't have to duplicate the retry logic.

The backoff calculation is where the jitter lives:

func calculateBackoff(attempt int, config RetryConfig) time.Duration {
    delay := float64(config.BaseDelay) * math.Pow(2, float64(attempt))
 
    if delay > float64(config.MaxDelay) {
        delay = float64(config.MaxDelay)
    }
 
    if config.JitterPct > 0 {
        jitter := delay * config.JitterPct * (2*rand.Float64() - 1)
        delay += jitter
    }
 
    if delay < 0 {
        delay = 0
    }
 
    return time.Duration(delay)
}

The formula 2*rand.Float64() - 1 produces a random number between -1 and 1, which gets scaled by the jitter percentage. With the default 25% jitter, the first retry happens somewhere between 37ms and 62ms instead of exactly 50ms. The second retry lands between 75ms and 125ms. By the third retry, concurrent goroutines are spread across a wide enough window that contention resolves naturally.

That delay < 0 guard at the bottom looks paranoid, but we actually tested it --- with extreme jitter percentages close to 100%, the math can produce a negative number. We run 1,000 iterations with 99% jitter in our test suite to make sure it never happens.

Wiring it up

Every write operation in our store now wraps itself with RetryDBExec. The pattern is always the same:

func (s *SQLiteStore) AddSession(ctx context.Context, session *models.Session) error {
    return RetryDBExec(ctx, "AddSession", DefaultRetryConfig(),
        func(ctx context.Context) error {
            _, err := s.db.ExecContext(ctx,
                `INSERT INTO sessions (id, workspace_id, name, ...) VALUES (?, ?, ?, ...)`,
                session.ID, session.WorkspaceID, session.Name, ...)
            return err
        })
}
 
func (s *SQLiteStore) DeleteSession(ctx context.Context, id string) error {
    return RetryDBExec(ctx, "DeleteSession", DefaultRetryConfig(),
        func(ctx context.Context) error {
            _, err := s.db.ExecContext(ctx,
                `DELETE FROM sessions WHERE id = ?`, id)
            return err
        })
}

We use this across all 11 write operations --- sessions, conversations, messages, tool actions, streaming snapshots, file tabs, attachments. The operation name always matches the function name, so logs are self-documenting.

For tests, we use an in-memory store with a single connection:

func NewSQLiteStoreInMemory() (*SQLiteStore, error) {
    db, err := sql.Open("sqlite", ":memory:?_pragma=foreign_keys(1)")
    // ...
    db.SetMaxOpenConns(1)
    // ...
}

Single connection means no contention, which means the retry logic never fires in tests. But it still wraps every operation, so we're testing the exact same code paths as production without introducing flaky timing-dependent failures.

The results

After deploying all three layers --- WAL mode, tuned connection pooling, and the retry function --- "database is locked" errors dropped to zero. Not reduced. Zero.

The retry logic does fire occasionally. When a user runs three or more agents and they all hit a write-heavy phase at the same time, we see log lines like "AddMessageToConversation: transient error (attempt 1/6), retrying in 52ms". But it always succeeds within one or two retries. The 50ms base delay means the first retry happens before a single frame renders at 60fps. Users never notice.

The total implementation is 138 lines of Go for the retry logic and about 10 lines of SQLite configuration. That's it. No external dependencies, no message queues, no background workers, no separate write service. Just SQLite doing what SQLite does best, with a thin layer of application logic to smooth over the one thing it doesn't handle well.

What I'd tell someone starting out

If you're building a desktop application in Go and you're reaching for PostgreSQL out of habit, stop and think about whether you actually need it. For local-first data that lives on the user's machine, SQLite with WAL mode is almost certainly enough. The "database is locked" problem is real, but it has a straightforward solution: configure WAL mode, set a reasonable busy_timeout, and add a generic retry with backoff and jitter.

The entire retry module is open source in backend/store/retry.go. If you're dealing with SQLite concurrency in Go, take it, adapt it, use it. It's 138 lines that saved us from a class of bugs we never had to think about again.

Want to try ChatML?

Download ChatML