Access Pattern Analysis

Schema and index decisions must follow from actual query patterns, not from assumptions about what might be queried. Before adding a column, index, or table, identify the queries that will use it and verify the design supports them efficiently.

Start With the Queries, Not the Schema

MUST answer these questions before finalizing any schema:

What are the most frequent queries? What columns appear in WHERE, JOIN, and ORDER BY?
What are the most expensive queries? (Use EXPLAIN QUERY PLAN and ANALYZE.)
Is this table read-heavy or write-heavy? What is the read-to-write ratio?
Are reads latency-sensitive (user-facing) or throughput-sensitive (background batch)?
What is the expected row count — today and at 10x scale?

The answers drive every subsequent decision: index selection, composite key ordering, WAL vs rollback journal, batch size, and connection count.

Design for WHERE, JOIN, and ORDER BY

A column belongs in an index if and only if queries will filter, join, or sort on it.

-- Before creating this index, confirm: which queries use status and created_date together?
CREATE INDEX idx_orders_status_date ON orders(status, created_date DESC);

MUST NOT add an index speculatively. An index that no query uses costs write performance with no read benefit.

SHOULD maintain a query inventory: a list of the significant queries for each table, their frequency, and which columns they filter on. Use this inventory to verify that every index earns its place, and that every frequent query has index support.

Column order in composite indexes follows query patterns:

Equality filters (=, IN) go first
Range filters (<, >, BETWEEN) go last
ORDER BY columns go last when the query has no range filter, enabling index-ordered scans that eliminate sort steps

Read-Heavy vs Write-Heavy Tradeoffs

Read-heavy tables (lookups, reporting, list views):

Add covering indexes for the hottest queries — eliminate table lookups entirely
Use mmap_size and a larger cache_size to keep working set in memory
Prefer SELECT with specific columns over SELECT * to maximize covering index usage
Run ANALYZE regularly to keep query planner statistics current

Write-heavy tables (event logs, outboxes, audit trails, sync change logs):

Minimize the number of indexes — each index adds overhead to every INSERT, UPDATE, and DELETE
A heavily indexed table can sustain 5x slower inserts compared to the same table with no secondary indexes
Use partial indexes to limit index size to the rows that matter
Batch writes in explicit transactions (100–1,000 rows per transaction for general use)
Append-only tables (insert but never update) may need no secondary indexes at all beyond the primary key

Mixed workload tables (the common case):

Profile the actual read/write ratio before adding indexes
One index on the most selective filter column often covers 80% of the read benefit
Add additional indexes only when profiling confirms they are needed

Identifying Missing Indexes

The query planner signals missing index opportunities in EXPLAIN QUERY PLAN output:

EXPLAIN QUERY PLAN SELECT * FROM orders WHERE customer_id = 42;

Signals to act on:

SCAN orders — full table scan; an index on customer_id would help
AUTOMATIC INDEX — SQLite built a temporary index at query time; a permanent index would be faster
USE TEMP B-TREE FOR ORDER BY — the sort is happening in memory; an index on the ORDER BY columns may eliminate it
CORRELATED SCALAR SUBQUERY — runs per outer row; restructure as a JOIN

-- Enable automatic query plan output in the CLI during development
.eqp on

Also monitor sqlite_stat1 after running ANALYZE to see actual row counts per index. If an index covers a column with very low cardinality on a large table, it may not help — the planner may prefer a full scan.

Sync-Specific Access Patterns

Sync workloads have predictable query shapes. Design for them explicitly.

Unsynced record queries

The most critical sync query: "what changed since the last sync?"

SELECT * FROM tasks
WHERE last_synced_at IS NULL
   OR updated_at > last_synced_at;

This query runs on every sync cycle. MUST have an index that covers it:

CREATE INDEX idx_tasks_sync_status ON tasks(last_synced_at, updated_at)
  WHERE last_synced_at IS NULL OR updated_at > last_synced_at;

Verify with EXPLAIN QUERY PLAN that this shows SEARCH ... USING INDEX, not SCAN.

Outbox and change log queries

-- Outbox: pending entries to upload
SELECT * FROM outbox WHERE status = 'pending' ORDER BY next_attempt_at;

-- Change log: unsynced entries
SELECT * FROM change_log WHERE synced = 0 ORDER BY changed_at;

Both benefit from partial indexes on the filter condition:

CREATE INDEX idx_outbox_pending ON outbox(status, next_attempt_at)
  WHERE status = 'pending';

CREATE INDEX idx_changelog_unsynced ON change_log(synced, changed_at)
  WHERE synced = 0;

Partial indexes remain small even as the underlying table grows, because they only index the rows still in the active state.

Soft-delete filter

Nearly every application query excludes deleted rows. Account for this in index design:

-- If most queries filter WHERE is_deleted = 0, include is_deleted in composite indexes
-- OR use a partial index on the active subset
CREATE INDEX idx_tasks_active_status ON tasks(status, created_date)
  WHERE is_deleted = 0;

Batch Size for Sync Operations

Batch size is an access pattern decision: too small, and HTTP overhead dominates; too large, and a failed request wastes work.

Context	Recommended Batch Size	Rationale
Mobile (unstable network)	50–100 records	Smaller batches survive connection drops
Desktop (stable network)	500–1,000 records	Reduces HTTP round-trips
Initial bootstrap	1,000–5,000 records	Fast initial sync is critical for UX
Background sync	100–500 records	Balances throughput with UI responsiveness

MUST wrap each sync batch in a single transaction. Applying changes row-by-row in autocommit mode pays an fsync per row — at 30ms+ per fsync, a 500-record batch would take 15 seconds.

BEGIN IMMEDIATE;
-- apply all changes in the batch
COMMIT;

For bulk ingestion (initial sync or large imports), use JSON bulk operations to avoid SQLite's per-statement parameter limit and reduce round-trips:

-- Bulk insert via JSON: single statement for an entire batch
INSERT INTO tasks (id, title, status)
SELECT e->>'id', e->>'title', e->>'status'
FROM json_each(?) e;

Transaction Management During Sync

MUST use BEGIN IMMEDIATE for sync write transactions (not BEGIN DEFERRED). IMMEDIATE acquires the write lock at transaction start, preventing mid-batch lock failures that would require rolling back the entire batch.

def apply_sync_batch(changes, db):
    db.execute("BEGIN IMMEDIATE")
    try:
        for change in changes:
            apply_change(db, change)
        db.execute("COMMIT")
    except Exception:
        db.execute("ROLLBACK")
        raise

MUST NOT hold the write lock open while waiting for network responses. The pattern is: fetch the batch from the server, then open the transaction, apply all changes, commit. Network I/O happens outside the transaction boundary.

Connection separation for sync:

One dedicated write connection for sync writes (prevents SQLITE_BUSY from competing writers)
Separate read connections for UI queries that run during sync
Set busy_timeout = 5000 on all connections to absorb brief contention without returning errors

This connection strategy, combined with WAL mode, allows the sync writer to apply batches while the UI continues to read — the fundamental concurrency requirement for a responsive offline-first application.

Change History

Version	Date	Author	Summary
1.0.2	2026-04-09	Mike Fullerton	Add trigger tags
1.0.1	2026-04-09	Mike Fullerton	Reorganize into use-case directory
1.0.0	2026-04-06	Mike Fullerton	Initial version