Giving an LLM access to a million rows

You might have built RAG before: ingest documents, split them into chunks, embed, retrieve, generate. It works. Now a client hands you a database: millions of rows, structured fields, years of records. The instinct is to either dump some or all rows straight into context, or reach for the same embedding playbook. More often than not, both strategies are wrong — and the failure mode is subtle enough that you won’t catch it until the system is in production.

The fix is matching your retrieval strategy to your data: SQL for structured fields, pre-computed summaries for trend questions, and semantic search — constrained by structured filters — for genuinely unstructured text. Three tools, three jobs. That’s how you give an LLM real access to a million rows — without maxing out context, degrading attention, or burning through your budget.

Here’s what the architecture looks like, and why.

The constraint isn’t what you think

The context window is the main bottleneck — but not in the way it’s usually framed. With 200k as the new baseline and some models pushing past a million tokens, “you can’t fit it all in” is less and less true as time goes on. The more honest constraint is both economic and qualitative: model attention degrades over long contexts, cost scales with tokens, and hallucinations multiply when the model is searching through a wall of loosely relevant data.

The goal isn’t to fit everything in. The goal is architectural discipline — getting only the precisely required data into context, with as little noise as possible. Lean context is fast, cheap, and accurate. Bloated context is slow, expensive, and unreliable. The problem is that with structured data at scale, getting to “lean” is most often non-trivial.

Archibald: where I learned this

I built Archibald as a production AI system for the art market — eight million-plus auction records, fifty thousand-plus unstructured documents, and art professionals asking questions that blend precise lookups with narrative context: “what did this artist’s late-period work achieve at auction, and is there anything notable in the provenance of recent major lots?”

Text-to-SQL handles the structured auction data. Pre-computed summaries catch trend questions. Vector search handles auction research essays, exhibition histories, provenance records, and artist biographies. The LLM orchestrates across all three without the caller needing to know which retrieval strategy was used. Grounding every response in current, structured data rather than base model training cuts by far most hallucinations. In a domain where base models miss entire auction records and provenance or exhibition details move valuations, a confident wrong answer is worse than no answer at all.

Hardcoded routing vs. agentic tool selection

Before the retrieval tools, there’s a higher-level decision to make: how does the system decide which one to invoke? There are two approaches to this:

Hardcoded routing pipelines classify the query upfront — with either a lightweight classifier call or even with rule-based heuristics — and give you explicit control over which tools fire and in what combination. The upside is control and debuggability. You know exactly which path fired, you can log and test each branch independently, and you’re not relying on a model to correctly interpret tool descriptions. Tool-calling has improved significantly across frontier models, but it’s not yet bulletproof — a hardcoded pipeline sidesteps that entirely. For Archibald, for now this is still the approach I chose — the query types are predictable enough to classify reliably, and a deterministic pipeline makes the system straightforward to evaluate and debug.

Agentic tool selection gives the LLM a toolset and lets it decide what to call based on the query. The upside is flexibility: the model can call multiple tools for a single query, chain results, and handle ambiguous intent without you pre-classifying it. “Show me Picasso’s auction trajectory and any notable sales from the last five years” needs SQL and vector search in one response. A hard router needs a separate branch for every combination, which can get messy quick, and only gets worse as the number of data sources and tools scales.

Agentic orchestration is where the field is heading. Regardless of which routing approach you choose for now, the underlying tool architecture looks the same:

Natural language query

▼

LLM with toolset

▼

Tool A:

run_sql (query)

Tool B:

search_summaries (query)

Tool C:

vector_search (query, filters)

▼

Results fused and returned

Three tools, three jobs

A large database rarely has just one kind of content — these three retrieval strategies cover most of what you’ll encounter.

Text-to-SQL

Job: precise lookups on structured fields.

An LLM interprets the user’s question and generates a SQL command as a structured response — so you always get a clean, parseable query rather than free text with SQL buried somewhere in it. That query then runs against the database and returns ground truth. For precise, filterable queries — “show me Picasso auction results above $1M between 2020 and 2023” — this is the right tool.

Failures come in two forms. The first is an outright database error — a hallucinated column name, a bad join, a syntax mismatch. The second is harder: the query executes fine but returns the wrong data — an empty result set when there should be rows, the wrong aggregation level, a date filter that quietly excludes half the results. Both can be mitigated with a self-correcting loop.

The loop works in three stages: generate SQL, execute it against the database, then judge the result semantically — did the query actually answer the question, or did it return nonsense? A scratchpad tracks what was tried and why it failed, so each retry builds on previous mistakes rather than re-rolling the same dice.

async def self_correcting_sql(
    nl_query: str,
    schema_context: str,
    max_attempts: int = 5,
) -> tuple:
    # Tracks what was tried and why it failed — each retry builds on previous mistakes
    scratchpad = []

    for attempt in range(max_attempts):
        # Format prior attempts as context for the SQL generator
        scratchpad_text = "\n".join(
            f"Attempt {i+1}: {e['sql']}\n  → Failed: {e['issue']}"
            for i, e in enumerate(scratchpad)
        ) if scratchpad else "No previous attempts."

        # Generate SQL with context from previous failures (if any)
        sql = await generate_sql(nl_query, schema_context, scratchpad_text)

        try:
            # Execute against the database
            results = db.execute(sql)
        except SQLError as e:
            # Log execution failures (syntax errors, bad joins, missing columns)
            scratchpad.append({"sql": sql, "issue": f"Execution error: {e}"})
            continue

        # Semantic validation: did this query actually answer the question?
        validation = await judge_results(nl_query, sql, results)
        if validation.valid:
            return results, sql

        # Log semantic failures (empty results, wrong aggregation, implausible values)
        scratchpad.append({"sql": sql, "issue": validation.issue})

    raise MaxAttemptsExceeded(scratchpad)

generate_sql produces a structured SQL command from the natural language query and schema context.

judge_results is a second LLM call that acts as a semantic validator — catching empty result sets when there should be data, implausible values, or aggregations at the wrong granularity (“average price per artist” returning one row per auction lot, etc).

scratchpad is what makes this an agentic loop rather than a retry mechanism: the system observes outcomes, updates its understanding of what went wrong, and adjusts accordingly. It adds a few hundred tokens per retry and materially improves correction on complex queries.

Four production notes on text-to-SQL:

Result-set size explosion. Even a perfectly valid query can return thousands of rows (“show me every Picasso lot above $100k”). The self-correcting loop guarantees correctness but not usefulness. Depending on the use case, a hard LIMIT at the SQL level may be appropriate — or, if truncating results would lose important signal, a summarisation step is the better option: aggregate the raw output (price bands, top lots, trends) before it reaches the final LLM context. This is where pre-computed summary logic can be reused on-the-fly to condense large result sets into narrative without bloating tokens.
Few-shot examples improve SQL quality noticeably for domain-specific databases. A small curated set of (natural language question → validated SQL) pairs injected directly into the generation prompt is usually enough — start there. Only reach for dynamic retrieval from a vector store if your query patterns are diverse enough that a static set stops covering them.
Schema injection. For simpler databases with well-named columns, passing the full schema into context works fine. For complex multi-table enterprise databases, embed each table’s schema description and retrieve only the relevant fragments via vector search at query time — the same mechanism as any other RAG retrieval, just applied to schema rather than content. This keeps prompt size manageable and reduces hallucinated column names.
Security. There’s obviously more to it, but some first measures to consider: never let generated SQL touch your production database directly. Use a read-only replica, prepared statements, and a lightweight whitelist that rejects anything outside SELECT. The judge model can flag dangerous patterns before execution.

Pre-computed summaries

Job: trend and analysis questions over aggregated data.

Some questions don’t need raw rows — they need interpretation. “How has Picasso’s auction trajectory changed since 2020?” You could pass thousands of auction records into context and ask the model to spot the trend. But that’s exactly the kind of context bloat this architecture is designed to avoid: millions of tokens spent on work that a deterministic pipeline handles better and faster. Solve the analysis deterministically, pass the distilled result to the model — more complete understanding, fewer tokens.

The pipeline is straightforward: aggregate structured data with pandas — price trajectories, market share shifts, sale-over-estimate ratios, top-lot breakdowns — then generate natural language summaries from the results: “Picasso’s auction prices trended +23% in 2022, led by Christie’s New York, sustained by three lots above $10M.” The model gets a single dense paragraph instead of ten thousand rows, and the analysis itself is deterministic and auditable.

This is where vectors earn their place in a structured data architecture — not on raw rows, but on derived, narrative representations of the data. Store summaries wherever makes sense for your stack — a vector store for semantic retrieval, a database table for structured lookup, or both. Inject them directly into context for known high-value entities, or retrieve them on demand when users ask ad hoc analysis questions.

Run the generation pipeline as a scheduled job. For Archibald, weekly refreshes work well — auction data updates daily, but the market moves slowly enough that weekly summaries remain accurate. The right cadence is dataset- and industry-dependent. The computation is deterministic: same data in, same summaries out. That makes it cheap to run, easy to validate, and important for client trust.

Vector search on genuinely unstructured fields

Job: semantic search over free-text content.

Auction catalogue essays, exhibition histories, provenance records, artist CVs, condition reports — this is genuinely unstructured content where meaning matters more than exact match, and vectors are the right tool. But unlike a standalone document corpus, these fields sit alongside precise, queryable attributes in the same database — dates, categories, prices, IDs. That co-location is the key advantage: the structured fields aren’t just metadata, they’re precision filters for metadata-constrained retrieval. You’re not doing a full-corpus semantic search — you’re running it against a pre-filtered candidate set. Same principle as everywhere else in this architecture: narrow aggressively to keep context lean.

There are two approaches to storing your embeddings. You can keep them in a dedicated vector store — Chroma, Pinecone, Weaviate — alongside your relational database; all of these support metadata filtering natively, so the pattern works, though it requires some additional glue to keep metadata in sync across systems. Or you can colocate them with your structured data via pgvector in PostgreSQL. If you’re already on PostgreSQL — say, through Supabase — pgvector is worth reaching for first: your SQL layer and your vector layer share the same filtering vocabulary, the same transaction guarantees, and the same infrastructure. No synchronising metadata between two systems, no coordinating filtering logic in two places. Structured filtering and semantic search collapse into a single query.

Here’s what that looks like end-to-end with pgvector:

async def vector_search(
    query: str,
    embedding_col: str,   # e.g. "provenance_embedding", "catalogue_embedding"
    filters: dict,
    db,
    top_k: int = 10,
) -> list[dict]:
    # Embed the user's query
    response = await openai.embeddings.create(
        input=query,
        model="text-embedding-3-small",
    )
    query_embedding = response.data[0].embedding

    # Metadata-constrained vector search in a single query
    rows = await db.fetch(f"""
        SELECT
            lot_id,
            artist,
            sale_date,
            hammer_price,
            1 - ({embedding_col} <=> $1::vector) AS similarity
        FROM auction_results
        WHERE artist = $2
          AND sale_date >= $3
        ORDER BY {embedding_col} <=> $1::vector
        LIMIT $4
    """,
        query_embedding,
        filters["artist"],
        filters["after_date"],
        top_k,
    )

    return [dict(row) for row in rows]

Each row in auction_results can carry multiple embedding columns — one per distinct free-text field: provenance_embedding, condition_embedding, catalogue_embedding, exhibition_embedding. Each is the pre-computed vector of its corresponding text column, stored at ingestion time. The embedding_col parameter lets you target whichever field is relevant to the query.

When a query like “Picasso works previously owned by Daniel-Henry Kahnweiler, sold after 2020” comes in, it gets embedded on the fly and compared against provenance_embedding. The WHERE clauses — standard SQL, the same predicates your text-to-SQL layer already generates — narrow the candidate set to Picasso lots sold after 2020 first. The <=> operator then ranks within that filtered set using pgvector’s cosine distance: a measure of how similar two embedding vectors are in meaning, independent of their magnitude. Subtracting from 1 converts it to a similarity score between 0 and 1.

One more note on retrieval quality: once the core architecture is in place, don’t stop at naive embedding lookup. Hybrid retrieval — combining keyword search (BM25) with semantic similarity and reranking — improves results substantially.

Reading the shape of your data

“RAG” has become a catch-all, and it’s causing real architectural mistakes. The document retrieval playbook — embed, retrieve, generate — is a more mature pattern, but structured data at scale doesn’t respond to it without meaningful modification.

The actual skill is reading the shape of your data and matching the retrieval strategy to each part of it. Structured fields need SQL, with a self-correcting loop and validated few-shot examples. Trend questions need pre-computed summaries, vectorised for retrieval if needed. Genuinely unstructured text — the narrative layer attached to structured records — needs semantic search, constrained by structured filters.

Vectors aren’t wrong for structured data. They’re just often applied to the wrong part of it. The architecture that works isn’t a choice between text-to-SQL and embeddings — it’s both, plus summaries, with each doing the job it’s suited for. Whether you wire them together with an explicit routing pipeline or let an LLM select tools at runtime is a separate decision — one that depends on your model, your reliability requirements, and how compound your users’ queries get. Either way, the retrieval strategies underneath are the same.

That’s how I give LLMs access to a million rows — not by cramming data into context, but by building the architecture that knows exactly which rows to retrieve, and how.

If you’re dealing with structured data at scale, the kind where naive RAG falls apart and your users need real answers from real databases — let’s talk architecture. I’ll tell you what I’d build, and you can decide if it’s worth building together.