Text-to-SQL with LLMs: What Actually Works

March 4, 2026 · Data & AI · 8 min read

The first version I built was embarrassingly bad. The LLM kept hallucinating column names, and once — I still think about this — it generated a DELETE statement because someone asked "remove the duplicates from my view." In production. Nearly ran it.

I've built a few of these systems since. Here's what I actually learned, not the sanitised tutorial version.

Why most demos are misleading

Every Text-to-SQL demo you see online uses a clean, three-table schema with sensible column names like customer_name and order_date. Real databases have columns named cust_id_v2_new, seven different tables that all have an id column, and a fact table someone built in 2019 that nobody fully understands anymore.

The gap between "works on the demo" and "works on your actual data" is where 90% of these projects die.

The thing that matters most: your schema prompt

If I had to pick one variable that determines whether a Text-to-SQL system is good or useless, it's how you describe your schema to the model. Not the model you pick. Not the temperature setting. The schema description.

Most people dump CREATE TABLE statements into the prompt and call it done. That's not enough. You need to tell the model what the columns mean — business meaning, not just data types. And include sample values. The difference in SQL quality is not small.

def build_schema_prompt(db_connection):
    schema_parts = []

    tables = db_connection.execute(
        "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public'"
    ).fetchall()

    for (table,) in tables:
        cols = db_connection.execute(f"""
            SELECT column_name, data_type
            FROM information_schema.columns
            WHERE table_name = '{table}'
        """).fetchall()

        # Don't skip this — sample values change everything
        sample = db_connection.execute(
            f"SELECT * FROM {table} LIMIT 3"
        ).fetchall()

        col_desc = ", ".join([f"{c} ({t})" for c, t in cols])
        schema_parts.append(
            f"Table: {table}\nColumns: {col_desc}\nSample rows: {sample}\n"
        )

    return "\n".join(schema_parts)

The system prompt that works for me

I've tried a lot of variations. This is roughly the prompt structure I've settled on with Gemini — the key things are: be explicit about what you don't want, set the output format hard, and don't give the model wiggle room on destructive operations.

SYSTEM_PROMPT = """You are a SQL expert. Generate a single valid {dialect} SQL query.

Database schema:
{schema}

Rules — follow these exactly:
- Return ONLY the SQL query. No explanation, no markdown, no backticks.
- Only use tables and columns that exist in the schema above.
- Never use DROP, DELETE, UPDATE, INSERT, ALTER, TRUNCATE under any circumstances.
- If the question is ambiguous, make the most reasonable assumption and proceed.
- Limit results to 500 rows unless the user is asking for aggregations.

Question: {question}
SQL:"""

Validation isn't optional

This is the part people skip. Don't.

Before you execute anything the LLM generates, check it. At minimum: scan for forbidden keywords, validate that the tables and columns in the query actually exist in your schema, and parse the SQL to catch syntax errors. It adds maybe 50ms and it's the difference between a useful tool and a liability.

import sqlparse, re

FORBIDDEN = ['DROP', 'DELETE', 'UPDATE', 'INSERT', 'ALTER', 'TRUNCATE', 'GRANT', 'REVOKE']

def validate_sql(sql: str) -> tuple[bool, str]:
    for keyword in FORBIDDEN:
        if re.search(rf'\b{keyword}\b', sql.upper()):
            return False, f"Blocked: {keyword} is not allowed"

    try:
        parsed = sqlparse.parse(sql)
        if not parsed or not parsed[0].tokens:
            return False, "Couldn't parse this SQL"
    except Exception as e:
        return False, str(e)

    return True, "OK"

The things that will bite you

Ambiguous column names. If five tables all have an id column, the model guesses. Sometimes right, often wrong. Prefix your column descriptions in the schema prompt — "user_id: unique identifier for a user in the users table" — and this problem mostly goes away.

Date handling. LLMs are weirdly inconsistent at date arithmetic. "Last 30 days" might become DATEADD(day, -30, GETDATE()) in SQL Server syntax even when you're on Postgres. Add date query examples to your prompt. It helps.

Large schemas. If your database has 80+ tables, you can't fit everything in context and you shouldn't try. Build a retrieval step first — embed the question, find the 5-10 most relevant tables, and only send those to the model. This is just RAG applied to your schema.

Temperature. Use 0 or 0.1 for SQL. Higher values feel creative but produce queries with subtle logic errors that look plausible. I learned this the hard way — a temperature-0.7 query once returned the right number but calculated it wrong. Took me two days to catch.

What it costs

With Gemini 1.5 Flash: a typical query with schema context is around 2,000 input tokens plus your question. At current pricing that's about $0.00015 per query — roughly $5/month at 1,000 queries a day. For most small-to-medium deployments this is noise.

When it's worth building and when it isn't

It works well when your schema is reasonably documented, your questions are fact-based ("how many users signed up last week?"), and you're aggregating over one or two tables. Internal analytics tools are the sweet spot.

It struggles with complex multi-table joins, questions that require knowing your specific business context ("what counts as an activated user in our system?"), and anything real-time where query cost matters.

The honest answer is: build a small prototype on your real schema and see if the query quality is actually useful. If 70% of queries come back right without much iteration, you have something worth investing in. If you're constantly debugging hallucinated column names, the schema documentation problem needs solving first.