Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

How to Write Prompts for Data Extraction

Reliable extraction is a schema problem, not a wording problem: tell the model exactly which fields to return, in what type and shape, and what to do when a value is missing.

By The DDH Team at Digital Dashboard HubUpdated

To write a prompt for data extraction, define a strict output schema first — list every field by name with its type — then instruct the model to return only that structure (usually JSON), show one worked example, and add an explicit fallback rule such as **use null when a value is absent**. That single rule, the schema plus the null fallback, is what turns flaky free-text answers into parseable, production-grade output.

This guide covers the field-naming, type, and fallback patterns that make extraction reliable, plus a before/after prompt you can copy. For the downstream side — designing the schema itself and validating what comes back — pair this with our structured output schema design patterns and function calling vs structured output guides. To draft an extraction prompt fast, the ChatGPT Prompt Generator gives you a structured starting point — no signup, free forever.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Extraction prompt building blocks

Feature
Building block
Why it matters
Skip it and you get
Named, typed fieldsRemoves ambiguity about shape and formatRenamed keys, strings where you wanted numbers
Closed enum for categoriesConstrains output to valid labelsInvented or inconsistent category values
Null fallback ruleStops the model guessing absent valuesHallucinated phone numbers, fake totals
One worked exampleAnchors exact JSON shape and casingFormat drift between runs
Source-span evidence fieldAuditable, exposes hallucinationsNo way to verify a suspicious value
Real structured-output modeGuarantees well-formed JSON at decode timeProse-wrapped or truncated JSON

Sources: [OpenAI prompt guide](https://platform.openai.com/docs/guides/prompt-engineering), [Anthropic prompt engineering](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/overview), [DAIR.ai Prompt Engineering Guide](https://www.promptingguide.ai/). Verified June 2026.

What makes an extraction prompt reliable?

Extraction is the task of pulling specific, named values out of unstructured text — an invoice total, a person's title, a list of skills — and returning them in a fixed shape. The failure modes are predictable: the model invents fields you didn't ask for, guesses a value that isn't actually in the source, changes the JSON key names between runs, or wraps the output in prose that breaks your parser.

Every one of those failures has the same root cause: ambiguity about what "done" looks like. A reliable extraction prompt removes that ambiguity by being explicit on four axes — **which fields**, **what type** each field is, **what shape** the overall output takes, and **what to do when a value is missing**. Nail those four and the model has nothing left to improvise.

Where the platform supports it, prefer a real structured-output mode over asking for JSON in plain text. OpenAI's structured outputs / JSON mode and the equivalent constrained-decoding features described in the Anthropic prompt engineering overview enforce the schema at decode time, so the model literally cannot return malformed JSON. Prompt wording is your second line of defense, not your only one.


How do I name fields and set types so the model doesn't drift?

Name fields the way you want them in your database, in snake_case or camelCase, and use the exact same names in the instruction and the example. If you call it invoice_total in the schema but Total in the example, the model will pick one at random across runs. Consistency in your own prompt is the cheapest reliability win available.

State the type next to every field: string, number, boolean, ISO-8601 date, or an array of one of those. Numbers are the classic trap — if you don't say number, you'll get "$1,240.00" as a string with a currency symbol and a comma, which your parser then has to clean. Saying **number, no currency symbol, no thousands separator** fixes it at the source.

For categorical fields, give the model a closed list and forbid anything outside it: "priority: one of [low, medium, high]". This is the same enum discipline that powers classification — see our how to write prompts for classification guide for the edge-case handling that applies equally here.

Order matters less than you'd think for accuracy, but keeping the schema, the example, and the requested output in the same field order makes the result far easier to eyeball and diff in review.


What's the right fallback for missing or uncertain values?

The single most important instruction in any extraction prompt is the missing-value rule. Without it, the model fills gaps with plausible-sounding hallucinations — a phone number that looks real but appears nowhere in the source. With it, you get an honest null you can detect and handle.

Pick one fallback convention and state it explicitly: **"If a value is not present in the source text, return null — never guess."** null is preferable to an empty string because it's unambiguous in JSON and easy to filter. For array fields, the empty fallback is [] rather than null, so downstream code can always iterate without a type check.

Add a confidence or evidence field when the cost of a wrong extraction is high. Asking the model to also return the exact source substring it pulled each value from ("quote the span you extracted this from") gives you an audit trail and makes hallucinations obvious — the quoted span won't exist in the source. This is a lightweight, prompt-only form of grounding; for heavier grounding over large corpora see what is RAG.

Never put real personal data, financial records, or confidential documents into a public chatbot to test extraction. Redact identifiers first, or use sample data.


Before / after: a real extraction prompt

Here is a vague prompt that produces inconsistent, unparseable output:

``` Pull the key details out of this resume: {resume_text} ```

It will return prose, invent fields, and format dates differently every run. Now the schema-first version with types and a fallback:

``` Extract the following fields from the resume below. Return ONLY a JSON object with exactly these keys and types: { "full_name": string, "email": string, // exact email, or null if absent "years_experience": number, // integer, or null if not stated "current_title": string, // most recent role, or null "skills": string[], // empty array [] if none listed "has_security_clearance": boolean } Rules: - If a value is not present in the text, return null. Never guess. - Do not add keys that are not listed above. - No prose, no markdown — JSON only. Example output: {"full_name":"Jordan Lee","email":"jordan@example.com","years_experience":7,"current_title":"Senior Analyst","skills":["SQL","Python"],"has_security_clearance":false} Resume: {resume_text} ```

The difference is night and day: closed key set, explicit types, one example to anchor the shape, and a null fallback that kills hallucinated values. Wire this through a real structured-output mode and the JSON is guaranteed well-formed.


How do I handle long documents and batches?

When the source exceeds what fits comfortably in the model's working window, don't paste the whole thing and hope — chunk it. Extract per-chunk into the same schema, then merge. For fields that should appear once (an invoice number), take the first non-null; for list fields (line items), concatenate and de-duplicate. Mind the context window limits of whichever model you choose.

For batch extraction across many documents, keep the schema identical and version it. The moment you change a field name or type, re-run a small golden set of known-answer documents to confirm the change didn't silently break parsing. Reusing a stable schema also lets you cache the instruction portion — see LLM caching strategies — which cuts cost on high-volume jobs.

Model choice matters less than schema discipline here, but for very long or multi-format sources a long-context, multimodal model helps. See how to choose an AI model in 2026 for the trade-offs and check live capabilities on the official model pages.

How to write a data-extraction prompt, step by step

  1. 1

    Define the output schema first

    Before writing any instruction text, list every field you need with its name and type — string, number, boolean, ISO date, or array. The schema is the contract; everything else just enforces it. See structured output schema design patterns for naming conventions.

  2. 2

    Constrain the shape and forbid extras

    Tell the model to return ONLY a JSON object with exactly those keys, and explicitly forbid adding keys that aren't listed. State 'no prose, no markdown — JSON only.'

  3. 3

    Set types and formats per field

    Specify number with no currency symbol or thousands separator, dates as ISO-8601, and closed enums for categoricals: 'priority: one of [low, medium, high]'. This is where most format drift gets eliminated.

  4. 4

    Add the missing-value fallback

    State the rule plainly: 'If a value is not present in the source, return null — never guess.' Use [] for absent arrays. This is the single instruction that prevents hallucinated values.

  5. 5

    Give exactly one worked example

    Show one filled-in JSON object matching the schema exactly — same keys, same casing. One example anchors the shape; you rarely need more for extraction (few-shot helps more for fuzzy classification).

  6. 6

    Prefer a real structured-output mode

    Where the API supports it, enable JSON/structured-output mode so the schema is enforced at decode time, not just requested in text. See the OpenAI prompt guide for current options.

  7. 7

    Validate, version, and add an evidence field

    Parse and schema-check every response in code; on parse failure, retry with the error fed back. For high-stakes fields, ask the model to quote the exact source span so you can audit and catch hallucinations.

Frequently Asked Questions

How do I write a prompt to extract data as JSON?

Define a strict schema (every field named with its type), tell the model to return ONLY that JSON object with no extra keys and no prose, give one worked example, and add a fallback: 'return null if a value is absent — never guess.' Where the API supports it, enable a structured-output/JSON mode so the shape is enforced at decode time.

How do I stop the AI from making up values during extraction?

Add an explicit missing-value rule: 'If a value is not present in the source text, return null — never guess.' For high-stakes fields, also ask the model to quote the exact source span it pulled each value from; a fabricated value's quoted span won't exist in the source, exposing the hallucination.

What's the best output format for data extraction prompts?

JSON with named, typed fields, returned via a real structured-output mode where available. JSON is unambiguous, parseable, and supported by constrained-decoding features that guarantee well-formed output. Reserve free text only for one-off, human-read extractions.

How do I extract data from a document longer than the context window?

Chunk the document, extract each chunk into the same schema, then merge — first non-null for single-value fields, concatenate-and-dedupe for list fields. Check your model's context window and pick a long-context model for big sources.

Should I use few-shot examples for data extraction?

Usually one example is enough for extraction, because the schema does most of the work. Add more examples only when fields are fuzzy or format-sensitive. Few-shot pays off more in classification than in straight field extraction.

How do I extract numbers without currency symbols and commas?

Type the field as number and add a format rule: 'number, no currency symbol, no thousands separator.' Without it you'll get strings like "$1,240.00" that your parser must clean. Specifying the type and format at the source eliminates the cleanup step.

Is it safe to use ChatGPT to extract data from confidential documents?

Do not paste real personal, financial, or confidential data into a public chatbot. Redact identifiers first or use sample data, and check your provider's data-handling and retention terms. For regulated data, use an enterprise tier with appropriate controls and verify outputs before relying on them.

Why does my extraction prompt return different JSON keys each time?

Almost always because your prompt uses inconsistent field names — e.g. invoice_total in the schema but Total in the example. Use the exact same key names in the instruction and the example, forbid extra keys, and enable a structured-output mode to lock the shape across runs.

Build a schema-first extraction prompt in seconds

Start from a structured template, add your fields and fallback rule, and copy it straight into ChatGPT or Claude. No signup, free forever.

Browse all prompt tools →