Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Multi-Modal Prompting: Images, Audio & Video (2026)

Multi-modal prompting covers two directions: feeding a model images, audio, or video to understand, and prompting a model to generate them. The principles differ by direction and by model. This guide walks both, with current pricing and capability tables as of June 2026.

By The DDH Team at Digital Dashboard HubUpdated

Multi-modal prompting is the practice of prompting a model with more than just text — giving it an image, audio clip, or video to analyze, or instructing it to generate visual or audio media. The core skill splits in two: for understanding tasks, you attach the media and write a precise text instruction about what to extract or do with it; for generation tasks, you describe the desired output in the structured vocabulary each generator responds to (subject, style, composition, and constraints).

Below we cover both directions across the current model landscape — GPT-5.x vision and gpt-image-2, Sora-2 for video, Claude vision, Gemini 3.x multimodal, and Google's Imagen and Veo families — with capability and pricing tables tied to the live provider docs. To practice generation prompts now, our Midjourney Prompt Builder, DALL-E Prompt Creator, and AI Art Style Mixer scaffold the structure these models reward.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Multi-modal capabilities & pricing (as of June 2026)

Feature
GPT-5.x / OpenAI
Claude (Anthropic)
Gemini 3.x (Google)
Image input (vision)
Audio inputYes (model-dependent)Check current docsYes
Video inputModel-dependentCheck current docsYes
Native image generationgpt-image-2No (text models)Imagen family
Video generationSora-2NoVeo family
Example text price (per 1M in/out)gpt-5.4: $2.50 / $15.00Sonnet 4.6: $3 / $15Gemini 3.5 Flash: $1.50 / $9.00
Generation pricegpt-image-2 $8/$30; Sora-2 $0.10-$0.50/secSee Google pricing

Sources, all June 2026: [OpenAI pricing](https://developers.openai.com/api/docs/pricing); [Anthropic/Claude pricing](https://claude.com/pricing); [Google Gemini pricing](https://ai.google.dev/gemini-api/docs/pricing). Capability rows summarize provider docs; exact format support and limits change — verify on the live docs before building.

What's in this guide

This is a long-form walkthrough of multi-modal prompting, in order:

1. Two directions of multi-modal prompting — understanding vs. generation.

2. Prompting with images as input (vision) — GPT-5.x, Claude, Gemini 3.x.

3. Prompting with audio as input — transcription, analysis, and instruction.

4. Prompting with video as input — what the current generation can and can't do.

5. Generating images — gpt-image-2, Imagen, Midjourney, and prompt structure.

6. Generating video — Sora-2 and Veo, and how to prompt them.

7. Cost: what multi-modal actually charges for, with current prices.

8. Practical patterns and pitfalls that apply across modalities.

We finish with a capabilities/pricing comparison table, FAQs, and a Sources section. All prices are as of June 2026 and link to the live provider pages, which are the authoritative source.


Two directions: understanding vs. generation

Before any specifics, separate the two things people mean by "multi-modal prompting," because the technique is different.

**Understanding (media in, text out).** You give the model an image, audio file, or video and a text instruction: "transcribe this," "what's the error in this screenshot," "summarize this meeting recording." The model reasons over the media. Here, the prompt skill is precision about the task and the output format — the media carries the content, your text directs the analysis.

**Generation (text in, media out).** You describe what you want and the model produces an image, video, or audio. Here, the prompt skill is descriptive: subject, style, composition, lighting, motion, duration, and explicit constraints. Image and video generators each have their own dialect.

A third, increasingly common case is interleaved: a chat where you paste a screenshot, ask a question, get text back, then ask for an edited image. The frontier chat models (GPT-5.x, Claude, Gemini 3.x) handle the understanding side natively; generation usually routes to a dedicated model like gpt-image-2 or a video model. Knowing which direction you're in tells you which prompting rules apply.


Prompting with images as input (vision)

All three frontier chat families accept images as input as of June 2026: OpenAI's GPT-5.x models, Anthropic's Claude (Opus 4.8, Sonnet 4.6), and Google's Gemini 3.x. The prompting approach is the same across them — what varies is exact format support and pricing, so check each provider's docs.

**Be explicit about the task.** "Describe this image" gets you a generic caption. "List every product name and price visible in this shelf photo as a table" gets you structured, useful output. Vision models follow specific instructions far better than open-ended ones.

**Tell it where to look.** For dense images — dashboards, documents, diagrams — name the region: "In the top-right chart, what's the Q3 value?" This reduces the model wandering to irrelevant parts of the image.

**Pin the output format.** Ask for JSON, a table, or a labeled list when you'll parse the result. Vision output is text, so the same structured-output discipline applies.

**Combine image and text reasoning.** The strongest pattern is giving context in text and the evidence in the image: "Here's our brand style guide [text]. Does the attached mockup follow it? List violations."

For the authoritative format rules and limits per provider, see the OpenAI prompting guide, the Claude prompt engineering overview, and Google's Gemini prompting strategies. Image inputs are billed as tokens; the per-model token pricing in the table below applies, and large or high-resolution images consume more input tokens.


Prompting with audio as input

Audio understanding spans transcription (speech to text), analysis (summarize a call, extract action items), and audio reasoning (identify speakers, tone, or non-speech events). The frontier multimodal models increasingly accept audio directly, and dedicated speech models handle high-volume transcription.

**For transcription, specify the deliverable.** "Transcribe this" differs from "Transcribe with speaker labels and timestamps every 30 seconds" or "Transcribe, then give me a 5-bullet summary and a list of decisions made." State whether you want verbatim or cleaned-up text.

**For analysis, give the audio a job.** Treat the recording as the source and your text as the task: "From this sales-call recording, extract the customer's stated objections, the next steps agreed, and any pricing mentioned. Output as three labeled lists."

**Mind the input cost.** Audio is converted to tokens, and long recordings add up fast — a one-hour call is a large input. Where possible, transcribe once with a cheaper model, then run analysis prompts over the text transcript rather than re-sending the audio. That two-step pattern is usually cheaper and more controllable.

Capabilities and exact audio format support change frequently; the Google Gemini docs and OpenAI docs are the live references for what each model accepts.


Prompting with video as input

Video understanding is the youngest of the input modalities and the most variable across providers. The frontier multimodal models can reason over video clips — answering questions about what happens, summarizing footage, locating moments — but support, maximum duration, and frame-sampling behavior differ a lot, so verify against current docs before building on it.

**Anchor questions in time.** "What happens in this video" is weak. "At roughly the 2-minute mark, what does the presenter click? List the UI steps in order" gives the model a target and a structure.

**Expect frame sampling, not every frame.** Models typically sample frames rather than watch continuously, so fast events between sampled frames can be missed. For precise frame-level work, extract key frames as images and prompt over those instead.

**Two-step for long video.** As with audio, a transcript-plus-keyframes approach is often cheaper and more reliable than sending raw long video: pull the audio transcript and a handful of representative frames, then reason over those.

Because video input pricing and limits move quickly, treat the live Gemini and OpenAI pricing pages as the source of truth rather than any figure quoted in an article.


Generating images: gpt-image-2, Imagen, Midjourney

Image generation is a descriptive prompting task. The structure that works across generators: subject, then descriptors (attributes, materials, mood), then style/medium, then composition (framing, angle, lighting), then constraints (aspect ratio, what to exclude). Models differ in dialect but reward the same specificity.

**gpt-image-2** (OpenAI) is the current OpenAI image model; per the OpenAI pricing page it is billed at $8.00 per 1M input tokens and $30.00 per 1M output tokens as of June 2026. It responds well to detailed natural-language descriptions and is strong at rendering text in images and following precise layout instructions.

**Imagen** (Google) is Google's image family, prompted through the Gemini/Vertex tooling; see Google's pricing for current rates.

**Midjourney** uses a more keyword-and-parameter style with trailing flags for aspect ratio, stylization, and version; the Midjourney docs are the authoritative reference for parameters. Our Midjourney Prompt Builder assembles a well-formed prompt with the right parameter order.

**A reliable starting template** for any generator:

``` [subject], [key attributes], [style/medium], [composition: framing, angle, lighting], [color/mood], [aspect ratio], avoid: [unwanted elements] ```

Example: "a ceramic coffee mug on a wooden table, morning light from the left, shallow depth of field, photographic, warm tones, 3:2, avoid: text, hands." Build DALL-E-style prompts with our DALL-E Prompt Creator, or blend two aesthetics with the AI Art Style Mixer. For how the dialects differ in practice, see our DALL-E vs Midjourney prompt differences guide.


Generating video: Sora-2 and Veo

Video generation prompting adds time: beyond what's in frame, you describe motion, camera movement, pacing, and shot length. Prompts are still descriptive, but now include verbs of motion and camera direction.

**Sora-2** (OpenAI) generates video; per the OpenAI pricing page, Sora-2 is priced per second of output — $0.10/sec at 720p and $0.50/sec at 1024p as of June 2026. That per-second billing makes prompt precision and clip length the main cost levers: a 10-second 1024p clip is meaningfully more than a 10-second 720p clip.

**Veo** (Google) is Google's video-generation family, accessed through Google's tooling; consult Google's pricing for current rates.

**Prompt structure for video:** describe the scene, then the subject's action, then the camera (static, slow pan, dolly in), then style and mood, then duration. Example: "A red kite drifting over a beach at sunset. The camera slowly tilts up to follow it. Cinematic, warm golden light, gentle motion, 6 seconds."

**Practical advice:** generate short, iterate cheaply at lower resolution, and only render the final at high resolution once the motion and composition are right. Because video is billed by duration and resolution, an undisciplined prompt loop is the fastest way to run up a bill. Always confirm current rates on the live pricing page before a large batch.


Cost: what multi-modal actually charges for

Multi-modal billing is less intuitive than text, so it's worth being explicit about what you pay for. The figures here are as of June 2026 and link to the live pages, which override anything quoted here.

**Image and audio input = tokens.** When you feed an image or audio clip to a chat model, it's converted into input tokens and billed at that model's per-token input rate (see the table). Bigger images and longer audio cost more.

**Image generation = its own token rates.** gpt-image-2 bills $8.00 in / $30.00 out per 1M tokens per OpenAI pricing.

**Video generation = per second of output.** Sora-2 is $0.10/sec (720p) and $0.50/sec (1024p). Resolution is a 5x cost multiplier here, so resolution choice is a real budget decision.

**The cost-control patterns:** transcribe-then-analyze for audio, keyframes-plus-transcript for long video, low-res-iterate-then-final-render for generation, and downscaling images to the smallest size that preserves the detail the task needs. For a deeper treatment of token economics across models, see our token cost by model comparison and LLM cost engineering guides.

Frequently Asked Questions

What is multi-modal prompting?

It's prompting a model with more than text — either feeding it an image, audio, or video to understand, or instructing it to generate visual or audio media. Understanding tasks need precise text instructions about what to extract; generation tasks need descriptive prompts covering subject, style, composition, and constraints.

Which models accept image, audio, and video input in 2026?

As of June 2026, all three frontier chat families accept images: OpenAI's GPT-5.x, Anthropic's Claude (Opus 4.8, Sonnet 4.6), and Google's Gemini 3.x. Audio and video input support is more variable by model — Gemini 3.x is strong on both. Always verify current format support and limits on the OpenAI, Claude, and Gemini docs.

How much does image and video generation cost?

As of June 2026, per the OpenAI pricing page: gpt-image-2 is $8.00 input / $30.00 output per 1M tokens, and Sora-2 video is $0.10/sec at 720p and $0.50/sec at 1024p. Google's Imagen and Veo rates are on the Gemini pricing page. These move, so treat the live pages as authoritative.

How do I write a good image-generation prompt?

Use the structure: subject, key attributes, style/medium, composition (framing, angle, lighting), color/mood, aspect ratio, and an explicit list of things to avoid. Midjourney uses a keyword-plus-parameter dialect (see the Midjourney docs); gpt-image-2 and DALL-E respond to detailed natural language. Our Midjourney Prompt Builder and DALL-E Prompt Creator build well-formed prompts for you.

How do I keep multi-modal costs down?

Image and audio inputs are billed as tokens, so downscale images to the smallest size that preserves needed detail, and for long audio or video, transcribe or extract keyframes once with a cheaper model, then run analysis over the text. For generation, iterate at low resolution and only do the final render at high resolution — Sora-2's 1024p costs 5x its 720p rate.

What's the difference between prompting Midjourney and DALL-E / gpt-image-2?

Midjourney favors a compact keyword-and-parameter style with trailing flags for aspect ratio and stylization, while DALL-E and gpt-image-2 respond best to detailed, natural-language scene descriptions; gpt-image-2 is also strong at rendering text and precise layouts. See our DALL-E vs Midjourney prompt differences breakdown.

Build better image prompts in seconds.

The Midjourney Prompt Builder, DALL-E Prompt Creator, and AI Art Style Mixer scaffold the structure these models reward. Free, no signup. Part of 40+ free prompt tools.

Browse all prompt tools →