What's in this guide
This is a long-form walkthrough of multi-modal prompting, in order:
1. Two directions of multi-modal prompting — understanding vs. generation.
2. Prompting with images as input (vision) — GPT-5.x, Claude, Gemini 3.x.
3. Prompting with audio as input — transcription, analysis, and instruction.
4. Prompting with video as input — what the current generation can and can't do.
5. Generating images — gpt-image-2, Imagen, Midjourney, and prompt structure.
6. Generating video — Sora-2 and Veo, and how to prompt them.
7. Cost: what multi-modal actually charges for, with current prices.
8. Practical patterns and pitfalls that apply across modalities.
We finish with a capabilities/pricing comparison table, FAQs, and a Sources section. All prices are as of June 2026 and link to the live provider pages, which are the authoritative source.