What is a vision prompt?
A vision prompt is a prompt that includes one or more images alongside text. The model 'sees' the image and reasons over it together with your instruction. Common tasks include reading text from a photo (OCR-style), describing a scene, extracting structured data from a document, comparing two images, answering a question about a chart, or judging whether an image meets some criteria.
The key difference from text-only prompting is that you have two channels. The image carries the visual content; the text carries the intent. Vague text wastes the image — "describe this" gets you a generic caption — while precise text ("List every line item and its price as a table") gets you exactly what you need.
Note that vision prompting (image as input) is the opposite of image generation (image as output). If you want to create images, you need a tool like the DALL-E Prompt Creator or Midjourney Prompt Builder, not a vision prompt.