How does RAG work?
A RAG pipeline has two phases. First, ingestion (done ahead of time): you split your source documents into smaller passages (chunking), convert each into an embedding — a numeric vector capturing its meaning — and store those vectors in a database.
Second, retrieval and generation (at query time): you embed the user's question, find the most semantically similar passages via vector search, and insert those passages into the prompt alongside the question. The model then generates an answer grounded in the supplied text, ideally with citations back to the source.
The whole point is that the model answers from evidence you placed in its context window, not from whatever it happened to memorize during training. That makes answers traceable — you can show which passage a claim came from — and keeps them current without retraining.