By The DDH Team · Digital Dashboard Hub

Build RAG with Pinecone (2026): Serverless Index to Streaming Answer

By The DDH Team at Digital Dashboard Hub·Updated June 21, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Retrieval-augmented generation (RAG) is the architecture teams use when they need an LLM to answer questions grounded in a private corpus — internal docs, product manuals, support tickets, legal filings. The core loop is: embed the user query → retrieve the top-K semantically similar chunks from a vector database → pass retrieved chunks as context → generate a grounded answer. Pinecone remains the most widely deployed managed vector database in 2026. Its serverless tier removed the pod-sizing tax that made the old starter tier awkward — you now pay per query and per write, with no always-on infrastructure.

This tutorial builds the full pipeline end-to-end: create a Pinecone serverless index, chunk and embed your documents in batches, upsert vectors with metadata, retrieve at query time, assemble an XML-tagged context block, call Claude Sonnet 4.6 with a streaming response, and return the answer to the user. Each phase shows the exact Python code with real SDK calls. Production considerations — metadata filtering, per-tenant namespaces, hybrid retrieval, monitoring — are covered in the final phases.

Related: Pinecone vs Weaviate vs Qdrant comparison · RAG cost per query calculator · pgvector alternative if you want to stay in Postgres · RAG architecture decision tree. If your queries mix exact-keyword lookups with semantic search, read hybrid search BM25 + dense before locking in Pinecone-only retrieval.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

Stack components and cost (2026)

Feature	Component	What it does
Pinecone serverless	Vector store + ANN search	$0.000000096/vector stored/hr + $0.04/1M read units (as of June 2026, pinecone.io/pricing)
OpenAI text-embedding-3-small	1536-dim embeddings	$0.02/1M tokens (as of June 2026, platform.openai.com/docs/models)
Anthropic Claude Sonnet 4.6	Answer generation	$3/1M input tokens, $15/1M output tokens (docs.anthropic.com/en/docs/about-claude/pricing)
Python pinecone SDK v6	Index/upsert/query client	MIT, pip install pinecone
Python openai SDK v1.x	Embedding calls	MIT, pip install openai
Python anthropic SDK v0.x	Claude generation calls	MIT, pip install anthropic

Pricing sourced from pinecone.io/pricing, platform.openai.com/docs/models, and docs.anthropic.com/en/docs/about-claude/pricing — all checked June 2026. Pinecone serverless read-unit pricing depends on index dimension and query configuration; benchmark with your own workload before committing.

Phase 1: Install dependencies and initialize clients

Pin exact versions in production. The Pinecone v6 SDK dropped the `pinecone-client` name; import from `pinecone` directly. The `grpc` extra is no longer required for serverless — skip it.

```bash pip install pinecone==6.0.0 openai==1.50.0 anthropic==0.40.0 tiktoken==0.8.0 ```

```python import os from pinecone import Pinecone, ServerlessSpec from openai import OpenAI import anthropic # Load keys from environment — never hardcode pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"]) oai = OpenAI(api_key=os.environ["OPENAI_API_KEY"]) anth = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) INDEX_NAME = "rag-docs" EMBED_MODEL = "text-embedding-3-small" DIMENSION = 1536 METRIC = "cosine" ```

Why cosine? The text-embedding-3-small model produces L2-normalized vectors by default, so cosine and dot-product are equivalent for it. Cosine is the safer default — it works regardless of whether your embedding library normalizes or not. If you switch to a model that produces unnormalized vectors (e.g., some open-source models), cosine remains correct while dot-product silently breaks.

Phase 2: Create the serverless index

Serverless indexes are created once. Check existence before creating to make the script idempotent — safe to re-run on every deploy.

```python def ensure_index(name: str, dimension: int, metric: str) -> None: """Create the index if it doesn't exist. Idempotent.""" existing = [idx.name for idx in pc.list_indexes()] if name in existing: print(f"Index '{name}' already exists — skipping create.") return pc.create_index( name=name, dimension=dimension, metric=metric, spec=ServerlessSpec( cloud="aws", # or "gcp" / "azure" region="us-east-1", ), ) print(f"Created serverless index '{name}'.") ensure_index(INDEX_NAME, DIMENSION, METRIC) index = pc.Index(INDEX_NAME) ```

The `ServerlessSpec` region must be in a region where Pinecone has a serverless deployment. As of June 2026, supported clouds are AWS (us-east-1, us-west-2, eu-west-1), GCP (us-central1), and Azure (eastus2). Check pinecone.io/docs/serverless for the current list — it expands frequently.

Pinecone returns from `create_index` before the index is ready. If you call `index.upsert()` immediately after creating, you may get a `NOT_READY` error. The v6 SDK exposes `pc.describe_index(name).status.ready` — poll it if you need synchronous creation in a script (not needed for idempotent re-runs since the index will already be ready).

Phase 3: Chunk documents with recursive splitting

Chunk size is the most impactful hyperparameter in a RAG pipeline. Too large and each chunk is noisy — you retrieve a lot of irrelevant text alongside the answer. Too small and the chunk loses enough surrounding context that the LLM can't form a coherent answer. For general text, 512 tokens with 50-token overlap is a reasonable starting point. Overlap prevents answers from falling at chunk boundaries.

This implementation uses character-count splitting with a tiktoken guard to ensure no chunk exceeds the token limit. It does NOT use LangChain — the logic is 40 lines of plain Python, easier to debug and faster to import.

```python import tiktoken from typing import Iterator ENC = tiktoken.get_encoding("cl100k_base") # same as text-embedding-3-small CHUNK_SIZE = 512 # tokens OVERLAP = 50 # tokens def chunk_text(text: str) -> list[str]: """ Split text into overlapping chunks of at most CHUNK_SIZE tokens. Uses character-level splitting with a token-count guard. """ tokens = ENC.encode(text) chunks = [] start = 0 while start < len(tokens): end = min(start + CHUNK_SIZE, len(tokens)) chunk_tokens = tokens[start:end] chunks.append(ENC.decode(chunk_tokens)) start += CHUNK_SIZE - OVERLAP return chunks def chunk_documents(docs: list[dict]) -> list[dict]: """ docs: list of {id: str, text: str, metadata: dict} Returns list of {id: str, text: str, metadata: dict} where metadata includes 'source_id' and 'chunk_index'. """ results = [] for doc in docs: for i, chunk in enumerate(chunk_text(doc["text"])): results.append({ "id": f"{doc['id']}_chunk_{i}", "text": chunk, "metadata": { **doc.get("metadata", {}), "source_id": doc["id"], "chunk_index": i, "chunk_text": chunk, # store text in metadata for retrieval } }) return results ```

Storing `chunk_text` in metadata is the simplest way to retrieve the raw text alongside the vector match. Pinecone's `query()` response includes metadata but not the original vector — you need the text somewhere. Alternatives: a sidecar key-value store (Redis, DynamoDB) keyed on vector ID, or a Postgres table with `source_id` + `chunk_index` as the composite key. For most workloads under 10M chunks, metadata storage in Pinecone is the path of least operational complexity.

Phase 4: Batch embed and upsert

OpenAI's embedding endpoint accepts up to 2048 inputs per request (as of v1 API). Batch in groups of 100 to stay well under that limit and avoid oversized requests. Pinecone's upsert endpoint accepts up to 100 vectors per request (serverless; see pinecone.io/docs/limits). These two limits align neatly — one OpenAI batch maps to one or a few Pinecone upserts.

```python import time from itertools import islice def batch(iterable, n: int): """Yield successive n-sized chunks from an iterable.""" it = iter(iterable) while True: chunk = list(islice(it, n)) if not chunk: return yield chunk def embed_and_upsert( chunks: list[dict], index, namespace: str = "", batch_size: int = 100, ) -> int: """ Embed chunk texts and upsert to Pinecone. Returns total vectors upserted. """ total = 0 for chunk_batch in batch(chunks, batch_size): texts = [c["text"] for c in chunk_batch] # Embed with OpenAI — single API call per batch response = oai.embeddings.create( model=EMBED_MODEL, input=texts, ) embeddings = [item.embedding for item in response.data] # Build Pinecone upsert payload vectors = [ { "id": chunk_batch[i]["id"], "values": embeddings[i], "metadata": chunk_batch[i]["metadata"], } for i in range(len(chunk_batch)) ] index.upsert(vectors=vectors, namespace=namespace) total += len(vectors) print(f"Upserted {total} vectors...") # Respect OpenAI rate limits — optional sleep for large ingestion jobs time.sleep(0.1) return total ```

The `namespace` parameter is the key to per-tenant isolation in Pinecone serverless. Each tenant's documents go into their own namespace (`namespace=f"tenant:{tenant_id}"`). Query time: pass the same namespace to `index.query()` and the search is automatically scoped to that tenant's data. Namespaces are free — you pay for stored vectors, not namespace count. This pattern eliminates the need for separate indexes per tenant, which is the naive (and expensive) approach.

Phase 5: Query embedding and topK retrieval

The retrieval step embeds the user's query with the same model used during ingestion, then calls Pinecone's `query()` method. `include_metadata=True` returns the stored chunk text alongside each match score.

```python def retrieve( query: str, index, namespace: str = "", top_k: int = 5, filter: dict | None = None, ) -> list[dict]: """ Embed query and retrieve top-K chunks from Pinecone. Returns list of {score: float, text: str, metadata: dict}. """ # Embed the query q_response = oai.embeddings.create( model=EMBED_MODEL, input=[query], ) q_vector = q_response.data[0].embedding # Query Pinecone results = index.query( vector=q_vector, top_k=top_k, namespace=namespace, include_metadata=True, filter=filter, # e.g. {"document_type": {"$eq": "policy"}} ) return [ { "score": match.score, "text": match.metadata.get("chunk_text", ""), "metadata": match.metadata, } for match in results.matches ] ```

The `filter` parameter accepts a subset of MongoDB-style query operators: `$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte`, `$in`, `$nin`. Filters run on metadata fields you stored during upsert. Example: filter to only policy documents from 2025 or later: `filter={"document_type": {"$eq": "policy"}, "year": {"$gte": 2025}}`. Metadata filters are applied before ANN scoring — they reduce the candidate set, which can improve both relevance and cost.

Tune `top_k` based on your context window and answer quality. Start with 5. If answers lack detail, try 8-10. More chunks mean more tokens in the Claude context, which costs more. At Sonnet 4.6 pricing ($3/1M input), 5 chunks of 512 tokens each = 2,560 extra input tokens = $0.0000077/query — negligible for most workloads. The cost to watch is your volume, not the per-query chunk cost.

Phase 6: Context assembly and Claude generation with streaming

Assemble retrieved chunks into an XML-tagged context block before the Claude call. XML tags are Anthropic's recommended structure for separating context from instructions — they improve grounding and reduce hallucination on documents with competing claims. Use `<document>` tags with `index` attributes so the model can cite sources.

```python def assemble_context(chunks: list[dict]) -> str: """Build an XML-tagged context string from retrieved chunks.""" parts = [] for i, chunk in enumerate(chunks, start=1): source = chunk["metadata"].get("source_id", "unknown") parts.append( f'<document index="{i}" source="{source}">\n' f'{chunk["text"]}\n' f'</document>' ) return "\n\n".join(parts) def ask( query: str, index, namespace: str = "", top_k: int = 5, filter: dict | None = None, ) -> str: """ Full RAG pipeline: retrieve → assemble → generate. Streams the Claude response and returns the full text. """ chunks = retrieve(query, index, namespace=namespace, top_k=top_k, filter=filter) context = assemble_context(chunks) system_prompt = """You are a precise research assistant. Answer the user's question using only the documents provided in <context>. If the answer is not in the documents, say "I don't have enough information in the provided documents to answer that." Cite document index numbers inline, e.g. [1] or [2, 3].""" messages = [ { "role": "user", "content": f"<context>\n{context}\n</context>\n\n<question>{query}</question>" } ] full_text = "" with anth.messages.stream( model="claude-sonnet-4-6", max_tokens=1024, system=system_prompt, messages=messages, ) as stream: for text in stream.text_stream: print(text, end="", flush=True) # stream to terminal/websocket full_text += text print() # newline after stream return full_text ```

The `<context>` + `<question>` XML wrapping is not cosmetic — it is the prompt pattern Anthropic's own RAG cookbook recommends (see docs.anthropic.com/en/docs/build-with-claude/prompt-engineering#long-document-qa). Claude is trained to understand these tag boundaries. Skipping the XML and pasting raw chunk text into the user message works but degrades answer quality on longer contexts with multiple competing documents.

Streaming is implemented via the `with stream:` context manager and `stream.text_stream` iterator. In a web API, replace the `print()` call with a `yield` inside a FastAPI `StreamingResponse` or Next.js edge stream. The `full_text` accumulator gives you the complete answer for logging and caching after the stream closes.

Phase 7: Production hardening — metadata filtering, namespaces, monitoring

The baseline pipeline above is correct but not production-hardened. Three additions matter most for production: metadata filtering to prevent cross-tenant data leakage, index statistics monitoring to catch ingestion failures, and a sidecar audit log.

**Namespace enforcement.** If your application is multi-tenant, make the namespace a required parameter and assert it is set before every `index.query()` call. The most common production bug in multi-tenant RAG is a forgotten namespace parameter that exposes Tenant A's documents to Tenant B's queries — there is no automatic cross-namespace isolation if you pass `namespace=""`.

```python def safe_retrieve( query: str, index, tenant_id: str, # required — no default top_k: int = 5, filter: dict | None = None, ) -> list[dict]: """Retrieve with mandatory namespace isolation.""" if not tenant_id or tenant_id == "": raise ValueError("tenant_id must be a non-empty string for namespace isolation") namespace = f"tenant:{tenant_id}" return retrieve(query, index, namespace=namespace, top_k=top_k, filter=filter) ```

**Index statistics.** Pinecone's `describe_index_stats()` returns vector count per namespace. Run it after every ingestion batch and alert if the expected count doesn't match. Pinecone upserts are eventually consistent — a count check immediately after upsert may undercount by a few vectors. Wait 1-2 seconds before counting in tests.

```python def check_index_health(index, expected_namespace_counts: dict[str, int]) -> None: stats = index.describe_index_stats() for namespace, expected in expected_namespace_counts.items(): actual = stats.namespaces.get(namespace, {}).get("vector_count", 0) if actual < expected: print(f"WARNING: namespace '{namespace}' has {actual} vectors, expected {expected}") else: print(f"OK: namespace '{namespace}' has {actual} vectors") ```

**Hybrid retrieval option.** For workloads where exact keyword matching matters (code snippets, product SKUs, proper nouns), augment the Pinecone dense retrieval with a BM25 pass. Pinecone Serverless does not natively support sparse vectors on the serverless tier as of June 2026 (sparse+dense hybrid is available on Pinecone pod-based indexes). For serverless hybrid, run a separate BM25 index (Elasticsearch, Typesense, or the `rank_bm25` Python library in memory) and fuse results with Reciprocal Rank Fusion. See hybrid search BM25 + dense tutorial for the complete RRF implementation.

Phase 8: End-to-end script and cost estimate

Here is the complete script that wires all phases together for a local test with 10 sample documents.

```python # rag_pinecone_demo.py — complete end-to-end demo import os from pinecone import Pinecone, ServerlessSpec from openai import OpenAI import anthropic # --- init --- pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"]) oai = OpenAI(api_key=os.environ["OPENAI_API_KEY"]) anth = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) INDEX_NAME = "rag-demo" EMBED_MODEL = "text-embedding-3-small" DIMENSION = 1536 # --- sample corpus --- SAMPLE_DOCS = [ {"id": "doc_001", "text": "Pinecone is a managed vector database optimized for similarity search at scale.", "metadata": {"category": "infra"}}, {"id": "doc_002", "text": "Claude Sonnet 4.6 is Anthropic's mid-tier model, balancing capability and cost.", "metadata": {"category": "models"}}, {"id": "doc_003", "text": "RAG combines retrieval from a knowledge base with large language model generation.", "metadata": {"category": "architecture"}}, ] # --- create index --- existing = [idx.name for idx in pc.list_indexes()] if INDEX_NAME not in existing: pc.create_index( name=INDEX_NAME, dimension=DIMENSION, metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1"), ) index = pc.Index(INDEX_NAME) # --- embed + upsert --- texts = [d["text"] for d in SAMPLE_DOCS] embeddings = oai.embeddings.create(model=EMBED_MODEL, input=texts) vectors = [ {"id": SAMPLE_DOCS[i]["id"], "values": embeddings.data[i].embedding, "metadata": {**SAMPLE_DOCS[i]["metadata"], "chunk_text": SAMPLE_DOCS[i]["text"]}} for i in range(len(SAMPLE_DOCS)) ] index.upsert(vectors=vectors) print(f"Upserted {len(vectors)} vectors.") # --- query + generate --- QUERY = "What is RAG and why does it use vector search?" q_emb = oai.embeddings.create(model=EMBED_MODEL, input=[QUERY]).data[0].embedding matches = index.query(vector=q_emb, top_k=3, include_metadata=True).matches context = "\n\n".join( f'<document index="{i+1}">{m.metadata["chunk_text"]}</document>' for i, m in enumerate(matches) ) print("\n--- Claude answer ---") with anth.messages.stream( model="claude-sonnet-4-6", max_tokens=512, system="Answer using only the provided documents. Cite [index].", messages=[{"role": "user", "content": f"<context>{context}</context>\n<question>{QUERY}</question>"}], ) as stream: for text in stream.text_stream: print(text, end="", flush=True) print() ```

Cost estimate for this demo: 3 documents, 3 embed calls (ingestion) + 1 embed call (query) = ~100 tokens of embedding × $0.02/1M = $0.000002. One Claude call of ~300 input tokens + 100 output tokens = $0.0000009 + $0.0000015 = $0.0000024. Total demo cost: under $0.00001. At 100K daily queries with 5 × 512-token chunks per query: embedding cost = 100K × ~15 tokens/query × $0.02/1M = $0.03/day. Claude context cost = 100K × ~2,800 tokens input × $3/1M = $0.84/day. Total: under $1/day for 100K RAG queries — very low cost.

Production checklist

1
Pin SDK versions in requirements.txt
Lock `pinecone==6.0.0`, `openai==1.50.0`, `anthropic==0.40.0`, `tiktoken==0.8.0`. Pinecone and Anthropic release breaking changes regularly. A floating `pinecone` dependency broke a production ingestion job in a wide-reported incident in early 2026 when v6 shipped without a deprecation period on the `pinecone-client` package name.
2
Use namespaces for every multi-tenant index
Make namespace a required argument in your retrieval function. Assert it is non-empty before every query. Test that Tenant A's documents are not retrievable when querying with Tenant B's namespace. This is a security boundary, not an optimization.
3
Store chunk text in metadata — or in a sidecar store
Pinecone returns metadata but not raw vectors on a query. You need the chunk text somewhere. Metadata storage in Pinecone is simplest for small-to-medium indexes. For indexes over 10M chunks, a sidecar Redis or Postgres store (keyed on vector ID) keeps Pinecone metadata lean and reduces read-unit cost.
4
Monitor cache hit rate and index health post-ingestion
Run `describe_index_stats()` after every ingestion job and alert on unexpected vector count drops. Set up a daily spot-check query against known documents and assert the top-1 result is the expected chunk — this catches silent embedding or upsert failures.
5
Add prompt caching on the Claude system prompt
If your RAG system prompt is stable across calls (it usually is), mark it with `cache_control: {type: 'ephemeral'}` in the Anthropic SDK. System prompt is ~100-300 tokens — at 100K daily queries, caching saves 90% on that portion. See the OpenAI to Claude migration guide for the exact cache breakpoint syntax.
→ Open the RAG cost per query calculator
6
Evaluate chunk size with your actual corpus
512 tokens with 50-token overlap is a starting point, not a universal answer. Run a held-out eval: 50 question-answer pairs from your corpus, measure retrieval recall@5 at chunk sizes 256, 512, 1024. The optimal size varies by document type — manuals tend to prefer larger chunks; Q&A databases prefer smaller ones.
7
Set deletion policies before launch
Pinecone does not auto-expire vectors. Define when document vectors should be deleted (doc update, user offboarding, data retention policy). Implement a soft-delete pattern: add `deleted: true` to metadata, filter it out at query time, and run a periodic hard-delete job with `index.delete(filter={'deleted': {'$eq': True}})`. Hard deletion is not transactionally atomic on eventual-consistent indexes.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

Pinecone vs Weaviate vs Qdrant comparison→RAG cost per query calculator→Build RAG with pgvector (Postgres alternative)→Hybrid search: BM25 + dense retrieval→RAG architecture decision tree 2026→OpenAI to Claude migration guide→

Frequently Asked Questions

What dimension should I use for my Pinecone index?

It must match your embedding model. OpenAI text-embedding-3-small outputs 1536 dimensions (default; you can request fewer via the `dimensions` parameter for cost/performance tradeoffs). text-embedding-3-large outputs 3072. OpenAI ada-002 outputs 1536. Set the Pinecone index dimension at creation time — you cannot change it without re-creating the index and re-ingesting all vectors.

How many chunks should I retrieve (top_k)?

Start with 5. For short, precise answers (FAQ-style), 3 is often enough. For long-form synthesis over a large corpus, try 8-10. Each additional chunk adds ~512 tokens to your Claude context input, which costs $3/1M tokens on Sonnet 4.6 — about $0.0000015 per extra chunk at 100K queries/day, that's $0.15/day incremental cost per extra chunk. The cost is negligible; optimize for answer quality first. Use RAG cost per query calculator to model your specific workload.

Should I use LangChain or LlamaIndex for Pinecone RAG?

Both are valid for prototyping. For production, direct SDK calls are preferable for most teams. LangChain and LlamaIndex add abstraction layers that make debugging harder when embeddings silently truncate, when Pinecone upserts fail partially, or when the LLM call format changes. The direct SDK pattern in this tutorial is 150 lines of straightforward Python — no framework required.

What is Pinecone serverless vs pod-based?

Serverless (2024 GA) eliminates pod sizing — you pay per query (read units) and per vector stored per hour, with no always-on pod cost. Pod-based is the older architecture where you choose pod type (s1, p1, p2) and pod count upfront. Serverless is the default recommendation for new projects. Pod-based is still available and preferred when you need sparse+dense hybrid search natively, or when your query latency SLA requires <5ms p99 (pod-based with p2 pods achieves this; serverless p99 is typically 10-30ms).

How do I handle document updates in a Pinecone RAG system?

Pinecone vectors are upserted by ID. For a document update: (1) re-chunk the updated document, (2) upsert the new chunks with new IDs (or the same IDs if the chunk count and positions are stable), (3) delete any old chunk IDs that no longer exist. The challenge is stale chunk IDs — if a 10-chunk document is re-edited to 8 chunks, chunk IDs 9 and 10 remain in the index. Keep a sidecar record of chunk IDs per source document and diff on update.

Can I use Claude for embeddings instead of OpenAI?

Anthropic does not offer a dedicated embedding model. The standard production stack is OpenAI embeddings (text-embedding-3-small or text-embedding-3-large) for retrieval and Claude for generation. Cohere also offers competitive embedding models (embed-v4.0, with 1024-dim outputs) at similar pricing — valid alternative if you want to reduce vendor count. Do not mix embedding models between ingestion and query time; the embedding space changes, breaking retrieval.

What is a Pinecone read unit and how does it affect cost?

A read unit (RU) is Pinecone's billing primitive for queries on serverless indexes. One RU represents scanning roughly 1,000 vectors during the ANN search. A query against an index with 100K vectors costs roughly 100 RUs; 1M vectors costs ~1,000 RUs. At $0.04/1M RUs, a query against a 1M-vector index costs $0.00004. This is lower than the embedding cost ($0.02/1M tokens × ~15 tokens per query = $0.0000003) and much lower than the Claude generation cost. Check the Pinecone serverless pricing calculator at pinecone.io/pricing for your expected index size.

How does this Pinecone RAG stack compare to pgvector?

Pinecone is a purpose-built vector database with managed infrastructure, sub-100ms ANN at billion-vector scale, and serverless pricing. pgvector runs inside Postgres — no separate service, transactional consistency, SQL joins across relational and vector data, familiar ops. If you already run Postgres (e.g., via Supabase or Neon), pgvector is worth evaluating before adding a Pinecone dependency. See build RAG with pgvector for the full comparison and code.

Build better RAG prompts for Claude generation.

The RAG answer quality depends as much on your generation prompt as on your retrieval. Our AI Prompt Generator builds Claude-tuned XML-structured prompts for your RAG use case — grounding instructions, citation format, fallback handling. 14-day free trial, no card.

Browse all prompt tools →