Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Build RAG with pgvector (2026): Postgres-Native Vector Search to Claude Generation

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

The most common objection to adding Pinecone or Qdrant to a stack is operational: one more service to provision, monitor, back up, and pay for separately. pgvector answers that objection by adding vector similarity search directly to Postgres. If you already run Postgres — whether self-hosted, on RDS, on Supabase, or on Neon — pgvector requires a one-line `CREATE EXTENSION` and you have a production-ready vector store with transactional consistency, mature backup/restore, RBAC, and SQL join capability across your relational and vector data.

This tutorial builds a complete RAG pipeline: install pgvector, create the documents table with a `vector(1536)` column, create an HNSW index with appropriate `m` and `ef_construction` parameters, insert embeddings with a parameterized INSERT, query with cosine distance (`<=>` operator), layer in BM25 hybrid search via `tsvector` and `ts_rank`, and generate answers with Claude Sonnet 4.6. Python (asyncpg + psycopg3) is used throughout. Managed Postgres options — Supabase and Neon — are covered in Phase 9; both have pgvector enabled by default and eliminate the extension installation step.

Related: Build RAG with Pinecone (if you want a separate managed vector DB) · Hybrid search BM25 + dense tutorial (for the RRF fusion algorithm in full detail) · Pinecone vs Weaviate vs Qdrant comparison · RAG cost per query calculator · RAG architecture decision tree. For GraphRAG on knowledge-graph-heavy corpora, see build GraphRAG 2026.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Stack components and cost (2026)

Feature
Component
What it does
Pricing
pgvector 0.7+Vector column type + HNSW/IVFFlat ANN indexesOpen-source (MIT), github.com/pgvector/pgvector
Postgres 16+Relational database with pgvector extensionOpen-source; managed: Supabase free tier, Neon free tier, RDS ~$0.017/hr db.t4g.micro
OpenAI text-embedding-3-small1536-dim embeddings$0.02/1M tokens (platform.openai.com/docs/models, June 2026)
Anthropic Claude Sonnet 4.6Answer generation$3/1M input, $15/1M output (docs.anthropic.com/en/docs/about-claude/pricing)
psycopg3 (psycopg[binary])Python Postgres driver, async supportLGPL, pip install psycopg[binary]
Supabase or Neon (optional)Managed Postgres with pgvector pre-enabledSupabase: free tier + $25/mo Pro; Neon: free tier + $19/mo Launch

Postgres and pgvector are open-source. Cloud pricing sourced from supabase.com/pricing, neon.tech/pricing, and aws.amazon.com/rds/postgresql/pricing — June 2026. Self-hosted Postgres on EC2 t4g.medium costs ~$25/mo; adds ops work but removes per-query billing.

Phase 1: Install pgvector and enable the extension

pgvector 0.7 added HNSW indexing (previously only IVFFlat was available). HNSW is the preferred index type for most RAG workloads — it does not require a training step on your data, supports incremental inserts without a rebuild, and achieves higher recall at lower query latency than IVFFlat at equivalent ef settings. Minimum Postgres version: 13. Recommended: 16+.

On a Debian/Ubuntu self-hosted Postgres 16:

```bash # Install pgvector from apt (available on Ubuntu 22.04+ via postgresql apt repo) sudo apt-get install postgresql-16-pgvector # Or build from source (requires build-essential, postgresql-server-dev-16) git clone --branch v0.7.4 https://github.com/pgvector/pgvector.git cd pgvector make && sudo make install ```

Enable the extension in your database (run as superuser, once per database):

```sql CREATE EXTENSION IF NOT EXISTS vector; ```

On Supabase, the `vector` extension is already installed — navigate to Database → Extensions in the Supabase dashboard, search 'vector', and enable it. Or run `CREATE EXTENSION IF NOT EXISTS vector;` in the SQL editor. On Neon, same approach — the extension is available, not auto-enabled; run the `CREATE EXTENSION` statement once.


Phase 2: Schema — documents table and HNSW index

The documents table stores raw chunk text, the embedding vector, and metadata. The HNSW index is created separately after the table so it can be tuned with `m` and `ef_construction` parameters.

```sql -- Documents table CREATE TABLE docs ( id BIGSERIAL PRIMARY KEY, source_id TEXT NOT NULL, -- original document identifier chunk_index INTEGER NOT NULL, -- position within source document content TEXT NOT NULL, -- raw chunk text embedding vector(1536) NOT NULL, -- must match embedding model dimension metadata JSONB, -- arbitrary key-value pairs ts_content TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', content)) STORED, created_at TIMESTAMPTZ DEFAULT NOW(), UNIQUE (source_id, chunk_index) ); -- HNSW index for cosine distance (ANN search) CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- Full-text search index for BM25 hybrid CREATE INDEX ON docs USING gin (ts_content); -- Index on source_id for efficient document-level deletes CREATE INDEX ON docs (source_id); ```

HNSW tuning parameters: `m` controls the number of bi-directional links per node in the graph. Higher `m` improves recall but increases memory usage and index build time. Default is 16; for high-precision RAG use cases, try 32 and benchmark recall. `ef_construction` controls the size of the dynamic candidate list during index build — higher improves recall at the cost of slower build time. Default is 64; leave at 64 unless your recall benchmarks are poor.

The `ts_content` generated column stores the pre-computed `tsvector` for BM25 hybrid search. Generating it as a stored column means the text is indexed once at insert time rather than at every query — this is the production pattern for hybrid Postgres RAG. The `GENERATED ALWAYS AS ... STORED` syntax requires Postgres 12+.

The `vector_cosine_ops` operator class instructs the HNSW index to optimize for cosine distance (the `<=>` operator). Other options: `vector_l2_ops` for Euclidean distance (`<->`), `vector_ip_ops` for negative inner product (`<#>`). Use cosine for OpenAI and most normalized embedding models. Use inner product only when you know your model produces unnormalized vectors.


Phase 3: Python setup — psycopg3 with async support

psycopg3 is the recommended driver for new Python Postgres projects. It supports async natively, has a clean parameter binding API, and supports the `register_vector` integration for pgvector column types.

```bash pip install psycopg[binary]==3.2.3 psycopg_pool==3.2.3 openai==1.50.0 anthropic==0.40.0 tiktoken==0.8.0 numpy==1.26.4 ```

```python import os import asyncio import numpy as np import psycopg from psycopg.rows import dict_row from openai import OpenAI import anthropic # Register pgvector type support from psycopg.adapt import Dumper, Loader # Clients oai = OpenAI(api_key=os.environ["OPENAI_API_KEY"]) anth = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) DATABASE_URL = os.environ["DATABASE_URL"] # e.g. postgresql://user:pass@host:5432/dbname EMBED_MODEL = "text-embedding-3-small" async def get_conn(): return await psycopg.AsyncConnection.connect( DATABASE_URL, row_factory=dict_row, ) ```

For production, use `psycopg_pool.AsyncConnectionPool` rather than creating a new connection per request. Connection creation is slow (~50-100ms on a managed Postgres); a pool of 5-20 connections handles hundreds of concurrent queries.

```python from psycopg_pool import AsyncConnectionPool # Create pool at application startup, not per-request pool = AsyncConnectionPool( DATABASE_URL, min_size=5, max_size=20, kwargs={"row_factory": dict_row}, ) ```


Phase 4: Chunking, embedding, and parameterized INSERT

Chunking logic is identical to the Pinecone tutorial — recursive 512-token splitting with 50-token overlap using tiktoken. See build RAG with Pinecone Phase 3 for the full `chunk_text()` implementation.

The INSERT uses a parameterized query with the `::vector` cast. psycopg3 does not automatically cast Python lists to the pgvector `vector` type — you must either register a custom adapter or pass the embedding as a string with `::vector` cast syntax. The `::vector` approach is simpler and more explicit.

```python async def insert_chunks( conn, chunks: list[dict], batch_size: int = 100, ) -> int: """ Embed chunks and insert into docs table. chunks: list of {source_id, chunk_index, content, metadata} Returns total rows inserted. """ total = 0 for i in range(0, len(chunks), batch_size): batch = chunks[i : i + batch_size] texts = [c["content"] for c in batch] # Embed with OpenAI resp = oai.embeddings.create(model=EMBED_MODEL, input=texts) embeddings = [item.embedding for item in resp.data] # Bulk insert with executemany await conn.executemany( """ INSERT INTO docs (source_id, chunk_index, content, embedding, metadata) VALUES (%s, %s, %s, %s::vector, %s) ON CONFLICT (source_id, chunk_index) DO UPDATE SET content = EXCLUDED.content, embedding = EXCLUDED.embedding, metadata = EXCLUDED.metadata """, [ ( batch[j]["source_id"], batch[j]["chunk_index"], batch[j]["content"], str(embeddings[j]), # cast: '[0.12, -0.34, ...]'::vector psycopg.types.json.Jsonb(batch[j].get("metadata", {})), ) for j in range(len(batch)) ], ) total += len(batch) print(f"Inserted {total} chunks...") await conn.commit() return total ```

The `ON CONFLICT DO UPDATE` clause makes inserts idempotent — re-running ingestion on an updated document overwrites the old chunks rather than creating duplicates. The `UNIQUE (source_id, chunk_index)` constraint enforces this. If your update changes the chunk count (document got longer), old extra chunks remain and need explicit deletion — see the production checklist step on document updates.


Phase 5: Vector similarity query — cosine, L2, inner product

pgvector exposes three distance operators as SQL infix operators. Use the one that matches your embedding model's geometry and your index's operator class.

```sql -- Cosine distance (use with vector_cosine_ops index, normalized embeddings like OpenAI) SELECT id, source_id, content, 1 - (embedding <=> $1::vector) AS similarity FROM docs ORDER BY embedding <=> $1::vector LIMIT $2; -- L2 / Euclidean distance (use with vector_l2_ops index) SELECT id, source_id, content, embedding <-> $1::vector AS distance FROM docs ORDER BY embedding <-> $1::vector LIMIT $2; -- Inner product / dot product (use with vector_ip_ops index, unnormalized embeddings) SELECT id, source_id, content, (embedding <#> $1::vector) * -1 AS score FROM docs ORDER BY embedding <#> $1::vector LIMIT $2; ```

Python retrieval function:

```python async def retrieve_dense( query: str, conn, top_k: int = 5, metadata_filter: str | None = None, ) -> list[dict]: """ Embed query and retrieve top-K chunks by cosine similarity. metadata_filter: optional SQL fragment, e.g. "metadata->>'category' = 'policy'" """ q_emb = oai.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding q_str = str(q_emb) # '[0.12, -0.34, ...]' where_clause = f"AND {metadata_filter}" if metadata_filter else "" sql = f""" SELECT id, source_id, chunk_index, content, 1 - (embedding <=> $1::vector) AS similarity FROM docs WHERE TRUE {where_clause} ORDER BY embedding <=> $1::vector LIMIT $2 """ rows = await conn.fetch(sql, q_str, top_k) return [dict(row) for row in rows] ```

The HNSW index is used automatically when the query's `ORDER BY` expression matches the index's operator class and the `LIMIT` clause is present. Verify the index is used with `EXPLAIN (ANALYZE, BUFFERS)` — look for `Index Scan using docs_embedding_idx` in the plan. If you see `Seq Scan`, check that `enable_seqscan = off` is not set and that your `ef_search` setting is not too low (default 40 for HNSW; increase for higher recall: `SET hnsw.ef_search = 100;`).


Phase 6: BM25 hybrid search via tsvector and ts_rank

Pure dense retrieval misses exact-keyword, code snippet, and proper-noun queries. Postgres has native full-text search (`tsvector`, `tsquery`, `ts_rank`) that implements a BM25-like scoring model. Combining dense ANN scores with BM25 keyword scores using Reciprocal Rank Fusion (RRF) consistently outperforms either retrieval method alone. For the full RRF algorithm and BEIR benchmark comparisons, see hybrid search BM25 + dense tutorial.

The Postgres hybrid query runs both retrievals in a single SQL statement using a CTE and merges scores with RRF:

```sql WITH dense AS ( -- Dense retrieval: top 50 by cosine similarity SELECT id, 1 - (embedding <=> $1::vector) AS dense_score, ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS dense_rank FROM docs ORDER BY embedding <=> $1::vector LIMIT 50 ), bm25 AS ( -- BM25 retrieval: top 50 by ts_rank SELECT id, ts_rank(ts_content, plainto_tsquery('english', $2)) AS bm25_score, ROW_NUMBER() OVER ( ORDER BY ts_rank(ts_content, plainto_tsquery('english', $2)) DESC ) AS bm25_rank FROM docs WHERE ts_content @@ plainto_tsquery('english', $2) LIMIT 50 ), fused AS ( -- Reciprocal Rank Fusion: score = 1/(60 + rank_dense) + 1/(60 + rank_bm25) SELECT COALESCE(d.id, b.id) AS id, COALESCE(1.0 / (60.0 + d.dense_rank), 0) + COALESCE(1.0 / (60.0 + b.bm25_rank), 0) AS rrf_score FROM dense d FULL OUTER JOIN bm25 b ON d.id = b.id ) SELECT docs.id, docs.source_id, docs.content, fused.rrf_score FROM fused JOIN docs ON docs.id = fused.id ORDER BY fused.rrf_score DESC LIMIT $3; ```

```python HYBRID_SQL = """ WITH dense AS ( SELECT id, 1 - (embedding <=> $1::vector) AS dense_score, ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS dense_rank FROM docs ORDER BY embedding <=> $1::vector LIMIT 50 ), bm25 AS ( SELECT id, ts_rank(ts_content, plainto_tsquery('english', $2)) AS bm25_score, ROW_NUMBER() OVER ( ORDER BY ts_rank(ts_content, plainto_tsquery('english', $2)) DESC ) AS bm25_rank FROM docs WHERE ts_content @@ plainto_tsquery('english', $2) LIMIT 50 ), fused AS ( SELECT COALESCE(d.id, b.id) AS id, COALESCE(1.0 / (60.0 + d.dense_rank), 0) + COALESCE(1.0 / (60.0 + b.bm25_rank), 0) AS rrf_score FROM dense d FULL OUTER JOIN bm25 b ON d.id = b.id ) SELECT docs.id, docs.source_id, docs.content, fused.rrf_score FROM fused JOIN docs ON docs.id = fused.id ORDER BY fused.rrf_score DESC LIMIT $3; """ async def retrieve_hybrid( query: str, conn, top_k: int = 5, ) -> list[dict]: """Hybrid dense + BM25 retrieval via RRF in a single Postgres query.""" q_emb = oai.embeddings.create(model=EMBED_MODEL, input=[query]).data[0].embedding q_str = str(q_emb) rows = await conn.fetch(HYBRID_SQL, q_str, query, top_k) return [dict(row) for row in rows] ```

The `FULL OUTER JOIN` between dense and BM25 results ensures that documents appearing in only one retrieval set are still included in the fusion — a document that ranks 1st in BM25 but doesn't appear in dense top-50 still scores `1/(60+1) = 0.0164`, which often beats documents that rank 40th+ in both. The constant k=60 in the RRF denominator is the standard choice from the original RRF paper (Cormack et al., 2009); it de-emphasizes the very top ranks to smooth score variance. Use `plainto_tsquery` rather than `to_tsquery` for user input — `plainto_tsquery` handles arbitrary text safely, while `to_tsquery` requires AND/OR/NOT operator syntax and will error on natural-language queries.


Phase 7: Context assembly and Claude generation

Context assembly and generation are identical to the Pinecone tutorial pattern: XML-tagged `<document>` blocks, `<context>` + `<question>` wrapper, streaming Claude response.

```python def assemble_context(rows: list[dict]) -> str: parts = [] for i, row in enumerate(rows, start=1): parts.append( f'<document index="{i}" source="{row["source_id"]}">\n' f'{row["content"]}\n' f'</document>' ) return "\n\n".join(parts) async def ask( query: str, conn, top_k: int = 5, hybrid: bool = True, ) -> str: """ Full RAG pipeline with optional hybrid retrieval. Returns the complete generated answer. """ if hybrid: rows = await retrieve_hybrid(query, conn, top_k=top_k) else: rows = await retrieve_dense(query, conn, top_k=top_k) context = assemble_context(rows) system_prompt = ( "You are a precise research assistant. Answer using only the provided documents. " "If the answer is not in the documents, say so explicitly. " "Cite document index numbers inline as [1], [2, 3], etc." ) full_text = "" with anth.messages.stream( model="claude-sonnet-4-6", max_tokens=1024, system=system_prompt, messages=[{ "role": "user", "content": f"<context>\n{context}\n</context>\n\n<question>{query}</question>" }], ) as stream: for text in stream.text_stream: print(text, end="", flush=True) full_text += text print() return full_text ```


Phase 8: ef_search tuning and EXPLAIN verification

HNSW recall at query time is controlled by `hnsw.ef_search` (default 40). This parameter sets the size of the dynamic candidate list during search — higher ef_search means more candidates are evaluated, improving recall at the cost of query latency. The tradeoff is nonlinear: going from ef_search=40 to ef_search=100 typically costs 2x latency but recovers recall from ~95% to ~99%+ at top-K=5.

```sql -- Set ef_search for the current session (default 40) SET hnsw.ef_search = 100; -- Or set for the current transaction SET LOCAL hnsw.ef_search = 100; -- Verify the HNSW index is being used EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT id, content, 1 - (embedding <=> '[0.1,0.2,...]'::vector) AS sim FROM docs ORDER BY embedding <=> '[0.1,0.2,...]'::vector LIMIT 5; ```

In the EXPLAIN output, look for `Index Scan using docs_embedding_idx on docs`. If you see `Seq Scan`, the planner decided a full scan was cheaper — this is correct behavior for small tables (under ~1,000 rows) where the index overhead exceeds the scan cost. For large tables, if you're getting `Seq Scan` unexpectedly, run `ANALYZE docs;` to refresh table statistics and ensure `random_page_cost` is tuned for your storage (SSD: set `random_page_cost = 1.1`).

Parallel workers can speed up large sequential scans but do NOT parallelize HNSW index scans in Postgres 16 — HNSW queries are single-threaded. For workloads requiring very high throughput, run multiple query workers in parallel at the application layer rather than relying on Postgres parallelism for each individual query.


Phase 9: Managed Postgres — Supabase and Neon

Both Supabase and Neon offer managed Postgres with pgvector pre-installed, removing the extension installation and Postgres version management overhead. They are the fastest path to a production pgvector deployment.

**Supabase** (supabase.com): Postgres 15+, pgvector enabled by default. Free tier: 500MB database, 1GB file storage. Pro tier: $25/mo, 8GB database. Supabase also offers a JavaScript/Python client library with pgvector helpers and Row Level Security (RLS) for per-tenant access control — the latter eliminates the need for application-layer namespace enforcement in multi-tenant RAG.

```python # Supabase connection string format DATABASE_URL = "postgresql://postgres:{password}@db.{project_ref}.supabase.co:5432/postgres" # Connection pooling via Supabase's PgBouncer (recommended for high-concurrency) DATABASE_URL_POOLER = "postgresql://postgres.{project_ref}:{password}@aws-0-{region}.pooler.supabase.com:6543/postgres" ```

**Neon** (neon.tech): Postgres 16+, pgvector available. Neon's architecture separates compute from storage — compute scales to zero when idle (eliminating always-on cost for low-traffic dev instances). Free tier: 0.5 CPU, 2GB storage. Launch tier: $19/mo. Neon supports branching — you can create a point-in-time branch of your database for testing schema changes without affecting production.

```python # Neon connection string format DATABASE_URL = "postgresql://{user}:{password}@{endpoint}.neon.tech/neondb?sslmode=require" # Neon requires SSL; add sslmode=require or sslrootcert parameter ```

Both Supabase and Neon support connection pooling out of the box. For pgvector RAG workloads with bursty concurrency (many simultaneous user queries), use the pooler endpoint rather than the direct connection endpoint — pooler handles connection spikes without hitting Postgres's `max_connections` limit. Supabase's pooler is PgBouncer in transaction mode; Neon's is a custom pool. Both are transparent to psycopg3.

For a cost comparison of self-hosted Postgres vs Supabase vs Pinecone vs Qdrant at scale, the pgvector-on-managed-Postgres option is typically the lowest-cost path for indexes under 5M vectors — the fixed Postgres cost is already paid for most teams, so the incremental cost of adding vector search is just the embedding compute.


Phase 10: Document updates and deletion

Unlike Pinecone, Postgres gives you transactional document management. The ACID guarantees mean a document update either fully completes or fully rolls back — no partial states where half the old chunks are deleted and half the new chunks are inserted.

```python async def update_document( conn, source_id: str, new_text: str, new_metadata: dict, ) -> int: """ Replace all chunks for a document atomically. Returns new chunk count. """ from chunking import chunk_text # your chunk_text() function import tiktoken chunks_text = chunk_text(new_text) new_chunks = [ { "source_id": source_id, "chunk_index": i, "content": text, "metadata": {**new_metadata, "chunk_index": i}, } for i, text in enumerate(chunks_text) ] async with conn.transaction(): # Delete old chunks await conn.execute("DELETE FROM docs WHERE source_id = $1", source_id) # Insert new chunks (reuse insert_chunks but pass the conn directly) await insert_chunks(conn, new_chunks) return len(new_chunks) async def delete_document(conn, source_id: str) -> int: """Delete all chunks for a document. Returns deleted row count.""" result = await conn.execute("DELETE FROM docs WHERE source_id = $1", source_id) await conn.commit() return int(result.split()[-1]) # 'DELETE N' ```

The transactional delete-then-insert pattern is safe because Postgres serializes concurrent queries against the table — a query running during the update will either see the old state or the new state, never a partial mix. This is a fundamental advantage of pgvector over Pinecone for workloads with frequent document updates (e.g., a CMS where articles are edited daily).

Production checklist

  1. 1

    Enable pgvector and verify the version

    Run `SELECT extversion FROM pg_extension WHERE extname = 'vector';` after enabling. Must be 0.7.0 or higher for HNSW support. Earlier versions (0.5.x, 0.6.x) only have IVFFlat, which requires a training step and does not support incremental inserts.

  2. 2

    Create the HNSW index before ingesting large datasets

    HNSW index build time is O(n log n) — creating it on 1M rows takes longer than creating it on an empty table. Create the index on an empty table, then ingest. If you must create it post-ingestion, set `maintenance_work_mem = '4GB'` (or higher) in your session before the `CREATE INDEX` statement to speed up the build.

  3. 3

    Use connection pooling at application layer

    psycopg3 + psycopg_pool AsyncConnectionPool with min_size=5, max_size=20 handles hundreds of concurrent RAG queries without connection exhaustion. For managed Postgres (Supabase/Neon), use the pooler connection string to route through PgBouncer and avoid hitting `max_connections` on the Postgres server itself.

  4. 4

    Tune ef_search for your recall target

    Default ef_search=40 achieves ~95% recall@5 in typical workloads. For production RAG, benchmark your held-out eval at ef_search=40, 80, 100 and pick the lowest setting that meets your recall target. Set it per-session with `SET hnsw.ef_search = N;` rather than globally to allow different endpoints to use different settings.

  5. 5

    Add Row Level Security for multi-tenant deployments

    Supabase supports Postgres RLS natively. Add a `tenant_id` column to the docs table, enable RLS, and create a policy that restricts rows to the current session's tenant. This enforces isolation at the database level rather than relying on application-layer namespace checks — a more secure default.

    → Open the RAG cost per query calculator
  6. 6

    Vacuum and analyze after bulk ingestion

    After inserting millions of rows, run `ANALYZE docs;` to update table statistics. Without fresh statistics, the Postgres query planner may choose a sequential scan over the HNSW index for the first few queries after bulk ingestion. Run `VACUUM ANALYZE docs;` as part of your ingestion job completion hook.

Frequently Asked Questions

What version of pgvector adds HNSW support?

pgvector 0.5.0 added HNSW indexing (released October 2023). pgvector 0.7.0 (released June 2024) added iterative index scans and improved recall for HNSW at high recall settings. Install 0.7.0 or higher. Check your version: `SELECT extversion FROM pg_extension WHERE extname = 'vector';`

When should I use pgvector instead of Pinecone?

pgvector is the better choice when: (1) you already run Postgres and don't want another service to manage; (2) you need transactional consistency between relational data and vector data (e.g., DELETE a user and their vectors atomically); (3) your index is under 5M vectors and query latency of 20-50ms is acceptable; (4) you need SQL joins between vector results and relational tables. Pinecone wins when: (1) you need sub-10ms p99 at billion-vector scale; (2) your team has no Postgres expertise; (3) you need built-in serverless scaling without ops work. See Pinecone vs Weaviate vs Qdrant comparison for the full analysis.

What is the cosine distance operator in pgvector?

The `<=>` operator returns the cosine distance between two vectors. Cosine distance = 1 - cosine similarity, so `<=>` returns 0 for identical vectors and 2 for maximally dissimilar vectors. To get similarity as a score between 0 and 1, use `1 - (embedding <=> query_vector)`. This is the standard pattern when you want to display a relevance score. The ORDER BY clause should still use `embedding <=>` (smaller distance = more similar = better rank).

Does pgvector support hybrid BM25 + dense search natively?

Not as a single native index type. Hybrid search requires combining Postgres full-text search (`tsvector` + `ts_rank` for BM25-like scoring) with pgvector's HNSW dense search using Reciprocal Rank Fusion in a CTE query. This is a SQL-level fusion, not a native hybrid index. Qdrant and Weaviate have native hybrid indexes that may be faster for very high query volumes — see hybrid search BM25 + dense tutorial for the full comparison.

How does pgvector performance compare to Pinecone at 1M vectors?

On a Postgres instance with an NVMe SSD and adequate RAM to cache the HNSW index, pgvector achieves 10-30ms p95 query latency at 1M vectors with ef_search=100 and recall@10 above 99%. Pinecone serverless achieves 50-100ms p95 due to network overhead. Pinecone pod-based (p2 pods) achieves 5-10ms p95. For most RAG workloads where the user's perception of latency is dominated by LLM generation time (500ms-2s), the 20ms pgvector query vs 10ms Pinecone pod difference is not perceptible. Benchmark your specific workload before optimizing for vector search latency.

Can I use pgvector on AWS RDS?

Yes. AWS RDS for PostgreSQL supports pgvector as of Postgres 15.2+. Install with `CREATE EXTENSION vector;` (no superuser required on RDS as of pgvector 0.5.1+). Aurora PostgreSQL also supports pgvector on compatible versions. Note that Aurora PostgreSQL charges for I/O separately, which can make high-query-volume RAG workloads expensive — benchmark with your expected query volume before choosing Aurora over standard RDS.

What is the maximum number of dimensions pgvector supports?

pgvector 0.7.x supports up to 16,000 dimensions. In practice, the largest embedding models produce 3,072 dimensions (OpenAI text-embedding-3-large). HNSW indexes on high-dimensional vectors (above 2,000 dimensions) use more memory and have slower index build times. For most RAG workloads, text-embedding-3-small at 1,536 dimensions is the optimal cost-performance tradeoff — comparable recall to text-embedding-3-large at half the embedding cost and faster HNSW operations.

How do I migrate from Pinecone to pgvector?

The main data migration challenge is re-ingesting your documents — there is no direct Pinecone-to-pgvector vector export path. Export your source documents (not the vectors), re-chunk and re-embed them with the same embedding model, insert into pgvector. If you're switching embedding models, you must re-embed everything — embedding spaces are not compatible across models. Run both databases in parallel for 2-4 weeks during the migration, routing queries to pgvector incrementally as you validate recall quality.

Build better RAG generation prompts for Claude.

Retrieval quality gets the chunk to Claude. Your generation prompt determines what Claude does with it. Our AI Prompt Generator builds XML-structured, grounding-optimized Claude prompts for your RAG use case — citation format, fallback handling, multi-document synthesis. 14-day free trial, no card.

Browse all prompt tools →