Skip to article
AI/LLMs 10 Feb 2026 18 min read 2.9K views

Building Production RAG Pipelines with LangChain & Supabase

End-to-end guide to retrieval-augmented generation: vector embeddings, semantic search, and streaming responses.

Suboor Khan

Full-Stack Developer & Technical Writer

Retrieval-Augmented Generation (RAG) is the pattern that makes LLMs genuinely useful in production. Instead of relying on a model's training data—which is stale and hallucination-prone—you give it fresh, grounded context at inference time. The results are dramatically more accurate and trustworthy.

We'll build a complete production-grade RAG system: document ingestion, chunking, embedding, vector storage in Supabase (pgvector), semantic retrieval, and streaming generation—all wired together with LangChain.

Document Ingestion & Chunking

How you chunk documents is the single biggest lever on RAG quality. Chunks too large = noisy context. Too small = loss of coherence. We use RecursiveCharacterTextSplitter with overlap to preserve context at boundaries.

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';

const loader   = new PDFLoader('knowledge-base.pdf');
const rawDocs  = await loader.load();

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize:    1200,   // ~300 tokens — fits well in context window
  chunkOverlap: 200,    // overlap preserves sentence context at boundaries
  separators:   ['

', '
', '. ', ' ', ''],
});

const chunks = await splitter.splitDocuments(rawDocs);
console.log(`Produced ${chunks.length} chunks`);

Tip: For structured documents (contracts, technical docs) use semantic chunking based on headings rather than character count alone.

Embeddings & pgvector in Supabase

We embed each chunk with OpenAI's text-embedding-3-small (1536 dimensions, best cost/quality ratio as of 2026) and store them in Supabase with pgvector enabled.

// 1. Enable pgvector in Supabase
-- SQL migration
create extension if not exists vector;

create table documents (
  id        bigserial primary key,
  content   text not null,
  metadata  jsonb,
  embedding vector(1536)
);

create index on documents using ivfflat (embedding vector_cosine_ops)
  with (lists = 100);  -- tune lists = sqrt(row_count)
// 2. Embed & insert with LangChain
import { OpenAIEmbeddings } from '@langchain/openai';
import { SupabaseVectorStore } from '@langchain/community/vectorstores/supabase';
import { createClient } from '@supabase/supabase-js';

const client   = createClient(process.env.SUPABASE_URL!, process.env.SUPABASE_KEY!);
const embedder = new OpenAIEmbeddings({ model: 'text-embedding-3-small' });

await SupabaseVectorStore.fromDocuments(chunks, embedder, {
  client,
  tableName:        'documents',
  queryName:        'match_documents',
});

Semantic Retrieval

At query time we embed the user question and fetch the top-k most similar chunks using cosine similarity. We also add a Maximal Marginal Relevance (MMR) re-ranker to reduce redundant chunks.

const vectorStore = await SupabaseVectorStore.fromExistingIndex(embedder, {
  client,
  tableName:  'documents',
  queryName:  'match_documents',
});

// MMR retriever — balances relevance with diversity
const retriever = vectorStore.asRetriever({
  searchType:    'mmr',
  k:             6,
  fetchK:        20,  // fetch 20, re-rank to top 6
  lambda:        0.7, // 0 = max diversity, 1 = max relevance
});

LLM Generation with Context

We build a RetrievalQAChain with a custom prompt that instructs the model to answer only from the retrieved context and cite sources.

import { ChatOpenAI } from '@langchain/openai';
import { RetrievalQAChain, loadQAStuffChain } from 'langchain/chains';
import { PromptTemplate } from '@langchain/core/prompts';

const llm = new ChatOpenAI({ model: 'gpt-4o-mini', temperature: 0.2 });

const prompt = PromptTemplate.fromTemplate(`
You are a precise technical assistant. Answer using ONLY the context below.
If the answer is not in the context, say "I don't have that information."
Always cite the source document when possible.

Context:
{context}

Question: {question}
Answer:`);

const chain = new RetrievalQAChain({
  combineDocumentsChain: loadQAStuffChain(llm, { prompt }),
  retriever,
  returnSourceDocuments: true,
});

Streaming Responses (Next.js App Router)

For a responsive UX, stream the LLM output token-by-token to the client using the Vercel AI SDK + ReadableStream.

// app/api/chat/route.ts
import { StreamingTextResponse, LangChainStream } from 'ai';

export async function POST(req: Request) {
  const { question } = await req.json();
  const { stream, handlers } = LangChainStream();

  chain.call({ query: question }, [handlers]);

  return new StreamingTextResponse(stream);
}

94%

Answer accuracy

340ms

Time to first token

2.1¢

Cost per query

Summary

  • Chunk with 200-token overlap using RecursiveCharacterTextSplitter
  • Use text-embedding-3-small for best cost/quality in 2026
  • Enable ivfflat index on pgvector for sub-millisecond retrieval at scale
  • Use MMR retrieval to reduce redundant context chunks
  • Stream responses with the Vercel AI SDK for instant perceived latency
  • Instruct the model to cite sources and refuse out-of-context questions

Stay Updated

Enjoyed this article?

Deep-dive articles on React, AI, WebGL, and software craft — twice a month. No spam, ever.