Building AI into your product in 2026 means navigating a complex landscape of
providers, frameworks, and deployment options. Do you call OpenAI directly? Use
a framework like LangChain? Self-host an open model? The right answer depends
on your constraints around cost, latency, privacy, and operational complexity.
This guide walks through the major architectural patterns, compares their tradeoffs,
and concludes with a complete design for a real-world Document Q&A system.
The Landscape at a Glance
┌─────────────────────────────────────────────────────────────────────┐
│ AI-Enabled Architectures │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │
│ │ Direct │ │ Unified │ │ Framework │ │ Self- │ │
│ │ Provider │ │ API │ │ (Lang- │ │ Hosted │ │
│ │ APIs │ │ Gateway │ │ Chain) │ │ Models │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬─────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Your Application │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Let's examine each pattern in detail.
Pattern 1: Direct Provider APIs
The simplest approach: call OpenAI, Anthropic, or Google directly. You send
HTTP requests to their APIs and receive model outputs.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ Your │────▶│ OpenAI/ │
│ (React, │ │ Backend │ │ Anthropic │
│ Mobile) │◀────│ (Node, │◀────│ API │
└─────────────┘ │ Python) │ └─────────────┘
└─────────────┘
Stores API key
Implementation Example
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
async function generateResponse(userMessage) {
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages: [
{ role: 'user', content: userMessage }
],
});
return response.content[0].text;
}
When to Use Direct APIs
| Pros |
Cons |
| Simplest implementation |
Vendor lock-in |
| First access to new features |
Separate integrations per provider |
| Best latency (no intermediary) |
No fallback if provider is down |
| Official SDKs well-maintained |
Cost optimization is manual |
Best for: Teams committed to a single provider, applications
needing cutting-edge features, prototypes and MVPs.
Pattern 2: Unified API Gateways
Services like OpenRouter,
Amazon Bedrock,
and Azure OpenAI
provide a single API that routes to multiple model providers.
┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐
│ Your │────▶│ OpenRouter │────▶│ Anthropic (Claude) │
│ Backend │ │ or │────▶│ OpenAI (GPT-4) │
│ │◀────│ Bedrock │────▶│ Google (Gemini) │
└─────────────┘ └─────────────┘ │ Meta (Llama) │
└─────────────────────────┘
Single API,
multiple models
Key Benefits
-
Model switching: Change models by updating a single parameter,
no code changes required
-
Automatic fallbacks: If Claude is down, route to GPT-4 automatically
-
Unified billing: One invoice regardless of which providers you use
-
A/B testing: Compare model performance in production
// OpenRouter uses OpenAI-compatible API format
const response = await fetch('https://openrouter.ai/api/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENROUTER_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'anthropic/claude-sonnet-4', // Easy to switch!
messages: [{ role: 'user', content: 'Hello!' }],
}),
});
Tradeoffs
| Consideration |
Impact |
| Latency |
~10-15% higher (additional hop) |
| Data privacy |
Requests pass through intermediary |
| Feature parity |
New provider features may lag |
| Cost |
Small markup over direct pricing |
Best for: Teams wanting vendor flexibility, production systems
needing fallback reliability, organizations comparing multiple models.
Pattern 3: LLM Frameworks
Frameworks like LangChain,
LlamaIndex,
Haystack, and
DSPy provide abstractions
for common LLM patterns: chains, agents, retrieval, memory.
┌─────────────────────────────────────────────────────────────┐
│ Your Application │
├─────────────────────────────────────────────────────────────┤
│ LangChain / LlamaIndex / Haystack │
├──────────────┬──────────────┬──────────────┬───────────────┤
│ Chains │ Agents │ Retrieval │ Memory │
├──────────────┴──────────────┴──────────────┴───────────────┤
│ LLM Providers │
│ (OpenAI, Anthropic, Local, etc.) │
└─────────────────────────────────────────────────────────────┘
Framework Comparison
| Framework |
Strength |
Best For |
| LangChain |
Ecosystem (100+ integrations) |
Complex, tool-augmented workflows |
| LlamaIndex |
Data ingestion and indexing |
RAG over large document sets |
| Haystack |
Production-ready pipelines |
Document QA in production |
| DSPy |
Minimal overhead, optimized prompts |
Performance-critical applications |
Performance Benchmarks
Recent benchmarks running identical agentic RAG workflows across frameworks
reveal significant differences in overhead:
Framework Overhead (milliseconds):
DSPy ████ ~3.5ms
Haystack ██████ ~5.9ms
LlamaIndex ██████ ~6.0ms
LangChain ██████████ ~10ms
LangGraph ██████████████ ~14ms
Token Usage (per query):
Haystack ████████ ~1,570 tokens
LlamaIndex ████████ ~1,600 tokens
DSPy ██████████ ~2,030 tokens
LangGraph ██████████ ~2,030 tokens
LangChain ████████████ ~2,400 tokens
LangChain Example
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import Qdrant
# Set up retriever
vectorstore = Qdrant.from_existing_collection(...)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Set up LLM
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
# Build chain
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the following context:
{context}
Question: {question}
""")
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Use it
answer = chain.invoke("What is the refund policy?")
Framework Tradeoffs
| Pros |
Cons |
| Rapid prototyping |
Abstraction overhead |
| Pre-built integrations |
Harder to debug (magic) |
| Standard patterns (RAG, agents) |
Version churn (breaking changes) |
| Active communities |
Learning curve for framework idioms |
Best for: Teams building complex LLM pipelines, RAG applications,
agent-based systems, or anyone wanting proven patterns over custom code.
Pattern 4: Self-Hosted Models
Run open-weight models (Llama, Mistral, Qwen) on your own infrastructure.
This gives you complete control over data privacy and can reduce costs at scale.
┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐
│ Your │────▶│ vLLM / │────▶│ GPU Server │
│ Backend │ │ Ollama │ │ (A100, H100, or │
│ │◀────│ │◀────│ consumer RTX) │
└─────────────┘ └─────────────┘ └─────────────────────────┘
Inference server Your hardware or cloud
Inference Server Comparison
| Tool |
Strength |
Use Case |
| vLLM |
Highest throughput (PagedAttention) |
Production multi-user serving |
| Ollama |
Simplest setup |
Development, single-user |
| llama.cpp |
CPU inference, low VRAM |
Edge devices, limited GPU |
| TGI |
Hugging Face ecosystem |
HF model serving |
| LocalAI |
OpenAI-compatible, Docker-native |
Drop-in OpenAI replacement |
Performance: vLLM vs Ollama vs llama.cpp
Requests per second at peak load:
vLLM ████████████████████████████████████ ~120-160 req/s
llama.cpp █████ ~10-15 req/s (GPU mode)
Ollama ███ ~1-3 req/s (sequential)
Note: vLLM achieves 35x+ higher throughput than llama.cpp
for multi-user workloads due to continuous batching.
Running Ollama Locally
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.2
# Run it
ollama run llama3.2
# Or use the API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain quantum computing"
}'
vLLM for Production
# Run vLLM server (requires GPU)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-8B-Instruct \
--tensor-parallel-size 2
# Compatible with OpenAI SDK
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Self-Hosted Tradeoffs
| Pros |
Cons |
| Complete data privacy |
GPU infrastructure costs (~$2-3K/mo cloud) |
| No per-token costs |
Model quality gap vs GPT-4/Claude |
| Full customization (fine-tuning) |
Operational complexity |
| Works offline |
Keeping up with model releases |
Best for: High-volume applications where API costs would be
prohibitive, privacy-sensitive data, offline/air-gapped environments, teams
wanting full control.
Pattern 5: RAG (Retrieval-Augmented Generation)
RAG is less a deployment pattern and more an architectural pattern that can
be combined with any of the above. It augments LLM responses with retrieved
context from your own data.
┌───────────────────────────────────────┐
│ Document Ingestion │
│ ┌─────────┐ ┌──────────┐ ┌──────┐ │
│ │ Parse │─▶│ Chunk │─▶│Embed │ │
│ │ Docs │ │ Text │ │ │ │
│ └─────────┘ └──────────┘ └──┬───┘ │
└──────────────────────────────────┼────┘
│
▼
┌────────────┐ ┌─────────────────────────────────────────────────┐
│ User │───▶│ Query Flow │
│ Question │ │ ┌───────┐ ┌──────────┐ ┌────────────────┐ │
└────────────┘ │ │ Embed │──▶│ Vector │──▶│ Retrieve Top-K │ │
│ │ Query │ │ Search │ │ Documents │ │
│ └───────┘ └──────────┘ └───────┬────────┘ │
└─────────────────────────────────────┼───────────┘
│
┌───────────────────┘
▼
┌────────────────────────────────────────────────────────────────────┐
│ LLM Generation │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ System: You are a helpful assistant. Answer based on the │ │
│ │ provided context. │ │
│ │ │ │
│ │ Context: [Retrieved document chunks] │ │
│ │ │ │
│ │ Question: [User's question] │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ [Generated Answer] │
└────────────────────────────────────────────────────────────────────┘
Vector Database Options
| Database |
Type |
Best For |
| Pinecone |
Managed |
Zero-ops, serverless scale |
| Qdrant |
Open source (Rust) |
Complex filtering, self-hosted |
| Weaviate |
Open source (Go) |
Multi-modal, GraphQL API |
| pgvector |
PostgreSQL extension |
Teams already on PostgreSQL |
| Chroma |
Open source (Python) |
Rapid prototyping |
| Milvus |
Open source |
Billion-scale vectors |
pgvector vs Dedicated Vector DBs
Recent benchmarks show pgvectorscale (Timescale's enhanced pgvector)
achieving 471 QPS at 99% recall on 50M vectors—11x better than Qdrant and competitive
with Pinecone, at ~75% lower cost when self-hosted.
Choose pgvector when: You're already on PostgreSQL, want unified
data management, or need transactional consistency between metadata and vectors.
Choose dedicated vector DB when: You need specialized features
(billion-scale, complex filtering), want managed infrastructure, or have
extreme performance requirements.
Decision Framework
START
│
▼
┌─────────────────────────────────────────┐
│ Is data privacy critical? │
│ (medical, financial, government) │
└───────────────────┬─────────────────────┘
│
Yes ───────┼─────── No
│ │ │
▼ │ ▼
┌─────────────┐ │ ┌────────────────────────────────┐
│ Self-hosted │ │ │ High volume (>100K queries/mo)? │
│ or Browser │ │ └───────────────┬────────────────┘
│ (see our │ │ │
│ WebGPU post)│ │ Yes ───────┼─────── No
└─────────────┘ │ │ │ │
│ ▼ │ ▼
│ ┌─────────────┐ │ ┌─────────────────┐
│ │ Calculate: │ │ │ Direct API │
│ │ Self-host │ │ │ (simplest) │
│ │ vs API cost │ │ │ or │
│ │ │ │ │ OpenRouter │
│ └─────────────┘ │ │ (flexibility) │
│ │ └─────────────────┘
│ │
│ Need complex orchestration?
│ (RAG, agents, chains)
│ │
│ Yes ───────┼─────── No
│ │ │
│ ▼ ▼
│ ┌─────────────┐ ┌────────────────┐
│ │ Framework │ │ Direct API │
│ │ (LangChain, │ │ + custom code │
│ │ LlamaIndex) │ └────────────────┘
│ └─────────────┘
│
Canonical Architecture: Document Q&A Repository
Let's design a production system for a fictional company, Acme Corp,
that needs to let employees ask questions about their internal documentation:
policy manuals, technical docs, HR guides, and product specs.
Requirements
- ~50,000 documents (PDFs, Word docs, Markdown, HTML)
- ~500 daily active users, ~2,000 queries/day
- Must cite sources in answers
- Confidential data (cannot use third-party APIs without approval)
- Needs to be accurate—wrong answers are worse than no answer
- Budget: moderate (not unlimited, not shoestring)
Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ Document Ingestion Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────────┐ │
│ │ Document │ │ Unstructured│ │ Chunk │ │ Embedding │ │
│ │ Sources │──▶│ .io │──▶│ (500 │──▶│ Model │ │
│ │ (S3, GDrive) │ (parse) │ │ tokens) │ │(text-embed) │ │
│ └───────────┘ └───────────┘ └───────────┘ └──────┬──────┘ │
│ │ │
│ ┌───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ PostgreSQL + pgvector │ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
│ │ │ documents │ │ chunks │ │ embeddings │ │ │
│ │ │ (metadata) │ │ (text, │ │ (vector, │ │ │
│ │ │ │ │ doc_id) │ │ chunk_id) │ │ │
│ │ └─────────────┘ └──────────────┘ └──────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Query Flow │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────┐ ┌───────────────────────────────────────────────┐ │
│ │ User │ │ API Gateway (FastAPI) │ │
│ │ Query │──▶│ │ │
│ └───────────┘ └───────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┴───────────────────┐ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Query Embedding │ │ Query Expansion │ │
│ │ (same model as │ │ (optional: │ │
│ │ ingestion) │ │ HyDE, multi- │ │
│ └────────┬─────────┘ │ query) │ │
│ │ └──────────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Hybrid Search │ │
│ │ (vector + BM25) │◀─── pgvector + pg_trgm │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Reranker │◀─── Cross-encoder model │
│ │ (top 20 → top 5)│ (bge-reranker or Cohere) │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ LLM Generation │ │
│ │ ┌────────────────────────────────────────────────┐ │ │
│ │ │ You are an Acme Corp assistant. Answer using │ │ │
│ │ │ ONLY the provided documents. If unsure, say so.│ │ │
│ │ │ │ │ │
│ │ │ Documents: │ │ │
│ │ │ [1] HR Policy v2.3: "Vacation days accrue..." │ │ │
│ │ │ [2] Employee Handbook: "Time off requests..." │ │ │
│ │ │ │ │ │
│ │ │ Question: How many vacation days do I get? │ │ │
│ │ └────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌────────────────────────────────────────────────┐ │ │
│ │ │ Claude Sonnet (via API) │ │ │
│ │ │ or vLLM (self-hosted Llama) │ │ │
│ │ └────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Response with Citations │ │
│ │ "You receive 15 vacation days per year [1]. │ │
│ │ Requests should be submitted 2 weeks in │ │
│ │ advance through the HR portal [2]." │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Technology Choices
| Component |
Choice |
Rationale |
| Document Parsing |
Unstructured.io |
Handles PDFs, Word, HTML; extracts tables |
| Chunking |
Semantic chunking |
Respects document structure |
| Embeddings |
text-embedding-3-small |
Good quality, low cost ($0.02/1M tokens) |
| Vector Store |
PostgreSQL + pgvector |
Unified data layer, transactional consistency |
| Reranker |
bge-reranker-v2-m3 |
Open source, runs locally, high quality |
| LLM (Option A) |
Claude Sonnet via API |
Best quality, approved for internal data |
| LLM (Option B) |
Llama 3.2 70B via vLLM |
Self-hosted, no data leaves premises |
| Framework |
LlamaIndex |
Best for document indexing and retrieval |
| API Layer |
FastAPI |
Async, streaming support, OpenAPI docs |
Key Design Decisions
1. Hybrid Search
Vector search alone misses keyword matches. We combine:
- Vector similarity (semantic meaning)
- BM25 / full-text search (exact keywords)
PostgreSQL's pg_trgm extension provides fast trigram search alongside pgvector.
2. Reranking
Initial retrieval returns ~20 candidates. A cross-encoder reranker (which is
more accurate but slower than bi-encoders) re-scores these to pick the best 5.
This dramatically improves answer quality.
3. Citation Enforcement
The prompt explicitly requires citing sources by number. We validate that
all citation numbers correspond to retrieved documents.
4. Confidence / Abstention
The prompt instructs the model to say "I don't know" rather than hallucinate.
We also monitor retrieval scores—if all retrieved documents have low similarity,
we proactively warn the user.
Cost Estimate (2,000 queries/day)
Option A: Claude Sonnet API
─────────────────────────────────────────
Embeddings (OpenAI): ~$5/mo
LLM (Claude Sonnet): ~$150-300/mo
PostgreSQL: ~$50/mo (managed)
Reranker (local): $0 (runs on API server)
─────────────────────────────────────────
Total: ~$200-350/mo
Option B: Self-Hosted (vLLM + Llama 70B)
─────────────────────────────────────────
Embeddings (local): $0 (runs on same GPU)
LLM (vLLM on A100): ~$2,000-3,000/mo (cloud GPU)
PostgreSQL: ~$50/mo
─────────────────────────────────────────
Total: ~$2,000-3,000/mo
Breakeven: Self-hosting makes sense at ~10-15K queries/day
Implementation Skeleton
from llama_index.core import VectorStoreIndex, Settings
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.anthropic import Anthropic
from llama_index.postprocessor.cohere_rerank import CohereRerank
import sqlalchemy
# Configure components
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = Anthropic(model="claude-sonnet-4-20250514")
# Connect to PostgreSQL with pgvector
engine = sqlalchemy.create_engine("postgresql://...")
vector_store = PGVectorStore.from_params(
connection_string="postgresql://...",
table_name="document_embeddings",
embed_dim=1536,
)
# Build index
index = VectorStoreIndex.from_vector_store(vector_store)
# Query with reranking
query_engine = index.as_query_engine(
similarity_top_k=20,
node_postprocessors=[
CohereRerank(top_n=5, model="rerank-english-v3.0")
],
)
# Ask a question
response = query_engine.query(
"What is the policy for requesting time off?"
)
print(response.response)
print("Sources:", [n.node.metadata for n in response.source_nodes])
Conclusion
There's no single "best" AI architecture—only the best architecture for your
specific constraints. Here's my decision framework:
-
Just getting started? Direct API (Anthropic or OpenAI).
Simplest path to production.
-
Want flexibility? OpenRouter or similar gateway. Easy model
switching without code changes.
-
Building RAG or agents? LlamaIndex for data-heavy apps,
LangChain for complex orchestration, DSPy for performance.
-
Privacy-critical or high-volume? Self-host with vLLM.
Higher upfront cost, lower marginal cost.
-
Client-side only? WebGPU + Transformers.js. See our
browser AI post.
The AI infrastructure landscape is evolving rapidly. The patterns in this guide
will likely look different in a year. What won't change: the need to match your
architecture to your actual constraints, not hypothetical ones.
Build the simplest thing that works. Measure. Iterate.
Further Reading